Compare commits

..

15 Commits

Author SHA1 Message Date
1032d82643 Merge #4140
4140: Fix bug where search with distinct attribute has wrong totalHits and totalPages r=dureuill a=vivek-26

# Pull Request

## Related issue
Fixes #4130

## What does this PR do?
This PR - 
- Fixes the bug where search with distinct attribute and pagination (no ranking) returns incorrect values for `totalHits` and `totalPages`.
- Add/update unit and integration tests.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
2023-10-19 14:00:53 +00:00
18bbadf645 Add explanatory comment 2023-10-19 15:58:25 +02:00
aa2cd52797 add/update tests when search with distinct attribute & pagination with no ranking 2023-10-19 16:50:14 +05:30
460e61b853 compute all candidates correctly when skipping 2023-10-19 16:48:45 +05:30
ec90946bdf Merge #4137
4137: Update version for the next release (v1.4.2) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2023-10-19 08:52:54 +00:00
58864a4dfa Update version for the next release (v1.4.2) in Cargo.toml 2023-10-19 08:50:11 +00:00
f343ef5f2f Merge #4108
4108: Fix bug where search with distinct attribute and no ranking, returns offset+limit hits r=curquiza a=vivek-26

# Pull Request

## Related issue
Fixes #4078 

## What does this PR do?
This PR - 
- Fixes bug where search with distinct attribute and no ranking, returns offset+limit hits.
- Adds unit and integration tests.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
2023-10-12 07:51:29 +00:00
67a678cfb6 Merge #4089
4089: Use a bufreader and bufwriter everytime there is a grenad<file> r=curquiza a=irevoire

# Pull Request
Wrap all the files we give to a grenad in a `BufReader` or `BufWriter`.

The dump import I tried in the issue went from 2h to 10 minutes on my machine.

I also ran a bunch of benchmarks on my machine, and we're faster by a few seconds everywhere but nothing huge.

-----

The one thing I’m afraid about is if we used to get the inner file in a grenad and then do a read right after without a seek at the beginning of the file or a reopen.
Since we now use a bufreader our read would return the bytes one buffer later and probably completely corrupt what we were supposed to read.

From what I see, it looks like it works, but I may have missed something, I don't know much about this part of the codebase.

This issue should not arise on the bufwriter, though, because if we're not able to write the content of the buffer I ensured that the `into_inner` of the bufwriter should return an internal error.

## Related issue
Fixes #4087


Co-authored-by: Tamo <tamo@meilisearch.com>
2023-10-11 14:27:00 +00:00
d1331d8abf add integration test for distinct search with no ranking 2023-10-11 19:12:56 +05:30
19ba129165 add unit test for distinct search with no ranking 2023-10-11 19:02:27 +05:30
d4da06ff47 fix bug where distinct search with no ranking returns offset+limit hits 2023-10-11 19:02:16 +05:30
c0f2724c2d get rids of the new introduced error code in favor of an io::Error 2023-10-10 15:12:23 +02:00
d772073dfa use a bufreader everytime there is a grenad<file> 2023-10-10 15:00:30 +02:00
8fe8ddea79 Merge #4112
4112: Update version for the next release (v1.4.1) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2023-10-10 09:05:10 +00:00
8a95bf28e5 Update version for the next release (v1.4.1) in Cargo.toml 2023-10-10 09:01:45 +00:00
91 changed files with 1560 additions and 2814 deletions

View File

@ -74,4 +74,4 @@ jobs:
echo "${{ steps.file.outputs.basename }}.json has just been pushed." echo "${{ steps.file.outputs.basename }}.json has just been pushed."
echo 'How to compare this benchmark with another one?' echo 'How to compare this benchmark with another one?'
echo ' - Check the available files with: ./benchmarks/scripts/list.sh' echo ' - Check the available files with: ./benchmarks/scripts/list.sh'
echo " - Run the following command: ./benchmaks/scripts/compare.sh <file-to-compare-with> ${{ steps.file.outputs.basename }}.json" echo " - Run the following command: ./benchmaks/scipts/compare.sh <file-to-compare-with> ${{ steps.file.outputs.basename }}.json"

View File

@ -2,8 +2,8 @@ name: Create issue to upgrade dependencies
on: on:
schedule: schedule:
# Run the first of the month, every 6 month # Run the first of the month, every 3 month
- cron: '0 0 1 */6 *' - cron: '0 0 1 */3 *'
workflow_dispatch: workflow_dispatch:
jobs: jobs:

View File

@ -57,10 +57,10 @@ jobs:
echo "date=$commit_date" >> $GITHUB_OUTPUT echo "date=$commit_date" >> $GITHUB_OUTPUT
- name: Set up QEMU - name: Set up QEMU
uses: docker/setup-qemu-action@v3 uses: docker/setup-qemu-action@v2
- name: Set up Docker Buildx - name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 uses: docker/setup-buildx-action@v2
- name: Login to Docker Hub - name: Login to Docker Hub
uses: docker/login-action@v2 uses: docker/login-action@v2
@ -70,7 +70,7 @@ jobs:
- name: Docker meta - name: Docker meta
id: meta id: meta
uses: docker/metadata-action@v5 uses: docker/metadata-action@v4
with: with:
images: getmeili/meilisearch images: getmeili/meilisearch
# Prevent `latest` to be updated for each new tag pushed. # Prevent `latest` to be updated for each new tag pushed.
@ -83,7 +83,7 @@ jobs:
type=raw,value=latest,enable=${{ steps.check-tag-format.outputs.stable == 'true' && steps.check-tag-format.outputs.latest == 'true' }} type=raw,value=latest,enable=${{ steps.check-tag-format.outputs.stable == 'true' && steps.check-tag-format.outputs.latest == 'true' }}
- name: Build and push - name: Build and push
uses: docker/build-push-action@v5 uses: docker/build-push-action@v4
with: with:
push: true push: true
platforms: linux/amd64,linux/arm64 platforms: linux/amd64,linux/arm64

View File

@ -14,7 +14,6 @@ on:
env: env:
MEILI_MASTER_KEY: 'masterKey' MEILI_MASTER_KEY: 'masterKey'
MEILI_NO_ANALYTICS: 'true' MEILI_NO_ANALYTICS: 'true'
DISABLE_COVERAGE: 'true'
jobs: jobs:
define-docker-image: define-docker-image:
@ -31,117 +30,6 @@ jobs:
if [[ $event == 'workflow_dispatch' ]]; then if [[ $event == 'workflow_dispatch' ]]; then
echo "docker-image=${{ github.event.inputs.docker_image }}" >> $GITHUB_OUTPUT echo "docker-image=${{ github.event.inputs.docker_image }}" >> $GITHUB_OUTPUT
fi fi
- name: Docker image is ${{ steps.define-image.outputs.docker-image }}
run: echo "Docker image is ${{ steps.define-image.outputs.docker-image }}"
##########
## SDKs ##
##########
meilisearch-dotnet-tests:
needs: define-docker-image
name: .NET SDK tests
runs-on: ubuntu-latest
env:
MEILISEARCH_VERSION: ${{ needs.define-docker-image.outputs.docker-image }}
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-dotnet
- name: Setup .NET Core
uses: actions/setup-dotnet@v3
with:
dotnet-version: "6.0.x"
- name: Install dependencies
run: dotnet restore
- name: Build
run: dotnet build --configuration Release --no-restore
- name: Meilisearch (latest version) setup with Docker
run: docker compose up -d
- name: Run tests
run: dotnet test --no-restore --verbosity normal
meilisearch-dart-tests:
needs: define-docker-image
name: Dart SDK tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-dart
- uses: dart-lang/setup-dart@v1
with:
sdk: 3.1.1
- name: Install dependencies
run: dart pub get
- name: Run integration tests
run: dart test --concurrency=4
meilisearch-go-tests:
needs: define-docker-image
name: Go SDK tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- name: Set up Go
uses: actions/setup-go@v4
with:
go-version: stable
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-go
- name: Get dependencies
run: |
go get -v -t -d ./...
if [ -f Gopkg.toml ]; then
curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh
dep ensure
fi
- name: Run integration tests
run: go test -v ./...
meilisearch-java-tests:
needs: define-docker-image
name: Java SDK tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-java
- name: Set up Java
uses: actions/setup-java@v3
with:
java-version: 8
distribution: 'zulu'
cache: gradle
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Build and run unit and integration tests
run: ./gradlew build integrationTest
meilisearch-js-tests: meilisearch-js-tests:
needs: define-docker-image needs: define-docker-image
@ -178,6 +66,33 @@ jobs:
- name: Run Browser env - name: Run Browser env
run: yarn test:env:browser run: yarn test:env:browser
instant-meilisearch-tests:
needs: define-docker-image
name: instant-meilisearch tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/instant-meilisearch
- name: Setup node
uses: actions/setup-node@v3
with:
cache: yarn
- name: Install dependencies
run: yarn install
- name: Run tests
run: yarn test
- name: Build all the playgrounds and the packages
run: yarn build
meilisearch-php-tests: meilisearch-php-tests:
needs: define-docker-image needs: define-docker-image
name: PHP SDK tests name: PHP SDK tests
@ -196,6 +111,8 @@ jobs:
repository: meilisearch/meilisearch-php repository: meilisearch/meilisearch-php
- name: Install PHP - name: Install PHP
uses: shivammathur/setup-php@v2 uses: shivammathur/setup-php@v2
with:
coverage: none
- name: Validate composer.json and composer.lock - name: Validate composer.json and composer.lock
run: composer validate run: composer validate
- name: Install dependencies - name: Install dependencies
@ -232,6 +149,36 @@ jobs:
- name: Test with pytest - name: Test with pytest
run: pipenv run pytest run: pipenv run pytest
meilisearch-go-tests:
needs: define-docker-image
name: Go SDK tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- name: Set up Go
uses: actions/setup-go@v4
with:
go-version: stable
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-go
- name: Get dependencies
run: |
go get -v -t -d ./...
if [ -f Gopkg.toml ]; then
curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh
dep ensure
fi
- name: Run integration tests
run: go test -v ./...
meilisearch-ruby-tests: meilisearch-ruby-tests:
needs: define-docker-image needs: define-docker-image
name: Ruby SDK tests name: Ruby SDK tests
@ -277,110 +224,3 @@ jobs:
run: cargo build --verbose run: cargo build --verbose
- name: Run tests - name: Run tests
run: cargo test --verbose run: cargo test --verbose
meilisearch-swift-tests:
needs: define-docker-image
name: Swift SDK tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-swift
- name: Run tests
run: swift test
########################
## FRONT-END PLUGINS ##
########################
meilisearch-js-plugins-tests:
needs: define-docker-image
name: meilisearch-js-plugins tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-js-plugins
- name: Setup node
uses: actions/setup-node@v3
with:
cache: yarn
- name: Install dependencies
run: yarn install
- name: Run tests
run: yarn test
- name: Build all the playgrounds and the packages
run: yarn build
########################
## BACK-END PLUGINS ###
########################
meilisearch-rails-tests:
needs: define-docker-image
name: meilisearch-rails tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-rails
- name: Set up Ruby 3
uses: ruby/setup-ruby@v1
with:
ruby-version: 3
bundler-cache: true
- name: Run tests
run: bundle exec rspec
meilisearch-symfony-tests:
needs: define-docker-image
name: meilisearch-symfony tests
runs-on: ubuntu-latest
services:
meilisearch:
image: getmeili/meilisearch:${{ needs.define-docker-image.outputs.docker-image }}
env:
MEILI_MASTER_KEY: ${{ env.MEILI_MASTER_KEY }}
MEILI_NO_ANALYTICS: ${{ env.MEILI_NO_ANALYTICS }}
ports:
- '7700:7700'
steps:
- uses: actions/checkout@v3
with:
repository: meilisearch/meilisearch-symfony
- name: Install PHP
uses: shivammathur/setup-php@v2
with:
tools: composer:v2, flex
- name: Validate composer.json and composer.lock
run: composer validate
- name: Install dependencies
run: composer install --prefer-dist --no-progress --quiet
- name: Remove doctrine/annotations
run: composer remove --dev doctrine/annotations
- name: Run test suite
run: composer test:unit

View File

@ -43,7 +43,7 @@ jobs:
toolchain: nightly toolchain: nightly
override: true override: true
- name: Cache dependencies - name: Cache dependencies
uses: Swatinem/rust-cache@v2.6.2 uses: Swatinem/rust-cache@v2.5.1
- name: Run cargo check without any default features - name: Run cargo check without any default features
uses: actions-rs/cargo@v1 uses: actions-rs/cargo@v1
with: with:
@ -65,7 +65,7 @@ jobs:
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- name: Cache dependencies - name: Cache dependencies
uses: Swatinem/rust-cache@v2.6.2 uses: Swatinem/rust-cache@v2.5.1
- name: Run cargo check without any default features - name: Run cargo check without any default features
uses: actions-rs/cargo@v1 uses: actions-rs/cargo@v1
with: with:
@ -123,10 +123,7 @@ jobs:
override: true override: true
- name: Run cargo tree without default features and check lindera is not present - name: Run cargo tree without default features and check lindera is not present
run: | run: |
if cargo tree -f '{p} {f}' -e normal --no-default-features | grep -vqz lindera; then cargo tree -f '{p} {f}' -e normal --no-default-features | grep lindera -vqz
echo "lindera has been found in the sources and it shouldn't"
exit 1
fi
- name: Run cargo tree with default features and check lindera is pressent - name: Run cargo tree with default features and check lindera is pressent
run: | run: |
cargo tree -f '{p} {f}' -e normal | grep lindera -qz cargo tree -f '{p} {f}' -e normal | grep lindera -qz
@ -149,7 +146,7 @@ jobs:
toolchain: stable toolchain: stable
override: true override: true
- name: Cache dependencies - name: Cache dependencies
uses: Swatinem/rust-cache@v2.6.2 uses: Swatinem/rust-cache@v2.5.1
- name: Run tests in debug - name: Run tests in debug
uses: actions-rs/cargo@v1 uses: actions-rs/cargo@v1
with: with:
@ -168,7 +165,7 @@ jobs:
override: true override: true
components: clippy components: clippy
- name: Cache dependencies - name: Cache dependencies
uses: Swatinem/rust-cache@v2.6.2 uses: Swatinem/rust-cache@v2.5.1
- name: Run cargo clippy - name: Run cargo clippy
uses: actions-rs/cargo@v1 uses: actions-rs/cargo@v1
with: with:
@ -187,7 +184,7 @@ jobs:
override: true override: true
components: rustfmt components: rustfmt
- name: Cache dependencies - name: Cache dependencies
uses: Swatinem/rust-cache@v2.6.2 uses: Swatinem/rust-cache@v2.5.1
- name: Run cargo fmt - name: Run cargo fmt
# Since we never ran the `build.rs` script in the benchmark directory we are missing one auto-generated import file. # Since we never ran the `build.rs` script in the benchmark directory we are missing one auto-generated import file.
# Since we want to trigger (and fail) this action as fast as possible, instead of building the benchmark crate # Since we want to trigger (and fail) this action as fast as possible, instead of building the benchmark crate

View File

@ -1,84 +0,0 @@
name: Benchmarks (PR)
on: issue_comment
permissions:
issues: write
env:
GH_TOKEN: ${{ secrets.MEILI_BOT_GH_PAT }}
jobs:
run-benchmarks-on-comment:
name: Run and upload benchmarks
runs-on: benchmarks
timeout-minutes: 4320 # 72h
steps:
- uses: actions/checkout@v3
- uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable
override: true
- name: Check for Command
id: command
uses: xt0rted/slash-command-action@v2
with:
command: benchmark
reaction-type: "eyes"
repo-token: ${{ env.GH_TOKEN }}
# Set variables
- name: Set current branch name
shell: bash
run: echo "name=$(echo ${GITHUB_REF#refs/heads/})" >> $GITHUB_OUTPUT
id: current_branch
- name: Set normalized current branch name # Replace `/` by `_` in branch name to avoid issues when pushing to S3
shell: bash
run: echo "name=$(echo ${GITHUB_REF#refs/heads/} | tr '/' '_')" >> $GITHUB_OUTPUT
id: normalized_current_branch
- name: Set shorter commit SHA
shell: bash
run: echo "short=$(echo $GITHUB_SHA | cut -c1-8)" >> $GITHUB_OUTPUT
id: commit_sha
- name: Set file basename with format "dataset_branch_commitSHA"
shell: bash
run: echo "basename=$(echo ${{ steps.command.outputs.command-arguments }}_${{ steps.normalized_current_branch.outputs.name }}_${{ steps.commit_sha.outputs.short }})" >> $GITHUB_OUTPUT
id: file
# Run benchmarks
- name: Run benchmarks - Dataset ${{ steps.command.outputs.command-arguments }} - Branch ${{ steps.current_branch.outputs.name }} - Commit ${{ steps.commit_sha.outputs.short }}
run: |
cd benchmarks
cargo bench --bench ${{ steps.command.outputs.command-arguments }} -- --save-baseline ${{ steps.file.outputs.basename }}
# Generate critcmp files
- name: Install critcmp
uses: taiki-e/install-action@v2
with:
tool: critcmp
- name: Export cripcmp file
run: |
critcmp --export ${{ steps.file.outputs.basename }} > ${{ steps.file.outputs.basename }}.json
# Upload benchmarks
- name: Upload ${{ steps.file.outputs.basename }}.json to DO Spaces # DigitalOcean Spaces = S3
uses: BetaHuhn/do-spaces-action@v2
with:
access_key: ${{ secrets.DO_SPACES_ACCESS_KEY }}
secret_key: ${{ secrets.DO_SPACES_SECRET_KEY }}
space_name: ${{ secrets.DO_SPACES_SPACE_NAME }}
space_region: ${{ secrets.DO_SPACES_SPACE_REGION }}
source: ${{ steps.file.outputs.basename }}.json
out_dir: critcmp_results
# Compute the diff of the benchmarks and send a message on the GitHub PR
- name: Compute and send a message in the PR
env:
GITHUB_TOKEN: ${{ secrets.MEILI_BOT_GH_PAT }}
run: |
export base=$(git log --pretty=%p -n 1)
echo 'Here are your benchmarks diff 👊' >> body.txt
echo '```' >> body.txt
./benchmarks/scripts/compare.sh $base ${{ steps.file.outputs.basename }}.json >> body.txt
echo '```' >> body.txt
gh pr comment ${GITHUB_REF#refs/heads/} --body-file body.txt

41
Cargo.lock generated
View File

@ -468,7 +468,7 @@ checksum = "8c3c1a368f70d6cf7302d78f8f7093da241fb8e8807c05cc9e51a125895a6d5b"
[[package]] [[package]]
name = "benchmarks" name = "benchmarks"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"bytes", "bytes",
@ -1206,7 +1206,7 @@ dependencies = [
[[package]] [[package]]
name = "dump" name = "dump"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"big_s", "big_s",
@ -1417,7 +1417,7 @@ dependencies = [
[[package]] [[package]]
name = "file-store" name = "file-store"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"faux", "faux",
"tempfile", "tempfile",
@ -1439,7 +1439,7 @@ dependencies = [
[[package]] [[package]]
name = "filter-parser" name = "filter-parser"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"insta", "insta",
"nom", "nom",
@ -1459,7 +1459,7 @@ dependencies = [
[[package]] [[package]]
name = "flatten-serde-json" name = "flatten-serde-json"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"criterion", "criterion",
"serde_json", "serde_json",
@ -1577,7 +1577,7 @@ dependencies = [
[[package]] [[package]]
name = "fuzzers" name = "fuzzers"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"arbitrary", "arbitrary",
"clap", "clap",
@ -1891,7 +1891,7 @@ dependencies = [
[[package]] [[package]]
name = "index-scheduler" name = "index-scheduler"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"big_s", "big_s",
@ -2088,7 +2088,7 @@ dependencies = [
[[package]] [[package]]
name = "json-depth-checker" name = "json-depth-checker"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"criterion", "criterion",
"serde_json", "serde_json",
@ -2500,7 +2500,7 @@ checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771"
[[package]] [[package]]
name = "meili-snap" name = "meili-snap"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"insta", "insta",
"md5", "md5",
@ -2509,7 +2509,7 @@ dependencies = [
[[package]] [[package]]
name = "meilisearch" name = "meilisearch"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"actix-cors", "actix-cors",
"actix-http", "actix-http",
@ -2600,7 +2600,7 @@ dependencies = [
[[package]] [[package]]
name = "meilisearch-auth" name = "meilisearch-auth"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"base64 0.21.2", "base64 0.21.2",
"enum-iterator", "enum-iterator",
@ -2619,7 +2619,7 @@ dependencies = [
[[package]] [[package]]
name = "meilisearch-types" name = "meilisearch-types"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"actix-web", "actix-web",
"anyhow", "anyhow",
@ -2673,7 +2673,7 @@ dependencies = [
[[package]] [[package]]
name = "milli" name = "milli"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"big_s", "big_s",
"bimap", "bimap",
@ -2704,7 +2704,6 @@ dependencies = [
"logging_timer", "logging_timer",
"maplit", "maplit",
"md5", "md5",
"meili-snap",
"memmap2", "memmap2",
"mimalloc", "mimalloc",
"obkv", "obkv",
@ -2996,7 +2995,7 @@ checksum = "9b2a4787296e9989611394c33f193f676704af1686e70b8f8033ab5ba9a35a94"
[[package]] [[package]]
name = "permissive-json-pointer" name = "permissive-json-pointer"
version = "1.4.0" version = "1.4.2"
dependencies = [ dependencies = [
"big_s", "big_s",
"serde_json", "serde_json",
@ -3539,9 +3538,9 @@ dependencies = [
[[package]] [[package]]
name = "rustls-webpki" name = "rustls-webpki"
version = "0.100.2" version = "0.100.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e98ff011474fa39949b7e5c0428f9b4937eda7da7848bbb947786b7be0b27dab" checksum = "d6207cd5ed3d8dca7816f8f3725513a34609c0c765bf652b8c3cb4cfd87db46b"
dependencies = [ dependencies = [
"ring", "ring",
"untrusted", "untrusted",
@ -4249,7 +4248,7 @@ dependencies = [
"log", "log",
"once_cell", "once_cell",
"rustls 0.21.6", "rustls 0.21.6",
"rustls-webpki 0.100.2", "rustls-webpki 0.100.1",
"url", "url",
"webpki-roots 0.23.1", "webpki-roots 0.23.1",
] ]
@ -4444,9 +4443,9 @@ dependencies = [
[[package]] [[package]]
name = "webpki" name = "webpki"
version = "0.22.1" version = "0.22.0"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0e74f82d49d545ad128049b7e88f6576df2da6b02e9ce565c6f533be576957e" checksum = "f095d78192e208183081cc07bc5515ef55216397af48b873e5edcd72637fa1bd"
dependencies = [ dependencies = [
"ring", "ring",
"untrusted", "untrusted",
@ -4467,7 +4466,7 @@ version = "0.23.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b03058f88386e5ff5310d9111d53f48b17d732b401aeb83a8d5190f2ac459338" checksum = "b03058f88386e5ff5310d9111d53f48b17d732b401aeb83a8d5190f2ac459338"
dependencies = [ dependencies = [
"rustls-webpki 0.100.2", "rustls-webpki 0.100.1",
] ]
[[package]] [[package]]

View File

@ -18,7 +18,7 @@ members = [
] ]
[workspace.package] [workspace.package]
version = "1.4.0" version = "1.4.2"
authors = ["Quentin de Quelen <quentin@dequelen.me>", "Clément Renault <clement@meilisearch.com>"] authors = ["Quentin de Quelen <quentin@dequelen.me>", "Clément Renault <clement@meilisearch.com>"]
description = "Meilisearch HTTP server" description = "Meilisearch HTTP server"
homepage = "https://meilisearch.com" homepage = "https://meilisearch.com"

View File

@ -129,9 +129,6 @@ impl HeedAuthStore {
Action::DumpsAll => { Action::DumpsAll => {
actions.insert(Action::DumpsCreate); actions.insert(Action::DumpsCreate);
} }
Action::SnapshotsAll => {
actions.insert(Action::SnapshotsCreate);
}
Action::TasksAll => { Action::TasksAll => {
actions.extend([Action::TasksGet, Action::TasksDelete, Action::TasksCancel]); actions.extend([Action::TasksGet, Action::TasksDelete, Action::TasksCancel]);
} }

View File

@ -257,12 +257,6 @@ pub enum Action {
#[serde(rename = "dumps.create")] #[serde(rename = "dumps.create")]
#[deserr(rename = "dumps.create")] #[deserr(rename = "dumps.create")]
DumpsCreate, DumpsCreate,
#[serde(rename = "snapshots.*")]
#[deserr(rename = "snapshots.*")]
SnapshotsAll,
#[serde(rename = "snapshots.create")]
#[deserr(rename = "snapshots.create")]
SnapshotsCreate,
#[serde(rename = "version")] #[serde(rename = "version")]
#[deserr(rename = "version")] #[deserr(rename = "version")]
Version, Version,
@ -315,7 +309,6 @@ impl Action {
METRICS_GET => Some(Self::MetricsGet), METRICS_GET => Some(Self::MetricsGet),
DUMPS_ALL => Some(Self::DumpsAll), DUMPS_ALL => Some(Self::DumpsAll),
DUMPS_CREATE => Some(Self::DumpsCreate), DUMPS_CREATE => Some(Self::DumpsCreate),
SNAPSHOTS_CREATE => Some(Self::SnapshotsCreate),
VERSION => Some(Self::Version), VERSION => Some(Self::Version),
KEYS_CREATE => Some(Self::KeysAdd), KEYS_CREATE => Some(Self::KeysAdd),
KEYS_GET => Some(Self::KeysGet), KEYS_GET => Some(Self::KeysGet),
@ -360,7 +353,6 @@ pub mod actions {
pub const METRICS_GET: u8 = MetricsGet.repr(); pub const METRICS_GET: u8 = MetricsGet.repr();
pub const DUMPS_ALL: u8 = DumpsAll.repr(); pub const DUMPS_ALL: u8 = DumpsAll.repr();
pub const DUMPS_CREATE: u8 = DumpsCreate.repr(); pub const DUMPS_CREATE: u8 = DumpsCreate.repr();
pub const SNAPSHOTS_CREATE: u8 = SnapshotsCreate.repr();
pub const VERSION: u8 = Version.repr(); pub const VERSION: u8 = Version.repr();
pub const KEYS_CREATE: u8 = KeysAdd.repr(); pub const KEYS_CREATE: u8 = KeysAdd.repr();
pub const KEYS_GET: u8 = KeysGet.repr(); pub const KEYS_GET: u8 = KeysGet.repr();

View File

@ -1,5 +1,6 @@
mod mock_analytics; mod mock_analytics;
#[cfg(feature = "analytics")] // if we are in release mode and the feature analytics was enabled
#[cfg(all(not(debug_assertions), feature = "analytics"))]
mod segment_analytics; mod segment_analytics;
use std::fs; use std::fs;
@ -16,25 +17,26 @@ use serde_json::Value;
use crate::routes::indexes::documents::UpdateDocumentsQuery; use crate::routes::indexes::documents::UpdateDocumentsQuery;
use crate::routes::tasks::TasksFilterQuery; use crate::routes::tasks::TasksFilterQuery;
// if the analytics feature is disabled // if we are in debug mode OR the analytics feature is disabled
// the `SegmentAnalytics` point to the mock instead of the real analytics // the `SegmentAnalytics` point to the mock instead of the real analytics
#[cfg(not(feature = "analytics"))] #[cfg(any(debug_assertions, not(feature = "analytics")))]
pub type SegmentAnalytics = mock_analytics::MockAnalytics; pub type SegmentAnalytics = mock_analytics::MockAnalytics;
#[cfg(not(feature = "analytics"))] #[cfg(any(debug_assertions, not(feature = "analytics")))]
pub type SearchAggregator = mock_analytics::SearchAggregator; pub type SearchAggregator = mock_analytics::SearchAggregator;
#[cfg(not(feature = "analytics"))] #[cfg(any(debug_assertions, not(feature = "analytics")))]
pub type MultiSearchAggregator = mock_analytics::MultiSearchAggregator; pub type MultiSearchAggregator = mock_analytics::MultiSearchAggregator;
#[cfg(not(feature = "analytics"))] #[cfg(any(debug_assertions, not(feature = "analytics")))]
pub type FacetSearchAggregator = mock_analytics::FacetSearchAggregator; pub type FacetSearchAggregator = mock_analytics::FacetSearchAggregator;
// if the feature analytics is enabled we use the real analytics // if we are in release mode and the feature analytics was enabled
#[cfg(feature = "analytics")] // we use the real analytics
#[cfg(all(not(debug_assertions), feature = "analytics"))]
pub type SegmentAnalytics = segment_analytics::SegmentAnalytics; pub type SegmentAnalytics = segment_analytics::SegmentAnalytics;
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
pub type SearchAggregator = segment_analytics::SearchAggregator; pub type SearchAggregator = segment_analytics::SearchAggregator;
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
pub type MultiSearchAggregator = segment_analytics::MultiSearchAggregator; pub type MultiSearchAggregator = segment_analytics::MultiSearchAggregator;
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
pub type FacetSearchAggregator = segment_analytics::FacetSearchAggregator; pub type FacetSearchAggregator = segment_analytics::FacetSearchAggregator;
/// The Meilisearch config dir: /// The Meilisearch config dir:

File diff suppressed because it is too large Load Diff

View File

@ -28,7 +28,7 @@ const MEILI_DB_PATH: &str = "MEILI_DB_PATH";
const MEILI_HTTP_ADDR: &str = "MEILI_HTTP_ADDR"; const MEILI_HTTP_ADDR: &str = "MEILI_HTTP_ADDR";
const MEILI_MASTER_KEY: &str = "MEILI_MASTER_KEY"; const MEILI_MASTER_KEY: &str = "MEILI_MASTER_KEY";
const MEILI_ENV: &str = "MEILI_ENV"; const MEILI_ENV: &str = "MEILI_ENV";
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
const MEILI_NO_ANALYTICS: &str = "MEILI_NO_ANALYTICS"; const MEILI_NO_ANALYTICS: &str = "MEILI_NO_ANALYTICS";
const MEILI_HTTP_PAYLOAD_SIZE_LIMIT: &str = "MEILI_HTTP_PAYLOAD_SIZE_LIMIT"; const MEILI_HTTP_PAYLOAD_SIZE_LIMIT: &str = "MEILI_HTTP_PAYLOAD_SIZE_LIMIT";
const MEILI_SSL_CERT_PATH: &str = "MEILI_SSL_CERT_PATH"; const MEILI_SSL_CERT_PATH: &str = "MEILI_SSL_CERT_PATH";
@ -159,7 +159,7 @@ pub struct Opt {
/// Meilisearch automatically collects data from all instances that do not opt out using this flag. /// Meilisearch automatically collects data from all instances that do not opt out using this flag.
/// All gathered data is used solely for the purpose of improving Meilisearch, and can be deleted /// All gathered data is used solely for the purpose of improving Meilisearch, and can be deleted
/// at any time. /// at any time.
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
#[serde(default)] // we can't send true #[serde(default)] // we can't send true
#[clap(long, env = MEILI_NO_ANALYTICS)] #[clap(long, env = MEILI_NO_ANALYTICS)]
pub no_analytics: bool, pub no_analytics: bool,
@ -390,7 +390,7 @@ impl Opt {
ignore_missing_dump: _, ignore_missing_dump: _,
ignore_dump_if_db_exists: _, ignore_dump_if_db_exists: _,
config_file_path: _, config_file_path: _,
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
no_analytics, no_analytics,
experimental_enable_metrics: enable_metrics_route, experimental_enable_metrics: enable_metrics_route,
experimental_reduce_indexing_memory_usage: reduce_indexing_memory_usage, experimental_reduce_indexing_memory_usage: reduce_indexing_memory_usage,
@ -401,7 +401,7 @@ impl Opt {
export_to_env_if_not_present(MEILI_MASTER_KEY, master_key); export_to_env_if_not_present(MEILI_MASTER_KEY, master_key);
} }
export_to_env_if_not_present(MEILI_ENV, env); export_to_env_if_not_present(MEILI_ENV, env);
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
{ {
export_to_env_if_not_present(MEILI_NO_ANALYTICS, no_analytics.to_string()); export_to_env_if_not_present(MEILI_NO_ANALYTICS, no_analytics.to_string());
} }

View File

@ -24,7 +24,6 @@ pub mod features;
pub mod indexes; pub mod indexes;
mod metrics; mod metrics;
mod multi_search; mod multi_search;
mod snapshot;
mod swap_indexes; mod swap_indexes;
pub mod tasks; pub mod tasks;
@ -33,7 +32,6 @@ pub fn configure(cfg: &mut web::ServiceConfig) {
.service(web::resource("/health").route(web::get().to(get_health))) .service(web::resource("/health").route(web::get().to(get_health)))
.service(web::scope("/keys").configure(api_key::configure)) .service(web::scope("/keys").configure(api_key::configure))
.service(web::scope("/dumps").configure(dump::configure)) .service(web::scope("/dumps").configure(dump::configure))
.service(web::scope("/snapshots").configure(snapshot::configure))
.service(web::resource("/stats").route(web::get().to(get_stats))) .service(web::resource("/stats").route(web::get().to(get_stats)))
.service(web::resource("/version").route(web::get().to(get_version))) .service(web::resource("/version").route(web::get().to(get_version)))
.service(web::scope("/indexes").configure(indexes::configure)) .service(web::scope("/indexes").configure(indexes::configure))

View File

@ -1,32 +0,0 @@
use actix_web::web::Data;
use actix_web::{web, HttpRequest, HttpResponse};
use index_scheduler::IndexScheduler;
use log::debug;
use meilisearch_types::error::ResponseError;
use meilisearch_types::tasks::KindWithContent;
use serde_json::json;
use crate::analytics::Analytics;
use crate::extractors::authentication::policies::*;
use crate::extractors::authentication::GuardedData;
use crate::extractors::sequential_extractor::SeqHandler;
use crate::routes::SummarizedTaskView;
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(web::resource("").route(web::post().to(SeqHandler(create_snapshot))));
}
pub async fn create_snapshot(
index_scheduler: GuardedData<ActionPolicy<{ actions::SNAPSHOTS_CREATE }>, Data<IndexScheduler>>,
req: HttpRequest,
analytics: web::Data<dyn Analytics>,
) -> Result<HttpResponse, ResponseError> {
analytics.publish("Snapshot Created".to_string(), json!({}), Some(&req));
let task = KindWithContent::SnapshotCreation;
let task: SummarizedTaskView =
tokio::task::spawn_blocking(move || index_scheduler.register(task)).await??.into();
debug!("returns: {:?}", task);
Ok(HttpResponse::Accepted().json(task))
}

View File

@ -1,7 +1,8 @@
use std::{thread, time}; use std::{thread, time};
use crate::common::{Server, Value}; use serde_json::{json, Value};
use crate::json;
use crate::common::Server;
#[actix_rt::test] #[actix_rt::test]
async fn add_valid_api_key() { async fn add_valid_api_key() {
@ -161,7 +162,7 @@ async fn add_valid_api_key_null_description() {
server.use_api_key("MASTER_KEY"); server.use_api_key("MASTER_KEY");
let content = json!({ let content = json!({
"description": json!(null), "description": Value::Null,
"indexes": ["products"], "indexes": ["products"],
"actions": ["documents.add"], "actions": ["documents.add"],
"expiresAt": "2050-11-13T00:00:00" "expiresAt": "2050-11-13T00:00:00"
@ -364,7 +365,7 @@ async fn error_add_api_key_invalid_index_uids() {
server.use_api_key("MASTER_KEY"); server.use_api_key("MASTER_KEY");
let content = json!({ let content = json!({
"description": json!(null), "description": Value::Null,
"indexes": ["invalid index # / \\name with spaces"], "indexes": ["invalid index # / \\name with spaces"],
"actions": [ "actions": [
"documents.add" "documents.add"
@ -421,7 +422,7 @@ async fn error_add_api_key_invalid_parameters_actions() {
meili_snap::snapshot!(code, @"400 Bad Request"); meili_snap::snapshot!(code, @"400 Bad Request");
meili_snap::snapshot!(meili_snap::json_string!(response, { ".createdAt" => "[ignored]", ".updatedAt" => "[ignored]" }), @r###" meili_snap::snapshot!(meili_snap::json_string!(response, { ".createdAt" => "[ignored]", ".updatedAt" => "[ignored]" }), @r###"
{ {
"message": "Unknown value `doc.add` at `.actions[0]`: expected one of `*`, `search`, `documents.*`, `documents.add`, `documents.get`, `documents.delete`, `indexes.*`, `indexes.create`, `indexes.get`, `indexes.update`, `indexes.delete`, `indexes.swap`, `tasks.*`, `tasks.cancel`, `tasks.delete`, `tasks.get`, `settings.*`, `settings.get`, `settings.update`, `stats.*`, `stats.get`, `metrics.*`, `metrics.get`, `dumps.*`, `dumps.create`, `snapshots.*`, `snapshots.create`, `version`, `keys.create`, `keys.get`, `keys.update`, `keys.delete`, `experimental.get`, `experimental.update`", "message": "Unknown value `doc.add` at `.actions[0]`: expected one of `*`, `search`, `documents.*`, `documents.add`, `documents.get`, `documents.delete`, `indexes.*`, `indexes.create`, `indexes.get`, `indexes.update`, `indexes.delete`, `indexes.swap`, `tasks.*`, `tasks.cancel`, `tasks.delete`, `tasks.get`, `settings.*`, `settings.get`, `settings.update`, `stats.*`, `stats.get`, `metrics.*`, `metrics.get`, `dumps.*`, `dumps.create`, `version`, `keys.create`, `keys.get`, `keys.update`, `keys.delete`, `experimental.get`, `experimental.update`",
"code": "invalid_api_key_actions", "code": "invalid_api_key_actions",
"type": "invalid_request", "type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_api_key_actions" "link": "https://docs.meilisearch.com/errors#invalid_api_key_actions"
@ -506,7 +507,7 @@ async fn error_add_api_key_invalid_parameters_uid() {
async fn error_add_api_key_parameters_uid_already_exist() { async fn error_add_api_key_parameters_uid_already_exist() {
let mut server = Server::new_auth().await; let mut server = Server::new_auth().await;
server.use_api_key("MASTER_KEY"); server.use_api_key("MASTER_KEY");
let content: Value = json!({ let content = json!({
"uid": "4bc0887a-0e41-4f3b-935d-0c451dcee9c8", "uid": "4bc0887a-0e41-4f3b-935d-0c451dcee9c8",
"indexes": ["products"], "indexes": ["products"],
"actions": ["search"], "actions": ["search"],
@ -1145,7 +1146,7 @@ async fn patch_api_key_description() {
meili_snap::snapshot!(code, @"200 OK"); meili_snap::snapshot!(code, @"200 OK");
// Remove the description // Remove the description
let content = json!({ "description": null }); let content = json!({ "description": serde_json::Value::Null });
let (response, code) = server.patch_api_key(&uid, content).await; let (response, code) = server.patch_api_key(&uid, content).await;
meili_snap::snapshot!(meili_snap::json_string!(response, { ".createdAt" => "[ignored]", ".updatedAt" => "[ignored]", ".uid" => "[ignored]", ".key" => "[ignored]" }), @r###" meili_snap::snapshot!(meili_snap::json_string!(response, { ".createdAt" => "[ignored]", ".updatedAt" => "[ignored]", ".uid" => "[ignored]", ".key" => "[ignored]" }), @r###"

View File

@ -3,10 +3,10 @@ use std::collections::{HashMap, HashSet};
use ::time::format_description::well_known::Rfc3339; use ::time::format_description::well_known::Rfc3339;
use maplit::{hashmap, hashset}; use maplit::{hashmap, hashset};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use time::{Duration, OffsetDateTime}; use time::{Duration, OffsetDateTime};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
pub static AUTHORIZATIONS: Lazy<HashMap<(&'static str, &'static str), HashSet<&'static str>>> = pub static AUTHORIZATIONS: Lazy<HashMap<(&'static str, &'static str), HashSet<&'static str>>> =
Lazy::new(|| { Lazy::new(|| {
@ -54,7 +54,6 @@ pub static AUTHORIZATIONS: Lazy<HashMap<(&'static str, &'static str), HashSet<&'
("GET", "/indexes/products/stats") => hashset!{"stats.get", "stats.*", "*"}, ("GET", "/indexes/products/stats") => hashset!{"stats.get", "stats.*", "*"},
("GET", "/stats") => hashset!{"stats.get", "stats.*", "*"}, ("GET", "/stats") => hashset!{"stats.get", "stats.*", "*"},
("POST", "/dumps") => hashset!{"dumps.create", "dumps.*", "*"}, ("POST", "/dumps") => hashset!{"dumps.create", "dumps.*", "*"},
("POST", "/snapshots") => hashset!{"snapshots.create", "snapshots.*", "*"},
("GET", "/version") => hashset!{"version", "*"}, ("GET", "/version") => hashset!{"version", "*"},
("GET", "/metrics") => hashset!{"metrics.get", "metrics.*", "*"}, ("GET", "/metrics") => hashset!{"metrics.get", "metrics.*", "*"},
("PATCH", "/keys/mykey/") => hashset!{"keys.update", "*"}, ("PATCH", "/keys/mykey/") => hashset!{"keys.update", "*"},

View File

@ -1,8 +1,8 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use uuid::Uuid; use uuid::Uuid;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn create_api_key_bad_description() { async fn create_api_key_bad_description() {
@ -90,7 +90,7 @@ async fn create_api_key_bad_actions() {
snapshot!(code, @"400 Bad Request"); snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###" snapshot!(json_string!(response), @r###"
{ {
"message": "Unknown value `doggo` at `.actions[0]`: expected one of `*`, `search`, `documents.*`, `documents.add`, `documents.get`, `documents.delete`, `indexes.*`, `indexes.create`, `indexes.get`, `indexes.update`, `indexes.delete`, `indexes.swap`, `tasks.*`, `tasks.cancel`, `tasks.delete`, `tasks.get`, `settings.*`, `settings.get`, `settings.update`, `stats.*`, `stats.get`, `metrics.*`, `metrics.get`, `dumps.*`, `dumps.create`, `snapshots.*`, `snapshots.create`, `version`, `keys.create`, `keys.get`, `keys.update`, `keys.delete`, `experimental.get`, `experimental.update`", "message": "Unknown value `doggo` at `.actions[0]`: expected one of `*`, `search`, `documents.*`, `documents.add`, `documents.get`, `documents.delete`, `indexes.*`, `indexes.create`, `indexes.get`, `indexes.update`, `indexes.delete`, `indexes.swap`, `tasks.*`, `tasks.cancel`, `tasks.delete`, `tasks.get`, `settings.*`, `settings.get`, `settings.update`, `stats.*`, `stats.get`, `metrics.*`, `metrics.get`, `dumps.*`, `dumps.create`, `version`, `keys.create`, `keys.get`, `keys.update`, `keys.delete`, `experimental.get`, `experimental.update`",
"code": "invalid_api_key_actions", "code": "invalid_api_key_actions",
"type": "invalid_request", "type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_api_key_actions" "link": "https://docs.meilisearch.com/errors#invalid_api_key_actions"

View File

@ -7,9 +7,9 @@ mod tenant_token;
mod tenant_token_multi_search; mod tenant_token_multi_search;
use actix_web::http::StatusCode; use actix_web::http::StatusCode;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
impl Server { impl Server {
pub fn use_api_key(&mut self, api_key: impl AsRef<str>) { pub fn use_api_key(&mut self, api_key: impl AsRef<str>) {

View File

@ -3,11 +3,11 @@ use std::collections::HashMap;
use ::time::format_description::well_known::Rfc3339; use ::time::format_description::well_known::Rfc3339;
use maplit::hashmap; use maplit::hashmap;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use time::{Duration, OffsetDateTime}; use time::{Duration, OffsetDateTime};
use super::authorization::{ALL_ACTIONS, AUTHORIZATIONS}; use super::authorization::{ALL_ACTIONS, AUTHORIZATIONS};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
fn generate_tenant_token( fn generate_tenant_token(
parent_uid: impl AsRef<str>, parent_uid: impl AsRef<str>,
@ -233,31 +233,31 @@ async fn search_authorized_simple_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": {}}), "searchRules" => json!({"*": {}}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["*"]), "searchRules" => json!(["*"]),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": {}}), "searchRules" => json!({"sales": {}}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sales"]), "searchRules" => json!(["sales"]),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sa*"]), "searchRules" => json!(["sa*"]),
"exp" => json!(null) "exp" => Value::Null
}, },
]; ];
@ -386,7 +386,7 @@ async fn error_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -398,7 +398,7 @@ async fn error_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -428,15 +428,15 @@ async fn error_search_forbidden_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": {}}), "searchRules" => json!({"products": {}}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": null}), "searchRules" => json!({"products": Value::Null}),
"exp" => json!(null) "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["products"]), "searchRules" => json!(["products"]),
"exp" => json!(null) "exp" => Value::Null
}, },
// expired token // expired token
hashmap! { hashmap! {
@ -444,7 +444,7 @@ async fn error_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -456,7 +456,7 @@ async fn error_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {

View File

@ -3,11 +3,11 @@ use std::collections::HashMap;
use ::time::format_description::well_known::Rfc3339; use ::time::format_description::well_known::Rfc3339;
use maplit::hashmap; use maplit::hashmap;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use time::{Duration, OffsetDateTime}; use time::{Duration, OffsetDateTime};
use super::authorization::ALL_ACTIONS; use super::authorization::ALL_ACTIONS;
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
fn generate_tenant_token( fn generate_tenant_token(
parent_uid: impl AsRef<str>, parent_uid: impl AsRef<str>,
@ -512,31 +512,31 @@ async fn single_search_authorized_simple_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": {}}), "searchRules" => json!({"*": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["*"]), "searchRules" => json!(["*"]),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": {}}), "searchRules" => json!({"sales": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sales"]), "searchRules" => json!(["sales"]),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sa*"]), "searchRules" => json!(["sa*"]),
"exp" => json!(null), "exp" => Value::Null
}, },
]; ];
@ -564,31 +564,31 @@ async fn multi_search_authorized_simple_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": {}}), "searchRules" => json!({"*": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["*"]), "searchRules" => json!(["*"]),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": {}, "products": {}}), "searchRules" => json!({"sales": {}, "products": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null, "products": null}), "searchRules" => json!({"sales": Value::Null, "products": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sales", "products"]), "searchRules" => json!(["sales", "products"]),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sa*", "pro*"]), "searchRules" => json!(["sa*", "pro*"]),
"exp" => json!(null), "exp" => Value::Null
}, },
]; ];
@ -823,7 +823,7 @@ async fn error_single_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -835,7 +835,7 @@ async fn error_single_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -864,7 +864,7 @@ async fn error_multi_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -876,7 +876,7 @@ async fn error_multi_search_token_forbidden_parent_key() {
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null, "products": null}), "searchRules" => json!({"sales": Value::Null, "products": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() + Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -919,15 +919,15 @@ async fn error_single_search_forbidden_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": {}}), "searchRules" => json!({"products": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": null}), "searchRules" => json!({"products": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["products"]), "searchRules" => json!(["products"]),
"exp" => json!(null), "exp" => Value::Null
}, },
// expired token // expired token
hashmap! { hashmap! {
@ -935,7 +935,7 @@ async fn error_single_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -947,7 +947,7 @@ async fn error_single_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -978,15 +978,15 @@ async fn error_multi_search_forbidden_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": {}}), "searchRules" => json!({"products": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"products": null}), "searchRules" => json!({"products": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["products"]), "searchRules" => json!(["products"]),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": {}}), "searchRules" => json!({"sales": {}}),
@ -998,15 +998,15 @@ async fn error_multi_search_forbidden_token() {
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": {}}), "searchRules" => json!({"sales": {}}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null}), "searchRules" => json!({"sales": Value::Null}),
"exp" => json!(null), "exp" => Value::Null
}, },
hashmap! { hashmap! {
"searchRules" => json!(["sales"]), "searchRules" => json!(["sales"]),
"exp" => json!(null), "exp" => Value::Null
}, },
// expired token // expired token
hashmap! { hashmap! {
@ -1014,7 +1014,7 @@ async fn error_multi_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"*": null}), "searchRules" => json!({"*": Value::Null}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
@ -1026,7 +1026,7 @@ async fn error_multi_search_forbidden_token() {
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {
"searchRules" => json!({"sales": null, "products": {}}), "searchRules" => json!({"sales": Value::Null, "products": {}}),
"exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp()) "exp" => json!((OffsetDateTime::now_utc() - Duration::hours(1)).unix_timestamp())
}, },
hashmap! { hashmap! {

View File

@ -3,13 +3,12 @@ use std::panic::{catch_unwind, resume_unwind, UnwindSafe};
use std::time::Duration; use std::time::Duration;
use actix_web::http::StatusCode; use actix_web::http::StatusCode;
use serde_json::{json, Value};
use tokio::time::sleep; use tokio::time::sleep;
use urlencoding::encode as urlencode; use urlencoding::encode as urlencode;
use super::encoder::Encoder; use super::encoder::Encoder;
use super::service::Service; use super::service::Service;
use super::Value;
use crate::json;
pub struct Index<'a> { pub struct Index<'a> {
pub uid: String, pub uid: String,
@ -243,9 +242,7 @@ impl Index<'_> {
pub async fn delete_batch(&self, ids: Vec<u64>) -> (Value, StatusCode) { pub async fn delete_batch(&self, ids: Vec<u64>) -> (Value, StatusCode) {
let url = format!("/indexes/{}/documents/delete-batch", urlencode(self.uid.as_ref())); let url = format!("/indexes/{}/documents/delete-batch", urlencode(self.uid.as_ref()));
self.service self.service.post_encoded(url, serde_json::to_value(&ids).unwrap(), self.encoder).await
.post_encoded(url, serde_json::to_value(&ids).unwrap().into(), self.encoder)
.await
} }
pub async fn delete_batch_raw(&self, body: Value) -> (Value, StatusCode) { pub async fn delete_batch_raw(&self, body: Value) -> (Value, StatusCode) {

View File

@ -3,83 +3,9 @@ pub mod index;
pub mod server; pub mod server;
pub mod service; pub mod service;
use std::fmt::{self, Display};
pub use index::{GetAllDocumentsOptions, GetDocumentOptions}; pub use index::{GetAllDocumentsOptions, GetDocumentOptions};
use meili_snap::json_string;
use serde::{Deserialize, Serialize};
pub use server::{default_settings, Server}; pub use server::{default_settings, Server};
#[derive(Debug, Clone, Default, Serialize, Deserialize, PartialEq, Eq)]
pub struct Value(pub serde_json::Value);
impl Value {
pub fn uid(&self) -> u64 {
if let Some(uid) = self["uid"].as_u64() {
uid
} else if let Some(uid) = self["taskUid"].as_u64() {
uid
} else {
panic!("Didn't find any task id in: {self}");
}
}
}
impl From<serde_json::Value> for Value {
fn from(value: serde_json::Value) -> Self {
Value(value)
}
}
impl std::ops::Deref for Value {
type Target = serde_json::Value;
fn deref(&self) -> &Self::Target {
&self.0
}
}
impl PartialEq<serde_json::Value> for Value {
fn eq(&self, other: &serde_json::Value) -> bool {
&self.0 == other
}
}
impl PartialEq<Value> for serde_json::Value {
fn eq(&self, other: &Value) -> bool {
self == &other.0
}
}
impl PartialEq<&str> for Value {
fn eq(&self, other: &&str) -> bool {
self.0.eq(other)
}
}
impl Display for Value {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"{}",
json_string!(self, { ".enqueuedAt" => "[date]", ".processedAt" => "[date]", ".finishedAt" => "[date]", ".duration" => "[duration]" })
)
}
}
impl From<Vec<Value>> for Value {
fn from(value: Vec<Value>) -> Self {
Self(value.into_iter().map(|value| value.0).collect::<serde_json::Value>())
}
}
#[macro_export]
macro_rules! json {
($($json:tt)+) => {
$crate::common::Value(serde_json::json!($($json)+))
};
}
/// Performs a search test on both post and get routes /// Performs a search test on both post and get routes
#[macro_export] #[macro_export]
macro_rules! test_post_get_search { macro_rules! test_post_get_search {

View File

@ -11,14 +11,13 @@ use clap::Parser;
use meilisearch::option::{IndexerOpts, MaxMemory, Opt}; use meilisearch::option::{IndexerOpts, MaxMemory, Opt};
use meilisearch::{analytics, create_app, setup_meilisearch}; use meilisearch::{analytics, create_app, setup_meilisearch};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use tempfile::TempDir; use tempfile::TempDir;
use tokio::time::sleep; use tokio::time::sleep;
use super::index::Index; use super::index::Index;
use super::service::Service; use super::service::Service;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::Value;
use crate::json;
pub struct Server { pub struct Server {
pub service: Service, pub service: Service,
@ -157,10 +156,6 @@ impl Server {
self.service.post("/dumps", json!(null)).await self.service.post("/dumps", json!(null)).await
} }
pub async fn create_snapshot(&self) -> (Value, StatusCode) {
self.service.post("/snapshots", json!(null)).await
}
pub async fn index_swap(&self, value: Value) -> (Value, StatusCode) { pub async fn index_swap(&self, value: Value) -> (Value, StatusCode) {
self.service.post("/swap-indexes", value).await self.service.post("/swap-indexes", value).await
} }
@ -209,7 +204,7 @@ pub fn default_settings(dir: impl AsRef<Path>) -> Opt {
db_path: dir.as_ref().join("db"), db_path: dir.as_ref().join("db"),
dump_dir: dir.as_ref().join("dumps"), dump_dir: dir.as_ref().join("dumps"),
env: "development".to_owned(), env: "development".to_owned(),
#[cfg(feature = "analytics")] #[cfg(all(not(debug_assertions), feature = "analytics"))]
no_analytics: true, no_analytics: true,
max_index_size: Byte::from_unit(100.0, ByteUnit::MiB).unwrap(), max_index_size: Byte::from_unit(100.0, ByteUnit::MiB).unwrap(),
max_task_db_size: Byte::from_unit(1.0, ByteUnit::GiB).unwrap(), max_task_db_size: Byte::from_unit(1.0, ByteUnit::GiB).unwrap(),

View File

@ -7,9 +7,9 @@ use actix_web::test::TestRequest;
use index_scheduler::IndexScheduler; use index_scheduler::IndexScheduler;
use meilisearch::{analytics, create_app, Opt}; use meilisearch::{analytics, create_app, Opt};
use meilisearch_auth::AuthController; use meilisearch_auth::AuthController;
use serde_json::Value;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::Value;
pub struct Service { pub struct Service {
pub index_scheduler: Arc<IndexScheduler>, pub index_scheduler: Arc<IndexScheduler>,

View File

@ -3,8 +3,9 @@
mod common; mod common;
use actix_web::test; use actix_web::test;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
enum HttpVerb { enum HttpVerb {
Put, Put,

View File

@ -1,11 +1,11 @@
use actix_web::test; use actix_web::test;
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::{json, Value};
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::{GetAllDocumentsOptions, Server, Value}; use crate::common::{GetAllDocumentsOptions, Server};
use crate::json;
/// This is the basic usage of our API and every other tests uses the content-type application/json /// This is the basic usage of our API and every other tests uses the content-type application/json
#[actix_rt::test] #[actix_rt::test]

View File

@ -1,7 +1,7 @@
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::json;
use crate::common::{GetAllDocumentsOptions, Server}; use crate::common::{GetAllDocumentsOptions, Server};
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn delete_one_document_unexisting_index() { async fn delete_one_document_unexisting_index() {

View File

@ -1,8 +1,8 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use urlencoding::encode; use urlencoding::encode;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn get_all_documents_bad_offset() { async fn get_all_documents_bad_offset() {

View File

@ -1,11 +1,11 @@
use actix_web::test; use actix_web::test;
use http::header::ACCEPT_ENCODING; use http::header::ACCEPT_ENCODING;
use meili_snap::*; use meili_snap::*;
use serde_json::{json, Value};
use urlencoding::encode as urlencode; use urlencoding::encode as urlencode;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::{GetAllDocumentsOptions, GetDocumentOptions, Server, Value}; use crate::common::{GetAllDocumentsOptions, GetDocumentOptions, Server};
use crate::json;
// TODO: partial test since we are testing error, amd error is not yet fully implemented in // TODO: partial test since we are testing error, amd error is not yet fully implemented in
// transplant // transplant
@ -40,7 +40,7 @@ async fn get_document() {
let server = Server::new().await; let server = Server::new().await;
let index = server.index("test"); let index = server.index("test");
index.create(None).await; index.create(None).await;
let documents = json!([ let documents = serde_json::json!([
{ {
"id": 0, "id": 0,
"nested": { "content": "foobar" }, "nested": { "content": "foobar" },
@ -53,7 +53,7 @@ async fn get_document() {
assert_eq!(code, 200); assert_eq!(code, 200);
assert_eq!( assert_eq!(
response, response,
json!({ serde_json::json!({
"id": 0, "id": 0,
"nested": { "content": "foobar" }, "nested": { "content": "foobar" },
}) })
@ -64,7 +64,7 @@ async fn get_document() {
assert_eq!(code, 200); assert_eq!(code, 200);
assert_eq!( assert_eq!(
response, response,
json!({ serde_json::json!({
"id": 0, "id": 0,
}) })
); );
@ -75,7 +75,7 @@ async fn get_document() {
assert_eq!(code, 200); assert_eq!(code, 200);
assert_eq!( assert_eq!(
response, response,
json!({ serde_json::json!({
"nested": { "content": "foobar" }, "nested": { "content": "foobar" },
}) })
); );
@ -122,7 +122,7 @@ async fn get_all_documents_no_options() {
assert_eq!(code, 200); assert_eq!(code, 200);
let arr = response["results"].as_array().unwrap(); let arr = response["results"].as_array().unwrap();
assert_eq!(arr.len(), 20); assert_eq!(arr.len(), 20);
let first = json!({ let first = serde_json::json!({
"id":0, "id":0,
"isActive":false, "isActive":false,
"balance":"$2,668.55", "balance":"$2,668.55",

View File

@ -1,8 +1,7 @@
use meili_snap::snapshot; use serde_json::json;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::{GetAllDocumentsOptions, Server}; use crate::common::{GetAllDocumentsOptions, Server};
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn error_document_update_create_index_bad_uid() { async fn error_document_update_create_index_bad_uid() {
@ -85,13 +84,7 @@ async fn update_document() {
let (response, code) = index.get_document(1, None).await; let (response, code) = index.get_document(1, None).await;
assert_eq!(code, 200); assert_eq!(code, 200);
snapshot!(response, @r###" assert_eq!(response.to_string(), r##"{"doc_id":1,"content":"foo","other":"bar"}"##);
{
"doc_id": 1,
"content": "foo",
"other": "bar"
}
"###);
} }
#[actix_rt::test] #[actix_rt::test]
@ -129,13 +122,7 @@ async fn update_document_gzip_encoded() {
let (response, code) = index.get_document(1, None).await; let (response, code) = index.get_document(1, None).await;
assert_eq!(code, 200); assert_eq!(code, 200);
snapshot!(response, @r###" assert_eq!(response.to_string(), r##"{"doc_id":1,"content":"foo","other":"bar"}"##);
{
"doc_id": 1,
"content": "foo",
"other": "bar"
}
"###);
} }
#[actix_rt::test] #[actix_rt::test]

View File

@ -2,10 +2,10 @@ mod data;
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use meilisearch::Opt; use meilisearch::Opt;
use serde_json::json;
use self::data::GetDump; use self::data::GetDump;
use crate::common::{default_settings, GetAllDocumentsOptions, Server}; use crate::common::{default_settings, GetAllDocumentsOptions, Server};
use crate::json;
// all the following test are ignored on windows. See #2364 // all the following test are ignored on windows. See #2364
#[actix_rt::test] #[actix_rt::test]

View File

@ -1,5 +1,6 @@
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
/// Feature name to test against. /// Feature name to test against.
/// This will have to be changed by a different one when that feature is stabilized. /// This will have to be changed by a different one when that feature is stabilized.

View File

@ -2,10 +2,10 @@ use actix_web::http::header::ContentType;
use actix_web::test; use actix_web::test;
use http::header::ACCEPT_ENCODING; use http::header::ACCEPT_ENCODING;
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::{json, Value};
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn create_index_no_primary_key() { async fn create_index_no_primary_key() {
@ -21,7 +21,7 @@ async fn create_index_no_primary_key() {
assert_eq!(response["status"], "succeeded"); assert_eq!(response["status"], "succeeded");
assert_eq!(response["type"], "indexCreation"); assert_eq!(response["type"], "indexCreation");
assert_eq!(response["details"]["primaryKey"], json!(null)); assert_eq!(response["details"]["primaryKey"], Value::Null);
} }
#[actix_rt::test] #[actix_rt::test]
@ -38,7 +38,7 @@ async fn create_index_with_gzip_encoded_request() {
assert_eq!(response["status"], "succeeded"); assert_eq!(response["status"], "succeeded");
assert_eq!(response["type"], "indexCreation"); assert_eq!(response["type"], "indexCreation");
assert_eq!(response["details"]["primaryKey"], json!(null)); assert_eq!(response["details"]["primaryKey"], Value::Null);
} }
#[actix_rt::test] #[actix_rt::test]
@ -86,7 +86,7 @@ async fn create_index_with_zlib_encoded_request() {
assert_eq!(response["status"], "succeeded"); assert_eq!(response["status"], "succeeded");
assert_eq!(response["type"], "indexCreation"); assert_eq!(response["type"], "indexCreation");
assert_eq!(response["details"]["primaryKey"], json!(null)); assert_eq!(response["details"]["primaryKey"], Value::Null);
} }
#[actix_rt::test] #[actix_rt::test]
@ -103,7 +103,7 @@ async fn create_index_with_brotli_encoded_request() {
assert_eq!(response["status"], "succeeded"); assert_eq!(response["status"], "succeeded");
assert_eq!(response["type"], "indexCreation"); assert_eq!(response["type"], "indexCreation");
assert_eq!(response["details"]["primaryKey"], json!(null)); assert_eq!(response["details"]["primaryKey"], Value::Null);
} }
#[actix_rt::test] #[actix_rt::test]
@ -136,7 +136,7 @@ async fn create_index_with_invalid_primary_key() {
let (response, code) = index.get().await; let (response, code) = index.get().await;
assert_eq!(code, 200); assert_eq!(code, 200);
assert_eq!(response["primaryKey"], json!(null)); assert_eq!(response["primaryKey"], Value::Null);
} }
#[actix_rt::test] #[actix_rt::test]

View File

@ -1,5 +1,6 @@
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn create_and_delete_index() { async fn create_and_delete_index() {

View File

@ -1,7 +1,7 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn get_indexes_bad_offset() { async fn get_indexes_bad_offset() {

View File

@ -1,5 +1,6 @@
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn stats() { async fn stats() {

View File

@ -1,9 +1,9 @@
use serde_json::json;
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::common::encoder::Encoder; use crate::common::encoder::Encoder;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn update_primary_key() { async fn update_primary_key() {

View File

@ -0,0 +1,241 @@
use meili_snap::snapshot;
use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::Server;
pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| {
json!([
{
"id": 1,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Brown"
},
{
"id": 2,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Black"
},
{
"id": 3,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Blue"
},
{
"id": 4,
"description": "T-Shirt",
"brand": "Nike",
"product_id": "789012",
"color": "Red"
},
{
"id": 5,
"description": "T-Shirt",
"brand": "Nike",
"product_id": "789012",
"color": "Blue"
},
{
"id": 6,
"description": "Running Shoes",
"brand": "Adidas",
"product_id": "456789",
"color": "Black"
},
{
"id": 7,
"description": "Running Shoes",
"brand": "Adidas",
"product_id": "456789",
"color": "White"
},
{
"id": 8,
"description": "Hoodie",
"brand": "Puma",
"product_id": "987654",
"color": "Gray"
},
{
"id": 9,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Green"
},
{
"id": 10,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Red"
},
{
"id": 11,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Blue"
},
{
"id": 12,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Indigo"
},
{
"id": 13,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Black"
},
{
"id": 14,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Stone Wash"
}
])
});
pub(self) static DOCUMENT_PRIMARY_KEY: &str = "id";
pub(self) static DOCUMENT_DISTINCT_KEY: &str = "product_id";
/// testing: https://github.com/meilisearch/meilisearch/issues/4078
#[actix_rt::test]
async fn distinct_search_with_offset_no_ranking() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, Some(DOCUMENT_PRIMARY_KEY)).await;
index.update_distinct_attribute(json!(DOCUMENT_DISTINCT_KEY)).await;
index.wait_task(1).await;
fn get_hits(response: &Value) -> Vec<&str> {
let hits_array = response["hits"].as_array().unwrap();
hits_array.iter().map(|h| h[DOCUMENT_DISTINCT_KEY].as_str().unwrap()).collect::<Vec<_>>()
}
let (response, code) = index.search_post(json!({"offset": 0, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["123456", "789012"]"#);
snapshot!(response["estimatedTotalHits"] , @"11");
let (response, code) = index.search_post(json!({"offset": 2, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["456789", "987654"]"#);
snapshot!(response["estimatedTotalHits"], @"10");
let (response, code) = index.search_post(json!({"offset": 4, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["234567", "345678"]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 5, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"1");
snapshot!(format!("{:?}", hits), @r#"["345678"]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 6, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 7, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["estimatedTotalHits"], @"6");
}
/// testing: https://github.com/meilisearch/meilisearch/issues/4130
#[actix_rt::test]
async fn distinct_search_with_pagination_no_ranking() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, Some(DOCUMENT_PRIMARY_KEY)).await;
index.update_distinct_attribute(json!(DOCUMENT_DISTINCT_KEY)).await;
index.wait_task(1).await;
fn get_hits(response: &Value) -> Vec<&str> {
let hits_array = response["hits"].as_array().unwrap();
hits_array.iter().map(|h| h[DOCUMENT_DISTINCT_KEY].as_str().unwrap()).collect::<Vec<_>>()
}
let (response, code) = index.search_post(json!({"page": 0, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["page"], @"0");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 1, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["123456", "789012"]"#);
snapshot!(response["page"], @"1");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 2, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["456789", "987654"]"#);
snapshot!(response["page"], @"2");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 3, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["234567", "345678"]"#);
snapshot!(response["page"], @"3");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 4, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["page"], @"4");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 2, "hitsPerPage": 3})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"3");
snapshot!(format!("{:?}", hits), @r#"["987654", "234567", "345678"]"#);
snapshot!(response["page"], @"2");
snapshot!(response["totalPages"], @"2");
snapshot!(response["totalHits"], @"6");
}

View File

@ -1,8 +1,8 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use super::DOCUMENTS; use super::DOCUMENTS;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn search_unexisting_index() { async fn search_unexisting_index() {

View File

@ -1,8 +1,8 @@
use meili_snap::snapshot; use meili_snap::snapshot;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| { pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| {
json!([ json!([

View File

@ -1,8 +1,8 @@
use insta::{allow_duplicates, assert_json_snapshot}; use insta::{allow_duplicates, assert_json_snapshot};
use serde_json::json;
use super::*; use super::*;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn formatted_contain_wildcard() { async fn formatted_contain_wildcard() {

View File

@ -1,8 +1,8 @@
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| { pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| {
json!([ json!([

View File

@ -1,6 +1,7 @@
// This modules contains all the test concerning search. Each particular feature of the search // This modules contains all the test concerning search. Each particular feature of the search
// should be tested in its own module to isolate tests and keep the tests readable. // should be tested in its own module to isolate tests and keep the tests readable.
mod distinct;
mod errors; mod errors;
mod facet_search; mod facet_search;
mod formatted; mod formatted;
@ -10,9 +11,9 @@ mod pagination;
mod restrict_searchable; mod restrict_searchable;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| { pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| {
json!([ json!([

View File

@ -1,8 +1,8 @@
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::json;
use super::{DOCUMENTS, NESTED_DOCUMENTS}; use super::{DOCUMENTS, NESTED_DOCUMENTS};
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn search_empty_list() { async fn search_empty_list() {

View File

@ -1,5 +1,6 @@
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
use crate::search::DOCUMENTS; use crate::search::DOCUMENTS;
#[actix_rt::test] #[actix_rt::test]

View File

@ -1,9 +1,9 @@
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::index::Index; use crate::common::index::Index;
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
async fn index_with_documents<'a>(server: &'a Server, documents: &Value) -> Index<'a> { async fn index_with_documents<'a>(server: &'a Server, documents: &Value) -> Index<'a> {
let index = server.index("test"); let index = server.index("test");

View File

@ -1,5 +1,6 @@
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn set_and_reset_distinct_attribute() { async fn set_and_reset_distinct_attribute() {

View File

@ -1,7 +1,7 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn settings_bad_displayed_attributes() { async fn settings_bad_displayed_attributes() {

View File

@ -1,16 +1,16 @@
use std::collections::HashMap; use std::collections::HashMap;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::{Server, Value}; use crate::common::Server;
use crate::json;
static DEFAULT_SETTINGS_VALUES: Lazy<HashMap<&'static str, Value>> = Lazy::new(|| { static DEFAULT_SETTINGS_VALUES: Lazy<HashMap<&'static str, Value>> = Lazy::new(|| {
let mut map = HashMap::new(); let mut map = HashMap::new();
map.insert("displayed_attributes", json!(["*"])); map.insert("displayed_attributes", json!(["*"]));
map.insert("searchable_attributes", json!(["*"])); map.insert("searchable_attributes", json!(["*"]));
map.insert("filterable_attributes", json!([])); map.insert("filterable_attributes", json!([]));
map.insert("distinct_attribute", json!(null)); map.insert("distinct_attribute", json!(Value::Null));
map.insert( map.insert(
"ranking_rules", "ranking_rules",
json!(["words", "typo", "proximity", "attribute", "sort", "exactness"]), json!(["words", "typo", "proximity", "attribute", "sort", "exactness"]),
@ -229,7 +229,7 @@ macro_rules! test_setting_routes {
.chars() .chars()
.map(|c| if c == '_' { '-' } else { c }) .map(|c| if c == '_' { '-' } else { c })
.collect::<String>()); .collect::<String>());
let (response, code) = server.service.$write_method(url, serde_json::Value::Null.into()).await; let (response, code) = server.service.$write_method(url, serde_json::Value::Null).await;
assert_eq!(code, 202, "{}", response); assert_eq!(code, 202, "{}", response);
server.index("").wait_task(0).await; server.index("").wait_task(0).await;
let (response, code) = server.index("test").get().await; let (response, code) = server.index("test").get().await;

View File

@ -1,7 +1,7 @@
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn set_and_reset() { async fn set_and_reset() {

View File

@ -1,13 +1,11 @@
use std::time::Duration; use std::time::Duration;
use actix_rt::time::sleep; use actix_rt::time::sleep;
use meili_snap::{json_string, snapshot};
use meilisearch::option::ScheduleSnapshot; use meilisearch::option::ScheduleSnapshot;
use meilisearch::Opt; use meilisearch::Opt;
use crate::common::server::default_settings; use crate::common::server::default_settings;
use crate::common::{GetAllDocumentsOptions, Server}; use crate::common::{GetAllDocumentsOptions, Server};
use crate::json;
macro_rules! verify_snapshot { macro_rules! verify_snapshot {
( (
@ -46,7 +44,7 @@ async fn perform_snapshot() {
let index = server.index("test"); let index = server.index("test");
index index
.update_settings(json! ({ .update_settings(serde_json::json! ({
"searchableAttributes": [], "searchableAttributes": [],
})) }))
.await; .await;
@ -92,95 +90,3 @@ async fn perform_snapshot() {
server.index("test1").settings(), server.index("test1").settings(),
); );
} }
#[actix_rt::test]
async fn perform_on_demand_snapshot() {
let temp = tempfile::tempdir().unwrap();
let snapshot_dir = tempfile::tempdir().unwrap();
let options =
Opt { snapshot_dir: snapshot_dir.path().to_owned(), ..default_settings(temp.path()) };
let server = Server::new_with_options(options).await.unwrap();
let index = server.index("catto");
index
.update_settings(json! ({
"searchableAttributes": [],
}))
.await;
index.load_test_set().await;
server.index("doggo").create(Some("bone")).await;
index.wait_task(2).await;
server.index("doggo").create(Some("bone")).await;
index.wait_task(2).await;
let (task, code) = server.create_snapshot().await;
snapshot!(code, @"202 Accepted");
snapshot!(json_string!(task, { ".enqueuedAt" => "[date]" }), @r###"
{
"taskUid": 4,
"indexUid": null,
"status": "enqueued",
"type": "snapshotCreation",
"enqueuedAt": "[date]"
}
"###);
let task = index.wait_task(task.uid()).await;
snapshot!(json_string!(task, { ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]", ".duration" => "[duration]" }), @r###"
{
"uid": 4,
"indexUid": null,
"status": "succeeded",
"type": "snapshotCreation",
"canceledBy": null,
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
let temp = tempfile::tempdir().unwrap();
let snapshots: Vec<String> = std::fs::read_dir(&snapshot_dir)
.unwrap()
.map(|entry| entry.unwrap().path().file_name().unwrap().to_str().unwrap().to_string())
.collect();
meili_snap::snapshot!(format!("{snapshots:?}"), @r###"["db.snapshot"]"###);
let snapshot_path = snapshot_dir.path().to_owned().join("db.snapshot");
#[cfg_attr(windows, allow(unused))]
let snapshot_meta = std::fs::metadata(&snapshot_path).unwrap();
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
let mode = snapshot_meta.permissions().mode();
// rwxrwxrwx
meili_snap::snapshot!(format!("{:b}", mode), @"1000000100100100");
}
let options = Opt { import_snapshot: Some(snapshot_path), ..default_settings(temp.path()) };
let snapshot_server = Server::new_with_options(options).await.unwrap();
verify_snapshot!(server, snapshot_server, |server| =>
server.list_indexes(None, None),
// for some reason the db sizes differ. this may be due to the compaction options we have
// set when performing the snapshot
//server.stats(),
// The original instance contains the snapshotCreation task, while the snapshotted-instance does not. For this reason we need to compare the task queue **after** the task 4
server.tasks_filter("?from=2"),
server.index("catto").get_all_documents(GetAllDocumentsOptions::default()),
server.index("catto").settings(),
server.index("doggo").get_all_documents(GetAllDocumentsOptions::default()),
server.index("doggo").settings(),
);
}

View File

@ -1,8 +1,8 @@
use serde_json::json;
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn get_settings_unexisting_index() { async fn get_settings_unexisting_index() {

View File

@ -1,7 +1,7 @@
use meili_snap::*; use meili_snap::*;
use serde_json::json;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn swap_indexes_bad_format() { async fn swap_indexes_bad_format() {

View File

@ -1,9 +1,9 @@
mod errors; mod errors;
use meili_snap::{json_string, snapshot}; use meili_snap::{json_string, snapshot};
use serde_json::json;
use crate::common::{GetAllDocumentsOptions, Server}; use crate::common::{GetAllDocumentsOptions, Server};
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn swap_indexes() { async fn swap_indexes() {

View File

@ -1,11 +1,11 @@
mod errors; mod errors;
use meili_snap::insta::assert_json_snapshot; use meili_snap::insta::assert_json_snapshot;
use serde_json::json;
use time::format_description::well_known::Rfc3339; use time::format_description::well_known::Rfc3339;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::common::Server; use crate::common::Server;
use crate::json;
#[actix_rt::test] #[actix_rt::test]
async fn error_get_unexisting_task_status() { async fn error_get_unexisting_task_status() {
@ -33,7 +33,7 @@ async fn get_task_status() {
index.create(None).await; index.create(None).await;
index index
.add_documents( .add_documents(
json!([{ serde_json::json!([{
"id": 1, "id": 1,
"content": "foobar", "content": "foobar",
}]), }]),

View File

@ -79,7 +79,6 @@ big_s = "1.0.2"
insta = "1.29.0" insta = "1.29.0"
maplit = "1.0.2" maplit = "1.0.2"
md5 = "0.7.0" md5 = "0.7.0"
meili-snap = { path = "../meili-snap" }
rand = { version = "0.8.5", features = ["small_rng"] } rand = { version = "0.8.5", features = ["small_rng"] }
[features] [features]

View File

@ -1,4 +1,5 @@
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use std::{io, str}; use std::{io, str};
use obkv::KvReader; use obkv::KvReader;
@ -19,14 +20,14 @@ use crate::FieldId;
pub struct EnrichedDocumentsBatchReader<R> { pub struct EnrichedDocumentsBatchReader<R> {
documents: DocumentsBatchReader<R>, documents: DocumentsBatchReader<R>,
primary_key: String, primary_key: String,
external_ids: grenad::ReaderCursor<File>, external_ids: grenad::ReaderCursor<BufReader<File>>,
} }
impl<R: io::Read + io::Seek> EnrichedDocumentsBatchReader<R> { impl<R: io::Read + io::Seek> EnrichedDocumentsBatchReader<R> {
pub fn new( pub fn new(
documents: DocumentsBatchReader<R>, documents: DocumentsBatchReader<R>,
primary_key: String, primary_key: String,
external_ids: grenad::Reader<File>, external_ids: grenad::Reader<BufReader<File>>,
) -> Result<Self, Error> { ) -> Result<Self, Error> {
if documents.documents_count() as u64 == external_ids.len() { if documents.documents_count() as u64 == external_ids.len() {
Ok(EnrichedDocumentsBatchReader { Ok(EnrichedDocumentsBatchReader {
@ -75,7 +76,7 @@ pub struct EnrichedDocument<'a> {
pub struct EnrichedDocumentsBatchCursor<R> { pub struct EnrichedDocumentsBatchCursor<R> {
documents: DocumentsBatchCursor<R>, documents: DocumentsBatchCursor<R>,
primary_key: String, primary_key: String,
external_ids: grenad::ReaderCursor<File>, external_ids: grenad::ReaderCursor<BufReader<File>>,
} }
impl<R> EnrichedDocumentsBatchCursor<R> { impl<R> EnrichedDocumentsBatchCursor<R> {

View File

@ -60,16 +60,12 @@ impl CboRoaringBitmapCodec {
/// if the merged values length is under the threshold, values are directly /// if the merged values length is under the threshold, values are directly
/// serialized in the buffer else a RoaringBitmap is created from the /// serialized in the buffer else a RoaringBitmap is created from the
/// values and is serialized in the buffer. /// values and is serialized in the buffer.
pub fn merge_into<I, A>(slices: I, buffer: &mut Vec<u8>) -> io::Result<()> pub fn merge_into(slices: &[Cow<[u8]>], buffer: &mut Vec<u8>) -> io::Result<()> {
where
I: IntoIterator<Item = A>,
A: AsRef<[u8]>,
{
let mut roaring = RoaringBitmap::new(); let mut roaring = RoaringBitmap::new();
let mut vec = Vec::new(); let mut vec = Vec::new();
for bytes in slices { for bytes in slices {
if bytes.as_ref().len() <= THRESHOLD * size_of::<u32>() { if bytes.len() <= THRESHOLD * size_of::<u32>() {
let mut reader = bytes.as_ref(); let mut reader = bytes.as_ref();
while let Ok(integer) = reader.read_u32::<NativeEndian>() { while let Ok(integer) = reader.read_u32::<NativeEndian>() {
vec.push(integer); vec.push(integer);
@ -89,7 +85,7 @@ impl CboRoaringBitmapCodec {
} }
} else { } else {
// We can unwrap safely because the vector is sorted upper. // We can unwrap safely because the vector is sorted upper.
let roaring = RoaringBitmap::from_sorted_iter(vec).unwrap(); let roaring = RoaringBitmap::from_sorted_iter(vec.into_iter()).unwrap();
roaring.serialize_into(buffer)?; roaring.serialize_into(buffer)?;
} }
} else { } else {

View File

@ -119,16 +119,16 @@ pub struct Index {
pub(crate) main: PolyDatabase, pub(crate) main: PolyDatabase,
/// A word and all the documents ids containing the word. /// A word and all the documents ids containing the word.
pub word_docids: Database<Str, CboRoaringBitmapCodec>, pub word_docids: Database<Str, RoaringBitmapCodec>,
/// A word and all the documents ids containing the word, from attributes for which typos are not allowed. /// A word and all the documents ids containing the word, from attributes for which typos are not allowed.
pub exact_word_docids: Database<Str, CboRoaringBitmapCodec>, pub exact_word_docids: Database<Str, RoaringBitmapCodec>,
/// A prefix of word and all the documents ids containing this prefix. /// A prefix of word and all the documents ids containing this prefix.
pub word_prefix_docids: Database<Str, CboRoaringBitmapCodec>, pub word_prefix_docids: Database<Str, RoaringBitmapCodec>,
/// A prefix of word and all the documents ids containing this prefix, from attributes for which typos are not allowed. /// A prefix of word and all the documents ids containing this prefix, from attributes for which typos are not allowed.
pub exact_word_prefix_docids: Database<Str, CboRoaringBitmapCodec>, pub exact_word_prefix_docids: Database<Str, RoaringBitmapCodec>,
/// Maps the proximity between a pair of words with all the docids where this relation appears. /// Maps the proximity between a pair of words with all the docids where this relation appears.
pub word_pair_proximity_docids: Database<U8StrStrCodec, CboRoaringBitmapCodec>, pub word_pair_proximity_docids: Database<U8StrStrCodec, CboRoaringBitmapCodec>,

View File

@ -53,11 +53,22 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
if excluded.contains(docid) { if excluded.contains(docid) {
continue; continue;
} }
distinct_single_docid(ctx.index, ctx.txn, distinct_fid, docid, &mut excluded)?; distinct_single_docid(ctx.index, ctx.txn, distinct_fid, docid, &mut excluded)?;
results.push(docid); results.push(docid);
} }
let mut all_candidates = universe - excluded; let mut all_candidates = universe - excluded;
all_candidates.extend(results.iter().copied()); all_candidates.extend(results.iter().copied());
// drain the results of the skipped elements
// this **must** be done **after** writing the entire results in `all_candidates` to ensure
// e.g. estimatedTotalHits is correct.
if results.len() >= from {
results.drain(..from);
} else {
results.clear();
}
return Ok(BucketSortOutput { return Ok(BucketSortOutput {
scores: vec![Default::default(); results.len()], scores: vec![Default::default(); results.len()],
docids: results, docids: results,

View File

@ -11,7 +11,9 @@ use super::interner::Interned;
use super::Word; use super::Word;
use crate::heed_codec::{BytesDecodeOwned, StrBEU16Codec}; use crate::heed_codec::{BytesDecodeOwned, StrBEU16Codec};
use crate::update::{merge_cbo_roaring_bitmaps, MergeFn}; use crate::update::{merge_cbo_roaring_bitmaps, MergeFn};
use crate::{CboRoaringBitmapCodec, CboRoaringBitmapLenCodec, Result, SearchContext}; use crate::{
CboRoaringBitmapCodec, CboRoaringBitmapLenCodec, Result, RoaringBitmapCodec, SearchContext,
};
/// A cache storing pointers to values in the LMDB databases. /// A cache storing pointers to values in the LMDB databases.
/// ///
@ -166,7 +168,7 @@ impl<'ctx> SearchContext<'ctx> {
merge_cbo_roaring_bitmaps, merge_cbo_roaring_bitmaps,
) )
} }
None => DatabaseCache::get_value::<_, _, CboRoaringBitmapCodec>( None => DatabaseCache::get_value::<_, _, RoaringBitmapCodec>(
self.txn, self.txn,
word, word,
self.word_interner.get(word).as_str(), self.word_interner.get(word).as_str(),
@ -180,7 +182,7 @@ impl<'ctx> SearchContext<'ctx> {
&mut self, &mut self,
word: Interned<String>, word: Interned<String>,
) -> Result<Option<RoaringBitmap>> { ) -> Result<Option<RoaringBitmap>> {
DatabaseCache::get_value::<_, _, CboRoaringBitmapCodec>( DatabaseCache::get_value::<_, _, RoaringBitmapCodec>(
self.txn, self.txn,
word, word,
self.word_interner.get(word).as_str(), self.word_interner.get(word).as_str(),
@ -228,7 +230,7 @@ impl<'ctx> SearchContext<'ctx> {
merge_cbo_roaring_bitmaps, merge_cbo_roaring_bitmaps,
) )
} }
None => DatabaseCache::get_value::<_, _, CboRoaringBitmapCodec>( None => DatabaseCache::get_value::<_, _, RoaringBitmapCodec>(
self.txn, self.txn,
prefix, prefix,
self.word_interner.get(prefix).as_str(), self.word_interner.get(prefix).as_str(),
@ -242,7 +244,7 @@ impl<'ctx> SearchContext<'ctx> {
&mut self, &mut self,
prefix: Interned<String>, prefix: Interned<String>,
) -> Result<Option<RoaringBitmap>> { ) -> Result<Option<RoaringBitmap>> {
DatabaseCache::get_value::<_, _, CboRoaringBitmapCodec>( DatabaseCache::get_value::<_, _, RoaringBitmapCodec>(
self.txn, self.txn,
prefix, prefix,
self.word_interner.get(prefix).as_str(), self.word_interner.get(prefix).as_str(),

View File

@ -13,7 +13,6 @@ This module tests the `sort` ranking rule:
use big_s::S; use big_s::S;
use maplit::hashset; use maplit::hashset;
use meili_snap::insta;
use crate::index::tests::TempIndex; use crate::index::tests::TempIndex;
use crate::search::new::tests::collect_field_values; use crate::search::new::tests::collect_field_values;

View File

@ -1,104 +0,0 @@
use obkv::Key;
pub type KvWriterDelAdd<W> = obkv::KvWriter<W, DelAdd>;
pub type KvReaderDelAdd<'a> = obkv::KvReader<'a, DelAdd>;
/// DelAdd defines the new value to add in the database and old value to delete from the database.
///
/// Its used in an OBKV to be serialized in grenad files.
#[repr(u8)]
#[derive(Clone, Copy, PartialOrd, PartialEq, Debug)]
pub enum DelAdd {
Deletion = 0,
Addition = 1,
}
impl Key for DelAdd {
const BYTES_SIZE: usize = std::mem::size_of::<DelAdd>();
type BYTES = [u8; Self::BYTES_SIZE];
fn to_be_bytes(&self) -> Self::BYTES {
u8::to_be_bytes(*self as u8)
}
fn from_be_bytes(array: Self::BYTES) -> Self {
match u8::from_be_bytes(array) {
0 => Self::Deletion,
1 => Self::Addition,
otherwise => unreachable!("DelAdd has only 2 variants, unknown variant: {}", otherwise),
}
}
}
/// Creates a Kv<K, Kv<DelAdd, value>> from Kv<K, value>
///
/// if deletion is `true`, the value will be inserted behind a DelAdd::Deletion key.
/// if addition is `true`, the value will be inserted behind a DelAdd::Addition key.
/// if both deletion and addition are `true, the value will be inserted in both keys.
pub fn into_del_add_obkv<K: obkv::Key + PartialOrd>(
reader: obkv::KvReader<K>,
deletion: bool,
addition: bool,
buffer: &mut Vec<u8>,
) -> Result<(), std::io::Error> {
let mut writer = obkv::KvWriter::new(buffer);
let mut value_buffer = Vec::new();
for (key, value) in reader.iter() {
value_buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
if deletion {
value_writer.insert(DelAdd::Deletion, value)?;
}
if addition {
value_writer.insert(DelAdd::Addition, value)?;
}
value_writer.finish()?;
writer.insert(key, &value_buffer)?;
}
writer.finish()
}
/// Creates a Kv<K, Kv<DelAdd, value>> from two Kv<K, value>
///
/// putting each deletion obkv's keys under an DelAdd::Deletion
/// and putting each addition obkv's keys under an DelAdd::Addition
pub fn del_add_from_two_obkvs<K: obkv::Key + PartialOrd + Ord>(
deletion: obkv::KvReader<K>,
addition: obkv::KvReader<K>,
buffer: &mut Vec<u8>,
) -> Result<(), std::io::Error> {
use itertools::merge_join_by;
use itertools::EitherOrBoth::{Both, Left, Right};
let mut writer = obkv::KvWriter::new(buffer);
let mut value_buffer = Vec::new();
for eob in merge_join_by(deletion.iter(), addition.iter(), |(b, _), (u, _)| b.cmp(u)) {
value_buffer.clear();
match eob {
Left((k, v)) => {
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
value_writer.insert(DelAdd::Deletion, v).unwrap();
writer.insert(k, value_writer.into_inner()?).unwrap();
}
Right((k, v)) => {
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
value_writer.insert(DelAdd::Addition, v).unwrap();
writer.insert(k, value_writer.into_inner()?).unwrap();
}
Both((k, deletion), (_, addition)) => {
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
value_writer.insert(DelAdd::Deletion, deletion).unwrap();
value_writer.insert(DelAdd::Addition, addition).unwrap();
writer.insert(k, value_writer.into_inner()?).unwrap();
}
}
}
writer.finish()
}
pub fn is_noop_del_add_obkv(del_add: KvReaderDelAdd) -> bool {
del_add.get(DelAdd::Deletion) == del_add.get(DelAdd::Addition)
}

View File

@ -16,7 +16,9 @@ use crate::facet::FacetType;
use crate::heed_codec::facet::FieldDocIdFacetCodec; use crate::heed_codec::facet::FieldDocIdFacetCodec;
use crate::heed_codec::CboRoaringBitmapCodec; use crate::heed_codec::CboRoaringBitmapCodec;
use crate::index::Hnsw; use crate::index::Hnsw;
use crate::{ExternalDocumentsIds, FieldId, FieldIdMapMissingEntry, Index, Result, BEU32}; use crate::{
ExternalDocumentsIds, FieldId, FieldIdMapMissingEntry, Index, Result, RoaringBitmapCodec, BEU32,
};
pub struct DeleteDocuments<'t, 'u, 'i> { pub struct DeleteDocuments<'t, 'u, 'i> {
wtxn: &'t mut heed::RwTxn<'i, 'u>, wtxn: &'t mut heed::RwTxn<'i, 'u>,
@ -493,7 +495,7 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
fn remove_from_word_prefix_docids( fn remove_from_word_prefix_docids(
txn: &mut heed::RwTxn, txn: &mut heed::RwTxn,
db: &Database<Str, CboRoaringBitmapCodec>, db: &Database<Str, RoaringBitmapCodec>,
to_remove: &RoaringBitmap, to_remove: &RoaringBitmap,
) -> Result<fst::Set<Vec<u8>>> { ) -> Result<fst::Set<Vec<u8>>> {
let mut prefixes_to_delete = fst::SetBuilder::memory(); let mut prefixes_to_delete = fst::SetBuilder::memory();
@ -521,7 +523,7 @@ fn remove_from_word_prefix_docids(
fn remove_from_word_docids( fn remove_from_word_docids(
txn: &mut heed::RwTxn, txn: &mut heed::RwTxn,
db: &heed::Database<Str, CboRoaringBitmapCodec>, db: &heed::Database<Str, RoaringBitmapCodec>,
to_remove: &RoaringBitmap, to_remove: &RoaringBitmap,
words_to_keep: &mut BTreeSet<String>, words_to_keep: &mut BTreeSet<String>,
words_to_remove: &mut BTreeSet<String>, words_to_remove: &mut BTreeSet<String>,

View File

@ -1,5 +1,6 @@
use std::borrow::Cow; use std::borrow::Cow;
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use grenad::CompressionType; use grenad::CompressionType;
use heed::types::ByteSlice; use heed::types::ByteSlice;
@ -30,7 +31,7 @@ pub struct FacetsUpdateBulk<'i> {
facet_type: FacetType, facet_type: FacetType,
field_ids: Vec<FieldId>, field_ids: Vec<FieldId>,
// None if level 0 does not need to be updated // None if level 0 does not need to be updated
new_data: Option<grenad::Reader<File>>, new_data: Option<grenad::Reader<BufReader<File>>>,
} }
impl<'i> FacetsUpdateBulk<'i> { impl<'i> FacetsUpdateBulk<'i> {
@ -38,7 +39,7 @@ impl<'i> FacetsUpdateBulk<'i> {
index: &'i Index, index: &'i Index,
field_ids: Vec<FieldId>, field_ids: Vec<FieldId>,
facet_type: FacetType, facet_type: FacetType,
new_data: grenad::Reader<File>, new_data: grenad::Reader<BufReader<File>>,
group_size: u8, group_size: u8,
min_level_size: u8, min_level_size: u8,
) -> FacetsUpdateBulk<'i> { ) -> FacetsUpdateBulk<'i> {
@ -132,8 +133,6 @@ impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
self.db.delete_range(wtxn, &range).map(drop)?; self.db.delete_range(wtxn, &range).map(drop)?;
Ok(()) Ok(())
} }
// TODO the new_data is an Reader<Obkv<Key, Obkv<DelAdd, RoaringBitmap>>>
fn update_level0(&mut self, wtxn: &mut RwTxn) -> Result<()> { fn update_level0(&mut self, wtxn: &mut RwTxn) -> Result<()> {
let new_data = match self.new_data.take() { let new_data = match self.new_data.take() {
Some(x) => x, Some(x) => x,
@ -189,7 +188,7 @@ impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
&self, &self,
field_id: FieldId, field_id: FieldId,
txn: &RoTxn, txn: &RoTxn,
) -> Result<(Vec<grenad::Reader<File>>, RoaringBitmap)> { ) -> Result<(Vec<grenad::Reader<BufReader<File>>>, RoaringBitmap)> {
let mut all_docids = RoaringBitmap::new(); let mut all_docids = RoaringBitmap::new();
let subwriters = self.compute_higher_levels(txn, field_id, 32, &mut |bitmaps, _| { let subwriters = self.compute_higher_levels(txn, field_id, 32, &mut |bitmaps, _| {
for bitmap in bitmaps { for bitmap in bitmaps {
@ -261,7 +260,7 @@ impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
field_id: u16, field_id: u16,
level: u8, level: u8,
handle_group: &mut dyn FnMut(&[RoaringBitmap], &'t [u8]) -> Result<()>, handle_group: &mut dyn FnMut(&[RoaringBitmap], &'t [u8]) -> Result<()>,
) -> Result<Vec<grenad::Reader<File>>> { ) -> Result<Vec<grenad::Reader<BufReader<File>>>> {
if level == 0 { if level == 0 {
self.read_level_0(rtxn, field_id, handle_group)?; self.read_level_0(rtxn, field_id, handle_group)?;
// Level 0 is already in the database // Level 0 is already in the database

View File

@ -1,5 +1,6 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use heed::types::{ByteSlice, DecodeIgnore}; use heed::types::{ByteSlice, DecodeIgnore};
use heed::{BytesDecode, Error, RoTxn, RwTxn}; use heed::{BytesDecode, Error, RoTxn, RwTxn};
@ -34,14 +35,14 @@ pub struct FacetsUpdateIncremental<'i> {
index: &'i Index, index: &'i Index,
inner: FacetsUpdateIncrementalInner, inner: FacetsUpdateIncrementalInner,
facet_type: FacetType, facet_type: FacetType,
new_data: grenad::Reader<File>, new_data: grenad::Reader<BufReader<File>>,
} }
impl<'i> FacetsUpdateIncremental<'i> { impl<'i> FacetsUpdateIncremental<'i> {
pub fn new( pub fn new(
index: &'i Index, index: &'i Index,
facet_type: FacetType, facet_type: FacetType,
new_data: grenad::Reader<File>, new_data: grenad::Reader<BufReader<File>>,
group_size: u8, group_size: u8,
min_level_size: u8, min_level_size: u8,
max_group_size: u8, max_group_size: u8,

View File

@ -78,6 +78,7 @@ pub const FACET_MIN_LEVEL_SIZE: u8 = 5;
use std::collections::BTreeSet; use std::collections::BTreeSet;
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use std::iter::FromIterator; use std::iter::FromIterator;
use charabia::normalizer::{Normalize, NormalizerOption}; use charabia::normalizer::{Normalize, NormalizerOption};
@ -108,14 +109,17 @@ pub struct FacetsUpdate<'i> {
index: &'i Index, index: &'i Index,
database: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>, database: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
facet_type: FacetType, facet_type: FacetType,
new_data: grenad::Reader<File>, new_data: grenad::Reader<BufReader<File>>,
group_size: u8, group_size: u8,
max_group_size: u8, max_group_size: u8,
min_level_size: u8, min_level_size: u8,
} }
impl<'i> FacetsUpdate<'i> { impl<'i> FacetsUpdate<'i> {
// TODO grenad::Reader<Key, Obkv<DelAdd, RoaringBitmap>> pub fn new(
pub fn new(index: &'i Index, facet_type: FacetType, new_data: grenad::Reader<File>) -> Self { index: &'i Index,
facet_type: FacetType,
new_data: grenad::Reader<BufReader<File>>,
) -> Self {
let database = match facet_type { let database = match facet_type {
FacetType::String => index FacetType::String => index
.facet_id_string_docids .facet_id_string_docids

View File

@ -1,4 +1,4 @@
use std::io::{Read, Seek}; use std::io::{BufWriter, Read, Seek};
use std::result::Result as StdResult; use std::result::Result as StdResult;
use std::{fmt, iter}; use std::{fmt, iter};
@ -35,7 +35,7 @@ pub fn enrich_documents_batch<R: Read + Seek>(
let (mut cursor, mut documents_batch_index) = reader.into_cursor_and_fields_index(); let (mut cursor, mut documents_batch_index) = reader.into_cursor_and_fields_index();
let mut external_ids = tempfile::tempfile().map(grenad::Writer::new)?; let mut external_ids = tempfile::tempfile().map(BufWriter::new).map(grenad::Writer::new)?;
let mut uuid_buffer = [0; uuid::fmt::Hyphenated::LENGTH]; let mut uuid_buffer = [0; uuid::fmt::Hyphenated::LENGTH];
// The primary key *field id* that has already been set for this index or the one // The primary key *field id* that has already been set for this index or the one

View File

@ -1,19 +1,22 @@
use std::collections::{HashMap, HashSet}; use std::collections::{HashMap, HashSet};
use std::convert::TryInto; use std::convert::TryInto;
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use std::{io, mem, str}; use std::{io, mem, str};
use charabia::{Language, Script, SeparatorKind, Token, TokenKind, Tokenizer, TokenizerBuilder}; use charabia::{Language, Script, SeparatorKind, Token, TokenKind, Tokenizer, TokenizerBuilder};
use obkv::{KvReader, KvWriterU16}; use obkv::KvReader;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use serde_json::Value; use serde_json::Value;
use super::helpers::{create_sorter, keep_latest_obkv, sorter_into_reader, GrenadParameters}; use super::helpers::{concat_u32s_array, create_sorter, sorter_into_reader, GrenadParameters};
use crate::error::{InternalError, SerializationError}; use crate::error::{InternalError, SerializationError};
use crate::update::del_add::{del_add_from_two_obkvs, DelAdd, KvReaderDelAdd}; use crate::update::index_documents::MergeFn;
use crate::{FieldId, Result, MAX_POSITION_PER_ATTRIBUTE, MAX_WORD_LENGTH}; use crate::{
absolute_from_relative_position, FieldId, Result, MAX_POSITION_PER_ATTRIBUTE, MAX_WORD_LENGTH,
};
pub type ScriptLanguageDocidsMap = HashMap<(Script, Language), (RoaringBitmap, RoaringBitmap)>; pub type ScriptLanguageDocidsMap = HashMap<(Script, Language), RoaringBitmap>;
/// Extracts the word and positions where this word appear and /// Extracts the word and positions where this word appear and
/// prefixes it by the document id. /// prefixes it by the document id.
@ -29,160 +32,25 @@ pub fn extract_docid_word_positions<R: io::Read + io::Seek>(
allowed_separators: Option<&[&str]>, allowed_separators: Option<&[&str]>,
dictionary: Option<&[&str]>, dictionary: Option<&[&str]>,
max_positions_per_attributes: Option<u32>, max_positions_per_attributes: Option<u32>,
) -> Result<(RoaringBitmap, grenad::Reader<File>, ScriptLanguageDocidsMap)> { ) -> Result<(RoaringBitmap, grenad::Reader<BufReader<File>>, ScriptLanguageDocidsMap)> {
puffin::profile_function!(); puffin::profile_function!();
let max_positions_per_attributes = max_positions_per_attributes let max_positions_per_attributes = max_positions_per_attributes
.map_or(MAX_POSITION_PER_ATTRIBUTE, |max| max.min(MAX_POSITION_PER_ATTRIBUTE)); .map_or(MAX_POSITION_PER_ATTRIBUTE, |max| max.min(MAX_POSITION_PER_ATTRIBUTE));
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
// initialize destination values.
let mut documents_ids = RoaringBitmap::new(); let mut documents_ids = RoaringBitmap::new();
let mut script_language_docids = HashMap::new(); let mut script_language_docids = HashMap::new();
let mut docid_word_positions_sorter = create_sorter( let mut docid_word_positions_sorter = create_sorter(
grenad::SortAlgorithm::Stable, grenad::SortAlgorithm::Stable,
keep_latest_obkv, concat_u32s_array,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory, max_memory,
); );
// initialize buffers. let mut buffers = Buffers::default();
let mut del_buffers = Buffers::default();
let mut add_buffers = Buffers::default();
let mut key_buffer = Vec::new();
let mut value_buffer = Vec::new();
// initialize tokenizer.
let mut builder = tokenizer_builder(stop_words, dictionary, allowed_separators, None);
let tokenizer = builder.build();
// iterate over documents.
let mut cursor = obkv_documents.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? {
let document_id = key
.try_into()
.map(u32::from_be_bytes)
.map_err(|_| SerializationError::InvalidNumberSerialization)?;
let obkv = KvReader::<FieldId>::new(value);
// if the searchable fields didn't change, skip the searchable indexing for this document.
if !searchable_fields_changed(&KvReader::<FieldId>::new(value), searchable_fields) {
continue;
}
documents_ids.push(document_id);
// Update key buffer prefix.
key_buffer.clear();
key_buffer.extend_from_slice(&document_id.to_be_bytes());
// Tokenize deletions and additions in 2 diffferent threads.
let (del, add): (Result<_>, Result<_>) = rayon::join(
|| {
// deletions
lang_safe_tokens_from_document(
&obkv,
searchable_fields,
&tokenizer,
stop_words,
allowed_separators,
dictionary,
max_positions_per_attributes,
DelAdd::Deletion,
&mut del_buffers,
)
},
|| {
// additions
lang_safe_tokens_from_document(
&obkv,
searchable_fields,
&tokenizer,
stop_words,
allowed_separators,
dictionary,
max_positions_per_attributes,
DelAdd::Addition,
&mut add_buffers,
)
},
);
let (del_obkv, del_script_language_word_count) = del?;
let (add_obkv, add_script_language_word_count) = add?;
// merge deletions and additions.
value_buffer.clear();
del_add_from_two_obkvs(
KvReader::<FieldId>::new(del_obkv),
KvReader::<FieldId>::new(add_obkv),
&mut value_buffer,
)?;
// write them into the sorter.
let obkv = KvReader::<FieldId>::new(value);
for (field_id, value) in obkv.iter() {
key_buffer.truncate(mem::size_of::<u32>());
key_buffer.extend_from_slice(&field_id.to_be_bytes());
docid_word_positions_sorter.insert(&key_buffer, value)?;
}
// update script_language_docids deletions.
for (script, languages_frequency) in del_script_language_word_count {
for (language, _) in languages_frequency {
let entry = script_language_docids
.entry((script, language))
.or_insert_with(|| (RoaringBitmap::new(), RoaringBitmap::new()));
entry.0.push(document_id);
}
}
// update script_language_docids additions.
for (script, languages_frequency) in add_script_language_word_count {
for (language, _) in languages_frequency {
let entry = script_language_docids
.entry((script, language))
.or_insert_with(|| (RoaringBitmap::new(), RoaringBitmap::new()));
entry.1.push(document_id);
}
}
}
sorter_into_reader(docid_word_positions_sorter, indexer)
.map(|reader| (documents_ids, reader, script_language_docids))
}
/// Check if any searchable fields of a document changed.
fn searchable_fields_changed(
obkv: &KvReader<FieldId>,
searchable_fields: &Option<HashSet<FieldId>>,
) -> bool {
for (field_id, field_bytes) in obkv.iter() {
if searchable_fields.as_ref().map_or(true, |sf| sf.contains(&field_id)) {
let del_add = KvReaderDelAdd::new(field_bytes);
match (del_add.get(DelAdd::Deletion), del_add.get(DelAdd::Addition)) {
// if both fields are None, check the next field.
(None, None) => (),
// if both contains a value and values are the same, check the next field.
(Some(del), Some(add)) if del == add => (),
// otherwise the fields are different, return true.
_otherwise => return true,
}
}
}
false
}
/// Factorize tokenizer building.
fn tokenizer_builder<'a>(
stop_words: Option<&'a fst::Set<&[u8]>>,
allowed_separators: Option<&'a [&str]>,
dictionary: Option<&'a [&str]>,
script_language: Option<&'a HashMap<Script, Vec<Language>>>,
) -> TokenizerBuilder<'a, &'a [u8]> {
let mut tokenizer_builder = TokenizerBuilder::new(); let mut tokenizer_builder = TokenizerBuilder::new();
if let Some(stop_words) = stop_words { if let Some(stop_words) = stop_words {
tokenizer_builder.stop_words(stop_words); tokenizer_builder.stop_words(stop_words);
@ -193,144 +61,130 @@ fn tokenizer_builder<'a>(
if let Some(separators) = allowed_separators { if let Some(separators) = allowed_separators {
tokenizer_builder.separators(separators); tokenizer_builder.separators(separators);
} }
let tokenizer = tokenizer_builder.build();
if let Some(script_language) = script_language { let mut cursor = obkv_documents.into_cursor()?;
tokenizer_builder.allow_list(&script_language); while let Some((key, value)) = cursor.move_on_next()? {
} let document_id = key
.try_into()
.map(u32::from_be_bytes)
.map_err(|_| SerializationError::InvalidNumberSerialization)?;
let obkv = KvReader::<FieldId>::new(value);
tokenizer_builder documents_ids.push(document_id);
} buffers.key_buffer.clear();
buffers.key_buffer.extend_from_slice(&document_id.to_be_bytes());
/// Extract words maped with their positions of a document, let mut script_language_word_count = HashMap::new();
/// ensuring no Language detection mistakes was made.
fn lang_safe_tokens_from_document<'a>(
obkv: &KvReader<FieldId>,
searchable_fields: &Option<HashSet<FieldId>>,
tokenizer: &Tokenizer,
stop_words: Option<&fst::Set<&[u8]>>,
allowed_separators: Option<&[&str]>,
dictionary: Option<&[&str]>,
max_positions_per_attributes: u32,
del_add: DelAdd,
buffers: &'a mut Buffers,
) -> Result<(&'a [u8], HashMap<Script, Vec<(Language, usize)>>)> {
let mut script_language_word_count = HashMap::new();
tokens_from_document( extract_tokens_from_document(
&obkv, &obkv,
searchable_fields, searchable_fields,
&tokenizer, &tokenizer,
max_positions_per_attributes, max_positions_per_attributes,
del_add, &mut buffers,
buffers, &mut script_language_word_count,
&mut script_language_word_count, &mut docid_word_positions_sorter,
)?; )?;
// if we detect a potetial mistake in the language detection, // if we detect a potetial mistake in the language detection,
// we rerun the extraction forcing the tokenizer to detect the most frequently detected Languages. // we rerun the extraction forcing the tokenizer to detect the most frequently detected Languages.
// context: https://github.com/meilisearch/meilisearch/issues/3565 // context: https://github.com/meilisearch/meilisearch/issues/3565
if script_language_word_count if script_language_word_count
.values() .values()
.map(Vec::as_slice) .map(Vec::as_slice)
.any(potential_language_detection_error) .any(potential_language_detection_error)
{ {
// build an allow list with the most frequent detected languages in the document. // build an allow list with the most frequent detected languages in the document.
let script_language: HashMap<_, _> = let script_language: HashMap<_, _> =
script_language_word_count.iter().filter_map(most_frequent_languages).collect(); script_language_word_count.iter().filter_map(most_frequent_languages).collect();
// if the allow list is empty, meaning that no Language is considered frequent, // if the allow list is empty, meaning that no Language is considered frequent,
// then we don't rerun the extraction. // then we don't rerun the extraction.
if !script_language.is_empty() { if !script_language.is_empty() {
// build a new temporary tokenizer including the allow list. // build a new temporary tokenizer including the allow list.
let mut builder = tokenizer_builder( let mut tokenizer_builder = TokenizerBuilder::new();
stop_words, if let Some(stop_words) = stop_words {
dictionary, tokenizer_builder.stop_words(stop_words);
allowed_separators, }
Some(&script_language), tokenizer_builder.allow_list(&script_language);
); let tokenizer = tokenizer_builder.build();
let tokenizer = builder.build();
script_language_word_count.clear(); script_language_word_count.clear();
// rerun the extraction. // rerun the extraction.
tokens_from_document( extract_tokens_from_document(
&obkv, &obkv,
searchable_fields, searchable_fields,
&tokenizer, &tokenizer,
max_positions_per_attributes, max_positions_per_attributes,
del_add, &mut buffers,
buffers, &mut script_language_word_count,
&mut script_language_word_count, &mut docid_word_positions_sorter,
)?; )?;
}
}
for (script, languages_frequency) in script_language_word_count {
for (language, _) in languages_frequency {
let entry = script_language_docids
.entry((script, language))
.or_insert_with(RoaringBitmap::new);
entry.push(document_id);
}
} }
} }
Ok((&buffers.obkv_buffer, script_language_word_count)) sorter_into_reader(docid_word_positions_sorter, indexer)
.map(|reader| (documents_ids, reader, script_language_docids))
} }
/// Extract words maped with their positions of a document. fn extract_tokens_from_document(
fn tokens_from_document<'a>(
obkv: &KvReader<FieldId>, obkv: &KvReader<FieldId>,
searchable_fields: &Option<HashSet<FieldId>>, searchable_fields: &Option<HashSet<FieldId>>,
tokenizer: &Tokenizer, tokenizer: &Tokenizer,
max_positions_per_attributes: u32, max_positions_per_attributes: u32,
del_add: DelAdd, buffers: &mut Buffers,
buffers: &'a mut Buffers,
script_language_word_count: &mut HashMap<Script, Vec<(Language, usize)>>, script_language_word_count: &mut HashMap<Script, Vec<(Language, usize)>>,
) -> Result<&'a [u8]> { docid_word_positions_sorter: &mut grenad::Sorter<MergeFn>,
buffers.obkv_buffer.clear(); ) -> Result<()> {
let mut document_writer = KvWriterU16::new(&mut buffers.obkv_buffer);
for (field_id, field_bytes) in obkv.iter() { for (field_id, field_bytes) in obkv.iter() {
// if field is searchable.
if searchable_fields.as_ref().map_or(true, |sf| sf.contains(&field_id)) { if searchable_fields.as_ref().map_or(true, |sf| sf.contains(&field_id)) {
// extract deletion or addition only. let value = serde_json::from_slice(field_bytes).map_err(InternalError::SerdeJson)?;
if let Some(field_bytes) = KvReaderDelAdd::new(field_bytes).get(del_add) { buffers.field_buffer.clear();
// parse json. if let Some(field) = json_to_string(&value, &mut buffers.field_buffer) {
let value = let tokens = process_tokens(tokenizer.tokenize(field))
serde_json::from_slice(field_bytes).map_err(InternalError::SerdeJson)?; .take_while(|(p, _)| (*p as u32) < max_positions_per_attributes);
// prepare writting destination. for (index, token) in tokens {
buffers.obkv_positions_buffer.clear(); // if a language has been detected for the token, we update the counter.
let mut writer = KvWriterU16::new(&mut buffers.obkv_positions_buffer); if let Some(language) = token.language {
let script = token.script;
// convert json into an unique string. let entry =
buffers.field_buffer.clear(); script_language_word_count.entry(script).or_insert_with(Vec::new);
if let Some(field) = json_to_string(&value, &mut buffers.field_buffer) { match entry.iter_mut().find(|(l, _)| *l == language) {
// create an iterator of token with their positions. Some((_, n)) => *n += 1,
let tokens = process_tokens(tokenizer.tokenize(field)) None => entry.push((language, 1)),
.take_while(|(p, _)| (*p as u32) < max_positions_per_attributes);
for (index, token) in tokens {
// if a language has been detected for the token, we update the counter.
if let Some(language) = token.language {
let script = token.script;
let entry =
script_language_word_count.entry(script).or_insert_with(Vec::new);
match entry.iter_mut().find(|(l, _)| *l == language) {
Some((_, n)) => *n += 1,
None => entry.push((language, 1)),
}
}
// keep a word only if it is not empty and fit in a LMDB key.
let token = token.lemma().trim();
if !token.is_empty() && token.len() <= MAX_WORD_LENGTH {
let position: u16 = index
.try_into()
.map_err(|_| SerializationError::InvalidNumberSerialization)?;
writer.insert(position, token.as_bytes())?;
} }
} }
let token = token.lemma().trim();
if !token.is_empty() && token.len() <= MAX_WORD_LENGTH {
buffers.key_buffer.truncate(mem::size_of::<u32>());
buffers.key_buffer.extend_from_slice(token.as_bytes());
// write positions into document. let position: u16 = index
let positions = writer.into_inner()?; .try_into()
document_writer.insert(field_id, positions)?; .map_err(|_| SerializationError::InvalidNumberSerialization)?;
let position = absolute_from_relative_position(field_id, position);
docid_word_positions_sorter
.insert(&buffers.key_buffer, position.to_ne_bytes())?;
}
} }
} }
} }
} }
Ok(document_writer.into_inner().map(|v| v.as_slice())?) Ok(())
} }
/// Transform a JSON value into a string that can be indexed. /// Transform a JSON value into a string that can be indexed.
@ -433,10 +287,10 @@ fn compute_language_frequency_threshold(languages_frequency: &[(Language, usize)
#[derive(Default)] #[derive(Default)]
struct Buffers { struct Buffers {
// the key buffer is the concatenation of the internal document id with the field id.
// The buffer has to be completelly cleared between documents,
// and the field id part must be cleared between each field.
key_buffer: Vec<u8>,
// the field buffer for each fields desserialization, and must be cleared between each field. // the field buffer for each fields desserialization, and must be cleared between each field.
field_buffer: String, field_buffer: String,
// buffer used to store the value data containing an obkv.
obkv_buffer: Vec<u8>,
// buffer used to store the value data containing an obkv of tokens with their positions.
obkv_positions_buffer: Vec<u8>,
} }

View File

@ -1,15 +1,14 @@
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use heed::{BytesDecode, BytesEncode}; use heed::{BytesDecode, BytesEncode};
use super::helpers::{ use super::helpers::{
create_sorter, merge_deladd_cbo_roaring_bitmaps, sorter_into_reader, GrenadParameters, create_sorter, merge_cbo_roaring_bitmaps, sorter_into_reader, GrenadParameters,
}; };
use crate::heed_codec::facet::{ use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FieldDocIdFacetF64Codec, OrderedF64Codec, FacetGroupKey, FacetGroupKeyCodec, FieldDocIdFacetF64Codec, OrderedF64Codec,
}; };
use crate::update::del_add::{KvReaderDelAdd, KvWriterDelAdd};
use crate::Result; use crate::Result;
/// Extracts the facet number and the documents ids where this facet number appear. /// Extracts the facet number and the documents ids where this facet number appear.
@ -18,39 +17,30 @@ use crate::Result;
/// documents ids from the given chunk of docid facet number positions. /// documents ids from the given chunk of docid facet number positions.
#[logging_timer::time] #[logging_timer::time]
pub fn extract_facet_number_docids<R: io::Read + io::Seek>( pub fn extract_facet_number_docids<R: io::Read + io::Seek>(
fid_docid_facet_number: grenad::Reader<R>, docid_fid_facet_number: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
let mut facet_number_docids_sorter = create_sorter( let mut facet_number_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable, grenad::SortAlgorithm::Unstable,
merge_deladd_cbo_roaring_bitmaps, merge_cbo_roaring_bitmaps,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory, max_memory,
); );
let mut buffer = Vec::new(); let mut cursor = docid_fid_facet_number.into_cursor()?;
let mut cursor = fid_docid_facet_number.into_cursor()?; while let Some((key_bytes, _)) = cursor.move_on_next()? {
while let Some((key_bytes, deladd_obkv_bytes)) = cursor.move_on_next()? {
let (field_id, document_id, number) = let (field_id, document_id, number) =
FieldDocIdFacetF64Codec::bytes_decode(key_bytes).unwrap(); FieldDocIdFacetF64Codec::bytes_decode(key_bytes).unwrap();
let key = FacetGroupKey { field_id, level: 0, left_bound: number }; let key = FacetGroupKey { field_id, level: 0, left_bound: number };
let key_bytes = FacetGroupKeyCodec::<OrderedF64Codec>::bytes_encode(&key).unwrap(); let key_bytes = FacetGroupKeyCodec::<OrderedF64Codec>::bytes_encode(&key).unwrap();
facet_number_docids_sorter.insert(key_bytes, document_id.to_ne_bytes())?;
buffer.clear();
let mut obkv = KvWriterDelAdd::new(&mut buffer);
for (deladd_key, _) in KvReaderDelAdd::new(deladd_obkv_bytes).iter() {
obkv.insert(deladd_key, document_id.to_ne_bytes())?;
}
obkv.finish()?;
facet_number_docids_sorter.insert(key_bytes, &buffer)?;
} }
sorter_into_reader(facet_number_docids_sorter, indexer) sorter_into_reader(facet_number_docids_sorter, indexer)

View File

@ -1,14 +1,13 @@
use std::fs::File; use std::fs::File;
use std::{io, str}; use std::io::{self, BufReader};
use heed::BytesEncode; use heed::BytesEncode;
use super::helpers::{create_sorter, sorter_into_reader, try_split_array_at, GrenadParameters}; use super::helpers::{create_sorter, sorter_into_reader, try_split_array_at, GrenadParameters};
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec}; use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec};
use crate::heed_codec::StrRefCodec; use crate::heed_codec::StrRefCodec;
use crate::update::del_add::{KvReaderDelAdd, KvWriterDelAdd}; use crate::update::index_documents::merge_cbo_roaring_bitmaps;
use crate::update::index_documents::helpers::merge_deladd_cbo_roaring_bitmaps; use crate::{FieldId, Result, MAX_FACET_VALUE_LENGTH};
use crate::{FieldId, Result};
/// Extracts the facet string and the documents ids where this facet string appear. /// Extracts the facet string and the documents ids where this facet string appear.
/// ///
@ -18,23 +17,22 @@ use crate::{FieldId, Result};
pub fn extract_facet_string_docids<R: io::Read + io::Seek>( pub fn extract_facet_string_docids<R: io::Read + io::Seek>(
docid_fid_facet_string: grenad::Reader<R>, docid_fid_facet_string: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
let mut facet_string_docids_sorter = create_sorter( let mut facet_string_docids_sorter = create_sorter(
grenad::SortAlgorithm::Stable, grenad::SortAlgorithm::Stable,
merge_deladd_cbo_roaring_bitmaps, merge_cbo_roaring_bitmaps,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory, max_memory,
); );
let mut buffer = Vec::new();
let mut cursor = docid_fid_facet_string.into_cursor()?; let mut cursor = docid_fid_facet_string.into_cursor()?;
while let Some((key, deladd_original_value_bytes)) = cursor.move_on_next()? { while let Some((key, _original_value_bytes)) = cursor.move_on_next()? {
let (field_id_bytes, bytes) = try_split_array_at(key).unwrap(); let (field_id_bytes, bytes) = try_split_array_at(key).unwrap();
let field_id = FieldId::from_be_bytes(field_id_bytes); let field_id = FieldId::from_be_bytes(field_id_bytes);
@ -42,17 +40,21 @@ pub fn extract_facet_string_docids<R: io::Read + io::Seek>(
try_split_array_at::<_, 4>(bytes).unwrap(); try_split_array_at::<_, 4>(bytes).unwrap();
let document_id = u32::from_be_bytes(document_id_bytes); let document_id = u32::from_be_bytes(document_id_bytes);
let normalized_value = str::from_utf8(normalized_value_bytes)?; let mut normalised_value = std::str::from_utf8(normalized_value_bytes)?;
let key = FacetGroupKey { field_id, level: 0, left_bound: normalized_value };
let key_bytes = FacetGroupKeyCodec::<StrRefCodec>::bytes_encode(&key).unwrap();
buffer.clear(); let normalised_truncated_value: String;
let mut obkv = KvWriterDelAdd::new(&mut buffer); if normalised_value.len() > MAX_FACET_VALUE_LENGTH {
for (deladd_key, _) in KvReaderDelAdd::new(deladd_original_value_bytes).iter() { normalised_truncated_value = normalised_value
obkv.insert(deladd_key, document_id.to_ne_bytes())?; .char_indices()
.take_while(|(idx, _)| *idx < MAX_FACET_VALUE_LENGTH)
.map(|(_, c)| c)
.collect();
normalised_value = normalised_truncated_value.as_str();
} }
obkv.finish()?; let key = FacetGroupKey { field_id, level: 0, left_bound: normalised_value };
facet_string_docids_sorter.insert(&key_bytes, &buffer)?; let key_bytes = FacetGroupKeyCodec::<StrRefCodec>::bytes_encode(&key).unwrap();
// document id is encoded in native-endian because of the CBO roaring bitmap codec
facet_string_docids_sorter.insert(&key_bytes, document_id.to_ne_bytes())?;
} }
sorter_into_reader(facet_string_docids_sorter, indexer) sorter_into_reader(facet_string_docids_sorter, indexer)

View File

@ -1,39 +1,27 @@
use std::borrow::Cow;
use std::collections::{BTreeMap, HashSet}; use std::collections::{BTreeMap, HashSet};
use std::convert::TryInto; use std::convert::TryInto;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use std::mem::size_of; use std::mem::size_of;
use std::result::Result as StdResult;
use grenad::Sorter;
use heed::zerocopy::AsBytes; use heed::zerocopy::AsBytes;
use heed::BytesEncode; use heed::BytesEncode;
use itertools::EitherOrBoth;
use ordered_float::OrderedFloat;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use serde_json::{from_slice, Value}; use serde_json::{from_slice, Value};
use FilterableValues::{Empty, Null, Values};
use super::helpers::{create_sorter, keep_first, sorter_into_reader, GrenadParameters}; use super::helpers::{create_sorter, keep_first, sorter_into_reader, GrenadParameters};
use crate::error::InternalError; use crate::error::InternalError;
use crate::facet::value_encoding::f64_into_bytes; use crate::facet::value_encoding::f64_into_bytes;
use crate::update::del_add::{DelAdd, KvWriterDelAdd};
use crate::update::index_documents::{create_writer, writer_into_reader}; use crate::update::index_documents::{create_writer, writer_into_reader};
use crate::{ use crate::{CboRoaringBitmapCodec, DocumentId, FieldId, Result, BEU32, MAX_FACET_VALUE_LENGTH};
CboRoaringBitmapCodec, DocumentId, Error, FieldId, Result, BEU32, MAX_FACET_VALUE_LENGTH,
};
/// The length of the elements that are always in the buffer when inserting new values.
const TRUNCATE_SIZE: usize = size_of::<FieldId>() + size_of::<DocumentId>();
/// The extracted facet values stored in grenad files by type. /// The extracted facet values stored in grenad files by type.
pub struct ExtractedFacetValues { pub struct ExtractedFacetValues {
pub fid_docid_facet_numbers_chunk: grenad::Reader<File>, pub docid_fid_facet_numbers_chunk: grenad::Reader<BufReader<File>>,
pub fid_docid_facet_strings_chunk: grenad::Reader<File>, pub docid_fid_facet_strings_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_is_null_docids_chunk: grenad::Reader<File>, pub fid_facet_is_null_docids_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_is_empty_docids_chunk: grenad::Reader<File>, pub fid_facet_is_empty_docids_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_exists_docids_chunk: grenad::Reader<File>, pub fid_facet_exists_docids_chunk: grenad::Reader<BufReader<File>>,
} }
/// Extracts the facet values of each faceted field of each document. /// Extracts the facet values of each faceted field of each document.
@ -70,150 +58,71 @@ pub fn extract_fid_docid_facet_values<R: io::Read + io::Seek>(
max_memory.map(|m| m / 2), max_memory.map(|m| m / 2),
); );
// The tuples represents the Del and Add side for a bitmap let mut facet_exists_docids = BTreeMap::<FieldId, RoaringBitmap>::new();
let mut facet_exists_docids = BTreeMap::<FieldId, (RoaringBitmap, RoaringBitmap)>::new(); let mut facet_is_null_docids = BTreeMap::<FieldId, RoaringBitmap>::new();
let mut facet_is_null_docids = BTreeMap::<FieldId, (RoaringBitmap, RoaringBitmap)>::new(); let mut facet_is_empty_docids = BTreeMap::<FieldId, RoaringBitmap>::new();
let mut facet_is_empty_docids = BTreeMap::<FieldId, (RoaringBitmap, RoaringBitmap)>::new();
// We create two buffer for mutable ref issues with closures.
let mut numbers_key_buffer = Vec::new();
let mut strings_key_buffer = Vec::new();
let mut key_buffer = Vec::new();
let mut cursor = obkv_documents.into_cursor()?; let mut cursor = obkv_documents.into_cursor()?;
while let Some((docid_bytes, value)) = cursor.move_on_next()? { while let Some((docid_bytes, value)) = cursor.move_on_next()? {
let obkv = obkv::KvReader::new(value); let obkv = obkv::KvReader::new(value);
for (field_id, field_bytes) in obkv.iter() { for (field_id, field_bytes) in obkv.iter() {
if faceted_fields.contains(&field_id) { if faceted_fields.contains(&field_id) {
numbers_key_buffer.clear(); key_buffer.clear();
strings_key_buffer.clear();
// Set key to the field_id // Set key to the field_id
// Note: this encoding is consistent with FieldIdCodec // Note: this encoding is consistent with FieldIdCodec
numbers_key_buffer.extend_from_slice(&field_id.to_be_bytes()); key_buffer.extend_from_slice(&field_id.to_be_bytes());
strings_key_buffer.extend_from_slice(&field_id.to_be_bytes());
// Here, we know already that the document must be added to the “field id exists” database
let document: [u8; 4] = docid_bytes[..4].try_into().ok().unwrap(); let document: [u8; 4] = docid_bytes[..4].try_into().ok().unwrap();
let document = BEU32::from(document).get(); let document = BEU32::from(document).get();
facet_exists_docids.entry(field_id).or_default().insert(document);
// For the other extraction tasks, prefix the key with the field_id and the document_id // For the other extraction tasks, prefix the key with the field_id and the document_id
numbers_key_buffer.extend_from_slice(docid_bytes); key_buffer.extend_from_slice(docid_bytes);
strings_key_buffer.extend_from_slice(docid_bytes);
let del_add_obkv = obkv::KvReader::new(field_bytes); let value = from_slice(field_bytes).map_err(InternalError::SerdeJson)?;
let del_value = match del_add_obkv.get(DelAdd::Deletion) {
Some(bytes) => from_slice(bytes).map_err(InternalError::SerdeJson)?,
None => None,
};
let add_value = match del_add_obkv.get(DelAdd::Addition) {
Some(bytes) => from_slice(bytes).map_err(InternalError::SerdeJson)?,
None => None,
};
// We insert the document id on the Del and the Add side if the field exists. match extract_facet_values(
let (ref mut del_exists, ref mut add_exists) = &value,
facet_exists_docids.entry(field_id).or_default(); geo_fields_ids.map_or(false, |(lat, lng)| field_id == lat || field_id == lng),
let (ref mut del_is_null, ref mut add_is_null) = ) {
facet_is_null_docids.entry(field_id).or_default(); FilterableValues::Null => {
let (ref mut del_is_empty, ref mut add_is_empty) = facet_is_null_docids.entry(field_id).or_default().insert(document);
facet_is_empty_docids.entry(field_id).or_default(); }
FilterableValues::Empty => {
facet_is_empty_docids.entry(field_id).or_default().insert(document);
}
FilterableValues::Values { numbers, strings } => {
// insert facet numbers in sorter
for number in numbers {
key_buffer.truncate(size_of::<FieldId>() + size_of::<DocumentId>());
if let Some(value_bytes) = f64_into_bytes(number) {
key_buffer.extend_from_slice(&value_bytes);
key_buffer.extend_from_slice(&number.to_be_bytes());
if del_value.is_some() { fid_docid_facet_numbers_sorter
del_exists.insert(document); .insert(&key_buffer, ().as_bytes())?;
} }
if add_value.is_some() { }
add_exists.insert(document);
}
let geo_support = // insert normalized and original facet string in sorter
geo_fields_ids.map_or(false, |(lat, lng)| field_id == lat || field_id == lng); for (normalized, original) in
let del_filterable_values = strings.into_iter().filter(|(n, _)| !n.is_empty())
del_value.map(|value| extract_facet_values(&value, geo_support)); {
let add_filterable_values = let normalized_truncated_value: String = normalized
add_value.map(|value| extract_facet_values(&value, geo_support)); .char_indices()
.take_while(|(idx, _)| idx + 4 < MAX_FACET_VALUE_LENGTH)
.map(|(_, c)| c)
.collect();
// Those closures are just here to simplify things a bit. key_buffer.truncate(size_of::<FieldId>() + size_of::<DocumentId>());
let mut insert_numbers_diff = |del_numbers, add_numbers| { key_buffer.extend_from_slice(normalized_truncated_value.as_bytes());
insert_numbers_diff( fid_docid_facet_strings_sorter
&mut fid_docid_facet_numbers_sorter, .insert(&key_buffer, original.as_bytes())?;
&mut numbers_key_buffer,
del_numbers,
add_numbers,
)
};
let mut insert_strings_diff = |del_strings, add_strings| {
insert_strings_diff(
&mut fid_docid_facet_strings_sorter,
&mut strings_key_buffer,
del_strings,
add_strings,
)
};
match (del_filterable_values, add_filterable_values) {
(None, None) => (),
(Some(del_filterable_values), None) => match del_filterable_values {
Null => {
del_is_null.insert(document);
}
Empty => {
del_is_empty.insert(document);
}
Values { numbers, strings } => {
insert_numbers_diff(numbers, vec![])?;
insert_strings_diff(strings, vec![])?;
}
},
(None, Some(add_filterable_values)) => match add_filterable_values {
Null => {
add_is_null.insert(document);
}
Empty => {
add_is_empty.insert(document);
}
Values { numbers, strings } => {
insert_numbers_diff(vec![], numbers)?;
insert_strings_diff(vec![], strings)?;
}
},
(Some(del_filterable_values), Some(add_filterable_values)) => {
match (del_filterable_values, add_filterable_values) {
(Null, Null) | (Empty, Empty) => (),
(Null, Empty) => {
del_is_null.insert(document);
add_is_empty.insert(document);
}
(Empty, Null) => {
del_is_empty.insert(document);
add_is_null.insert(document);
}
(Null, Values { numbers, strings }) => {
insert_numbers_diff(vec![], numbers)?;
insert_strings_diff(vec![], strings)?;
del_is_null.insert(document);
}
(Empty, Values { numbers, strings }) => {
insert_numbers_diff(vec![], numbers)?;
insert_strings_diff(vec![], strings)?;
del_is_empty.insert(document);
}
(Values { numbers, strings }, Null) => {
add_is_null.insert(document);
insert_numbers_diff(numbers, vec![])?;
insert_strings_diff(strings, vec![])?;
}
(Values { numbers, strings }, Empty) => {
add_is_empty.insert(document);
insert_numbers_diff(numbers, vec![])?;
insert_strings_diff(strings, vec![])?;
}
(
Values { numbers: del_numbers, strings: del_strings },
Values { numbers: add_numbers, strings: add_strings },
) => {
insert_numbers_diff(del_numbers, add_numbers)?;
insert_strings_diff(del_strings, add_strings)?;
}
} }
} }
} }
@ -221,15 +130,14 @@ pub fn extract_fid_docid_facet_values<R: io::Read + io::Seek>(
} }
} }
let mut buffer = Vec::new();
let mut facet_exists_docids_writer = create_writer( let mut facet_exists_docids_writer = create_writer(
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
tempfile::tempfile()?, tempfile::tempfile()?,
); );
for (fid, (del_bitmap, add_bitmap)) in facet_exists_docids.into_iter() { for (fid, bitmap) in facet_exists_docids.into_iter() {
deladd_obkv_cbo_roaring_bitmaps(&mut buffer, &del_bitmap, &add_bitmap)?; let bitmap_bytes = CboRoaringBitmapCodec::bytes_encode(&bitmap).unwrap();
facet_exists_docids_writer.insert(fid.to_be_bytes(), &buffer)?; facet_exists_docids_writer.insert(fid.to_be_bytes(), &bitmap_bytes)?;
} }
let facet_exists_docids_reader = writer_into_reader(facet_exists_docids_writer)?; let facet_exists_docids_reader = writer_into_reader(facet_exists_docids_writer)?;
@ -238,9 +146,9 @@ pub fn extract_fid_docid_facet_values<R: io::Read + io::Seek>(
indexer.chunk_compression_level, indexer.chunk_compression_level,
tempfile::tempfile()?, tempfile::tempfile()?,
); );
for (fid, (del_bitmap, add_bitmap)) in facet_is_null_docids.into_iter() { for (fid, bitmap) in facet_is_null_docids.into_iter() {
deladd_obkv_cbo_roaring_bitmaps(&mut buffer, &del_bitmap, &add_bitmap)?; let bitmap_bytes = CboRoaringBitmapCodec::bytes_encode(&bitmap).unwrap();
facet_is_null_docids_writer.insert(fid.to_be_bytes(), &buffer)?; facet_is_null_docids_writer.insert(fid.to_be_bytes(), &bitmap_bytes)?;
} }
let facet_is_null_docids_reader = writer_into_reader(facet_is_null_docids_writer)?; let facet_is_null_docids_reader = writer_into_reader(facet_is_null_docids_writer)?;
@ -249,156 +157,21 @@ pub fn extract_fid_docid_facet_values<R: io::Read + io::Seek>(
indexer.chunk_compression_level, indexer.chunk_compression_level,
tempfile::tempfile()?, tempfile::tempfile()?,
); );
for (fid, (del_bitmap, add_bitmap)) in facet_is_empty_docids.into_iter() { for (fid, bitmap) in facet_is_empty_docids.into_iter() {
deladd_obkv_cbo_roaring_bitmaps(&mut buffer, &del_bitmap, &add_bitmap)?; let bitmap_bytes = CboRoaringBitmapCodec::bytes_encode(&bitmap).unwrap();
facet_is_empty_docids_writer.insert(fid.to_be_bytes(), &buffer)?; facet_is_empty_docids_writer.insert(fid.to_be_bytes(), &bitmap_bytes)?;
} }
let facet_is_empty_docids_reader = writer_into_reader(facet_is_empty_docids_writer)?; let facet_is_empty_docids_reader = writer_into_reader(facet_is_empty_docids_writer)?;
Ok(ExtractedFacetValues { Ok(ExtractedFacetValues {
fid_docid_facet_numbers_chunk: sorter_into_reader(fid_docid_facet_numbers_sorter, indexer)?, docid_fid_facet_numbers_chunk: sorter_into_reader(fid_docid_facet_numbers_sorter, indexer)?,
fid_docid_facet_strings_chunk: sorter_into_reader(fid_docid_facet_strings_sorter, indexer)?, docid_fid_facet_strings_chunk: sorter_into_reader(fid_docid_facet_strings_sorter, indexer)?,
fid_facet_is_null_docids_chunk: facet_is_null_docids_reader, fid_facet_is_null_docids_chunk: facet_is_null_docids_reader,
fid_facet_is_empty_docids_chunk: facet_is_empty_docids_reader, fid_facet_is_empty_docids_chunk: facet_is_empty_docids_reader,
fid_facet_exists_docids_chunk: facet_exists_docids_reader, fid_facet_exists_docids_chunk: facet_exists_docids_reader,
}) })
} }
/// Generates a vector of bytes containing a DelAdd obkv with two bitmaps.
fn deladd_obkv_cbo_roaring_bitmaps(
buffer: &mut Vec<u8>,
del_bitmap: &RoaringBitmap,
add_bitmap: &RoaringBitmap,
) -> io::Result<()> {
buffer.clear();
let mut obkv = KvWriterDelAdd::new(buffer);
let del_bitmap_bytes = CboRoaringBitmapCodec::bytes_encode(del_bitmap).unwrap();
let add_bitmap_bytes = CboRoaringBitmapCodec::bytes_encode(add_bitmap).unwrap();
obkv.insert(DelAdd::Deletion, del_bitmap_bytes)?;
obkv.insert(DelAdd::Addition, add_bitmap_bytes)?;
obkv.finish()
}
/// Truncates a string to the biggest valid LMDB key size.
fn truncate_string(s: String) -> String {
s.char_indices()
.take_while(|(idx, _)| idx + 4 < MAX_FACET_VALUE_LENGTH)
.map(|(_, c)| c)
.collect()
}
/// Computes the diff between both Del and Add numbers and
/// only inserts the parts that differ in the sorter.
fn insert_numbers_diff<MF>(
fid_docid_facet_numbers_sorter: &mut Sorter<MF>,
key_buffer: &mut Vec<u8>,
mut del_numbers: Vec<f64>,
mut add_numbers: Vec<f64>,
) -> Result<()>
where
MF: for<'a> Fn(&[u8], &[Cow<'a, [u8]>]) -> StdResult<Cow<'a, [u8]>, Error>,
{
// We sort and dedup the float numbers
del_numbers.sort_unstable_by_key(|f| OrderedFloat(*f));
add_numbers.sort_unstable_by_key(|f| OrderedFloat(*f));
del_numbers.dedup_by_key(|f| OrderedFloat(*f));
add_numbers.dedup_by_key(|f| OrderedFloat(*f));
let merged_numbers_iter = itertools::merge_join_by(
del_numbers.into_iter().map(OrderedFloat),
add_numbers.into_iter().map(OrderedFloat),
|del, add| del.cmp(add),
);
// insert facet numbers in sorter
for eob in merged_numbers_iter {
key_buffer.truncate(TRUNCATE_SIZE);
match eob {
EitherOrBoth::Both(_, _) => (), // no need to touch anything
EitherOrBoth::Left(OrderedFloat(number)) => {
if let Some(value_bytes) = f64_into_bytes(number) {
key_buffer.extend_from_slice(&value_bytes);
key_buffer.extend_from_slice(&number.to_be_bytes());
// We insert only the Del part of the Obkv to inform
// that we only want to remove all those numbers.
let mut obkv = KvWriterDelAdd::memory();
obkv.insert(DelAdd::Deletion, ().as_bytes())?;
let bytes = obkv.into_inner()?;
fid_docid_facet_numbers_sorter.insert(&key_buffer, bytes)?;
}
}
EitherOrBoth::Right(OrderedFloat(number)) => {
if let Some(value_bytes) = f64_into_bytes(number) {
key_buffer.extend_from_slice(&value_bytes);
key_buffer.extend_from_slice(&number.to_be_bytes());
// We insert only the Del part of the Obkv to inform
// that we only want to remove all those numbers.
let mut obkv = KvWriterDelAdd::memory();
obkv.insert(DelAdd::Addition, ().as_bytes())?;
let bytes = obkv.into_inner()?;
fid_docid_facet_numbers_sorter.insert(&key_buffer, bytes)?;
}
}
}
}
Ok(())
}
/// Computes the diff between both Del and Add strings and
/// only inserts the parts that differ in the sorter.
fn insert_strings_diff<MF>(
fid_docid_facet_strings_sorter: &mut Sorter<MF>,
key_buffer: &mut Vec<u8>,
mut del_strings: Vec<(String, String)>,
mut add_strings: Vec<(String, String)>,
) -> Result<()>
where
MF: for<'a> Fn(&[u8], &[Cow<'a, [u8]>]) -> StdResult<Cow<'a, [u8]>, Error>,
{
// We sort and dedup the normalized and original strings
del_strings.sort_unstable();
add_strings.sort_unstable();
del_strings.dedup();
add_strings.dedup();
let merged_strings_iter = itertools::merge_join_by(
del_strings.into_iter().filter(|(n, _)| !n.is_empty()),
add_strings.into_iter().filter(|(n, _)| !n.is_empty()),
|del, add| del.cmp(add),
);
// insert normalized and original facet string in sorter
for eob in merged_strings_iter {
key_buffer.truncate(TRUNCATE_SIZE);
match eob {
EitherOrBoth::Both(_, _) => (), // no need to touch anything
EitherOrBoth::Left((normalized, original)) => {
let truncated = truncate_string(normalized);
key_buffer.extend_from_slice(truncated.as_bytes());
let mut obkv = KvWriterDelAdd::memory();
obkv.insert(DelAdd::Deletion, original)?;
let bytes = obkv.into_inner()?;
fid_docid_facet_strings_sorter.insert(&key_buffer, bytes)?;
}
EitherOrBoth::Right((normalized, original)) => {
let truncated = truncate_string(normalized);
key_buffer.extend_from_slice(truncated.as_bytes());
let mut obkv = KvWriterDelAdd::memory();
obkv.insert(DelAdd::Addition, original)?;
let bytes = obkv.into_inner()?;
fid_docid_facet_strings_sorter.insert(&key_buffer, bytes)?;
}
}
}
Ok(())
}
/// Represent what a document field contains. /// Represent what a document field contains.
enum FilterableValues { enum FilterableValues {
/// Corresponds to the JSON `null` value. /// Corresponds to the JSON `null` value.
@ -409,7 +182,6 @@ enum FilterableValues {
Values { numbers: Vec<f64>, strings: Vec<(String, String)> }, Values { numbers: Vec<f64>, strings: Vec<(String, String)> },
} }
/// Extracts the facet values of a JSON field.
fn extract_facet_values(value: &Value, geo_field: bool) -> FilterableValues { fn extract_facet_values(value: &Value, geo_field: bool) -> FilterableValues {
fn inner_extract_facet_values( fn inner_extract_facet_values(
value: &Value, value: &Value,

View File

@ -1,17 +1,16 @@
use std::collections::HashMap;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use obkv::KvReaderU16; use grenad::Sorter;
use super::helpers::{ use super::helpers::{
create_sorter, merge_cbo_roaring_bitmaps, sorter_into_reader, try_split_array_at, create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
GrenadParameters, try_split_array_at, GrenadParameters, MergeFn,
}; };
use crate::error::SerializationError; use crate::error::SerializationError;
use crate::index::db_name::DOCID_WORD_POSITIONS; use crate::index::db_name::DOCID_WORD_POSITIONS;
use crate::Result; use crate::{relative_from_absolute_position, DocumentId, FieldId, Result};
const MAX_COUNTED_WORDS: usize = 30;
/// Extracts the field id word count and the documents ids where /// Extracts the field id word count and the documents ids where
/// this field id with this amount of words appear. /// this field id with this amount of words appear.
@ -22,7 +21,7 @@ const MAX_COUNTED_WORDS: usize = 30;
pub fn extract_fid_word_count_docids<R: io::Read + io::Seek>( pub fn extract_fid_word_count_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>, docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
@ -36,21 +35,63 @@ pub fn extract_fid_word_count_docids<R: io::Read + io::Seek>(
max_memory, max_memory,
); );
let mut key_buffer = Vec::new(); // This map is assumed to not consume a lot of memory.
let mut document_fid_wordcount = HashMap::new();
let mut current_document_id = None;
let mut cursor = docid_word_positions.into_cursor()?; let mut cursor = docid_word_positions.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? { while let Some((key, value)) = cursor.move_on_next()? {
let (document_id_bytes, fid_bytes) = try_split_array_at(key) let (document_id_bytes, _word_bytes) = try_split_array_at(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?; .ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = u32::from_be_bytes(document_id_bytes); let document_id = u32::from_be_bytes(document_id_bytes);
let word_count = KvReaderU16::new(&value).iter().take(MAX_COUNTED_WORDS + 1).count(); let curr_document_id = *current_document_id.get_or_insert(document_id);
if word_count <= MAX_COUNTED_WORDS { if curr_document_id != document_id {
key_buffer.clear(); drain_document_fid_wordcount_into_sorter(
key_buffer.extend_from_slice(fid_bytes); &mut fid_word_count_docids_sorter,
key_buffer.push(word_count as u8); &mut document_fid_wordcount,
fid_word_count_docids_sorter.insert(&key_buffer, document_id.to_ne_bytes())?; curr_document_id,
)?;
current_document_id = Some(document_id);
} }
for position in read_u32_ne_bytes(value) {
let (field_id, _) = relative_from_absolute_position(position);
let value = document_fid_wordcount.entry(field_id as FieldId).or_insert(0);
*value += 1;
}
}
if let Some(document_id) = current_document_id {
// We must make sure that don't lose the current document field id
// word count map if we break because we reached the end of the chunk.
drain_document_fid_wordcount_into_sorter(
&mut fid_word_count_docids_sorter,
&mut document_fid_wordcount,
document_id,
)?;
} }
sorter_into_reader(fid_word_count_docids_sorter, indexer) sorter_into_reader(fid_word_count_docids_sorter, indexer)
} }
fn drain_document_fid_wordcount_into_sorter(
fid_word_count_docids_sorter: &mut Sorter<MergeFn>,
document_fid_wordcount: &mut HashMap<FieldId, u32>,
document_id: DocumentId,
) -> Result<()> {
let mut key_buffer = Vec::new();
for (fid, count) in document_fid_wordcount.drain() {
if count <= 30 {
key_buffer.clear();
key_buffer.extend_from_slice(&fid.to_be_bytes());
key_buffer.push(count as u8);
fid_word_count_docids_sorter.insert(&key_buffer, document_id.to_ne_bytes())?;
}
}
Ok(())
}

View File

@ -1,5 +1,5 @@
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use concat_arrays::concat_arrays; use concat_arrays::concat_arrays;
use serde_json::Value; use serde_json::Value;
@ -18,7 +18,7 @@ pub fn extract_geo_points<R: io::Read + io::Seek>(
indexer: GrenadParameters, indexer: GrenadParameters,
primary_key_id: FieldId, primary_key_id: FieldId,
(lat_fid, lng_fid): (FieldId, FieldId), (lat_fid, lng_fid): (FieldId, FieldId),
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let mut writer = create_writer( let mut writer = create_writer(

View File

@ -1,6 +1,6 @@
use std::convert::TryFrom; use std::convert::TryFrom;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use bytemuck::cast_slice; use bytemuck::cast_slice;
use serde_json::{from_slice, Value}; use serde_json::{from_slice, Value};
@ -18,7 +18,7 @@ pub fn extract_vector_points<R: io::Read + io::Seek>(
indexer: GrenadParameters, indexer: GrenadParameters,
primary_key_id: FieldId, primary_key_id: FieldId,
vectors_fid: FieldId, vectors_fid: FieldId,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let mut writer = create_writer( let mut writer = create_writer(

View File

@ -1,20 +1,18 @@
use std::collections::{BTreeSet, HashSet}; use std::collections::HashSet;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use std::iter::FromIterator;
use heed::BytesDecode; use roaring::RoaringBitmap;
use obkv::KvReaderU16;
use super::helpers::{ use super::helpers::{
create_sorter, create_writer, merge_deladd_cbo_roaring_bitmaps, sorter_into_reader, create_sorter, merge_roaring_bitmaps, serialize_roaring_bitmap, sorter_into_reader,
try_split_array_at, writer_into_reader, GrenadParameters, try_split_array_at, GrenadParameters,
}; };
use crate::error::SerializationError; use crate::error::SerializationError;
use crate::heed_codec::StrBEU16Codec;
use crate::index::db_name::DOCID_WORD_POSITIONS; use crate::index::db_name::DOCID_WORD_POSITIONS;
use crate::update::del_add::{is_noop_del_add_obkv, DelAdd, KvReaderDelAdd, KvWriterDelAdd}; use crate::update::index_documents::helpers::read_u32_ne_bytes;
use crate::update::MergeFn; use crate::{relative_from_absolute_position, FieldId, Result};
use crate::{DocumentId, FieldId, Result};
/// Extracts the word and the documents ids where this word appear. /// Extracts the word and the documents ids where this word appear.
/// ///
@ -28,148 +26,65 @@ pub fn extract_word_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>, docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
exact_attributes: &HashSet<FieldId>, exact_attributes: &HashSet<FieldId>,
) -> Result<(grenad::Reader<File>, grenad::Reader<File>, grenad::Reader<File>)> { ) -> Result<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
let mut word_fid_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable,
merge_deladd_cbo_roaring_bitmaps,
indexer.chunk_compression_type,
indexer.chunk_compression_level,
indexer.max_nb_chunks,
max_memory.map(|x| x / 3),
);
let mut key_buffer = Vec::new();
let mut del_words = BTreeSet::new();
let mut add_words = BTreeSet::new();
let mut cursor = docid_word_positions.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? {
let (document_id_bytes, fid_bytes) = try_split_array_at(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let (fid_bytes, _) = try_split_array_at(fid_bytes)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = u32::from_be_bytes(document_id_bytes);
let fid = u16::from_be_bytes(fid_bytes);
let del_add_reader = KvReaderDelAdd::new(&value);
// extract all unique words to remove.
if let Some(deletion) = del_add_reader.get(DelAdd::Deletion) {
for (_pos, word) in KvReaderU16::new(&deletion).iter() {
del_words.insert(word.to_vec());
}
}
// extract all unique additional words.
if let Some(addition) = del_add_reader.get(DelAdd::Addition) {
for (_pos, word) in KvReaderU16::new(&addition).iter() {
add_words.insert(word.to_vec());
}
}
words_into_sorter(
document_id,
fid,
&mut key_buffer,
&del_words,
&add_words,
&mut word_fid_docids_sorter,
)?;
del_words.clear();
add_words.clear();
}
let mut word_docids_sorter = create_sorter( let mut word_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable, grenad::SortAlgorithm::Unstable,
merge_deladd_cbo_roaring_bitmaps, merge_roaring_bitmaps,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory.map(|x| x / 3), max_memory.map(|x| x / 2),
); );
let mut exact_word_docids_sorter = create_sorter( let mut exact_word_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable, grenad::SortAlgorithm::Unstable,
merge_deladd_cbo_roaring_bitmaps, merge_roaring_bitmaps,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory.map(|x| x / 3), max_memory.map(|x| x / 2),
); );
let mut word_fid_docids_writer = create_writer( let mut value_buffer = Vec::new();
indexer.chunk_compression_type, let mut cursor = docid_word_positions.into_cursor()?;
indexer.chunk_compression_level, while let Some((key, positions)) = cursor.move_on_next()? {
tempfile::tempfile()?, let (document_id_bytes, word_bytes) = try_split_array_at(key)
);
let mut iter = word_fid_docids_sorter.into_stream_merger_iter()?;
// TODO: replace sorters by writers by accumulating values into a buffer before inserting them.
while let Some((key, value)) = iter.next()? {
// only keep the value if their is a change to apply in the DB.
if !is_noop_del_add_obkv(KvReaderDelAdd::new(value)) {
word_fid_docids_writer.insert(key, value)?;
}
let (word, fid) = StrBEU16Codec::bytes_decode(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?; .ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = u32::from_be_bytes(document_id_bytes);
// every words contained in an attribute set to exact must be pushed in the exact_words list. let bitmap = RoaringBitmap::from_iter(Some(document_id));
if exact_attributes.contains(&fid) { serialize_roaring_bitmap(&bitmap, &mut value_buffer)?;
exact_word_docids_sorter.insert(word.as_bytes(), &value)?;
// If there are no exact attributes, we do not need to iterate over positions.
if exact_attributes.is_empty() {
word_docids_sorter.insert(word_bytes, &value_buffer)?;
} else { } else {
word_docids_sorter.insert(word.as_bytes(), &value)?; let mut added_to_exact = false;
let mut added_to_word_docids = false;
for position in read_u32_ne_bytes(positions) {
// as soon as we know that this word had been to both readers, we don't need to
// iterate over the positions.
if added_to_exact && added_to_word_docids {
break;
}
let (fid, _) = relative_from_absolute_position(position);
if exact_attributes.contains(&fid) && !added_to_exact {
exact_word_docids_sorter.insert(word_bytes, &value_buffer)?;
added_to_exact = true;
} else if !added_to_word_docids {
word_docids_sorter.insert(word_bytes, &value_buffer)?;
added_to_word_docids = true;
}
}
} }
} }
Ok(( Ok((
sorter_into_reader(word_docids_sorter, indexer)?, sorter_into_reader(word_docids_sorter, indexer)?,
sorter_into_reader(exact_word_docids_sorter, indexer)?, sorter_into_reader(exact_word_docids_sorter, indexer)?,
writer_into_reader(word_fid_docids_writer)?,
)) ))
} }
fn words_into_sorter(
document_id: DocumentId,
fid: FieldId,
key_buffer: &mut Vec<u8>,
del_words: &BTreeSet<Vec<u8>>,
add_words: &BTreeSet<Vec<u8>>,
word_fid_docids_sorter: &mut grenad::Sorter<MergeFn>,
) -> Result<()> {
puffin::profile_function!();
use itertools::merge_join_by;
use itertools::EitherOrBoth::{Both, Left, Right};
let mut buffer = Vec::new();
for eob in merge_join_by(del_words.iter(), add_words.iter(), |d, a| d.cmp(a)) {
buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut buffer);
let word_bytes = match eob {
Left(word_bytes) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
word_bytes
}
Right(word_bytes) => {
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
word_bytes
}
Both(word_bytes, _) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
word_bytes
}
};
key_buffer.clear();
key_buffer.extend_from_slice(&word_bytes);
key_buffer.push(0);
key_buffer.extend_from_slice(&fid.to_be_bytes());
word_fid_docids_sorter.insert(&key_buffer, value_writer.into_inner().unwrap())?;
}
Ok(())
}

View File

@ -0,0 +1,51 @@
use std::fs::File;
use std::io::{self, BufReader};
use super::helpers::{
create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
try_split_array_at, GrenadParameters,
};
use crate::error::SerializationError;
use crate::index::db_name::DOCID_WORD_POSITIONS;
use crate::{relative_from_absolute_position, DocumentId, Result};
/// Extracts the word, field id, and the documents ids where this word appear at this field id.
#[logging_timer::time]
pub fn extract_word_fid_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut word_fid_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable,
merge_cbo_roaring_bitmaps,
indexer.chunk_compression_type,
indexer.chunk_compression_level,
indexer.max_nb_chunks,
max_memory,
);
let mut key_buffer = Vec::new();
let mut cursor = docid_word_positions.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? {
let (document_id_bytes, word_bytes) = try_split_array_at(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = DocumentId::from_be_bytes(document_id_bytes);
for position in read_u32_ne_bytes(value) {
key_buffer.clear();
key_buffer.extend_from_slice(word_bytes);
key_buffer.push(0);
let (fid, _) = relative_from_absolute_position(position);
key_buffer.extend_from_slice(&fid.to_be_bytes());
word_fid_docids_sorter.insert(&key_buffer, document_id.to_ne_bytes())?;
}
}
let word_fid_docids_reader = sorter_into_reader(word_fid_docids_sorter, indexer)?;
Ok(word_fid_docids_reader)
}

View File

@ -1,17 +1,16 @@
use std::collections::{BTreeMap, VecDeque}; use std::cmp::Ordering;
use std::collections::{BinaryHeap, HashMap};
use std::fs::File; use std::fs::File;
use std::{cmp, io}; use std::io::BufReader;
use std::{cmp, io, mem, str, vec};
use obkv::KvReaderU16;
use super::helpers::{ use super::helpers::{
create_sorter, create_writer, merge_deladd_cbo_roaring_bitmaps, try_split_array_at, create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
writer_into_reader, GrenadParameters, MergeFn, try_split_array_at, GrenadParameters, MergeFn,
}; };
use crate::error::SerializationError; use crate::error::SerializationError;
use crate::index::db_name::DOCID_WORD_POSITIONS; use crate::index::db_name::DOCID_WORD_POSITIONS;
use crate::proximity::{index_proximity, MAX_DISTANCE}; use crate::proximity::{positions_proximity, MAX_DISTANCE};
use crate::update::del_add::{DelAdd, KvReaderDelAdd, KvWriterDelAdd};
use crate::{DocumentId, Result}; use crate::{DocumentId, Result};
/// Extracts the best proximity between pairs of words and the documents ids where this pair appear. /// Extracts the best proximity between pairs of words and the documents ids where this pair appear.
@ -22,143 +21,63 @@ use crate::{DocumentId, Result};
pub fn extract_word_pair_proximity_docids<R: io::Read + io::Seek>( pub fn extract_word_pair_proximity_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>, docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
let mut word_pair_proximity_docids_sorters: Vec<_> = (1..MAX_DISTANCE) let mut word_pair_proximity_docids_sorter = create_sorter(
.into_iter() grenad::SortAlgorithm::Unstable,
.map(|_| { merge_cbo_roaring_bitmaps,
create_sorter( indexer.chunk_compression_type,
grenad::SortAlgorithm::Unstable, indexer.chunk_compression_level,
merge_deladd_cbo_roaring_bitmaps, indexer.max_nb_chunks,
indexer.chunk_compression_type, max_memory.map(|m| m / 2),
indexer.chunk_compression_level, );
indexer.max_nb_chunks,
max_memory.map(|m| m / MAX_DISTANCE as usize),
)
})
.collect();
let mut del_word_positions: VecDeque<(String, u16)> = // This map is assumed to not consume a lot of memory.
VecDeque::with_capacity(MAX_DISTANCE as usize); let mut document_word_positions_heap = BinaryHeap::new();
let mut add_word_positions: VecDeque<(String, u16)> =
VecDeque::with_capacity(MAX_DISTANCE as usize);
let mut del_word_pair_proximity = BTreeMap::new();
let mut add_word_pair_proximity = BTreeMap::new();
let mut current_document_id = None; let mut current_document_id = None;
let mut cursor = docid_word_positions.into_cursor()?; let mut cursor = docid_word_positions.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? { while let Some((key, value)) = cursor.move_on_next()? {
let (document_id_bytes, _fid_bytes) = try_split_array_at(key) let (document_id_bytes, word_bytes) = try_split_array_at(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?; .ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = u32::from_be_bytes(document_id_bytes); let document_id = u32::from_be_bytes(document_id_bytes);
let word = str::from_utf8(word_bytes)?;
// if we change document, we fill the sorter let curr_document_id = *current_document_id.get_or_insert(document_id);
if current_document_id.map_or(false, |id| id != document_id) { if curr_document_id != document_id {
puffin::profile_scope!("Document into sorter"); let document_word_positions_heap = mem::take(&mut document_word_positions_heap);
document_word_positions_into_sorter( document_word_positions_into_sorter(
current_document_id.unwrap(), curr_document_id,
&del_word_pair_proximity, document_word_positions_heap,
&add_word_pair_proximity, &mut word_pair_proximity_docids_sorter,
&mut word_pair_proximity_docids_sorters,
)?; )?;
del_word_pair_proximity.clear(); current_document_id = Some(document_id);
add_word_pair_proximity.clear();
} }
current_document_id = Some(document_id); let word = word.to_string();
let mut positions: Vec<_> = read_u32_ne_bytes(value).collect();
let (del, add): (Result<_>, Result<_>) = rayon::join( positions.sort_unstable();
|| { let mut iter = positions.into_iter();
// deletions if let Some(position) = iter.next() {
if let Some(deletion) = KvReaderDelAdd::new(&value).get(DelAdd::Deletion) { document_word_positions_heap.push(PeekedWordPosition { word, position, iter });
for (position, word) in KvReaderU16::new(deletion).iter() { }
// drain the proximity window until the head word is considered close to the word we are inserting.
while del_word_positions.get(0).map_or(false, |(_w, p)| {
index_proximity(*p as u32, position as u32) >= MAX_DISTANCE
}) {
word_positions_into_word_pair_proximity(
&mut del_word_positions,
&mut del_word_pair_proximity,
)?;
}
// insert the new word.
let word = std::str::from_utf8(word)?;
del_word_positions.push_back((word.to_string(), position));
}
while !del_word_positions.is_empty() {
word_positions_into_word_pair_proximity(
&mut del_word_positions,
&mut del_word_pair_proximity,
)?;
}
}
Ok(())
},
|| {
// additions
if let Some(addition) = KvReaderDelAdd::new(&value).get(DelAdd::Addition) {
for (position, word) in KvReaderU16::new(addition).iter() {
// drain the proximity window until the head word is considered close to the word we are inserting.
while add_word_positions.get(0).map_or(false, |(_w, p)| {
index_proximity(*p as u32, position as u32) >= MAX_DISTANCE
}) {
word_positions_into_word_pair_proximity(
&mut add_word_positions,
&mut add_word_pair_proximity,
)?;
}
// insert the new word.
let word = std::str::from_utf8(word)?;
add_word_positions.push_back((word.to_string(), position));
}
while !add_word_positions.is_empty() {
word_positions_into_word_pair_proximity(
&mut add_word_positions,
&mut add_word_pair_proximity,
)?;
}
}
Ok(())
},
);
del?;
add?;
} }
if let Some(document_id) = current_document_id { if let Some(document_id) = current_document_id {
puffin::profile_scope!("Final document into sorter"); // We must make sure that don't lose the current document field id
// word count map if we break because we reached the end of the chunk.
let document_word_positions_heap = mem::take(&mut document_word_positions_heap);
document_word_positions_into_sorter( document_word_positions_into_sorter(
document_id, document_id,
&del_word_pair_proximity, document_word_positions_heap,
&add_word_pair_proximity, &mut word_pair_proximity_docids_sorter,
&mut word_pair_proximity_docids_sorters,
)?; )?;
} }
{
puffin::profile_scope!("sorter_into_reader");
let mut writer = create_writer(
indexer.chunk_compression_type,
indexer.chunk_compression_level,
tempfile::tempfile()?,
);
for sorter in word_pair_proximity_docids_sorters { sorter_into_reader(word_pair_proximity_docids_sorter, indexer)
sorter.write_into_stream_writer(&mut writer)?;
}
writer_into_reader(writer)
}
} }
/// Fills the list of all pairs of words with the shortest proximity between 1 and 7 inclusive. /// Fills the list of all pairs of words with the shortest proximity between 1 and 7 inclusive.
@ -167,66 +86,96 @@ pub fn extract_word_pair_proximity_docids<R: io::Read + io::Seek>(
/// close to each other. /// close to each other.
fn document_word_positions_into_sorter( fn document_word_positions_into_sorter(
document_id: DocumentId, document_id: DocumentId,
del_word_pair_proximity: &BTreeMap<(String, String), u8>, mut word_positions_heap: BinaryHeap<PeekedWordPosition<vec::IntoIter<u32>>>,
add_word_pair_proximity: &BTreeMap<(String, String), u8>, word_pair_proximity_docids_sorter: &mut grenad::Sorter<MergeFn>,
word_pair_proximity_docids_sorters: &mut Vec<grenad::Sorter<MergeFn>>,
) -> Result<()> { ) -> Result<()> {
use itertools::merge_join_by; let mut word_pair_proximity = HashMap::new();
use itertools::EitherOrBoth::{Both, Left, Right}; let mut ordered_peeked_word_positions = Vec::new();
while !word_positions_heap.is_empty() {
while let Some(peeked_word_position) = word_positions_heap.pop() {
ordered_peeked_word_positions.push(peeked_word_position);
if ordered_peeked_word_positions.len() == 7 {
break;
}
}
if let Some((head, tail)) = ordered_peeked_word_positions.split_first() {
for PeekedWordPosition { word, position, .. } in tail {
let prox = positions_proximity(head.position, *position);
if prox > 0 && prox < MAX_DISTANCE {
word_pair_proximity
.entry((head.word.clone(), word.clone()))
.and_modify(|p| {
*p = cmp::min(*p, prox);
})
.or_insert(prox);
}
}
// Push the tail in the heap.
let tail_iter = ordered_peeked_word_positions.drain(1..);
word_positions_heap.extend(tail_iter);
// Advance the head and push it in the heap.
if let Some(mut head) = ordered_peeked_word_positions.pop() {
if let Some(next_position) = head.iter.next() {
let prox = positions_proximity(head.position, next_position);
if prox > 0 && prox < MAX_DISTANCE {
word_pair_proximity
.entry((head.word.clone(), head.word.clone()))
.and_modify(|p| {
*p = cmp::min(*p, prox);
})
.or_insert(prox);
}
word_positions_heap.push(PeekedWordPosition {
word: head.word,
position: next_position,
iter: head.iter,
});
}
}
}
}
let mut buffer = Vec::new();
let mut key_buffer = Vec::new(); let mut key_buffer = Vec::new();
for eob in for ((w1, w2), prox) in word_pair_proximity {
merge_join_by(del_word_pair_proximity.iter(), add_word_pair_proximity.iter(), |d, a| {
d.cmp(a)
})
{
buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut buffer);
let ((w1, w2), prox) = match eob {
Left(key_value) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
key_value
}
Right(key_value) => {
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
key_value
}
Both(key_value, _) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
key_value
}
};
key_buffer.clear(); key_buffer.clear();
key_buffer.push(*prox as u8); key_buffer.push(prox as u8);
key_buffer.extend_from_slice(w1.as_bytes()); key_buffer.extend_from_slice(w1.as_bytes());
key_buffer.push(0); key_buffer.push(0);
key_buffer.extend_from_slice(w2.as_bytes()); key_buffer.extend_from_slice(w2.as_bytes());
word_pair_proximity_docids_sorters[*prox as usize - 1] word_pair_proximity_docids_sorter.insert(&key_buffer, document_id.to_ne_bytes())?;
.insert(&key_buffer, value_writer.into_inner().unwrap())?;
} }
Ok(()) Ok(())
} }
fn word_positions_into_word_pair_proximity( struct PeekedWordPosition<I> {
word_positions: &mut VecDeque<(String, u16)>, word: String,
word_pair_proximity: &mut BTreeMap<(String, String), u8>, position: u32,
) -> Result<()> { iter: I,
let (head_word, head_position) = word_positions.pop_front().unwrap(); }
for (word, position) in word_positions.iter() {
let prox = index_proximity(head_position as u32, *position as u32) as u8; impl<I> Ord for PeekedWordPosition<I> {
if prox > 0 && prox < MAX_DISTANCE as u8 { fn cmp(&self, other: &Self) -> Ordering {
word_pair_proximity self.position.cmp(&other.position).reverse()
.entry((head_word.clone(), word.clone())) }
.and_modify(|p| { }
*p = cmp::min(*p, prox);
}) impl<I> PartialOrd for PeekedWordPosition<I> {
.or_insert(prox); fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
} Some(self.cmp(other))
} }
Ok(()) }
impl<I> Eq for PeekedWordPosition<I> {}
impl<I> PartialEq for PeekedWordPosition<I> {
fn eq(&self, other: &Self) -> bool {
self.position == other.position
}
} }

View File

@ -1,18 +1,13 @@
use std::collections::BTreeSet;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use obkv::KvReaderU16;
use super::helpers::{ use super::helpers::{
create_sorter, merge_deladd_cbo_roaring_bitmaps, sorter_into_reader, try_split_array_at, create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
GrenadParameters, try_split_array_at, GrenadParameters,
}; };
use crate::error::SerializationError; use crate::error::SerializationError;
use crate::index::db_name::DOCID_WORD_POSITIONS; use crate::index::db_name::DOCID_WORD_POSITIONS;
use crate::update::del_add::{DelAdd, KvReaderDelAdd, KvWriterDelAdd}; use crate::{bucketed_position, relative_from_absolute_position, DocumentId, Result};
use crate::update::MergeFn;
use crate::{bucketed_position, DocumentId, Result};
/// Extracts the word positions and the documents ids where this word appear. /// Extracts the word positions and the documents ids where this word appear.
/// ///
@ -22,117 +17,39 @@ use crate::{bucketed_position, DocumentId, Result};
pub fn extract_word_position_docids<R: io::Read + io::Seek>( pub fn extract_word_position_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>, docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!(); puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread(); let max_memory = indexer.max_memory_by_thread();
let mut word_position_docids_sorter = create_sorter( let mut word_position_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable, grenad::SortAlgorithm::Unstable,
merge_deladd_cbo_roaring_bitmaps, merge_cbo_roaring_bitmaps,
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
indexer.max_nb_chunks, indexer.max_nb_chunks,
max_memory, max_memory,
); );
let mut del_word_positions: BTreeSet<(u16, Vec<u8>)> = BTreeSet::new();
let mut add_word_positions: BTreeSet<(u16, Vec<u8>)> = BTreeSet::new();
let mut current_document_id: Option<u32> = None;
let mut key_buffer = Vec::new(); let mut key_buffer = Vec::new();
let mut cursor = docid_word_positions.into_cursor()?; let mut cursor = docid_word_positions.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? { while let Some((key, value)) = cursor.move_on_next()? {
let (document_id_bytes, _fid_bytes) = try_split_array_at(key) let (document_id_bytes, word_bytes) = try_split_array_at(key)
.ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?; .ok_or(SerializationError::Decoding { db_name: Some(DOCID_WORD_POSITIONS) })?;
let document_id = DocumentId::from_be_bytes(document_id_bytes); let document_id = DocumentId::from_be_bytes(document_id_bytes);
if current_document_id.map_or(false, |id| document_id != id) { for position in read_u32_ne_bytes(value) {
words_position_into_sorter( key_buffer.clear();
current_document_id.unwrap(), key_buffer.extend_from_slice(word_bytes);
&mut key_buffer, key_buffer.push(0);
&del_word_positions, let (_, position) = relative_from_absolute_position(position);
&add_word_positions, let position = bucketed_position(position);
&mut word_position_docids_sorter, key_buffer.extend_from_slice(&position.to_be_bytes());
)?; word_position_docids_sorter.insert(&key_buffer, document_id.to_ne_bytes())?;
del_word_positions.clear();
add_word_positions.clear();
}
current_document_id = Some(document_id);
let del_add_reader = KvReaderDelAdd::new(&value);
// extract all unique words to remove.
if let Some(deletion) = del_add_reader.get(DelAdd::Deletion) {
for (position, word_bytes) in KvReaderU16::new(deletion).iter() {
let position = bucketed_position(position);
del_word_positions.insert((position, word_bytes.to_vec()));
}
}
// extract all unique additional words.
if let Some(addition) = del_add_reader.get(DelAdd::Addition) {
for (position, word_bytes) in KvReaderU16::new(addition).iter() {
let position = bucketed_position(position);
add_word_positions.insert((position, word_bytes.to_vec()));
}
} }
} }
if let Some(document_id) = current_document_id {
words_position_into_sorter(
document_id,
&mut key_buffer,
&del_word_positions,
&add_word_positions,
&mut word_position_docids_sorter,
)?;
}
// TODO remove noop DelAdd OBKV
let word_position_docids_reader = sorter_into_reader(word_position_docids_sorter, indexer)?; let word_position_docids_reader = sorter_into_reader(word_position_docids_sorter, indexer)?;
Ok(word_position_docids_reader) Ok(word_position_docids_reader)
} }
fn words_position_into_sorter(
document_id: DocumentId,
key_buffer: &mut Vec<u8>,
del_word_positions: &BTreeSet<(u16, Vec<u8>)>,
add_word_positions: &BTreeSet<(u16, Vec<u8>)>,
word_position_docids_sorter: &mut grenad::Sorter<MergeFn>,
) -> Result<()> {
puffin::profile_function!();
use itertools::merge_join_by;
use itertools::EitherOrBoth::{Both, Left, Right};
let mut buffer = Vec::new();
for eob in merge_join_by(del_word_positions.iter(), add_word_positions.iter(), |d, a| d.cmp(a))
{
buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut buffer);
let (position, word_bytes) = match eob {
Left(key) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
key
}
Right(key) => {
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
key
}
Both(key, _) => {
value_writer.insert(DelAdd::Deletion, document_id.to_ne_bytes()).unwrap();
value_writer.insert(DelAdd::Addition, document_id.to_ne_bytes()).unwrap();
key
}
};
key_buffer.clear();
key_buffer.extend_from_slice(word_bytes);
key_buffer.push(0);
key_buffer.extend_from_slice(&position.to_be_bytes());
word_position_docids_sorter.insert(&key_buffer, value_writer.into_inner().unwrap())?;
}
Ok(())
}

View File

@ -6,11 +6,13 @@ mod extract_fid_word_count_docids;
mod extract_geo_points; mod extract_geo_points;
mod extract_vector_points; mod extract_vector_points;
mod extract_word_docids; mod extract_word_docids;
mod extract_word_fid_docids;
mod extract_word_pair_proximity_docids; mod extract_word_pair_proximity_docids;
mod extract_word_position_docids; mod extract_word_position_docids;
use std::collections::HashSet; use std::collections::HashSet;
use std::fs::File; use std::fs::File;
use std::io::BufReader;
use crossbeam_channel::Sender; use crossbeam_channel::Sender;
use log::debug; use log::debug;
@ -24,11 +26,12 @@ use self::extract_fid_word_count_docids::extract_fid_word_count_docids;
use self::extract_geo_points::extract_geo_points; use self::extract_geo_points::extract_geo_points;
use self::extract_vector_points::extract_vector_points; use self::extract_vector_points::extract_vector_points;
use self::extract_word_docids::extract_word_docids; use self::extract_word_docids::extract_word_docids;
use self::extract_word_fid_docids::extract_word_fid_docids;
use self::extract_word_pair_proximity_docids::extract_word_pair_proximity_docids; use self::extract_word_pair_proximity_docids::extract_word_pair_proximity_docids;
use self::extract_word_position_docids::extract_word_position_docids; use self::extract_word_position_docids::extract_word_position_docids;
use super::helpers::{ use super::helpers::{
as_cloneable_grenad, merge_cbo_roaring_bitmaps, CursorClonableMmap, GrenadParameters, MergeFn, as_cloneable_grenad, merge_cbo_roaring_bitmaps, merge_roaring_bitmaps, CursorClonableMmap,
MergeableReader, GrenadParameters, MergeFn, MergeableReader,
}; };
use super::{helpers, TypedChunk}; use super::{helpers, TypedChunk};
use crate::{FieldId, Result}; use crate::{FieldId, Result};
@ -37,8 +40,8 @@ use crate::{FieldId, Result};
/// Send data in grenad file over provided Sender. /// Send data in grenad file over provided Sender.
#[allow(clippy::too_many_arguments)] #[allow(clippy::too_many_arguments)]
pub(crate) fn data_from_obkv_documents( pub(crate) fn data_from_obkv_documents(
original_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<File>>> + Send, original_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>> + Send,
flattened_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<File>>> + Send, flattened_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>> + Send,
indexer: GrenadParameters, indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>, lmdb_writer_sx: Sender<Result<TypedChunk>>,
searchable_fields: Option<HashSet<FieldId>>, searchable_fields: Option<HashSet<FieldId>>,
@ -91,9 +94,9 @@ pub(crate) fn data_from_obkv_documents(
let ( let (
docid_word_positions_chunks, docid_word_positions_chunks,
( (
fid_docid_facet_numbers_chunks, docid_fid_facet_numbers_chunks,
( (
fid_docid_facet_strings_chunks, docid_fid_facet_strings_chunks,
( (
facet_is_null_docids_chunks, facet_is_null_docids_chunks,
(facet_is_empty_docids_chunks, facet_exists_docids_chunks), (facet_is_empty_docids_chunks, facet_exists_docids_chunks),
@ -150,7 +153,7 @@ pub(crate) fn data_from_obkv_documents(
}); });
} }
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>( spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(), docid_word_positions_chunks.clone(),
indexer, indexer,
lmdb_writer_sx.clone(), lmdb_writer_sx.clone(),
@ -160,7 +163,7 @@ pub(crate) fn data_from_obkv_documents(
"word-pair-proximity-docids", "word-pair-proximity-docids",
); );
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>( spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(), docid_word_positions_chunks.clone(),
indexer, indexer,
lmdb_writer_sx.clone(), lmdb_writer_sx.clone(),
@ -173,24 +176,21 @@ pub(crate) fn data_from_obkv_documents(
spawn_extraction_task::< spawn_extraction_task::<
_, _,
_, _,
Vec<(grenad::Reader<File>, grenad::Reader<File>, grenad::Reader<File>)>, Vec<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)>,
>( >(
docid_word_positions_chunks.clone(), docid_word_positions_chunks.clone(),
indexer, indexer,
lmdb_writer_sx.clone(), lmdb_writer_sx.clone(),
move |doc_word_pos, indexer| extract_word_docids(doc_word_pos, indexer, &exact_attributes), move |doc_word_pos, indexer| extract_word_docids(doc_word_pos, indexer, &exact_attributes),
merge_cbo_roaring_bitmaps, merge_roaring_bitmaps,
|(word_docids_reader, exact_word_docids_reader, word_fid_docids_reader)| { |(word_docids_reader, exact_word_docids_reader)| TypedChunk::WordDocids {
TypedChunk::WordDocids { word_docids_reader,
word_docids_reader, exact_word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
}
}, },
"word-docids", "word-docids",
); );
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>( spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(), docid_word_positions_chunks.clone(),
indexer, indexer,
lmdb_writer_sx.clone(), lmdb_writer_sx.clone(),
@ -199,9 +199,18 @@ pub(crate) fn data_from_obkv_documents(
TypedChunk::WordPositionDocids, TypedChunk::WordPositionDocids,
"word-position-docids", "word-position-docids",
); );
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks,
indexer,
lmdb_writer_sx.clone(),
extract_word_fid_docids,
merge_cbo_roaring_bitmaps,
TypedChunk::WordFidDocids,
"word-fid-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>( spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
fid_docid_facet_strings_chunks, docid_fid_facet_strings_chunks,
indexer, indexer,
lmdb_writer_sx.clone(), lmdb_writer_sx.clone(),
extract_facet_string_docids, extract_facet_string_docids,
@ -210,8 +219,8 @@ pub(crate) fn data_from_obkv_documents(
"field-id-facet-string-docids", "field-id-facet-string-docids",
); );
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>( spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
fid_docid_facet_numbers_chunks, docid_fid_facet_numbers_chunks,
indexer, indexer,
lmdb_writer_sx, lmdb_writer_sx,
extract_facet_number_docids, extract_facet_number_docids,
@ -265,7 +274,7 @@ fn spawn_extraction_task<FE, FS, M>(
/// Extract chunked data and send it into lmdb_writer_sx sender: /// Extract chunked data and send it into lmdb_writer_sx sender:
/// - documents /// - documents
fn send_original_documents_data( fn send_original_documents_data(
original_documents_chunk: Result<grenad::Reader<File>>, original_documents_chunk: Result<grenad::Reader<BufReader<File>>>,
indexer: GrenadParameters, indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>, lmdb_writer_sx: Sender<Result<TypedChunk>>,
vectors_field_id: Option<FieldId>, vectors_field_id: Option<FieldId>,
@ -307,7 +316,7 @@ fn send_original_documents_data(
#[allow(clippy::too_many_arguments)] #[allow(clippy::too_many_arguments)]
#[allow(clippy::type_complexity)] #[allow(clippy::type_complexity)]
fn send_and_extract_flattened_documents_data( fn send_and_extract_flattened_documents_data(
flattened_documents_chunk: Result<grenad::Reader<File>>, flattened_documents_chunk: Result<grenad::Reader<BufReader<File>>>,
indexer: GrenadParameters, indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>, lmdb_writer_sx: Sender<Result<TypedChunk>>,
searchable_fields: &Option<HashSet<FieldId>>, searchable_fields: &Option<HashSet<FieldId>>,
@ -324,7 +333,10 @@ fn send_and_extract_flattened_documents_data(
grenad::Reader<CursorClonableMmap>, grenad::Reader<CursorClonableMmap>,
( (
grenad::Reader<CursorClonableMmap>, grenad::Reader<CursorClonableMmap>,
(grenad::Reader<File>, (grenad::Reader<File>, grenad::Reader<File>)), (
grenad::Reader<BufReader<File>>,
(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>),
),
), ),
), ),
)> { )> {
@ -344,7 +356,7 @@ fn send_and_extract_flattened_documents_data(
}); });
} }
let (docid_word_positions_chunk, fid_docid_facet_values_chunks): (Result<_>, Result<_>) = let (docid_word_positions_chunk, docid_fid_facet_values_chunks): (Result<_>, Result<_>) =
rayon::join( rayon::join(
|| { || {
let (documents_ids, docid_word_positions_chunk, script_language_pair) = let (documents_ids, docid_word_positions_chunk, script_language_pair) =
@ -372,8 +384,8 @@ fn send_and_extract_flattened_documents_data(
}, },
|| { || {
let ExtractedFacetValues { let ExtractedFacetValues {
fid_docid_facet_numbers_chunk, docid_fid_facet_numbers_chunk,
fid_docid_facet_strings_chunk, docid_fid_facet_strings_chunk,
fid_facet_is_null_docids_chunk, fid_facet_is_null_docids_chunk,
fid_facet_is_empty_docids_chunk, fid_facet_is_empty_docids_chunk,
fid_facet_exists_docids_chunk, fid_facet_exists_docids_chunk,
@ -384,26 +396,26 @@ fn send_and_extract_flattened_documents_data(
geo_fields_ids, geo_fields_ids,
)?; )?;
// send fid_docid_facet_numbers_chunk to DB writer // send docid_fid_facet_numbers_chunk to DB writer
let fid_docid_facet_numbers_chunk = let docid_fid_facet_numbers_chunk =
unsafe { as_cloneable_grenad(&fid_docid_facet_numbers_chunk)? }; unsafe { as_cloneable_grenad(&docid_fid_facet_numbers_chunk)? };
let _ = lmdb_writer_sx.send(Ok(TypedChunk::FieldIdDocidFacetNumbers( let _ = lmdb_writer_sx.send(Ok(TypedChunk::FieldIdDocidFacetNumbers(
fid_docid_facet_numbers_chunk.clone(), docid_fid_facet_numbers_chunk.clone(),
))); )));
// send fid_docid_facet_strings_chunk to DB writer // send docid_fid_facet_strings_chunk to DB writer
let fid_docid_facet_strings_chunk = let docid_fid_facet_strings_chunk =
unsafe { as_cloneable_grenad(&fid_docid_facet_strings_chunk)? }; unsafe { as_cloneable_grenad(&docid_fid_facet_strings_chunk)? };
let _ = lmdb_writer_sx.send(Ok(TypedChunk::FieldIdDocidFacetStrings( let _ = lmdb_writer_sx.send(Ok(TypedChunk::FieldIdDocidFacetStrings(
fid_docid_facet_strings_chunk.clone(), docid_fid_facet_strings_chunk.clone(),
))); )));
Ok(( Ok((
fid_docid_facet_numbers_chunk, docid_fid_facet_numbers_chunk,
( (
fid_docid_facet_strings_chunk, docid_fid_facet_strings_chunk,
( (
fid_facet_is_null_docids_chunk, fid_facet_is_null_docids_chunk,
(fid_facet_is_empty_docids_chunk, fid_facet_exists_docids_chunk), (fid_facet_is_empty_docids_chunk, fid_facet_exists_docids_chunk),
@ -413,5 +425,5 @@ fn send_and_extract_flattened_documents_data(
}, },
); );
Ok((docid_word_positions_chunk?, fid_docid_facet_values_chunks?)) Ok((docid_word_positions_chunk?, docid_fid_facet_values_chunks?))
} }

View File

@ -1,6 +1,6 @@
use std::borrow::Cow; use std::borrow::Cow;
use std::fs::File; use std::fs::File;
use std::io::{self, Seek}; use std::io::{self, BufReader, BufWriter, Seek};
use std::time::Instant; use std::time::Instant;
use grenad::{CompressionType, Sorter}; use grenad::{CompressionType, Sorter};
@ -17,13 +17,13 @@ pub fn create_writer<R: io::Write>(
typ: grenad::CompressionType, typ: grenad::CompressionType,
level: Option<u32>, level: Option<u32>,
file: R, file: R,
) -> grenad::Writer<R> { ) -> grenad::Writer<BufWriter<R>> {
let mut builder = grenad::Writer::builder(); let mut builder = grenad::Writer::builder();
builder.compression_type(typ); builder.compression_type(typ);
if let Some(level) = level { if let Some(level) = level {
builder.compression_level(level); builder.compression_level(level);
} }
builder.build(file) builder.build(BufWriter::new(file))
} }
pub fn create_sorter( pub fn create_sorter(
@ -53,8 +53,7 @@ pub fn create_sorter(
pub fn sorter_into_reader( pub fn sorter_into_reader(
sorter: grenad::Sorter<MergeFn>, sorter: grenad::Sorter<MergeFn>,
indexer: GrenadParameters, indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> { ) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let mut writer = create_writer( let mut writer = create_writer(
indexer.chunk_compression_type, indexer.chunk_compression_type,
indexer.chunk_compression_level, indexer.chunk_compression_level,
@ -65,16 +64,18 @@ pub fn sorter_into_reader(
writer_into_reader(writer) writer_into_reader(writer)
} }
pub fn writer_into_reader(writer: grenad::Writer<File>) -> Result<grenad::Reader<File>> { pub fn writer_into_reader(
let mut file = writer.into_inner()?; writer: grenad::Writer<BufWriter<File>>,
) -> Result<grenad::Reader<BufReader<File>>> {
let mut file = writer.into_inner()?.into_inner().map_err(|err| err.into_error())?;
file.rewind()?; file.rewind()?;
grenad::Reader::new(file).map_err(Into::into) grenad::Reader::new(BufReader::new(file)).map_err(Into::into)
} }
pub unsafe fn as_cloneable_grenad( pub unsafe fn as_cloneable_grenad(
reader: &grenad::Reader<File>, reader: &grenad::Reader<BufReader<File>>,
) -> Result<grenad::Reader<CursorClonableMmap>> { ) -> Result<grenad::Reader<CursorClonableMmap>> {
let file = reader.get_ref(); let file = reader.get_ref().get_ref();
let mmap = memmap2::Mmap::map(file)?; let mmap = memmap2::Mmap::map(file)?;
let cursor = io::Cursor::new(ClonableMmap::from(mmap)); let cursor = io::Cursor::new(ClonableMmap::from(mmap));
let reader = grenad::Reader::new(cursor)?; let reader = grenad::Reader::new(cursor)?;
@ -90,8 +91,8 @@ where
fn merge(self, merge_fn: MergeFn, indexer: &GrenadParameters) -> Result<Self::Output>; fn merge(self, merge_fn: MergeFn, indexer: &GrenadParameters) -> Result<Self::Output>;
} }
impl MergeableReader for Vec<grenad::Reader<File>> { impl MergeableReader for Vec<grenad::Reader<BufReader<File>>> {
type Output = grenad::Reader<File>; type Output = grenad::Reader<BufReader<File>>;
fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> { fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> {
let mut merger = MergerBuilder::new(merge_fn); let mut merger = MergerBuilder::new(merge_fn);
@ -100,8 +101,8 @@ impl MergeableReader for Vec<grenad::Reader<File>> {
} }
} }
impl MergeableReader for Vec<(grenad::Reader<File>, grenad::Reader<File>)> { impl MergeableReader for Vec<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)> {
type Output = (grenad::Reader<File>, grenad::Reader<File>); type Output = (grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>);
fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> { fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> {
let mut m1 = MergerBuilder::new(merge_fn); let mut m1 = MergerBuilder::new(merge_fn);
@ -114,22 +115,6 @@ impl MergeableReader for Vec<(grenad::Reader<File>, grenad::Reader<File>)> {
} }
} }
impl MergeableReader for Vec<(grenad::Reader<File>, grenad::Reader<File>, grenad::Reader<File>)> {
type Output = (grenad::Reader<File>, grenad::Reader<File>, grenad::Reader<File>);
fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> {
let mut m1 = MergerBuilder::new(merge_fn);
let mut m2 = MergerBuilder::new(merge_fn);
let mut m3 = MergerBuilder::new(merge_fn);
for (r1, r2, r3) in self.into_iter() {
m1.push(r1)?;
m2.push(r2)?;
m3.push(r3)?;
}
Ok((m1.finish(params)?, m2.finish(params)?, m3.finish(params)?))
}
}
struct MergerBuilder<R>(grenad::MergerBuilder<R, MergeFn>); struct MergerBuilder<R>(grenad::MergerBuilder<R, MergeFn>);
impl<R: io::Read + io::Seek> MergerBuilder<R> { impl<R: io::Read + io::Seek> MergerBuilder<R> {
@ -142,7 +127,7 @@ impl<R: io::Read + io::Seek> MergerBuilder<R> {
Ok(()) Ok(())
} }
fn finish(self, params: &GrenadParameters) -> Result<grenad::Reader<File>> { fn finish(self, params: &GrenadParameters) -> Result<grenad::Reader<BufReader<File>>> {
let merger = self.0.build(); let merger = self.0.build();
let mut writer = create_writer( let mut writer = create_writer(
params.chunk_compression_type, params.chunk_compression_type,
@ -193,7 +178,7 @@ pub fn grenad_obkv_into_chunks<R: io::Read + io::Seek>(
reader: grenad::Reader<R>, reader: grenad::Reader<R>,
indexer: GrenadParameters, indexer: GrenadParameters,
documents_chunk_size: usize, documents_chunk_size: usize,
) -> Result<impl Iterator<Item = Result<grenad::Reader<File>>>> { ) -> Result<impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>>> {
let mut continue_reading = true; let mut continue_reading = true;
let mut cursor = reader.into_cursor()?; let mut cursor = reader.into_cursor()?;

View File

@ -6,13 +6,11 @@ use std::result::Result as StdResult;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use crate::heed_codec::CboRoaringBitmapCodec; use crate::heed_codec::CboRoaringBitmapCodec;
use crate::update::del_add::{DelAdd, KvReaderDelAdd, KvWriterDelAdd};
use crate::update::index_documents::transform::Operation; use crate::update::index_documents::transform::Operation;
use crate::Result; use crate::Result;
pub type MergeFn = for<'a> fn(&[u8], &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>>; pub type MergeFn = for<'a> fn(&[u8], &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>>;
#[allow(unused)]
pub fn concat_u32s_array<'a>(_key: &[u8], values: &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>> { pub fn concat_u32s_array<'a>(_key: &[u8], values: &[Cow<'a, [u8]>]) -> Result<Cow<'a, [u8]>> {
if values.len() == 1 { if values.len() == 1 {
Ok(values[0].clone()) Ok(values[0].clone())
@ -77,123 +75,57 @@ pub fn keep_latest_obkv<'a>(_key: &[u8], obkvs: &[Cow<'a, [u8]>]) -> Result<Cow<
Ok(obkvs.last().unwrap().clone()) Ok(obkvs.last().unwrap().clone())
} }
pub fn merge_two_del_add_obkvs( pub fn merge_two_obkvs(base: obkv::KvReaderU16, update: obkv::KvReaderU16, buffer: &mut Vec<u8>) {
base: obkv::KvReaderU16,
update: obkv::KvReaderU16,
merge_additions: bool,
buffer: &mut Vec<u8>,
) {
use itertools::merge_join_by; use itertools::merge_join_by;
use itertools::EitherOrBoth::{Both, Left, Right}; use itertools::EitherOrBoth::{Both, Left, Right};
buffer.clear(); buffer.clear();
let mut writer = obkv::KvWriter::new(buffer); let mut writer = obkv::KvWriter::new(buffer);
let mut value_buffer = Vec::new();
for eob in merge_join_by(base.iter(), update.iter(), |(b, _), (u, _)| b.cmp(u)) { for eob in merge_join_by(base.iter(), update.iter(), |(b, _), (u, _)| b.cmp(u)) {
match eob { match eob {
Left((k, v)) => { Both(_, (k, v)) | Left((k, v)) | Right((k, v)) => writer.insert(k, v).unwrap(),
if merge_additions {
writer.insert(k, v).unwrap()
} else {
// If merge_additions is false, recreate an obkv keeping the deletions only.
value_buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
let base_reader = KvReaderDelAdd::new(v);
if let Some(deletion) = base_reader.get(DelAdd::Deletion) {
value_writer.insert(DelAdd::Deletion, deletion).unwrap();
value_writer.finish().unwrap();
writer.insert(k, &value_buffer).unwrap()
}
}
}
Right((k, v)) => writer.insert(k, v).unwrap(),
Both((k, base), (_, update)) => {
// merge deletions and additions.
value_buffer.clear();
let mut value_writer = KvWriterDelAdd::new(&mut value_buffer);
let base_reader = KvReaderDelAdd::new(base);
let update_reader = KvReaderDelAdd::new(update);
// keep newest deletion.
if let Some(deletion) = update_reader
.get(DelAdd::Deletion)
.or_else(|| base_reader.get(DelAdd::Deletion))
{
value_writer.insert(DelAdd::Deletion, deletion).unwrap();
}
// keep base addition only if merge_additions is true.
let base_addition =
merge_additions.then(|| base_reader.get(DelAdd::Addition)).flatten();
// keep newest addition.
// TODO use or_else
if let Some(addition) = update_reader.get(DelAdd::Addition).or(base_addition) {
value_writer.insert(DelAdd::Addition, addition).unwrap();
}
value_writer.finish().unwrap();
writer.insert(k, &value_buffer).unwrap()
}
} }
} }
writer.finish().unwrap(); writer.finish().unwrap();
} }
/// Merge all the obkvs from the newest to the oldest. /// Merge all the obks in the order we see them.
fn inner_merge_del_add_obkvs<'a>( pub fn merge_obkvs_and_operations<'a>(
_key: &[u8],
obkvs: &[Cow<'a, [u8]>], obkvs: &[Cow<'a, [u8]>],
merge_additions: bool,
) -> Result<Cow<'a, [u8]>> { ) -> Result<Cow<'a, [u8]>> {
// pop the newest operation from the list. // [add, add, delete, add, add]
let (newest, obkvs) = obkvs.split_last().unwrap(); // we can ignore everything that happened before the last delete.
// keep the operation type for the returned value. let starting_position =
let newest_operation_type = newest[0]; obkvs.iter().rposition(|obkv| obkv[0] == Operation::Deletion as u8).unwrap_or(0);
// treat the newest obkv as the starting point of the merge. // [add, add, delete]
let mut acc_operation_type = newest_operation_type; // if the last operation was a deletion then we simply return the deletion
let mut acc = newest[1..].to_vec(); if starting_position == obkvs.len() - 1 && obkvs.last().unwrap()[0] == Operation::Deletion as u8
let mut buffer = Vec::new(); {
// reverse iter from the most recent to the oldest. return Ok(obkvs[obkvs.len() - 1].clone());
for current in obkvs.into_iter().rev() {
// if in the previous iteration there was a complete deletion,
// stop the merge process.
if acc_operation_type == Operation::Deletion as u8 {
break;
}
let newest = obkv::KvReader::new(&acc);
let oldest = obkv::KvReader::new(&current[1..]);
merge_two_del_add_obkvs(oldest, newest, merge_additions, &mut buffer);
// we want the result of the merge into our accumulator.
std::mem::swap(&mut acc, &mut buffer);
acc_operation_type = current[0];
} }
let mut buffer = Vec::new();
acc.insert(0, newest_operation_type); // (add, add, delete) [add, add]
Ok(Cow::from(acc)) // in the other case, no deletion will be encountered during the merge
let mut ret =
obkvs[starting_position..].iter().cloned().fold(Vec::new(), |mut acc, current| {
let first = obkv::KvReader::new(&acc);
let second = obkv::KvReader::new(&current[1..]);
merge_two_obkvs(first, second, &mut buffer);
// we want the result of the merge into our accumulator
std::mem::swap(&mut acc, &mut buffer);
acc
});
ret.insert(0, Operation::Addition as u8);
Ok(Cow::from(ret))
} }
/// Merge all the obkvs from the newest to the oldest.
pub fn obkvs_merge_additions_and_deletions<'a>(
_key: &[u8],
obkvs: &[Cow<'a, [u8]>],
) -> Result<Cow<'a, [u8]>> {
inner_merge_del_add_obkvs(obkvs, true)
}
/// Merge all the obkvs deletions from the newest to the oldest and keep only the newest additions.
pub fn obkvs_keep_last_addition_merge_deletions<'a>(
_key: &[u8],
obkvs: &[Cow<'a, [u8]>],
) -> Result<Cow<'a, [u8]>> {
inner_merge_del_add_obkvs(obkvs, false)
}
/// Do a union of all the CboRoaringBitmaps in the values.
pub fn merge_cbo_roaring_bitmaps<'a>( pub fn merge_cbo_roaring_bitmaps<'a>(
_key: &[u8], _key: &[u8],
values: &[Cow<'a, [u8]>], values: &[Cow<'a, [u8]>],
@ -206,36 +138,3 @@ pub fn merge_cbo_roaring_bitmaps<'a>(
Ok(Cow::from(vec)) Ok(Cow::from(vec))
} }
} }
/// Do a union of CboRoaringBitmaps on both sides of a DelAdd obkv
/// separately and outputs a new DelAdd with both unions.
pub fn merge_deladd_cbo_roaring_bitmaps<'a>(
_key: &[u8],
values: &[Cow<'a, [u8]>],
) -> Result<Cow<'a, [u8]>> {
if values.len() == 1 {
Ok(values[0].clone())
} else {
// Retrieve the bitmaps from both sides
let mut del_bitmaps_bytes = Vec::new();
let mut add_bitmaps_bytes = Vec::new();
for value in values {
let obkv = KvReaderDelAdd::new(value);
if let Some(bitmap_bytes) = obkv.get(DelAdd::Deletion) {
del_bitmaps_bytes.push(bitmap_bytes);
}
if let Some(bitmap_bytes) = obkv.get(DelAdd::Addition) {
add_bitmaps_bytes.push(bitmap_bytes);
}
}
let mut output_deladd_obkv = KvWriterDelAdd::memory();
let mut buffer = Vec::new();
CboRoaringBitmapCodec::merge_into(del_bitmaps_bytes, &mut buffer)?;
output_deladd_obkv.insert(DelAdd::Deletion, &buffer)?;
buffer.clear();
CboRoaringBitmapCodec::merge_into(add_bitmaps_bytes, &mut buffer)?;
output_deladd_obkv.insert(DelAdd::Addition, &buffer)?;
output_deladd_obkv.into_inner().map(Cow::from).map_err(Into::into)
}
}

View File

@ -14,8 +14,7 @@ pub use grenad_helpers::{
}; };
pub use merge_functions::{ pub use merge_functions::{
concat_u32s_array, keep_first, keep_latest_obkv, merge_btreeset_string, concat_u32s_array, keep_first, keep_latest_obkv, merge_btreeset_string,
merge_cbo_roaring_bitmaps, merge_deladd_cbo_roaring_bitmaps, merge_roaring_bitmaps, merge_cbo_roaring_bitmaps, merge_obkvs_and_operations, merge_roaring_bitmaps, merge_two_obkvs,
obkvs_keep_last_addition_merge_deletions, obkvs_merge_additions_and_deletions,
serialize_roaring_bitmap, MergeFn, serialize_roaring_bitmap, MergeFn,
}; };
@ -45,7 +44,6 @@ where
Some((head, tail)) Some((head, tail))
} }
#[allow(unused)]
pub fn read_u32_ne_bytes(bytes: &[u8]) -> impl Iterator<Item = u32> + '_ { pub fn read_u32_ne_bytes(bytes: &[u8]) -> impl Iterator<Item = u32> + '_ {
bytes.chunks_exact(4).flat_map(TryInto::try_into).map(u32::from_ne_bytes) bytes.chunks_exact(4).flat_map(TryInto::try_into).map(u32::from_ne_bytes)
} }

View File

@ -38,7 +38,7 @@ use crate::update::{
self, DeletionStrategy, IndexerConfig, PrefixWordPairsProximityDocids, UpdateIndexingStep, self, DeletionStrategy, IndexerConfig, PrefixWordPairsProximityDocids, UpdateIndexingStep,
WordPrefixDocids, WordPrefixIntegerDocids, WordsPrefixesFst, WordPrefixDocids, WordPrefixIntegerDocids, WordsPrefixesFst,
}; };
use crate::{CboRoaringBitmapCodec, Index, Result}; use crate::{Index, Result, RoaringBitmapCodec};
static MERGED_DATABASE_COUNT: usize = 7; static MERGED_DATABASE_COUNT: usize = 7;
static PREFIX_DATABASE_COUNT: usize = 5; static PREFIX_DATABASE_COUNT: usize = 5;
@ -406,23 +406,13 @@ where
} }
let typed_chunk = match result? { let typed_chunk = match result? {
TypedChunk::WordDocids { TypedChunk::WordDocids { word_docids_reader, exact_word_docids_reader } => {
word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
} => {
let cloneable_chunk = unsafe { as_cloneable_grenad(&word_docids_reader)? }; let cloneable_chunk = unsafe { as_cloneable_grenad(&word_docids_reader)? };
word_docids = Some(cloneable_chunk); word_docids = Some(cloneable_chunk);
let cloneable_chunk = let cloneable_chunk =
unsafe { as_cloneable_grenad(&exact_word_docids_reader)? }; unsafe { as_cloneable_grenad(&exact_word_docids_reader)? };
exact_word_docids = Some(cloneable_chunk); exact_word_docids = Some(cloneable_chunk);
let cloneable_chunk = unsafe { as_cloneable_grenad(&word_fid_docids_reader)? }; TypedChunk::WordDocids { word_docids_reader, exact_word_docids_reader }
word_fid_docids = Some(cloneable_chunk);
TypedChunk::WordDocids {
word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
}
} }
TypedChunk::WordPairProximityDocids(chunk) => { TypedChunk::WordPairProximityDocids(chunk) => {
let cloneable_chunk = unsafe { as_cloneable_grenad(&chunk)? }; let cloneable_chunk = unsafe { as_cloneable_grenad(&chunk)? };
@ -434,6 +424,11 @@ where
word_position_docids = Some(cloneable_chunk); word_position_docids = Some(cloneable_chunk);
TypedChunk::WordPositionDocids(chunk) TypedChunk::WordPositionDocids(chunk)
} }
TypedChunk::WordFidDocids(chunk) => {
let cloneable_chunk = unsafe { as_cloneable_grenad(&chunk)? };
word_fid_docids = Some(cloneable_chunk);
TypedChunk::WordFidDocids(chunk)
}
otherwise => otherwise, otherwise => otherwise,
}; };
@ -475,14 +470,13 @@ where
let all_documents_ids = index_documents_ids | new_documents_ids; let all_documents_ids = index_documents_ids | new_documents_ids;
self.index.put_documents_ids(self.wtxn, &all_documents_ids)?; self.index.put_documents_ids(self.wtxn, &all_documents_ids)?;
// TODO: reactivate prefix DB with diff-indexing self.execute_prefix_databases(
// self.execute_prefix_databases( word_docids,
// word_docids, exact_word_docids,
// exact_word_docids, word_pair_proximity_docids,
// word_pair_proximity_docids, word_position_docids,
// word_position_docids, word_fid_docids,
// word_fid_docids, )?;
// )?;
Ok(all_documents_ids.len()) Ok(all_documents_ids.len())
} }
@ -696,8 +690,8 @@ where
fn execute_word_prefix_docids( fn execute_word_prefix_docids(
txn: &mut heed::RwTxn, txn: &mut heed::RwTxn,
reader: grenad::Reader<Cursor<ClonableMmap>>, reader: grenad::Reader<Cursor<ClonableMmap>>,
word_docids_db: Database<Str, CboRoaringBitmapCodec>, word_docids_db: Database<Str, RoaringBitmapCodec>,
word_prefix_docids_db: Database<Str, CboRoaringBitmapCodec>, word_prefix_docids_db: Database<Str, RoaringBitmapCodec>,
indexer_config: &IndexerConfig, indexer_config: &IndexerConfig,
new_prefix_fst_words: &[String], new_prefix_fst_words: &[String],
common_prefix_fst_words: &[&[String]], common_prefix_fst_words: &[&[String]],

View File

@ -7,20 +7,18 @@ use std::io::{Read, Seek};
use fxhash::FxHashMap; use fxhash::FxHashMap;
use heed::RoTxn; use heed::RoTxn;
use itertools::Itertools; use itertools::Itertools;
use obkv::{KvReader, KvReaderU16, KvWriter}; use obkv::{KvReader, KvWriter};
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use serde_json::Value; use serde_json::Value;
use smartstring::SmartString; use smartstring::SmartString;
use super::helpers::{ use super::helpers::{
create_sorter, create_writer, obkvs_keep_last_addition_merge_deletions, create_sorter, create_writer, keep_latest_obkv, merge_obkvs_and_operations, MergeFn,
obkvs_merge_additions_and_deletions, MergeFn,
}; };
use super::{IndexDocumentsMethod, IndexerConfig}; use super::{IndexDocumentsMethod, IndexerConfig};
use crate::documents::{DocumentsBatchIndex, EnrichedDocument, EnrichedDocumentsBatchReader}; use crate::documents::{DocumentsBatchIndex, EnrichedDocument, EnrichedDocumentsBatchReader};
use crate::error::{Error, InternalError, UserError}; use crate::error::{Error, InternalError, UserError};
use crate::index::{db_name, main_key}; use crate::index::{db_name, main_key};
use crate::update::del_add::into_del_add_obkv;
use crate::update::{AvailableDocumentsIds, ClearDocuments, UpdateIndexingStep}; use crate::update::{AvailableDocumentsIds, ClearDocuments, UpdateIndexingStep};
use crate::{ use crate::{
FieldDistribution, FieldId, FieldIdMapMissingEntry, FieldsIdsMap, Index, Result, BEU32, FieldDistribution, FieldId, FieldIdMapMissingEntry, FieldsIdsMap, Index, Result, BEU32,
@ -108,8 +106,8 @@ impl<'a, 'i> Transform<'a, 'i> {
// We must choose the appropriate merge function for when two or more documents // We must choose the appropriate merge function for when two or more documents
// with the same user id must be merged or fully replaced in the same batch. // with the same user id must be merged or fully replaced in the same batch.
let merge_function = match index_documents_method { let merge_function = match index_documents_method {
IndexDocumentsMethod::ReplaceDocuments => obkvs_keep_last_addition_merge_deletions, IndexDocumentsMethod::ReplaceDocuments => keep_latest_obkv,
IndexDocumentsMethod::UpdateDocuments => obkvs_merge_additions_and_deletions, IndexDocumentsMethod::UpdateDocuments => merge_obkvs_and_operations,
}; };
// We initialize the sorter with the user indexing settings. // We initialize the sorter with the user indexing settings.
@ -225,21 +223,19 @@ impl<'a, 'i> Transform<'a, 'i> {
let docid = match self.new_external_documents_ids_builder.entry((*external_id).into()) { let docid = match self.new_external_documents_ids_builder.entry((*external_id).into()) {
Entry::Occupied(entry) => *entry.get() as u32, Entry::Occupied(entry) => *entry.get() as u32,
Entry::Vacant(entry) => { Entry::Vacant(entry) => {
let docid = match external_documents_ids.get(entry.key()) { // If the document was already in the db we mark it as a replaced document.
Some(docid) => { // It'll be deleted later.
// If it was already in the list of replaced documents it means it was deleted if let Some(docid) = external_documents_ids.get(entry.key()) {
// by the remove_document method. We should starts as if it never existed. // If it was already in the list of replaced documents it means it was deleted
if self.replaced_documents_ids.insert(docid) { // by the remove_document method. We should starts as if it never existed.
original_docid = Some(docid); if self.replaced_documents_ids.insert(docid) {
} original_docid = Some(docid);
docid
} }
None => self }
.available_documents_ids let docid = self
.next() .available_documents_ids
.ok_or(UserError::DocumentLimitReached)?, .next()
}; .ok_or(UserError::DocumentLimitReached)?;
entry.insert(docid as u64); entry.insert(docid as u64);
docid docid
} }
@ -267,28 +263,16 @@ impl<'a, 'i> Transform<'a, 'i> {
skip_insertion = true; skip_insertion = true;
} else { } else {
// we associate the base document with the new key, everything will get merged later. // we associate the base document with the new key, everything will get merged later.
let keep_original_version =
self.index_documents_method == IndexDocumentsMethod::UpdateDocuments;
document_sorter_buffer.clear(); document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Addition as u8); document_sorter_buffer.push(Operation::Addition as u8);
into_del_add_obkv( document_sorter_buffer.extend_from_slice(base_obkv);
KvReaderU16::new(base_obkv),
true,
keep_original_version,
&mut document_sorter_buffer,
)?;
self.original_sorter.insert(docid.to_be_bytes(), &document_sorter_buffer)?; self.original_sorter.insert(docid.to_be_bytes(), &document_sorter_buffer)?;
match self.flatten_from_fields_ids_map(KvReader::new(base_obkv))? { match self.flatten_from_fields_ids_map(KvReader::new(base_obkv))? {
Some(flattened_obkv) => { Some(flattened_obkv) => {
// we recreate our buffer with the flattened documents // we recreate our buffer with the flattened documents
document_sorter_buffer.clear(); document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Addition as u8); document_sorter_buffer.push(Operation::Addition as u8);
into_del_add_obkv( document_sorter_buffer.extend_from_slice(&flattened_obkv);
KvReaderU16::new(&flattened_obkv),
true,
keep_original_version,
&mut document_sorter_buffer,
)?;
self.flattened_sorter self.flattened_sorter
.insert(docid.to_be_bytes(), &document_sorter_buffer)? .insert(docid.to_be_bytes(), &document_sorter_buffer)?
} }
@ -304,12 +288,7 @@ impl<'a, 'i> Transform<'a, 'i> {
document_sorter_buffer.clear(); document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Addition as u8); document_sorter_buffer.push(Operation::Addition as u8);
into_del_add_obkv( document_sorter_buffer.extend_from_slice(&obkv_buffer);
KvReaderU16::new(&obkv_buffer),
false,
true,
&mut document_sorter_buffer,
)?;
// We use the extracted/generated user id as the key for this document. // We use the extracted/generated user id as the key for this document.
self.original_sorter.insert(docid.to_be_bytes(), &document_sorter_buffer)?; self.original_sorter.insert(docid.to_be_bytes(), &document_sorter_buffer)?;
@ -317,12 +296,7 @@ impl<'a, 'i> Transform<'a, 'i> {
Some(flattened_obkv) => { Some(flattened_obkv) => {
document_sorter_buffer.clear(); document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Addition as u8); document_sorter_buffer.push(Operation::Addition as u8);
into_del_add_obkv( document_sorter_buffer.extend_from_slice(&flattened_obkv);
KvReaderU16::new(&flattened_obkv),
false,
true,
&mut document_sorter_buffer,
)?;
self.flattened_sorter self.flattened_sorter
.insert(docid.to_be_bytes(), &document_sorter_buffer)? .insert(docid.to_be_bytes(), &document_sorter_buffer)?
} }
@ -380,25 +354,19 @@ impl<'a, 'i> Transform<'a, 'i> {
let external_documents_ids = self.index.external_documents_ids(wtxn)?; let external_documents_ids = self.index.external_documents_ids(wtxn)?;
let mut documents_deleted = 0; let mut documents_deleted = 0;
let mut document_sorter_buffer = Vec::new();
for to_remove in to_remove { for to_remove in to_remove {
if should_abort() { if should_abort() {
return Err(Error::InternalError(InternalError::AbortedIndexation)); return Err(Error::InternalError(InternalError::AbortedIndexation));
} }
// Check if the document has been added in the current indexing process. match self.new_external_documents_ids_builder.entry((*to_remove).into()) {
let deleted_from_current = match self
.new_external_documents_ids_builder
.entry((*to_remove).into())
{
// if the document was added in a previous iteration of the transform we make it as deleted in the sorters. // if the document was added in a previous iteration of the transform we make it as deleted in the sorters.
Entry::Occupied(entry) => { Entry::Occupied(entry) => {
let doc_id = *entry.get() as u32; let doc_id = *entry.get() as u32;
document_sorter_buffer.clear(); self.original_sorter
document_sorter_buffer.push(Operation::Deletion as u8); .insert(doc_id.to_be_bytes(), [Operation::Deletion as u8])?;
obkv::KvWriterU16::new(&mut document_sorter_buffer).finish().unwrap(); self.flattened_sorter
self.original_sorter.insert(doc_id.to_be_bytes(), &document_sorter_buffer)?; .insert(doc_id.to_be_bytes(), [Operation::Deletion as u8])?;
self.flattened_sorter.insert(doc_id.to_be_bytes(), &document_sorter_buffer)?;
// we must NOT update the list of replaced_documents_ids // we must NOT update the list of replaced_documents_ids
// Either: // Either:
@ -407,69 +375,21 @@ impl<'a, 'i> Transform<'a, 'i> {
// we're removing it there is nothing to do. // we're removing it there is nothing to do.
self.new_documents_ids.remove(doc_id); self.new_documents_ids.remove(doc_id);
entry.remove_entry(); entry.remove_entry();
true
} }
Entry::Vacant(_) => false, Entry::Vacant(entry) => {
}; // If the document was already in the db we mark it as a `to_delete` document.
// It'll be deleted later. We don't need to push anything to the sorters.
// If the document was already in the db we mark it as a `to_delete` document. if let Some(docid) = external_documents_ids.get(entry.key()) {
// Then we push the document in sorters in deletion mode. self.replaced_documents_ids.insert(docid);
let deleted_from_db = match external_documents_ids.get(&to_remove) { } else {
Some(docid) => { // if the document is nowehere to be found, there is nothing to do and we must NOT
self.replaced_documents_ids.insert(docid); // increment the count of documents_deleted
continue;
// fetch the obkv document
let original_key = BEU32::new(docid);
let base_obkv = self
.index
.documents
.remap_data_type::<heed::types::ByteSlice>()
.get(wtxn, &original_key)?
.ok_or(InternalError::DatabaseMissingEntry {
db_name: db_name::DOCUMENTS,
key: None,
})?;
// push it as to delete in the original_sorter
document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Deletion as u8);
into_del_add_obkv(
KvReaderU16::new(base_obkv),
true,
false,
&mut document_sorter_buffer,
)?;
self.original_sorter.insert(docid.to_be_bytes(), &document_sorter_buffer)?;
// flatten it and push it as to delete in the flattened_sorter
match self.flatten_from_fields_ids_map(KvReader::new(base_obkv))? {
Some(flattened_obkv) => {
// we recreate our buffer with the flattened documents
document_sorter_buffer.clear();
document_sorter_buffer.push(Operation::Deletion as u8);
into_del_add_obkv(
KvReaderU16::new(&flattened_obkv),
true,
false,
&mut document_sorter_buffer,
)?;
self.flattened_sorter
.insert(docid.to_be_bytes(), &document_sorter_buffer)?
}
None => self
.flattened_sorter
.insert(docid.to_be_bytes(), &document_sorter_buffer)?,
} }
true
} }
None => false,
}; };
// increase counter only if the document existed somewhere before. documents_deleted += 1;
if deleted_from_current || deleted_from_db {
documents_deleted += 1;
}
} }
Ok(documents_deleted) Ok(documents_deleted)
@ -669,7 +589,9 @@ impl<'a, 'i> Transform<'a, 'i> {
let mut documents_count = 0; let mut documents_count = 0;
while let Some((key, val)) = iter.next()? { while let Some((key, val)) = iter.next()? {
// skip first byte corresponding to the operation type (Deletion or Addition). if val[0] == Operation::Deletion as u8 {
continue;
}
let val = &val[1..]; let val = &val[1..];
// send a callback to show at which step we are // send a callback to show at which step we are
@ -709,7 +631,9 @@ impl<'a, 'i> Transform<'a, 'i> {
// We get rids of the `Operation` byte and skip the deleted documents as well. // We get rids of the `Operation` byte and skip the deleted documents as well.
let mut iter = self.flattened_sorter.into_stream_merger_iter()?; let mut iter = self.flattened_sorter.into_stream_merger_iter()?;
while let Some((key, val)) = iter.next()? { while let Some((key, val)) = iter.next()? {
// skip first byte corresponding to the operation type (Deletion or Addition). if val[0] == Operation::Deletion as u8 {
continue;
}
let val = &val[1..]; let val = &val[1..];
writer.insert(key, val)?; writer.insert(key, val)?;
} }
@ -735,8 +659,10 @@ impl<'a, 'i> Transform<'a, 'i> {
new_documents_ids: self.new_documents_ids, new_documents_ids: self.new_documents_ids,
replaced_documents_ids: self.replaced_documents_ids, replaced_documents_ids: self.replaced_documents_ids,
documents_count: self.documents_count, documents_count: self.documents_count,
original_documents, original_documents: original_documents.into_inner().map_err(|err| err.into_error())?,
flattened_documents, flattened_documents: flattened_documents
.into_inner()
.map_err(|err| err.into_error())?,
}) })
} }
@ -787,7 +713,6 @@ impl<'a, 'i> Transform<'a, 'i> {
); );
let mut obkv_buffer = Vec::new(); let mut obkv_buffer = Vec::new();
let mut document_sorter_buffer = Vec::new();
for result in self.index.all_documents(wtxn)? { for result in self.index.all_documents(wtxn)? {
let (docid, obkv) = result?; let (docid, obkv) = result?;
@ -802,9 +727,7 @@ impl<'a, 'i> Transform<'a, 'i> {
} }
let buffer = obkv_writer.into_inner()?; let buffer = obkv_writer.into_inner()?;
document_sorter_buffer.clear(); original_writer.insert(docid.to_be_bytes(), &buffer)?;
into_del_add_obkv(KvReaderU16::new(buffer), false, true, &mut document_sorter_buffer)?;
original_writer.insert(docid.to_be_bytes(), &document_sorter_buffer)?;
// Once we have the document. We're going to flatten it // Once we have the document. We're going to flatten it
// and insert it in the flattened sorter. // and insert it in the flattened sorter.
@ -839,9 +762,7 @@ impl<'a, 'i> Transform<'a, 'i> {
let value = serde_json::to_vec(&value).map_err(InternalError::SerdeJson)?; let value = serde_json::to_vec(&value).map_err(InternalError::SerdeJson)?;
writer.insert(fid, &value)?; writer.insert(fid, &value)?;
} }
document_sorter_buffer.clear(); flattened_writer.insert(docid.to_be_bytes(), &buffer)?;
into_del_add_obkv(KvReaderU16::new(&buffer), false, true, &mut document_sorter_buffer)?;
flattened_writer.insert(docid.to_be_bytes(), &document_sorter_buffer)?;
} }
// Once we have written all the documents, we extract // Once we have written all the documents, we extract
@ -860,8 +781,10 @@ impl<'a, 'i> Transform<'a, 'i> {
new_documents_ids: documents_ids, new_documents_ids: documents_ids,
replaced_documents_ids: RoaringBitmap::default(), replaced_documents_ids: RoaringBitmap::default(),
documents_count, documents_count,
original_documents, original_documents: original_documents.into_inner().map_err(|err| err.into_error())?,
flattened_documents, flattened_documents: flattened_documents
.into_inner()
.map_err(|err| err.into_error())?,
}; };
let new_facets = output.compute_real_facets(wtxn, self.index)?; let new_facets = output.compute_real_facets(wtxn, self.index)?;
@ -905,86 +828,38 @@ mod test {
#[test] #[test]
fn merge_obkvs() { fn merge_obkvs() {
let mut additive_doc_0 = Vec::new(); let mut doc_0 = Vec::new();
let mut deletive_doc_0 = Vec::new(); let mut kv_writer = KvWriter::new(&mut doc_0);
let mut del_add_doc_0 = Vec::new();
let mut kv_writer = KvWriter::memory();
kv_writer.insert(0_u8, [0]).unwrap(); kv_writer.insert(0_u8, [0]).unwrap();
let buffer = kv_writer.into_inner().unwrap(); kv_writer.finish().unwrap();
into_del_add_obkv(KvReaderU16::new(&buffer), false, true, &mut additive_doc_0).unwrap(); doc_0.insert(0, Operation::Addition as u8);
additive_doc_0.insert(0, Operation::Addition as u8);
into_del_add_obkv(KvReaderU16::new(&buffer), true, false, &mut deletive_doc_0).unwrap();
deletive_doc_0.insert(0, Operation::Deletion as u8);
into_del_add_obkv(KvReaderU16::new(&buffer), true, true, &mut del_add_doc_0).unwrap();
del_add_doc_0.insert(0, Operation::Addition as u8);
let mut additive_doc_1 = Vec::new(); let ret = merge_obkvs_and_operations(&[], &[Cow::from(doc_0.as_slice())]).unwrap();
let mut kv_writer = KvWriter::memory(); assert_eq!(*ret, doc_0);
kv_writer.insert(1_u8, [1]).unwrap();
let buffer = kv_writer.into_inner().unwrap();
into_del_add_obkv(KvReaderU16::new(&buffer), false, true, &mut additive_doc_1).unwrap();
additive_doc_1.insert(0, Operation::Addition as u8);
let mut additive_doc_0_1 = Vec::new(); let ret = merge_obkvs_and_operations(
let mut kv_writer = KvWriter::memory();
kv_writer.insert(0_u8, [0]).unwrap();
kv_writer.insert(1_u8, [1]).unwrap();
let buffer = kv_writer.into_inner().unwrap();
into_del_add_obkv(KvReaderU16::new(&buffer), false, true, &mut additive_doc_0_1).unwrap();
additive_doc_0_1.insert(0, Operation::Addition as u8);
let ret = obkvs_merge_additions_and_deletions(&[], &[Cow::from(additive_doc_0.as_slice())])
.unwrap();
assert_eq!(*ret, additive_doc_0);
let ret = obkvs_merge_additions_and_deletions(
&[], &[],
&[Cow::from(deletive_doc_0.as_slice()), Cow::from(additive_doc_0.as_slice())], &[Cow::from([Operation::Deletion as u8].as_slice()), Cow::from(doc_0.as_slice())],
) )
.unwrap(); .unwrap();
assert_eq!(*ret, del_add_doc_0); assert_eq!(*ret, doc_0);
let ret = obkvs_merge_additions_and_deletions( let ret = merge_obkvs_and_operations(
&[], &[],
&[Cow::from(additive_doc_0.as_slice()), Cow::from(deletive_doc_0.as_slice())], &[Cow::from(doc_0.as_slice()), Cow::from([Operation::Deletion as u8].as_slice())],
) )
.unwrap(); .unwrap();
assert_eq!(*ret, deletive_doc_0); assert_eq!(*ret, [Operation::Deletion as u8]);
let ret = obkvs_merge_additions_and_deletions( let ret = merge_obkvs_and_operations(
&[], &[],
&[ &[
Cow::from(additive_doc_1.as_slice()), Cow::from([Operation::Addition as u8, 1].as_slice()),
Cow::from(deletive_doc_0.as_slice()), Cow::from([Operation::Deletion as u8].as_slice()),
Cow::from(additive_doc_0.as_slice()), Cow::from(doc_0.as_slice()),
], ],
) )
.unwrap(); .unwrap();
assert_eq!(*ret, del_add_doc_0); assert_eq!(*ret, doc_0);
let ret = obkvs_merge_additions_and_deletions(
&[],
&[Cow::from(additive_doc_1.as_slice()), Cow::from(additive_doc_0.as_slice())],
)
.unwrap();
assert_eq!(*ret, additive_doc_0_1);
let ret = obkvs_keep_last_addition_merge_deletions(
&[],
&[Cow::from(additive_doc_1.as_slice()), Cow::from(additive_doc_0.as_slice())],
)
.unwrap();
assert_eq!(*ret, additive_doc_0);
let ret = obkvs_keep_last_addition_merge_deletions(
&[],
&[
Cow::from(deletive_doc_0.as_slice()),
Cow::from(additive_doc_1.as_slice()),
Cow::from(additive_doc_0.as_slice()),
],
)
.unwrap();
assert_eq!(*ret, del_add_doc_0);
} }
} }

View File

@ -2,7 +2,7 @@ use std::borrow::Cow;
use std::collections::HashMap; use std::collections::HashMap;
use std::convert::TryInto; use std::convert::TryInto;
use std::fs::File; use std::fs::File;
use std::io; use std::io::{self, BufReader};
use bytemuck::allocation::pod_collect_to_vec; use bytemuck::allocation::pod_collect_to_vec;
use charabia::{Language, Script}; use charabia::{Language, Script};
@ -27,23 +27,23 @@ pub(crate) enum TypedChunk {
FieldIdDocidFacetStrings(grenad::Reader<CursorClonableMmap>), FieldIdDocidFacetStrings(grenad::Reader<CursorClonableMmap>),
FieldIdDocidFacetNumbers(grenad::Reader<CursorClonableMmap>), FieldIdDocidFacetNumbers(grenad::Reader<CursorClonableMmap>),
Documents(grenad::Reader<CursorClonableMmap>), Documents(grenad::Reader<CursorClonableMmap>),
FieldIdWordcountDocids(grenad::Reader<File>), FieldIdWordcountDocids(grenad::Reader<BufReader<File>>),
NewDocumentsIds(RoaringBitmap), NewDocumentsIds(RoaringBitmap),
WordDocids { WordDocids {
word_docids_reader: grenad::Reader<File>, word_docids_reader: grenad::Reader<BufReader<File>>,
exact_word_docids_reader: grenad::Reader<File>, exact_word_docids_reader: grenad::Reader<BufReader<File>>,
word_fid_docids_reader: grenad::Reader<File>,
}, },
WordPositionDocids(grenad::Reader<File>), WordPositionDocids(grenad::Reader<BufReader<File>>),
WordPairProximityDocids(grenad::Reader<File>), WordFidDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetStringDocids(grenad::Reader<File>), WordPairProximityDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetNumberDocids(grenad::Reader<File>), FieldIdFacetStringDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetExistsDocids(grenad::Reader<File>), FieldIdFacetNumberDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetIsNullDocids(grenad::Reader<File>), FieldIdFacetExistsDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetIsEmptyDocids(grenad::Reader<File>), FieldIdFacetIsNullDocids(grenad::Reader<BufReader<File>>),
GeoPoints(grenad::Reader<File>), FieldIdFacetIsEmptyDocids(grenad::Reader<BufReader<File>>),
VectorPoints(grenad::Reader<File>), GeoPoints(grenad::Reader<BufReader<File>>),
ScriptLanguageDocids(HashMap<(Script, Language), (RoaringBitmap, RoaringBitmap)>), VectorPoints(grenad::Reader<BufReader<File>>),
ScriptLanguageDocids(HashMap<(Script, Language), RoaringBitmap>),
} }
impl TypedChunk { impl TypedChunk {
@ -64,19 +64,17 @@ impl TypedChunk {
TypedChunk::NewDocumentsIds(grenad) => { TypedChunk::NewDocumentsIds(grenad) => {
format!("NewDocumentsIds {{ number_of_entries: {} }}", grenad.len()) format!("NewDocumentsIds {{ number_of_entries: {} }}", grenad.len())
} }
TypedChunk::WordDocids { TypedChunk::WordDocids { word_docids_reader, exact_word_docids_reader } => format!(
word_docids_reader, "WordDocids {{ word_docids_reader: {}, exact_word_docids_reader: {} }}",
exact_word_docids_reader,
word_fid_docids_reader,
} => format!(
"WordDocids {{ word_docids_reader: {}, exact_word_docids_reader: {}, word_fid_docids_reader: {} }}",
word_docids_reader.len(), word_docids_reader.len(),
exact_word_docids_reader.len(), exact_word_docids_reader.len()
word_fid_docids_reader.len()
), ),
TypedChunk::WordPositionDocids(grenad) => { TypedChunk::WordPositionDocids(grenad) => {
format!("WordPositionDocids {{ number_of_entries: {} }}", grenad.len()) format!("WordPositionDocids {{ number_of_entries: {} }}", grenad.len())
} }
TypedChunk::WordFidDocids(grenad) => {
format!("WordFidDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::WordPairProximityDocids(grenad) => { TypedChunk::WordPairProximityDocids(grenad) => {
format!("WordPairProximityDocids {{ number_of_entries: {} }}", grenad.len()) format!("WordPairProximityDocids {{ number_of_entries: {} }}", grenad.len())
} }
@ -101,8 +99,8 @@ impl TypedChunk {
TypedChunk::VectorPoints(grenad) => { TypedChunk::VectorPoints(grenad) => {
format!("VectorPoints {{ number_of_entries: {} }}", grenad.len()) format!("VectorPoints {{ number_of_entries: {} }}", grenad.len())
} }
TypedChunk::ScriptLanguageDocids(sl_map) => { TypedChunk::ScriptLanguageDocids(grenad) => {
format!("ScriptLanguageDocids {{ number_of_entries: {} }}", sl_map.len()) format!("ScriptLanguageDocids {{ number_of_entries: {} }}", grenad.len())
} }
} }
} }
@ -140,11 +138,7 @@ pub(crate) fn write_typed_chunk_into_index(
TypedChunk::NewDocumentsIds(documents_ids) => { TypedChunk::NewDocumentsIds(documents_ids) => {
return Ok((documents_ids, is_merged_database)) return Ok((documents_ids, is_merged_database))
} }
TypedChunk::WordDocids { TypedChunk::WordDocids { word_docids_reader, exact_word_docids_reader } => {
word_docids_reader,
exact_word_docids_reader,
word_fid_docids_reader,
} => {
let word_docids_iter = unsafe { as_cloneable_grenad(&word_docids_reader) }?; let word_docids_iter = unsafe { as_cloneable_grenad(&word_docids_reader) }?;
append_entries_into_database( append_entries_into_database(
word_docids_iter.clone(), word_docids_iter.clone(),
@ -152,7 +146,7 @@ pub(crate) fn write_typed_chunk_into_index(
wtxn, wtxn,
index_is_empty, index_is_empty,
|value, _buffer| Ok(value), |value, _buffer| Ok(value),
merge_cbo_roaring_bitmaps, merge_roaring_bitmaps,
)?; )?;
let exact_word_docids_iter = unsafe { as_cloneable_grenad(&exact_word_docids_reader) }?; let exact_word_docids_iter = unsafe { as_cloneable_grenad(&exact_word_docids_reader) }?;
@ -162,17 +156,7 @@ pub(crate) fn write_typed_chunk_into_index(
wtxn, wtxn,
index_is_empty, index_is_empty,
|value, _buffer| Ok(value), |value, _buffer| Ok(value),
merge_cbo_roaring_bitmaps, merge_roaring_bitmaps,
)?;
let word_fid_docids_iter = unsafe { as_cloneable_grenad(&word_fid_docids_reader) }?;
append_entries_into_database(
word_fid_docids_iter,
&index.word_fid_docids,
wtxn,
index_is_empty,
|value, _buffer| Ok(value),
merge_cbo_roaring_bitmaps,
)?; )?;
// create fst from word docids // create fst from word docids
@ -198,6 +182,17 @@ pub(crate) fn write_typed_chunk_into_index(
)?; )?;
is_merged_database = true; is_merged_database = true;
} }
TypedChunk::WordFidDocids(word_fid_docids_iter) => {
append_entries_into_database(
word_fid_docids_iter,
&index.word_fid_docids,
wtxn,
index_is_empty,
|value, _buffer| Ok(value),
merge_cbo_roaring_bitmaps,
)?;
is_merged_database = true;
}
TypedChunk::FieldIdFacetNumberDocids(facet_id_number_docids_iter) => { TypedChunk::FieldIdFacetNumberDocids(facet_id_number_docids_iter) => {
let indexer = FacetsUpdate::new(index, FacetType::Number, facet_id_number_docids_iter); let indexer = FacetsUpdate::new(index, FacetType::Number, facet_id_number_docids_iter);
indexer.execute(wtxn)?; indexer.execute(wtxn)?;
@ -344,25 +339,22 @@ pub(crate) fn write_typed_chunk_into_index(
log::debug!("There are {} entries in the HNSW so far", hnsw_length); log::debug!("There are {} entries in the HNSW so far", hnsw_length);
index.put_vector_hnsw(wtxn, &new_hnsw)?; index.put_vector_hnsw(wtxn, &new_hnsw)?;
} }
TypedChunk::ScriptLanguageDocids(sl_map) => { TypedChunk::ScriptLanguageDocids(hash_pair) => {
for (key, (deletion, addition)) in sl_map { let mut buffer = Vec::new();
let mut db_key_exists = false; for (key, value) in hash_pair {
buffer.clear();
let final_value = match index.script_language_docids.get(wtxn, &key)? { let final_value = match index.script_language_docids.get(wtxn, &key)? {
Some(db_values) => { Some(db_values) => {
db_key_exists = true; let mut db_value_buffer = Vec::new();
(db_values - deletion) | addition serialize_roaring_bitmap(&db_values, &mut db_value_buffer)?;
let mut new_value_buffer = Vec::new();
serialize_roaring_bitmap(&value, &mut new_value_buffer)?;
merge_roaring_bitmaps(&new_value_buffer, &db_value_buffer, &mut buffer)?;
RoaringBitmap::deserialize_from(&buffer[..])?
} }
None => addition, None => value,
}; };
index.script_language_docids.put(wtxn, &key, &final_value)?;
if final_value.is_empty() {
// If the database entry exists, delete it.
if db_key_exists == true {
index.script_language_docids.delete(wtxn, &key)?;
}
} else {
index.script_language_docids.put(wtxn, &key, &final_value)?;
}
} }
} }
} }
@ -387,6 +379,13 @@ fn merge_word_docids_reader_into_fst(
Ok(builder.into_set()) Ok(builder.into_set())
} }
fn merge_roaring_bitmaps(new_value: &[u8], db_value: &[u8], buffer: &mut Vec<u8>) -> Result<()> {
let new_value = RoaringBitmap::deserialize_from(new_value)?;
let db_value = RoaringBitmap::deserialize_from(db_value)?;
let value = new_value | db_value;
Ok(serialize_roaring_bitmap(&value, buffer)?)
}
fn merge_cbo_roaring_bitmaps( fn merge_cbo_roaring_bitmaps(
new_value: &[u8], new_value: &[u8],
db_value: &[u8], db_value: &[u8],
@ -456,7 +455,6 @@ where
R: io::Read + io::Seek, R: io::Read + io::Seek,
FS: for<'a> Fn(&'a [u8], &'a mut Vec<u8>) -> Result<&'a [u8]>, FS: for<'a> Fn(&'a [u8], &'a mut Vec<u8>) -> Result<&'a [u8]>,
FM: Fn(&[u8], &[u8], &mut Vec<u8>) -> Result<()>, FM: Fn(&[u8], &[u8], &mut Vec<u8>) -> Result<()>,
K: for<'a> heed::BytesDecode<'a>,
{ {
puffin::profile_function!(format!("number of entries: {}", data.len())); puffin::profile_function!(format!("number of entries: {}", data.len()));
@ -477,12 +475,6 @@ where
let mut cursor = data.into_cursor()?; let mut cursor = data.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? { while let Some((key, value)) = cursor.move_on_next()? {
if valid_lmdb_key(key) { if valid_lmdb_key(key) {
debug_assert!(
K::bytes_decode(&key).is_some(),
"Couldn't decode key with the database decoder, key length: {} - key bytes: {:x?}",
key.len(),
&key
);
buffer.clear(); buffer.clear();
let value = serialize_value(value, &mut buffer)?; let value = serialize_value(value, &mut buffer)?;
unsafe { database.append(key, value)? }; unsafe { database.append(key, value)? };

View File

@ -21,7 +21,6 @@ pub use self::words_prefixes_fst::WordsPrefixesFst;
mod available_documents_ids; mod available_documents_ids;
mod clear_documents; mod clear_documents;
pub(crate) mod del_add;
mod delete_documents; mod delete_documents;
pub(crate) mod facet; pub(crate) mod facet;
mod index_documents; mod index_documents;

View File

@ -1,6 +1,6 @@
use std::borrow::Cow; use std::borrow::Cow;
use std::collections::HashSet; use std::collections::HashSet;
use std::io::BufReader; use std::io::{BufReader, BufWriter};
use grenad::CompressionType; use grenad::CompressionType;
use heed::types::ByteSlice; use heed::types::ByteSlice;
@ -119,9 +119,9 @@ pub fn insert_into_database(
pub fn write_into_lmdb_database_without_merging( pub fn write_into_lmdb_database_without_merging(
wtxn: &mut heed::RwTxn, wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase, database: heed::PolyDatabase,
writer: grenad::Writer<std::fs::File>, writer: grenad::Writer<BufWriter<std::fs::File>>,
) -> Result<()> { ) -> Result<()> {
let file = writer.into_inner()?; let file = writer.into_inner()?.into_inner().map_err(|err| err.into_error())?;
let reader = grenad::Reader::new(BufReader::new(file))?; let reader = grenad::Reader::new(BufReader::new(file))?;
if database.is_empty(wtxn)? { if database.is_empty(wtxn)? {
let mut out_iter = database.iter_mut::<_, ByteSlice, ByteSlice>(wtxn)?; let mut out_iter = database.iter_mut::<_, ByteSlice, ByteSlice>(wtxn)?;

View File

@ -5,15 +5,15 @@ use heed::types::{ByteSlice, Str};
use heed::Database; use heed::Database;
use crate::update::index_documents::{ use crate::update::index_documents::{
create_sorter, merge_cbo_roaring_bitmaps, sorter_into_lmdb_database, valid_lmdb_key, create_sorter, merge_roaring_bitmaps, sorter_into_lmdb_database, valid_lmdb_key,
CursorClonableMmap, MergeFn, CursorClonableMmap, MergeFn,
}; };
use crate::{CboRoaringBitmapCodec, Result}; use crate::{Result, RoaringBitmapCodec};
pub struct WordPrefixDocids<'t, 'u, 'i> { pub struct WordPrefixDocids<'t, 'u, 'i> {
wtxn: &'t mut heed::RwTxn<'i, 'u>, wtxn: &'t mut heed::RwTxn<'i, 'u>,
word_docids: Database<Str, CboRoaringBitmapCodec>, word_docids: Database<Str, RoaringBitmapCodec>,
word_prefix_docids: Database<Str, CboRoaringBitmapCodec>, word_prefix_docids: Database<Str, RoaringBitmapCodec>,
pub(crate) chunk_compression_type: CompressionType, pub(crate) chunk_compression_type: CompressionType,
pub(crate) chunk_compression_level: Option<u32>, pub(crate) chunk_compression_level: Option<u32>,
pub(crate) max_nb_chunks: Option<usize>, pub(crate) max_nb_chunks: Option<usize>,
@ -23,8 +23,8 @@ pub struct WordPrefixDocids<'t, 'u, 'i> {
impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> { impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
pub fn new( pub fn new(
wtxn: &'t mut heed::RwTxn<'i, 'u>, wtxn: &'t mut heed::RwTxn<'i, 'u>,
word_docids: Database<Str, CboRoaringBitmapCodec>, word_docids: Database<Str, RoaringBitmapCodec>,
word_prefix_docids: Database<Str, CboRoaringBitmapCodec>, word_prefix_docids: Database<Str, RoaringBitmapCodec>,
) -> WordPrefixDocids<'t, 'u, 'i> { ) -> WordPrefixDocids<'t, 'u, 'i> {
WordPrefixDocids { WordPrefixDocids {
wtxn, wtxn,
@ -40,7 +40,6 @@ impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
#[logging_timer::time("WordPrefixDocids::{}")] #[logging_timer::time("WordPrefixDocids::{}")]
pub fn execute( pub fn execute(
self, self,
// TODO grenad::Reader<onkv::Reader<Word, obkv::Reader<DelAdd, CboRoaringBitmap>>>
mut new_word_docids_iter: grenad::ReaderCursor<CursorClonableMmap>, mut new_word_docids_iter: grenad::ReaderCursor<CursorClonableMmap>,
new_prefix_fst_words: &[String], new_prefix_fst_words: &[String],
common_prefix_fst_words: &[&[String]], common_prefix_fst_words: &[&[String]],
@ -52,8 +51,7 @@ impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
// and write into it at the same time, therefore we write into another file. // and write into it at the same time, therefore we write into another file.
let mut prefix_docids_sorter = create_sorter( let mut prefix_docids_sorter = create_sorter(
grenad::SortAlgorithm::Unstable, grenad::SortAlgorithm::Unstable,
// TODO change to merge_deladd_cbo_roaring_bitmaps merge_roaring_bitmaps,
merge_cbo_roaring_bitmaps,
self.chunk_compression_type, self.chunk_compression_type,
self.chunk_compression_level, self.chunk_compression_level,
self.max_nb_chunks, self.max_nb_chunks,
@ -98,7 +96,6 @@ impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
let prefix = std::str::from_utf8(prefix.as_bytes())?; let prefix = std::str::from_utf8(prefix.as_bytes())?;
for result in db.prefix_iter(self.wtxn, prefix)? { for result in db.prefix_iter(self.wtxn, prefix)? {
let (_word, data) = result?; let (_word, data) = result?;
// TODO fake a DelAdd -> Add(`data`)
prefix_docids_sorter.insert(prefix, data)?; prefix_docids_sorter.insert(prefix, data)?;
} }
} }
@ -114,14 +111,11 @@ impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
drop(iter); drop(iter);
// We finally write the word prefix docids into the LMDB database. // We finally write the word prefix docids into the LMDB database.
// TODO introduce a new function that is similar to `append_entries_into_database`
// and accepts the `merge_deladd_cbo_roaring_bitmaps` function
sorter_into_lmdb_database( sorter_into_lmdb_database(
self.wtxn, self.wtxn,
*self.word_prefix_docids.as_polymorph(), *self.word_prefix_docids.as_polymorph(),
prefix_docids_sorter, prefix_docids_sorter,
// TODO change to `merge_deladd_cbo_roaring_bitmaps` merge_roaring_bitmaps,
merge_cbo_roaring_bitmaps,
)?; )?;
Ok(()) Ok(())
@ -133,7 +127,6 @@ fn write_prefixes_in_sorter(
sorter: &mut grenad::Sorter<MergeFn>, sorter: &mut grenad::Sorter<MergeFn>,
) -> Result<()> { ) -> Result<()> {
for (key, data_slices) in prefixes.drain() { for (key, data_slices) in prefixes.drain() {
// TODO merge keys before inserting them in the sorter
for data in data_slices { for data in data_slices {
if valid_lmdb_key(&key) { if valid_lmdb_key(&key) {
sorter.insert(&key, data)?; sorter.insert(&key, data)?;

View File

@ -8,7 +8,7 @@ use Criterion::*;
use crate::search::{self, EXTERNAL_DOCUMENTS_IDS}; use crate::search::{self, EXTERNAL_DOCUMENTS_IDS};
macro_rules! test_distinct { macro_rules! test_distinct {
($func:ident, $distinct:ident, $exhaustive:ident, $limit:expr, $criteria:expr, $n_res:expr) => { ($func:ident, $distinct:ident, $exhaustive:ident, $limit:expr, $offset:expr, $criteria:expr, $n_res:expr) => {
#[test] #[test]
fn $func() { fn $func() {
let criteria = $criteria; let criteria = $criteria;
@ -27,6 +27,7 @@ macro_rules! test_distinct {
let mut search = Search::new(&rtxn, &index); let mut search = Search::new(&rtxn, &index);
search.query(search::TEST_QUERY); search.query(search::TEST_QUERY);
search.limit($limit); search.limit($limit);
search.offset($offset);
search.exhaustive_number_hits($exhaustive); search.exhaustive_number_hits($exhaustive);
search.terms_matching_strategy(TermsMatchingStrategy::default()); search.terms_matching_strategy(TermsMatchingStrategy::default());
@ -47,6 +48,7 @@ macro_rules! test_distinct {
Some(d.id) Some(d.id)
} }
}) })
.skip($offset)
.take($limit) .take($limit)
.collect(); .collect();
@ -61,6 +63,7 @@ test_distinct!(
tag, tag,
true, true,
1, 1,
0,
vec![Words, Typo, Proximity, Attribute, Exactness], vec![Words, Typo, Proximity, Attribute, Exactness],
3 3
); );
@ -69,6 +72,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
true, true,
1, 1,
0,
vec![Words, Typo, Proximity, Attribute, Exactness], vec![Words, Typo, Proximity, Attribute, Exactness],
7 7
); );
@ -77,6 +81,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
true, true,
0, 0,
0,
vec![Desc(S("attribute_rank")), Desc(S("exactness_rank")), Exactness, Typo], vec![Desc(S("attribute_rank")), Desc(S("exactness_rank")), Exactness, Typo],
7 7
); );
@ -86,6 +91,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo, Proximity, Attribute, Exactness], vec![Words, Typo, Proximity, Attribute, Exactness],
3 3
); );
@ -94,6 +100,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo, Proximity, Attribute, Exactness], vec![Words, Typo, Proximity, Attribute, Exactness],
7 7
); );
@ -102,6 +109,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words], vec![Words],
3 3
); );
@ -110,6 +118,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words], vec![Words],
7 7
); );
@ -118,6 +127,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo], vec![Words, Typo],
3 3
); );
@ -126,6 +136,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo], vec![Words, Typo],
7 7
); );
@ -134,6 +145,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Proximity], vec![Words, Proximity],
3 3
); );
@ -142,6 +154,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Proximity], vec![Words, Proximity],
7 7
); );
@ -150,6 +163,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Attribute], vec![Words, Attribute],
3 3
); );
@ -158,6 +172,7 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Attribute], vec![Words, Attribute],
7 7
); );
@ -166,6 +181,7 @@ test_distinct!(
tag, tag,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Exactness], vec![Words, Exactness],
3 3
); );
@ -174,6 +190,47 @@ test_distinct!(
asc_desc_rank, asc_desc_rank,
false, false,
EXTERNAL_DOCUMENTS_IDS.len(), EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Exactness], vec![Words, Exactness],
7 7
); );
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
distinct_string_limit_and_offset,
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
1,
vec![],
3
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
exhaustive_distinct_string_limit_and_offset,
tag,
true,
1,
2,
vec![],
3
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
distinct_number_limit_and_offset,
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
2,
vec![],
7
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
exhaustive_distinct_number_limit_and_offset,
asc_desc_rank,
true,
2,
4,
vec![],
7
);