Compare commits

...

142 Commits

Author SHA1 Message Date
e0f446e4d3 Remove a log that would log too much 2025-01-28 21:31:01 +01:00
3bbad823e0 Refine the env variable and the max readers 2025-01-28 21:31:01 +01:00
b605549bf2 Do not create too many rayon tasks 2025-01-28 21:31:01 +01:00
6a1062edf5 Add more logs to see calls to the embedders 2025-01-28 21:31:01 +01:00
426ea5aa97 Accept the max readers param by env var and increase it 2025-01-28 21:31:00 +01:00
e20b91210d Merge #5276
5276: Fix the stuck indexation due to the internal BBQueue capacity r=curquiza a=Kerollmops

Fixes https://github.com/meilisearch/meilisearch/issues/5277. Reduce the maximum reserve grant in the BBQueue so we are never stuck.

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2025-01-23 13:41:34 +00:00
17478301ab Merge #5278
Some checks failed
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 14s
Test suite / Run tests in debug (push) Failing after 13s
Test suite / Run Clippy (push) Failing after 14s
Test suite / Tests on windows-2022 (push) Failing after 26s
Test suite / Run Rustfmt (push) Successful in 1m43s
Test suite / Tests on macos-13 (push) Has been cancelled
5278: Update version for the next release (v1.12.7) in Cargo.toml r=dureuill a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2025-01-23 10:47:30 +00:00
968c9dff27 Update version for the next release (v1.12.7) in Cargo.toml 2025-01-23 10:17:23 +00:00
463553988c Support offline upgrade up to v1.12.7 2025-01-23 11:11:40 +01:00
c321fdb9c0 Comment the max grant of the bbqueue
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2025-01-23 11:09:20 +01:00
36b6e94b29 Give more RAM to bbqueue.
- bbqueue buffers used to have (5% * 2%) / num_threads
- they now have 5% / num_threads
2025-01-23 10:55:03 +01:00
34dea863e5 Reduce the maximum grant possible we can store in the BBQueue 2025-01-23 10:43:28 +01:00
ad9d8e10f2 Merge #5260
Some checks failed
Test suite / Tests on ubuntu-20.04 (push) Failing after 2s
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Run tests in debug (push) Failing after 1s
Test suite / Tests on windows-2022 (push) Failing after 26s
Test suite / Run Rustfmt (push) Successful in 1m57s
Test suite / Run Clippy (push) Successful in 6m6s
Test suite / Tests on macos-13 (push) Has been cancelled
5260: Update version for the next release (v1.12.6) in Cargo.toml r=Kerollmops a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>
2025-01-21 12:37:46 +00:00
f7f35ef37c Update version for the next release (v1.12.6) in Cargo.toml 2025-01-21 12:22:56 +00:00
c575d2693b Merge #5258
5258: Unify facet strings by their normalized value r=ManyTheFish a=dureuill

Fixes #5228: the "missing facet keys" issue.

- Before this PR, updating a document such that `"facet": "DUREUILL"` would become `"facet": "dureuill"` could cause the normalized facet value `dureuill` to be removed from `field_id_docid_facet_strings` db.
- This PR makes sure to unify the intermediate representation of the facet strings by their field_id and **normalized** (and truncated) string value.
- The introduced test is testing only one of the two facet distribution algorithms.
- We removed the panic when the facet string was not found, and we instead returned the normalized string.

## Draft status

- [x] target release v1.12.6 branch and milestone
- [ ] ~consider meilitool offline upgrade to fix the corrupted dbs in the wild.~
   workaround: ~remove facets, then add them again... if your facet distribution is right.~ Just use a dump.
- [x] Add unit test demonstrating the issue fixed by this PR.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2025-01-21 11:02:33 +00:00
024e06f7e3 Do not panic when the facet string is not found 2025-01-21 12:01:26 +01:00
145fa3a8ff Add a test to check the facet casing is good 2025-01-21 11:42:25 +01:00
d3a7e10348 Unify facet strings by their normalized value 2025-01-21 00:11:50 +01:00
1c78447226 Merge #5246
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests on ubuntu-20.04 (push) Failing after 1s
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Run tests in debug (push) Failing after 17s
Test suite / Run Rustfmt (push) Failing after 16s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 40s
Test suite / Run Clippy (push) Failing after 1m24s
5246: Fix dump import r=Kerollmops a=dureuill

- Fix: handle the change of format of the update files
  - Correctly handle update files as JSON stream rather than obkv when exporting a dump with enqueued tasks
  - Correctly recreate update files as JSON stream rather than obkv when importing a dump
  - As the dump format itself didn't change, all dumps are still compatible
- Temporary workaround for https://github.com/meilisearch/meilisearch/issues/5247: set the batch uid of tasks to `null` at dump export time.
- Changes to meilitool
  - Export dump with update files in new format if DB >= v1.12
  - offline upgrade now supports upgrading from [1.9.0-1.12.5] to [1.10.0-1.12.5].
  - offline upgrade supports no-op upgrades and has better error messages 

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2025-01-20 13:03:49 +00:00
c55891f73b Replace guards by OR patterns
Co-authored-by: Tamo <tamo@meilisearch.com>
2025-01-20 11:46:03 +01:00
40f8c0d840 Remove batch ids on export 2025-01-20 11:16:18 +01:00
34d8c1a903 Make offline upgrade more flexible 2025-01-20 10:43:47 +01:00
3c9483b6e0 meilitool dumps old-style dump for older DBs, otherwise new-style 2025-01-20 10:43:47 +01:00
8c789b3c7a Merge #5252
5252: Update version for the next release (v1.12.5) in Cargo.toml r=dureuill a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2025-01-20 09:03:35 +00:00
3403eae9ee Update version for the next release (v1.12.5) in Cargo.toml 2025-01-20 08:53:20 +00:00
11458eefd9 Handle empty payloads 2025-01-20 09:51:07 +01:00
289eb92bef Fix warnings 2025-01-20 09:51:07 +01:00
cea0c89212 Change format of update file when importing dump 2025-01-20 09:51:07 +01:00
1cadab9ad8 Also fix dump import from meilitool 2025-01-20 09:51:07 +01:00
6383f8f19e Do not explode on missing content file if the task has no docs 2025-01-20 09:51:06 +01:00
8a9f952bda Create update files in new format 2025-01-20 09:51:06 +01:00
a5c44b4d79 Merge #5242
Some checks failed
Test suite / Tests on ubuntu-20.04 (push) Failing after 13s
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 14s
Test suite / Run tests in debug (push) Failing after 11s
Test suite / Run Clippy (push) Successful in 7m16s
Test suite / Run Rustfmt (push) Successful in 2m41s
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Has been cancelled
5242: Fix infinite loop r=Kerollmops a=dureuill

- Fix possible infinite loop by releasing `writer_receiver` as soon as writing to DB panics
- Demote panic to error log

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2025-01-16 14:13:07 +00:00
8c35744848 Improve error log 2025-01-16 14:33:53 +01:00
c0d414fc3c Merge #5243
5243: Update version for the next release (v1.12.4) in Cargo.toml r=dureuill a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2025-01-16 10:57:55 +00:00
b56358f606 Update version for the next release (v1.12.4) in Cargo.toml 2025-01-16 10:47:17 +00:00
b84c0a5390 Demote panic to error log 2025-01-16 11:38:21 +01:00
ce621e447e Release writer_receiver as soon as writing to db panics 2025-01-16 11:37:37 +01:00
aee74f47aa Merge pull request #5229 from meilisearch/improve-unknown-entry-deletion-report
Improve the panic message when deleting an unknown entry
2025-01-13 14:20:24 +01:00
be2717edbd Merge pull request #5224 from meilisearch/fix-facet-distribution
Fix facet distribution
2025-01-13 14:20:09 +01:00
c66841626e Update after review 2025-01-13 10:43:26 +01:00
d0bc8c755a Improve the panic message when deleting an unknown entry 2025-01-13 10:30:53 +01:00
031abfd281 Merge #5227
5227: Update version for the next release (v1.12.3) in Cargo.toml r=dureuill a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2025-01-13 09:29:18 +00:00
27169bc7b4 Update version for the next release (v1.12.3) in Cargo.toml 2025-01-13 08:47:07 +00:00
181a01f8d8 Skip rebuilding field distribution if not coming from v1.12 2025-01-13 09:31:27 +01:00
1d153c1867 write stats after rebuilding facet distribution 2025-01-09 18:13:36 +01:00
5fde2a3ee1 Add support to upgrade to v1.12.3 in meilitool 2025-01-09 15:25:44 +01:00
4465a1a3c9 Fix test 2025-01-09 13:26:17 +01:00
e342ae1b46 Add currently failing test 2025-01-09 13:26:13 +01:00
dcb4c49cf2 Merge #5205
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 12s
Test suite / Run tests in debug (push) Failing after 11s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 30s
Test suite / Run Rustfmt (push) Successful in 1m40s
Test suite / Run Clippy (push) Successful in 5m50s
5205: Incremental facets on v1.12 r=curquiza a=dureuill

# Pull Request

## Related issue

Fixes https://github.com/meilisearch/meilisearch/issues/5213

## What does this PR do?
- Add `new_incremental`module that computes incremental facet indexing for the new indexer
- Change heuristics for incremental vs bulk facet choice: under 1000 operations is now always incremental, over 100_000 operations is now always bulk)
- Add sanity checks in debug

## Future improvements

- Use multi ops from Roaring to decrease the number of allocations
- Consider removing or adding multiple levels at once instead of max once per update
- Consider using information about the tree structure + the operations that were done (e.g. only addition) to avoid recomputing the group from all children 
- Consider making the algorithm parallel and looking into the roaring values to know which actually changed



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2025-01-08 21:52:02 +00:00
e83c021755 When spilling on the next fid, no longer ignore children 2025-01-08 16:50:05 +01:00
7ec7200378 Check valid_facet_value as part of a filter of the iterator 2025-01-08 16:25:44 +01:00
6a577254fa No longer ignore the first child without parent 2025-01-08 16:25:30 +01:00
fd88c834c3 Modernize valid_lmdb_key 2025-01-08 15:22:11 +01:00
b4005593f4 Switch to an iterative algorithm for find_changed_parents 2025-01-08 14:57:14 +01:00
8ee3793259 Update after review 2025-01-08 13:58:14 +01:00
3648abbfd5 Remove unused FacetFieldIdOperation 2025-01-07 15:26:09 +01:00
4d2433de12 center groups 2025-01-06 18:23:35 +01:00
28cc6df7a3 Fix uselessly deep stack trace 2025-01-06 18:07:49 +01:00
7b14cb10a1 Merge #5207
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 13s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 13s
Test suite / Run tests in debug (push) Failing after 12s
Test suite / Run Rustfmt (push) Successful in 1m40s
Test suite / Run Clippy (push) Successful in 5m49s
5207: Update version for the next release (v1.12.2) in Cargo.toml r=dureuill a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2025-01-06 16:04:03 +00:00
34f4602ae8 Update snapshot 2025-01-06 16:55:12 +01:00
12e21a177b Update version for the next release (v1.12.2) in Cargo.toml 2025-01-06 14:11:58 +00:00
7a9290aaae Use new incremental facet indexing and enable sanity checks in debug 2025-01-06 15:08:48 +01:00
5d219587b8 Add new incremental facet indexing 2025-01-06 15:08:36 +01:00
6e9aa49893 add valid_facet_value utility function 2025-01-06 15:08:07 +01:00
6b3a2c7281 Add sanity checks for facet values 2025-01-06 15:07:55 +01:00
5908aec6cb Merge #5192
Some checks failed
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 12s
Test suite / Run tests in debug (push) Failing after 12s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 28s
Test suite / Run Rustfmt (push) Successful in 2m16s
Test suite / Run Clippy (push) Successful in 6m20s
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Has been cancelled
5192: Fix empty document addition r=irevoire a=irevoire

# Pull Request

## Related issue
Fixes #5190

## What does this PR do?
- Improve a test just to make sure this issue never arises again
- Fix the issue

For the reviewer: Calling `add_documents` with an empty `mmap` seems to work, but does it impact the perf in a significant way? / 

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-12-31 17:11:10 +00:00
19f48c15fb Fix the addition of empty payload 2024-12-31 18:00:14 +01:00
47b484c07c update the test to ensure it works when specifying the primary key or not: it doesn't work 2024-12-31 17:24:32 +01:00
7d5e28b475 Merge #5193
5193: Update version for the next release (v1.12.1) in Cargo.toml r=irevoire a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2024-12-31 09:40:31 +00:00
0648e06aa2 Update version for the next release (v1.12.1) in Cargo.toml 2024-12-30 17:36:46 +00:00
33921747b7 stop skipping empty tasks when adding documents 2024-12-30 17:48:25 +01:00
970a489dcc add a test reproducing the bug 2024-12-30 16:21:06 +01:00
ba11121cfc Merge #5159
Some checks failed
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 11s
Test suite / Run tests in debug (push) Failing after 10s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 22s
Test suite / Run Rustfmt (push) Successful in 1m18s
Test suite / Run Clippy (push) Successful in 5m30s
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Has been cancelled
5159: Fix the New Indexer Spilling r=irevoire a=Kerollmops

Fix two bugs in the merging of the spilled caches. Thanks to `@ManyTheFish` and `@irevoire` 👏

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-12-12 17:16:53 +00:00
acdd5aa6ea Use the thread source id instead of the destination id
when filtering on the cache to merge
2024-12-12 18:12:00 +01:00
2f3cc8cdd2 Fix the merge_caches_sorted function 2024-12-12 16:15:37 +01:00
7a95fed23f Merge #5158
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 46s
Test suite / Tests on ubuntu-20.04 (push) Failing after 13s
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Run tests in debug (push) Failing after 15s
Test suite / Run Rustfmt (push) Successful in 9m49s
Test suite / Run Clippy (push) Successful in 46m15s
5158: Indexer edition 2024 fix facet fst r=Kerollmops a=ManyTheFish

# Pull Request
Fix a regression in the new indexer; when several filterable attributes containing strings were set, all the field IDs were shifted, and the last one was overwriting the previous FST.

## What does this PR do?
- Add a test reproducing the bug
- fix the bug

Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-12-12 14:14:44 +00:00
961de4d34e Fix facet fst 2024-12-12 15:12:28 +01:00
18ce95dcbf Add test reproducing the bug 2024-12-12 14:56:45 +01:00
c177210b1b Merge #5152
5152: Make xtasks be able to use the specified binary r=dureuill a=Kerollmops

Makes it possible to specify the binary to run. It is useful to run PGO optimized binaries.

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-12-12 12:28:16 +00:00
1fc90fbacb Merge #5147
5147: Batch progress r=dureuill a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/5068

## What does this PR do?
- ...

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-12-12 09:15:54 +00:00
6c72559457 Update the binary-path description
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-12 09:39:39 +01:00
1fdfa3f208 Change the exit code to 130 when Ctrl-Ced 2024-12-12 09:26:14 +01:00
0d0c18f519 rename the Step::name into Step::current_step 2024-12-11 18:41:03 +01:00
d12364c1e0 fix the tests 2024-12-11 18:30:48 +01:00
8cd3a1aa57 fmt 2024-12-11 18:18:40 +01:00
08fd026ebd fix warning 2024-12-11 18:18:13 +01:00
75d5cea624 use a with_capacity while allocating the progress view 2024-12-11 18:17:33 +01:00
ab9213fa94 ensure we never write the progress to the db 2024-12-11 18:16:20 +01:00
45d5d4bf40 make the progressview public 2024-12-11 18:15:33 +01:00
fa885e75b4 rename the send_progress in progress 2024-12-11 18:13:12 +01:00
29fc77ee5b remove usuless print 2024-12-11 18:11:19 +01:00
ad4dc70720 rename the ComputingTheChanges to ComputingDocumentChanges in the edit document progress 2024-12-11 18:09:54 +01:00
5d682b4700 rename the ComputingTheChanges to ComputingDocumentChanges 2024-12-11 18:08:45 +01:00
f1beb60204 make the progress use payload instead of documents 2024-12-11 18:07:45 +01:00
85577e70cd reuse the enqueued 2024-12-11 18:05:34 +01:00
c5536c37b5 rename the atomic::name to unit_name 2024-12-11 18:03:06 +01:00
9245c89cfe move the macros to milli 2024-12-11 18:00:46 +01:00
eaabc1af2f Merge #5144
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 11s
Test suite / Run tests in debug (push) Failing after 10s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 47s
Test suite / Run Rustfmt (push) Successful in 1m17s
Test suite / Run Clippy (push) Successful in 5m30s
5144: Exactly 512 bytes docid fails r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #5050 

## What does this PR do?
- Return a user error rather than an internal one for docids of exactly 512 bytes
- Fix up error message to indicate that exactly 512 bytes long docids are not supported.
- Fix up error message to reflect that index uids are actually limited to 400 bytes in length

## Impact

- Impacts docs: 
    - update [this paragraph](https://www.meilisearch.com/docs/learn/resources/known_limitations#length-of-primary-key-values) to say 511 bytes instead of 512 

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-11 15:41:05 +00:00
04a24a9239 Kill Meilisearch with a TERM signal 2024-12-11 16:27:07 +01:00
1f54dfa883 update the macro to look more like an enum 2024-12-11 16:26:09 +01:00
786b0fabea implement the progress for almost all the tasks 2024-12-11 16:26:08 +01:00
26733c705d add progress for the task deletion and task cancelation 2024-12-11 16:25:02 +01:00
ab75f53efd update all snapshots 2024-12-11 16:25:02 +01:00
867e6a8f1d rename the send_progress field to progress since it s not sending anything 2024-12-11 16:25:01 +01:00
6f4823fc97 make the number of document in the document tasks more incremental 2024-12-11 16:25:01 +01:00
df9b68f8ed inital implementation of the progress 2024-12-11 16:25:01 +01:00
5bc6391700 Merge #5153
5153: Return docid in case of errors while rendering the document template r=Kerollmops a=dureuill

Improves error message:

Before: 

```
ERROR index_scheduler: Batch failed Index `mieli`: user error: missing field in document: liquid: Unknown index
  with:
    variable=doc
    requested index=title
    available indexes=by, id, kids, parent, text, time, type
```

After:

```
ERROR index_scheduler: Batch failed Index `mieli`: user error: missing field in document `11345147`: liquid: Unknown index
  with:
    variable=doc
    requested index=title
    available indexes=by, id, kids, parent, text, time, type
```

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-11 15:01:40 +00:00
eaa897d983 Avoid compiling when unecessary 2024-12-11 15:57:16 +01:00
bfca54cc2c Return docid in case of errors while rendering the document template 2024-12-11 15:26:18 +01:00
04a62d2b97 Compile Meilisearch or run the dedicated binary file 2024-12-11 14:57:07 +01:00
8c19cb0a0b Merge #5146
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 19s
Test suite / Run tests in debug (push) Failing after 14s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 23s
Test suite / Run Rustfmt (push) Successful in 4m52s
Test suite / Run Clippy (push) Failing after 8m9s
5146: Offline upgrade v1.12 r=irevoire a=ManyTheFish

# Pull Request

## Related issue
Fixes #4978 

## What does this PR do?
- add v1_11_to_v1_12 function to upgrade Meilisearch from v1.11 to v1.12
- Convert the update files from OBKV to ndjson format


Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2024-12-11 13:39:14 +00:00
5c492031d9 Update crates/meilitool/src/upgrade/v1_12.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-11 14:34:18 +01:00
fb1caa4724 Merge #5148
5148: Do not duplicate NDJson data when unecessary r=dureuill a=Kerollmops

This PR improves the NDJSON support. Usually, we save all of the user's document content into a temporary file, validate its content, and then convert everything into NDJSON in the file store (update files in the tasks).

It is a waste of time when users are already sending NDJSON. So, this PR removes the last copy and directly stores the user content in the file store, validating it from the file store. If an issue arises, the file will not persist and will be dropped/deleted instead.

Related to #5078.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-11 13:00:50 +00:00
5622b9607d Wrap the read NDJSON pass into a tokio blocking 2024-12-11 12:18:36 +01:00
01bcc601be Use a nonrandom hasher when decoding JSON 2024-12-11 12:04:29 +01:00
93fbdc06d3 Use a nonrandom hasher when decoding NDJSON 2024-12-11 12:03:09 +01:00
69c931334f Fix the error messages categorization with invalid NDJson 2024-12-11 12:02:48 +01:00
d683f5980c Do not duplicate NDJson when unecessary 2024-12-11 12:02:48 +01:00
f8ba112f66 Merge #5150
5150: Reintroduce the Document Addition Logs r=dureuill a=Kerollmops

This PR reintroduces lost tracing logs showing some information about the number of indexed documents.

Related to #5078. Resolves [this comment](https://github.com/meilisearch/meilisearch/pull/4900/files?show-deleted-files=true&show-viewed-files=true&file-filters%5B%5D=#r1852158338) and [this other one](https://github.com/meilisearch/meilisearch/pull/4900/files?show-deleted-files=true&show-viewed-files=true&file-filters%5B%5D=#r1852159073).

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-11 10:48:48 +00:00
c614d0dd35 Add context when returning an error 2024-12-11 10:55:39 +01:00
479607e5dd Convert update files from OBKV to ndjson 2024-12-11 10:55:39 +01:00
bb00e70087 Reintroduce the document addition logs 2024-12-11 10:39:04 +01:00
e974be9518 Merge #5145
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 11s
Test suite / Run tests in debug (push) Failing after 9s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 22s
Test suite / Run Rustfmt (push) Successful in 1m18s
Test suite / Run Clippy (push) Successful in 5m32s
5145: Use bumparaw-collections in Meilisearch/milli r=dureuill a=Kerollmops

This PR is related to #5078. It uses the now published bumparaw-collections and (soon) makes the `RawMap` hasher nonrandom.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-10 15:51:01 +00:00
aeb6b74725 Make sure we use an FxHashBuilder on the Value 2024-12-10 15:52:22 +01:00
a751972c57 Prefer using a stable than a random hash builder 2024-12-10 14:25:53 +01:00
6b269795d2 Update bumparaw-collections to 0.1.2 2024-12-10 14:25:13 +01:00
d075be798a Fix tests 2024-12-10 13:39:07 +01:00
89637bcaaf Use bumparaw-collections in Meilisearch/milli 2024-12-10 11:52:20 +01:00
866ac91be3 Fix error messages 2024-12-10 11:06:58 +01:00
e610af36aa User failure for documents with docid of ==512 bytes 2024-12-10 11:06:24 +01:00
7cf6707ed3 Extend test to add the ==512 bytes case 2024-12-10 11:05:42 +01:00
1995040846 Merge #5142
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 10s
Test suite / Run tests in debug (push) Failing after 11s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 22s
Test suite / Run Rustfmt (push) Successful in 1m19s
Test suite / Run Clippy (push) Successful in 5m49s
5142: Try merge optimisation r=dureuill a=ManyTheFish

![Capture_decran_2024-12-09_a_11 59 42](https://github.com/user-attachments/assets/0dfc7e30-a603-4546-98d2-791990bdfcce)

Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-12-09 14:48:26 +00:00
07f42e8057 Do not index a filed count when no word is counted 2024-12-09 15:45:12 +01:00
71f59749dc Reduce union impact in merging 2024-12-09 15:44:06 +01:00
3b0b9967f6 Merge #5141
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 16s
Test suite / Run tests in debug (push) Failing after 14s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 44s
Test suite / Run Rustfmt (push) Successful in 9m52s
Test suite / Run Clippy (push) Successful in 1h2m24s
5141: Use the right amount of max memory and not impact the settings r=curquiza a=Kerollmops

Fixes #5132. Related to #5125.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-09 10:40:46 +00:00
123b54a178 Merge #5056
5056: Attach index name in error message r=irevoire a=airycanon

# Pull Request

## Related issue
Fixes #4392 

## What does this PR do?
- ...

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: airycanon <airycanon@airycanon.me>
2024-12-09 09:59:12 +00:00
f5dd8dfc3e Rollback max memory usage changes 2024-12-09 10:26:30 +01:00
bcfed70888 Revert "Merge #5125"
This reverts commit 9a9383643f, reversing
changes made to cac355bfa7.
2024-12-09 10:08:02 +01:00
503ef3bbc9 Merge #5138
5138: Allow xtask bench to proceed without a commit message r=Kerollmops a=dureuill



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-09 09:00:12 +00:00
08f2c696b0 Allow xtask bench to proceed without a commit message 2024-12-09 09:36:59 +01:00
b75f1f4c17 fix tests
# Conflicts:
#	crates/index-scheduler/src/batch.rs
#	crates/index-scheduler/src/snapshots/lib.rs/fail_in_process_batch_for_document_deletion/after_removing_the_documents.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_bad_primary_key/fifth_task_succeeds.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_bad_primary_key/fourth_task_fails.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_multiple_primary_key/second_task_fails.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_multiple_primary_key/third_task_fails.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_multiple_primary_key_batch_wrong_key/second_and_third_tasks_fails.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_set_and_null_primary_key_inference_works/all_other_tasks_succeeds.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_set_and_null_primary_key_inference_works/second_task_fails.snap
#	crates/index-scheduler/src/snapshots/lib.rs/test_document_addition_with_set_and_null_primary_key_inference_works/third_task_succeeds.snap

# Conflicts:
#	crates/index-scheduler/src/batch.rs
#	crates/meilisearch/src/search/mod.rs
#	crates/meilisearch/tests/vector/mod.rs

# Conflicts:
#	crates/index-scheduler/src/batch.rs
2024-12-06 02:03:02 +08:00
95ed079761 attach index name in errors
# Conflicts:
#	crates/index-scheduler/src/batch.rs

# Conflicts:
#	crates/index-scheduler/src/batch.rs
#	crates/meilisearch/src/search/mod.rs
2024-12-06 01:12:13 +08:00
106 changed files with 3770 additions and 1229 deletions

90
Cargo.lock generated
View File

@ -496,7 +496,7 @@ source = "git+https://github.com/meilisearch/bbqueue#cbb87cc707b5af415ef203bdaf2
[[package]]
name = "benchmarks"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"bumpalo",
@ -689,7 +689,7 @@ dependencies = [
[[package]]
name = "build-info"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"time",
@ -706,6 +706,20 @@ dependencies = [
"serde",
]
[[package]]
name = "bumparaw-collections"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4ce682bdc86c2e25ef5cd95881d9d6a1902214eddf74cf9ffea88fe1464377e8"
dependencies = [
"allocator-api2",
"bitpacking",
"bumpalo",
"hashbrown 0.15.1",
"serde",
"serde_json",
]
[[package]]
name = "byte-unit"
version = "5.1.4"
@ -1650,7 +1664,7 @@ dependencies = [
[[package]]
name = "dump"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"big_s",
@ -1862,7 +1876,7 @@ checksum = "486f806e73c5707928240ddc295403b1b93c96a02038563881c4a2fd84b81ac4"
[[package]]
name = "file-store"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"tempfile",
"thiserror",
@ -1884,7 +1898,7 @@ dependencies = [
[[package]]
name = "filter-parser"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"insta",
"nom",
@ -1904,7 +1918,7 @@ dependencies = [
[[package]]
name = "flatten-serde-json"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"criterion",
"serde_json",
@ -2043,7 +2057,7 @@ dependencies = [
[[package]]
name = "fuzzers"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"arbitrary",
"bumpalo",
@ -2610,13 +2624,15 @@ checksum = "206ca75c9c03ba3d4ace2460e57b189f39f43de612c2f85836e65c929701bb2d"
[[package]]
name = "index-scheduler"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"arroy 0.5.0 (registry+https://github.com/rust-lang/crates.io-index)",
"big_s",
"bincode",
"bumpalo",
"bumparaw-collections",
"convert_case 0.6.0",
"crossbeam-channel",
"csv",
"derive_builder 0.20.0",
@ -2631,7 +2647,6 @@ dependencies = [
"meilisearch-types",
"memmap2",
"page_size",
"raw-collections",
"rayon",
"roaring",
"serde",
@ -2647,12 +2662,12 @@ dependencies = [
[[package]]
name = "indexmap"
version = "2.2.6"
version = "2.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "168fb715dda47215e360912c096649d23d58bf392ac62f73919e831745e40f26"
checksum = "62f822373a4fe84d4bb149bf54e584a7f4abec90e072ed49cda0edea5b95471f"
dependencies = [
"equivalent",
"hashbrown 0.14.3",
"hashbrown 0.15.1",
"serde",
]
@ -2807,7 +2822,7 @@ dependencies = [
[[package]]
name = "json-depth-checker"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"criterion",
"serde_json",
@ -3426,7 +3441,7 @@ checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771"
[[package]]
name = "meili-snap"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"insta",
"md5",
@ -3435,7 +3450,7 @@ dependencies = [
[[package]]
name = "meilisearch"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"actix-cors",
"actix-http",
@ -3525,7 +3540,7 @@ dependencies = [
[[package]]
name = "meilisearch-auth"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"base64 0.22.1",
"enum-iterator",
@ -3544,11 +3559,12 @@ dependencies = [
[[package]]
name = "meilisearch-types"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"actix-web",
"anyhow",
"bumpalo",
"bumparaw-collections",
"convert_case 0.6.0",
"csv",
"deserr",
@ -3561,8 +3577,8 @@ dependencies = [
"meili-snap",
"memmap2",
"milli",
"raw-collections",
"roaring",
"rustc-hash 2.1.0",
"serde",
"serde-cs",
"serde_json",
@ -3576,16 +3592,19 @@ dependencies = [
[[package]]
name = "meilitool"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"arroy 0.5.0 (git+https://github.com/meilisearch/arroy/?tag=DO-NOT-DELETE-upgrade-v04-to-v05)",
"clap",
"dump",
"file-store",
"indexmap",
"meilisearch-auth",
"meilisearch-types",
"serde",
"serde_json",
"tempfile",
"time",
"uuid",
]
@ -3608,7 +3627,7 @@ dependencies = [
[[package]]
name = "milli"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"allocator-api2",
"arroy 0.5.0 (registry+https://github.com/rust-lang/crates.io-index)",
@ -3618,6 +3637,7 @@ dependencies = [
"bincode",
"bstr",
"bumpalo",
"bumparaw-collections",
"bytemuck",
"byteorder",
"candle-core",
@ -3656,13 +3676,12 @@ dependencies = [
"once_cell",
"ordered-float",
"rand",
"raw-collections",
"rayon",
"rayon-par-bridge",
"rhai",
"roaring",
"rstar",
"rustc-hash 2.0.0",
"rustc-hash 2.1.0",
"serde",
"serde_json",
"slice-group-by",
@ -4064,7 +4083,7 @@ checksum = "e3148f5046208a5d56bcfc03053e3ca6334e51da8dfb19b6cdc8b306fae3283e"
[[package]]
name = "permissive-json-pointer"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"big_s",
"serde_json",
@ -4411,7 +4430,7 @@ dependencies = [
"bytes",
"rand",
"ring",
"rustc-hash 2.0.0",
"rustc-hash 2.1.0",
"rustls",
"slab",
"thiserror",
@ -4487,19 +4506,6 @@ dependencies = [
"rand",
]
[[package]]
name = "raw-collections"
version = "0.1.0"
source = "git+https://github.com/meilisearch/raw-collections.git#15e5d7bdebc0c149b2a28b2454f307c717d07f8a"
dependencies = [
"allocator-api2",
"bitpacking",
"bumpalo",
"hashbrown 0.15.1",
"serde",
"serde_json",
]
[[package]]
name = "raw-cpuid"
version = "10.7.0"
@ -4797,9 +4803,9 @@ checksum = "08d43f7aa6b08d49f382cde6a7982047c3426db949b1424bc4b7ec9ae12c6ce2"
[[package]]
name = "rustc-hash"
version = "2.0.0"
version = "2.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "583034fd73374156e66797ed8e5b0d5690409c9226b22d87cb7f19821c05d152"
checksum = "c7fb8039b3032c191086b10f11f319a6e99e1e82889c5cc6046f515c9db1d497"
[[package]]
name = "rustc_version"
@ -4968,9 +4974,9 @@ dependencies = [
[[package]]
name = "serde_json"
version = "1.0.132"
version = "1.0.133"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d726bfaff4b320266d395898905d0eba0345aae23b54aee3a737e260fd46db03"
checksum = "c7fceb2473b9166b2294ef05efcb65a3db80803f0b03ef86a5fc88a2b85ee377"
dependencies = [
"indexmap",
"itoa",
@ -6480,7 +6486,7 @@ dependencies = [
[[package]]
name = "xtask"
version = "1.12.0"
version = "1.12.7"
dependencies = [
"anyhow",
"build-info",

View File

@ -22,7 +22,7 @@ members = [
]
[workspace.package]
version = "1.12.0"
version = "1.12.7"
authors = [
"Quentin de Quelen <quentin@dequelen.me>",
"Clément Renault <clement@meilisearch.com>",

View File

@ -8,6 +8,7 @@ use bumpalo::Bump;
use criterion::{criterion_group, criterion_main, Criterion};
use milli::documents::PrimaryKey;
use milli::heed::{EnvOpenOptions, RwTxn};
use milli::progress::Progress;
use milli::update::new::indexer;
use milli::update::{IndexDocumentsMethod, IndexerConfig, Settings};
use milli::vector::EmbeddingConfigs;
@ -151,7 +152,7 @@ fn indexing_songs_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -166,7 +167,7 @@ fn indexing_songs_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -218,7 +219,7 @@ fn reindexing_songs_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -233,7 +234,7 @@ fn reindexing_songs_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -263,7 +264,7 @@ fn reindexing_songs_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -278,7 +279,7 @@ fn reindexing_songs_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -332,7 +333,7 @@ fn deleting_songs_in_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -347,7 +348,7 @@ fn deleting_songs_in_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -409,7 +410,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -424,7 +425,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -454,7 +455,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -469,7 +470,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -495,7 +496,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -510,7 +511,7 @@ fn indexing_songs_in_three_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -563,7 +564,7 @@ fn indexing_songs_without_faceted_numbers(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -578,7 +579,7 @@ fn indexing_songs_without_faceted_numbers(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -630,7 +631,7 @@ fn indexing_songs_without_faceted_fields(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -645,7 +646,7 @@ fn indexing_songs_without_faceted_fields(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -697,7 +698,7 @@ fn indexing_wiki(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -712,7 +713,7 @@ fn indexing_wiki(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -763,7 +764,7 @@ fn reindexing_wiki(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -778,7 +779,7 @@ fn reindexing_wiki(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -808,7 +809,7 @@ fn reindexing_wiki(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -823,7 +824,7 @@ fn reindexing_wiki(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -876,7 +877,7 @@ fn deleting_wiki_in_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -891,7 +892,7 @@ fn deleting_wiki_in_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -953,7 +954,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -968,7 +969,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -999,7 +1000,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1014,7 +1015,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1041,7 +1042,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1056,7 +1057,7 @@ fn indexing_wiki_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1108,7 +1109,7 @@ fn indexing_movies_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1123,7 +1124,7 @@ fn indexing_movies_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1174,7 +1175,7 @@ fn reindexing_movies_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1189,7 +1190,7 @@ fn reindexing_movies_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1219,7 +1220,7 @@ fn reindexing_movies_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1234,7 +1235,7 @@ fn reindexing_movies_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1287,7 +1288,7 @@ fn deleting_movies_in_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1302,7 +1303,7 @@ fn deleting_movies_in_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1350,7 +1351,7 @@ fn delete_documents_from_ids(index: Index, document_ids_to_delete: Vec<RoaringBi
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1400,7 +1401,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1415,7 +1416,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1445,7 +1446,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1460,7 +1461,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1486,7 +1487,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1501,7 +1502,7 @@ fn indexing_movies_in_three_batches(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1576,7 +1577,7 @@ fn indexing_nested_movies_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1591,7 +1592,7 @@ fn indexing_nested_movies_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1667,7 +1668,7 @@ fn deleting_nested_movies_in_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1682,7 +1683,7 @@ fn deleting_nested_movies_in_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1750,7 +1751,7 @@ fn indexing_nested_movies_without_faceted_fields(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1765,7 +1766,7 @@ fn indexing_nested_movies_without_faceted_fields(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1817,7 +1818,7 @@ fn indexing_geo(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1832,7 +1833,7 @@ fn indexing_geo(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1883,7 +1884,7 @@ fn reindexing_geo(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1898,7 +1899,7 @@ fn reindexing_geo(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1928,7 +1929,7 @@ fn reindexing_geo(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -1943,7 +1944,7 @@ fn reindexing_geo(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
@ -1996,7 +1997,7 @@ fn deleting_geo_in_batches_default(c: &mut Criterion) {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2011,7 +2012,7 @@ fn deleting_geo_in_batches_default(c: &mut Criterion) {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();

View File

@ -10,6 +10,7 @@ use bumpalo::Bump;
use criterion::BenchmarkId;
use memmap2::Mmap;
use milli::heed::EnvOpenOptions;
use milli::progress::Progress;
use milli::update::new::indexer;
use milli::update::{IndexDocumentsMethod, IndexerConfig, Settings};
use milli::vector::EmbeddingConfigs;
@ -110,7 +111,7 @@ pub fn base_setup(conf: &Conf) -> Index {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -125,7 +126,7 @@ pub fn base_setup(conf: &Conf) -> Index {
&document_changes,
EmbeddingConfigs::default(),
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();

View File

@ -136,6 +136,14 @@ pub struct File {
}
impl File {
pub fn from_parts(path: PathBuf, file: Option<NamedTempFile>) -> Self {
Self { path, file }
}
pub fn into_parts(self) -> (PathBuf, Option<NamedTempFile>) {
(self.path, self.file)
}
pub fn dry_file() -> Result<Self> {
Ok(Self { path: PathBuf::new(), file: None })
}

View File

@ -10,6 +10,7 @@ use either::Either;
use fuzzers::Operation;
use milli::documents::mmap_from_objects;
use milli::heed::EnvOpenOptions;
use milli::progress::Progress;
use milli::update::new::indexer;
use milli::update::{IndexDocumentsMethod, IndexerConfig};
use milli::vector::EmbeddingConfigs;
@ -128,7 +129,7 @@ fn main() {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -143,7 +144,7 @@ fn main() {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();

View File

@ -13,6 +13,9 @@ license.workspace = true
[dependencies]
anyhow = "1.0.86"
bincode = "1.3.3"
bumpalo = "3.16.0"
bumparaw-collections = "0.1.2"
convert_case = "0.6.0"
csv = "1.3.0"
derive_builder = "0.20.0"
dump = { path = "../dump" }
@ -21,8 +24,8 @@ file-store = { path = "../file-store" }
flate2 = "1.0.30"
meilisearch-auth = { path = "../meilisearch-auth" }
meilisearch-types = { path = "../meilisearch-types" }
memmap2 = "0.9.4"
page_size = "0.6.0"
raw-collections = { git = "https://github.com/meilisearch/raw-collections.git", version = "0.1.0" }
rayon = "1.10.0"
roaring = { version = "0.10.7", features = ["serde"] }
serde = { version = "1.0.204", features = ["derive"] }
@ -30,7 +33,6 @@ serde_json = { version = "1.0.120", features = ["preserve_order"] }
synchronoise = "1.0.1"
tempfile = "3.10.1"
thiserror = "1.0.61"
memmap2 = "0.9.4"
time = { version = "0.3.36", features = [
"serde-well-known",
"formatting",
@ -40,7 +42,6 @@ time = { version = "0.3.36", features = [
tracing = "0.1.40"
ureq = "2.10.0"
uuid = { version = "1.10.0", features = ["serde", "v4"] }
bumpalo = "3.16.0"
[dev-dependencies]
arroy = "0.5.0"

View File

@ -22,27 +22,26 @@ use std::ffi::OsStr;
use std::fmt;
use std::fs::{self, File};
use std::io::BufWriter;
use std::sync::atomic::{self, AtomicU64};
use std::time::Duration;
use std::sync::atomic::Ordering;
use bumpalo::collections::CollectIn;
use bumpalo::Bump;
use dump::IndexMetadata;
use meilisearch_types::batches::BatchId;
use meilisearch_types::error::Code;
use meilisearch_types::heed::{RoTxn, RwTxn};
use meilisearch_types::milli::documents::{obkv_to_object, DocumentsBatchReader, PrimaryKey};
use meilisearch_types::milli::documents::PrimaryKey;
use meilisearch_types::milli::heed::CompactionOption;
use meilisearch_types::milli::progress::Progress;
use meilisearch_types::milli::update::new::indexer::{self, UpdateByFunction};
use meilisearch_types::milli::update::{IndexDocumentsMethod, Settings as MilliSettings};
use meilisearch_types::milli::update::{
DocumentAdditionResult, IndexDocumentsMethod, Settings as MilliSettings,
};
use meilisearch_types::milli::vector::parsed_vectors::{
ExplicitVectors, VectorOrArrayOfVectors, RESERVED_VECTORS_FIELD_NAME,
};
use meilisearch_types::milli::{self, Filter, ThreadPoolNoAbortBuilder};
use meilisearch_types::settings::{apply_settings_to_builder, Settings, Unchecked};
use meilisearch_types::tasks::{
Details, IndexSwap, Kind, KindWithContent, Status, Task, TaskProgress,
};
use meilisearch_types::tasks::{Details, IndexSwap, Kind, KindWithContent, Status, Task};
use meilisearch_types::{compression, Index, VERSION_FILE_NAME};
use roaring::RoaringBitmap;
use time::macros::format_description;
@ -50,6 +49,13 @@ use time::OffsetDateTime;
use uuid::Uuid;
use crate::autobatcher::{self, BatchKind};
use crate::processing::{
AtomicBatchStep, AtomicDocumentStep, AtomicTaskStep, AtomicUpdateFileStep, CreateIndexProgress,
DeleteIndexProgress, DocumentDeletionProgress, DocumentEditionProgress,
DocumentOperationProgress, DumpCreationProgress, InnerSwappingTwoIndexes, SettingsProgress,
SnapshotCreationProgress, SwappingTheIndexes, TaskCancelationProgress, TaskDeletionProgress,
UpdateIndexProgress, VariableNameStep,
};
use crate::utils::{self, swap_index_uid_in_task, ProcessingBatch};
use crate::{Error, IndexScheduler, Result, TaskId};
@ -560,11 +566,12 @@ impl IndexScheduler {
/// The list of tasks that were processed. The metadata of each task in the returned
/// list is updated accordingly, with the exception of the its date fields
/// [`finished_at`](meilisearch_types::tasks::Task::finished_at) and [`started_at`](meilisearch_types::tasks::Task::started_at).
#[tracing::instrument(level = "trace", skip(self, batch), target = "indexing::scheduler", fields(batch=batch.to_string()))]
#[tracing::instrument(level = "trace", skip(self, batch, progress), target = "indexing::scheduler", fields(batch=batch.to_string()))]
pub(crate) fn process_batch(
&self,
batch: Batch,
current_batch: &mut ProcessingBatch,
progress: Progress,
) -> Result<Vec<Task>> {
#[cfg(test)]
{
@ -584,8 +591,13 @@ impl IndexScheduler {
};
let rtxn = self.env.read_txn()?;
let mut canceled_tasks =
self.cancel_matched_tasks(&rtxn, task.uid, current_batch, matched_tasks)?;
let mut canceled_tasks = self.cancel_matched_tasks(
&rtxn,
task.uid,
current_batch,
matched_tasks,
&progress,
)?;
task.status = Status::Succeeded;
match &mut task.details {
@ -616,7 +628,8 @@ impl IndexScheduler {
}
let mut wtxn = self.env.write_txn()?;
let mut deleted_tasks = self.delete_matched_tasks(&mut wtxn, &matched_tasks)?;
let mut deleted_tasks =
self.delete_matched_tasks(&mut wtxn, &matched_tasks, &progress)?;
wtxn.commit()?;
for task in tasks.iter_mut() {
@ -642,6 +655,8 @@ impl IndexScheduler {
Ok(tasks)
}
Batch::SnapshotCreation(mut tasks) => {
progress.update_progress(SnapshotCreationProgress::StartTheSnapshotCreation);
fs::create_dir_all(&self.snapshots_path)?;
let temp_snapshot_dir = tempfile::tempdir()?;
@ -662,6 +677,7 @@ impl IndexScheduler {
// two read operations as the task processing is synchronous.
// 2.1 First copy the LMDB env of the index-scheduler
progress.update_progress(SnapshotCreationProgress::SnapshotTheIndexScheduler);
let dst = temp_snapshot_dir.path().join("tasks");
fs::create_dir_all(&dst)?;
self.env.copy_to_file(dst.join("data.mdb"), CompactionOption::Enabled)?;
@ -674,27 +690,41 @@ impl IndexScheduler {
fs::create_dir_all(&update_files_dir)?;
// 2.4 Only copy the update files of the enqueued tasks
for task_id in self.get_status(&rtxn, Status::Enqueued)? {
progress.update_progress(SnapshotCreationProgress::SnapshotTheUpdateFiles);
let enqueued = self.get_status(&rtxn, Status::Enqueued)?;
let (atomic, update_file_progress) =
AtomicUpdateFileStep::new(enqueued.len() as u32);
progress.update_progress(update_file_progress);
for task_id in enqueued {
let task = self.get_task(&rtxn, task_id)?.ok_or(Error::CorruptedTaskQueue)?;
if let Some(content_uuid) = task.content_uuid() {
let src = self.file_store.get_update_path(content_uuid);
let dst = update_files_dir.join(content_uuid.to_string());
fs::copy(src, dst)?;
}
atomic.fetch_add(1, Ordering::Relaxed);
}
// 3. Snapshot every indexes
for result in self.index_mapper.index_mapping.iter(&rtxn)? {
progress.update_progress(SnapshotCreationProgress::SnapshotTheIndexes);
let index_mapping = self.index_mapper.index_mapping;
let nb_indexes = index_mapping.len(&rtxn)? as u32;
for (i, result) in index_mapping.iter(&rtxn)?.enumerate() {
let (name, uuid) = result?;
progress.update_progress(VariableNameStep::new(name, i as u32, nb_indexes));
let index = self.index_mapper.index(&rtxn, name)?;
let dst = temp_snapshot_dir.path().join("indexes").join(uuid.to_string());
fs::create_dir_all(&dst)?;
index.copy_to_file(dst.join("data.mdb"), CompactionOption::Enabled)?;
index
.copy_to_file(dst.join("data.mdb"), CompactionOption::Enabled)
.map_err(|e| Error::from_milli(e, Some(name.to_string())))?;
}
drop(rtxn);
// 4. Snapshot the auth LMDB env
progress.update_progress(SnapshotCreationProgress::SnapshotTheApiKeys);
let dst = temp_snapshot_dir.path().join("auth");
fs::create_dir_all(&dst)?;
// TODO We can't use the open_auth_store_env function here but we should
@ -707,6 +737,7 @@ impl IndexScheduler {
auth.copy_to_file(dst.join("data.mdb"), CompactionOption::Enabled)?;
// 5. Copy and tarball the flat snapshot
progress.update_progress(SnapshotCreationProgress::CreateTheTarball);
// 5.1 Find the original name of the database
// TODO find a better way to get this path
let mut base_path = self.env.path().to_owned();
@ -739,6 +770,7 @@ impl IndexScheduler {
Ok(tasks)
}
Batch::Dump(mut task) => {
progress.update_progress(DumpCreationProgress::StartTheDumpCreation);
let started_at = OffsetDateTime::now_utc();
let (keys, instance_uid) =
if let KindWithContent::DumpCreation { keys, instance_uid } = &task.kind {
@ -749,6 +781,7 @@ impl IndexScheduler {
let dump = dump::DumpWriter::new(*instance_uid)?;
// 1. dump the keys
progress.update_progress(DumpCreationProgress::DumpTheApiKeys);
let mut dump_keys = dump.create_keys()?;
for key in keys {
dump_keys.push_key(key)?;
@ -758,7 +791,13 @@ impl IndexScheduler {
let rtxn = self.env.read_txn()?;
// 2. dump the tasks
progress.update_progress(DumpCreationProgress::DumpTheTasks);
let mut dump_tasks = dump.create_tasks_queue()?;
let (atomic, update_task_progress) =
AtomicTaskStep::new(self.all_tasks.len(&rtxn)? as u32);
progress.update_progress(update_task_progress);
for ret in self.all_tasks.iter(&rtxn)? {
if self.must_stop_processing.get() {
return Err(Error::AbortedTask);
@ -780,6 +819,13 @@ impl IndexScheduler {
t.started_at = Some(started_at);
t.finished_at = Some(finished_at);
}
// Patch the task to remove the batch uid, because as of v1.12.5 batches are not persisted.
// This prevent from referencing *future* batches not actually associated with the task.
//
// See <https://github.com/meilisearch/meilisearch/issues/5247> for details.
t.batch_uid = None;
let mut dump_content_file = dump_tasks.push_task(&t.into())?;
// 2.1. Dump the `content_file` associated with the task if there is one and the task is not finished yet.
@ -790,51 +836,82 @@ impl IndexScheduler {
if status == Status::Enqueued {
let content_file = self.file_store.get_update(content_file)?;
let reader = DocumentsBatchReader::from_reader(content_file)
.map_err(milli::Error::from)?;
let (mut cursor, documents_batch_index) =
reader.into_cursor_and_fields_index();
while let Some(doc) =
cursor.next_document().map_err(milli::Error::from)?
for document in
serde_json::de::Deserializer::from_reader(content_file).into_iter()
{
dump_content_file
.push_document(&obkv_to_object(doc, &documents_batch_index)?)?;
let document = document.map_err(|e| {
Error::from_milli(
milli::InternalError::SerdeJson(e).into(),
None,
)
})?;
dump_content_file.push_document(&document)?;
}
dump_content_file.flush()?;
}
}
atomic.fetch_add(1, Ordering::Relaxed);
}
dump_tasks.flush()?;
// 3. Dump the indexes
progress.update_progress(DumpCreationProgress::DumpTheIndexes);
let nb_indexes = self.index_mapper.index_mapping.len(&rtxn)? as u32;
let mut count = 0;
self.index_mapper.try_for_each_index(&rtxn, |uid, index| -> Result<()> {
progress.update_progress(VariableNameStep::new(
uid.to_string(),
count,
nb_indexes,
));
count += 1;
let rtxn = index.read_txn()?;
let metadata = IndexMetadata {
uid: uid.to_owned(),
primary_key: index.primary_key(&rtxn)?.map(String::from),
created_at: index.created_at(&rtxn)?,
updated_at: index.updated_at(&rtxn)?,
created_at: index
.created_at(&rtxn)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?,
updated_at: index
.updated_at(&rtxn)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?,
};
let mut index_dumper = dump.create_index(uid, &metadata)?;
let fields_ids_map = index.fields_ids_map(&rtxn)?;
let all_fields: Vec<_> = fields_ids_map.iter().map(|(id, _)| id).collect();
let embedding_configs = index.embedding_configs(&rtxn)?;
let embedding_configs = index
.embedding_configs(&rtxn)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
let nb_documents = index
.number_of_documents(&rtxn)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?
as u32;
let (atomic, update_document_progress) = AtomicDocumentStep::new(nb_documents);
progress.update_progress(update_document_progress);
let documents = index
.all_documents(&rtxn)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
// 3.1. Dump the documents
for ret in index.all_documents(&rtxn)? {
for ret in documents {
if self.must_stop_processing.get() {
return Err(Error::AbortedTask);
}
let (id, doc) = ret?;
let (id, doc) =
ret.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
let mut document = milli::obkv_to_json(&all_fields, &fields_ids_map, doc)?;
let mut document =
milli::obkv_to_json(&all_fields, &fields_ids_map, doc)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
'inject_vectors: {
let embeddings = index.embeddings(&rtxn, id)?;
let embeddings = index
.embeddings(&rtxn, id)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
if embeddings.is_empty() {
break 'inject_vectors;
@ -845,7 +922,7 @@ impl IndexScheduler {
.or_insert(serde_json::Value::Object(Default::default()));
let serde_json::Value::Object(vectors) = vectors else {
return Err(milli::Error::UserError(
let user_err = milli::Error::UserError(
milli::UserError::InvalidVectorsMapType {
document_id: {
if let Ok(Some(Ok(index))) = index
@ -859,8 +936,9 @@ impl IndexScheduler {
},
value: vectors.clone(),
},
)
.into());
);
return Err(Error::from_milli(user_err, Some(uid.to_string())));
};
for (embedder_name, embeddings) in embeddings {
@ -883,6 +961,7 @@ impl IndexScheduler {
}
index_dumper.push_document(&document)?;
atomic.fetch_add(1, Ordering::Relaxed);
}
// 3.2. Dump the settings
@ -890,12 +969,14 @@ impl IndexScheduler {
index,
&rtxn,
meilisearch_types::settings::SecretPolicy::RevealSecrets,
)?;
)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?;
index_dumper.settings(&settings)?;
Ok(())
})?;
// 4. Dump experimental feature settings
progress.update_progress(DumpCreationProgress::DumpTheExperimentalFeatures);
let features = self.features().runtime_features();
dump.create_experimental_features(features)?;
@ -906,6 +987,7 @@ impl IndexScheduler {
if self.must_stop_processing.get() {
return Err(Error::AbortedTask);
}
progress.update_progress(DumpCreationProgress::CompressTheDump);
let path = self.dumps_path.join(format!("{}.dump", dump_uid));
let file = File::create(path)?;
dump.persist_to(BufWriter::new(file))?;
@ -931,7 +1013,7 @@ impl IndexScheduler {
.set_currently_updating_index(Some((index_uid.clone(), index.clone())));
let mut index_wtxn = index.write_txn()?;
let tasks = self.apply_index_operation(&mut index_wtxn, &index, op)?;
let tasks = self.apply_index_operation(&mut index_wtxn, &index, op, progress)?;
{
let span = tracing::trace_span!(target: "indexing::scheduler", "commit");
@ -946,7 +1028,8 @@ impl IndexScheduler {
// the entire batch.
let res = || -> Result<()> {
let index_rtxn = index.read_txn()?;
let stats = crate::index_mapper::IndexStats::new(&index, &index_rtxn)?;
let stats = crate::index_mapper::IndexStats::new(&index, &index_rtxn)
.map_err(|e| Error::from_milli(e, Some(index_uid.to_string())))?;
let mut wtxn = self.env.write_txn()?;
self.index_mapper.store_stats_of(&mut wtxn, &index_uid, &stats)?;
wtxn.commit()?;
@ -964,6 +1047,8 @@ impl IndexScheduler {
Ok(tasks)
}
Batch::IndexCreation { index_uid, primary_key, task } => {
progress.update_progress(CreateIndexProgress::CreatingTheIndex);
let wtxn = self.env.write_txn()?;
if self.index_mapper.exists(&wtxn, &index_uid)? {
return Err(Error::IndexAlreadyExists(index_uid));
@ -973,9 +1058,11 @@ impl IndexScheduler {
self.process_batch(
Batch::IndexUpdate { index_uid, primary_key, task },
current_batch,
progress,
)
}
Batch::IndexUpdate { index_uid, primary_key, mut task } => {
progress.update_progress(UpdateIndexProgress::UpdatingTheIndex);
let rtxn = self.env.read_txn()?;
let index = self.index_mapper.index(&rtxn, &index_uid)?;
@ -988,10 +1075,12 @@ impl IndexScheduler {
);
builder.set_primary_key(primary_key);
let must_stop_processing = self.must_stop_processing.clone();
builder.execute(
|indexing_step| tracing::debug!(update = ?indexing_step),
|| must_stop_processing.get(),
)?;
builder
.execute(
|indexing_step| tracing::debug!(update = ?indexing_step),
|| must_stop_processing.get(),
)
.map_err(|e| Error::from_milli(e, Some(index_uid.to_string())))?;
index_wtxn.commit()?;
}
@ -1008,7 +1097,8 @@ impl IndexScheduler {
let res = || -> Result<()> {
let mut wtxn = self.env.write_txn()?;
let index_rtxn = index.read_txn()?;
let stats = crate::index_mapper::IndexStats::new(&index, &index_rtxn)?;
let stats = crate::index_mapper::IndexStats::new(&index, &index_rtxn)
.map_err(|e| Error::from_milli(e, Some(index_uid.clone())))?;
self.index_mapper.store_stats_of(&mut wtxn, &index_uid, &stats)?;
wtxn.commit()?;
Ok(())
@ -1025,13 +1115,16 @@ impl IndexScheduler {
Ok(vec![task])
}
Batch::IndexDeletion { index_uid, index_has_been_created, mut tasks } => {
progress.update_progress(DeleteIndexProgress::DeletingTheIndex);
let wtxn = self.env.write_txn()?;
// it's possible that the index doesn't exist
let number_of_documents = || -> Result<u64> {
let index = self.index_mapper.index(&wtxn, &index_uid)?;
let index_rtxn = index.read_txn()?;
Ok(index.number_of_documents(&index_rtxn)?)
index
.number_of_documents(&index_rtxn)
.map_err(|e| Error::from_milli(e, Some(index_uid.to_string())))
}()
.unwrap_or_default();
@ -1056,6 +1149,8 @@ impl IndexScheduler {
Ok(tasks)
}
Batch::IndexSwap { mut task } => {
progress.update_progress(SwappingTheIndexes::EnsuringCorrectnessOfTheSwap);
let mut wtxn = self.env.write_txn()?;
let swaps = if let KindWithContent::IndexSwap { swaps } = &task.kind {
swaps
@ -1082,8 +1177,20 @@ impl IndexScheduler {
));
}
}
for swap in swaps {
self.apply_index_swap(&mut wtxn, task.uid, &swap.indexes.0, &swap.indexes.1)?;
progress.update_progress(SwappingTheIndexes::SwappingTheIndexes);
for (step, swap) in swaps.iter().enumerate() {
progress.update_progress(VariableNameStep::new(
format!("swapping index {} and {}", swap.indexes.0, swap.indexes.1),
step as u32,
swaps.len() as u32,
));
self.apply_index_swap(
&mut wtxn,
&progress,
task.uid,
&swap.indexes.0,
&swap.indexes.1,
)?;
}
wtxn.commit()?;
task.status = Status::Succeeded;
@ -1093,7 +1200,15 @@ impl IndexScheduler {
}
/// Swap the index `lhs` with the index `rhs`.
fn apply_index_swap(&self, wtxn: &mut RwTxn, task_id: u32, lhs: &str, rhs: &str) -> Result<()> {
fn apply_index_swap(
&self,
wtxn: &mut RwTxn,
progress: &Progress,
task_id: u32,
lhs: &str,
rhs: &str,
) -> Result<()> {
progress.update_progress(InnerSwappingTwoIndexes::RetrieveTheTasks);
// 1. Verify that both lhs and rhs are existing indexes
let index_lhs_exists = self.index_mapper.index_exists(wtxn, lhs)?;
if !index_lhs_exists {
@ -1111,14 +1226,21 @@ impl IndexScheduler {
index_rhs_task_ids.remove_range(task_id..);
// 3. before_name -> new_name in the task's KindWithContent
for task_id in &index_lhs_task_ids | &index_rhs_task_ids {
progress.update_progress(InnerSwappingTwoIndexes::UpdateTheTasks);
let tasks_to_update = &index_lhs_task_ids | &index_rhs_task_ids;
let (atomic, task_progress) = AtomicTaskStep::new(tasks_to_update.len() as u32);
progress.update_progress(task_progress);
for task_id in tasks_to_update {
let mut task = self.get_task(wtxn, task_id)?.ok_or(Error::CorruptedTaskQueue)?;
swap_index_uid_in_task(&mut task, (lhs, rhs));
self.all_tasks.put(wtxn, &task_id, &task)?;
atomic.fetch_add(1, Ordering::Relaxed);
}
// 4. remove the task from indexuid = before_name
// 5. add the task to indexuid = after_name
progress.update_progress(InnerSwappingTwoIndexes::UpdateTheIndexesMetadata);
self.update_index(wtxn, lhs, |lhs_tasks| {
*lhs_tasks -= &index_lhs_task_ids;
*lhs_tasks |= &index_rhs_task_ids;
@ -1140,7 +1262,7 @@ impl IndexScheduler {
/// The list of processed tasks.
#[tracing::instrument(
level = "trace",
skip(self, index_wtxn, index),
skip(self, index_wtxn, index, progress),
target = "indexing::scheduler"
)]
fn apply_index_operation<'i>(
@ -1148,48 +1270,18 @@ impl IndexScheduler {
index_wtxn: &mut RwTxn<'i>,
index: &'i Index,
operation: IndexOperation,
progress: Progress,
) -> Result<Vec<Task>> {
let indexer_alloc = Bump::new();
let started_processing_at = std::time::Instant::now();
let secs_since_started_processing_at = AtomicU64::new(0);
const PRINT_SECS_DELTA: u64 = 5;
let processing_tasks = self.processing_tasks.clone();
let must_stop_processing = self.must_stop_processing.clone();
let send_progress = |progress| {
let now = std::time::Instant::now();
let elapsed = secs_since_started_processing_at.load(atomic::Ordering::Relaxed);
let previous = started_processing_at + Duration::from_secs(elapsed);
let elapsed = now - previous;
if elapsed.as_secs() < PRINT_SECS_DELTA {
return;
}
secs_since_started_processing_at
.store((now - started_processing_at).as_secs(), atomic::Ordering::Relaxed);
let TaskProgress {
current_step,
finished_steps,
total_steps,
finished_substeps,
total_substeps,
} = processing_tasks.write().unwrap().update_progress(progress);
tracing::info!(
current_step,
finished_steps,
total_steps,
finished_substeps,
total_substeps
);
};
match operation {
IndexOperation::DocumentClear { mut tasks, .. } => {
let count = milli::update::ClearDocuments::new(index_wtxn, index).execute()?;
IndexOperation::DocumentClear { index_uid, mut tasks } => {
let count = milli::update::ClearDocuments::new(index_wtxn, index)
.execute()
.map_err(|e| Error::from_milli(e, Some(index_uid)))?;
let mut first_clear_found = false;
for task in &mut tasks {
@ -1209,12 +1301,13 @@ impl IndexScheduler {
Ok(tasks)
}
IndexOperation::DocumentOperation {
index_uid: _,
index_uid,
primary_key,
method,
operations,
mut tasks,
} => {
progress.update_progress(DocumentOperationProgress::RetrievingConfig);
// TODO: at some point, for better efficiency we might want to reuse the bumpalo for successive batches.
// this is made difficult by the fact we're doing private clones of the index scheduler and sending it
// to a fresh thread.
@ -1223,9 +1316,7 @@ impl IndexScheduler {
if let DocumentOperation::Add(content_uuid) = operation {
let content_file = self.file_store.get_update(*content_uuid)?;
let mmap = unsafe { memmap2::Mmap::map(&content_file)? };
if !mmap.is_empty() {
content_files.push(mmap);
}
content_files.push(mmap);
}
}
@ -1235,13 +1326,17 @@ impl IndexScheduler {
let mut content_files_iter = content_files.iter();
let mut indexer = indexer::DocumentOperation::new(method);
let embedders = index.embedding_configs(index_wtxn)?;
let embedders = self.embedders(embedders)?;
let embedders = index
.embedding_configs(index_wtxn)
.map_err(|e| Error::from_milli(e, Some(index_uid.clone())))?;
let embedders = self.embedders(index_uid.clone(), embedders)?;
for operation in operations {
match operation {
DocumentOperation::Add(_content_uuid) => {
let mmap = content_files_iter.next().unwrap();
indexer.add_documents(mmap)?;
indexer
.add_documents(mmap)
.map_err(|e| Error::from_milli(e, Some(index_uid.clone())))?;
}
DocumentOperation::Delete(document_ids) => {
let document_ids: bumpalo::collections::vec::Vec<_> = document_ids
@ -1266,19 +1361,22 @@ impl IndexScheduler {
}
};
let (document_changes, operation_stats, primary_key) = indexer.into_changes(
&indexer_alloc,
index,
&rtxn,
primary_key.as_deref(),
&mut new_fields_ids_map,
&|| must_stop_processing.get(),
&send_progress,
)?;
progress.update_progress(DocumentOperationProgress::ComputingDocumentChanges);
let (document_changes, operation_stats, primary_key) = indexer
.into_changes(
&indexer_alloc,
index,
&rtxn,
primary_key.as_deref(),
&mut new_fields_ids_map,
&|| must_stop_processing.get(),
progress.clone(),
)
.map_err(|e| Error::from_milli(e, Some(index_uid.clone())))?;
let mut addition = 0;
let mut candidates_count = 0;
for (stats, task) in operation_stats.into_iter().zip(&mut tasks) {
addition += stats.document_count;
candidates_count += stats.document_count;
match stats.error {
Some(error) => {
task.status = Status::Failed;
@ -1308,6 +1406,7 @@ impl IndexScheduler {
}
}
progress.update_progress(DocumentOperationProgress::Indexing);
if tasks.iter().any(|res| res.error.is_none()) {
indexer::index(
index_wtxn,
@ -1320,15 +1419,25 @@ impl IndexScheduler {
&document_changes,
embedders,
&|| must_stop_processing.get(),
&send_progress,
)?;
&progress,
)
.map_err(|e| Error::from_milli(e, Some(index_uid.clone())))?;
let addition = DocumentAdditionResult {
indexed_documents: candidates_count,
number_of_documents: index
.number_of_documents(index_wtxn)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?,
};
tracing::info!(indexing_result = ?addition, processed_in = ?started_processing_at.elapsed(), "document indexing done");
}
Ok(tasks)
}
IndexOperation::DocumentEdition { mut task, .. } => {
IndexOperation::DocumentEdition { index_uid, mut task } => {
progress.update_progress(DocumentEditionProgress::RetrievingConfig);
let (filter, code) = if let KindWithContent::DocumentEdition {
filter_expr,
context: _,
@ -1342,16 +1451,11 @@ impl IndexScheduler {
};
let candidates = match filter.as_ref().map(Filter::from_json) {
Some(Ok(Some(filter))) => {
filter.evaluate(index_wtxn, index).map_err(|err| match err {
milli::Error::UserError(milli::UserError::InvalidFilter(_)) => {
Error::from(err).with_custom_error_code(Code::InvalidDocumentFilter)
}
e => e.into(),
})?
}
Some(Ok(Some(filter))) => filter
.evaluate(index_wtxn, index)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?,
None | Some(Ok(None)) => index.documents_ids(index_wtxn)?,
Some(Err(e)) => return Err(e.into()),
Some(Err(e)) => return Err(Error::from_milli(e, Some(index_uid.clone()))),
};
let (original_filter, context, function) = if let Some(Details::DocumentEdition {
@ -1386,8 +1490,9 @@ impl IndexScheduler {
// candidates not empty => index not empty => a primary key is set
let primary_key = index.primary_key(&rtxn)?.unwrap();
let primary_key = PrimaryKey::new_or_insert(primary_key, &mut new_fields_ids_map)
.map_err(milli::Error::from)?;
let primary_key =
PrimaryKey::new_or_insert(primary_key, &mut new_fields_ids_map)
.map_err(|err| Error::from_milli(err.into(), Some(index_uid.clone())))?;
let result_count = Ok((candidates.len(), candidates.len())) as Result<_>;
@ -1405,13 +1510,22 @@ impl IndexScheduler {
}
};
let candidates_count = candidates.len();
progress.update_progress(DocumentEditionProgress::ComputingDocumentChanges);
let indexer = UpdateByFunction::new(candidates, context.clone(), code.clone());
let document_changes =
pool.install(|| indexer.into_changes(&primary_key)).unwrap()?;
let embedders = index.embedding_configs(index_wtxn)?;
let embedders = self.embedders(embedders)?;
let document_changes = pool
.install(|| {
indexer
.into_changes(&primary_key)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))
})
.unwrap()?;
let embedders = index
.embedding_configs(index_wtxn)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?;
let embedders = self.embedders(index_uid.clone(), embedders)?;
progress.update_progress(DocumentEditionProgress::Indexing);
indexer::index(
index_wtxn,
index,
@ -1423,10 +1537,18 @@ impl IndexScheduler {
&document_changes,
embedders,
&|| must_stop_processing.get(),
&send_progress,
)?;
&progress,
)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?;
// tracing::info!(indexing_result = ?addition, processed_in = ?started_processing_at.elapsed(), "document indexing done");
let addition = DocumentAdditionResult {
indexed_documents: candidates_count,
number_of_documents: index
.number_of_documents(index_wtxn)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?,
};
tracing::info!(indexing_result = ?addition, processed_in = ?started_processing_at.elapsed(), "document indexing done");
}
match result_count {
@ -1455,7 +1577,9 @@ impl IndexScheduler {
Ok(vec![task])
}
IndexOperation::DocumentDeletion { mut tasks, index_uid: _ } => {
IndexOperation::DocumentDeletion { mut tasks, index_uid } => {
progress.update_progress(DocumentDeletionProgress::RetrievingConfig);
let mut to_delete = RoaringBitmap::new();
let external_documents_ids = index.external_documents_ids();
@ -1476,35 +1600,23 @@ impl IndexScheduler {
deleted_documents: Some(will_be_removed),
});
}
KindWithContent::DocumentDeletionByFilter { index_uid: _, filter_expr } => {
KindWithContent::DocumentDeletionByFilter { index_uid, filter_expr } => {
let before = to_delete.len();
let filter = match Filter::from_json(filter_expr) {
Ok(filter) => filter,
Err(err) => {
// theorically, this should be catched by deserr before reaching the index-scheduler and cannot happens
task.status = Status::Failed;
task.error = match err {
milli::Error::UserError(
milli::UserError::InvalidFilterExpression { .. },
) => Some(
Error::from(err)
.with_custom_error_code(Code::InvalidDocumentFilter)
.into(),
),
e => Some(e.into()),
};
task.error = Some(
Error::from_milli(err, Some(index_uid.clone())).into(),
);
None
}
};
if let Some(filter) = filter {
let candidates =
filter.evaluate(index_wtxn, index).map_err(|err| match err {
milli::Error::UserError(
milli::UserError::InvalidFilter(_),
) => Error::from(err)
.with_custom_error_code(Code::InvalidDocumentFilter),
e => e.into(),
});
let candidates = filter
.evaluate(index_wtxn, index)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())));
match candidates {
Ok(candidates) => to_delete |= candidates,
Err(err) => {
@ -1540,8 +1652,9 @@ impl IndexScheduler {
// to_delete not empty => index not empty => primary key set
let primary_key = index.primary_key(&rtxn)?.unwrap();
let primary_key = PrimaryKey::new_or_insert(primary_key, &mut new_fields_ids_map)
.map_err(milli::Error::from)?;
let primary_key =
PrimaryKey::new_or_insert(primary_key, &mut new_fields_ids_map)
.map_err(|err| Error::from_milli(err.into(), Some(index_uid.clone())))?;
if !tasks.iter().all(|res| res.error.is_some()) {
let local_pool;
@ -1557,12 +1670,17 @@ impl IndexScheduler {
}
};
progress.update_progress(DocumentDeletionProgress::DeleteDocuments);
let mut indexer = indexer::DocumentDeletion::new();
let candidates_count = to_delete.len();
indexer.delete_documents_by_docids(to_delete);
let document_changes = indexer.into_changes(&indexer_alloc, primary_key);
let embedders = index.embedding_configs(index_wtxn)?;
let embedders = self.embedders(embedders)?;
let embedders = index
.embedding_configs(index_wtxn)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?;
let embedders = self.embedders(index_uid.clone(), embedders)?;
progress.update_progress(DocumentDeletionProgress::Indexing);
indexer::index(
index_wtxn,
index,
@ -1574,15 +1692,24 @@ impl IndexScheduler {
&document_changes,
embedders,
&|| must_stop_processing.get(),
&send_progress,
)?;
&progress,
)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?;
// tracing::info!(indexing_result = ?addition, processed_in = ?started_processing_at.elapsed(), "document indexing done");
let addition = DocumentAdditionResult {
indexed_documents: candidates_count,
number_of_documents: index
.number_of_documents(index_wtxn)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?,
};
tracing::info!(indexing_result = ?addition, processed_in = ?started_processing_at.elapsed(), "document indexing done");
}
Ok(tasks)
}
IndexOperation::Settings { index_uid: _, settings, mut tasks } => {
IndexOperation::Settings { index_uid, settings, mut tasks } => {
progress.update_progress(SettingsProgress::RetrievingAndMergingTheSettings);
let indexer_config = self.index_mapper.indexer_config();
let mut builder = milli::update::Settings::new(index_wtxn, index, indexer_config);
@ -1596,10 +1723,13 @@ impl IndexScheduler {
task.status = Status::Succeeded;
}
builder.execute(
|indexing_step| tracing::debug!(update = ?indexing_step),
|| must_stop_processing.get(),
)?;
progress.update_progress(SettingsProgress::ApplyTheSettings);
builder
.execute(
|indexing_step| tracing::debug!(update = ?indexing_step),
|| must_stop_processing.get(),
)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?;
Ok(tasks)
}
@ -1616,12 +1746,14 @@ impl IndexScheduler {
index_uid: index_uid.clone(),
tasks: cleared_tasks,
},
progress.clone(),
)?;
let settings_tasks = self.apply_index_operation(
index_wtxn,
index,
IndexOperation::Settings { index_uid, settings, tasks: settings_tasks },
progress,
)?;
let mut tasks = settings_tasks;
@ -1638,15 +1770,18 @@ impl IndexScheduler {
&self,
wtxn: &mut RwTxn,
matched_tasks: &RoaringBitmap,
progress: &Progress,
) -> Result<RoaringBitmap> {
progress.update_progress(TaskDeletionProgress::DeletingTasksDateTime);
// 1. Remove from this list the tasks that we are not allowed to delete
let enqueued_tasks = self.get_status(wtxn, Status::Enqueued)?;
let processing_tasks = &self.processing_tasks.read().unwrap().processing.clone();
let all_task_ids = self.all_task_ids(wtxn)?;
let mut to_delete_tasks = all_task_ids & matched_tasks;
to_delete_tasks -= processing_tasks;
to_delete_tasks -= enqueued_tasks;
to_delete_tasks -= &**processing_tasks;
to_delete_tasks -= &enqueued_tasks;
// 2. We now have a list of tasks to delete, delete them
@ -1657,6 +1792,8 @@ impl IndexScheduler {
// The tasks that have been removed *per batches*.
let mut affected_batches: HashMap<BatchId, RoaringBitmap> = HashMap::new();
let (atomic_progress, task_progress) = AtomicTaskStep::new(to_delete_tasks.len() as u32);
progress.update_progress(task_progress);
for task_id in to_delete_tasks.iter() {
let task = self.get_task(wtxn, task_id)?.ok_or(Error::CorruptedTaskQueue)?;
@ -1680,22 +1817,35 @@ impl IndexScheduler {
if let Some(batch_uid) = task.batch_uid {
affected_batches.entry(batch_uid).or_default().insert(task_id);
}
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
progress.update_progress(TaskDeletionProgress::DeletingTasksMetadata);
let (atomic_progress, task_progress) = AtomicTaskStep::new(
(affected_indexes.len() + affected_statuses.len() + affected_kinds.len()) as u32,
);
progress.update_progress(task_progress);
for index in affected_indexes.iter() {
self.update_index(wtxn, index, |bitmap| *bitmap -= &to_delete_tasks)?;
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
for status in affected_statuses.iter() {
self.update_status(wtxn, *status, |bitmap| *bitmap -= &to_delete_tasks)?;
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
for kind in affected_kinds.iter() {
self.update_kind(wtxn, *kind, |bitmap| *bitmap -= &to_delete_tasks)?;
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
progress.update_progress(TaskDeletionProgress::DeletingTasks);
let (atomic_progress, task_progress) = AtomicTaskStep::new(to_delete_tasks.len() as u32);
progress.update_progress(task_progress);
for task in to_delete_tasks.iter() {
self.all_tasks.delete(wtxn, &task)?;
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
for canceled_by in affected_canceled_by {
if let Some(mut tasks) = self.canceled_by.get(wtxn, &canceled_by)? {
@ -1707,6 +1857,9 @@ impl IndexScheduler {
}
}
}
progress.update_progress(TaskDeletionProgress::DeletingBatches);
let (atomic_progress, batch_progress) = AtomicBatchStep::new(affected_batches.len() as u32);
progress.update_progress(batch_progress);
for (batch_id, to_delete_tasks) in affected_batches {
if let Some(mut tasks) = self.batch_to_tasks_mapping.get(wtxn, &batch_id)? {
tasks -= &to_delete_tasks;
@ -1748,6 +1901,7 @@ impl IndexScheduler {
}
}
}
atomic_progress.fetch_add(1, Ordering::Relaxed);
}
Ok(to_delete_tasks)
@ -1762,21 +1916,36 @@ impl IndexScheduler {
cancel_task_id: TaskId,
current_batch: &mut ProcessingBatch,
matched_tasks: &RoaringBitmap,
progress: &Progress,
) -> Result<Vec<Task>> {
progress.update_progress(TaskCancelationProgress::RetrievingTasks);
// 1. Remove from this list the tasks that we are not allowed to cancel
// Notice that only the _enqueued_ ones are cancelable and we should
// have already aborted the indexation of the _processing_ ones
let cancelable_tasks = self.get_status(rtxn, Status::Enqueued)?;
let tasks_to_cancel = cancelable_tasks & matched_tasks;
// 2. We now have a list of tasks to cancel, cancel them
let mut tasks = self.get_existing_tasks(rtxn, tasks_to_cancel.iter())?;
let (task_progress, progress_obj) = AtomicTaskStep::new(tasks_to_cancel.len() as u32);
progress.update_progress(progress_obj);
// 2. We now have a list of tasks to cancel, cancel them
let mut tasks = self.get_existing_tasks(
rtxn,
tasks_to_cancel.iter().inspect(|_| {
task_progress.fetch_add(1, Ordering::Relaxed);
}),
)?;
progress.update_progress(TaskCancelationProgress::UpdatingTasks);
let (task_progress, progress_obj) = AtomicTaskStep::new(tasks_to_cancel.len() as u32);
progress.update_progress(progress_obj);
for task in tasks.iter_mut() {
task.status = Status::Canceled;
task.canceled_by = Some(cancel_task_id);
task.details = task.details.as_ref().map(|d| d.to_failed());
current_batch.processing(Some(task));
task_progress.fetch_add(1, Ordering::Relaxed);
}
Ok(tasks)

View File

@ -104,7 +104,7 @@ pub enum Error {
)]
InvalidTaskCanceledBy { canceled_by: String },
#[error(
"{index_uid} is not a valid index uid. Index uid can be an integer or a string containing only alphanumeric characters, hyphens (-) and underscores (_), and can not be more than 512 bytes."
"{index_uid} is not a valid index uid. Index uid can be an integer or a string containing only alphanumeric characters, hyphens (-) and underscores (_), and can not be more than 400 bytes."
)]
InvalidIndexUid { index_uid: String },
#[error("Task `{0}` not found.")]
@ -122,8 +122,11 @@ pub enum Error {
Dump(#[from] dump::Error),
#[error(transparent)]
Heed(#[from] heed::Error),
#[error(transparent)]
Milli(#[from] milli::Error),
#[error("{}", match .index_uid {
Some(uid) if !uid.is_empty() => format!("Index `{}`: {error}", uid),
_ => format!("{error}")
})]
Milli { error: milli::Error, index_uid: Option<String> },
#[error("An unexpected crash occurred when processing the task.")]
ProcessBatchPanicked,
#[error(transparent)]
@ -190,7 +193,7 @@ impl Error {
| Error::AbortedTask
| Error::Dump(_)
| Error::Heed(_)
| Error::Milli(_)
| Error::Milli { .. }
| Error::ProcessBatchPanicked
| Error::FileStore(_)
| Error::IoError(_)
@ -209,6 +212,20 @@ impl Error {
pub fn with_custom_error_code(self, code: Code) -> Self {
Self::WithCustomErrorCode(code, Box::new(self))
}
pub fn from_milli(err: milli::Error, index_uid: Option<String>) -> Self {
match err {
milli::Error::UserError(milli::UserError::InvalidFilter(_)) => {
Self::Milli { error: err, index_uid }
.with_custom_error_code(Code::InvalidDocumentFilter)
}
milli::Error::UserError(milli::UserError::InvalidFilterExpression { .. }) => {
Self::Milli { error: err, index_uid }
.with_custom_error_code(Code::InvalidDocumentFilter)
}
_ => Self::Milli { error: err, index_uid },
}
}
}
impl ErrorCode for Error {
@ -236,7 +253,7 @@ impl ErrorCode for Error {
// TODO: not sure of the Code to use
Error::NoSpaceLeftInTaskQueue => Code::NoSpaceLeftOnDevice,
Error::Dump(e) => e.error_code(),
Error::Milli(e) => e.error_code(),
Error::Milli { error, .. } => error.error_code(),
Error::ProcessBatchPanicked => Code::Internal,
Error::Heed(e) => e.error_code(),
Error::HeedTransaction(e) => e.error_code(),

View File

@ -1,16 +1,17 @@
use std::collections::BTreeMap;
use std::env::VarError;
use std::path::Path;
use std::str::FromStr;
use std::time::Duration;
use meilisearch_types::heed::{EnvClosingEvent, EnvFlags, EnvOpenOptions};
use meilisearch_types::milli::Index;
use meilisearch_types::milli::{Index, Result};
use time::OffsetDateTime;
use uuid::Uuid;
use super::IndexStatus::{self, Available, BeingDeleted, Closing, Missing};
use crate::clamp_to_page_size;
use crate::lru::{InsertionOutcome, LruMap};
use crate::{clamp_to_page_size, Result};
/// Keep an internally consistent view of the open indexes in memory.
///
/// This view is made of an LRU cache that will evict the least frequently used indexes when new indexes are opened.
@ -303,7 +304,15 @@ fn create_or_open_index(
) -> Result<Index> {
let mut options = EnvOpenOptions::new();
options.map_size(clamp_to_page_size(map_size));
options.max_readers(1024);
let max_readers = match std::env::var("MEILI_EXPERIMENTAL_INDEX_MAX_READERS") {
Ok(value) => u32::from_str(&value).unwrap(),
Err(VarError::NotPresent) => 1024,
Err(VarError::NotUnicode(value)) => panic!(
"Invalid unicode for the `MEILI_EXPERIMENTAL_INDEX_MAX_READERS` env var: {value:?}"
),
};
options.max_readers(max_readers);
if enable_mdb_writemap {
unsafe { options.flags(EnvFlags::WRITE_MAP) };
}

View File

@ -5,6 +5,7 @@ use std::{fs, thread};
use meilisearch_types::heed::types::{SerdeJson, Str};
use meilisearch_types::heed::{Database, Env, RoTxn, RwTxn};
use meilisearch_types::milli;
use meilisearch_types::milli::update::IndexerConfig;
use meilisearch_types::milli::{FieldDistribution, Index};
use serde::{Deserialize, Serialize};
@ -121,7 +122,7 @@ impl IndexStats {
/// # Parameters
///
/// - rtxn: a RO transaction for the index, obtained from `Index::read_txn()`.
pub fn new(index: &Index, rtxn: &RoTxn) -> Result<Self> {
pub fn new(index: &Index, rtxn: &RoTxn) -> milli::Result<Self> {
Ok(IndexStats {
number_of_documents: index.number_of_documents(rtxn)?,
database_size: index.on_disk_size()?,
@ -183,13 +184,18 @@ impl IndexMapper {
// Error if the UUIDv4 somehow already exists in the map, since it should be fresh.
// This is very unlikely to happen in practice.
// TODO: it would be better to lazily create the index. But we need an Index::open function for milli.
let index = self.index_map.write().unwrap().create(
&uuid,
&index_path,
date,
self.enable_mdb_writemap,
self.index_base_map_size,
)?;
let index = self
.index_map
.write()
.unwrap()
.create(
&uuid,
&index_path,
date,
self.enable_mdb_writemap,
self.index_base_map_size,
)
.map_err(|e| Error::from_milli(e, Some(uuid.to_string())))?;
wtxn.commit()?;
@ -357,7 +363,9 @@ impl IndexMapper {
};
let index_path = self.base_path.join(uuid.to_string());
// take the lock to reopen the environment.
reopen.reopen(&mut self.index_map.write().unwrap(), &index_path)?;
reopen
.reopen(&mut self.index_map.write().unwrap(), &index_path)
.map_err(|e| Error::from_milli(e, Some(uuid.to_string())))?;
continue;
}
BeingDeleted => return Err(Error::IndexNotFound(name.to_string())),
@ -372,13 +380,15 @@ impl IndexMapper {
Missing => {
let index_path = self.base_path.join(uuid.to_string());
break index_map.create(
&uuid,
&index_path,
None,
self.enable_mdb_writemap,
self.index_base_map_size,
)?;
break index_map
.create(
&uuid,
&index_path,
None,
self.enable_mdb_writemap,
self.index_base_map_size,
)
.map_err(|e| Error::from_milli(e, Some(uuid.to_string())))?;
}
Available(index) => break index,
Closing(_) => {
@ -460,6 +470,7 @@ impl IndexMapper {
let index = self.index(rtxn, index_uid)?;
let index_rtxn = index.read_txn()?;
IndexStats::new(&index, &index_rtxn)
.map_err(|e| Error::from_milli(e, Some(uuid.to_string())))
}
}
}

View File

@ -353,7 +353,7 @@ pub fn snapshot_canceled_by(rtxn: &RoTxn, db: Database<BEU32, RoaringBitmapCodec
pub fn snapshot_batch(batch: &Batch) -> String {
let mut snap = String::new();
let Batch { uid, details, stats, started_at, finished_at } = batch;
let Batch { uid, details, stats, started_at, finished_at, progress: _ } = batch;
if let Some(finished_at) = finished_at {
assert!(finished_at > started_at);
}

View File

@ -26,6 +26,7 @@ mod index_mapper;
#[cfg(test)]
mod insta_snapshot;
mod lru;
mod processing;
mod utils;
pub mod uuid_codec;
@ -54,14 +55,13 @@ use meilisearch_types::features::{InstanceTogglableFeatures, RuntimeTogglableFea
use meilisearch_types::heed::byteorder::BE;
use meilisearch_types::heed::types::{SerdeBincode, SerdeJson, Str, I128};
use meilisearch_types::heed::{self, Database, Env, PutFlags, RoTxn, RwTxn};
use meilisearch_types::milli::documents::DocumentsBatchBuilder;
use meilisearch_types::milli::index::IndexEmbeddingConfig;
use meilisearch_types::milli::update::new::indexer::document_changes::Progress;
use meilisearch_types::milli::update::IndexerConfig;
use meilisearch_types::milli::vector::{Embedder, EmbedderOptions, EmbeddingConfigs};
use meilisearch_types::milli::{self, CboRoaringBitmapCodec, Index, RoaringBitmapCodec, BEU32};
use meilisearch_types::task_view::TaskView;
use meilisearch_types::tasks::{Kind, KindWithContent, Status, Task, TaskProgress};
use meilisearch_types::tasks::{Kind, KindWithContent, Status, Task};
use processing::ProcessingTasks;
use rayon::current_num_threads;
use rayon::prelude::{IntoParallelIterator, ParallelIterator};
use roaring::RoaringBitmap;
@ -72,7 +72,8 @@ use utils::{filter_out_references_to_newer_tasks, keep_ids_within_datetimes, map
use uuid::Uuid;
use crate::index_mapper::IndexMapper;
use crate::utils::{check_index_swap_validity, clamp_to_page_size, ProcessingBatch};
use crate::processing::{AtomicTaskStep, BatchProgress};
use crate::utils::{check_index_swap_validity, clamp_to_page_size};
pub(crate) type BEI128 = I128<BE>;
@ -163,48 +164,6 @@ impl Query {
}
}
#[derive(Debug, Clone)]
pub struct ProcessingTasks {
batch: Option<ProcessingBatch>,
/// The list of tasks ids that are currently running.
processing: RoaringBitmap,
/// The progress on processing tasks
progress: Option<TaskProgress>,
}
impl ProcessingTasks {
/// Creates an empty `ProcessingAt` struct.
fn new() -> ProcessingTasks {
ProcessingTasks { batch: None, processing: RoaringBitmap::new(), progress: None }
}
/// Stores the currently processing tasks, and the date time at which it started.
fn start_processing(&mut self, processing_batch: ProcessingBatch, processing: RoaringBitmap) {
self.batch = Some(processing_batch);
self.processing = processing;
}
fn update_progress(&mut self, progress: Progress) -> TaskProgress {
self.progress.get_or_insert_with(TaskProgress::default).update(progress)
}
/// Set the processing tasks to an empty list
fn stop_processing(&mut self) -> Self {
self.progress = None;
Self {
batch: std::mem::take(&mut self.batch),
processing: std::mem::take(&mut self.processing),
progress: None,
}
}
/// Returns `true` if there, at least, is one task that is currently processing that we must stop.
fn must_cancel_processing_tasks(&self, canceled_tasks: &RoaringBitmap) -> bool {
!self.processing.is_disjoint(canceled_tasks)
}
}
#[derive(Default, Clone, Debug)]
struct MustStopProcessing(Arc<AtomicBool>);
@ -813,7 +772,7 @@ impl IndexScheduler {
let mut batch_tasks = RoaringBitmap::new();
for batch_uid in batch_uids {
if processing_batch.as_ref().map_or(false, |batch| batch.uid == *batch_uid) {
batch_tasks |= &processing_tasks;
batch_tasks |= &*processing_tasks;
} else {
batch_tasks |= self.tasks_in_batch(rtxn, *batch_uid)?;
}
@ -827,13 +786,13 @@ impl IndexScheduler {
match status {
// special case for Processing tasks
Status::Processing => {
status_tasks |= &processing_tasks;
status_tasks |= &*processing_tasks;
}
status => status_tasks |= &self.get_status(rtxn, *status)?,
};
}
if !status.contains(&Status::Processing) {
tasks -= &processing_tasks;
tasks -= &*processing_tasks;
}
tasks &= status_tasks;
}
@ -882,7 +841,7 @@ impl IndexScheduler {
// Once we have filtered the two subsets, we put them back together and assign it back to `tasks`.
tasks = {
let (mut filtered_non_processing_tasks, mut filtered_processing_tasks) =
(&tasks - &processing_tasks, &tasks & &processing_tasks);
(&tasks - &*processing_tasks, &tasks & &*processing_tasks);
// special case for Processing tasks
// A closure that clears the filtered_processing_tasks if their started_at date falls outside the given bounds
@ -1090,7 +1049,7 @@ impl IndexScheduler {
// Once we have filtered the two subsets, we put them back together and assign it back to `batches`.
batches = {
let (mut filtered_non_processing_batches, mut filtered_processing_batches) =
(&batches - &processing.processing, &batches & &processing.processing);
(&batches - &*processing.processing, &batches & &*processing.processing);
// special case for Processing batches
// A closure that clears the filtered_processing_batches if their started_at date falls outside the given bounds
@ -1606,7 +1565,8 @@ impl IndexScheduler {
// We reset the must_stop flag to be sure that we don't stop processing tasks
self.must_stop_processing.reset();
self.processing_tasks
let progress = self
.processing_tasks
.write()
.unwrap()
// We can clone the processing batch here because we don't want its modification to affect the view of the processing batches
@ -1619,11 +1579,12 @@ impl IndexScheduler {
let res = {
let cloned_index_scheduler = self.private_clone();
let processing_batch = &mut processing_batch;
let progress = progress.clone();
std::thread::scope(|s| {
let handle = std::thread::Builder::new()
.name(String::from("batch-operation"))
.spawn_scoped(s, move || {
cloned_index_scheduler.process_batch(batch, processing_batch)
cloned_index_scheduler.process_batch(batch, processing_batch, progress)
})
.unwrap();
handle.join().unwrap_or(Err(Error::ProcessBatchPanicked))
@ -1636,6 +1597,7 @@ impl IndexScheduler {
#[cfg(test)]
self.maybe_fail(tests::FailureLocation::AcquiringWtxn)?;
progress.update_progress(BatchProgress::WritingTasksToDisk);
processing_batch.finished();
let mut wtxn = self.env.write_txn().map_err(Error::HeedTransaction)?;
let mut canceled = RoaringBitmap::new();
@ -1645,12 +1607,15 @@ impl IndexScheduler {
#[cfg(test)]
self.breakpoint(Breakpoint::ProcessBatchSucceeded);
let (task_progress, task_progress_obj) = AtomicTaskStep::new(tasks.len() as u32);
progress.update_progress(task_progress_obj);
let mut success = 0;
let mut failure = 0;
let mut canceled_by = None;
#[allow(unused_variables)]
for (i, mut task) in tasks.into_iter().enumerate() {
task_progress.fetch_add(1, Ordering::Relaxed);
processing_batch.update(&mut task);
if task.status == Status::Canceled {
canceled.insert(task.uid);
@ -1678,9 +1643,10 @@ impl IndexScheduler {
tracing::info!("A batch of tasks was successfully completed with {success} successful tasks and {failure} failed tasks.");
}
// If we have an abortion error we must stop the tick here and re-schedule tasks.
Err(Error::Milli(milli::Error::InternalError(
milli::InternalError::AbortedIndexation,
)))
Err(Error::Milli {
error: milli::Error::InternalError(milli::InternalError::AbortedIndexation),
..
})
| Err(Error::AbortedTask) => {
#[cfg(test)]
self.breakpoint(Breakpoint::AbortedIndexation);
@ -1699,9 +1665,10 @@ impl IndexScheduler {
// 2. close the associated environment
// 3. resize it
// 4. re-schedule tasks
Err(Error::Milli(milli::Error::UserError(
milli::UserError::MaxDatabaseSizeReached,
))) if index_uid.is_some() => {
Err(Error::Milli {
error: milli::Error::UserError(milli::UserError::MaxDatabaseSizeReached),
..
}) if index_uid.is_some() => {
// fixme: add index_uid to match to avoid the unwrap
let index_uid = index_uid.unwrap();
// fixme: handle error more gracefully? not sure when this could happen
@ -1716,8 +1683,12 @@ impl IndexScheduler {
Err(err) => {
#[cfg(test)]
self.breakpoint(Breakpoint::ProcessBatchFailed);
let (task_progress, task_progress_obj) = AtomicTaskStep::new(ids.len() as u32);
progress.update_progress(task_progress_obj);
let error: ResponseError = err.into();
for id in ids.iter() {
task_progress.fetch_add(1, Ordering::Relaxed);
let mut task = self
.get_task(&wtxn, id)
.map_err(|e| Error::TaskDatabaseUpdate(Box::new(e)))?
@ -1943,6 +1914,7 @@ impl IndexScheduler {
// TODO: consider using a type alias or a struct embedder/template
pub fn embedders(
&self,
index_uid: String,
embedding_configs: Vec<IndexEmbeddingConfig>,
) -> Result<EmbeddingConfigs> {
let res: Result<_> = embedding_configs
@ -1953,8 +1925,12 @@ impl IndexScheduler {
config: milli::vector::EmbeddingConfig { embedder_options, prompt, quantized },
..
}| {
let prompt =
Arc::new(prompt.try_into().map_err(meilisearch_types::milli::Error::from)?);
let prompt = Arc::new(
prompt
.try_into()
.map_err(meilisearch_types::milli::Error::from)
.map_err(|err| Error::from_milli(err, Some(index_uid.clone())))?,
);
// optimistically return existing embedder
{
let embedders = self.embedders.read().unwrap();
@ -1970,7 +1946,9 @@ impl IndexScheduler {
let embedder = Arc::new(
Embedder::new(embedder_options.clone())
.map_err(meilisearch_types::milli::vector::Error::from)
.map_err(meilisearch_types::milli::Error::from)?,
.map_err(|err| {
Error::from_milli(err.into(), Some(index_uid.clone()))
})?,
);
{
let mut embedders = self.embedders.write().unwrap();
@ -2038,14 +2016,19 @@ impl<'a> Dump<'a> {
task: TaskDump,
content_file: Option<Box<UpdateFile>>,
) -> Result<Task> {
let task_has_no_docs = matches!(task.kind, KindDump::DocumentImport { documents_count, .. } if documents_count == 0);
let content_uuid = match content_file {
Some(content_file) if task.status == Status::Enqueued => {
let (uuid, mut file) = self.index_scheduler.create_update_file(false)?;
let mut builder = DocumentsBatchBuilder::new(&mut file);
let (uuid, file) = self.index_scheduler.create_update_file(false)?;
let mut writer = io::BufWriter::new(file);
for doc in content_file {
builder.append_json_object(&doc?)?;
let doc = doc?;
serde_json::to_writer(&mut writer, &doc).map_err(|e| {
Error::from_milli(milli::InternalError::SerdeJson(e).into(), None)
})?;
}
builder.into_inner()?;
let file = writer.into_inner().map_err(|e| e.into_error())?;
file.persist()?;
Some(uuid)
@ -2053,6 +2036,12 @@ impl<'a> Dump<'a> {
// If the task isn't `Enqueued` then just generate a recognisable `Uuid`
// in case we try to open it later.
_ if task.status != Status::Enqueued => Some(Uuid::nil()),
None if task.status == Status::Enqueued && task_has_no_docs => {
let (uuid, file) = self.index_scheduler.create_update_file(false)?;
file.persist()?;
Some(uuid)
}
_ => None,
};
@ -6171,7 +6160,7 @@ mod tests {
insta::assert_json_snapshot!(simple_hf_config.embedder_options);
let simple_hf_name = name.clone();
let configs = index_scheduler.embedders(configs).unwrap();
let configs = index_scheduler.embedders("doggos".to_string(), configs).unwrap();
let (hf_embedder, _, _) = configs.get(&simple_hf_name).unwrap();
let beagle_embed =
hf_embedder.embed_one(S("Intel the beagle best doggo"), None).unwrap();

View File

@ -0,0 +1,316 @@
use std::borrow::Cow;
use std::sync::Arc;
use enum_iterator::Sequence;
use meilisearch_types::milli::progress::{AtomicSubStep, NamedStep, Progress, ProgressView, Step};
use meilisearch_types::milli::{make_atomic_progress, make_enum_progress};
use roaring::RoaringBitmap;
use crate::utils::ProcessingBatch;
#[derive(Clone)]
pub struct ProcessingTasks {
pub batch: Option<Arc<ProcessingBatch>>,
/// The list of tasks ids that are currently running.
pub processing: Arc<RoaringBitmap>,
/// The progress on processing tasks
pub progress: Option<Progress>,
}
impl ProcessingTasks {
/// Creates an empty `ProcessingAt` struct.
pub fn new() -> ProcessingTasks {
ProcessingTasks { batch: None, processing: Arc::new(RoaringBitmap::new()), progress: None }
}
pub fn get_progress_view(&self) -> Option<ProgressView> {
Some(self.progress.as_ref()?.as_progress_view())
}
/// Stores the currently processing tasks, and the date time at which it started.
pub fn start_processing(
&mut self,
processing_batch: ProcessingBatch,
processing: RoaringBitmap,
) -> Progress {
self.batch = Some(Arc::new(processing_batch));
self.processing = Arc::new(processing);
let progress = Progress::default();
progress.update_progress(BatchProgress::ProcessingTasks);
self.progress = Some(progress.clone());
progress
}
/// Set the processing tasks to an empty list
pub fn stop_processing(&mut self) -> Self {
self.progress = None;
Self {
batch: std::mem::take(&mut self.batch),
processing: std::mem::take(&mut self.processing),
progress: None,
}
}
/// Returns `true` if there, at least, is one task that is currently processing that we must stop.
pub fn must_cancel_processing_tasks(&self, canceled_tasks: &RoaringBitmap) -> bool {
!self.processing.is_disjoint(canceled_tasks)
}
}
make_enum_progress! {
pub enum BatchProgress {
ProcessingTasks,
WritingTasksToDisk,
}
}
make_enum_progress! {
pub enum TaskCancelationProgress {
RetrievingTasks,
UpdatingTasks,
}
}
make_enum_progress! {
pub enum TaskDeletionProgress {
DeletingTasksDateTime,
DeletingTasksMetadata,
DeletingTasks,
DeletingBatches,
}
}
make_enum_progress! {
pub enum SnapshotCreationProgress {
StartTheSnapshotCreation,
SnapshotTheIndexScheduler,
SnapshotTheUpdateFiles,
SnapshotTheIndexes,
SnapshotTheApiKeys,
CreateTheTarball,
}
}
make_enum_progress! {
pub enum DumpCreationProgress {
StartTheDumpCreation,
DumpTheApiKeys,
DumpTheTasks,
DumpTheIndexes,
DumpTheExperimentalFeatures,
CompressTheDump,
}
}
make_enum_progress! {
pub enum CreateIndexProgress {
CreatingTheIndex,
}
}
make_enum_progress! {
pub enum UpdateIndexProgress {
UpdatingTheIndex,
}
}
make_enum_progress! {
pub enum DeleteIndexProgress {
DeletingTheIndex,
}
}
make_enum_progress! {
pub enum SwappingTheIndexes {
EnsuringCorrectnessOfTheSwap,
SwappingTheIndexes,
}
}
make_enum_progress! {
pub enum InnerSwappingTwoIndexes {
RetrieveTheTasks,
UpdateTheTasks,
UpdateTheIndexesMetadata,
}
}
make_enum_progress! {
pub enum DocumentOperationProgress {
RetrievingConfig,
ComputingDocumentChanges,
Indexing,
}
}
make_enum_progress! {
pub enum DocumentEditionProgress {
RetrievingConfig,
ComputingDocumentChanges,
Indexing,
}
}
make_enum_progress! {
pub enum DocumentDeletionProgress {
RetrievingConfig,
DeleteDocuments,
Indexing,
}
}
make_enum_progress! {
pub enum SettingsProgress {
RetrievingAndMergingTheSettings,
ApplyTheSettings,
}
}
make_atomic_progress!(Task alias AtomicTaskStep => "task" );
make_atomic_progress!(Document alias AtomicDocumentStep => "document" );
make_atomic_progress!(Batch alias AtomicBatchStep => "batch" );
make_atomic_progress!(UpdateFile alias AtomicUpdateFileStep => "update file" );
pub struct VariableNameStep {
name: String,
current: u32,
total: u32,
}
impl VariableNameStep {
pub fn new(name: impl Into<String>, current: u32, total: u32) -> Self {
Self { name: name.into(), current, total }
}
}
impl Step for VariableNameStep {
fn name(&self) -> Cow<'static, str> {
self.name.clone().into()
}
fn current(&self) -> u32 {
self.current
}
fn total(&self) -> u32 {
self.total
}
}
#[cfg(test)]
mod test {
use std::sync::atomic::Ordering;
use meili_snap::{json_string, snapshot};
use super::*;
#[test]
fn one_level() {
let mut processing = ProcessingTasks::new();
processing.start_processing(ProcessingBatch::new(0), RoaringBitmap::new());
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "processing tasks",
"finished": 0,
"total": 2
}
],
"percentage": 0.0
}
"#);
processing.progress.as_ref().unwrap().update_progress(BatchProgress::WritingTasksToDisk);
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "writing tasks to disk",
"finished": 1,
"total": 2
}
],
"percentage": 50.0
}
"#);
}
#[test]
fn task_progress() {
let mut processing = ProcessingTasks::new();
processing.start_processing(ProcessingBatch::new(0), RoaringBitmap::new());
let (atomic, tasks) = AtomicTaskStep::new(10);
processing.progress.as_ref().unwrap().update_progress(tasks);
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "processing tasks",
"finished": 0,
"total": 2
},
{
"currentStep": "task",
"finished": 0,
"total": 10
}
],
"percentage": 0.0
}
"#);
atomic.fetch_add(6, Ordering::Relaxed);
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "processing tasks",
"finished": 0,
"total": 2
},
{
"currentStep": "task",
"finished": 6,
"total": 10
}
],
"percentage": 30.000002
}
"#);
processing.progress.as_ref().unwrap().update_progress(BatchProgress::WritingTasksToDisk);
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "writing tasks to disk",
"finished": 1,
"total": 2
}
],
"percentage": 50.0
}
"#);
let (atomic, tasks) = AtomicTaskStep::new(5);
processing.progress.as_ref().unwrap().update_progress(tasks);
atomic.fetch_add(4, Ordering::Relaxed);
snapshot!(json_string!(processing.get_progress_view()), @r#"
{
"steps": [
{
"currentStep": "writing tasks to disk",
"finished": 1,
"total": 2
},
{
"currentStep": "task",
"finished": 4,
"total": 5
}
],
"percentage": 90.0
}
"#);
}
}

View File

@ -9,8 +9,8 @@ source: crates/index-scheduler/src/lib.rs
0 {uid: 0, batch_uid: 0, status: succeeded, details: { settings: Settings { displayed_attributes: WildcardSetting(NotSet), searchable_attributes: WildcardSetting(NotSet), filterable_attributes: Set({"catto"}), sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: NotSet, search_cutoff_ms: NotSet, localized_attributes: NotSet, facet_search: NotSet, prefix_search: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> } }, kind: SettingsUpdate { index_uid: "doggos", new_settings: Settings { displayed_attributes: WildcardSetting(NotSet), searchable_attributes: WildcardSetting(NotSet), filterable_attributes: Set({"catto"}), sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: NotSet, search_cutoff_ms: NotSet, localized_attributes: NotSet, facet_search: NotSet, prefix_search: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> }, is_deletion: false, allow_index_creation: true }}
1 {uid: 1, batch_uid: 1, status: succeeded, details: { received_documents: 3, indexed_documents: Some(3) }, kind: DocumentAdditionOrUpdate { index_uid: "doggos", primary_key: Some("id"), method: ReplaceDocuments, content_file: 00000000-0000-0000-0000-000000000000, documents_count: 3, allow_index_creation: true }}
2 {uid: 2, batch_uid: 2, status: succeeded, details: { received_document_ids: 1, deleted_documents: Some(1) }, kind: DocumentDeletion { index_uid: "doggos", documents_ids: ["1"] }}
3 {uid: 3, batch_uid: 2, status: failed, error: ResponseError { code: 200, message: "Invalid type for filter subexpression: expected: String, Array, found: true.", error_code: "invalid_document_filter", error_type: "invalid_request", error_link: "https://docs.meilisearch.com/errors#invalid_document_filter" }, details: { original_filter: true, deleted_documents: Some(0) }, kind: DocumentDeletionByFilter { index_uid: "doggos", filter_expr: Bool(true) }}
4 {uid: 4, batch_uid: 2, status: failed, error: ResponseError { code: 200, message: "Attribute `id` is not filterable. Available filterable attributes are: `catto`.\n1:3 id = 2", error_code: "invalid_document_filter", error_type: "invalid_request", error_link: "https://docs.meilisearch.com/errors#invalid_document_filter" }, details: { original_filter: "id = 2", deleted_documents: Some(0) }, kind: DocumentDeletionByFilter { index_uid: "doggos", filter_expr: String("id = 2") }}
3 {uid: 3, batch_uid: 2, status: failed, error: ResponseError { code: 200, message: "Index `doggos`: Invalid type for filter subexpression: expected: String, Array, found: true.", error_code: "invalid_document_filter", error_type: "invalid_request", error_link: "https://docs.meilisearch.com/errors#invalid_document_filter" }, details: { original_filter: true, deleted_documents: Some(0) }, kind: DocumentDeletionByFilter { index_uid: "doggos", filter_expr: Bool(true) }}
4 {uid: 4, batch_uid: 2, status: failed, error: ResponseError { code: 200, message: "Index `doggos`: Attribute `id` is not filterable. Available filterable attributes are: `catto`.\n1:3 id = 2", error_code: "invalid_document_filter", error_type: "invalid_request", error_link: "https://docs.meilisearch.com/errors#invalid_document_filter" }, details: { original_filter: "id = 2", deleted_documents: Some(0) }, kind: DocumentDeletionByFilter { index_uid: "doggos", filter_expr: String("id = 2") }}
5 {uid: 5, batch_uid: 2, status: succeeded, details: { original_filter: "catto EXISTS", deleted_documents: Some(1) }, kind: DocumentDeletionByFilter { index_uid: "doggos", filter_expr: String("catto EXISTS") }}
----------------------------------------------------------------------
### Status:

View File

@ -134,6 +134,7 @@ impl ProcessingBatch {
pub fn to_batch(&self) -> Batch {
Batch {
uid: self.uid,
progress: None,
details: self.details.clone(),
stats: self.stats.clone(),
started_at: self.started_at,
@ -187,6 +188,7 @@ impl IndexScheduler {
&batch.uid,
&Batch {
uid: batch.uid,
progress: None,
details: batch.details,
stats: batch.stats,
started_at: batch.started_at,
@ -273,7 +275,9 @@ impl IndexScheduler {
.into_iter()
.map(|batch_id| {
if Some(batch_id) == processing.batch.as_ref().map(|batch| batch.uid) {
Ok(processing.batch.as_ref().unwrap().to_batch())
let mut batch = processing.batch.as_ref().unwrap().to_batch();
batch.progress = processing.get_progress_view();
Ok(batch)
} else {
self.get_batch(rtxn, batch_id)
.and_then(|task| task.ok_or(Error::CorruptedTaskQueue))
@ -287,7 +291,10 @@ impl IndexScheduler {
debug_assert!(old_task != *task);
debug_assert_eq!(old_task.uid, task.uid);
debug_assert!(old_task.batch_uid.is_none() && task.batch_uid.is_some());
debug_assert!(
old_task.batch_uid.is_none() && task.batch_uid.is_some(),
"\n==> old: {old_task:?}\n==> new: {task:?}"
);
if old_task.status != task.status {
self.update_status(wtxn, old_task.status, |bitmap| {

View File

@ -24,8 +24,9 @@ flate2 = "1.0.30"
fst = "0.4.7"
memmap2 = "0.9.4"
milli = { path = "../milli" }
raw-collections = { git = "https://github.com/meilisearch/raw-collections.git", version = "0.1.0" }
bumparaw-collections = "0.1.2"
roaring = { version = "0.10.7", features = ["serde"] }
rustc-hash = "2.1.0"
serde = { version = "1.0.204", features = ["derive"] }
serde-cs = "0.2.4"
serde_json = "1.0.120"

View File

@ -1,16 +1,16 @@
use milli::progress::ProgressView;
use serde::Serialize;
use time::{Duration, OffsetDateTime};
use crate::{
batches::{Batch, BatchId, BatchStats},
task_view::DetailsView,
tasks::serialize_duration,
};
use crate::batches::{Batch, BatchId, BatchStats};
use crate::task_view::DetailsView;
use crate::tasks::serialize_duration;
#[derive(Debug, Clone, Serialize)]
#[serde(rename_all = "camelCase")]
pub struct BatchView {
pub uid: BatchId,
pub progress: Option<ProgressView>,
pub details: DetailsView,
pub stats: BatchStats,
#[serde(serialize_with = "serialize_duration", default)]
@ -25,6 +25,7 @@ impl BatchView {
pub fn from_batch(batch: &Batch) -> Self {
Self {
uid: batch.uid,
progress: batch.progress.clone(),
details: batch.details.clone(),
stats: batch.stats.clone(),
duration: batch.finished_at.map(|finished_at| finished_at - batch.started_at),

View File

@ -1,12 +1,11 @@
use std::collections::BTreeMap;
use milli::progress::ProgressView;
use serde::{Deserialize, Serialize};
use time::OffsetDateTime;
use crate::{
task_view::DetailsView,
tasks::{Kind, Status},
};
use crate::task_view::DetailsView;
use crate::tasks::{Kind, Status};
pub type BatchId = u32;
@ -15,6 +14,8 @@ pub type BatchId = u32;
pub struct Batch {
pub uid: BatchId,
#[serde(skip)]
pub progress: Option<ProgressView>,
pub details: DetailsView,
pub stats: BatchStats,

View File

@ -4,10 +4,11 @@ use std::io::{self, BufWriter};
use std::marker::PhantomData;
use bumpalo::Bump;
use bumparaw_collections::RawMap;
use memmap2::Mmap;
use milli::documents::Error;
use milli::Object;
use raw_collections::RawMap;
use rustc_hash::FxBuildHasher;
use serde::de::{SeqAccess, Visitor};
use serde::{Deserialize, Deserializer};
use serde_json::error::Category;
@ -220,7 +221,7 @@ pub fn read_json(input: &File, output: impl io::Write) -> Result<u64> {
let mut deserializer = serde_json::Deserializer::from_slice(&input);
let res = array_each(&mut deserializer, |obj: &RawValue| {
doc_alloc.reset();
let map = RawMap::from_raw_value(obj, &doc_alloc)?;
let map = RawMap::from_raw_value_and_hasher(obj, FxBuildHasher, &doc_alloc)?;
to_writer(&mut out, &map)
});
let count = match res {
@ -250,26 +251,25 @@ pub fn read_json(input: &File, output: impl io::Write) -> Result<u64> {
}
}
/// Reads NDJSON from file and write it in NDJSON in a file checking it along the way.
pub fn read_ndjson(input: &File, output: impl io::Write) -> Result<u64> {
/// Reads NDJSON from file and checks it.
pub fn read_ndjson(input: &File) -> Result<u64> {
// We memory map to be able to deserialize into a RawMap that
// does not allocate when possible and only materialize the first/top level.
let input = unsafe { Mmap::map(input).map_err(DocumentFormatError::Io)? };
let mut output = BufWriter::new(output);
let mut bump = Bump::with_capacity(1024 * 1024);
let mut count = 0;
for result in serde_json::Deserializer::from_slice(&input).into_iter() {
bump.reset();
count += 1;
result
.and_then(|raw: &RawValue| {
match result {
Ok(raw) => {
// try to deserialize as a map
let map = RawMap::from_raw_value(raw, &bump)?;
to_writer(&mut output, &map)
})
.map_err(|e| DocumentFormatError::from((PayloadType::Ndjson, e)))?;
RawMap::from_raw_value_and_hasher(raw, FxBuildHasher, &bump)
.map_err(|e| DocumentFormatError::from((PayloadType::Ndjson, e)))?;
count += 1;
}
Err(e) => return Err(DocumentFormatError::from((PayloadType::Ndjson, e))),
}
}
Ok(count)

View File

@ -550,7 +550,7 @@ impl fmt::Display for deserr_codes::InvalidSimilarId {
"the value of `id` is invalid. \
A document identifier can be of type integer or string, \
only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), \
and can not be more than 512 bytes."
and can not be more than 511 bytes."
)
}
}

View File

@ -4,7 +4,6 @@ use std::fmt::{Display, Write};
use std::str::FromStr;
use enum_iterator::Sequence;
use milli::update::new::indexer::document_changes::Progress;
use milli::update::IndexDocumentsMethod;
use milli::Object;
use roaring::RoaringBitmap;
@ -41,62 +40,6 @@ pub struct Task {
pub kind: KindWithContent,
}
#[derive(Clone, Copy, Debug, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct TaskProgress {
pub current_step: &'static str,
pub finished_steps: u16,
pub total_steps: u16,
pub finished_substeps: Option<u32>,
pub total_substeps: Option<u32>,
}
impl Default for TaskProgress {
fn default() -> Self {
Self::new()
}
}
impl TaskProgress {
pub fn new() -> Self {
Self {
current_step: "start",
finished_steps: 0,
total_steps: 1,
finished_substeps: None,
total_substeps: None,
}
}
pub fn update(&mut self, progress: Progress) -> TaskProgress {
if self.finished_steps > progress.finished_steps {
return *self;
}
if self.current_step != progress.step_name {
self.current_step = progress.step_name
}
self.total_steps = progress.total_steps;
if self.finished_steps < progress.finished_steps {
self.finished_substeps = None;
self.total_substeps = None;
}
self.finished_steps = progress.finished_steps;
if let Some((finished_substeps, total_substeps)) = progress.finished_total_substep {
if let Some(task_finished_substeps) = self.finished_substeps {
if task_finished_substeps > finished_substeps {
return *self;
}
}
self.finished_substeps = Some(finished_substeps);
self.total_substeps = Some(total_substeps);
}
*self
}
}
impl Task {
pub fn index_uid(&self) -> Option<&str> {
use KindWithContent::*;

View File

@ -4,6 +4,7 @@ use byte_unit::{Byte, UnitType};
use meilisearch_types::document_formats::{DocumentFormatError, PayloadType};
use meilisearch_types::error::{Code, ErrorCode, ResponseError};
use meilisearch_types::index_uid::{IndexUid, IndexUidFormatError};
use meilisearch_types::milli;
use meilisearch_types::milli::OrderBy;
use serde_json::Value;
use tokio::task::JoinError;
@ -62,8 +63,11 @@ pub enum MeilisearchHttpError {
HeedError(#[from] meilisearch_types::heed::Error),
#[error(transparent)]
IndexScheduler(#[from] index_scheduler::Error),
#[error(transparent)]
Milli(#[from] meilisearch_types::milli::Error),
#[error("{}", match .index_name {
Some(name) if !name.is_empty() => format!("Index `{}`: {error}", name),
_ => format!("{error}")
})]
Milli { error: milli::Error, index_name: Option<String> },
#[error(transparent)]
Payload(#[from] PayloadError),
#[error(transparent)]
@ -76,6 +80,12 @@ pub enum MeilisearchHttpError {
MissingSearchHybrid,
}
impl MeilisearchHttpError {
pub(crate) fn from_milli(error: milli::Error, index_name: Option<String>) -> Self {
Self::Milli { error, index_name }
}
}
impl ErrorCode for MeilisearchHttpError {
fn error_code(&self) -> Code {
match self {
@ -95,7 +105,7 @@ impl ErrorCode for MeilisearchHttpError {
MeilisearchHttpError::SerdeJson(_) => Code::Internal,
MeilisearchHttpError::HeedError(_) => Code::Internal,
MeilisearchHttpError::IndexScheduler(e) => e.error_code(),
MeilisearchHttpError::Milli(e) => e.error_code(),
MeilisearchHttpError::Milli { error, .. } => error.error_code(),
MeilisearchHttpError::Payload(e) => e.error_code(),
MeilisearchHttpError::FileStore(_) => Code::Internal,
MeilisearchHttpError::DocumentFormat(e) => e.error_code(),

View File

@ -395,6 +395,7 @@ fn import_dump(
for index_reader in dump_reader.indexes()? {
let mut index_reader = index_reader?;
let metadata = index_reader.metadata();
let uid = metadata.uid.clone();
tracing::info!("Importing index `{}`.", metadata.uid);
let date = Some((metadata.created_at, metadata.updated_at));
@ -432,7 +433,7 @@ fn import_dump(
let reader = DocumentsBatchReader::from_reader(reader)?;
let embedder_configs = index.embedding_configs(&wtxn)?;
let embedders = index_scheduler.embedders(embedder_configs)?;
let embedders = index_scheduler.embedders(uid, embedder_configs)?;
let builder = milli::update::IndexDocuments::new(
&mut wtxn,

View File

@ -129,6 +129,11 @@ async fn try_main() -> anyhow::Result<()> {
print_launch_resume(&opt, analytics.clone(), config_read_from);
tokio::spawn(async move {
tokio::signal::ctrl_c().await.unwrap();
std::process::exit(130);
});
run_http(index_scheduler, auth_controller, opt, log_handle, Arc::new(analytics)).await?;
Ok(())

View File

@ -654,9 +654,8 @@ impl Opt {
#[derive(Debug, Default, Clone, Parser, Deserialize)]
pub struct IndexerOpts {
/// Specifies the maximum resident memory that Meilisearch can use for indexing.
/// By default, Meilisearch limits the RAM usage to 5% of the total available memory.
/// Note that the underlying store utilizes memory-mapping and makes use of the rest.
/// Sets the maximum amount of RAM Meilisearch can use when indexing. By default, Meilisearch
/// uses no more than two thirds of available memory.
#[clap(long, env = MEILI_MAX_INDEXING_MEMORY, default_value_t)]
#[serde(default)]
pub max_indexing_memory: MaxMemory,
@ -715,7 +714,7 @@ impl TryFrom<&IndexerOpts> for IndexerConfig {
}
}
/// A type used to detect the max resident memory available and use 5% of it.
/// A type used to detect the max memory available and use 2/3 of it.
#[derive(Debug, Clone, Copy, Deserialize, Serialize)]
pub struct MaxMemory(Option<Byte>);
@ -729,7 +728,7 @@ impl FromStr for MaxMemory {
impl Default for MaxMemory {
fn default() -> MaxMemory {
MaxMemory(total_memory_bytes().map(|bytes| bytes * 5 / 100).map(Byte::from_u64))
MaxMemory(total_memory_bytes().map(|bytes| bytes * 2 / 3).map(Byte::from_u64))
}
}

View File

@ -1,18 +1,18 @@
use actix_web::{
web::{self, Data},
HttpResponse,
};
use actix_web::web::{self, Data};
use actix_web::HttpResponse;
use deserr::actix_web::AwebQueryParameter;
use index_scheduler::{IndexScheduler, Query};
use meilisearch_types::{
batch_view::BatchView, batches::BatchId, deserr::DeserrQueryParamError, error::ResponseError,
keys::actions,
};
use meilisearch_types::batch_view::BatchView;
use meilisearch_types::batches::BatchId;
use meilisearch_types::deserr::DeserrQueryParamError;
use meilisearch_types::error::ResponseError;
use meilisearch_types::keys::actions;
use serde::Serialize;
use crate::extractors::{authentication::GuardedData, sequential_extractor::SeqHandler};
use super::{tasks::TasksFilterQuery, ActionPolicy};
use super::tasks::TasksFilterQuery;
use super::ActionPolicy;
use crate::extractors::authentication::GuardedData;
use crate::extractors::sequential_extractor::SeqHandler;
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(web::resource("").route(web::get().to(SeqHandler(get_batches))))

View File

@ -1,5 +1,5 @@
use std::collections::HashSet;
use std::io::ErrorKind;
use std::io::{ErrorKind, Seek as _};
use std::marker::PhantomData;
use actix_web::http::header::CONTENT_TYPE;
@ -572,7 +572,7 @@ async fn document_addition(
index_uid: IndexUid,
primary_key: Option<String>,
csv_delimiter: Option<u8>,
mut body: Payload,
body: Payload,
method: IndexDocumentsMethod,
task_id: Option<TaskId>,
dry_run: bool,
@ -609,54 +609,60 @@ async fn document_addition(
};
let (uuid, mut update_file) = index_scheduler.create_update_file(dry_run)?;
let documents_count = match format {
PayloadType::Ndjson => {
let (path, file) = update_file.into_parts();
let file = match file {
Some(file) => {
let (file, path) = file.into_parts();
let mut file = copy_body_to_file(file, body, format).await?;
file.rewind().map_err(|e| {
index_scheduler::Error::FileStore(file_store::Error::IoError(e))
})?;
Some(tempfile::NamedTempFile::from_parts(file, path))
}
None => None,
};
let temp_file = match tempfile() {
Ok(file) => file,
Err(e) => return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e)))),
let documents_count = tokio::task::spawn_blocking(move || {
let documents_count = file.as_ref().map_or(Ok(0), |ntf| {
read_ndjson(ntf.as_file()).map_err(MeilisearchHttpError::DocumentFormat)
})?;
let update_file = file_store::File::from_parts(path, file);
update_file.persist()?;
Ok(documents_count)
})
.await?;
Ok(documents_count)
}
PayloadType::Json | PayloadType::Csv { delimiter: _ } => {
let temp_file = match tempfile() {
Ok(file) => file,
Err(e) => return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e)))),
};
let read_file = copy_body_to_file(temp_file, body, format).await?;
tokio::task::spawn_blocking(move || {
let documents_count = match format {
PayloadType::Json => read_json(&read_file, &mut update_file)?,
PayloadType::Csv { delimiter } => {
read_csv(&read_file, &mut update_file, delimiter)?
}
PayloadType::Ndjson => {
unreachable!("We already wrote the user content into the update file")
}
};
// we NEED to persist the file here because we moved the `udpate_file` in another task.
update_file.persist()?;
Ok(documents_count)
})
.await
}
};
let async_file = File::from_std(temp_file);
let mut buffer = BufWriter::new(async_file);
let mut buffer_write_size: usize = 0;
while let Some(result) = body.next().await {
let byte = result?;
if byte.is_empty() && buffer_write_size == 0 {
return Err(MeilisearchHttpError::MissingPayload(format));
}
match buffer.write_all(&byte).await {
Ok(()) => buffer_write_size += 1,
Err(e) => return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e)))),
}
}
if let Err(e) = buffer.flush().await {
return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e))));
}
if buffer_write_size == 0 {
return Err(MeilisearchHttpError::MissingPayload(format));
}
if let Err(e) = buffer.seek(std::io::SeekFrom::Start(0)).await {
return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e))));
}
let read_file = buffer.into_inner().into_std().await;
let documents_count = tokio::task::spawn_blocking(move || {
let documents_count = match format {
PayloadType::Json => read_json(&read_file, &mut update_file)?,
PayloadType::Csv { delimiter } => read_csv(&read_file, &mut update_file, delimiter)?,
PayloadType::Ndjson => read_ndjson(&read_file, &mut update_file)?,
};
// we NEED to persist the file here because we moved the `udpate_file` in another task.
update_file.persist()?;
Ok(documents_count)
})
.await;
let documents_count = match documents_count {
Ok(Ok(documents_count)) => documents_count,
// in this case the file has not possibly be persisted.
@ -703,6 +709,39 @@ async fn document_addition(
Ok(task.into())
}
async fn copy_body_to_file(
output: std::fs::File,
mut body: Payload,
format: PayloadType,
) -> Result<std::fs::File, MeilisearchHttpError> {
let async_file = File::from_std(output);
let mut buffer = BufWriter::new(async_file);
let mut buffer_write_size: usize = 0;
while let Some(result) = body.next().await {
let byte = result?;
if byte.is_empty() && buffer_write_size == 0 {
return Err(MeilisearchHttpError::MissingPayload(format));
}
match buffer.write_all(&byte).await {
Ok(()) => buffer_write_size += 1,
Err(e) => return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e)))),
}
}
if let Err(e) = buffer.flush().await {
return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e))));
}
if buffer_write_size == 0 {
return Err(MeilisearchHttpError::MissingPayload(format));
}
if let Err(e) = buffer.seek(std::io::SeekFrom::Start(0)).await {
return Err(MeilisearchHttpError::Payload(ReceivePayload(Box::new(e))));
}
let read_file = buffer.into_inner().into_std().await;
Ok(read_file)
}
pub async fn delete_documents_batch(
index_scheduler: GuardedData<ActionPolicy<{ actions::DOCUMENTS_DELETE }>, Data<IndexScheduler>>,
index_uid: web::Path<String>,

View File

@ -185,7 +185,8 @@ pub async fn search(
let index = index_scheduler.index(&index_uid)?;
let features = index_scheduler.features();
let search_kind = search_kind(&search_query, &index_scheduler, &index, features)?;
let search_kind =
search_kind(&search_query, &index_scheduler, index_uid.to_string(), &index, features)?;
let permit = search_queue.try_get_search_permit().await?;
let search_result = tokio::task::spawn_blocking(move || {
perform_facet_search(

View File

@ -5,7 +5,7 @@ use actix_web::web::Data;
use actix_web::{web, HttpRequest, HttpResponse};
use deserr::actix_web::{AwebJson, AwebQueryParameter};
use deserr::{DeserializeError, Deserr, ValuePointerRef};
use index_scheduler::IndexScheduler;
use index_scheduler::{Error, IndexScheduler};
use meilisearch_types::deserr::query_params::Param;
use meilisearch_types::deserr::{immutable_field_error, DeserrJsonError, DeserrQueryParamError};
use meilisearch_types::error::deserr_codes::*;
@ -107,7 +107,10 @@ pub async fn list_indexes(
if !filters.is_index_authorized(uid) {
return Ok(None);
}
Ok(Some(IndexView::new(uid.to_string(), index)?))
Ok(Some(
IndexView::new(uid.to_string(), index)
.map_err(|e| Error::from_milli(e, Some(uid.to_string())))?,
))
})?;
// Won't cause to open all indexes because IndexView doesn't keep the `Index` opened.
let indexes: Vec<IndexView> = indexes.into_iter().flatten().collect();

View File

@ -243,11 +243,19 @@ pub async fn search_with_url_query(
let index = index_scheduler.index(&index_uid)?;
let features = index_scheduler.features();
let search_kind = search_kind(&query, index_scheduler.get_ref(), &index, features)?;
let search_kind =
search_kind(&query, index_scheduler.get_ref(), index_uid.to_string(), &index, features)?;
let retrieve_vector = RetrieveVectors::new(query.retrieve_vectors, features)?;
let permit = search_queue.try_get_search_permit().await?;
let search_result = tokio::task::spawn_blocking(move || {
perform_search(&index, query, search_kind, retrieve_vector, index_scheduler.features())
perform_search(
index_uid.to_string(),
&index,
query,
search_kind,
retrieve_vector,
index_scheduler.features(),
)
})
.await;
permit.drop().await;
@ -287,12 +295,20 @@ pub async fn search_with_post(
let features = index_scheduler.features();
let search_kind = search_kind(&query, index_scheduler.get_ref(), &index, features)?;
let search_kind =
search_kind(&query, index_scheduler.get_ref(), index_uid.to_string(), &index, features)?;
let retrieve_vectors = RetrieveVectors::new(query.retrieve_vectors, features)?;
let permit = search_queue.try_get_search_permit().await?;
let search_result = tokio::task::spawn_blocking(move || {
perform_search(&index, query, search_kind, retrieve_vectors, index_scheduler.features())
perform_search(
index_uid.to_string(),
&index,
query,
search_kind,
retrieve_vectors,
index_scheduler.features(),
)
})
.await;
permit.drop().await;
@ -314,6 +330,7 @@ pub async fn search_with_post(
pub fn search_kind(
query: &SearchQuery,
index_scheduler: &IndexScheduler,
index_uid: String,
index: &milli::Index,
features: RoFeatures,
) -> Result<SearchKind, ResponseError> {
@ -332,7 +349,7 @@ pub fn search_kind(
(None, _, None) => Ok(SearchKind::KeywordOnly),
// hybrid.semantic_ratio == 1.0 => vector
(_, Some(HybridQuery { semantic_ratio, embedder }), v) if **semantic_ratio == 1.0 => {
SearchKind::semantic(index_scheduler, index, embedder, v.map(|v| v.len()))
SearchKind::semantic(index_scheduler, index_uid, index, embedder, v.map(|v| v.len()))
}
// hybrid.semantic_ratio == 0.0 => keyword
(_, Some(HybridQuery { semantic_ratio, embedder: _ }), _) if **semantic_ratio == 0.0 => {
@ -340,13 +357,14 @@ pub fn search_kind(
}
// no query, hybrid, vector => semantic
(None, Some(HybridQuery { semantic_ratio: _, embedder }), Some(v)) => {
SearchKind::semantic(index_scheduler, index, embedder, Some(v.len()))
SearchKind::semantic(index_scheduler, index_uid, index, embedder, Some(v.len()))
}
// query, no hybrid, no vector => keyword
(Some(_), None, None) => Ok(SearchKind::KeywordOnly),
// query, hybrid, maybe vector => hybrid
(Some(_), Some(HybridQuery { semantic_ratio, embedder }), v) => SearchKind::hybrid(
index_scheduler,
index_uid,
index,
embedder,
**semantic_ratio,

View File

@ -103,8 +103,13 @@ async fn similar(
let index = index_scheduler.index(&index_uid)?;
let (embedder_name, embedder, quantized) =
SearchKind::embedder(&index_scheduler, &index, &query.embedder, None)?;
let (embedder_name, embedder, quantized) = SearchKind::embedder(
&index_scheduler,
index_uid.to_string(),
&index,
&query.embedder,
None,
)?;
tokio::task::spawn_blocking(move || {
perform_similar(

View File

@ -125,14 +125,28 @@ pub async fn multi_search_with_post(
})
.with_index(query_index)?;
let search_kind =
search_kind(&query, index_scheduler.get_ref(), &index, features)
.with_index(query_index)?;
let index_uid_str = index_uid.to_string();
let search_kind = search_kind(
&query,
index_scheduler.get_ref(),
index_uid_str.clone(),
&index,
features,
)
.with_index(query_index)?;
let retrieve_vector = RetrieveVectors::new(query.retrieve_vectors, features)
.with_index(query_index)?;
let search_result = tokio::task::spawn_blocking(move || {
perform_search(&index, query, search_kind, retrieve_vector, features)
perform_search(
index_uid_str.clone(),
&index,
query,
search_kind,
retrieve_vector,
features,
)
})
.await
.with_index(query_index)?;

View File

@ -560,7 +560,8 @@ pub fn perform_federated_search(
// use an immediately invoked lambda to capture the result without returning from the function
let res: Result<(), ResponseError> = (|| {
let search_kind = search_kind(&query, index_scheduler, &index, features)?;
let search_kind =
search_kind(&query, index_scheduler, index_uid.to_string(), &index, features)?;
let canonicalization_kind = match (&search_kind, &query.q) {
(SearchKind::SemanticOnly { .. }, _) => {
@ -636,7 +637,8 @@ pub fn perform_federated_search(
search.offset(0);
search.limit(required_hit_count);
let (result, _semantic_hit_count) = super::search_from_kind(search_kind, search)?;
let (result, _semantic_hit_count) =
super::search_from_kind(index_uid.to_string(), search_kind, search)?;
let format = AttributesFormat {
attributes_to_retrieve: query.attributes_to_retrieve,
retrieve_vectors,
@ -670,7 +672,10 @@ pub fn perform_federated_search(
let formatter_builder = HitMaker::formatter_builder(matching_words, tokenizer);
let hit_maker = HitMaker::new(&index, &rtxn, format, formatter_builder)?;
let hit_maker =
HitMaker::new(&index, &rtxn, format, formatter_builder).map_err(|e| {
MeilisearchHttpError::from_milli(e, Some(index_uid.to_string()))
})?;
results_by_query.push(SearchResultByQuery {
federation_options,

View File

@ -19,7 +19,9 @@ use meilisearch_types::locales::Locale;
use meilisearch_types::milli::score_details::{ScoreDetails, ScoringStrategy};
use meilisearch_types::milli::vector::parsed_vectors::ExplicitVectors;
use meilisearch_types::milli::vector::Embedder;
use meilisearch_types::milli::{FacetValueHit, OrderBy, SearchForFacetValues, TimeBudget};
use meilisearch_types::milli::{
FacetValueHit, InternalError, OrderBy, SearchForFacetValues, TimeBudget,
};
use meilisearch_types::settings::DEFAULT_PAGINATION_MAX_TOTAL_HITS;
use meilisearch_types::{milli, Document};
use milli::tokenizer::{Language, TokenizerBuilder};
@ -281,35 +283,38 @@ pub enum SearchKind {
impl SearchKind {
pub(crate) fn semantic(
index_scheduler: &index_scheduler::IndexScheduler,
index_uid: String,
index: &Index,
embedder_name: &str,
vector_len: Option<usize>,
) -> Result<Self, ResponseError> {
let (embedder_name, embedder, quantized) =
Self::embedder(index_scheduler, index, embedder_name, vector_len)?;
Self::embedder(index_scheduler, index_uid, index, embedder_name, vector_len)?;
Ok(Self::SemanticOnly { embedder_name, embedder, quantized })
}
pub(crate) fn hybrid(
index_scheduler: &index_scheduler::IndexScheduler,
index_uid: String,
index: &Index,
embedder_name: &str,
semantic_ratio: f32,
vector_len: Option<usize>,
) -> Result<Self, ResponseError> {
let (embedder_name, embedder, quantized) =
Self::embedder(index_scheduler, index, embedder_name, vector_len)?;
Self::embedder(index_scheduler, index_uid, index, embedder_name, vector_len)?;
Ok(Self::Hybrid { embedder_name, embedder, quantized, semantic_ratio })
}
pub(crate) fn embedder(
index_scheduler: &index_scheduler::IndexScheduler,
index_uid: String,
index: &Index,
embedder_name: &str,
vector_len: Option<usize>,
) -> Result<(String, Arc<Embedder>, bool), ResponseError> {
let embedder_configs = index.embedding_configs(&index.read_txn()?)?;
let embedders = index_scheduler.embedders(embedder_configs)?;
let embedders = index_scheduler.embedders(index_uid, embedder_configs)?;
let (embedder, _, quantized) = embedders
.get(embedder_name)
@ -890,6 +895,7 @@ fn prepare_search<'t>(
}
pub fn perform_search(
index_uid: String,
index: &Index,
query: SearchQuery,
search_kind: SearchKind,
@ -916,7 +922,7 @@ pub fn perform_search(
used_negative_operator,
},
semantic_hit_count,
) = search_from_kind(search_kind, search)?;
) = search_from_kind(index_uid, search_kind, search)?;
let SearchQuery {
q,
@ -1069,17 +1075,27 @@ fn compute_facet_distribution_stats<S: AsRef<str>>(
}
pub fn search_from_kind(
index_uid: String,
search_kind: SearchKind,
search: milli::Search<'_>,
) -> Result<(milli::SearchResult, Option<u32>), MeilisearchHttpError> {
let (milli_result, semantic_hit_count) = match &search_kind {
SearchKind::KeywordOnly => (search.execute()?, None),
SearchKind::KeywordOnly => {
let results = search
.execute()
.map_err(|e| MeilisearchHttpError::from_milli(e, Some(index_uid.to_string())))?;
(results, None)
}
SearchKind::SemanticOnly { .. } => {
let results = search.execute()?;
let results = search
.execute()
.map_err(|e| MeilisearchHttpError::from_milli(e, Some(index_uid.to_string())))?;
let semantic_hit_count = results.document_scores.len() as u32;
(results, Some(semantic_hit_count))
}
SearchKind::Hybrid { semantic_ratio, .. } => search.execute_hybrid(*semantic_ratio)?,
SearchKind::Hybrid { semantic_ratio, .. } => search
.execute_hybrid(*semantic_ratio)
.map_err(|e| MeilisearchHttpError::from_milli(e, Some(index_uid)))?,
};
Ok((milli_result, semantic_hit_count))
}
@ -1181,7 +1197,7 @@ impl<'a> HitMaker<'a> {
rtxn: &'a RoTxn<'a>,
format: AttributesFormat,
mut formatter_builder: MatcherBuilder<'a>,
) -> Result<Self, MeilisearchHttpError> {
) -> milli::Result<Self> {
formatter_builder.crop_marker(format.crop_marker);
formatter_builder.highlight_prefix(format.highlight_pre_tag);
formatter_builder.highlight_suffix(format.highlight_post_tag);
@ -1276,11 +1292,7 @@ impl<'a> HitMaker<'a> {
})
}
pub fn make_hit(
&self,
id: u32,
score: &[ScoreDetails],
) -> Result<SearchHit, MeilisearchHttpError> {
pub fn make_hit(&self, id: u32, score: &[ScoreDetails]) -> milli::Result<SearchHit> {
let (_, obkv) =
self.index.iter_documents(self.rtxn, std::iter::once(id))?.next().unwrap()?;
@ -1323,7 +1335,10 @@ impl<'a> HitMaker<'a> {
.is_some_and(|conf| conf.user_provided.contains(id));
let embeddings =
ExplicitVectors { embeddings: Some(vector.into()), regenerate: !user_provided };
vectors.insert(name, serde_json::to_value(embeddings)?);
vectors.insert(
name,
serde_json::to_value(embeddings).map_err(InternalError::SerdeJson)?,
);
}
document.insert("_vectors".into(), vectors.into());
}
@ -1369,7 +1384,7 @@ fn make_hits<'a>(
format: AttributesFormat,
matching_words: milli::MatchingWords,
documents_ids_scores: impl Iterator<Item = (u32, &'a Vec<ScoreDetails>)> + 'a,
) -> Result<Vec<SearchHit>, MeilisearchHttpError> {
) -> milli::Result<Vec<SearchHit>> {
let mut documents = Vec::new();
let dictionary = index.dictionary(rtxn)?;
@ -1697,12 +1712,12 @@ fn make_document(
displayed_attributes: &BTreeSet<FieldId>,
field_ids_map: &FieldsIdsMap,
obkv: &obkv::KvReaderU16,
) -> Result<Document, MeilisearchHttpError> {
) -> milli::Result<Document> {
let mut document = serde_json::Map::new();
// recreate the original json
for (key, value) in obkv.iter() {
let value = serde_json::from_slice(value)?;
let value = serde_json::from_slice(value).map_err(InternalError::SerdeJson)?;
let key = field_ids_map.name(key).expect("Missing field name").to_string();
document.insert(key, value);
@ -1727,7 +1742,7 @@ fn format_fields(
displayable_ids: &BTreeSet<FieldId>,
locales: Option<&[Language]>,
localized_attributes: &[LocalizedAttributesRule],
) -> Result<(Option<MatchesPosition>, Document), MeilisearchHttpError> {
) -> milli::Result<(Option<MatchesPosition>, Document)> {
let mut matches_position = compute_matches.then(BTreeMap::new);
let mut document = document.clone();
@ -1905,7 +1920,7 @@ fn parse_filter_array(arr: &[Value]) -> Result<Option<Filter>, MeilisearchHttpEr
}
}
Ok(Filter::from_array(ands)?)
Filter::from_array(ands).map_err(|e| MeilisearchHttpError::from_milli(e, None))
}
#[cfg(test)]

View File

@ -284,6 +284,7 @@ async fn test_summarized_document_addition_or_update() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
@ -314,6 +315,7 @@ async fn test_summarized_document_addition_or_update() {
@r#"
{
"uid": 1,
"progress": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
@ -349,6 +351,7 @@ async fn test_summarized_delete_documents_by_batch() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"providedIds": 3,
"deletedDocuments": 0
@ -380,6 +383,7 @@ async fn test_summarized_delete_documents_by_batch() {
@r#"
{
"uid": 2,
"progress": null,
"details": {
"providedIds": 1,
"deletedDocuments": 0
@ -416,6 +420,7 @@ async fn test_summarized_delete_documents_by_filter() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"providedIds": 0,
"deletedDocuments": 0,
@ -448,6 +453,7 @@ async fn test_summarized_delete_documents_by_filter() {
@r#"
{
"uid": 2,
"progress": null,
"details": {
"providedIds": 0,
"deletedDocuments": 0,
@ -480,6 +486,7 @@ async fn test_summarized_delete_documents_by_filter() {
@r#"
{
"uid": 4,
"progress": null,
"details": {
"providedIds": 0,
"deletedDocuments": 0,
@ -516,6 +523,7 @@ async fn test_summarized_delete_document_by_id() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"providedIds": 1,
"deletedDocuments": 0
@ -547,6 +555,7 @@ async fn test_summarized_delete_document_by_id() {
@r#"
{
"uid": 2,
"progress": null,
"details": {
"providedIds": 1,
"deletedDocuments": 0
@ -594,6 +603,7 @@ async fn test_summarized_settings_update() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"displayedAttributes": [
"doggos",
@ -638,6 +648,7 @@ async fn test_summarized_index_creation() {
@r#"
{
"uid": 0,
"progress": null,
"details": {},
"stats": {
"totalNbTasks": 1,
@ -665,6 +676,7 @@ async fn test_summarized_index_creation() {
@r#"
{
"uid": 1,
"progress": null,
"details": {
"primaryKey": "doggos"
},
@ -809,6 +821,7 @@ async fn test_summarized_index_update() {
@r#"
{
"uid": 0,
"progress": null,
"details": {},
"stats": {
"totalNbTasks": 1,
@ -836,6 +849,7 @@ async fn test_summarized_index_update() {
@r#"
{
"uid": 1,
"progress": null,
"details": {
"primaryKey": "bones"
},
@ -868,6 +882,7 @@ async fn test_summarized_index_update() {
@r#"
{
"uid": 3,
"progress": null,
"details": {},
"stats": {
"totalNbTasks": 1,
@ -895,6 +910,7 @@ async fn test_summarized_index_update() {
@r#"
{
"uid": 4,
"progress": null,
"details": {
"primaryKey": "bones"
},
@ -932,6 +948,7 @@ async fn test_summarized_index_swap() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"swaps": [
{
@ -972,6 +989,7 @@ async fn test_summarized_index_swap() {
@r#"
{
"uid": 3,
"progress": null,
"details": {
"swaps": [
{
@ -1014,6 +1032,7 @@ async fn test_summarized_batch_cancelation() {
@r#"
{
"uid": 1,
"progress": null,
"details": {
"matchedTasks": 1,
"canceledTasks": 0,
@ -1051,6 +1070,7 @@ async fn test_summarized_batch_deletion() {
@r#"
{
"uid": 1,
"progress": null,
"details": {
"matchedTasks": 1,
"deletedTasks": 1,
@ -1084,6 +1104,7 @@ async fn test_summarized_dump_creation() {
@r#"
{
"uid": 0,
"progress": null,
"details": {
"dumpUid": "[dumpUid]"
},

View File

@ -1220,9 +1220,89 @@ async fn replace_document() {
#[actix_rt::test]
async fn add_no_documents() {
let server = Server::new().await;
let index = server.index("test");
let (_response, code) = index.add_documents(json!([]), None).await;
let index = server.index("kefir");
let (task, code) = index.add_documents(json!([]), None).await;
snapshot!(code, @"202 Accepted");
let task = server.wait_task(task.uid()).await;
let task = task.succeeded();
snapshot!(task, @r#"
{
"uid": "[uid]",
"batchUid": "[batch_uid]",
"indexUid": "kefir",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 0,
"indexedDocuments": 0
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"#);
let (task, _code) = index.add_documents(json!([]), Some("kefkef")).await;
let task = server.wait_task(task.uid()).await;
let task = task.succeeded();
snapshot!(task, @r#"
{
"uid": "[uid]",
"batchUid": "[batch_uid]",
"indexUid": "kefir",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 0,
"indexedDocuments": 0
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"#);
let (task, _code) = index.add_documents(json!([{ "kefkef": 1 }]), None).await;
let task = server.wait_task(task.uid()).await;
let task = task.succeeded();
snapshot!(task, @r#"
{
"uid": "[uid]",
"batchUid": "[batch_uid]",
"indexUid": "kefir",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"#);
let (documents, _status) = index.get_all_documents(GetAllDocumentsOptions::default()).await;
snapshot!(documents, @r#"
{
"results": [
{
"kefkef": 1
}
],
"offset": 0,
"limit": 20,
"total": 1
}
"#);
}
#[actix_rt::test]
@ -1264,15 +1344,18 @@ async fn error_add_documents_bad_document_id() {
let server = Server::new().await;
let index = server.index("test");
index.create(Some("docid")).await;
// unsupported characters
let documents = json!([
{
"docid": "foo & bar",
"content": "foobar"
}
]);
index.add_documents(documents, None).await;
index.wait_task(1).await;
let (response, code) = index.get_task(1).await;
let (value, _code) = index.add_documents(documents, None).await;
index.wait_task(value.uid()).await;
let (response, code) = index.get_task(value.uid()).await;
snapshot!(code, @"200 OK");
snapshot!(json_string!(response, { ".duration" => "[duration]", ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]" }),
@r###"
@ -1288,7 +1371,81 @@ async fn error_add_documents_bad_document_id() {
"indexedDocuments": 0
},
"error": {
"message": "Document identifier `\"foo & bar\"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 512 bytes.",
"message": "Document identifier `\"foo & bar\"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes.",
"code": "invalid_document_id",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_id"
},
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
// More than 512 bytes
let documents = json!([
{
"docid": "a".repeat(600),
"content": "foobar"
}
]);
let (value, _code) = index.add_documents(documents, None).await;
index.wait_task(value.uid()).await;
let (response, code) = index.get_task(value.uid()).await;
snapshot!(code, @"200 OK");
snapshot!(json_string!(response, { ".duration" => "[duration]", ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]" }),
@r###"
{
"uid": 2,
"batchUid": 2,
"indexUid": "test",
"status": "failed",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 0
},
"error": {
"message": "Document identifier `\"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes.",
"code": "invalid_document_id",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_id"
},
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
// Exactly 512 bytes
let documents = json!([
{
"docid": "a".repeat(512),
"content": "foobar"
}
]);
let (value, _code) = index.add_documents(documents, None).await;
index.wait_task(value.uid()).await;
let (response, code) = index.get_task(value.uid()).await;
snapshot!(code, @"200 OK");
snapshot!(json_string!(response, { ".duration" => "[duration]", ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]" }),
@r###"
{
"uid": 3,
"batchUid": 3,
"indexUid": "test",
"status": "failed",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 0
},
"error": {
"message": "Document identifier `\"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes.",
"code": "invalid_document_id",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_id"
@ -1681,7 +1838,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "The `_geo` field in the document with the id: `\"11\"` is not an object. Was expecting an object with the `_geo.lat` and `_geo.lng` fields but instead got `\"foobar\"`.",
"message": "Index `test`: The `_geo` field in the document with the id: `\"11\"` is not an object. Was expecting an object with the `_geo.lat` and `_geo.lng` fields but instead got `\"foobar\"`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1719,7 +1876,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find latitude nor longitude in the document with the id: `\"11\"`. Was expecting `_geo.lat` and `_geo.lng` fields.",
"message": "Index `test`: Could not find latitude nor longitude in the document with the id: `\"11\"`. Was expecting `_geo.lat` and `_geo.lng` fields.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1757,7 +1914,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find latitude nor longitude in the document with the id: `\"11\"`. Was expecting `_geo.lat` and `_geo.lng` fields.",
"message": "Index `test`: Could not find latitude nor longitude in the document with the id: `\"11\"`. Was expecting `_geo.lat` and `_geo.lng` fields.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1795,7 +1952,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"message": "Index `test`: Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1833,7 +1990,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"message": "Index `test`: Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1871,7 +2028,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"message": "Index `test`: Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1909,7 +2066,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"message": "Index `test`: Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1947,7 +2104,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not parse latitude nor longitude in the document with the id: `\"11\"`. Was expecting finite numbers but instead got `false` and `true`.",
"message": "Index `test`: Could not parse latitude nor longitude in the document with the id: `\"11\"`. Was expecting finite numbers but instead got `false` and `true`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -1985,7 +2142,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"message": "Index `test`: Could not find longitude in the document with the id: `\"11\"`. Was expecting a `_geo.lng` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2023,7 +2180,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"message": "Index `test`: Could not find latitude in the document with the id: `\"11\"`. Was expecting a `_geo.lat` field.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2061,7 +2218,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not parse latitude nor longitude in the document with the id: `\"11\"`. Was expecting finite numbers but instead got `\"doggo\"` and `\"doggo\"`.",
"message": "Index `test`: Could not parse latitude nor longitude in the document with the id: `\"11\"`. Was expecting finite numbers but instead got `\"doggo\"` and `\"doggo\"`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2099,7 +2256,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "The `_geo` field in the document with the id: `\"11\"` contains the following unexpected fields: `{\"doggo\":\"are the best\"}`.",
"message": "Index `test`: The `_geo` field in the document with the id: `\"11\"` contains the following unexpected fields: `{\"doggo\":\"are the best\"}`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2138,7 +2295,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not parse longitude in the document with the id: `\"12\"`. Was expecting a finite number but instead got `null`.",
"message": "Index `test`: Could not parse longitude in the document with the id: `\"12\"`. Was expecting a finite number but instead got `null`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2175,7 +2332,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not parse latitude in the document with the id: `\"12\"`. Was expecting a finite number but instead got `null`.",
"message": "Index `test`: Could not parse latitude in the document with the id: `\"12\"`. Was expecting a finite number but instead got `null`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2212,7 +2369,7 @@ async fn add_documents_invalid_geo_field() {
"indexedDocuments": 0
},
"error": {
"message": "Could not parse latitude nor longitude in the document with the id: `\"13\"`. Was expecting finite numbers but instead got `null` and `null`.",
"message": "Index `test`: Could not parse latitude nor longitude in the document with the id: `\"13\"`. Was expecting finite numbers but instead got `null` and `null`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"
@ -2279,7 +2436,7 @@ async fn add_invalid_geo_and_then_settings() {
]
},
"error": {
"message": "Could not parse latitude in the document with the id: `\"11\"`. Was expecting a finite number but instead got `null`.",
"message": "Index `test`: Could not parse latitude in the document with the id: `\"11\"`. Was expecting a finite number but instead got `null`.",
"code": "invalid_document_geo_field",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_geo_field"

View File

@ -604,7 +604,7 @@ async fn delete_document_by_filter() {
"originalFilter": "\"doggo = bernese\""
},
"error": {
"message": "Attribute `doggo` is not filterable. This index does not have configured filterable attributes.\n1:6 doggo = bernese",
"message": "Index `EMPTY_INDEX`: Attribute `doggo` is not filterable. This index does not have configured filterable attributes.\n1:6 doggo = bernese",
"code": "invalid_document_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_filter"
@ -636,7 +636,7 @@ async fn delete_document_by_filter() {
"originalFilter": "\"catto = jorts\""
},
"error": {
"message": "Attribute `catto` is not filterable. Available filterable attributes are: `id`, `title`.\n1:6 catto = jorts",
"message": "Index `SHARED_DOCUMENTS`: Attribute `catto` is not filterable. Available filterable attributes are: `id`, `title`.\n1:6 catto = jorts",
"code": "invalid_document_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_filter"

View File

@ -172,7 +172,7 @@ async fn error_update_documents_bad_document_id() {
assert_eq!(
response["error"]["message"],
json!(
r#"Document identifier `"foo & bar"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 512 bytes."#
r#"Document identifier `"foo & bar"` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes."#
)
);
assert_eq!(response["error"]["code"], json!("invalid_document_id"));

View File

@ -95,7 +95,7 @@ async fn error_update_existing_primary_key() {
let response = index.wait_task(2).await;
let expected_response = json!({
"message": "Index already has a primary key: `id`.",
"message": "Index `test`: Index already has a primary key: `id`.",
"code": "index_primary_key_already_exists",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#index_primary_key_already_exists"

View File

@ -711,7 +711,7 @@ async fn filter_invalid_attribute_array() {
index.wait_task(task.uid()).await;
let expected_response = json!({
"message": "Attribute `many` is not filterable. Available filterable attributes are: `title`.\n1:5 many = Glass",
"message": format!("Index `{}`: Attribute `many` is not filterable. Available filterable attributes are: `title`.\n1:5 many = Glass", index.uid),
"code": "invalid_search_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_filter"
@ -733,7 +733,7 @@ async fn filter_invalid_attribute_string() {
index.wait_task(task.uid()).await;
let expected_response = json!({
"message": "Attribute `many` is not filterable. Available filterable attributes are: `title`.\n1:5 many = Glass",
"message": format!("Index `{}`: Attribute `many` is not filterable. Available filterable attributes are: `title`.\n1:5 many = Glass", index.uid),
"code": "invalid_search_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_filter"
@ -940,7 +940,7 @@ async fn sort_unsortable_attribute() {
index.wait_task(response.uid()).await.succeeded();
let expected_response = json!({
"message": "Attribute `title` is not sortable. Available sortable attributes are: `id`.",
"message": format!("Index `{}`: Attribute `title` is not sortable. Available sortable attributes are: `id`.", index.uid),
"code": "invalid_search_sort",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_sort"
@ -998,7 +998,7 @@ async fn sort_unset_ranking_rule() {
index.wait_task(response.uid()).await.succeeded();
let expected_response = json!({
"message": "You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time.",
"message": format!("Index `{}`: You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time.", index.uid),
"code": "invalid_search_sort",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_sort"
@ -1024,19 +1024,18 @@ async fn search_on_unknown_field() {
index.update_settings_searchable_attributes(json!(["id", "title"])).await;
index.wait_task(response.uid()).await.succeeded();
let expected_response = json!({
"message": format!("Index `{}`: Attribute `unknown` is not searchable. Available searchable attributes are: `id, title`.", index.uid),
"code": "invalid_search_attributes_to_search_on",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_attributes_to_search_on"
});
index
.search(
json!({"q": "Captain Marvel", "attributesToSearchOn": ["unknown"]}),
|response, code| {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Attribute `unknown` is not searchable. Available searchable attributes are: `id, title`.",
"code": "invalid_search_attributes_to_search_on",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_attributes_to_search_on"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
},
)
.await;
@ -1050,19 +1049,18 @@ async fn search_on_unknown_field_plus_joker() {
index.update_settings_searchable_attributes(json!(["id", "title"])).await;
index.wait_task(response.uid()).await.succeeded();
let expected_response = json!({
"message": format!("Index `{}`: Attribute `unknown` is not searchable. Available searchable attributes are: `id, title`.", index.uid),
"code": "invalid_search_attributes_to_search_on",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_attributes_to_search_on"
});
index
.search(
json!({"q": "Captain Marvel", "attributesToSearchOn": ["*", "unknown"]}),
|response, code| {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Attribute `unknown` is not searchable. Available searchable attributes are: `id, title`.",
"code": "invalid_search_attributes_to_search_on",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_attributes_to_search_on"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
},
)
.await;
@ -1071,15 +1069,8 @@ async fn search_on_unknown_field_plus_joker() {
.search(
json!({"q": "Captain Marvel", "attributesToSearchOn": ["unknown", "*"]}),
|response, code| {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Attribute `unknown` is not searchable. Available searchable attributes are: `id, title`.",
"code": "invalid_search_attributes_to_search_on",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_attributes_to_search_on"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
},
)
.await;
@ -1092,47 +1083,44 @@ async fn distinct_at_search_time() {
let (task, _) = index.create(None).await;
index.wait_task(task.uid()).await.succeeded();
let expected_response = json!({
"message": format!("Index `{}`: Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. This index does not have configured filterable attributes.", index.uid),
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
});
let (response, code) =
index.search_post(json!({"page": 0, "hitsPerPage": 2, "distinct": "doggo.truc"})).await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. This index does not have configured filterable attributes.",
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
let (task, _) = index.update_settings_filterable_attributes(json!(["color", "machin"])).await;
index.wait_task(task.uid()).await;
let expected_response = json!({
"message": format!("Index `{}`: Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. Available filterable attributes are: `color, machin`.", index.uid),
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
});
let (response, code) =
index.search_post(json!({"page": 0, "hitsPerPage": 2, "distinct": "doggo.truc"})).await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. Available filterable attributes are: `color, machin`.",
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
let (task, _) = index.update_settings_displayed_attributes(json!(["color"])).await;
index.wait_task(task.uid()).await;
let expected_response = json!({
"message": format!("Index `{}`: Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. Available filterable attributes are: `color, <..hidden-attributes>`.", index.uid),
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
});
let (response, code) =
index.search_post(json!({"page": 0, "hitsPerPage": 2, "distinct": "doggo.truc"})).await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Attribute `doggo.truc` is not filterable and thus, cannot be used as distinct attribute. Available filterable attributes are: `color, <..hidden-attributes>`.",
"code": "invalid_search_distinct",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_distinct"
}
"###);
assert_eq!(response, expected_response);
assert_eq!(code, 400);
let (response, code) =
index.search_post(json!({"page": 0, "hitsPerPage": 2, "distinct": true})).await;

View File

@ -57,6 +57,116 @@ async fn simple_facet_search() {
assert_eq!(response["facetHits"].as_array().unwrap().len(), 1);
}
#[actix_rt::test]
async fn simple_facet_search_on_movies() {
let server = Server::new().await;
let index = server.index("test");
let documents = json!([
{
"id": 1,
"title": "Carol",
"genres": [
"Romance",
"Drama"
],
"color": [
"red"
],
"platforms": [
"MacOS",
"Linux",
"Windows"
]
},
{
"id": 2,
"title": "Wonder Woman",
"genres": [
"Action",
"Adventure"
],
"color": [
"green"
],
"platforms": [
"MacOS"
]
},
{
"id": 3,
"title": "Life of Pi",
"genres": [
"Adventure",
"Drama"
],
"color": [
"blue"
],
"platforms": [
"Windows"
]
},
{
"id": 4,
"title": "Mad Max: Fury Road",
"genres": [
"Adventure",
"Science Fiction"
],
"color": [
"red"
],
"platforms": [
"MacOS",
"Linux"
]
},
{
"id": 5,
"title": "Moana",
"genres": [
"Fantasy",
"Action"
],
"color": [
"red"
],
"platforms": [
"Windows"
]
},
{
"id": 6,
"title": "Philadelphia",
"genres": [
"Drama"
],
"color": [
"blue"
],
"platforms": [
"MacOS",
"Linux",
"Windows"
]
}
]);
let (response, code) =
index.update_settings_filterable_attributes(json!(["genres", "color"])).await;
assert_eq!(202, code, "{:?}", response);
index.wait_task(response.uid()).await;
let (response, _code) = index.add_documents(documents, None).await;
index.wait_task(response.uid()).await;
let (response, code) =
index.facet_search(json!({"facetQuery": "", "facetName": "genres", "q": "" })).await;
assert_eq!(code, 200, "{}", response);
snapshot!(response["facetHits"], @r###"[{"value":"Action","count":2},{"value":"Adventure","count":3},{"value":"Drama","count":3},{"value":"Fantasy","count":1},{"value":"Romance","count":1},{"value":"Science Fiction","count":1}]"###);
}
#[actix_rt::test]
async fn advanced_facet_search() {
let server = Server::new().await;

View File

@ -1746,3 +1746,57 @@ async fn change_attributes_settings() {
)
.await;
}
/// Modifying facets with different casing should work correctly
#[actix_rt::test]
async fn change_facet_casing() {
let server = Server::new().await;
let index = server.index("test");
let (response, code) = index
.update_settings(json!({
"filterableAttributes": ["dog"],
}))
.await;
assert_eq!("202", code.as_str(), "{:?}", response);
index.wait_task(response.uid()).await;
let (response, _code) = index
.add_documents(
json!([
{
"id": 1,
"dog": "Bouvier Bernois"
}
]),
None,
)
.await;
index.wait_task(response.uid()).await;
let (response, _code) = index
.add_documents(
json!([
{
"id": 1,
"dog": "bouvier bernois"
}
]),
None,
)
.await;
index.wait_task(response.uid()).await;
index
.search(json!({ "facets": ["dog"] }), |response, code| {
meili_snap::snapshot!(code, @"200 OK");
meili_snap::snapshot!(meili_snap::json_string!(response["facetDistribution"]), @r###"
{
"dog": {
"bouvier bernois": 1
}
}
"###);
})
.await;
}

View File

@ -1070,7 +1070,7 @@ async fn federation_one_query_error() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[1]`: Attribute `title` is not filterable. This index does not have configured filterable attributes.\n1:6 title = toto",
"message": "Inside `.queries[1]`: Index `nested`: Attribute `title` is not filterable. This index does not have configured filterable attributes.\n1:6 title = toto",
"code": "invalid_search_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_filter"
@ -1102,7 +1102,7 @@ async fn federation_one_query_sort_error() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[1]`: Attribute `doggos` is not sortable. This index does not have configured sortable attributes.",
"message": "Inside `.queries[1]`: Index `nested`: Attribute `doggos` is not sortable. This index does not have configured sortable attributes.",
"code": "invalid_search_sort",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_sort"
@ -1166,7 +1166,7 @@ async fn federation_multiple_query_errors() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[0]`: Attribute `title` is not filterable. This index does not have configured filterable attributes.\n1:6 title = toto",
"message": "Inside `.queries[0]`: Index `test`: Attribute `title` is not filterable. This index does not have configured filterable attributes.\n1:6 title = toto",
"code": "invalid_search_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_filter"
@ -1198,7 +1198,7 @@ async fn federation_multiple_query_sort_errors() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[0]`: Attribute `title` is not sortable. This index does not have configured sortable attributes.",
"message": "Inside `.queries[0]`: Index `test`: Attribute `title` is not sortable. This index does not have configured sortable attributes.",
"code": "invalid_search_sort",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_sort"
@ -1231,7 +1231,7 @@ async fn federation_multiple_query_errors_interleaved() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[1]`: Attribute `doggos` is not filterable. This index does not have configured filterable attributes.\n1:7 doggos IN [intel, kefir]",
"message": "Inside `.queries[1]`: Index `nested`: Attribute `doggos` is not filterable. This index does not have configured filterable attributes.\n1:7 doggos IN [intel, kefir]",
"code": "invalid_search_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_filter"
@ -1264,7 +1264,7 @@ async fn federation_multiple_query_sort_errors_interleaved() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Inside `.queries[1]`: Attribute `doggos` is not sortable. This index does not have configured sortable attributes.",
"message": "Inside `.queries[1]`: Index `nested`: Attribute `doggos` is not sortable. This index does not have configured sortable attributes.",
"code": "invalid_search_sort",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_search_sort"

View File

@ -79,7 +79,7 @@ async fn similar_bad_id() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Invalid value at `.id`: the value of `id` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 512 bytes.",
"message": "Invalid value at `.id`: the value of `id` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes.",
"code": "invalid_similar_id",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_similar_id"
@ -172,7 +172,7 @@ async fn similar_invalid_id() {
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Invalid value at `.id`: the value of `id` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 512 bytes.",
"message": "Invalid value at `.id`: the value of `id` is invalid. A document identifier can be of type integer or string, only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), and can not be more than 511 bytes.",
"code": "invalid_similar_id",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_similar_id"

View File

@ -448,7 +448,7 @@ async fn test_summarized_delete_documents_by_filter() {
"originalFilter": "\"doggo = bernese\""
},
"error": {
"message": "Attribute `doggo` is not filterable. This index does not have configured filterable attributes.\n1:6 doggo = bernese",
"message": "Index `test`: Attribute `doggo` is not filterable. This index does not have configured filterable attributes.\n1:6 doggo = bernese",
"code": "invalid_document_filter",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_document_filter"

View File

@ -318,7 +318,7 @@ async fn try_to_disable_binary_quantization() {
}
},
"error": {
"message": "`.embedders.manual.binaryQuantized`: Cannot disable the binary quantization.\n - Note: Binary quantization is a lossy operation that cannot be reverted.\n - Hint: Add a new embedder that is non-quantized and regenerate the vectors.",
"message": "Index `doggo`: `.embedders.manual.binaryQuantized`: Cannot disable the binary quantization.\n - Note: Binary quantization is a lossy operation that cannot be reverted.\n - Hint: Add a new embedder that is non-quantized and regenerate the vectors.",
"code": "invalid_settings_embedders",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_settings_embedders"

View File

@ -250,7 +250,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Missing field `._vectors.manual.regenerate`\n - note: `._vectors.manual` must be an array of floats, an array of arrays of floats, or an object with field `regenerate`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Missing field `._vectors.manual.regenerate`\n - note: `._vectors.manual` must be an array of floats, an array of arrays of floats, or an object with field `regenerate`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -280,7 +280,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Missing field `._vectors.manual.regenerate`\n - note: `._vectors.manual` must be an array of floats, an array of arrays of floats, or an object with field `regenerate`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Missing field `._vectors.manual.regenerate`\n - note: `._vectors.manual` must be an array of floats, an array of arrays of floats, or an object with field `regenerate`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -311,7 +311,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Could not parse `._vectors.manual.regenerate`: invalid type: string \"yes please\", expected a boolean at line 1 column 26",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Could not parse `._vectors.manual.regenerate`: invalid type: string \"yes please\", expected a boolean at line 1 column 26",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -340,7 +340,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings`: expected null or an array, but found a boolean: `true`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings`: expected null or an array, but found a boolean: `true`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -369,7 +369,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0]`: expected a number or an array, but found a boolean: `true`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0]`: expected a number or an array, but found a boolean: `true`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -398,7 +398,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0][0]`: expected a number, but found a boolean: `true`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0][0]`: expected a number, but found a boolean: `true`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -440,7 +440,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[1]`: expected a number, but found an array: `[0.2,0.3]`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[1]`: expected a number, but found an array: `[0.2,0.3]`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -469,7 +469,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[1]`: expected an array, but found a number: `0.3`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[1]`: expected an array, but found a number: `0.3`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -498,7 +498,7 @@ async fn user_provided_embeddings_error() {
"indexedDocuments": 0
},
"error": {
"message": "Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0][1]`: expected a number, but found a boolean: `true`",
"message": "Index `doggo`: Bad embedder configuration in the document with id: `0`. Invalid value type at `._vectors.manual.embeddings[0][1]`: expected a number, but found a boolean: `true`",
"code": "invalid_vectors_type",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vectors_type"
@ -539,7 +539,7 @@ async fn user_provided_vectors_error() {
"indexedDocuments": 0
},
"error": {
"message": "While embedding documents for embedder `manual`: no vectors provided for document `40` and at least 4 other document(s)\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: opt-out for a document with `_vectors.manual: null`",
"message": "Index `doggo`: While embedding documents for embedder `manual`: no vectors provided for document `40` and at least 4 other document(s)\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: opt-out for a document with `_vectors.manual: null`",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -569,7 +569,7 @@ async fn user_provided_vectors_error() {
"indexedDocuments": 0
},
"error": {
"message": "While embedding documents for embedder `manual`: no vectors provided for document `42`\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: try replacing `_vector` by `_vectors` in 1 document(s).",
"message": "Index `doggo`: While embedding documents for embedder `manual`: no vectors provided for document `42`\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: try replacing `_vector` by `_vectors` in 1 document(s).",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -599,7 +599,7 @@ async fn user_provided_vectors_error() {
"indexedDocuments": 0
},
"error": {
"message": "While embedding documents for embedder `manual`: no vectors provided for document `42`\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: try replacing `_vectors.manaul` by `_vectors.manual` in 1 document(s).",
"message": "Index `doggo`: While embedding documents for embedder `manual`: no vectors provided for document `42`\n- Note: `manual` has `source: userProvided`, so documents must provide embeddings as an array in `_vectors.manual`.\n- Hint: try replacing `_vectors.manaul` by `_vectors.manual` in 1 document(s).",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"

View File

@ -713,7 +713,7 @@ async fn bad_api_key() {
}
},
"error": {
"message": "While embedding documents for embedder `default`: user error: could not authenticate against OpenAI server\n - server replied with `{\"error\":{\"message\":\"Incorrect API key provided: Bearer doggo. You can find your API key at https://platform.openai.com/account/api-keys.\",\"type\":\"invalid_request_error\",\"param\":null,\"code\":\"invalid_api_key\"}}`\n - Hint: Check the `apiKey` parameter in the embedder configuration, and the `MEILI_OPENAI_API_KEY` and `OPENAI_API_KEY` environment variables",
"message": "Index `doggo`: While embedding documents for embedder `default`: user error: could not authenticate against OpenAI server\n - server replied with `{\"error\":{\"message\":\"Incorrect API key provided: Bearer doggo. You can find your API key at https://platform.openai.com/account/api-keys.\",\"type\":\"invalid_request_error\",\"param\":null,\"code\":\"invalid_api_key\"}}`\n - Hint: Check the `apiKey` parameter in the embedder configuration, and the `MEILI_OPENAI_API_KEY` and `OPENAI_API_KEY` environment variables",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -757,7 +757,7 @@ async fn bad_api_key() {
}
},
"error": {
"message": "While embedding documents for embedder `default`: user error: could not authenticate against OpenAI server\n - server replied with `{\"error\":{\"message\":\"You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys.\",\"type\":\"invalid_request_error\",\"param\":null,\"code\":null}}`\n - Hint: Check the `apiKey` parameter in the embedder configuration, and the `MEILI_OPENAI_API_KEY` and `OPENAI_API_KEY` environment variables",
"message": "Index `doggo`: While embedding documents for embedder `default`: user error: could not authenticate against OpenAI server\n - server replied with `{\"error\":{\"message\":\"You didn't provide an API key. You need to provide your API key in an Authorization header using Bearer auth (i.e. Authorization: Bearer YOUR_KEY), or as the password field (with blank username) if you're accessing the API from your browser and are prompted for a username and password. You can obtain an API key from https://platform.openai.com/account/api-keys.\",\"type\":\"invalid_request_error\",\"param\":null,\"code\":null}}`\n - Hint: Check the `apiKey` parameter in the embedder configuration, and the `MEILI_OPENAI_API_KEY` and `OPENAI_API_KEY` environment variables",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"

View File

@ -985,7 +985,7 @@ async fn bad_settings() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting a single \"{{embedding}}\", expected `response` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected a sequence",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting a single \"{{embedding}}\", expected `response` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected a sequence",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1025,7 +1025,7 @@ async fn bad_settings() {
"indexedDocuments": 0
},
"error": {
"message": "While embedding documents for embedder `rest`: runtime error: was expecting embeddings of dimension `2`, got embeddings of dimensions `3`",
"message": "Index `doggo`: While embedding documents for embedder `rest`: runtime error: was expecting embeddings of dimension `2`, got embeddings of dimensions `3`",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1178,7 +1178,7 @@ async fn server_returns_bad_request() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: sent a bad request to embedding server\n - Hint: check that the `request` in the embedder configuration matches the remote server's API\n - server replied with `{\"error\":\"Invalid request: invalid type: string \\\"test\\\", expected struct MultipleRequest at line 1 column 6\"}`",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: sent a bad request to embedding server\n - Hint: check that the `request` in the embedder configuration matches the remote server's API\n - server replied with `{\"error\":\"Invalid request: invalid type: string \\\"test\\\", expected struct MultipleRequest at line 1 column 6\"}`",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1247,7 +1247,7 @@ async fn server_returns_bad_request() {
"indexedDocuments": 0
},
"error": {
"message": "While embedding documents for embedder `rest`: user error: sent a bad request to embedding server\n - Hint: check that the `request` in the embedder configuration matches the remote server's API\n - server replied with `{\"error\":\"Invalid request: invalid type: string \\\"name: kefir\\\\n\\\", expected struct MultipleRequest at line 1 column 15\"}`",
"message": "Index `doggo`: While embedding documents for embedder `rest`: user error: sent a bad request to embedding server\n - Hint: check that the `request` in the embedder configuration matches the remote server's API\n - server replied with `{\"error\":\"Invalid request: invalid type: string \\\"name: kefir\\\\n\\\", expected struct MultipleRequest at line 1 column 15\"}`",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1306,7 +1306,7 @@ async fn server_returns_bad_response() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting the array of \"{{embedding}}\"s, configuration expects `response` to be an array with at least 1 item(s) but server sent an object with 1 field(s)",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting the array of \"{{embedding}}\"s, configuration expects `response` to be an array with at least 1 item(s) but server sent an object with 1 field(s)",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1362,7 +1362,7 @@ async fn server_returns_bad_response() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting item #0 from the array of \"{{embedding}}\"s, expected `response` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected a sequence",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response`, while extracting item #0 from the array of \"{{embedding}}\"s, expected `response` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected a sequence",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1414,7 +1414,7 @@ async fn server_returns_bad_response() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.output`, while extracting a single \"{{embedding}}\", expected `output` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected f32",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.output`, while extracting a single \"{{embedding}}\", expected `output` to be an array of numbers, but failed to parse server response:\n - invalid type: map, expected f32",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1478,7 +1478,7 @@ async fn server_returns_bad_response() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.embedding`, while extracting item #0 from the array of \"{{embedding}}\"s, configuration expects `embedding` to be an object with key `data` but server sent an array of size 3",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.embedding`, while extracting item #0 from the array of \"{{embedding}}\"s, configuration expects `embedding` to be an object with key `data` but server sent an array of size 3",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1542,7 +1542,7 @@ async fn server_returns_bad_response() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.output[0]`, while extracting a single \"{{embedding}}\", configuration expects key \"embeddings\", which is missing in response\n - Hint: item #0 inside `output` has key `embedding`, did you mean `response.output[0].embedding` in embedder configuration?",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with runtime error: error extracting embeddings from the response:\n - in `response.output[0]`, while extracting a single \"{{embedding}}\", configuration expects key \"embeddings\", which is missing in response\n - Hint: item #0 inside `output` has key `embedding`, did you mean `response.output[0].embedding` in embedder configuration?",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1908,7 +1908,7 @@ async fn server_custom_header() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: could not authenticate against embedding server\n - server replied with `{\"error\":\"missing header 'my-nonstandard-auth'\"}`\n - Hint: Check the `apiKey` parameter in the embedder configuration",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: could not authenticate against embedding server\n - server replied with `{\"error\":\"missing header 'my-nonstandard-auth'\"}`\n - Hint: Check the `apiKey` parameter in the embedder configuration",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -1951,7 +1951,7 @@ async fn server_custom_header() {
}
},
"error": {
"message": "Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: could not authenticate against embedding server\n - server replied with `{\"error\":\"thou shall not pass, Balrog\"}`\n - Hint: Check the `apiKey` parameter in the embedder configuration",
"message": "Index `doggo`: Error while generating embeddings: runtime error: could not determine model dimensions:\n - test embedding failed with user error: could not authenticate against embedding server\n - server replied with `{\"error\":\"thou shall not pass, Balrog\"}`\n - Hint: Check the `apiKey` parameter in the embedder configuration",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
@ -2099,7 +2099,7 @@ async fn searchable_reindex() {
]
},
"error": {
"message": "While embedding documents for embedder `rest`: error: received unexpected HTTP 404 from embedding server\n - server replied with `{\"error\":\"text not found\",\"text\":\"breed: patou\\n\"}`",
"message": "Index `doggo`: While embedding documents for embedder `rest`: error: received unexpected HTTP 404 from embedding server\n - server replied with `{\"error\":\"text not found\",\"text\":\"breed: patou\\n\"}`",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"

View File

@ -10,12 +10,15 @@ license.workspace = true
[dependencies]
anyhow = "1.0.86"
arroy_v04_to_v05 = { package = "arroy", git = "https://github.com/meilisearch/arroy/", tag = "DO-NOT-DELETE-upgrade-v04-to-v05" }
clap = { version = "4.5.9", features = ["derive"] }
dump = { path = "../dump" }
file-store = { path = "../file-store" }
indexmap = { version = "2.7.0", features = ["serde"] }
meilisearch-auth = { path = "../meilisearch-auth" }
meilisearch-types = { path = "../meilisearch-types" }
serde = { version = "1.0.209", features = ["derive"] }
serde_json = { version = "1.0.133", features = ["preserve_order"] }
tempfile = "3.14.0"
time = { version = "0.3.36", features = ["formatting", "parsing", "alloc"] }
uuid = { version = "1.10.0", features = ["v4"], default-features = false }
arroy_v04_to_v05 = { package = "arroy", git = "https://github.com/meilisearch/arroy/", tag = "DO-NOT-DELETE-upgrade-v04-to-v05" }

View File

@ -73,7 +73,7 @@ enum Command {
///
/// Supported upgrade paths:
///
/// - v1.9.x -> v1.10.x -> v1.11.x
/// - v1.9.x -> v1.10.x -> v1.11.x -> v1.12.x
OfflineUpgrade {
#[arg(long)]
target_version: String,
@ -88,7 +88,7 @@ fn main() -> anyhow::Result<()> {
match command {
Command::ClearTaskQueue => clear_task_queue(db_path),
Command::ExportADump { dump_dir, skip_enqueued_tasks } => {
export_a_dump(db_path, dump_dir, skip_enqueued_tasks)
export_a_dump(db_path, dump_dir, skip_enqueued_tasks, detected_version)
}
Command::OfflineUpgrade { target_version } => {
let target_version = parse_version(&target_version).context("While parsing `--target-version`. Make sure `--target-version` is in the format MAJOR.MINOR.PATCH")?;
@ -187,6 +187,7 @@ fn export_a_dump(
db_path: PathBuf,
dump_dir: PathBuf,
skip_enqueued_tasks: bool,
detected_version: (String, String, String),
) -> Result<(), anyhow::Error> {
let started_at = OffsetDateTime::now_utc();
@ -238,9 +239,6 @@ fn export_a_dump(
if skip_enqueued_tasks {
eprintln!("Skip dumping the enqueued tasks...");
} else {
eprintln!("Dumping the enqueued tasks...");
// 3. dump the tasks
let mut dump_tasks = dump.create_tasks_queue()?;
let mut count = 0;
for ret in all_tasks.iter(&rtxn)? {
@ -254,18 +252,39 @@ fn export_a_dump(
if status == Status::Enqueued {
let content_file = file_store.get_update(content_file_uuid)?;
let reader =
DocumentsBatchReader::from_reader(content_file).with_context(|| {
format!("While reading content file {:?}", content_file_uuid)
})?;
let (mut cursor, documents_batch_index) = reader.into_cursor_and_fields_index();
while let Some(doc) = cursor.next_document().with_context(|| {
format!("While iterating on content file {:?}", content_file_uuid)
})? {
dump_content_file
.push_document(&obkv_to_object(doc, &documents_batch_index)?)?;
if (
detected_version.0.as_str(),
detected_version.1.as_str(),
detected_version.2.as_str(),
) < ("1", "12", "0")
{
eprintln!("Dumping the enqueued tasks reading them in obkv format...");
let reader =
DocumentsBatchReader::from_reader(content_file).with_context(|| {
format!("While reading content file {:?}", content_file_uuid)
})?;
let (mut cursor, documents_batch_index) =
reader.into_cursor_and_fields_index();
while let Some(doc) = cursor.next_document().with_context(|| {
format!("While iterating on content file {:?}", content_file_uuid)
})? {
dump_content_file
.push_document(&obkv_to_object(doc, &documents_batch_index)?)?;
}
} else {
eprintln!(
"Dumping the enqueued tasks reading them in JSON stream format..."
);
for document in
serde_json::de::Deserializer::from_reader(content_file).into_iter()
{
let document = document.with_context(|| {
format!("While reading content file {:?}", content_file_uuid)
})?;
dump_content_file.push_document(&document)?;
}
}
dump_content_file.flush()?;
count += 1;
}

View File

@ -1,13 +1,14 @@
mod v1_10;
mod v1_11;
mod v1_12;
mod v1_9;
use std::path::{Path, PathBuf};
use anyhow::{bail, Context};
use meilisearch_types::versioning::create_version_file;
use v1_10::v1_9_to_v1_10;
use v1_12::{v1_11_to_v1_12, v1_12_to_v1_12_3};
use crate::upgrade::v1_11::v1_10_to_v1_11;
@ -19,11 +20,48 @@ pub struct OfflineUpgrade {
impl OfflineUpgrade {
pub fn upgrade(self) -> anyhow::Result<()> {
// Adding a version?
//
// 1. Update the LAST_SUPPORTED_UPGRADE_FROM_VERSION and LAST_SUPPORTED_UPGRADE_TO_VERSION.
// 2. Add new version to the upgrade list if necessary
// 3. Use `no_upgrade` as index for versions that are compatible.
if self.current_version == self.target_version {
println!("Database is already at the target version. Exiting.");
return Ok(());
}
if self.current_version > self.target_version {
bail!(
"Cannot downgrade from {}.{}.{} to {}.{}.{}. Downgrade not supported",
self.current_version.0,
self.current_version.1,
self.current_version.2,
self.target_version.0,
self.target_version.1,
self.target_version.2
);
}
const FIRST_SUPPORTED_UPGRADE_FROM_VERSION: &str = "1.9.0";
const LAST_SUPPORTED_UPGRADE_FROM_VERSION: &str = "1.12.7";
const FIRST_SUPPORTED_UPGRADE_TO_VERSION: &str = "1.10.0";
const LAST_SUPPORTED_UPGRADE_TO_VERSION: &str = "1.12.7";
let upgrade_list = [
(v1_9_to_v1_10 as fn(&Path) -> Result<(), anyhow::Error>, "1", "10", "0"),
(
v1_9_to_v1_10 as fn(&Path, &str, &str, &str) -> Result<(), anyhow::Error>,
"1",
"10",
"0",
),
(v1_10_to_v1_11, "1", "11", "0"),
(v1_11_to_v1_12, "1", "12", "0"),
(v1_12_to_v1_12_3, "1", "12", "3"),
];
let no_upgrade: usize = upgrade_list.len();
let (current_major, current_minor, current_patch) = &self.current_version;
let start_at = match (
@ -33,8 +71,13 @@ impl OfflineUpgrade {
) {
("1", "9", _) => 0,
("1", "10", _) => 1,
("1", "11", _) => 2,
("1", "12", "0" | "1" | "2") => 3,
("1", "12", "3" | "4" | "5" | "6" | "7") => no_upgrade,
_ => {
bail!("Unsupported current version {current_major}.{current_minor}.{current_patch}. Can only upgrade from v1.9 and v1.10")
bail!("Unsupported current version {current_major}.{current_minor}.{current_patch}. Can only upgrade from versions in range [{}-{}]",
FIRST_SUPPORTED_UPGRADE_FROM_VERSION,
LAST_SUPPORTED_UPGRADE_FROM_VERSION);
}
};
@ -43,20 +86,32 @@ impl OfflineUpgrade {
let ends_at = match (target_major.as_str(), target_minor.as_str(), target_patch.as_str()) {
("1", "10", _) => 0,
("1", "11", _) => 1,
("1", "12", "0" | "1" | "2") => 2,
("1", "12", "3" | "4" | "5" | "6" | "7") => 3,
(major, _, _) if major.starts_with('v') => {
bail!("Target version must not starts with a `v`. Instead of writing `v1.9.0` write `1.9.0` for example.")
}
_ => {
bail!("Unsupported target version {target_major}.{target_minor}.{target_patch}. Can only upgrade to v1.10 and v1.11")
bail!("Unsupported target version {target_major}.{target_minor}.{target_patch}. Can only upgrade to versions in range [{}-{}]",
FIRST_SUPPORTED_UPGRADE_TO_VERSION,
LAST_SUPPORTED_UPGRADE_TO_VERSION);
}
};
println!("Starting the upgrade from {current_major}.{current_minor}.{current_patch} to {target_major}.{target_minor}.{target_patch}");
if start_at == no_upgrade {
println!("No upgrade operation to perform, writing VERSION file");
create_version_file(&self.db_path, target_major, target_minor, target_patch)
.context("while writing VERSION file after the upgrade")?;
println!("Success");
return Ok(());
}
#[allow(clippy::needless_range_loop)]
for index in start_at..=ends_at {
let (func, major, minor, patch) = upgrade_list[index];
(func)(&self.db_path)?;
(func)(&self.db_path, current_major, current_minor, current_patch)?;
println!("Done");
// We're writing the version file just in case an issue arise _while_ upgrading.
// We don't want the DB to fail in an unknown state.

View File

@ -1,18 +1,13 @@
use anyhow::bail;
use std::path::Path;
use anyhow::Context;
use meilisearch_types::{
heed::{
types::{SerdeJson, Str},
Database, Env, EnvOpenOptions, RoTxn, RwTxn, Unspecified,
},
milli::index::{db_name, main_key},
};
use crate::{try_opening_database, try_opening_poly_database, uuid_codec::UuidCodec};
use anyhow::{bail, Context};
use meilisearch_types::heed::types::{SerdeJson, Str};
use meilisearch_types::heed::{Database, Env, EnvOpenOptions, RoTxn, RwTxn, Unspecified};
use meilisearch_types::milli::index::{db_name, main_key};
use super::v1_9;
use crate::uuid_codec::UuidCodec;
use crate::{try_opening_database, try_opening_poly_database};
pub type FieldDistribution = std::collections::BTreeMap<String, u64>;
@ -156,7 +151,12 @@ fn date_round_trip(
Ok(())
}
pub fn v1_9_to_v1_10(db_path: &Path) -> anyhow::Result<()> {
pub fn v1_9_to_v1_10(
db_path: &Path,
_origin_major: &str,
_origin_minor: &str,
_origin_patch: &str,
) -> anyhow::Result<()> {
println!("Upgrading from v1.9.0 to v1.10.0");
// 2 changes here

View File

@ -7,14 +7,19 @@
use std::path::Path;
use anyhow::Context;
use meilisearch_types::{
heed::{types::Str, Database, EnvOpenOptions},
milli::index::db_name,
};
use meilisearch_types::heed::types::Str;
use meilisearch_types::heed::{Database, EnvOpenOptions};
use meilisearch_types::milli::index::db_name;
use crate::{try_opening_database, try_opening_poly_database, uuid_codec::UuidCodec};
use crate::uuid_codec::UuidCodec;
use crate::{try_opening_database, try_opening_poly_database};
pub fn v1_10_to_v1_11(db_path: &Path) -> anyhow::Result<()> {
pub fn v1_10_to_v1_11(
db_path: &Path,
_origin_major: &str,
_origin_minor: &str,
_origin_patch: &str,
) -> anyhow::Result<()> {
println!("Upgrading from v1.10.0 to v1.11.0");
let index_scheduler_path = db_path.join("tasks");

View File

@ -0,0 +1,298 @@
//! The breaking changes that happened between the v1.11 and the v1.12 are:
//! - The new indexer changed the update files format from OBKV to ndjson. https://github.com/meilisearch/meilisearch/pull/4900
use std::borrow::Cow;
use std::io::BufWriter;
use std::path::Path;
use std::sync::atomic::AtomicBool;
use anyhow::Context;
use file_store::FileStore;
use indexmap::IndexMap;
use meilisearch_types::milli::documents::DocumentsBatchReader;
use meilisearch_types::milli::heed::types::{SerdeJson, Str};
use meilisearch_types::milli::heed::{Database, EnvOpenOptions, RoTxn, RwTxn};
use meilisearch_types::milli::progress::Step;
use meilisearch_types::milli::{FieldDistribution, Index};
use serde::Serialize;
use serde_json::value::RawValue;
use tempfile::NamedTempFile;
use time::OffsetDateTime;
use uuid::Uuid;
use crate::try_opening_database;
use crate::uuid_codec::UuidCodec;
pub fn v1_11_to_v1_12(
db_path: &Path,
_origin_major: &str,
_origin_minor: &str,
_origin_patch: &str,
) -> anyhow::Result<()> {
println!("Upgrading from v1.11.0 to v1.12.0");
convert_update_files(db_path)?;
Ok(())
}
pub fn v1_12_to_v1_12_3(
db_path: &Path,
origin_major: &str,
origin_minor: &str,
origin_patch: &str,
) -> anyhow::Result<()> {
println!("Upgrading from v1.12.{{0, 1, 2}} to v1.12.3");
if origin_minor == "12" {
rebuild_field_distribution(db_path)?;
} else {
println!("Not rebuilding field distribution as it wasn't corrupted coming from v{origin_major}.{origin_minor}.{origin_patch}");
}
Ok(())
}
/// Convert the update files from OBKV to ndjson format.
///
/// 1) List all the update files using the file store.
/// 2) For each update file, read the update file into a DocumentsBatchReader.
/// 3) For each document in the update file, convert the document to a JSON object.
/// 4) Write the JSON object to a tmp file in the update files directory.
/// 5) Persist the tmp file replacing the old update file.
fn convert_update_files(db_path: &Path) -> anyhow::Result<()> {
let update_files_dir_path = db_path.join("update_files");
let file_store = FileStore::new(&update_files_dir_path).with_context(|| {
format!("while creating file store for update files dir {update_files_dir_path:?}")
})?;
for uuid in file_store.all_uuids().context("while retrieving uuids from file store")? {
let uuid = uuid.context("while retrieving uuid from file store")?;
let update_file_path = file_store.get_update_path(uuid);
let update_file = file_store
.get_update(uuid)
.with_context(|| format!("while getting update file for uuid {uuid:?}"))?;
let mut file =
NamedTempFile::new_in(&update_files_dir_path).map(BufWriter::new).with_context(
|| format!("while creating bufwriter for update file {update_file_path:?}"),
)?;
let reader = DocumentsBatchReader::from_reader(update_file).with_context(|| {
format!("while creating documents batch reader for update file {update_file_path:?}")
})?;
let (mut cursor, index) = reader.into_cursor_and_fields_index();
while let Some(document) = cursor.next_document().with_context(|| {
format!(
"while reading documents from batch reader for update file {update_file_path:?}"
)
})? {
let mut json_document = IndexMap::new();
for (fid, value) in document {
let field_name = index
.name(fid)
.with_context(|| format!("while getting field name for fid {fid} for update file {update_file_path:?}"))?;
let value: &RawValue = serde_json::from_slice(value)?;
json_document.insert(field_name, value);
}
serde_json::to_writer(&mut file, &json_document)?;
}
let file = file.into_inner().map_err(|e| e.into_error()).context(format!(
"while flushing update file bufwriter for update file {update_file_path:?}"
))?;
let _ = file
// atomically replace the obkv file with the rewritten NDJSON file
.persist(&update_file_path)
.with_context(|| format!("while persisting update file {update_file_path:?}"))?;
}
Ok(())
}
/// Rebuild field distribution as it was wrongly computed in v1.12.x if x < 3
fn rebuild_field_distribution(db_path: &Path) -> anyhow::Result<()> {
let index_scheduler_path = db_path.join("tasks");
let env = unsafe { EnvOpenOptions::new().max_dbs(100).open(&index_scheduler_path) }
.with_context(|| format!("While trying to open {:?}", index_scheduler_path.display()))?;
let mut sched_wtxn = env.write_txn()?;
let index_mapping: Database<Str, UuidCodec> =
try_opening_database(&env, &sched_wtxn, "index-mapping")?;
let stats_db: Database<UuidCodec, SerdeJson<IndexStats>> =
try_opening_database(&env, &sched_wtxn, "index-stats").with_context(|| {
format!("While trying to open {:?}", index_scheduler_path.display())
})?;
let index_count =
index_mapping.len(&sched_wtxn).context("while reading the number of indexes")?;
// FIXME: not ideal, we have to pre-populate all indexes to prevent double borrow of sched_wtxn
// 1. immutably for the iteration
// 2. mutably for updating index stats
let indexes: Vec<_> = index_mapping
.iter(&sched_wtxn)?
.map(|res| res.map(|(uid, uuid)| (uid.to_owned(), uuid)))
.collect();
let progress = meilisearch_types::milli::progress::Progress::default();
let finished = AtomicBool::new(false);
std::thread::scope(|scope| {
let display_progress = std::thread::Builder::new()
.name("display_progress".into())
.spawn_scoped(scope, || {
while !finished.load(std::sync::atomic::Ordering::Relaxed) {
std::thread::sleep(std::time::Duration::from_secs(5));
let view = progress.as_progress_view();
let Ok(view) = serde_json::to_string(&view) else {
continue;
};
println!("{view}");
}
})
.unwrap();
for (index_index, result) in indexes.into_iter().enumerate() {
let (uid, uuid) = result?;
progress.update_progress(VariableNameStep::new(
&uid,
index_index as u32,
index_count as u32,
));
let index_path = db_path.join("indexes").join(uuid.to_string());
println!(
"[{}/{index_count}]Updating index `{uid}` at `{}`",
index_index + 1,
index_path.display()
);
println!("\t- Rebuilding field distribution");
let index = meilisearch_types::milli::Index::new(EnvOpenOptions::new(), &index_path)
.with_context(|| {
format!("while opening index {uid} at '{}'", index_path.display())
})?;
let mut index_txn = index.write_txn()?;
meilisearch_types::milli::update::new::reindex::field_distribution(
&index,
&mut index_txn,
&progress,
)
.context("while rebuilding field distribution")?;
let stats = IndexStats::new(&index, &index_txn)
.with_context(|| format!("computing stats for index `{uid}`"))?;
store_stats_of(stats_db, uuid, &mut sched_wtxn, &uid, &stats)?;
index_txn.commit().context("while committing the write txn for the updated index")?;
}
sched_wtxn.commit().context("while committing the write txn for the index-scheduler")?;
finished.store(true, std::sync::atomic::Ordering::Relaxed);
if let Err(panic) = display_progress.join() {
let msg = match panic.downcast_ref::<&'static str>() {
Some(s) => *s,
None => match panic.downcast_ref::<String>() {
Some(s) => &s[..],
None => "Box<dyn Any>",
},
};
eprintln!("WARN: the display thread panicked with {msg}");
}
println!("Upgrading database succeeded");
Ok(())
})
}
pub struct VariableNameStep {
name: String,
current: u32,
total: u32,
}
impl VariableNameStep {
pub fn new(name: impl Into<String>, current: u32, total: u32) -> Self {
Self { name: name.into(), current, total }
}
}
impl Step for VariableNameStep {
fn name(&self) -> Cow<'static, str> {
self.name.clone().into()
}
fn current(&self) -> u32 {
self.current
}
fn total(&self) -> u32 {
self.total
}
}
pub fn store_stats_of(
stats_db: Database<UuidCodec, SerdeJson<IndexStats>>,
index_uuid: Uuid,
sched_wtxn: &mut RwTxn,
index_uid: &str,
stats: &IndexStats,
) -> anyhow::Result<()> {
stats_db
.put(sched_wtxn, &index_uuid, stats)
.with_context(|| format!("storing stats for index `{index_uid}`"))?;
Ok(())
}
/// The statistics that can be computed from an `Index` object.
#[derive(Serialize, Debug)]
pub struct IndexStats {
/// Number of documents in the index.
pub number_of_documents: u64,
/// Size taken up by the index' DB, in bytes.
///
/// This includes the size taken by both the used and free pages of the DB, and as the free pages
/// are not returned to the disk after a deletion, this number is typically larger than
/// `used_database_size` that only includes the size of the used pages.
pub database_size: u64,
/// Size taken by the used pages of the index' DB, in bytes.
///
/// As the DB backend does not return to the disk the pages that are not currently used by the DB,
/// this value is typically smaller than `database_size`.
pub used_database_size: u64,
/// Association of every field name with the number of times it occurs in the documents.
pub field_distribution: FieldDistribution,
/// Creation date of the index.
#[serde(with = "time::serde::rfc3339")]
pub created_at: OffsetDateTime,
/// Date of the last update of the index.
#[serde(with = "time::serde::rfc3339")]
pub updated_at: OffsetDateTime,
}
impl IndexStats {
/// Compute the stats of an index
///
/// # Parameters
///
/// - rtxn: a RO transaction for the index, obtained from `Index::read_txn()`.
pub fn new(index: &Index, rtxn: &RoTxn) -> meilisearch_types::milli::Result<Self> {
Ok(IndexStats {
number_of_documents: index.number_of_documents(rtxn)?,
database_size: index.on_disk_size()?,
used_database_size: index.used_size()?,
field_distribution: index.field_distribution(rtxn)?,
created_at: index.created_at(rtxn)?,
updated_at: index.updated_at(rtxn)?,
})
}
}

View File

@ -91,8 +91,8 @@ ureq = { version = "2.10.0", features = ["json"] }
url = "2.5.2"
rayon-par-bridge = "0.1.0"
hashbrown = "0.15.0"
raw-collections = { git = "https://github.com/meilisearch/raw-collections.git", version = "0.1.0" }
bumpalo = "3.16.0"
bumparaw-collections = "0.1.2"
thread_local = "1.1.8"
allocator-api2 = "0.2.18"
rustc-hash = "2.0.0"

View File

@ -280,7 +280,7 @@ fn starts_with(selector: &str, key: &str) -> bool {
pub fn validate_document_id_str(document_id: &str) -> Option<&str> {
if document_id.is_empty()
|| document_id.len() > 512
|| document_id.len() >= 512
|| !document_id.chars().all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')
{
None

View File

@ -114,7 +114,7 @@ pub enum UserError {
"Document identifier `{}` is invalid. \
A document identifier can be of type integer or string, \
only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and underscores (_), \
and can not be more than 512 bytes.", .document_id.to_string()
and can not be more than 511 bytes.", .document_id.to_string()
)]
InvalidDocumentId { document_id: Value },
#[error("Invalid facet distribution, {}", format_invalid_filter_distribution(.invalid_facets_name, .valid_facets_name))]

View File

@ -1734,6 +1734,7 @@ pub(crate) mod tests {
use crate::error::{Error, InternalError};
use crate::index::{DEFAULT_MIN_WORD_LEN_ONE_TYPO, DEFAULT_MIN_WORD_LEN_TWO_TYPOS};
use crate::progress::Progress;
use crate::update::new::indexer;
use crate::update::settings::InnerIndexSettings;
use crate::update::{
@ -1810,7 +1811,7 @@ pub(crate) mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)?;
if let Some(error) = operation_stats.into_iter().find_map(|stat| stat.error) {
@ -1829,7 +1830,7 @@ pub(crate) mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
})
.unwrap()?;
@ -1901,7 +1902,7 @@ pub(crate) mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)?;
if let Some(error) = operation_stats.into_iter().find_map(|stat| stat.error) {
@ -1920,7 +1921,7 @@ pub(crate) mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
})
.unwrap()?;
@ -1982,7 +1983,7 @@ pub(crate) mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2001,7 +2002,7 @@ pub(crate) mod tests {
&document_changes,
embedders,
&|| should_abort.load(Relaxed),
&|_| (),
&Progress::default(),
)
})
.unwrap()

View File

@ -31,6 +31,7 @@ pub mod vector;
#[macro_use]
pub mod snapshot_tests;
mod fieldids_weights_map;
pub mod progress;
use std::collections::{BTreeMap, HashMap};
use std::convert::{TryFrom, TryInto};

View File

@ -0,0 +1,152 @@
use std::any::TypeId;
use std::borrow::Cow;
use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::{Arc, RwLock};
use serde::Serialize;
pub trait Step: 'static + Send + Sync {
fn name(&self) -> Cow<'static, str>;
fn current(&self) -> u32;
fn total(&self) -> u32;
}
#[derive(Clone, Default)]
pub struct Progress {
steps: Arc<RwLock<Vec<(TypeId, Box<dyn Step>)>>>,
}
impl Progress {
pub fn update_progress<P: Step>(&self, sub_progress: P) {
let mut steps = self.steps.write().unwrap();
let step_type = TypeId::of::<P>();
if let Some(idx) = steps.iter().position(|(id, _)| *id == step_type) {
steps.truncate(idx);
}
steps.push((step_type, Box::new(sub_progress)));
}
// TODO: This code should be in meilisearch_types but cannot because milli can't depend on meilisearch_types
pub fn as_progress_view(&self) -> ProgressView {
let steps = self.steps.read().unwrap();
let mut percentage = 0.0;
let mut prev_factors = 1.0;
let mut step_view = Vec::with_capacity(steps.len());
for (_, step) in steps.iter() {
prev_factors *= step.total() as f32;
percentage += step.current() as f32 / prev_factors;
step_view.push(ProgressStepView {
current_step: step.name(),
finished: step.current(),
total: step.total(),
});
}
ProgressView { steps: step_view, percentage: percentage * 100.0 }
}
}
/// This trait lets you use the AtomicSubStep defined right below.
/// The name must be a const that never changed but that can't be enforced by the type system because it make the trait non object-safe.
/// By forcing the Default trait + the &'static str we make it harder to miss-use the trait.
pub trait NamedStep: 'static + Send + Sync + Default {
fn name(&self) -> &'static str;
}
/// Structure to quickly define steps that need very quick, lockless updating of their current step.
/// You can use this struct if:
/// - The name of the step doesn't change
/// - The total number of steps doesn't change
pub struct AtomicSubStep<Name: NamedStep> {
unit_name: Name,
current: Arc<AtomicU32>,
total: u32,
}
impl<Name: NamedStep> AtomicSubStep<Name> {
pub fn new(total: u32) -> (Arc<AtomicU32>, Self) {
let current = Arc::new(AtomicU32::new(0));
(current.clone(), Self { current, total, unit_name: Name::default() })
}
}
impl<Name: NamedStep> Step for AtomicSubStep<Name> {
fn name(&self) -> Cow<'static, str> {
self.unit_name.name().into()
}
fn current(&self) -> u32 {
self.current.load(Ordering::Relaxed)
}
fn total(&self) -> u32 {
self.total
}
}
#[macro_export]
macro_rules! make_enum_progress {
($visibility:vis enum $name:ident { $($variant:ident,)+ }) => {
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Sequence)]
#[allow(clippy::enum_variant_names)]
$visibility enum $name {
$($variant),+
}
impl Step for $name {
fn name(&self) -> Cow<'static, str> {
use convert_case::Casing;
match self {
$(
$name::$variant => stringify!($variant).from_case(convert_case::Case::Camel).to_case(convert_case::Case::Lower).into()
),+
}
}
fn current(&self) -> u32 {
*self as u32
}
fn total(&self) -> u32 {
Self::CARDINALITY as u32
}
}
};
}
#[macro_export]
macro_rules! make_atomic_progress {
($struct_name:ident alias $atomic_struct_name:ident => $step_name:literal) => {
#[derive(Default, Debug, Clone, Copy)]
pub struct $struct_name {}
impl NamedStep for $struct_name {
fn name(&self) -> &'static str {
$step_name
}
}
pub type $atomic_struct_name = AtomicSubStep<$struct_name>;
};
}
make_atomic_progress!(Document alias AtomicDocumentStep => "document" );
make_atomic_progress!(Payload alias AtomicPayloadStep => "payload" );
#[derive(Debug, Serialize, Clone)]
#[serde(rename_all = "camelCase")]
pub struct ProgressView {
pub steps: Vec<ProgressStepView>,
pub percentage: f32,
}
#[derive(Debug, Serialize, Clone)]
#[serde(rename_all = "camelCase")]
pub struct ProgressStepView {
pub current_step: Cow<'static, str>,
pub finished: u32,
pub total: u32,
}

View File

@ -3,12 +3,13 @@ use std::collections::BTreeMap;
use std::fmt::{self, Debug};
use bumpalo::Bump;
use bumparaw_collections::{RawMap, RawVec, Value};
use liquid::model::{
ArrayView, DisplayCow, KString, KStringCow, ObjectRender, ObjectSource, ScalarCow, State,
Value as LiquidValue,
};
use liquid::{ObjectView, ValueView};
use raw_collections::{RawMap, RawVec};
use rustc_hash::FxBuildHasher;
use serde_json::value::RawValue;
use crate::update::del_add::{DelAdd, KvReaderDelAdd};
@ -195,7 +196,7 @@ impl<'doc, D: DocumentTrait<'doc> + Debug> ObjectView for ParseableDocument<'doc
}
impl<'doc, D: DocumentTrait<'doc> + Debug> ValueView for ParseableDocument<'doc, D> {
fn as_debug(&self) -> &dyn fmt::Debug {
fn as_debug(&self) -> &dyn Debug {
self
}
fn render(&self) -> liquid::model::DisplayCow<'_> {
@ -243,14 +244,13 @@ impl<'doc, D: DocumentTrait<'doc> + Debug> ValueView for ParseableDocument<'doc,
}
}
#[derive(Debug)]
struct ParseableValue<'doc> {
value: raw_collections::Value<'doc>,
value: Value<'doc, FxBuildHasher>,
}
impl<'doc> ParseableValue<'doc> {
pub fn new(value: &'doc RawValue, doc_alloc: &'doc Bump) -> Self {
let value = raw_collections::Value::from_raw_value(value, doc_alloc).unwrap();
let value = Value::from_raw_value_and_hasher(value, FxBuildHasher, doc_alloc).unwrap();
Self { value }
}
@ -260,19 +260,19 @@ impl<'doc> ParseableValue<'doc> {
}
// transparent newtype for implementing ValueView
#[repr(transparent)]
#[derive(Debug)]
struct ParseableMap<'doc>(RawMap<'doc>);
#[repr(transparent)]
struct ParseableMap<'doc>(RawMap<'doc, FxBuildHasher>);
// transparent newtype for implementing ValueView
#[repr(transparent)]
#[derive(Debug)]
#[repr(transparent)]
struct ParseableArray<'doc>(RawVec<'doc>);
impl<'doc> ParseableMap<'doc> {
pub fn as_parseable<'a>(map: &'a RawMap<'doc>) -> &'a ParseableMap<'doc> {
pub fn as_parseable<'a>(map: &'a RawMap<'doc, FxBuildHasher>) -> &'a ParseableMap<'doc> {
// SAFETY: repr(transparent)
unsafe { &*(map as *const RawMap as *const Self) }
unsafe { &*(map as *const RawMap<FxBuildHasher> as *const Self) }
}
}
@ -447,8 +447,9 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn render(&self) -> DisplayCow<'_> {
use raw_collections::value::Number;
use raw_collections::Value;
use bumparaw_collections::value::Number;
use bumparaw_collections::Value;
match &self.value {
Value::Null => LiquidValue::Nil.render(),
Value::Bool(v) => v.render(),
@ -464,8 +465,9 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn source(&self) -> DisplayCow<'_> {
use raw_collections::value::Number;
use raw_collections::Value;
use bumparaw_collections::value::Number;
use bumparaw_collections::Value;
match &self.value {
Value::Null => LiquidValue::Nil.source(),
Value::Bool(v) => ValueView::source(v),
@ -481,8 +483,9 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn type_name(&self) -> &'static str {
use raw_collections::value::Number;
use raw_collections::Value;
use bumparaw_collections::value::Number;
use bumparaw_collections::Value;
match &self.value {
Value::Null => LiquidValue::Nil.type_name(),
Value::Bool(v) => v.type_name(),
@ -498,7 +501,8 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn query_state(&self, state: State) -> bool {
use raw_collections::Value;
use bumparaw_collections::Value;
match &self.value {
Value::Null => ValueView::query_state(&LiquidValue::Nil, state),
Value::Bool(v) => ValueView::query_state(v, state),
@ -515,7 +519,8 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn to_kstr(&self) -> KStringCow<'_> {
use raw_collections::Value;
use bumparaw_collections::Value;
match &self.value {
Value::Null => ValueView::to_kstr(&LiquidValue::Nil),
Value::Bool(v) => ValueView::to_kstr(v),
@ -527,12 +532,14 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn to_value(&self) -> LiquidValue {
use raw_collections::Value;
use bumparaw_collections::value::Number;
use bumparaw_collections::Value;
match &self.value {
Value::Null => LiquidValue::Nil,
Value::Bool(v) => LiquidValue::Scalar(liquid::model::ScalarCow::new(*v)),
Value::Number(number) => match number {
raw_collections::value::Number::PosInt(number) => {
Number::PosInt(number) => {
let number: i64 = match (*number).try_into() {
Ok(number) => number,
Err(_) => {
@ -541,12 +548,8 @@ impl<'doc> ValueView for ParseableValue<'doc> {
};
LiquidValue::Scalar(ScalarCow::new(number))
}
raw_collections::value::Number::NegInt(number) => {
LiquidValue::Scalar(ScalarCow::new(*number))
}
raw_collections::value::Number::Finite(number) => {
LiquidValue::Scalar(ScalarCow::new(*number))
}
Number::NegInt(number) => LiquidValue::Scalar(ScalarCow::new(*number)),
Number::Finite(number) => LiquidValue::Scalar(ScalarCow::new(*number)),
},
Value::String(s) => LiquidValue::Scalar(liquid::model::ScalarCow::new(s.to_string())),
Value::Array(raw_vec) => ParseableArray::as_parseable(raw_vec).to_value(),
@ -555,8 +558,9 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn as_scalar(&self) -> Option<liquid::model::ScalarCow<'_>> {
use raw_collections::value::Number;
use raw_collections::Value;
use bumparaw_collections::value::Number;
use bumparaw_collections::Value;
match &self.value {
Value::Bool(v) => Some(liquid::model::ScalarCow::new(*v)),
Value::Number(number) => match number {
@ -576,34 +580,41 @@ impl<'doc> ValueView for ParseableValue<'doc> {
}
fn is_scalar(&self) -> bool {
use raw_collections::Value;
use bumparaw_collections::Value;
matches!(&self.value, Value::Bool(_) | Value::Number(_) | Value::String(_))
}
fn as_array(&self) -> Option<&dyn liquid::model::ArrayView> {
if let raw_collections::Value::Array(array) = &self.value {
if let Value::Array(array) = &self.value {
return Some(ParseableArray::as_parseable(array) as _);
}
None
}
fn is_array(&self) -> bool {
matches!(&self.value, raw_collections::Value::Array(_))
matches!(&self.value, bumparaw_collections::Value::Array(_))
}
fn as_object(&self) -> Option<&dyn ObjectView> {
if let raw_collections::Value::Object(object) = &self.value {
if let Value::Object(object) = &self.value {
return Some(ParseableMap::as_parseable(object) as _);
}
None
}
fn is_object(&self) -> bool {
matches!(&self.value, raw_collections::Value::Object(_))
matches!(&self.value, bumparaw_collections::Value::Object(_))
}
fn is_nil(&self) -> bool {
matches!(&self.value, raw_collections::Value::Null)
matches!(&self.value, bumparaw_collections::Value::Null)
}
}
impl Debug for ParseableValue<'_> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_struct("ParseableValue").field("value", &self.value).finish()
}
}

View File

@ -38,6 +38,16 @@ pub struct RenderPromptError {
pub fault: FaultSource,
}
impl RenderPromptError {
pub(crate) fn missing_context_with_external_docid(
external_docid: String,
inner: liquid::Error,
) -> RenderPromptError {
Self {
kind: RenderPromptErrorKind::MissingContextWithExternalDocid(external_docid, inner),
fault: FaultSource::User,
}
}
pub(crate) fn missing_context(inner: liquid::Error) -> RenderPromptError {
Self { kind: RenderPromptErrorKind::MissingContext(inner), fault: FaultSource::User }
}
@ -47,6 +57,8 @@ impl RenderPromptError {
pub enum RenderPromptErrorKind {
#[error("missing field in document: {0}")]
MissingContext(liquid::Error),
#[error("missing field in document `{0}`: {1}")]
MissingContextWithExternalDocid(String, liquid::Error),
}
impl From<RenderPromptError> for crate::Error {

View File

@ -119,6 +119,7 @@ impl Prompt {
'doc: 'a, // lifetime of the allocator, will live for an entire chunk of documents
>(
&self,
external_docid: &str,
document: impl crate::update::new::document::Document<'a> + Debug,
field_id_map: &RefCell<GlobalFieldsIdsMap>,
doc_alloc: &'doc Bump,
@ -130,9 +131,12 @@ impl Prompt {
self.max_bytes.unwrap_or_else(default_max_bytes).get(),
doc_alloc,
);
self.template
.render_to(&mut rendered, &context)
.map_err(RenderPromptError::missing_context)?;
self.template.render_to(&mut rendered, &context).map_err(|liquid_error| {
RenderPromptError::missing_context_with_external_docid(
external_docid.to_owned(),
liquid_error,
)
})?;
Ok(std::str::from_utf8(rendered.into_bump_slice())
.expect("render can only write UTF-8 because all inputs and processing preserve utf-8"))
}

View File

@ -219,12 +219,19 @@ impl<'a> FacetDistribution<'a> {
let facet_key = StrRefCodec::bytes_decode(facet_key).unwrap();
let key: (FieldId, _, &str) = (field_id, any_docid, facet_key);
let original_string = self
.index
.field_id_docid_facet_strings
.get(self.rtxn, &key)?
.unwrap()
.to_owned();
let optional_original_string =
self.index.field_id_docid_facet_strings.get(self.rtxn, &key)?;
let original_string = match optional_original_string {
Some(original_string) => original_string.to_owned(),
None => {
tracing::error!(
"Missing original facet string. Using the normalized facet {} instead",
facet_key
);
facet_key.to_string()
}
};
distribution.insert(original_string, nbr_docids);
if distribution.len() == self.max_values_per_facet {

View File

@ -5,6 +5,7 @@ use bumpalo::Bump;
use heed::EnvOpenOptions;
use maplit::{btreemap, hashset};
use crate::progress::Progress;
use crate::update::new::indexer;
use crate::update::{IndexDocumentsMethod, IndexerConfig, Settings};
use crate::vector::EmbeddingConfigs;
@ -72,7 +73,7 @@ pub fn setup_search_index_with_criteria(criteria: &[Criterion]) -> Index {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -91,7 +92,7 @@ pub fn setup_search_index_with_criteria(criteria: &[Criterion]) -> Index {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();

View File

@ -1,4 +1,4 @@
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering};
use std::sync::Arc;
use rayon::{ThreadPool, ThreadPoolBuilder};
@ -9,6 +9,8 @@ use thiserror::Error;
#[derive(Debug)]
pub struct ThreadPoolNoAbort {
thread_pool: ThreadPool,
/// The number of active operations.
active_operations: AtomicUsize,
/// Set to true if the thread pool catched a panic.
pool_catched_panic: Arc<AtomicBool>,
}
@ -19,7 +21,9 @@ impl ThreadPoolNoAbort {
OP: FnOnce() -> R + Send,
R: Send,
{
self.active_operations.fetch_add(1, Ordering::Relaxed);
let output = self.thread_pool.install(op);
self.active_operations.fetch_sub(1, Ordering::Relaxed);
// While reseting the pool panic catcher we return an error if we catched one.
if self.pool_catched_panic.swap(false, Ordering::SeqCst) {
Err(PanicCatched)
@ -31,6 +35,11 @@ impl ThreadPoolNoAbort {
pub fn current_num_threads(&self) -> usize {
self.thread_pool.current_num_threads()
}
/// The number of active operations.
pub fn active_operations(&self) -> usize {
self.active_operations.load(Ordering::Relaxed)
}
}
#[derive(Error, Debug)]
@ -64,6 +73,10 @@ impl ThreadPoolNoAbortBuilder {
let catched_panic = pool_catched_panic.clone();
move |_result| catched_panic.store(true, Ordering::SeqCst)
});
Ok(ThreadPoolNoAbort { thread_pool: self.0.build()?, pool_catched_panic })
Ok(ThreadPoolNoAbort {
thread_pool: self.0.build()?,
active_operations: AtomicUsize::new(0),
pool_catched_panic,
})
}
}

View File

@ -79,22 +79,29 @@ pub const FACET_MIN_LEVEL_SIZE: u8 = 5;
use std::collections::BTreeSet;
use std::fs::File;
use std::io::BufReader;
use std::ops::Bound;
use grenad::Merger;
use heed::types::{Bytes, DecodeIgnore};
use heed::BytesDecode as _;
use roaring::RoaringBitmap;
use time::OffsetDateTime;
use tracing::debug;
use self::incremental::FacetsUpdateIncremental;
use super::{FacetsUpdateBulk, MergeDeladdBtreesetString, MergeDeladdCboRoaringBitmaps};
use crate::facet::FacetType;
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec, OrderedF64Codec,
};
use crate::heed_codec::BytesRefCodec;
use crate::search::facet::get_highest_level;
use crate::update::del_add::{DelAdd, KvReaderDelAdd};
use crate::{try_split_array_at, FieldId, Index, Result};
pub mod bulk;
pub mod incremental;
pub mod new_incremental;
/// A builder used to add new elements to the `facet_id_string_docids` or `facet_id_f64_docids` databases.
///
@ -646,3 +653,194 @@ mod comparison_bench {
}
}
}
/// Run sanity checks on the specified fid tree
///
/// 1. No "orphan" child value, any child value has a parent
/// 2. Any docid in the child appears in the parent
/// 3. No docid in the parent is missing from all its children
/// 4. no group is bigger than max_group_size
/// 5. Less than 50% of groups are bigger than group_size
/// 6. group size matches the number of children
/// 7. max_level is < 255
pub(crate) fn sanity_checks(
index: &Index,
rtxn: &heed::RoTxn,
field_id: FieldId,
facet_type: FacetType,
group_size: usize,
_min_level_size: usize, // might add a check on level size later
max_group_size: usize,
) -> Result<()> {
tracing::info!(%field_id, ?facet_type, "performing sanity checks");
let database = match facet_type {
FacetType::String => {
index.facet_id_string_docids.remap_key_type::<FacetGroupKeyCodec<BytesRefCodec>>()
}
FacetType::Number => {
index.facet_id_f64_docids.remap_key_type::<FacetGroupKeyCodec<BytesRefCodec>>()
}
};
let leaf_prefix: FacetGroupKey<&[u8]> = FacetGroupKey { field_id, level: 0, left_bound: &[] };
let leaf_it = database.prefix_iter(rtxn, &leaf_prefix)?;
let max_level = get_highest_level(rtxn, database, field_id)?;
if max_level == u8::MAX {
panic!("max_level == 255");
}
for leaf in leaf_it {
let (leaf_facet_value, leaf_docids) = leaf?;
let mut current_level = 0;
let mut current_parent_facet_value: Option<FacetGroupKey<&[u8]>> = None;
let mut current_parent_docids: Option<crate::heed_codec::facet::FacetGroupValue> = None;
loop {
current_level += 1;
if current_level >= max_level {
break;
}
let parent_key_right_bound = FacetGroupKey {
field_id,
level: current_level,
left_bound: leaf_facet_value.left_bound,
};
let (parent_facet_value, parent_docids) = database
.get_lower_than_or_equal_to(rtxn, &parent_key_right_bound)?
.expect("no parent found");
if parent_facet_value.level != current_level {
panic!(
"wrong parent level, found_level={}, expected_level={}",
parent_facet_value.level, current_level
);
}
if parent_facet_value.field_id != field_id {
panic!("wrong parent fid");
}
if parent_facet_value.left_bound > leaf_facet_value.left_bound {
panic!("wrong parent left bound");
}
if !leaf_docids.bitmap.is_subset(&parent_docids.bitmap) {
panic!(
"missing docids from leaf in parent, current_level={}, parent={}, child={}, missing={missing:?}, child_len={}, child={:?}",
current_level,
facet_to_string(parent_facet_value.left_bound, facet_type),
facet_to_string(leaf_facet_value.left_bound, facet_type),
leaf_docids.bitmap.len(),
leaf_docids.bitmap.clone(),
missing=leaf_docids.bitmap - parent_docids.bitmap,
)
}
if let Some(current_parent_facet_value) = current_parent_facet_value {
if current_parent_facet_value.field_id != parent_facet_value.field_id {
panic!("wrong parent parent fid");
}
if current_parent_facet_value.level + 1 != parent_facet_value.level {
panic!("wrong parent parent level");
}
if current_parent_facet_value.left_bound < parent_facet_value.left_bound {
panic!("wrong parent parent left bound");
}
}
if let Some(current_parent_docids) = current_parent_docids {
if !current_parent_docids.bitmap.is_subset(&parent_docids.bitmap) {
panic!("missing docids from intermediate node in parent, parent_level={}, parent={}, intermediate={}, missing={missing:?}, intermediate={:?}",
parent_facet_value.level,
facet_to_string(parent_facet_value.left_bound, facet_type),
facet_to_string(current_parent_facet_value.unwrap().left_bound, facet_type),
current_parent_docids.bitmap.clone(),
missing=current_parent_docids.bitmap - parent_docids.bitmap,
);
}
}
current_parent_facet_value = Some(parent_facet_value);
current_parent_docids = Some(parent_docids);
}
}
tracing::info!(%field_id, ?facet_type, "checked all leaves");
let mut current_level = max_level;
let mut greater_than_group = 0usize;
let mut total = 0usize;
loop {
if current_level == 0 {
break;
}
let child_level = current_level - 1;
tracing::info!(%field_id, ?facet_type, %current_level, "checked groups for level");
let level_groups_prefix: FacetGroupKey<&[u8]> =
FacetGroupKey { field_id, level: current_level, left_bound: &[] };
let mut level_groups_it = database.prefix_iter(rtxn, &level_groups_prefix)?.peekable();
'group_it: loop {
let Some(group) = level_groups_it.next() else { break 'group_it };
let (group_facet_value, group_docids) = group?;
let child_left_bound = group_facet_value.left_bound.to_owned();
let mut expected_docids = RoaringBitmap::new();
let mut expected_size = 0usize;
let right_bound = level_groups_it
.peek()
.and_then(|res| res.as_ref().ok())
.map(|(key, _)| key.left_bound);
let child_left_bound = FacetGroupKey {
field_id,
level: child_level,
left_bound: child_left_bound.as_slice(),
};
let child_left_bound = Bound::Included(&child_left_bound);
let child_right_bound;
let child_right_bound = if let Some(right_bound) = right_bound {
child_right_bound =
FacetGroupKey { field_id, level: child_level, left_bound: right_bound };
Bound::Excluded(&child_right_bound)
} else {
Bound::Unbounded
};
let children = database.range(rtxn, &(child_left_bound, child_right_bound))?;
for child in children {
let (child_facet_value, child_docids) = child?;
if child_facet_value.field_id != field_id {
break;
}
if child_facet_value.level != child_level {
break;
}
expected_size += 1;
expected_docids |= &child_docids.bitmap;
}
assert_eq!(expected_size, group_docids.size as usize);
assert!(expected_size <= max_group_size);
assert_eq!(expected_docids, group_docids.bitmap);
total += 1;
if expected_size > group_size {
greater_than_group += 1;
}
}
current_level -= 1;
}
if greater_than_group * 2 > total {
panic!("too many groups have a size > group_size");
}
tracing::info!("sanity checks OK");
Ok(())
}
fn facet_to_string(facet_value: &[u8], facet_type: FacetType) -> String {
match facet_type {
FacetType::String => bstr::BStr::new(facet_value).to_string(),
FacetType::Number => match OrderedF64Codec::bytes_decode(facet_value) {
Ok(value) => value.to_string(),
Err(e) => format!("error: {e} (bytes: {facet_value:?}"),
},
}
}

View File

@ -0,0 +1,498 @@
use std::ops::Bound;
use heed::types::{Bytes, DecodeIgnore};
use heed::{BytesDecode as _, Database, RwTxn};
use roaring::RoaringBitmap;
use crate::facet::FacetType;
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValue, FacetGroupValueCodec,
};
use crate::heed_codec::BytesRefCodec;
use crate::search::facet::get_highest_level;
use crate::update::valid_facet_value;
use crate::{FieldId, Index, Result};
pub struct FacetsUpdateIncremental {
inner: FacetsUpdateIncrementalInner,
delta_data: Vec<FacetFieldIdChange>,
}
struct FacetsUpdateIncrementalInner {
db: Database<FacetGroupKeyCodec<BytesRefCodec>, FacetGroupValueCodec>,
field_id: FieldId,
group_size: u8,
min_level_size: u8,
max_group_size: u8,
}
impl FacetsUpdateIncremental {
pub fn new(
index: &Index,
facet_type: FacetType,
field_id: FieldId,
delta_data: Vec<FacetFieldIdChange>,
group_size: u8,
min_level_size: u8,
max_group_size: u8,
) -> Self {
FacetsUpdateIncremental {
inner: FacetsUpdateIncrementalInner {
db: match facet_type {
FacetType::String => index
.facet_id_string_docids
.remap_key_type::<FacetGroupKeyCodec<BytesRefCodec>>(),
FacetType::Number => index
.facet_id_f64_docids
.remap_key_type::<FacetGroupKeyCodec<BytesRefCodec>>(),
},
field_id,
group_size,
min_level_size,
max_group_size,
},
delta_data,
}
}
#[tracing::instrument(level = "trace", skip_all, target = "indexing::facets::incremental")]
pub fn execute(mut self, wtxn: &mut RwTxn) -> Result<()> {
if self.delta_data.is_empty() {
return Ok(());
}
self.delta_data.sort_unstable_by(
|FacetFieldIdChange { facet_value: left, .. },
FacetFieldIdChange { facet_value: right, .. }| {
left.cmp(right)
// sort in **reverse** lexicographic order
.reverse()
},
);
self.inner.find_changed_parents(wtxn, self.delta_data)?;
self.inner.add_or_delete_level(wtxn)
}
}
impl FacetsUpdateIncrementalInner {
/// WARNING: `changed_children` must be sorted in **reverse** lexicographic order.
fn find_changed_parents(
&self,
wtxn: &mut RwTxn,
mut changed_children: Vec<FacetFieldIdChange>,
) -> Result<()> {
let mut changed_parents = vec![];
for child_level in 0u8..u8::MAX {
// child_level < u8::MAX by construction
let parent_level = child_level + 1;
let parent_level_left_bound: FacetGroupKey<&[u8]> =
FacetGroupKey { field_id: self.field_id, level: parent_level, left_bound: &[] };
let mut last_parent: Option<Box<[u8]>> = None;
let mut child_it = changed_children
// drain all changed children
.drain(..)
// keep only children whose value is valid in the LMDB sense
.filter(|child| valid_facet_value(&child.facet_value));
// `while let` rather than `for` because we advance `child_it` inside of the loop
'current_level: while let Some(child) = child_it.next() {
if let Some(last_parent) = &last_parent {
if &child.facet_value >= last_parent {
self.compute_parent_group(wtxn, child_level, child.facet_value)?;
continue 'current_level;
}
}
// need to find a new parent
let parent_key_prefix = FacetGroupKey {
field_id: self.field_id,
level: parent_level,
left_bound: &*child.facet_value,
};
let parent = self
.db
.remap_data_type::<DecodeIgnore>()
.rev_range(
wtxn,
&(
Bound::Excluded(&parent_level_left_bound),
Bound::Included(&parent_key_prefix),
),
)?
.next();
match parent {
Some(Ok((parent_key, _parent_value))) => {
// found parent, cache it for next keys
last_parent = Some(parent_key.left_bound.to_owned().into_boxed_slice());
// add to modified list for parent level
changed_parents.push(FacetFieldIdChange {
facet_value: parent_key.left_bound.to_owned().into_boxed_slice(),
});
self.compute_parent_group(wtxn, child_level, child.facet_value)?;
}
Some(Err(err)) => return Err(err.into()),
None => {
// no parent for that key
let mut parent_it = self
.db
.remap_data_type::<DecodeIgnore>()
.prefix_iter_mut(wtxn, &parent_level_left_bound)?;
match parent_it.next() {
// 1. left of the current left bound, or
Some(Ok((first_key, _first_value))) => {
// make sure we don't spill on the neighboring fid (level also included defensively)
if first_key.field_id != self.field_id
|| first_key.level != parent_level
{
// max level reached, exit
drop(parent_it);
self.compute_parent_group(
wtxn,
child_level,
child.facet_value,
)?;
for child in child_it.by_ref() {
self.compute_parent_group(
wtxn,
child_level,
child.facet_value,
)?;
}
return Ok(());
}
// remove old left bound
unsafe { parent_it.del_current()? };
drop(parent_it);
changed_parents.push(FacetFieldIdChange {
facet_value: child.facet_value.clone(),
});
self.compute_parent_group(wtxn, child_level, child.facet_value)?;
// pop all elements in order to visit the new left bound
let new_left_bound =
&mut changed_parents.last_mut().unwrap().facet_value;
for child in child_it.by_ref() {
new_left_bound.clone_from(&child.facet_value);
self.compute_parent_group(
wtxn,
child_level,
child.facet_value,
)?;
}
}
Some(Err(err)) => return Err(err.into()),
// 2. max level reached, exit
None => {
drop(parent_it);
self.compute_parent_group(wtxn, child_level, child.facet_value)?;
for child in child_it.by_ref() {
self.compute_parent_group(
wtxn,
child_level,
child.facet_value,
)?;
}
return Ok(());
}
}
}
}
}
if changed_parents.is_empty() {
return Ok(());
}
drop(child_it);
std::mem::swap(&mut changed_children, &mut changed_parents);
// changed_parents is now empty because changed_children was emptied by the drain
}
Ok(())
}
fn compute_parent_group(
&self,
wtxn: &mut RwTxn<'_>,
parent_level: u8,
parent_left_bound: Box<[u8]>,
) -> Result<()> {
let mut range_left_bound: Vec<u8> = parent_left_bound.into();
if parent_level == 0 {
return Ok(());
}
let child_level = parent_level - 1;
let parent_key = FacetGroupKey {
field_id: self.field_id,
level: parent_level,
left_bound: &*range_left_bound,
};
let child_right_bound = self
.db
.remap_data_type::<DecodeIgnore>()
.get_greater_than(wtxn, &parent_key)?
.and_then(
|(
FacetGroupKey {
level: right_level,
field_id: right_fid,
left_bound: right_bound,
},
_,
)| {
if parent_level != right_level || self.field_id != right_fid {
// there was a greater key, but with a greater level or fid, so not a sibling to the parent: ignore
return None;
}
Some(right_bound.to_owned())
},
);
let child_right_bound = match &child_right_bound {
Some(right_bound) => Bound::Excluded(FacetGroupKey {
left_bound: right_bound.as_slice(),
field_id: self.field_id,
level: child_level,
}),
None => Bound::Unbounded,
};
let child_left_key = FacetGroupKey {
field_id: self.field_id,
level: child_level,
left_bound: &*range_left_bound,
};
let mut child_left_bound = Bound::Included(child_left_key);
loop {
// do a first pass on the range to find the number of children
let child_count = self
.db
.remap_data_type::<DecodeIgnore>()
.range(wtxn, &(child_left_bound, child_right_bound))?
.take(self.max_group_size as usize * 2)
.count();
let mut child_it = self.db.range(wtxn, &(child_left_bound, child_right_bound))?;
// pick the right group_size depending on the number of children
let group_size = if child_count >= self.max_group_size as usize * 2 {
// more than twice the max_group_size => there will be space for at least 2 groups of max_group_size
self.max_group_size as usize
} else if child_count >= self.group_size as usize {
// size in [group_size, max_group_size * 2[
// divided by 2 it is between [group_size / 2, max_group_size[
// this ensures that the tree is balanced
child_count / 2
} else {
// take everything
child_count
};
let res: Result<_> = child_it
.by_ref()
.take(group_size)
// stop if we go to the next level or field id
.take_while(|res| match res {
Ok((child_key, _)) => {
child_key.field_id == self.field_id && child_key.level == child_level
}
Err(_) => true,
})
.try_fold(
(None, FacetGroupValue { size: 0, bitmap: Default::default() }),
|(bounds, mut group_value), child_res| {
let (child_key, child_value) = child_res?;
let bounds = match bounds {
Some((left_bound, _)) => Some((left_bound, child_key.left_bound)),
None => Some((child_key.left_bound, child_key.left_bound)),
};
// max_group_size <= u8::MAX
group_value.size += 1;
group_value.bitmap |= &child_value.bitmap;
Ok((bounds, group_value))
},
);
let (bounds, group_value) = res?;
let Some((group_left_bound, right_bound)) = bounds else {
let update_key = FacetGroupKey {
field_id: self.field_id,
level: parent_level,
left_bound: &*range_left_bound,
};
drop(child_it);
if let Bound::Included(_) = child_left_bound {
self.db.delete(wtxn, &update_key)?;
}
break;
};
drop(child_it);
let current_left_bound = group_left_bound.to_owned();
let delete_old_bound = match child_left_bound {
Bound::Included(bound) => {
if bound.left_bound != current_left_bound {
Some(range_left_bound.clone())
} else {
None
}
}
_ => None,
};
range_left_bound.clear();
range_left_bound.extend_from_slice(right_bound);
let child_left_key = FacetGroupKey {
field_id: self.field_id,
level: child_level,
left_bound: range_left_bound.as_slice(),
};
child_left_bound = Bound::Excluded(child_left_key);
if let Some(old_bound) = delete_old_bound {
let update_key = FacetGroupKey {
field_id: self.field_id,
level: parent_level,
left_bound: old_bound.as_slice(),
};
self.db.delete(wtxn, &update_key)?;
}
let update_key = FacetGroupKey {
field_id: self.field_id,
level: parent_level,
left_bound: current_left_bound.as_slice(),
};
if group_value.bitmap.is_empty() {
self.db.delete(wtxn, &update_key)?;
} else {
self.db.put(wtxn, &update_key, &group_value)?;
}
}
Ok(())
}
/// Check whether the highest level has exceeded `min_level_size` * `self.group_size`.
/// If it has, we must build an addition level above it.
/// Then check whether the highest level is under `min_level_size`.
/// If it has, we must remove the complete level.
pub(crate) fn add_or_delete_level(&self, txn: &mut RwTxn<'_>) -> Result<()> {
let highest_level = get_highest_level(txn, self.db, self.field_id)?;
let mut highest_level_prefix = vec![];
highest_level_prefix.extend_from_slice(&self.field_id.to_be_bytes());
highest_level_prefix.push(highest_level);
let size_highest_level =
self.db.remap_types::<Bytes, Bytes>().prefix_iter(txn, &highest_level_prefix)?.count();
if size_highest_level >= self.group_size as usize * self.min_level_size as usize {
self.add_level(txn, highest_level, &highest_level_prefix, size_highest_level)
} else if size_highest_level < self.min_level_size as usize && highest_level != 0 {
self.delete_level(txn, &highest_level_prefix)
} else {
Ok(())
}
}
/// Delete a level.
fn delete_level(&self, txn: &mut RwTxn<'_>, highest_level_prefix: &[u8]) -> Result<()> {
let mut to_delete = vec![];
let mut iter =
self.db.remap_types::<Bytes, Bytes>().prefix_iter(txn, highest_level_prefix)?;
for el in iter.by_ref() {
let (k, _) = el?;
to_delete.push(
FacetGroupKeyCodec::<BytesRefCodec>::bytes_decode(k)
.map_err(heed::Error::Encoding)?
.into_owned(),
);
}
drop(iter);
for k in to_delete {
self.db.delete(txn, &k.as_ref())?;
}
Ok(())
}
/// Build an additional level for the field id.
fn add_level(
&self,
txn: &mut RwTxn<'_>,
highest_level: u8,
highest_level_prefix: &[u8],
size_highest_level: usize,
) -> Result<()> {
let mut groups_iter = self
.db
.remap_types::<Bytes, FacetGroupValueCodec>()
.prefix_iter(txn, highest_level_prefix)?;
let nbr_new_groups = size_highest_level / self.group_size as usize;
let nbr_leftover_elements = size_highest_level % self.group_size as usize;
let mut to_add = vec![];
for _ in 0..nbr_new_groups {
let mut first_key = None;
let mut values = RoaringBitmap::new();
for _ in 0..self.group_size {
let (key_bytes, value_i) = groups_iter.next().unwrap()?;
let key_i = FacetGroupKeyCodec::<BytesRefCodec>::bytes_decode(key_bytes)
.map_err(heed::Error::Encoding)?;
if first_key.is_none() {
first_key = Some(key_i);
}
values |= value_i.bitmap;
}
let key = FacetGroupKey {
field_id: self.field_id,
level: highest_level + 1,
left_bound: first_key.unwrap().left_bound,
};
let value = FacetGroupValue { size: self.group_size, bitmap: values };
to_add.push((key.into_owned(), value));
}
// now we add the rest of the level, in case its size is > group_size * min_level_size
// this can indeed happen if the min_level_size parameter changes between two calls to `insert`
if nbr_leftover_elements > 0 {
let mut first_key = None;
let mut values = RoaringBitmap::new();
for _ in 0..nbr_leftover_elements {
let (key_bytes, value_i) = groups_iter.next().unwrap()?;
let key_i = FacetGroupKeyCodec::<BytesRefCodec>::bytes_decode(key_bytes)
.map_err(heed::Error::Encoding)?;
if first_key.is_none() {
first_key = Some(key_i);
}
values |= value_i.bitmap;
}
let key = FacetGroupKey {
field_id: self.field_id,
level: highest_level + 1,
left_bound: first_key.unwrap().left_bound,
};
// Note: nbr_leftover_elements can be casted to a u8 since it is bounded by `max_group_size`
// when it is created above.
let value = FacetGroupValue { size: nbr_leftover_elements as u8, bitmap: values };
to_add.push((key.into_owned(), value));
}
drop(groups_iter);
for (key, value) in to_add {
self.db.put(txn, &key.as_ref(), &value)?;
}
Ok(())
}
}
#[derive(Debug)]
pub struct FacetFieldIdChange {
pub facet_value: Box<[u8]>,
}

View File

@ -10,10 +10,14 @@ use fst::{IntoStreamer, Streamer};
pub use grenad_helpers::*;
pub use merge_functions::*;
use crate::MAX_WORD_LENGTH;
use crate::MAX_LMDB_KEY_LENGTH;
pub fn valid_lmdb_key(key: impl AsRef<[u8]>) -> bool {
key.as_ref().len() <= MAX_WORD_LENGTH * 2 && !key.as_ref().is_empty()
key.as_ref().len() <= MAX_LMDB_KEY_LENGTH - 3 && !key.as_ref().is_empty()
}
pub fn valid_facet_value(facet_value: impl AsRef<[u8]>) -> bool {
facet_value.as_ref().len() <= MAX_LMDB_KEY_LENGTH - 3 && !facet_value.as_ref().is_empty()
}
/// Divides one slice into two at an index, returns `None` if mid is out of bounds.

View File

@ -766,6 +766,7 @@ mod tests {
use crate::documents::mmap_from_objects;
use crate::index::tests::TempIndex;
use crate::index::IndexEmbeddingConfig;
use crate::progress::Progress;
use crate::search::TermsMatchingStrategy;
use crate::update::new::indexer;
use crate::update::Setting;
@ -1964,7 +1965,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2148,7 +2149,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2163,7 +2164,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2210,7 +2211,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2225,7 +2226,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2263,7 +2264,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2278,7 +2279,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2315,7 +2316,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2330,7 +2331,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2369,7 +2370,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2384,7 +2385,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2428,7 +2429,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2443,7 +2444,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2480,7 +2481,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2495,7 +2496,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2532,7 +2533,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2547,7 +2548,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2726,7 +2727,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2741,7 +2742,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2785,7 +2786,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2800,7 +2801,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -2841,7 +2842,7 @@ mod tests {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -2856,7 +2857,7 @@ mod tests {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();
wtxn.commit().unwrap();
@ -3333,6 +3334,44 @@ mod tests {
rtxn.commit().unwrap();
}
#[test]
fn incremental_update_without_changing_facet_distribution() {
let index = TempIndex::new();
index
.add_documents(documents!([
{"id": 0, "some_field": "aaa", "other_field": "aaa" },
{"id": 1, "some_field": "bbb", "other_field": "bbb" },
]))
.unwrap();
{
let rtxn = index.read_txn().unwrap();
// count field distribution
let results = index.field_distribution(&rtxn).unwrap();
assert_eq!(Some(&2), results.get("id"));
assert_eq!(Some(&2), results.get("some_field"));
assert_eq!(Some(&2), results.get("other_field"));
}
let mut index = index;
index.index_documents_config.update_method = IndexDocumentsMethod::UpdateDocuments;
index
.add_documents(documents!([
{"id": 0, "other_field": "bbb" },
{"id": 1, "some_field": "ccc" },
]))
.unwrap();
{
let rtxn = index.read_txn().unwrap();
// count field distribution
let results = index.field_distribution(&rtxn).unwrap();
assert_eq!(Some(&2), results.get("id"));
assert_eq!(Some(&2), results.get("some_field"));
assert_eq!(Some(&2), results.get("other_field"));
}
}
#[test]
fn delete_words_exact_attributes() {
let index = TempIndex::new();

View File

@ -1,5 +1,5 @@
---
source: milli/src/update/index_documents/mod.rs
source: crates/milli/src/update/index_documents/mod.rs
---
3 0 48.9021 1 [19, ]
3 0 49.9314 1 [17, ]
@ -15,6 +15,11 @@ source: milli/src/update/index_documents/mod.rs
3 0 50.7453 1 [7, ]
3 0 50.8466 1 [10, ]
3 0 51.0537 1 [9, ]
3 1 48.9021 2 [17, 19, ]
3 1 50.1793 3 [13, 14, 15, ]
3 1 50.4502 4 [0, 3, 8, 12, ]
3 1 50.6312 2 [1, 2, ]
3 1 50.7453 3 [7, 9, 10, ]
4 0 2.271 1 [17, ]
4 0 2.3708 1 [19, ]
4 0 2.7637 1 [14, ]
@ -28,4 +33,3 @@ source: milli/src/update/index_documents/mod.rs
4 0 3.6957 1 [9, ]
4 0 3.9623 1 [12, ]
4 0 4.337 1 [10, ]

View File

@ -27,6 +27,12 @@ use crate::update::new::KvReaderFieldId;
use crate::vector::Embedding;
use crate::{CboRoaringBitmapCodec, DocumentId, Error, Index, InternalError};
/// Note that the FrameProducer requires up to 9 bytes to
/// encode the length, the max grant has been computed accordingly.
///
/// <https://docs.rs/bbqueue/latest/bbqueue/framed/index.html#frame-header>
const MAX_FRAME_HEADER_SIZE: usize = 9;
/// Creates a tuple of senders/receiver to be used by
/// the extractors and the writer loop.
///
@ -53,8 +59,9 @@ pub fn extractor_writer_bbqueue(
bbbuffers.resize_with(current_num_threads, || BBBuffer::new(bbbuffer_capacity));
let capacity = bbbuffers.first().unwrap().capacity();
// Read the field description to understand this
let capacity = capacity.checked_sub(9).unwrap();
// 1. Due to fragmentation in the bbbuffer, we can only accept up to half the capacity in a single message.
// 2. Read the documentation for `MAX_FRAME_HEADER_SIZE` for more information about why it is here.
let max_grant = capacity.saturating_div(2).checked_sub(MAX_FRAME_HEADER_SIZE).unwrap();
let producers = ThreadLocal::with_capacity(bbbuffers.len());
let consumers = rayon::broadcast(|bi| {
@ -65,7 +72,7 @@ pub fn extractor_writer_bbqueue(
});
let (sender, receiver) = flume::bounded(channel_capacity);
let sender = ExtractorBbqueueSender { sender, producers, capacity };
let sender = ExtractorBbqueueSender { sender, producers, max_grant };
let receiver = WriterBbqueueReceiver {
receiver,
look_at_consumer: (0..consumers.len()).cycle(),
@ -81,13 +88,10 @@ pub struct ExtractorBbqueueSender<'a> {
/// A memory buffer, one by thread, is used to serialize
/// the entries directly in this shared, lock-free space.
producers: ThreadLocal<FullySend<RefCell<FrameProducer<'a>>>>,
/// The capacity of this frame producer, will never be able to store more than that.
///
/// Note that the FrameProducer requires up to 9 bytes to encode the length,
/// the capacity has been shrunk accordingly.
///
/// <https://docs.rs/bbqueue/latest/bbqueue/framed/index.html#frame-header>
capacity: usize,
/// The maximum frame grant that a producer can reserve.
/// It will never be able to store more than that as the
/// buffer cannot split data into two parts.
max_grant: usize,
}
pub struct WriterBbqueueReceiver<'a> {
@ -443,14 +447,14 @@ impl<'b> ExtractorBbqueueSender<'b> {
}
fn delete_vector(&self, docid: DocumentId) -> crate::Result<()> {
let capacity = self.capacity;
let max_grant = self.max_grant;
let refcell = self.producers.get().unwrap();
let mut producer = refcell.0.borrow_mut_or_yield();
let payload_header = EntryHeader::ArroyDeleteVector(ArroyDeleteVector { docid });
let total_length = EntryHeader::total_delete_vector_size();
if total_length > capacity {
panic!("The entry is larger ({total_length} bytes) than the BBQueue capacity ({capacity} bytes)");
if total_length > max_grant {
panic!("The entry is larger ({total_length} bytes) than the BBQueue max grant ({max_grant} bytes)");
}
// Spin loop to have a frame the size we requested.
@ -468,7 +472,7 @@ impl<'b> ExtractorBbqueueSender<'b> {
embedder_id: u8,
embeddings: &[Vec<f32>],
) -> crate::Result<()> {
let capacity = self.capacity;
let max_grant = self.max_grant;
let refcell = self.producers.get().unwrap();
let mut producer = refcell.0.borrow_mut_or_yield();
@ -479,7 +483,7 @@ impl<'b> ExtractorBbqueueSender<'b> {
let arroy_set_vector = ArroySetVectors { docid, embedder_id, _padding: [0; 3] };
let payload_header = EntryHeader::ArroySetVectors(arroy_set_vector);
let total_length = EntryHeader::total_set_vectors_size(embeddings.len(), dimensions);
if total_length > capacity {
if total_length > max_grant {
let mut value_file = tempfile::tempfile().map(BufWriter::new)?;
for embedding in embeddings {
let mut embedding_bytes = bytemuck::cast_slice(embedding);
@ -540,14 +544,14 @@ impl<'b> ExtractorBbqueueSender<'b> {
where
F: FnOnce(&mut [u8], &mut [u8]) -> crate::Result<()>,
{
let capacity = self.capacity;
let max_grant = self.max_grant;
let refcell = self.producers.get().unwrap();
let mut producer = refcell.0.borrow_mut_or_yield();
let operation = DbOperation { database, key_length: Some(key_length) };
let payload_header = EntryHeader::DbOperation(operation);
let total_length = EntryHeader::total_key_value_size(key_length, value_length);
if total_length > capacity {
if total_length > max_grant {
let mut key_buffer = vec![0; key_length.get() as usize].into_boxed_slice();
let value_file = tempfile::tempfile()?;
value_file.set_len(value_length.try_into().unwrap())?;
@ -601,7 +605,7 @@ impl<'b> ExtractorBbqueueSender<'b> {
where
F: FnOnce(&mut [u8]) -> crate::Result<()>,
{
let capacity = self.capacity;
let max_grant = self.max_grant;
let refcell = self.producers.get().unwrap();
let mut producer = refcell.0.borrow_mut_or_yield();
@ -610,8 +614,8 @@ impl<'b> ExtractorBbqueueSender<'b> {
let operation = DbOperation { database, key_length: None };
let payload_header = EntryHeader::DbOperation(operation);
let total_length = EntryHeader::total_key_size(key_length);
if total_length > capacity {
panic!("The entry is larger ({total_length} bytes) than the BBQueue capacity ({capacity} bytes)");
if total_length > max_grant {
panic!("The entry is larger ({total_length} bytes) than the BBQueue max grant ({max_grant} bytes)");
}
// Spin loop to have a frame the size we requested.

View File

@ -1,7 +1,8 @@
use std::collections::{BTreeMap, BTreeSet};
use bumparaw_collections::RawMap;
use heed::RoTxn;
use raw_collections::RawMap;
use rustc_hash::FxBuildHasher;
use serde_json::value::RawValue;
use super::vector_document::VectorDocument;
@ -385,12 +386,12 @@ pub type Entry<'doc> = (&'doc str, &'doc RawValue);
#[derive(Debug)]
pub struct Versions<'doc> {
data: RawMap<'doc>,
data: RawMap<'doc, FxBuildHasher>,
}
impl<'doc> Versions<'doc> {
pub fn multiple(
mut versions: impl Iterator<Item = Result<RawMap<'doc>>>,
mut versions: impl Iterator<Item = Result<RawMap<'doc, FxBuildHasher>>>,
) -> Result<Option<Self>> {
let Some(data) = versions.next() else { return Ok(None) };
let mut data = data?;
@ -403,7 +404,7 @@ impl<'doc> Versions<'doc> {
Ok(Some(Self::single(data)))
}
pub fn single(version: RawMap<'doc>) -> Self {
pub fn single(version: RawMap<'doc, FxBuildHasher>) -> Self {
Self { data: version }
}

View File

@ -69,12 +69,12 @@ use std::io::BufReader;
use std::{io, iter, mem};
use bumpalo::Bump;
use bumparaw_collections::bbbul::{BitPacker, BitPacker4x};
use bumparaw_collections::map::FrozenMap;
use bumparaw_collections::{Bbbul, FrozenBbbul};
use grenad::ReaderCursor;
use hashbrown::hash_map::RawEntryMut;
use hashbrown::HashMap;
use raw_collections::bbbul::{BitPacker, BitPacker4x};
use raw_collections::map::FrozenMap;
use raw_collections::{Bbbul, FrozenBbbul};
use roaring::RoaringBitmap;
use rustc_hash::FxBuildHasher;
@ -177,12 +177,12 @@ impl<'extractor> BalancedCaches<'extractor> {
Ok(())
}
pub fn freeze(&mut self) -> Result<Vec<FrozenCache<'_, 'extractor>>> {
pub fn freeze(&mut self, source_id: usize) -> Result<Vec<FrozenCache<'_, 'extractor>>> {
match &mut self.caches {
InnerCaches::Normal(NormalCaches { caches }) => caches
.iter_mut()
.enumerate()
.map(|(bucket, map)| {
.map(|(bucket_id, map)| {
// safety: we are transmuting the Bbbul into a FrozenBbbul
// that are the same size.
let map = unsafe {
@ -201,14 +201,19 @@ impl<'extractor> BalancedCaches<'extractor> {
>,
>(map)
};
Ok(FrozenCache { bucket, cache: FrozenMap::new(map), spilled: Vec::new() })
Ok(FrozenCache {
source_id,
bucket_id,
cache: FrozenMap::new(map),
spilled: Vec::new(),
})
})
.collect(),
InnerCaches::Spilling(SpillingCaches { caches, spilled_entries, .. }) => caches
.iter_mut()
.zip(mem::take(spilled_entries))
.enumerate()
.map(|(bucket, (map, sorter))| {
.map(|(bucket_id, (map, sorter))| {
let spilled = sorter
.into_reader_cursors()?
.into_iter()
@ -234,7 +239,7 @@ impl<'extractor> BalancedCaches<'extractor> {
>,
>(map)
};
Ok(FrozenCache { bucket, cache: FrozenMap::new(map), spilled })
Ok(FrozenCache { source_id, bucket_id, cache: FrozenMap::new(map), spilled })
})
.collect(),
}
@ -440,7 +445,8 @@ fn spill_entry_to_sorter(
}
pub struct FrozenCache<'a, 'extractor> {
bucket: usize,
bucket_id: usize,
source_id: usize,
cache: FrozenMap<
'a,
'extractor,
@ -457,9 +463,9 @@ pub fn transpose_and_freeze_caches<'a, 'extractor>(
let width = caches.first().map(BalancedCaches::buckets).unwrap_or(0);
let mut bucket_caches: Vec<_> = iter::repeat_with(Vec::new).take(width).collect();
for thread_cache in caches {
for frozen in thread_cache.freeze()? {
bucket_caches[frozen.bucket].push(frozen);
for (thread_index, thread_cache) in caches.iter_mut().enumerate() {
for frozen in thread_cache.freeze(thread_index)? {
bucket_caches[frozen.bucket_id].push(frozen);
}
}
@ -477,21 +483,16 @@ where
F: for<'a> FnMut(&'a [u8], DelAddRoaringBitmap) -> Result<()>,
{
let mut maps = Vec::new();
let mut readers = Vec::new();
let mut current_bucket = None;
for FrozenCache { bucket, cache, ref mut spilled } in frozen {
assert_eq!(*current_bucket.get_or_insert(bucket), bucket);
maps.push(cache);
readers.append(spilled);
}
// First manage the spilled entries by looking into the HashMaps,
// merge them and mark them as dummy.
let mut heap = BinaryHeap::new();
for (source_index, source) in readers.into_iter().enumerate() {
let mut cursor = source.into_cursor()?;
if cursor.move_on_next()?.is_some() {
heap.push(Entry { cursor, source_index });
let mut current_bucket = None;
for FrozenCache { source_id, bucket_id, cache, spilled } in frozen {
assert_eq!(*current_bucket.get_or_insert(bucket_id), bucket_id);
maps.push((source_id, cache));
for reader in spilled {
let mut cursor = reader.into_cursor()?;
if cursor.move_on_next()?.is_some() {
heap.push(Entry { cursor, source_id });
}
}
}
@ -508,25 +509,29 @@ where
let mut output = DelAddRoaringBitmap::from_bytes(first_value)?;
while let Some(mut entry) = heap.peek_mut() {
if let Some((key, _value)) = entry.cursor.current() {
if first_key == key {
let new = DelAddRoaringBitmap::from_bytes(first_value)?;
output = output.merge(new);
// When we are done we the current value of this entry move make
// it move forward and let the heap reorganize itself (on drop)
if entry.cursor.move_on_next()?.is_none() {
PeekMut::pop(entry);
}
} else {
if let Some((key, value)) = entry.cursor.current() {
if first_key != key {
break;
}
let new = DelAddRoaringBitmap::from_bytes(value)?;
output = output.merge(new);
// When we are done we the current value of this entry move make
// it move forward and let the heap reorganize itself (on drop)
if entry.cursor.move_on_next()?.is_none() {
PeekMut::pop(entry);
}
}
}
// Once we merged all of the spilled bitmaps we must also
// fetch the entries from the non-spilled entries (the HashMaps).
for (map_index, map) in maps.iter_mut().enumerate() {
if first_entry.source_index != map_index {
for (source_id, map) in maps.iter_mut() {
debug_assert!(
!(map.get(first_key).is_some() && first_entry.source_id == *source_id),
"A thread should not have spiled a key that has been inserted in the cache"
);
if first_entry.source_id != *source_id {
if let Some(new) = map.get_mut(first_key) {
output.union_and_clear_bbbul(new);
}
@ -538,12 +543,12 @@ where
// Don't forget to put the first entry back into the heap.
if first_entry.cursor.move_on_next()?.is_some() {
heap.push(first_entry)
heap.push(first_entry);
}
}
// Then manage the content on the HashMap entries that weren't taken (mem::take).
while let Some(mut map) = maps.pop() {
while let Some((_, mut map)) = maps.pop() {
// Make sure we don't try to work with entries already managed by the spilled
let mut ordered_entries: Vec<_> =
map.iter_mut().filter(|(_, bbbul)| !bbbul.is_empty()).collect();
@ -553,7 +558,7 @@ where
let mut output = DelAddRoaringBitmap::empty();
output.union_and_clear_bbbul(bbbul);
for rhs in maps.iter_mut() {
for (_, rhs) in maps.iter_mut() {
if let Some(new) = rhs.get_mut(key) {
output.union_and_clear_bbbul(new);
}
@ -569,14 +574,14 @@ where
struct Entry<R> {
cursor: ReaderCursor<R>,
source_index: usize,
source_id: usize,
}
impl<R> Ord for Entry<R> {
fn cmp(&self, other: &Entry<R>) -> Ordering {
let skey = self.cursor.current().map(|(k, _)| k);
let okey = other.cursor.current().map(|(k, _)| k);
skey.cmp(&okey).then(self.source_index.cmp(&other.source_index)).reverse()
skey.cmp(&okey).then(self.source_id.cmp(&other.source_id)).reverse()
}
}

View File

@ -89,7 +89,8 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for DocumentsExtractor<'a, 'b> {
.or_default();
*entry -= 1;
}
let content = update.updated();
let content =
update.merged(&context.rtxn, context.index, &context.db_fields_ids_map)?;
let geo_iter =
content.geo_field().transpose().map(|res| res.map(|rv| ("_geo", rv)));
for res in content.iter_top_level_fields().chain(geo_iter) {

View File

@ -16,10 +16,10 @@ use crate::update::del_add::DelAdd;
use crate::update::new::channel::FieldIdDocidFacetSender;
use crate::update::new::extract::perm_json_p;
use crate::update::new::indexer::document_changes::{
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext, Progress,
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext,
};
use crate::update::new::ref_cell_ext::RefCellExt as _;
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::{FullySend, ThreadLocal};
use crate::update::new::DocumentChange;
use crate::update::GrenadParameters;
@ -283,42 +283,60 @@ impl FacetedDocidsExtractor {
}
struct DelAddFacetValue<'doc> {
strings: HashMap<(FieldId, BVec<'doc, u8>), DelAdd, hashbrown::DefaultHashBuilder, &'doc Bump>,
strings: HashMap<
(FieldId, &'doc str),
Option<BVec<'doc, u8>>,
hashbrown::DefaultHashBuilder,
&'doc Bump,
>,
f64s: HashMap<(FieldId, BVec<'doc, u8>), DelAdd, hashbrown::DefaultHashBuilder, &'doc Bump>,
doc_alloc: &'doc Bump,
}
impl<'doc> DelAddFacetValue<'doc> {
fn new(doc_alloc: &'doc Bump) -> Self {
Self { strings: HashMap::new_in(doc_alloc), f64s: HashMap::new_in(doc_alloc) }
Self { strings: HashMap::new_in(doc_alloc), f64s: HashMap::new_in(doc_alloc), doc_alloc }
}
fn insert_add(&mut self, fid: FieldId, value: BVec<'doc, u8>, kind: FacetKind) {
let cache = match kind {
FacetKind::String => &mut self.strings,
FacetKind::Number => &mut self.f64s,
_ => return,
};
let key = (fid, value);
if let Some(DelAdd::Deletion) = cache.get(&key) {
cache.remove(&key);
} else {
cache.insert(key, DelAdd::Addition);
match kind {
FacetKind::Number => {
let key = (fid, value);
if let Some(DelAdd::Deletion) = self.f64s.get(&key) {
self.f64s.remove(&key);
} else {
self.f64s.insert(key, DelAdd::Addition);
}
}
FacetKind::String => {
if let Ok(s) = std::str::from_utf8(&value) {
let normalized = crate::normalize_facet(s);
let truncated = self.doc_alloc.alloc_str(truncate_str(&normalized));
self.strings.insert((fid, truncated), Some(value));
}
}
_ => (),
}
}
fn insert_del(&mut self, fid: FieldId, value: BVec<'doc, u8>, kind: FacetKind) {
let cache = match kind {
FacetKind::String => &mut self.strings,
FacetKind::Number => &mut self.f64s,
_ => return,
};
let key = (fid, value);
if let Some(DelAdd::Addition) = cache.get(&key) {
cache.remove(&key);
} else {
cache.insert(key, DelAdd::Deletion);
match kind {
FacetKind::Number => {
let key = (fid, value);
if let Some(DelAdd::Addition) = self.f64s.get(&key) {
self.f64s.remove(&key);
} else {
self.f64s.insert(key, DelAdd::Deletion);
}
}
FacetKind::String => {
if let Ok(s) = std::str::from_utf8(&value) {
let normalized = crate::normalize_facet(s);
let truncated = self.doc_alloc.alloc_str(truncate_str(&normalized));
self.strings.insert((fid, truncated), None);
}
}
_ => (),
}
}
@ -329,18 +347,14 @@ impl<'doc> DelAddFacetValue<'doc> {
doc_alloc: &Bump,
) -> crate::Result<()> {
let mut buffer = bumpalo::collections::Vec::new_in(doc_alloc);
for ((fid, value), deladd) in self.strings {
if let Ok(s) = std::str::from_utf8(&value) {
buffer.clear();
buffer.extend_from_slice(&fid.to_be_bytes());
buffer.extend_from_slice(&docid.to_be_bytes());
let normalized = crate::normalize_facet(s);
let truncated = truncate_str(&normalized);
buffer.extend_from_slice(truncated.as_bytes());
match deladd {
DelAdd::Deletion => sender.delete_facet_string(&buffer)?,
DelAdd::Addition => sender.write_facet_string(&buffer, &value)?,
}
for ((fid, truncated), value) in self.strings {
buffer.clear();
buffer.extend_from_slice(&fid.to_be_bytes());
buffer.extend_from_slice(&docid.to_be_bytes());
buffer.extend_from_slice(truncated.as_bytes());
match &value {
Some(value) => sender.write_facet_string(&buffer, value)?,
None => sender.delete_facet_string(&buffer)?,
}
}
@ -373,26 +387,16 @@ fn truncate_str(s: &str) -> &str {
impl FacetedDocidsExtractor {
#[tracing::instrument(level = "trace", skip_all, target = "indexing::extract::faceted")]
pub fn run_extraction<
'pl,
'fid,
'indexer,
'index,
'extractor,
DC: DocumentChanges<'pl>,
MSP,
SP,
>(
pub fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP>(
grenad_parameters: GrenadParameters,
document_changes: &DC,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
sender: &FieldIdDocidFacetSender,
step: Step,
step: IndexingStep,
) -> Result<Vec<BalancedCaches<'extractor>>>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
let index = indexing_context.index;
let rtxn = index.read_txn()?;

View File

@ -15,23 +15,22 @@ pub use geo::*;
pub use searchable::*;
pub use vectors::EmbeddingExtractor;
use super::indexer::document_changes::{DocumentChanges, IndexingContext, Progress};
use super::steps::Step;
use super::indexer::document_changes::{DocumentChanges, IndexingContext};
use super::steps::IndexingStep;
use super::thread_local::{FullySend, ThreadLocal};
use crate::update::GrenadParameters;
use crate::Result;
pub trait DocidsExtractor {
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP, SP>(
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP>(
grenad_parameters: GrenadParameters,
document_changes: &DC,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
step: Step,
step: IndexingStep,
) -> Result<Vec<BalancedCaches<'extractor>>>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync;
MSP: Fn() -> bool + Sync;
}
/// TODO move in permissive json pointer

View File

@ -11,10 +11,10 @@ use super::tokenize_document::{tokenizer_builder, DocumentTokenizer};
use crate::update::new::extract::cache::BalancedCaches;
use crate::update::new::extract::perm_json_p::contained_in;
use crate::update::new::indexer::document_changes::{
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext, Progress,
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext,
};
use crate::update::new::ref_cell_ext::RefCellExt as _;
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::{FullySend, MostlySend, ThreadLocal};
use crate::update::new::DocumentChange;
use crate::update::GrenadParameters;
@ -28,7 +28,7 @@ pub struct WordDocidsBalancedCaches<'extractor> {
exact_word_docids: BalancedCaches<'extractor>,
word_position_docids: BalancedCaches<'extractor>,
fid_word_count_docids: BalancedCaches<'extractor>,
fid_word_count: HashMap<FieldId, (usize, usize)>,
fid_word_count: HashMap<FieldId, (Option<usize>, Option<usize>)>,
current_docid: Option<DocumentId>,
}
@ -85,8 +85,8 @@ impl<'extractor> WordDocidsBalancedCaches<'extractor> {
self.fid_word_count
.entry(field_id)
.and_modify(|(_current_count, new_count)| *new_count += 1)
.or_insert((0, 1));
.and_modify(|(_current_count, new_count)| *new_count.get_or_insert(0) += 1)
.or_insert((None, Some(1)));
self.current_docid = Some(docid);
Ok(())
@ -130,8 +130,8 @@ impl<'extractor> WordDocidsBalancedCaches<'extractor> {
self.fid_word_count
.entry(field_id)
.and_modify(|(current_count, _new_count)| *current_count += 1)
.or_insert((1, 0));
.and_modify(|(current_count, _new_count)| *current_count.get_or_insert(0) += 1)
.or_insert((Some(1), None));
self.current_docid = Some(docid);
@ -141,14 +141,18 @@ impl<'extractor> WordDocidsBalancedCaches<'extractor> {
fn flush_fid_word_count(&mut self, buffer: &mut BumpVec<u8>) -> Result<()> {
for (fid, (current_count, new_count)) in self.fid_word_count.drain() {
if current_count != new_count {
if current_count <= MAX_COUNTED_WORDS {
if let Some(current_count) =
current_count.filter(|current_count| *current_count <= MAX_COUNTED_WORDS)
{
buffer.clear();
buffer.extend_from_slice(&fid.to_be_bytes());
buffer.push(current_count as u8);
self.fid_word_count_docids
.insert_del_u32(buffer, self.current_docid.unwrap())?;
}
if new_count <= MAX_COUNTED_WORDS {
if let Some(new_count) =
new_count.filter(|new_count| *new_count <= MAX_COUNTED_WORDS)
{
buffer.clear();
buffer.extend_from_slice(&fid.to_be_bytes());
buffer.push(new_count as u8);
@ -235,25 +239,15 @@ impl<'a, 'extractor> Extractor<'extractor> for WordDocidsExtractorData<'a> {
pub struct WordDocidsExtractors;
impl WordDocidsExtractors {
pub fn run_extraction<
'pl,
'fid,
'indexer,
'index,
'extractor,
DC: DocumentChanges<'pl>,
MSP,
SP,
>(
pub fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP>(
grenad_parameters: GrenadParameters,
document_changes: &DC,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
step: Step,
step: IndexingStep,
) -> Result<WordDocidsCaches<'extractor>>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
let index = indexing_context.index;
let rtxn = index.read_txn()?;

View File

@ -14,9 +14,9 @@ use tokenize_document::{tokenizer_builder, DocumentTokenizer};
use super::cache::BalancedCaches;
use super::DocidsExtractor;
use crate::update::new::indexer::document_changes::{
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext, Progress,
extract, DocumentChangeContext, DocumentChanges, Extractor, IndexingContext,
};
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::{FullySend, ThreadLocal};
use crate::update::new::DocumentChange;
use crate::update::GrenadParameters;
@ -56,16 +56,15 @@ impl<'a, 'extractor, EX: SearchableExtractor + Sync> Extractor<'extractor>
}
pub trait SearchableExtractor: Sized + Sync {
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP, SP>(
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP>(
grenad_parameters: GrenadParameters,
document_changes: &DC,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
step: Step,
step: IndexingStep,
) -> Result<Vec<BalancedCaches<'extractor>>>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
let rtxn = indexing_context.index.read_txn()?;
let stop_words = indexing_context.index.stop_words(&rtxn)?;
@ -134,16 +133,15 @@ pub trait SearchableExtractor: Sized + Sync {
}
impl<T: SearchableExtractor> DocidsExtractor for T {
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP, SP>(
fn run_extraction<'pl, 'fid, 'indexer, 'index, 'extractor, DC: DocumentChanges<'pl>, MSP>(
grenad_parameters: GrenadParameters,
document_changes: &DC,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
indexing_context: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
step: Step,
step: IndexingStep,
) -> Result<Vec<BalancedCaches<'extractor>>>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
Self::run_extraction(
grenad_parameters,

View File

@ -176,9 +176,10 @@ pub fn tokenizer_builder<'a>(
#[cfg(test)]
mod test {
use bumpalo::Bump;
use bumparaw_collections::RawMap;
use charabia::TokenizerBuilder;
use meili_snap::snapshot;
use raw_collections::RawMap;
use rustc_hash::FxBuildHasher;
use serde_json::json;
use serde_json::value::RawValue;
@ -234,7 +235,7 @@ mod test {
let bump = Bump::new();
let document: &RawValue = serde_json::from_str(&document).unwrap();
let document = RawMap::from_raw_value(document, &bump).unwrap();
let document = RawMap::from_raw_value_and_hasher(document, FxBuildHasher, &bump).unwrap();
let document = Versions::single(document);
let document = DocumentFromVersions::new(&document);

View File

@ -130,6 +130,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
);
} else if new_vectors.regenerate {
let new_rendered = prompt.render_document(
update.external_document_id(),
update.current(
&context.rtxn,
context.index,
@ -139,6 +140,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
&context.doc_alloc,
)?;
let old_rendered = prompt.render_document(
update.external_document_id(),
update.merged(
&context.rtxn,
context.index,
@ -158,6 +160,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
}
} else if old_vectors.regenerate {
let old_rendered = prompt.render_document(
update.external_document_id(),
update.current(
&context.rtxn,
context.index,
@ -167,6 +170,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
&context.doc_alloc,
)?;
let new_rendered = prompt.render_document(
update.external_document_id(),
update.merged(
&context.rtxn,
context.index,
@ -216,6 +220,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
);
} else if new_vectors.regenerate {
let rendered = prompt.render_document(
insertion.external_document_id(),
insertion.inserted(),
context.new_fields_ids_map,
&context.doc_alloc,
@ -229,6 +234,7 @@ impl<'a, 'b, 'extractor> Extractor<'extractor> for EmbeddingExtractor<'a, 'b> {
}
} else {
let rendered = prompt.render_document(
insertion.external_document_id(),
insertion.inserted(),
context.new_fields_ids_map,
&context.doc_alloc,

View File

@ -103,6 +103,8 @@ impl<'indexer> FacetSearchBuilder<'indexer> {
#[tracing::instrument(level = "trace", skip_all, target = "indexing::facet_fst")]
pub fn merge_and_write(self, index: &Index, wtxn: &mut RwTxn, rtxn: &RoTxn) -> Result<()> {
tracing::trace!("merge facet strings for facet search: {:?}", self.registered_facets);
let reader = self.normalized_facet_string_docids_sorter.into_reader_cursors()?;
let mut builder = grenad::MergerBuilder::new(MergeDeladdBtreesetString);
builder.extend(reader);
@ -118,12 +120,15 @@ impl<'indexer> FacetSearchBuilder<'indexer> {
BEU16StrCodec::bytes_decode(key).map_err(heed::Error::Encoding)?;
if current_field_id != Some(field_id) {
if let Some(fst_merger_builder) = fst_merger_builder {
if let (Some(current_field_id), Some(fst_merger_builder)) =
(current_field_id, fst_merger_builder)
{
let mmap = fst_merger_builder.build(&mut callback)?;
index
.facet_id_string_fst
.remap_data_type::<Bytes>()
.put(wtxn, &field_id, &mmap)?;
index.facet_id_string_fst.remap_data_type::<Bytes>().put(
wtxn,
&current_field_id,
&mmap,
)?;
}
fst = index.facet_id_string_fst.get(rtxn, &field_id)?;

View File

@ -1,6 +1,8 @@
use std::ops::ControlFlow;
use bumpalo::Bump;
use bumparaw_collections::RawVec;
use rustc_hash::FxBuildHasher;
use serde::de::{DeserializeSeed, Deserializer as _, Visitor};
use serde_json::value::RawValue;
@ -360,7 +362,7 @@ impl<'a> DeserrRawValue<'a> {
}
pub struct DeserrRawVec<'a> {
vec: raw_collections::RawVec<'a>,
vec: RawVec<'a>,
alloc: &'a Bump,
}
@ -379,7 +381,7 @@ impl<'a> deserr::Sequence for DeserrRawVec<'a> {
}
pub struct DeserrRawVecIter<'a> {
it: raw_collections::vec::iter::IntoIter<'a>,
it: bumparaw_collections::vec::iter::IntoIter<'a>,
alloc: &'a Bump,
}
@ -393,7 +395,7 @@ impl<'a> Iterator for DeserrRawVecIter<'a> {
}
pub struct DeserrRawMap<'a> {
map: raw_collections::RawMap<'a>,
map: bumparaw_collections::RawMap<'a, FxBuildHasher>,
alloc: &'a Bump,
}
@ -416,7 +418,7 @@ impl<'a> deserr::Map for DeserrRawMap<'a> {
}
pub struct DeserrRawMapIter<'a> {
it: raw_collections::map::iter::IntoIter<'a>,
it: bumparaw_collections::map::iter::IntoIter<'a>,
alloc: &'a Bump,
}
@ -615,7 +617,7 @@ impl<'de> Visitor<'de> for DeserrRawValueVisitor<'de> {
where
A: serde::de::SeqAccess<'de>,
{
let mut raw_vec = raw_collections::RawVec::new_in(self.alloc);
let mut raw_vec = RawVec::new_in(self.alloc);
while let Some(next) = seq.next_element()? {
raw_vec.push(next);
}

View File

@ -1,4 +1,5 @@
use std::cell::{Cell, RefCell};
use std::sync::atomic::Ordering;
use std::sync::{Arc, RwLock};
use bumpalo::Bump;
@ -7,8 +8,9 @@ use rayon::iter::IndexedParallelIterator;
use super::super::document_change::DocumentChange;
use crate::fields_ids_map::metadata::FieldIdMapWithMetadata;
use crate::progress::{AtomicDocumentStep, Progress};
use crate::update::new::parallel_iterator_ext::ParallelIteratorExt as _;
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::{FullySend, MostlySend, ThreadLocal};
use crate::{FieldsIdsMap, GlobalFieldsIdsMap, Index, InternalError, Result};
@ -133,10 +135,8 @@ pub struct IndexingContext<
'indexer, // covariant lifetime of objects that are borrowed during the entire indexing operation
'index, // covariant lifetime of the index
MSP,
SP,
> where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
pub index: &'index Index,
pub db_fields_ids_map: &'indexer FieldsIdsMap,
@ -144,7 +144,7 @@ pub struct IndexingContext<
pub doc_allocs: &'indexer ThreadLocal<FullySend<Cell<Bump>>>,
pub fields_ids_map_store: &'indexer ThreadLocal<FullySend<RefCell<GlobalFieldsIdsMap<'fid>>>>,
pub must_stop_processing: &'indexer MSP,
pub send_progress: &'indexer SP,
pub progress: &'indexer Progress,
}
impl<
@ -152,18 +152,15 @@ impl<
'indexer, // covariant lifetime of objects that are borrowed during the entire indexing operation
'index, // covariant lifetime of the index
MSP,
SP,
> Copy
for IndexingContext<
'fid, // invariant lifetime of fields ids map
'indexer, // covariant lifetime of objects that are borrowed during the entire indexing operation
'index, // covariant lifetime of the index
MSP,
SP,
>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
}
@ -172,18 +169,15 @@ impl<
'indexer, // covariant lifetime of objects that are borrowed during the entire indexing operation
'index, // covariant lifetime of the index
MSP,
SP,
> Clone
for IndexingContext<
'fid, // invariant lifetime of fields ids map
'indexer, // covariant lifetime of objects that are borrowed during the entire indexing operation
'index, // covariant lifetime of the index
MSP,
SP,
>
where
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
fn clone(&self) -> Self {
*self
@ -202,7 +196,6 @@ pub fn extract<
EX,
DC: DocumentChanges<'pl>,
MSP,
SP,
>(
document_changes: &DC,
extractor: &EX,
@ -213,18 +206,18 @@ pub fn extract<
doc_allocs,
fields_ids_map_store,
must_stop_processing,
send_progress,
}: IndexingContext<'fid, 'indexer, 'index, MSP, SP>,
progress,
}: IndexingContext<'fid, 'indexer, 'index, MSP>,
extractor_allocs: &'extractor mut ThreadLocal<FullySend<Bump>>,
datastore: &'data ThreadLocal<EX::Data>,
step: Step,
step: IndexingStep,
) -> Result<()>
where
EX: Extractor<'extractor>,
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
tracing::trace!("We are resetting the extractor allocators");
progress.update_progress(step);
// Clean up and reuse the extractor allocs
for extractor_alloc in extractor_allocs.iter_mut() {
tracing::trace!("\tWith {} bytes reset", extractor_alloc.0.allocated_bytes());
@ -232,9 +225,11 @@ where
}
let total_documents = document_changes.len() as u32;
let (step, progress_step) = AtomicDocumentStep::new(total_documents);
progress.update_progress(progress_step);
let pi = document_changes.iter(CHUNK_SIZE);
pi.enumerate().try_arc_for_each_try_init(
pi.try_arc_for_each_try_init(
|| {
DocumentChangeContext::new(
index,
@ -247,13 +242,10 @@ where
move |index_alloc| extractor.init_data(index_alloc),
)
},
|context, (finished_documents, items)| {
|context, items| {
if (must_stop_processing)() {
return Err(Arc::new(InternalError::AbortedIndexation.into()));
}
let finished_documents = (finished_documents * CHUNK_SIZE) as u32;
(send_progress)(Progress::from_step_substep(step, finished_documents, total_documents));
// Clean up and reuse the document-specific allocator
context.doc_alloc.reset();
@ -264,6 +256,7 @@ where
});
let res = extractor.process(changes, context).map_err(Arc::new);
step.fetch_add(items.as_ref().len() as u32, Ordering::Relaxed);
// send back the doc_alloc in the pool
context.doc_allocs.get_or_default().0.set(std::mem::take(&mut context.doc_alloc));
@ -271,32 +264,7 @@ where
res
},
)?;
(send_progress)(Progress::from_step_substep(step, total_documents, total_documents));
step.store(total_documents, Ordering::Relaxed);
Ok(())
}
pub struct Progress {
pub finished_steps: u16,
pub total_steps: u16,
pub step_name: &'static str,
pub finished_total_substep: Option<(u32, u32)>,
}
impl Progress {
pub fn from_step(step: Step) -> Self {
Self {
finished_steps: step.finished_steps(),
total_steps: Step::total_steps(),
step_name: step.name(),
finished_total_substep: None,
}
}
pub fn from_step_substep(step: Step, finished_substep: u32, total_substep: u32) -> Self {
Self {
finished_total_substep: Some((finished_substep, total_substep)),
..Progress::from_step(step)
}
}
}

View File

@ -92,11 +92,12 @@ mod test {
use crate::fields_ids_map::metadata::{FieldIdMapWithMetadata, MetadataBuilder};
use crate::index::tests::TempIndex;
use crate::progress::Progress;
use crate::update::new::indexer::document_changes::{
extract, DocumentChangeContext, Extractor, IndexingContext,
};
use crate::update::new::indexer::DocumentDeletion;
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::{MostlySend, ThreadLocal};
use crate::update::new::DocumentChange;
use crate::DocumentId;
@ -164,7 +165,7 @@ mod test {
doc_allocs: &doc_allocs,
fields_ids_map_store: &fields_ids_map_store,
must_stop_processing: &(|| false),
send_progress: &(|_progress| {}),
progress: &Progress::default(),
};
for _ in 0..3 {
@ -176,7 +177,7 @@ mod test {
context,
&mut extractor_allocs,
&datastore,
Step::ExtractingDocuments,
IndexingStep::ExtractingDocuments,
)
.unwrap();

View File

@ -1,19 +1,23 @@
use std::sync::atomic::Ordering;
use bumpalo::collections::CollectIn;
use bumpalo::Bump;
use bumparaw_collections::RawMap;
use hashbrown::hash_map::Entry;
use heed::RoTxn;
use memmap2::Mmap;
use raw_collections::RawMap;
use rayon::slice::ParallelSlice;
use rustc_hash::FxBuildHasher;
use serde_json::value::RawValue;
use serde_json::Deserializer;
use super::super::document_change::DocumentChange;
use super::document_changes::{DocumentChangeContext, DocumentChanges, Progress};
use super::document_changes::{DocumentChangeContext, DocumentChanges};
use super::retrieve_or_guess_primary_key;
use crate::documents::PrimaryKey;
use crate::progress::{AtomicPayloadStep, Progress};
use crate::update::new::document::Versions;
use crate::update::new::steps::Step;
use crate::update::new::steps::IndexingStep;
use crate::update::new::thread_local::MostlySend;
use crate::update::new::{Deletion, Insertion, Update};
use crate::update::{AvailableIds, IndexDocumentsMethod};
@ -44,7 +48,7 @@ impl<'pl> DocumentOperation<'pl> {
#[allow(clippy::too_many_arguments)]
#[tracing::instrument(level = "trace", skip_all, target = "indexing::document_operation")]
pub fn into_changes<MSP, SP>(
pub fn into_changes<MSP>(
self,
indexer: &'pl Bump,
index: &Index,
@ -52,12 +56,12 @@ impl<'pl> DocumentOperation<'pl> {
primary_key_from_op: Option<&'pl str>,
new_fields_ids_map: &mut FieldsIdsMap,
must_stop_processing: &MSP,
send_progress: &SP,
progress: Progress,
) -> Result<(DocumentOperationChanges<'pl>, Vec<PayloadStats>, Option<PrimaryKey<'pl>>)>
where
MSP: Fn() -> bool,
SP: Fn(Progress),
{
progress.update_progress(IndexingStep::PreparingPayloads);
let Self { operations, method } = self;
let documents_ids = index.documents_ids(rtxn)?;
@ -67,16 +71,14 @@ impl<'pl> DocumentOperation<'pl> {
let mut primary_key = None;
let payload_count = operations.len();
let (step, progress_step) = AtomicPayloadStep::new(payload_count as u32);
progress.update_progress(progress_step);
for (payload_index, operation) in operations.into_iter().enumerate() {
if must_stop_processing() {
return Err(InternalError::AbortedIndexation.into());
}
send_progress(Progress::from_step_substep(
Step::PreparingPayloads,
payload_index as u32,
payload_count as u32,
));
step.store(payload_index as u32, Ordering::Relaxed);
let mut bytes = 0;
let result = match operation {
@ -117,12 +119,7 @@ impl<'pl> DocumentOperation<'pl> {
};
operations_stats.push(PayloadStats { document_count, bytes, error });
}
send_progress(Progress::from_step_substep(
Step::PreparingPayloads,
payload_count as u32,
payload_count as u32,
));
step.store(payload_count as u32, Ordering::Relaxed);
// TODO We must drain the HashMap into a Vec because rayon::hash_map::IntoIter: !Clone
let mut docids_version_offsets: bumpalo::collections::vec::Vec<_> =
@ -166,8 +163,9 @@ fn extract_addition_payload_changes<'r, 'pl: 'r>(
// Only guess the primary key if it is the first document
let retrieved_primary_key = if previous_offset == 0 {
let doc =
RawMap::from_raw_value(doc, indexer).map(Some).map_err(UserError::SerdeJson)?;
let doc = RawMap::from_raw_value_and_hasher(doc, FxBuildHasher, indexer)
.map(Some)
.map_err(UserError::SerdeJson)?;
let result = retrieve_or_guess_primary_key(
rtxn,
@ -254,6 +252,24 @@ fn extract_addition_payload_changes<'r, 'pl: 'r>(
previous_offset = iter.byte_offset();
}
if payload.is_empty() {
let result = retrieve_or_guess_primary_key(
rtxn,
index,
new_fields_ids_map,
primary_key_from_op,
None,
);
match result {
Ok(Ok((pk, _))) => {
primary_key.get_or_insert(pk);
}
Ok(Err(UserError::NoPrimaryKeyCandidateFound)) => (),
Ok(Err(user_error)) => return Err(Error::UserError(user_error)),
Err(error) => return Err(error),
};
}
Ok(new_docids_version_offsets)
}
@ -545,8 +561,9 @@ impl MergeChanges for MergeDocumentForReplacement {
match operations.last() {
Some(InnerDocOp::Addition(DocumentOffset { content })) => {
let document = serde_json::from_slice(content).unwrap();
let document = raw_collections::RawMap::from_raw_value(document, doc_alloc)
.map_err(UserError::SerdeJson)?;
let document =
RawMap::from_raw_value_and_hasher(document, FxBuildHasher, doc_alloc)
.map_err(UserError::SerdeJson)?;
if is_new {
Ok(Some(DocumentChange::Insertion(Insertion::create(
@ -632,8 +649,9 @@ impl MergeChanges for MergeDocumentForUpdates {
}
};
let document = serde_json::from_slice(content).unwrap();
let document = raw_collections::RawMap::from_raw_value(document, doc_alloc)
.map_err(UserError::SerdeJson)?;
let document =
RawMap::from_raw_value_and_hasher(document, FxBuildHasher, doc_alloc)
.map_err(UserError::SerdeJson)?;
Some(Versions::single(document))
}
@ -647,8 +665,9 @@ impl MergeChanges for MergeDocumentForUpdates {
};
let document = serde_json::from_slice(content).unwrap();
let document = raw_collections::RawMap::from_raw_value(document, doc_alloc)
.map_err(UserError::SerdeJson)?;
let document =
RawMap::from_raw_value_and_hasher(document, FxBuildHasher, doc_alloc)
.map_err(UserError::SerdeJson)?;
Ok(document)
});
Versions::multiple(versions)?

View File

@ -4,7 +4,9 @@ use std::sync::{OnceLock, RwLock};
use std::thread::{self, Builder};
use big_s::S;
use document_changes::{extract, DocumentChanges, IndexingContext, Progress};
use bstr::ByteSlice as _;
use bumparaw_collections::RawMap;
use document_changes::{extract, DocumentChanges, IndexingContext};
pub use document_deletion::DocumentDeletion;
pub use document_operation::{DocumentOperation, PayloadStats};
use hashbrown::HashMap;
@ -13,7 +15,7 @@ use heed::{RoTxn, RwTxn};
use itertools::{merge_join_by, EitherOrBoth};
pub use partial_dump::PartialDump;
use rand::SeedableRng as _;
use raw_collections::RawMap;
use rustc_hash::FxBuildHasher;
use time::OffsetDateTime;
pub use update_by_function::UpdateByFunction;
@ -21,7 +23,7 @@ use super::channel::*;
use super::extract::*;
use super::facet_search_builder::FacetSearchBuilder;
use super::merger::FacetFieldIdsDelta;
use super::steps::Step;
use super::steps::IndexingStep;
use super::thread_local::ThreadLocal;
use super::word_fst_builder::{PrefixData, PrefixDelta, WordFstBuilder};
use super::words_prefix_docids::{
@ -32,8 +34,11 @@ use crate::documents::{PrimaryKey, DEFAULT_PRIMARY_KEY};
use crate::facet::FacetType;
use crate::fields_ids_map::metadata::{FieldIdMapWithMetadata, MetadataBuilder};
use crate::index::main_key::{WORDS_FST_KEY, WORDS_PREFIXES_FST_KEY};
use crate::progress::Progress;
use crate::proximity::ProximityPrecision;
use crate::update::del_add::DelAdd;
use crate::update::facet::new_incremental::FacetsUpdateIncremental;
use crate::update::facet::{FACET_GROUP_SIZE, FACET_MAX_GROUP_SIZE, FACET_MIN_LEVEL_SIZE};
use crate::update::new::extract::EmbeddingExtractor;
use crate::update::new::merger::merge_and_send_rtree;
use crate::update::new::words_prefix_docids::compute_exact_word_prefix_docids;
@ -59,7 +64,7 @@ mod update_by_function;
///
/// TODO return stats
#[allow(clippy::too_many_arguments)] // clippy: 😝
pub fn index<'pl, 'indexer, 'index, DC, MSP, SP>(
pub fn index<'pl, 'indexer, 'index, DC, MSP>(
wtxn: &mut RwTxn,
index: &'index Index,
pool: &ThreadPoolNoAbort,
@ -70,34 +75,50 @@ pub fn index<'pl, 'indexer, 'index, DC, MSP, SP>(
document_changes: &DC,
embedders: EmbeddingConfigs,
must_stop_processing: &'indexer MSP,
send_progress: &'indexer SP,
progress: &'indexer Progress,
) -> Result<()>
where
DC: DocumentChanges<'pl>,
MSP: Fn() -> bool + Sync,
SP: Fn(Progress) + Sync,
{
let mut bbbuffers = Vec::new();
let finished_extraction = AtomicBool::new(false);
// We compute and remove the allocated BBQueues buffers capacity from the indexing memory.
let minimum_capacity = 50 * 1024 * 1024 * pool.current_num_threads(); // 50 MiB
// We reduce the actual memory used to 5%. The reason we do this here and not in Meilisearch
// is because we still use the old indexer for the settings and it is highly impacted by the
// max memory. So we keep the changes here and will remove these changes once we use the new
// indexer to also index settings. Related to #5125 and #5141.
let grenad_parameters = GrenadParameters {
max_memory: grenad_parameters.max_memory.map(|mm| mm * 5 / 100),
..grenad_parameters
};
// 5% percent of the allocated memory for the extractors, or min 100MiB
// 5% percent of the allocated memory for the bbqueues, or min 50MiB
//
// Minimum capacity for bbqueues
let minimum_total_bbbuffer_capacity = 50 * 1024 * 1024 * pool.current_num_threads(); // 50 MiB
let minimum_total_extractors_capacity = minimum_total_bbbuffer_capacity * 2;
let (grenad_parameters, total_bbbuffer_capacity) = grenad_parameters.max_memory.map_or(
(grenad_parameters, 2 * minimum_capacity), // 100 MiB by thread by default
(
GrenadParameters {
max_memory: Some(minimum_total_extractors_capacity),
..grenad_parameters
},
minimum_total_bbbuffer_capacity,
), // 100 MiB by thread by default
|max_memory| {
// 2% of the indexing memory
let total_bbbuffer_capacity = (max_memory / 100 / 2).max(minimum_capacity);
let total_bbbuffer_capacity = max_memory.max(minimum_total_bbbuffer_capacity);
let new_grenad_parameters = GrenadParameters {
max_memory: Some(
max_memory.saturating_sub(total_bbbuffer_capacity).max(100 * 1024 * 1024),
),
max_memory: Some(max_memory.max(minimum_total_extractors_capacity)),
..grenad_parameters
};
(new_grenad_parameters, total_bbbuffer_capacity)
},
);
let (extractor_sender, mut writer_receiver) = pool
let (extractor_sender, writer_receiver) = pool
.install(|| extractor_writer_bbqueue(&mut bbbuffers, total_bbbuffer_capacity, 1000))
.unwrap();
@ -115,7 +136,7 @@ where
doc_allocs: &doc_allocs,
fields_ids_map_store: &fields_ids_map_store,
must_stop_processing,
send_progress,
progress,
};
let mut index_embeddings = index.embedding_configs(wtxn)?;
@ -149,7 +170,7 @@ where
indexing_context,
&mut extractor_allocs,
&datastore,
Step::ExtractingDocuments,
IndexingStep::ExtractingDocuments,
)?;
}
{
@ -181,7 +202,7 @@ where
indexing_context,
&mut extractor_allocs,
&extractor_sender.field_id_docid_facet_sender(),
Step::ExtractingFacets
IndexingStep::ExtractingFacets
)?
};
@ -193,6 +214,7 @@ where
caches,
FacetDatabases::new(index),
index,
&rtxn,
extractor_sender.facet_docids(),
)?;
}
@ -214,7 +236,7 @@ where
document_changes,
indexing_context,
&mut extractor_allocs,
Step::ExtractingWords
IndexingStep::ExtractingWords
)?
};
@ -292,7 +314,7 @@ where
document_changes,
indexing_context,
&mut extractor_allocs,
Step::ExtractingWordProximity,
IndexingStep::ExtractingWordProximity,
)?
};
@ -328,7 +350,7 @@ where
indexing_context,
&mut extractor_allocs,
&datastore,
Step::ExtractingEmbeddings,
IndexingStep::ExtractingEmbeddings,
)?;
}
{
@ -361,7 +383,7 @@ where
indexing_context,
&mut extractor_allocs,
&datastore,
Step::WritingGeoPoints
IndexingStep::WritingGeoPoints
)?;
}
@ -373,9 +395,7 @@ where
&indexing_context.must_stop_processing,
)?;
}
(indexing_context.send_progress)(Progress::from_step(Step::WritingToDatabase));
indexing_context.progress.update_progress(IndexingStep::WritingToDatabase);
finished_extraction.store(true, std::sync::atomic::Ordering::Relaxed);
Result::Ok((facet_field_ids_delta, index_embeddings))
@ -413,6 +433,7 @@ where
let mut arroy_writers = arroy_writers?;
{
let mut writer_receiver = writer_receiver;
let span = tracing::trace_span!(target: "indexing::write_db", "all");
let _entered = span.enter();
@ -475,7 +496,7 @@ where
)?;
}
(indexing_context.send_progress)(Progress::from_step(Step::WaitingForExtractors));
indexing_context.progress.update_progress(IndexingStep::WaitingForExtractors);
let (facet_field_ids_delta, index_embeddings) = extractor_handle.join().unwrap()?;
@ -488,10 +509,7 @@ where
break 'vectors;
}
(indexing_context.send_progress)(Progress::from_step(
Step::WritingEmbeddingsToDatabase,
));
indexing_context.progress.update_progress(IndexingStep::WritingEmbeddingsToDatabase);
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
for (_index, (_embedder_name, _embedder, writer, dimensions)) in &mut arroy_writers {
let dimensions = *dimensions;
@ -507,21 +525,19 @@ where
index.put_embedding_configs(wtxn, index_embeddings)?;
}
(indexing_context.send_progress)(Progress::from_step(Step::PostProcessingFacets));
indexing_context.progress.update_progress(IndexingStep::PostProcessingFacets);
if index.facet_search(wtxn)? {
compute_facet_search_database(index, wtxn, global_fields_ids_map)?;
}
compute_facet_level_database(index, wtxn, facet_field_ids_delta)?;
(indexing_context.send_progress)(Progress::from_step(Step::PostProcessingWords));
indexing_context.progress.update_progress(IndexingStep::PostProcessingWords);
if let Some(prefix_delta) = compute_word_fst(index, wtxn)? {
compute_prefix_database(index, wtxn, prefix_delta, grenad_parameters)?;
}
(indexing_context.send_progress)(Progress::from_step(Step::Finalizing));
indexing_context.progress.update_progress(IndexingStep::Finalizing);
Ok(()) as Result<_>
})?;
@ -577,7 +593,12 @@ fn write_from_bbqueue(
}
(key, None) => match database.delete(wtxn, key) {
Ok(false) => {
unreachable!("We tried to delete an unknown key: {key:?}")
tracing::error!(
database_name,
key_bytes = ?key,
formatted_key = ?key.as_bstr(),
"Attempt to delete an unknown key"
);
}
Ok(_) => (),
Err(error) => {
@ -732,27 +753,66 @@ fn compute_facet_search_database(
fn compute_facet_level_database(
index: &Index,
wtxn: &mut RwTxn,
facet_field_ids_delta: FacetFieldIdsDelta,
mut facet_field_ids_delta: FacetFieldIdsDelta,
) -> Result<()> {
if let Some(modified_facet_string_ids) = facet_field_ids_delta.modified_facet_string_ids() {
for (fid, delta) in facet_field_ids_delta.consume_facet_string_delta() {
let span = tracing::trace_span!(target: "indexing::facet_field_ids", "string");
let _entered = span.enter();
FacetsUpdateBulk::new_not_updating_level_0(
index,
modified_facet_string_ids,
FacetType::String,
)
.execute(wtxn)?;
match delta {
super::merger::FacetFieldIdDelta::Bulk => {
tracing::debug!(%fid, "bulk string facet processing");
FacetsUpdateBulk::new_not_updating_level_0(index, vec![fid], FacetType::String)
.execute(wtxn)?
}
super::merger::FacetFieldIdDelta::Incremental(delta_data) => {
tracing::debug!(%fid, len=%delta_data.len(), "incremental string facet processing");
FacetsUpdateIncremental::new(
index,
FacetType::String,
fid,
delta_data,
FACET_GROUP_SIZE,
FACET_MIN_LEVEL_SIZE,
FACET_MAX_GROUP_SIZE,
)
.execute(wtxn)?
}
}
}
if let Some(modified_facet_number_ids) = facet_field_ids_delta.modified_facet_number_ids() {
for (fid, delta) in facet_field_ids_delta.consume_facet_number_delta() {
let span = tracing::trace_span!(target: "indexing::facet_field_ids", "number");
let _entered = span.enter();
FacetsUpdateBulk::new_not_updating_level_0(
match delta {
super::merger::FacetFieldIdDelta::Bulk => {
tracing::debug!(%fid, "bulk number facet processing");
FacetsUpdateBulk::new_not_updating_level_0(index, vec![fid], FacetType::Number)
.execute(wtxn)?
}
super::merger::FacetFieldIdDelta::Incremental(delta_data) => {
tracing::debug!(%fid, len=%delta_data.len(), "incremental number facet processing");
FacetsUpdateIncremental::new(
index,
FacetType::Number,
fid,
delta_data,
FACET_GROUP_SIZE,
FACET_MIN_LEVEL_SIZE,
FACET_MAX_GROUP_SIZE,
)
.execute(wtxn)?
}
}
debug_assert!(crate::update::facet::sanity_checks(
index,
modified_facet_number_ids,
wtxn,
fid,
FacetType::Number,
FACET_GROUP_SIZE as usize,
FACET_MIN_LEVEL_SIZE as usize,
FACET_MAX_GROUP_SIZE as usize,
)
.execute(wtxn)?;
.is_ok());
}
Ok(())
@ -767,7 +827,7 @@ pub fn retrieve_or_guess_primary_key<'a>(
index: &Index,
new_fields_ids_map: &mut FieldsIdsMap,
primary_key_from_op: Option<&'a str>,
first_document: Option<RawMap<'a>>,
first_document: Option<RawMap<'a, FxBuildHasher>>,
) -> Result<StdResult<(PrimaryKey<'a>, bool), UserError>> {
// make sure that we have a declared primary key, either fetching it from the index or attempting to guess it.

View File

@ -1,6 +1,8 @@
use std::ops::DerefMut;
use bumparaw_collections::RawMap;
use rayon::iter::IndexedParallelIterator;
use rustc_hash::FxBuildHasher;
use serde_json::value::RawValue;
use super::document_changes::{DocumentChangeContext, DocumentChanges};
@ -75,7 +77,7 @@ where
self.primary_key.extract_fields_and_docid(document, fields_ids_map, doc_alloc)?;
let external_document_id = external_document_id.to_de();
let document = raw_collections::RawMap::from_raw_value(document, doc_alloc)
let document = RawMap::from_raw_value_and_hasher(document, FxBuildHasher, doc_alloc)
.map_err(InternalError::SerdeJson)?;
let insertion = Insertion::create(docid, external_document_id, Versions::single(document));

View File

@ -1,8 +1,9 @@
use raw_collections::RawMap;
use bumparaw_collections::RawMap;
use rayon::iter::IndexedParallelIterator;
use rayon::slice::ParallelSlice as _;
use rhai::{Dynamic, Engine, OptimizationLevel, Scope, AST};
use roaring::RoaringBitmap;
use rustc_hash::FxBuildHasher;
use super::document_changes::DocumentChangeContext;
use super::DocumentChanges;
@ -160,8 +161,12 @@ impl<'index> DocumentChanges<'index> for UpdateByFunctionChanges<'index> {
if document_id != new_document_id {
Err(Error::UserError(UserError::DocumentEditionCannotModifyPrimaryKey))
} else {
let raw_new_doc = RawMap::from_raw_value(raw_new_doc, doc_alloc)
.map_err(InternalError::SerdeJson)?;
let raw_new_doc = RawMap::from_raw_value_and_hasher(
raw_new_doc,
FxBuildHasher,
doc_alloc,
)
.map_err(InternalError::SerdeJson)?;
Ok(Some(DocumentChange::Update(Update::create(
docid,

View File

@ -1,6 +1,6 @@
use std::cell::RefCell;
use hashbrown::HashSet;
use hashbrown::HashMap;
use heed::types::Bytes;
use heed::{Database, RoTxn};
use memmap2::Mmap;
@ -12,6 +12,7 @@ use super::extract::{
merge_caches_sorted, transpose_and_freeze_caches, BalancedCaches, DelAddRoaringBitmap,
FacetKind, GeoExtractorData,
};
use crate::update::facet::new_incremental::FacetFieldIdChange;
use crate::{CboRoaringBitmapCodec, FieldId, GeoPoint, Index, InternalError, Result};
#[tracing::instrument(level = "trace", skip_all, target = "indexing::merge")]
@ -100,12 +101,18 @@ pub fn merge_and_send_facet_docids<'extractor>(
mut caches: Vec<BalancedCaches<'extractor>>,
database: FacetDatabases,
index: &Index,
rtxn: &RoTxn,
docids_sender: FacetDocidsSender,
) -> Result<FacetFieldIdsDelta> {
let max_string_count = (index.facet_id_string_docids.len(rtxn)? / 500) as usize;
let max_number_count = (index.facet_id_f64_docids.len(rtxn)? / 500) as usize;
let max_string_count = max_string_count.clamp(1000, 100_000);
let max_number_count = max_number_count.clamp(1000, 100_000);
transpose_and_freeze_caches(&mut caches)?
.into_par_iter()
.map(|frozen| {
let mut facet_field_ids_delta = FacetFieldIdsDelta::default();
let mut facet_field_ids_delta =
FacetFieldIdsDelta::new(max_string_count, max_number_count);
let rtxn = index.read_txn()?;
merge_caches_sorted(frozen, |key, DelAddRoaringBitmap { del, add }| {
let current = database.get_cbo_roaring_bytes_value(&rtxn, key)?;
@ -126,7 +133,10 @@ pub fn merge_and_send_facet_docids<'extractor>(
Ok(facet_field_ids_delta)
})
.reduce(|| Ok(FacetFieldIdsDelta::default()), |lhs, rhs| Ok(lhs?.merge(rhs?)))
.reduce(
|| Ok(FacetFieldIdsDelta::new(max_string_count, max_number_count)),
|lhs, rhs| Ok(lhs?.merge(rhs?)),
)
}
pub struct FacetDatabases<'a> {
@ -155,60 +165,131 @@ impl<'a> FacetDatabases<'a> {
}
}
#[derive(Debug, Default)]
#[derive(Debug)]
pub enum FacetFieldIdDelta {
Bulk,
Incremental(Vec<FacetFieldIdChange>),
}
impl FacetFieldIdDelta {
fn push(&mut self, facet_value: &[u8], max_count: usize) {
*self = match std::mem::replace(self, FacetFieldIdDelta::Bulk) {
FacetFieldIdDelta::Bulk => FacetFieldIdDelta::Bulk,
FacetFieldIdDelta::Incremental(mut v) => {
if v.len() >= max_count {
FacetFieldIdDelta::Bulk
} else {
v.push(FacetFieldIdChange { facet_value: facet_value.into() });
FacetFieldIdDelta::Incremental(v)
}
}
}
}
fn merge(&mut self, rhs: Option<Self>, max_count: usize) {
let Some(rhs) = rhs else {
return;
};
*self = match (std::mem::replace(self, FacetFieldIdDelta::Bulk), rhs) {
(FacetFieldIdDelta::Bulk, _) | (_, FacetFieldIdDelta::Bulk) => FacetFieldIdDelta::Bulk,
(
FacetFieldIdDelta::Incremental(mut left),
FacetFieldIdDelta::Incremental(mut right),
) => {
if left.len() + right.len() >= max_count {
FacetFieldIdDelta::Bulk
} else {
left.append(&mut right);
FacetFieldIdDelta::Incremental(left)
}
}
};
}
}
#[derive(Debug)]
pub struct FacetFieldIdsDelta {
/// The field ids that have been modified
modified_facet_string_ids: HashSet<FieldId>,
modified_facet_number_ids: HashSet<FieldId>,
modified_facet_string_ids: HashMap<FieldId, FacetFieldIdDelta, rustc_hash::FxBuildHasher>,
modified_facet_number_ids: HashMap<FieldId, FacetFieldIdDelta, rustc_hash::FxBuildHasher>,
max_string_count: usize,
max_number_count: usize,
}
impl FacetFieldIdsDelta {
fn register_facet_string_id(&mut self, field_id: FieldId) {
self.modified_facet_string_ids.insert(field_id);
pub fn new(max_string_count: usize, max_number_count: usize) -> Self {
Self {
max_string_count,
max_number_count,
modified_facet_string_ids: Default::default(),
modified_facet_number_ids: Default::default(),
}
}
fn register_facet_number_id(&mut self, field_id: FieldId) {
self.modified_facet_number_ids.insert(field_id);
fn register_facet_string_id(&mut self, field_id: FieldId, facet_value: &[u8]) {
self.modified_facet_string_ids
.entry(field_id)
.or_insert(FacetFieldIdDelta::Incremental(Default::default()))
.push(facet_value, self.max_string_count);
}
fn register_facet_number_id(&mut self, field_id: FieldId, facet_value: &[u8]) {
self.modified_facet_number_ids
.entry(field_id)
.or_insert(FacetFieldIdDelta::Incremental(Default::default()))
.push(facet_value, self.max_number_count);
}
fn register_from_key(&mut self, key: &[u8]) {
let (facet_kind, field_id) = self.extract_key_data(key);
match facet_kind {
FacetKind::Number => self.register_facet_number_id(field_id),
FacetKind::String => self.register_facet_string_id(field_id),
let (facet_kind, field_id, facet_value) = self.extract_key_data(key);
match (facet_kind, facet_value) {
(FacetKind::Number, Some(facet_value)) => {
self.register_facet_number_id(field_id, facet_value)
}
(FacetKind::String, Some(facet_value)) => {
self.register_facet_string_id(field_id, facet_value)
}
_ => (),
}
}
fn extract_key_data(&self, key: &[u8]) -> (FacetKind, FieldId) {
fn extract_key_data<'key>(&self, key: &'key [u8]) -> (FacetKind, FieldId, Option<&'key [u8]>) {
let facet_kind = FacetKind::from(key[0]);
let field_id = FieldId::from_be_bytes([key[1], key[2]]);
(facet_kind, field_id)
let facet_value = if key.len() >= 4 {
// level is also stored in the key at [3] (always 0)
Some(&key[4..])
} else {
None
};
(facet_kind, field_id, facet_value)
}
pub fn modified_facet_string_ids(&self) -> Option<Vec<FieldId>> {
if self.modified_facet_string_ids.is_empty() {
None
} else {
Some(self.modified_facet_string_ids.iter().copied().collect())
}
pub fn consume_facet_string_delta(
&mut self,
) -> impl Iterator<Item = (FieldId, FacetFieldIdDelta)> + '_ {
self.modified_facet_string_ids.drain()
}
pub fn modified_facet_number_ids(&self) -> Option<Vec<FieldId>> {
if self.modified_facet_number_ids.is_empty() {
None
} else {
Some(self.modified_facet_number_ids.iter().copied().collect())
}
pub fn consume_facet_number_delta(
&mut self,
) -> impl Iterator<Item = (FieldId, FacetFieldIdDelta)> + '_ {
self.modified_facet_number_ids.drain()
}
pub fn merge(mut self, rhs: Self) -> Self {
let Self { modified_facet_number_ids, modified_facet_string_ids } = rhs;
modified_facet_number_ids.into_iter().for_each(|fid| {
self.modified_facet_number_ids.insert(fid);
// rhs.max_xx_count is assumed to be equal to self.max_xx_count, and so gets unused
let Self { modified_facet_number_ids, modified_facet_string_ids, .. } = rhs;
modified_facet_number_ids.into_iter().for_each(|(fid, mut delta)| {
let old_delta = self.modified_facet_number_ids.remove(&fid);
delta.merge(old_delta, self.max_number_count);
self.modified_facet_number_ids.insert(fid, delta);
});
modified_facet_string_ids.into_iter().for_each(|fid| {
self.modified_facet_string_ids.insert(fid);
modified_facet_string_ids.into_iter().for_each(|(fid, mut delta)| {
let old_delta = self.modified_facet_string_ids.remove(&fid);
delta.merge(old_delta, self.max_string_count);
self.modified_facet_string_ids.insert(fid, delta);
});
self
}
@ -235,8 +316,12 @@ fn merge_cbo_bitmaps(
(Some(_current), None, None) => Ok(Operation::Ignore), // but it's strange
(Some(current), None, Some(add)) => Ok(Operation::Write(current | add)),
(Some(current), Some(del), add) => {
debug_assert!(
del.is_subset(&current),
"del is not a subset of current, which must be impossible."
);
let output = match add {
Some(add) => (&current - del) | add,
Some(add) => (&current - (&del - &add)) | (add - del),
None => &current - del,
};
if output.is_empty() {

View File

@ -16,6 +16,7 @@ pub mod indexer;
mod merger;
mod parallel_iterator_ext;
mod ref_cell_ext;
pub mod reindex;
pub(crate) mod steps;
pub(crate) mod thread_local;
pub mod vector_document;

View File

@ -0,0 +1,38 @@
use heed::RwTxn;
use super::document::{Document, DocumentFromDb};
use crate::progress::{self, AtomicSubStep, Progress};
use crate::{FieldDistribution, Index, Result};
pub fn field_distribution(index: &Index, wtxn: &mut RwTxn<'_>, progress: &Progress) -> Result<()> {
let mut distribution = FieldDistribution::new();
let document_count = index.number_of_documents(wtxn)?;
let field_id_map = index.fields_ids_map(wtxn)?;
let (update_document_count, sub_step) =
AtomicSubStep::<progress::Document>::new(document_count as u32);
progress.update_progress(sub_step);
let docids = index.documents_ids(wtxn)?;
for docid in docids {
update_document_count.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let Some(document) = DocumentFromDb::new(docid, wtxn, index, &field_id_map)? else {
continue;
};
let geo_iter = document.geo_field().transpose().map(|res| res.map(|rv| ("_geo", rv)));
for res in document.iter_top_level_fields().chain(geo_iter) {
let (field_name, _) = res?;
if let Some(count) = distribution.get_mut(field_name) {
*count += 1;
} else {
distribution.insert(field_name.to_owned(), 1);
}
}
}
index.put_field_distribution(wtxn, &distribution)?;
Ok(())
}

View File

@ -1,8 +1,12 @@
use std::borrow::Cow;
use enum_iterator::Sequence;
use crate::progress::Step;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Sequence)]
#[repr(u16)]
pub enum Step {
#[repr(u8)]
pub enum IndexingStep {
PreparingPayloads,
ExtractingDocuments,
ExtractingFacets,
@ -18,30 +22,31 @@ pub enum Step {
Finalizing,
}
impl Step {
pub fn name(&self) -> &'static str {
impl Step for IndexingStep {
fn name(&self) -> Cow<'static, str> {
match self {
Step::PreparingPayloads => "preparing update file",
Step::ExtractingDocuments => "extracting documents",
Step::ExtractingFacets => "extracting facets",
Step::ExtractingWords => "extracting words",
Step::ExtractingWordProximity => "extracting word proximity",
Step::ExtractingEmbeddings => "extracting embeddings",
Step::WritingGeoPoints => "writing geo points",
Step::WritingToDatabase => "writing to database",
Step::WaitingForExtractors => "waiting for extractors",
Step::WritingEmbeddingsToDatabase => "writing embeddings to database",
Step::PostProcessingFacets => "post-processing facets",
Step::PostProcessingWords => "post-processing words",
Step::Finalizing => "finalizing",
IndexingStep::PreparingPayloads => "preparing update file",
IndexingStep::ExtractingDocuments => "extracting documents",
IndexingStep::ExtractingFacets => "extracting facets",
IndexingStep::ExtractingWords => "extracting words",
IndexingStep::ExtractingWordProximity => "extracting word proximity",
IndexingStep::ExtractingEmbeddings => "extracting embeddings",
IndexingStep::WritingGeoPoints => "writing geo points",
IndexingStep::WritingToDatabase => "writing to database",
IndexingStep::WaitingForExtractors => "waiting for extractors",
IndexingStep::WritingEmbeddingsToDatabase => "writing embeddings to database",
IndexingStep::PostProcessingFacets => "post-processing facets",
IndexingStep::PostProcessingWords => "post-processing words",
IndexingStep::Finalizing => "finalizing",
}
.into()
}
pub fn finished_steps(self) -> u16 {
self as u16
fn current(&self) -> u32 {
*self as u32
}
pub const fn total_steps() -> u16 {
Self::CARDINALITY as u16
fn total(&self) -> u32 {
Self::CARDINALITY as u32
}
}

View File

@ -1,9 +1,10 @@
use std::collections::BTreeSet;
use bumpalo::Bump;
use bumparaw_collections::RawMap;
use deserr::{Deserr, IntoValue};
use heed::RoTxn;
use raw_collections::RawMap;
use rustc_hash::FxBuildHasher;
use serde::Serialize;
use serde_json::value::RawValue;
@ -84,7 +85,7 @@ pub struct VectorDocumentFromDb<'t> {
docid: DocumentId,
embedding_config: Vec<IndexEmbeddingConfig>,
index: &'t Index,
vectors_field: Option<RawMap<'t>>,
vectors_field: Option<RawMap<'t, FxBuildHasher>>,
rtxn: &'t RoTxn<'t>,
doc_alloc: &'t Bump,
}
@ -102,9 +103,10 @@ impl<'t> VectorDocumentFromDb<'t> {
};
let vectors = document.vectors_field()?;
let vectors_field = match vectors {
Some(vectors) => {
Some(RawMap::from_raw_value(vectors, doc_alloc).map_err(InternalError::SerdeJson)?)
}
Some(vectors) => Some(
RawMap::from_raw_value_and_hasher(vectors, FxBuildHasher, doc_alloc)
.map_err(InternalError::SerdeJson)?,
),
None => None,
};
@ -220,7 +222,7 @@ fn entry_from_raw_value(
pub struct VectorDocumentFromVersions<'doc> {
external_document_id: &'doc str,
vectors: RawMap<'doc>,
vectors: RawMap<'doc, FxBuildHasher>,
embedders: &'doc EmbeddingConfigs,
}
@ -233,8 +235,8 @@ impl<'doc> VectorDocumentFromVersions<'doc> {
) -> Result<Option<Self>> {
let document = DocumentFromVersions::new(versions);
if let Some(vectors_field) = document.vectors_field()? {
let vectors =
RawMap::from_raw_value(vectors_field, bump).map_err(UserError::SerdeJson)?;
let vectors = RawMap::from_raw_value_and_hasher(vectors_field, FxBuildHasher, bump)
.map_err(UserError::SerdeJson)?;
Ok(Some(Self { external_document_id, vectors, embedders }))
} else {
Ok(None)

View File

@ -5,7 +5,7 @@ use rayon::slice::ParallelSlice as _;
use super::error::{EmbedError, EmbedErrorKind, NewEmbedderError, NewEmbedderErrorKind};
use super::rest::{Embedder as RestEmbedder, EmbedderOptions as RestEmbedderOptions};
use super::DistributionShift;
use super::{DistributionShift, REQUEST_PARALLELISM};
use crate::error::FaultSource;
use crate::vector::Embedding;
use crate::ThreadPoolNoAbort;
@ -113,20 +113,30 @@ impl Embedder {
texts: &[&str],
threads: &ThreadPoolNoAbort,
) -> Result<Vec<Vec<f32>>, EmbedError> {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
if threads.active_operations() >= REQUEST_PARALLELISM {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
} else {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
}
}
pub fn chunk_count_hint(&self) -> usize {

View File

@ -6,7 +6,7 @@ use rayon::slice::ParallelSlice as _;
use super::error::{EmbedError, NewEmbedderError};
use super::rest::{Embedder as RestEmbedder, EmbedderOptions as RestEmbedderOptions};
use super::DistributionShift;
use super::{DistributionShift, REQUEST_PARALLELISM};
use crate::error::FaultSource;
use crate::vector::error::EmbedErrorKind;
use crate::vector::Embedding;
@ -270,20 +270,29 @@ impl Embedder {
texts: &[&str],
threads: &ThreadPoolNoAbort,
) -> Result<Vec<Vec<f32>>, EmbedError> {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
if threads.active_operations() >= REQUEST_PARALLELISM {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
} else {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
}
}
pub fn chunk_count_hint(&self) -> usize {

View File

@ -203,20 +203,30 @@ impl Embedder {
texts: &[&str],
threads: &ThreadPoolNoAbort,
) -> Result<Vec<Embedding>, EmbedError> {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed_ref(chunk, None))
.collect();
if threads.active_operations() >= REQUEST_PARALLELISM {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed_ref(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
} else {
threads
.install(move || {
let embeddings: Result<Vec<Vec<Embedding>>, _> = texts
.par_chunks(self.prompt_count_in_chunk_hint())
.map(move |chunk| self.embed_ref(chunk, None))
.collect();
let embeddings = embeddings?;
Ok(embeddings.into_iter().flatten().collect())
})
.map_err(|error| EmbedError {
kind: EmbedErrorKind::PanicInThreadPool(error),
fault: FaultSource::Bug,
})?
}
}
pub fn chunk_count_hint(&self) -> usize {

View File

@ -3,6 +3,7 @@ use bumpalo::Bump;
use heed::EnvOpenOptions;
use maplit::hashset;
use milli::documents::mmap_from_objects;
use milli::progress::Progress;
use milli::update::new::indexer;
use milli::update::{IndexDocumentsMethod, IndexerConfig, Settings};
use milli::vector::EmbeddingConfigs;
@ -57,7 +58,7 @@ fn test_facet_distribution_with_no_facet_values() {
None,
&mut new_fields_ids_map,
&|| false,
&|_progress| (),
Progress::default(),
)
.unwrap();
@ -72,7 +73,7 @@ fn test_facet_distribution_with_no_facet_values() {
&document_changes,
embedders,
&|| false,
&|_| (),
&Progress::default(),
)
.unwrap();

Some files were not shown because too many files have changed in this diff Show More