Compare commits

...

198 Commits

Author SHA1 Message Date
f505fa4ae8 Add recommendation route 2024-04-09 12:30:24 +02:00
b4deb9b8db filtered_universe accepts index and txn instead of SearchContext 2024-04-09 12:03:03 +02:00
7476ad6599 Add error codes 2024-04-09 12:02:07 +02:00
d6b6cd322c Update sprint_issue.md (#4556) 2024-04-05 18:40:28 +02:00
b1844b0c27 Merge #4548
4548: v1.8 hybrid search changes r=dureuill a=dureuill

Implements the search changes from the [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#40f24df3da694428a39cc8043c9cfc64)

### ⚠️ Breaking changes in an experimental feature:

- Removed the `_semanticScore`. Use the `_rankingScore` instead.
- Removed `vector` in the response of the search (output was too big).
- Removed all the vectors from the `vectorSort` ranking score details
  - target vector appearing in the name of the rule
  - matched vector appearing in the details of the rule

### Other user-facing changes

- Added `semanticHitCount`, indicating how many hits were returned from the semantic search. This is especially useful in the hybrid search.
- Embed lazily: Meilisearch no longer generates an embedding when the keyword results are "good enough".
- Graceful embedding failure in hybrid search: when doing hybrid search (`semanticRatio in ]0.0, 1.0[`), an embedding failure no longer causes the search request to fail. Instead, only the keyword search is performed. When doing a full vector search (`semanticRatio==1.0`), a failure to embed will still result in failing that search.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-04 16:00:20 +00:00
a9013ed683 Fix comment mistake
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-04-04 17:21:47 +02:00
ca499a0302 Fix test after rebase 2024-04-04 16:04:07 +02:00
355e5282b2 Remove _semanticScore 2024-04-04 16:04:07 +02:00
7c27417a5d Add tests 2024-04-04 16:04:07 +02:00
1ff2a2d6fb Add semanticHitCount 2024-04-04 16:04:06 +02:00
3c6e9851a4 Correct error formatting 2024-04-04 15:58:19 +02:00
4564a38ae7 Bail earlier when the experimental feature is not enabled 2024-04-04 15:58:19 +02:00
466d718a05 Fix test 2024-04-04 15:58:19 +02:00
6ebb6b55a6 Lazily embed, don't fail hybrid search on embedding failure 2024-04-04 15:58:17 +02:00
fabc9cf14a milli: add Embedder::embed_one 2024-04-04 15:57:29 +02:00
00c4ed3bc2 milli: refactor getting embedder and embedder name 2024-04-04 15:57:29 +02:00
190933f6e1 Breaking: Remove vector from SearchResult 2024-04-04 15:57:29 +02:00
928e6e4c05 Breaking change: remove vector for score details 2024-04-04 15:57:29 +02:00
339a5e3431 Merge #4549
4549: Hugging Face embedder improvements r=dureuill a=dureuill

Architectural changes/Internal improvements

### 1. Prefer safetensors weights over pytorch weights when available

safetensors weights are memory mapped, which reduces memory usage of supported models.

### 2. Update candle

Updates candle to `0.4.1`, now targeting crates.io and the tokenizers to `v0.15.2` (still on github).

This might fix https://github.com/meilisearch/meilisearch/issues/4399 thanks to the now included https://github.com/huggingface/candle/issues/1454

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-04 13:47:18 +00:00
5509bafff8 Merge #4535
4535: Support Negative Keywords r=ManyTheFish a=Kerollmops

This PR fixes #4422 by supporting `-` before any word in the query.

The minus symbol `-`, from the ASCII table, is not the only character that can be considered the negative operator. You can see the two other matching characters under the `Based on "-" (U+002D)` section on [this unicode reference website](https://www.compart.com/en/unicode/U+002D).

It's important to notice the strange behavior when a query includes and excludes the same word; only the derivative ( synonyms and split) will be kept:
 - If you input `progamer -progamer`, the engine will still search for `pro gamer`.
 - If you have the synonym `like = love` and you input `like -like`, it will still search for `love`.

## TODO
 - [x] Add analytics
 - [x] Add support to the `-` operator
 - [x] Make sure to support spaces around `-` well
 - [x] Support phrase negation
 - [x] Add tests


Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-04-04 13:10:27 +00:00
90e812fc0b Add some tests 2024-04-04 15:08:37 +02:00
58cafcc824 Update candle 2024-04-03 13:11:56 +02:00
56bf8503db Merge #4537
4537: Expose distribution shift in settings r=ManyTheFish a=dureuill

See [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#d652adc0890445658aaf36352dbc8802)

# Changes

- Distribution shift added to all embedders.
- Exposed in settings
- Changed the reindexing logic to not trigger a reindex operation when only the distribution shift or API key change

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-03 09:08:58 +00:00
a1eccc762a Prefer safetensors to pytorch when both are available 2024-04-03 11:05:59 +02:00
75f81a0bab Merge #4547
4547: Fix milli/Cargo.toml for usage as dependency via git r=dureuill a=Toromyx

# Pull Request

## Related issues/discussions
This enables th usage of `milli` [via git repository](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-git-repositories) as mentioned in <https://github.com/meilisearch/meilisearch/issues/3367#issuecomment-1422613815>, <https://github.com/meilisearch/meilisearch/discussions/1523#discussioncomment-1039338>, and <https://github.com/meilisearch/meilisearch/discussions/1981#discussioncomment-1771568>

## What does this PR do?
Trying to depend on `milli` like

```
[dependencies.milli]
git = "https://github.com/meilisearch/meilisearch.git"
tag = "v1.7.4"
```

leads to the following error:

```
error: failed to select a version for the requirement `candle-core = "^0.3.1"`
candidate versions found which didn't match: 0.4.2
location searched: Git repository https://github.com/huggingface/candle.git
required by package `milli v1.7.4 (https://github.com/meilisearch/meilisearch.git?tag=v1.7.4#0259ad60)`
```

because the default branch of <https://github.com/huggingface/candle> does not contain the correct version.

To fix this, i added a `rev="..."` entry in the relevant dependencies, specifiyng the commit already present in the `Cargo.lock` file.
I also updated the version to the one in the Cargo.lock. This also updated `candle-kernels` sub-dependency from 0.3.1 to 0.3.3 which is probably correct?

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Thomas Gauges <thomas.gauges@gmail.com>
2024-04-03 07:31:36 +00:00
d55d496250 Fix milli/Cargo.toml for usage as dependency via git 2024-04-02 15:19:30 +02:00
5080bef0d6 Merge #4546
4546: Fix some typos in conments r=curquiza a=redistay

# Pull Request



## What does this PR do?
- fix some typos in conments 

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: redistay <wujunjing@outlook.com>
2024-04-02 12:07:09 +00:00
182cb42953 chore: fix some typos in conments
Signed-off-by: redistay <wujunjing@outlook.com>
2024-04-02 19:37:55 +08:00
92a049c2dd Merge #4543
4543: Bring back changes from v1.7.4 into main r=Kerollmops a=dureuill



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2024-03-28 16:53:51 +00:00
78668584cd Merge #4533
4533: Hide api key in settings and task queue r=dureuill a=dureuill

# Pull Request

See [Usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#117f5ff7b19f4d95bb3ae0005f6c6633)

## Motivation

See [slack discussion (internal link)](https://meilisearch.slack.com/archives/C06GQP7FQ6P/p1709804022298749)


## Changes

- The value of the `apiKey` parameter is now hidden in the settings and the details of the task queue.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-28 16:02:53 +00:00
fa9748cc99 Merge #4536
4536: Limit concurrent search requests r=ManyTheFish a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4489

## What does this PR do?
- Adds a « search queue » that limits the number of search requests we can process at the same time and stores search requests to be processed
- Process only one search request per core/thread (we use available_parallelism)
- When the search queue is full, new search requests replace old ones **randomly**. The reason is that:
  - If we serve the oldest one first, like Typesense, we give the worst performances to everyone
  - If we serve the latest one, it gets too easy to DoS us (you just need to fill the queue with as many search requests as we can process simultaneously to ensure no other request will ever be processed)
  - By picking the search request randomly, we give a chance to recent search requests to be processed while ensuring that we can't be owned unless they fill our queue entirely and we start returning errors 5xx
- Adds an experimental parameter to control the size of the queue
- Adds a bunch of tests to ensure the search queue works correctly
- Ensure the loop consuming the search queue is running in the health route and crashes if it’s not the case

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-28 15:01:52 +00:00
877f4b1045 Support negative phrases 2024-03-28 15:51:43 +01:00
781e2d7750 Merge #4532
4532: Add `url` and `api_key` to ollama r=ManyTheFish a=dureuill

See [Usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#5c77ef49e78e43388c1d3d5429151357)

### Motivation

- Before this PR, the url for ollama is only read from the environment. This is a needless restriction that will be troublesome in settings where passing an environment variable is complex or impossible (e.g., the Cloud)
- Before this PR, ollama did not support an api_key. While ollama does not natively support API keys, [a common practice](https://github.com/ollama/ollama/issues/849) is to put a publicly accessible ollama server behind a proxy to support authentication.

### Skip changelog

ollama embedder was added to v1.8

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-28 12:35:19 +00:00
796213af9a Merge branch 'main' into tmp-release-v1.7.4 2024-03-28 10:51:49 +01:00
69f8b2730d Fix the tests 2024-03-28 10:47:04 +01:00
7385067c42 Merge #4542
4542: fixes typos r=irevoire a=brunoocasali

Just fix a typo 😬 

Co-authored-by: Bruno Casali <brunoocasali@gmail.com>
2024-03-27 18:21:48 +00:00
d1021c0f0d Merge #4520
4520: Add automation to create openAPI issue r=dureuill a=curquiza

Create automatically an issue to remind us to update open-api file when opening a milestone

Co-authored-by: curquiza <clementine@meilisearch.com>
2024-03-27 17:33:22 +00:00
8f2606d79d fixes typos 2024-03-27 14:26:47 -03:00
0259ad6082 Merge #4541
4541: Update version for the next release (v1.7.4) in Cargo.toml r=Kerollmops a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2024-03-27 16:49:40 +00:00
06a11b5b21 Improve error message 2024-03-27 17:34:49 +01:00
b50f518764 Update version for the next release (v1.7.4) in Cargo.toml 2024-03-27 16:12:54 +00:00
94b7afcc55 Merge #4539
4539: Don't optimize reindexing when fields contain dots r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4525

## What does this PR do?
- Don't try to optimize the amount of reindexing operation when nested fields are used anywhere in:
    - the field distribution (e.g. a key actually contains a `.`)
    - the old faceted fields
    - the new faceted fields

This is because the facet distribution is not reporting on existing nested fields.



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-27 16:07:49 +00:00
ee8cbea810 Don't optimize reindexing when fields contain dots 2024-03-27 17:04:45 +01:00
b7c582e4f3 connect the search queue with the health route 2024-03-27 15:49:43 +01:00
03c886ac1b adds a bit of documentation 2024-03-27 15:38:36 +01:00
cde7ce4f44 Add test 2024-03-27 14:02:09 +01:00
92224f109a Fix tests 2024-03-27 12:19:10 +01:00
0d27d50740 Merge #4516
4516: Update sprint_issue.md r=Kerollmops a=curquiza

Following decision made about specification

Also
- removed useless parts of the template
- add automatic labels -> better to forget to remove them rather than forgetting to add them (some mistakes happened in the past)

Co-authored-by: Clémentine U. - curqui <clementine@meilisearch.com>
2024-03-27 11:04:06 +00:00
572fb3a51d Finer granularity for embedder needs reindex 2024-03-27 12:01:34 +01:00
4ff0255783 remove unused function 2024-03-27 11:51:14 +01:00
a25456120d Expose distribution in settings 2024-03-27 11:51:04 +01:00
168ded3b9d Deserr for distribution 2024-03-27 11:50:33 +01:00
afd1da5642 Add distribution to all embedders 2024-03-27 11:50:22 +01:00
087a96d22e fix flaky test 2024-03-27 11:05:37 +01:00
34dfea72cc Merge #4509
4509: Rest embedder r=ManyTheFish a=dureuill

Fixes #4531 

See [Usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42?pvs=25#e6f58c3b742c4effb4ddc625ce12ee16)

### Implementation changes

- Remove tokio, futures, reqwests
- Add a new `milli::vector::rest::Embedder` embedder
- Update OpenAI and Ollama embedders to use the REST embedder internally
- Make Embedder::embed a sync method
- Add the new embedder source as described in the usage


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-27 09:27:46 +00:00
3a1f458139 fix a flaky test 2024-03-26 21:06:55 +01:00
55df9daaa0 adds a comment about the safety of an operation 2024-03-26 19:34:55 +01:00
2e36f069c2 fmt imports 2024-03-26 19:23:55 +01:00
8f5d9f501a update the discussion link 2024-03-26 19:18:32 +01:00
8127c9a115 handle the case of a queue of zero elements 2024-03-26 19:04:39 +01:00
e7704f1fc1 add a test to ensure we effectively returns a retry-after when the search queue is full 2024-03-26 18:08:59 +01:00
34262c7a0d Add analytics for the negative operator 2024-03-26 18:01:27 +01:00
e2a1bbae37 simplify and improve the http error 2024-03-26 17:53:37 +01:00
1da9e0f246 Better support space around the negative operator (-) 2024-03-26 17:47:13 +01:00
e4a3e603b3 Expose a first working version of the negative keyword 2024-03-26 17:47:13 +01:00
e433fd53e6 rename the method to get a permit and use it in all search requests 2024-03-26 17:28:03 +01:00
3f23fbb46d create the experimental CLI argument 2024-03-26 16:43:40 +01:00
c41e1274dc push and test the search queue datastructure 2024-03-26 15:56:43 +01:00
9a95ed619d Add tests 2024-03-26 10:36:56 +01:00
f82d056072 Hide secrets in settings and task queue 2024-03-26 10:36:24 +01:00
5ea017b922 Merge #4530
4530: fix: set the histogram bucket boundaries to follow the otel spec r=curquiza a=rohankmr414

# Pull Request

## What does this PR do?
- Fixes the http request duration histogram bucket boundaries to follow the opentelemetry spec, currently the bucket boundaries are too granular and only track latencies below 1s.

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Rohan Kumar <rohankmr414@gmail.com>
2024-03-25 12:23:31 +00:00
817ccc089a also allow api_key 2024-03-25 11:50:00 +01:00
2ddd872ce6 Merge #4373
4373: feat: add status code label to prometheus http request counter r=irevoire a=rohankmr414

# Pull Request

## What does this PR do?
- This PR adds the `status` label (the value is http status code) to the `meilisearch_http_requests_total` metric.

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Rohan Kumar <rohankmr414@gmail.com>
2024-03-25 10:40:50 +00:00
4136630ea5 Use constants instead of raw strings in set_*set() 2024-03-25 11:39:33 +01:00
58972f35cb Allow url parameter for ollama embedder 2024-03-25 11:32:55 +01:00
dfa5e41ea6 Check validity of the URL setting 2024-03-25 11:23:16 +01:00
a1db342f01 Expose REST embedder to the API 2024-03-25 11:23:15 +01:00
f87747f4d3 Remove unwraps 2024-03-25 11:23:04 +01:00
b6b4b6bab7 Remove the tokio and the reqwests 2024-03-25 11:23:03 +01:00
f649f58013 embed no longer async 2024-03-25 11:23:03 +01:00
ac52c857e8 Update ollama and openai impls to use the rest embedder internally 2024-03-25 11:23:03 +01:00
8708cbef25 Add RestEmbedder 2024-03-25 11:23:03 +01:00
c3d02f092d OpenAI sync 2024-03-25 11:23:03 +01:00
bc58e8a310 Documentation for the vector module 2024-03-25 11:23:03 +01:00
ec81c2bf1a Merge #4511
4511: Bump charabia to 0.8.8 r=ManyTheFish a=6543

... and update lock file

this will add the fix (https://github.com/meilisearch/charabia/pull/275) to support markdown formatted codeblocks

Co-authored-by: 6543 <6543@obermui.de>
2024-03-25 09:26:11 +00:00
13a84ae557 fix: set the histogram bucket boundaries to follow the otel spec 2024-03-25 11:20:30 +05:30
325435ad43 feat: add request rate and error rate panels to grafana dashboard 2024-03-25 10:49:40 +05:30
5833070358 feat: add status code label to prometheus http request counter 2024-03-25 10:49:40 +05:30
ae3c31a82c Merge #4526
4526: chore: remove repetitive word r=curquiza a=availhang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- ...

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: availhang <mayangang@outlook.com>
2024-03-22 16:06:54 +00:00
9865c58046 chore: remove repetitive words
Signed-off-by: availhang <mayangang@outlook.com>
2024-03-22 15:23:13 +08:00
bf95438ea8 Merge #4522
4522: Brings back change to main r=curquiza a=irevoire

# Pull Request

Bring back changes to main

Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: irevoire <irevoire@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2024-03-21 15:57:50 +00:00
48d012c3e2 Merge branch 'main' into tmp-release-v1.7.3 2024-03-21 16:39:38 +01:00
8394be9484 Add automation to create openAPI issue 2024-03-21 15:52:11 +01:00
414fc14426 Merge #4519
4519: Update version for the next release (v1.7.3) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2024-03-21 11:21:56 +00:00
3b8e8b7f1a Update version for the next release (v1.7.3) in Cargo.toml 2024-03-21 11:20:30 +00:00
c67f04c746 Update sprint_issue.md 2024-03-20 18:45:56 +01:00
fc1c3f4a29 Merge #4466
4466: Implements the search cutoff r=irevoire a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4488

## What does this PR do?
- Adds a cutoff to the bucket sort after 150ms has been spent
- Adds a new setting to customize the default value of 150ms
- When the time is exceeded, we exit early with what we had the time to sort
- If the cutoff has been reached, the search details are updated with a new `Skip` ranking details for the ranking rules that were skipped
- Adds analytics to measure the total number of degraded search requests
- Adds the number of degraded search requests to the Prometheus metrics and Grafana dashboard
- The cutoff **must not** skip the filters; otherwise, we would leak documents to people who don’t have the right to see them


Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-20 13:06:53 +00:00
f2f1367ec3 add a timeout to the webhook 2024-03-20 13:59:43 +01:00
18f17ed728 Update version for the next release (v1.7.2) in Cargo.toml 2024-03-20 13:59:42 +01:00
4628b7b7bd bump charabia to 0.8.8
and update lock file
2024-03-20 13:39:00 +01:00
d49250358d Merge #4513
4513: Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1" r=Kerollmops a=irevoire

This reverts commit bd74cce86a, reversing changes made to d2f77e88bd.

This commit wasn’t supposed to be merged on the `release-v1.7.1` branch


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-20 09:57:24 +00:00
5046ffdf54 Merge #4512
4512: Revert "Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"" r=Kerollmops a=irevoire

Reverts meilisearch/meilisearch#4510

This PR was supposed to be merged on `release-v1.7.1` not main 🤦 

Co-authored-by: Tamo <irevoire@protonmail.ch>
2024-03-20 09:14:43 +00:00
c5322df519 Revert "Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"" 2024-03-20 10:08:28 +01:00
6079141ea6 snapshot the scores side by side with the score details 2024-03-19 18:30:14 +01:00
2c3af8e513 query the detailed score detail in the test 2024-03-19 18:09:02 +01:00
098ab594eb A score of 0.0 is now lesser than a sort result
handles the niche case 🐩 in the hybrid search where:
1. a sort ranking rule is the first rule.
2. the keyword search is skipped at the first rule.
3. the semantic search is not skipped at the first rule.

Previously, we would have the skipped search winning, whereas we want the non skipped one winning.
2024-03-19 17:32:32 +01:00
c495c8eb33 Merge #4510
4510: Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1" r=Kerollmops a=irevoire

In https://github.com/meilisearch/meilisearch/pull/4502 we merged main into release-v1.7.1 instead of a temporary branch thus we now need to revert this merge commit.

This reverts commit bd74cce86a, reversing changes made to d2f77e88bd.


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-19 16:02:24 +00:00
567194b925 Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"
This reverts commit bd74cce86a, reversing
changes made to d2f77e88bd.
2024-03-19 16:56:21 +01:00
d8fe4fe49d return the order in the score details 2024-03-19 15:45:04 +01:00
7b9e0d2944 forward the degraded parameter to the hybrid search 2024-03-19 15:11:21 +01:00
0ae39644f7 fix the facet search 2024-03-19 15:07:06 +01:00
bfec9468d4 Update milli/src/search/mod.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-19 14:49:15 +01:00
5233534dc0 Merge #4477
4477: Add documentation for benchmarks r=dureuill a=dureuill

See [CONTRIBUTING.md](https://github.com/meilisearch/meilisearch/blob/benchmark-docs/CONTRIBUTING.md#logging)

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-19 13:23:48 +00:00
fced2ff9ab Merge #4502
4502: Release v1.7.1 r=dureuill a=Kerollmops

Bring the v1.7.1 changes back to main.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
2024-03-19 12:41:28 +00:00
bd74cce86a Merge remote-tracking branch 'origin/main' into release-v1.7.1 2024-03-19 13:39:17 +01:00
f85c80d059 Merge #4503
4503: Add settings diff indexing benchmarks r=dureuill a=ManyTheFish

Add several benchmarks targetting settings diff-indexing enhancements

Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-03-19 10:35:46 +00:00
2a92c04100 Adding new assets 2024-03-19 11:31:32 +01:00
4369e9e97c add an error code test on the setting 2024-03-19 11:14:28 +01:00
e8516f00c4 move settings workload in root workload directory 2024-03-19 10:41:30 +01:00
7bd881b9bc adds the degraded searches to the prometheus dashboard 2024-03-19 10:35:47 +01:00
6a0c399c2f rename the search_cutoff parameter to search_cutoff_ms 2024-03-19 10:35:47 +01:00
038c26c118 stop returning the degraded boolean when a search was cutoff 2024-03-19 10:35:47 +01:00
ad9192fbbf reduce the size of an integration test 2024-03-19 10:35:47 +01:00
b8cda6c300 fix the search cutoff and add a test 2024-03-19 10:35:47 +01:00
b72495eb58 fix the settings tests 2024-03-19 10:28:23 +01:00
d1db495119 add a settings for the search cutoff 2024-03-19 10:28:23 +01:00
4a467739cd implements a first version of the cutoff without settings 2024-03-19 10:28:21 +01:00
29e71eedc7 Add benchmarks 2024-03-18 18:31:28 +01:00
10d053cd2f Merge #4500
4500: Don't display dimensions as 0 when it is not set r=ManyTheFish a=dureuill

Fixes regression in embedders where `dimensions: 0` was displayed when it hadn't be set for the `openAi` source.

Was breaking a PHP SDK integration test: cbaecb8c55/tests/Settings/EmbeddersTest.php (L28)

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-18 15:21:24 +00:00
a302e258bd Don't display dimensions as 0 when it is not set 2024-03-18 16:10:12 +01:00
29840473b4 Merge #4499
4499: Fix milli link in contributing doc r=curquiza a=mohsen-alizadeh

# Pull Request

## Related issue
Fixes #4498

## What does this PR do?
 The milli link in CONTRIBUTING.md targeted the archived milli repository. it has to be changed to target to the milli crate in the main repo

## PR checklist
Please check if your PR fulfills the following requirements:
- [X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [X] Have you read the contributing guidelines?
- [X] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Mohsen Alizadeh <mohsen@alizadeh.us>
Co-authored-by: Clémentine U. - curqui <clementine@meilisearch.com>
2024-03-18 14:39:26 +00:00
f4037c1a95 Update CONTRIBUTING.md
Co-authored-by: Clément Renault <renault.cle@gmail.com>
2024-03-18 15:39:01 +01:00
13cc62728b Fix milli link in contributing doc 2024-03-17 19:29:42 -07:00
f84bcb09e1 Merge #4491
4491: chore: remove repetitive words r=curquiza a=shuangcui

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- ...

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: shuangcui <fliter@qq.com>
2024-03-14 17:44:01 +00:00
5c95b5c933 chore: remove repetitive words
Signed-off-by: shuangcui <fliter@qq.com>
2024-03-14 21:28:55 +08:00
0b7bebeeb6 Merge #4483
4483: Workflows: Fix reason param when benches are triggered from a comment. r=irevoire a=dureuill



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-13 17:05:30 +00:00
d2f77e88bd Merge #4479
4479: Skip reindexing when modifying unknown faceted fields r=dureuill a=Kerollmops

This PR improves Meilisearch's decision to reindex when a faceted field is added to the settings, but not a single document contains this field. It is effectively a waste of time to reindex documents when the engine needs to know a field.

This is related to a conversation [we have with our biggest customer (internal link)](https://discord.com/channels/1006923006964154428/1101213808627830794/1217112918857089187). They have 170 million documents, so reindexing this amount would be problematic.

---

The image is available by using the following Docker command. You can see the advancement of the image's build [on the GitHub CI page](https://github.com/meilisearch/meilisearch/actions/runs/8251688778).

```
docker pull getmeili/meilisearch:prototype-no-reindex-unknown-fields-0
```

Here is the hand-made test that shows that when modifying unknown filterable attributes, here `lol`, it doesn't reindex. However, when modifying the known `genre` field, it does reindex. You can see all that by looking at the time spent processing the update.

```json
{
  "uid": 3,
  "indexUid": "movies",
  "status": "succeeded",
  "type": "settingsUpdate",
  "canceledBy": null,
  "details": {
    "filterableAttributes": [
      "genres"
    ]
  },
  "error": null,
  "duration": "PT9.237703S",
  "enqueuedAt": "2024-03-12T15:34:26.836083Z",
  "startedAt": "2024-03-12T15:34:26.836374Z",
  "finishedAt": "2024-03-12T15:34:36.074077Z"
},
{
  "uid": 2,
  "indexUid": "movies",
  "status": "succeeded",
  "type": "settingsUpdate",
  "canceledBy": null,
  "details": {
    "filterableAttributes": [
      "lol"
    ]
  },
  "error": null,
  "duration": "PT0.000751S",
  "enqueuedAt": "2024-03-12T15:33:53.563923Z",
  "startedAt": "2024-03-12T15:33:53.565259Z",
  "finishedAt": "2024-03-12T15:33:53.56601Z"
},
{
  "uid": 0,
  "indexUid": "movies",
  "status": "succeeded",
  "type": "documentAdditionOrUpdate",
  "canceledBy": null,
  "details": {
    "receivedDocuments": 31944,
    "indexedDocuments": 31944
  },
  "error": null,
  "duration": "PT3.120723S",
  "enqueuedAt": "2024-02-17T10:35:55.042864Z",
  "startedAt": "2024-02-17T10:35:55.043505Z",
  "finishedAt": "2024-02-17T10:35:58.164228Z"
}
```

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-03-13 16:23:32 +00:00
1d8c13f595 Merge #4487
4487: Update version for the next release (v1.7.1) in Cargo.toml r=Kerollmops a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>
2024-03-13 15:41:10 +00:00
7f3c495f5c Update version for the next release (v1.7.1) in Cargo.toml 2024-03-13 14:49:21 +00:00
abd954755d Merge #4476
4476: Make the `/facet-search` route use the `sortFacetValuesBy` setting r=irevoire a=Kerollmops

This PR fixes #4423 by ensuring that the `/facet-search` route uses the `sortFacetValuesBy` setting.

Note for the documentation team (to be moved in the tracking issue): Using the new `sortFacetValuesBy` setting can slow down the facet-search requests as Meilisearch iterates over the whole list of facet values and computes the count of documents on every entry. That is hardly or even impossible to optimize correctly.

### TODO
 - [x] Create a custom HashMap wrapper for the facet `OrderBy` settings.
         This wrapper will return the `OrderBy` setting of the facet, if not defined will use the default `*` one, and if not there either (strange) will fall back on the lexicographic one.
- [x] Create a `ValuesCollection` wrapper that implements the logic for the lexicographic and count order by.
  - [x] Use it when there is no search query.
  - [x] Use it when there is a search query with and without allowed typos.
  - [x] Do not change the original logic, only use a wrapper.
- [x] Add tests

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-03-13 14:36:14 +00:00
f3fc2bd01f Address some issues with preallocations 2024-03-13 15:22:14 +01:00
6fa3872268 Workflows: Fix reason param when benches are triggered from a comment. 2024-03-13 13:46:43 +01:00
6c9823d7bb Add tests to sortFacetValuesBy count 2024-03-13 11:59:39 +01:00
e0dac5a22f Simplify the algorithm by using the new facet values collection wrapper 2024-03-13 11:31:34 +01:00
b918b55c6b Introduce a new facet value collection wrapper to simply the usage 2024-03-13 11:31:34 +01:00
07b1d0edaf Merge #4475
4475: Allow running benchmarks without sending results to the dashboard r=irevoire a=dureuill

Adds a `--no-dashboard` option to avoid sending results to the dashboard.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-13 09:59:52 +00:00
306b25ad3a Move the searchForFacetValues struct into a dedicated module 2024-03-13 10:24:21 +01:00
9f7a4fbfeb Return the facets of a placeholder facet-search sorted by count 2024-03-13 10:09:01 +01:00
5ed7b6a0b2 Merge #4456
4456: Add Ollama as an embeddings provider r=dureuill a=jakobklemm

# Pull Request

## Related issue
[Related Discord Thread](https://discord.com/channels/1006923006964154428/1211977150316683305)

## What does this PR do?
- Adds Ollama as a provider of Embeddings besides HuggingFace and OpenAI under the name `ollama`
- Adds the environment variable `MEILI_OLLAMA_URL` to set the embeddings URL of an Ollama instance with a default value of `http://localhost:11434/api/embeddings` if no variable is set
- Changes some of the structs and functions in `openai.rs` to be public so that they can be shared.
- Added more error variants for Ollama specific errors
- It uses the model `nomic-embed-text` as default, but any string value is allowed, however it won't automatically check if the model actually exists or is an embedding model

Tested against Ollama version `v0.1.27` and the `nomic-embed-text` model.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Co-authored-by: Jakob Klemm <jakob@jeykey.net>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
2024-03-13 08:48:47 +00:00
ae67d5eef0 Update milli/src/vector/error.rs
Fix Meilisearch capitalization
2024-03-13 09:45:04 +01:00
88bc9556a9 Add Ollama dimension inference and add clearer errors
Instead of the user manually specifying the model dimensions it will now automatically get determined
Just like with hf.rs the word "test" gets embedded to determine the dimensions of the output
Add a dedicated error type for if the model doesn't exist (don't automatically pull it though) and set the fault of that error to be the user
2024-03-12 19:59:11 +01:00
ca4876fd10 Do not reindex when modifying unknown faceted field 2024-03-12 16:18:58 +01:00
d3a95ea2f6 Introduce a new OrderByMap struct to simplify the sort by usage 2024-03-12 13:56:56 +01:00
88d27949cd Add documentation for benchmarks 2024-03-12 10:56:16 +01:00
69c118ef76 Extract the facet order before extracting the facets values 2024-03-12 10:35:39 +01:00
d44e20aa89 Merge #4474
4474: Update cargo version r=irevoire a=curquiza

Fixes #4417

Co-authored-by: curquiza <clementine@meilisearch.com>
2024-03-12 09:27:22 +00:00
7b670a4afa Allow dry runs for benchmarks where reports are generated but not sent to the dashboard 2024-03-12 10:26:13 +01:00
fde209b7b6 Update cargo version 2024-03-12 10:20:07 +01:00
904b82a61d Merge #4473
4473: Bring back changes from v1.7.0 to main r=curquiza a=curquiza



Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
2024-03-11 15:02:47 +00:00
8ec3e30d2b Merge branch 'main' into tmp-release-v1.7.0 2024-03-11 15:39:51 +01:00
0a59cb9734 Merge #4463
4463: Add tests when the field limit is reached r=Kerollmops a=irevoire

# Pull Request

## Related issue
Related to https://github.com/meilisearch/meilisearch/discussions/4429#discussioncomment-8689101

This user found out that the error message we’re supposed to return when the maximum number of attributes is reached is _not_ returned in some cases

## What does this PR do?
- This PR adds four tests around the maximum number of attributes:
  1. Add a document with u16::MAX + 1 fields - Meilisearch panics
  2. Add two documents which together adds up to u16::MAX + 1 fields - Meilisearch returns the expected error 
  3. Add a document with u16::MAX + 1 **nested fields** - No error message but the document isn’t indexed
  4. Add two documents which together add up to u16::MAX + 1 nested fields - Meilisearch doesn’t return any error but doesn’t index the document

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-07 10:36:54 +00:00
f053c280e1 add tests when the field limit is reached 2024-03-06 18:42:41 +01:00
ee3076d5ba Merge #4462
4462: Divide threshold by ten r=dureuill a=ManyTheFish

Change the facet incremental vs bulk indexing threshold to better fit our user needs, it might be changed in the future if we have more insights


Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-03-06 13:05:38 +00:00
ab1224bfa7 Merge #4458
4458: Replace logging timer by spans r=Kerollmops a=dureuill

- Remove logging timer dependency.
- Remplace last uses in search by spans

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-05 16:43:23 +00:00
eefc1c421e Merge #4459
4459: Put a bound on OpenAI timeout r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #4460 

## What does this PR do?
- Makes sure that the timeout of the openai embedder is limited to max 1min, rather than the prior 15min+



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-05 15:18:51 +00:00
4d42a7af7c Merge #4445
4445: Add subcommand to run benchmarks r=irevoire a=dureuill

# Pull Request

## Related issue
Not user-facing, no issue

## What does this PR do?
- Adds a new `cargo xtask bench` subcommand that can run one or multiple workload files and report the results to a server
- A workload file is a JSON file with a specific schema
- Refactor our use of the `vergen` crate:
  - update to the beta `vergen-git2` crate
  - VERGEN_GIT_SEMVER_LIGHTWEIGHT => VERGEN_GIT_DESCRIBE
  - factor logic in a single `build-info` crate that is used both by meilisearch and xtask (prevents vergen variables from overriding themselves)
  - checked that defining the variables by hand when no git repo is available (docker build case) still works.
- Add CI to run `cargo xtask bench`

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-05 14:03:57 +00:00
7408db2a46 Meilisearch: fix date formatting 2024-03-05 14:56:48 +01:00
663629a9d6 Remove unused build dependency from xtask
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-05 14:45:06 +01:00
15c38dca78 Output RFC 3339 dates where we can
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-05 14:44:48 +01:00
7ee20b0895 Refactor xtask bench 2024-03-05 14:42:06 +01:00
bdd428c22e Merge #4450
4450: Add the content type in the webhook + improve the test r=Kerollmops a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4436

## What does this PR do?
- Specify the content type of the webhook
- Ensure it’s the case in the test

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-03-05 10:36:53 +00:00
b130917933 add the content type in the webhook + improve the test 2024-03-05 11:22:29 +01:00
25f64ce7df Replace logging timer by spans 2024-03-05 11:05:42 +01:00
adcd848809 CI: Add bench workflows 2024-03-05 11:02:05 +01:00
84ae0cd456 Merge #4457
4457: Bump mio from 0.8.9 to 0.8.11 r=Kerollmops a=dependabot[bot]

Bumps [mio](https://github.com/tokio-rs/mio) from 0.8.9 to 0.8.11.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/tokio-rs/mio/blob/master/CHANGELOG.md">mio's changelog</a>.</em></p>
<blockquote>
<h1>0.8.11</h1>
<ul>
<li>Fix receiving IOCP events after deregistering a Windows named pipe
(<a href="https://redirect.github.com/tokio-rs/mio/pull/1760">tokio-rs/mio#1760</a>, backport pr:
<a href="https://redirect.github.com/tokio-rs/mio/pull/1761">tokio-rs/mio#1761</a>).</li>
</ul>
<h1>0.8.10</h1>
<h2>Added</h2>
<ul>
<li>Solaris support
(<a href="https://redirect.github.com/tokio-rs/mio/pull/1724">tokio-rs/mio#1724</a>).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="0328bdef90"><code>0328bde</code></a> Release v0.8.11</li>
<li><a href="7084498512"><code>7084498</code></a> Fix warnings</li>
<li><a href="90d4fe00df"><code>90d4fe0</code></a> named-pipes: fix receiving IOCP events after deregister</li>
<li><a href="c710a307f8"><code>c710a30</code></a> Add v0.8.x to the CI</li>
<li><a href="c29e21c244"><code>c29e21c</code></a> Release v0.8.10</li>
<li><a href="f6a20da1c8"><code>f6a20da</code></a> Add Solaris operating system support (<a href="https://redirect.github.com/tokio-rs/mio/issues/1724">#1724</a>)</li>
<li>See full diff in <a href="https://github.com/tokio-rs/mio/compare/v0.8.9...v0.8.11">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mio&package-manager=cargo&previous-version=0.8.9&new-version=0.8.11)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/meilisearch/meilisearch/network/alerts).

</details>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-05 09:35:17 +00:00
eee46b7537 Add first workloads 2024-03-05 10:13:11 +01:00
55f60a3638 Update .gitignore
- Ignore `/bench` directory for git purposes
- Ignore benchmark DB
2024-03-05 10:12:52 +01:00
c608b3f9b5 Factor vergen stuff to a build-info crate 2024-03-05 10:11:43 +01:00
86ce843f3d Add cargo xtask bench 2024-03-05 10:11:43 +01:00
b11df7ec34 Meilisearch: fix some wrong spans 2024-03-05 10:11:43 +01:00
6862caef64 Span Stats compute self-time 2024-03-05 10:11:43 +01:00
f75c7ac979 Compile xtask in --release 2024-03-05 10:11:43 +01:00
f07069094b Bump mio from 0.8.9 to 0.8.11
Bumps [mio](https://github.com/tokio-rs/mio) from 0.8.9 to 0.8.11.
- [Release notes](https://github.com/tokio-rs/mio/releases)
- [Changelog](https://github.com/tokio-rs/mio/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/mio/compare/v0.8.9...v0.8.11)

---
updated-dependencies:
- dependency-name: mio
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-04 22:03:25 +00:00
eada6de261 Divide threshold by ten 2024-03-04 18:02:54 +01:00
d3004d8040 Implemented Ollama as an embeddings provider
Initial prototype of Ollama embeddings actually working, error handlign / retries still missing.

Allow model to be any String and require dimensions parameter

Fixed rustfmt formatting issues

There were some formatting issues in the initial PR and this should not make the changes comply with the Rust style guidelines

Because I accidentally didn't follow the style guide for commits in my commit messages I squashed them into one to comply
2024-03-04 15:09:43 +01:00
938149f814 Merge #4042
4042: Implements the new replication parameters r=ManyTheFish a=irevoire

### This PR implements the necessary parameters for the High Availability

- [ ] Update the spec

Introduce a new CLI flag called `--experimental-replication-parameters` that changes a few behaviors in the engine:
- [The auto-deletion of tasks is disabled](https://specs.meilisearch.com/specifications/text/0060-tasks-api.html#_2-technical-details)
- Upon registering a task, you can choose its task ID by sending a new header: `TaskId: 456645`. It must be a valid number, which must be superior to the last task id ever seen.
- Add the ability to « dry-register » a task. That means meilisearch will answer to you with a valid task ID like everything went well, but won’t actually write anything in the database. To do that, you need to use the `DryRun: true` header.

----

Old prototype `prototype-custom-task-id-0`:
-  Adds the capability to specify your own task ID via the `TaskId` http header
- Make the task IDs a u64 instead of a u32


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-02-26 11:37:34 +00:00
eb90f0b4fb fix and remove the file-store hack of /dev/null 2024-02-26 10:19:07 +01:00
c2e2003a80 create a test with the dry-run parameter enabled 2024-02-22 15:51:47 +01:00
693ba8dd15 rename the cli parameter 2024-02-21 14:33:40 +01:00
e1a3eed1eb update the discussion link 2024-02-21 12:30:28 +01:00
05ae291989 implement the dry run ha parameter 2024-02-21 11:21:26 +01:00
6ba9994916 disable the auto deletion of tasks when the ha mode is enabled 2024-02-20 12:23:39 +01:00
01ae46dd80 add an experimental cli parameter to allow specifying your task id 2024-02-20 11:24:44 +01:00
12f5389ba7 Merge #4416
4416: Create automation when creating Milestone to create update-version issue r=curquiza a=curquiza

Following our discussion `@irevoire` -> we miss reminder to update cargo version BEFORE rc0

Issue template [here](https://github.com/meilisearch/engine-team/blob/main/issue-templates/update-version-issue.md)

Co-authored-by: curquiza <clementine@meilisearch.com>
2024-02-20 08:47:29 +00:00
9ee4f55e6c let you specify your task id 2024-02-19 14:29:33 +01:00
024de0dcf8 Create automation when creating Milestone to create update-version issue 2024-02-14 17:36:47 +01:00
82b43e9a7f Merge #4400
4400: Upgrade rustls to 0.21.10 and ring to 0.17 r=curquiza a=hack3ric

# Pull Request

## What does this PR do?
- Upgrade dependencies that uses ring 0.16 so that they rely on ring 0.17 instead
- Use rustls 0.21 for actix-{http,tls}, since newer versions of rustls uses ring 0.17
- Fix some trivial breaking API changes caused by above

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Eric Long <i@hack3r.moe>
2024-02-12 13:17:40 +00:00
c02d585f5b Upgrade rustls to 0.21.10 and ring to 0.17 2024-02-12 14:32:29 +08:00
120 changed files with 9067 additions and 2094 deletions

View File

@ -1,2 +1,2 @@
[alias]
xtask = "run --package xtask --"
xtask = "run --release --package xtask --"

View File

@ -2,14 +2,13 @@
name: New sprint issue
about: ⚠️ Should only be used by the engine team ⚠️
title: ''
labels: ''
labels: 'missing usage in PRD, impacts docs'
assignees: ''
---
Related product team resources: [PRD]() (_internal only_)
Related product discussion:
Related spec: WIP
## Motivation
@ -21,11 +20,7 @@ Related spec: WIP
## TODO
<!---Feel free to adapt this list with more technical/product steps-->
- [ ] Release a prototype
- [ ] If prototype validated, merge changes into `main`
- [ ] Update the spec
<!---If necessary, create a list with technical/product steps-->
### Reminders when modifying the Setting API

30
.github/workflows/bench-manual.yml vendored Normal file
View File

@ -0,0 +1,30 @@
name: Bench (manual)
on:
workflow_dispatch:
inputs:
workload:
description: 'The path to the workloads to execute (workloads/...)'
required: true
default: 'workloads/movies.json'
env:
WORKLOAD_NAME: ${{ github.event.inputs.workload }}
jobs:
benchmarks:
name: Run and upload benchmarks
runs-on: benchmarks
timeout-minutes: 180 # 3h
steps:
- uses: actions/checkout@v3
- uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable
override: true
- name: Run benchmarks - workload ${WORKLOAD_NAME} - branch ${{ github.ref }} - commit ${{ github.sha }}
run: |
cargo xtask bench --api-key "${{ secrets.BENCHMARK_API_KEY }}" --dashboard-url "${{ vars.BENCHMARK_DASHBOARD_URL }}" --reason "Manual [Run #${{ github.run_id }}](https://github.com/meilisearch/meilisearch/actions/runs/${{ github.run_id }})" -- ${WORKLOAD_NAME}

46
.github/workflows/bench-pr.yml vendored Normal file
View File

@ -0,0 +1,46 @@
name: Bench (PR)
on:
issue_comment:
types: [created]
permissions:
issues: write
env:
GH_TOKEN: ${{ secrets.MEILI_BOT_GH_PAT }}
jobs:
run-benchmarks-on-comment:
if: startsWith(github.event.comment.body, '/bench')
name: Run and upload benchmarks
runs-on: benchmarks
timeout-minutes: 180 # 3h
steps:
- name: Check for Command
id: command
uses: xt0rted/slash-command-action@v2
with:
command: bench
reaction-type: "rocket"
repo-token: ${{ env.GH_TOKEN }}
- uses: xt0rted/pull-request-comment-branch@v2
id: comment-branch
with:
repo_token: ${{ env.GH_TOKEN }}
- uses: actions/checkout@v3
if: success()
with:
fetch-depth: 0 # fetch full history to be able to get main commit sha
ref: ${{ steps.comment-branch.outputs.head_ref }}
- uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable
override: true
- name: Run benchmarks on PR ${{ github.event.issue.id }}
run: |
cargo xtask bench --api-key "${{ secrets.BENCHMARK_API_KEY }}" --dashboard-url "${{ vars.BENCHMARK_DASHBOARD_URL }}" --reason "[Comment](${{ github.event.comment.html_url }}) on [#${{ github.event.issue.number }}](${{ github.event.issue.html_url }})" -- ${{ steps.command.outputs.command-arguments }}

View File

@ -0,0 +1,25 @@
name: Indexing bench (push)
on:
push:
branches:
- main
jobs:
benchmarks:
name: Run and upload benchmarks
runs-on: benchmarks
timeout-minutes: 180 # 3h
steps:
- uses: actions/checkout@v3
- uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable
override: true
# Run benchmarks
- name: Run benchmarks - Dataset ${BENCH_NAME} - Branch main - Commit ${{ github.sha }}
run: |
cargo xtask bench --api-key "${{ secrets.BENCHMARK_API_KEY }}" --dashboard-url "${{ vars.BENCHMARK_DASHBOARD_URL }}" --reason "Push on `main` [Run #${{ github.run_id }}](https://github.com/meilisearch/meilisearch/actions/runs/${{ github.run_id }})" -- workloads/*.json

View File

@ -110,6 +110,44 @@ jobs:
--milestone $MILESTONE_VERSION \
--assignee curquiza
create-update-version-issue:
needs: get-release-version
# Create the update-version issue even if the release is a patch release
if: github.event.action == 'created'
runs-on: ubuntu-latest
env:
ISSUE_TEMPLATE: issue-template.md
steps:
- uses: actions/checkout@v3
- name: Download the issue template
run: curl -s https://raw.githubusercontent.com/meilisearch/engine-team/main/issue-templates/update-version-issue.md > $ISSUE_TEMPLATE
- name: Create the issue
run: |
gh issue create \
--title "Update version in Cargo.toml for $MILESTONE_VERSION" \
--label 'maintenance' \
--body-file $ISSUE_TEMPLATE \
--milestone $MILESTONE_VERSION
create-update-openapi-issue:
needs: get-release-version
# Create the openAPI issue if the release is not only a patch release
if: github.event.action == 'created' && needs.get-release-version.outputs.is-patch == 'false'
runs-on: ubuntu-latest
env:
ISSUE_TEMPLATE: issue-template.md
steps:
- uses: actions/checkout@v3
- name: Download the issue template
run: curl -s https://raw.githubusercontent.com/meilisearch/engine-team/main/issue-templates/update-openapi-issue.md > $ISSUE_TEMPLATE
- name: Create the issue
run: |
gh issue create \
--title "Update Open API file for $MILESTONE_VERSION" \
--label 'maintenance' \
--body-file $ISSUE_TEMPLATE \
--milestone $MILESTONE_VERSION
# ----------------
# MILESTONE CLOSED
# ----------------

2
.gitignore vendored
View File

@ -9,6 +9,8 @@
/data.ms
/snapshots
/dumps
/bench
/_xtask_benchmark.ms
# Snapshots
## ... large

362
BENCHMARKS.md Normal file
View File

@ -0,0 +1,362 @@
# Benchmarks
Currently this repository hosts two kinds of benchmarks:
1. The older "milli benchmarks", that use [criterion](https://github.com/bheisler/criterion.rs) and live in the "benchmarks" directory.
2. The newer "bench" that are workload-based and so split between the [`workloads`](./workloads/) directory and the [`xtask::bench`](./xtask/src/bench/) module.
This document describes the newer "bench" benchmarks. For more details on the "milli benchmarks", see [benchmarks/README.md](./benchmarks/README.md).
## Design philosophy for the benchmarks
The newer "bench" benchmarks are **integration** benchmarks, in the sense that they spawn an actual Meilisearch server and measure its performance end-to-end, including HTTP request overhead.
Since this is prone to fluctuating, the benchmarks regain a bit of precision by measuring the runtime of the individual spans using the [logging machinery](./CONTRIBUTING.md#logging) of Meilisearch.
A span roughly translates to a function call. The benchmark runner collects all the spans by name using the [logs route](https://github.com/orgs/meilisearch/discussions/721) and sums their runtime. The processed results are then sent to the [benchmark dashboard](https://bench.meilisearch.dev), which is in charge of storing and presenting the data.
## Running the benchmarks
Benchmarks can run locally or in CI.
### Locally
#### With a local benchmark dashboard
The benchmarks dashboard lives in its [own repository](https://github.com/meilisearch/benchboard). We provide binaries for Ubuntu/Debian, but you can build from source for other platforms (MacOS should work as it was developed under that platform).
Run the `benchboard` binary to create a fresh database of results. By default it will serve the results and the API to gather results on `http://localhost:9001`.
From the Meilisearch repository, you can then run benchmarks with:
```sh
cargo xtask bench -- workloads/my_workload_1.json ..
```
This command will build and run Meilisearch locally on port 7700, so make sure that this port is available.
To run benchmarks on a different commit, just use the usual git command to get back to the desired commit.
#### Without a local benchmark dashboard
To work with the raw results, you can also skip using a local benchmark dashboard.
Run:
```sh
cargo xtask bench --no-dashboard -- workloads/my_workload_1.json workloads/my_workload_2.json ..
```
For processing the results, look at [Looking at benchmark results/Without dashboard](#without-dashboard).
### In CI
We have dedicated runners to run workloads on CI. Currently, there are three ways of running the CI:
1. Automatically, on every push to `main`.
2. Manually, by clicking the [`Run workflow`](https://github.com/meilisearch/meilisearch/actions/workflows/bench-manual.yml) button and specifying the target reference (tag, commit or branch) as well as one or multiple workloads to run. The workloads must exist in the Meilisearch repository (conventionally, in the [`workloads`](./workloads/) directory) on the target reference. Globbing (e.g., `workloads/*.json`) works.
3. Manually on a PR, by posting a comment containing a `/bench` command, followed by one or multiple workloads to run. Globbing works. The workloads must exist in the Meilisearch repository in the branch of the PR.
```
/bench workloads/movies*.json /hackernews_1M.json
```
## Looking at benchmark results
### On the dashboard
Results are available on the global dashboard used by CI at <https://bench.meilisearch.dev> or on your [local dashboard](#with-a-local-benchmark-dashboard).
The dashboard homepage presents three sections:
1. The latest invocations (a call to `cargo xtask bench`, either local or by CI) with their reason (generally set to some helpful link in CI) and their status.
2. The latest workloads ran on `main`.
3. The latest workloads ran on other references.
By default, the workload shows the total runtime delta with the latest applicable commit on `main`. The latest applicable commit is the latest commit for workload invocations that do not originate on `main`, and the latest previous commit for workload invocations that originate on `main`.
You can explicitly request a detailed comparison by span with the `main` branch, the branch or origin, or any previous commit, by clicking the links at the bottom of the workload invocation.
In the detailed comparison view, the spans are sorted by improvements, regressions, stable (no statistically significant change) and unstable (the span runtime is comparable to its standard deviation).
You can click on the name of any span to get a box plot comparing the target commit with multiple commits of the selected branch.
### Without dashboard
After the workloads are done running, the reports will live in the Meilisearch repository, in the `bench/reports` directory (by default).
You can then convert these reports into other formats.
- To [Firefox profiler](https://profiler.firefox.com) format. Run:
```sh
cd bench/reports
cargo run --release --bin trace-to-firefox -- my_workload_1-0-trace.json
```
You can then upload the resulting `firefox-my_workload_1-0-trace.json` file to the online profiler.
## Designing benchmark workloads
Benchmark workloads conventionally live in the `workloads` directory of the Meilisearch repository.
They are JSON files with the following structure (comments are not actually supported, to make your own, remove them or copy some existing workload file):
```jsonc
{
// Name of the workload. Must be unique to the workload, as it will be used to group results on the dashboard.
"name": "hackernews.ndjson_1M,no-threads",
// Number of consecutive runs of the commands that should be performed.
// Each run uses a fresh instance of Meilisearch and a fresh database.
// Each run produces its own report file.
"run_count": 3,
// List of arguments to add to the Meilisearch command line.
"extra_cli_args": ["--max-indexing-threads=1"],
// List of named assets that can be used in the commands.
"assets": {
// name of the asset.
// Must be unique at the workload level.
// For better results, the same asset (same sha256) should have the same name accross workloads.
// Having multiple assets with the same name and distinct hashes is supported accross workloads,
// but will lead to superfluous downloads.
//
// Assets are stored in the `bench/assets/` directory by default.
"hackernews-100_000.ndjson": {
// If the assets exists in the local filesystem (Meilisearch repository or for your local workloads)
// Its file path can be specified here.
// `null` if the asset should be downloaded from a remote location.
"local_location": null,
// URL of the remote location where the asset can be downloaded.
// Use the `--assets-key` of the runner to pass an API key in the `Authorization: Bearer` header of the download requests.
// `null` if the asset should be imported from a local location.
// if both local and remote locations are specified, then the local one is tried first, then the remote one
// if the file is locally missing or its hash differs.
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-100_000.ndjson",
// SHA256 of the asset.
// Optional, the `sha256` of the asset will be displayed during a run of the workload if it is missing.
// If present, the hash of the asset in the `bench/assets/` directory will be compared against this hash before
// running the workload. If the hashes differ, the asset will be downloaded anew.
"sha256": "60ecd23485d560edbd90d9ca31f0e6dba1455422f2a44e402600fbb5f7f1b213",
// Optional, one of "Auto", "Json", "NdJson" or "Raw".
// If missing, assumed to be "Auto".
// If "Auto", the format will be determined from the extension in the asset name.
"format": "NdJson"
},
"hackernews-200_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-200_000.ndjson",
"sha256": "785b0271fdb47cba574fab617d5d332276b835c05dd86e4a95251cf7892a1685"
},
"hackernews-300_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-300_000.ndjson",
"sha256": "de73c7154652eddfaf69cdc3b2f824d5c452f095f40a20a1c97bb1b5c4d80ab2"
},
"hackernews-400_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-400_000.ndjson",
"sha256": "c1b00a24689110f366447e434c201c086d6f456d54ed1c4995894102794d8fe7"
},
"hackernews-500_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-500_000.ndjson",
"sha256": "ae98f9dbef8193d750e3e2dbb6a91648941a1edca5f6e82c143e7996f4840083"
},
"hackernews-600_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-600_000.ndjson",
"sha256": "b495fdc72c4a944801f786400f22076ab99186bee9699f67cbab2f21f5b74dbe"
},
"hackernews-700_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-700_000.ndjson",
"sha256": "4b2c63974f3dabaa4954e3d4598b48324d03c522321ac05b0d583f36cb78a28b"
},
"hackernews-800_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-800_000.ndjson",
"sha256": "cb7b6afe0e6caa1be111be256821bc63b0771b2a0e1fad95af7aaeeffd7ba546"
},
"hackernews-900_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-900_000.ndjson",
"sha256": "e1154ddcd398f1c867758a93db5bcb21a07b9e55530c188a2917fdef332d3ba9"
},
"hackernews-1_000_000.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/hackernews-1_000_000.ndjson",
"sha256": "27e25efd0b68b159b8b21350d9af76938710cb29ce0393fa71b41c4f3c630ffe"
}
},
// Core of the workload.
// A list of commands to run sequentially.
// A command is a request to the Meilisearch instance that is executed while the profiling runs.
"commands": [
{
// Meilisearch route to call. `http://localhost:7700/` will be prepended.
"route": "indexes/movies/settings",
// HTTP method to call.
"method": "PATCH",
// If applicable, body of the request.
// Optional, if missing, the body will be empty.
"body": {
// One of "empty", "inline" or "asset".
// If using "empty", you can skip the entire "body" key.
"inline": {
// when "inline" is used, the body is the JSON object that is the value of the `"inline"` key.
"displayedAttributes": [
"title",
"by",
"score",
"time"
],
"searchableAttributes": [
"title"
],
"filterableAttributes": [
"by"
],
"sortableAttributes": [
"score",
"time"
]
}
},
// Whether to wait before running the next request.
// One of:
// - DontWait: run the next command without waiting the response to this one.
// - WaitForResponse: run the next command as soon as the response from the server is received.
// - WaitForTask: run the next command once **all** the Meilisearch tasks created up to now have finished processing.
"synchronous": "DontWait"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
// When using "asset", use the name of an asset as value to use the content of that asset as body.
// the content type is derived of the format of the asset:
// "NdJson" => "application/x-ndjson"
// "Json" => "application/json"
// "Raw" => "application/octet-stream"
// See [AssetFormat::to_content_type](https://github.com/meilisearch/meilisearch/blob/7b670a4afadb132ac4a01b6403108700501a391d/xtask/src/bench/assets.rs#L30)
// for details and up-to-date list.
"asset": "hackernews-100_000.ndjson"
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-200_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-300_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-400_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-500_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-600_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-700_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-800_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-900_000.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-1_000_000.ndjson"
},
"synchronous": "WaitForTask"
}
]
}
```
### Adding new assets
Assets reside in our DigitalOcean S3 space. Assuming you have team access to the DigitalOcean S3 space:
1. go to <https://cloud.digitalocean.com/spaces/milli-benchmarks?i=d1c552&path=bench%2Fdatasets%2F>
2. upload your dataset:
1. if your dataset is a single file, upload that single file using the "upload" button,
2. otherwise, create a folder using the "create folder" button, then inside that folder upload your individual files.
## Upgrading `https://bench.meilisearch.dev`
The URL of the server is in our password manager (look for "benchboard").
1. Make the needed modifications on the [benchboard repository](https://github.com/meilisearch/benchboard) and merge them to main.
2. Publish a new release to produce the Ubuntu/Debian binary.
3. Download the binary locally, send it to the server:
```
scp -6 ~/Downloads/benchboard root@\[<ipv6-address>\]:/bench/new-benchboard
```
Note that the ipv6 must be between escaped square brackets for SCP.
4. SSH to the server:
```
ssh root@<ipv6-address>
```
Note the ipv6 must **NOT** be between escaped square brackets for SSH 🥲
5. On the server, set the correct permissions for the new binary:
```
chown bench:bench /bench/new-benchboard
chmod 700 /bench/new-benchboard
```
6. On the server, move the new binary to the location of the running binary (if unsure, start by making a backup of the running binary):
```
mv /bench/{new-,}benchboard
```
7. Restart the benchboard service.
```
systemctl restart benchboard
```
8. Check that the service runs correctly.
```
systemctl status benchboard
```
9. Check the availability of the service by going to <https://bench.meilisearch.dev> on your browser.

View File

@ -4,7 +4,7 @@ First, thank you for contributing to Meilisearch! The goal of this document is t
Remember that there are many ways to contribute other than writing code: writing [tutorials or blog posts](https://github.com/meilisearch/awesome-meilisearch), improving [the documentation](https://github.com/meilisearch/documentation), submitting [bug reports](https://github.com/meilisearch/meilisearch/issues/new?assignees=&labels=&template=bug_report.md&title=) and [feature requests](https://github.com/meilisearch/product/discussions/categories/feedback-feature-proposal)...
The code in this repository is only concerned with managing multiple indexes, handling the update store, and exposing an HTTP API. Search and indexation are the domain of our core engine, [`milli`](https://github.com/meilisearch/milli), while tokenization is handled by [our `charabia` library](https://github.com/meilisearch/charabia/).
Meilisearch can manage multiple indexes, handle the update store, and expose an HTTP API. Search and indexation are the domain of our core engine, [`milli`](https://github.com/meilisearch/meilisearch/tree/main/milli), while tokenization is handled by [our `charabia` library](https://github.com/meilisearch/charabia/).
If Meilisearch does not offer optimized support for your language, please consider contributing to `charabia` by following the [CONTRIBUTING.md file](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) and integrating your intended normalizer/segmenter.
@ -81,6 +81,30 @@ Meilisearch follows the [cargo xtask](https://github.com/matklad/cargo-xtask) wo
Run `cargo xtask --help` from the root of the repository to find out what is available.
### Logging
Meilisearch uses [`tracing`](https://lib.rs/crates/tracing) for logging purposes. Tracing logs are structured and can be displayed as JSON to the end user, so prefer passing arguments as fields rather than interpolating them in the message.
Refer to the [documentation](https://docs.rs/tracing/0.1.40/tracing/index.html#using-the-macros) for the syntax of the spans and events.
Logging spans are used for 3 distinct purposes:
1. Regular logging
2. Profiling
3. Benchmarking
As a result, the spans should follow some rules:
- They should not be put on functions that are called too often. That is because opening and closing a span causes some overhead. For regular logging, avoid putting spans on functions that are taking less than a few hundred nanoseconds. For profiling or benchmarking, avoid putting spans on functions that are taking less than a few microseconds.
- For profiling and benchmarking, use the `TRACE` level.
- For profiling and benchmarking, use the following `target` prefixes:
- `indexing::` for spans meant when profiling the indexing operations.
- `search::` for spans meant when profiling the search operations.
### Benchmarking
See [BENCHMARKS.md](./BENCHMARKS.md)
## Git Guidelines
### Git Branches

936
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@ -17,11 +17,11 @@ members = [
"benchmarks",
"fuzzers",
"tracing-trace",
"xtask",
"xtask", "build-info",
]
[workspace.package]
version = "1.7.0"
version = "1.8.0"
authors = [
"Quentin de Quelen <quentin@dequelen.me>",
"Clément Renault <clement@meilisearch.com>",

View File

@ -8,7 +8,7 @@ WORKDIR /
ARG COMMIT_SHA
ARG COMMIT_DATE
ARG GIT_TAG
ENV VERGEN_GIT_SHA=${COMMIT_SHA} VERGEN_GIT_COMMIT_TIMESTAMP=${COMMIT_DATE} VERGEN_GIT_SEMVER_LIGHTWEIGHT=${GIT_TAG}
ENV VERGEN_GIT_SHA=${COMMIT_SHA} VERGEN_GIT_COMMIT_TIMESTAMP=${COMMIT_DATE} VERGEN_GIT_DESCRIBE=${GIT_TAG}
ENV RUSTFLAGS="-C target-feature=-crt-static"
COPY . .

File diff suppressed because it is too large Load Diff

18
build-info/Cargo.toml Normal file
View File

@ -0,0 +1,18 @@
[package]
name = "build-info"
version.workspace = true
authors.workspace = true
description.workspace = true
homepage.workspace = true
readme.workspace = true
edition.workspace = true
license.workspace = true
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
time = { version = "0.3.34", features = ["parsing"] }
[build-dependencies]
anyhow = "1.0.80"
vergen-git2 = "1.0.0-beta.2"

22
build-info/build.rs Normal file
View File

@ -0,0 +1,22 @@
fn main() {
if let Err(err) = emit_git_variables() {
println!("cargo:warning=vergen: {}", err);
}
}
fn emit_git_variables() -> anyhow::Result<()> {
// Note: any code that needs VERGEN_ environment variables should take care to define them manually in the Dockerfile and pass them
// in the corresponding GitHub workflow (publish_docker.yml).
// This is due to the Dockerfile building the binary outside of the git directory.
let mut builder = vergen_git2::Git2Builder::default();
builder.branch(true);
builder.commit_timestamp(true);
builder.commit_message(true);
builder.describe(true, true, None);
builder.sha(false);
let git2 = builder.build()?;
vergen_git2::Emitter::default().fail_on_error().add_instructions(&git2)?.emit()
}

203
build-info/src/lib.rs Normal file
View File

@ -0,0 +1,203 @@
use time::format_description::well_known::Iso8601;
#[derive(Debug, Clone)]
pub struct BuildInfo {
pub branch: Option<&'static str>,
pub describe: Option<DescribeResult>,
pub commit_sha1: Option<&'static str>,
pub commit_msg: Option<&'static str>,
pub commit_timestamp: Option<time::OffsetDateTime>,
}
impl BuildInfo {
pub fn from_build() -> Self {
let branch: Option<&'static str> = option_env!("VERGEN_GIT_BRANCH");
let describe = DescribeResult::from_build();
let commit_sha1 = option_env!("VERGEN_GIT_SHA");
let commit_msg = option_env!("VERGEN_GIT_COMMIT_MESSAGE");
let commit_timestamp = option_env!("VERGEN_GIT_COMMIT_TIMESTAMP");
let commit_timestamp = commit_timestamp.and_then(|commit_timestamp| {
time::OffsetDateTime::parse(commit_timestamp, &Iso8601::DEFAULT).ok()
});
Self { branch, describe, commit_sha1, commit_msg, commit_timestamp }
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum DescribeResult {
Prototype { name: &'static str },
Release { version: &'static str, major: u64, minor: u64, patch: u64 },
Prerelease { version: &'static str, major: u64, minor: u64, patch: u64, rc: u64 },
NotATag { describe: &'static str },
}
impl DescribeResult {
pub fn new(describe: &'static str) -> Self {
if let Some(name) = prototype_name(describe) {
Self::Prototype { name }
} else if let Some(release) = release_version(describe) {
release
} else if let Some(prerelease) = prerelease_version(describe) {
prerelease
} else {
Self::NotATag { describe }
}
}
pub fn from_build() -> Option<Self> {
let describe: &'static str = option_env!("VERGEN_GIT_DESCRIBE")?;
Some(Self::new(describe))
}
pub fn as_tag(&self) -> Option<&'static str> {
match self {
DescribeResult::Prototype { name } => Some(name),
DescribeResult::Release { version, .. } => Some(version),
DescribeResult::Prerelease { version, .. } => Some(version),
DescribeResult::NotATag { describe: _ } => None,
}
}
pub fn as_prototype(&self) -> Option<&'static str> {
match self {
DescribeResult::Prototype { name } => Some(name),
DescribeResult::Release { .. }
| DescribeResult::Prerelease { .. }
| DescribeResult::NotATag { .. } => None,
}
}
}
/// Parses the input as a prototype name.
///
/// Returns `Some(prototype_name)` if the following conditions are met on this value:
///
/// 1. starts with `prototype-`,
/// 2. ends with `-<some_number>`,
/// 3. does not end with `<some_number>-<some_number>`.
///
/// Otherwise, returns `None`.
fn prototype_name(describe: &'static str) -> Option<&'static str> {
if !describe.starts_with("prototype-") {
return None;
}
let mut rsplit_prototype = describe.rsplit('-');
// last component MUST be a number
rsplit_prototype.next()?.parse::<u64>().ok()?;
// before than last component SHALL NOT be a number
rsplit_prototype.next()?.parse::<u64>().err()?;
Some(describe)
}
fn release_version(describe: &'static str) -> Option<DescribeResult> {
if !describe.starts_with('v') {
return None;
}
// full release version don't contain a `-`
if describe.contains('-') {
return None;
}
// full release version parse as vX.Y.Z, with X, Y, Z numbers.
let mut dots = describe[1..].split('.');
let major: u64 = dots.next()?.parse().ok()?;
let minor: u64 = dots.next()?.parse().ok()?;
let patch: u64 = dots.next()?.parse().ok()?;
if dots.next().is_some() {
return None;
}
Some(DescribeResult::Release { version: describe, major, minor, patch })
}
fn prerelease_version(describe: &'static str) -> Option<DescribeResult> {
// prerelease version is in the shape vM.N.P-rc.C
let mut hyphen = describe.rsplit('-');
let prerelease = hyphen.next()?;
if !prerelease.starts_with("rc.") {
return None;
}
let rc: u64 = prerelease[3..].parse().ok()?;
let release = hyphen.next()?;
let DescribeResult::Release { version: _, major, minor, patch } = release_version(release)?
else {
return None;
};
Some(DescribeResult::Prerelease { version: describe, major, minor, patch, rc })
}
#[cfg(test)]
mod test {
use super::DescribeResult;
fn assert_not_a_tag(describe: &'static str) {
assert_eq!(DescribeResult::NotATag { describe }, DescribeResult::new(describe))
}
fn assert_proto(describe: &'static str) {
assert_eq!(DescribeResult::Prototype { name: describe }, DescribeResult::new(describe))
}
fn assert_release(describe: &'static str, major: u64, minor: u64, patch: u64) {
assert_eq!(
DescribeResult::Release { version: describe, major, minor, patch },
DescribeResult::new(describe)
)
}
fn assert_prerelease(describe: &'static str, major: u64, minor: u64, patch: u64, rc: u64) {
assert_eq!(
DescribeResult::Prerelease { version: describe, major, minor, patch, rc },
DescribeResult::new(describe)
)
}
#[test]
fn not_a_tag() {
assert_not_a_tag("whatever-fuzzy");
assert_not_a_tag("whatever-fuzzy-5-ggg-dirty");
assert_not_a_tag("whatever-fuzzy-120-ggg-dirty");
// technically a tag, but not a proto nor a version, so not parsed as a tag
assert_not_a_tag("whatever");
// dirty version
assert_not_a_tag("v1.7.0-1-ggga-dirty");
assert_not_a_tag("v1.7.0-rc.1-1-ggga-dirty");
// after version
assert_not_a_tag("v1.7.0-1-ggga");
assert_not_a_tag("v1.7.0-rc.1-1-ggga");
// after proto
assert_not_a_tag("protoype-tag-0-1-ggga");
assert_not_a_tag("protoype-tag-0-1-ggga-dirty");
}
#[test]
fn prototype() {
assert_proto("prototype-tag-0");
assert_proto("prototype-tag-10");
assert_proto("prototype-long-name-tag-10");
}
#[test]
fn release() {
assert_release("v1.7.2", 1, 7, 2);
}
#[test]
fn prerelease() {
assert_prerelease("v1.7.2-rc.3", 1, 7, 2, 3);
}
}

View File

@ -277,6 +277,7 @@ pub(crate) mod test {
}),
pagination: Setting::NotSet,
embedders: Setting::NotSet,
search_cutoff_ms: Setting::NotSet,
_kind: std::marker::PhantomData,
};
settings.check()

View File

@ -379,6 +379,7 @@ impl<T> From<v5::Settings<T>> for v6::Settings<v6::Unchecked> {
v5::Setting::NotSet => v6::Setting::NotSet,
},
embedders: v6::Setting::NotSet,
search_cutoff_ms: v6::Setting::NotSet,
_kind: std::marker::PhantomData,
}
}

View File

@ -61,7 +61,7 @@ pub enum IndexDocumentsMethod {
#[cfg_attr(test, derive(serde::Serialize))]
#[non_exhaustive]
pub enum UpdateFormat {
/// The given update is a real **comma seperated** CSV with headers on the first line.
/// The given update is a real **comma separated** CSV with headers on the first line.
Csv,
/// The given update is a JSON array with documents inside.
Json,

View File

@ -219,7 +219,7 @@ pub(crate) mod test {
fn _create_directory_hierarchy(dir: &Path, depth: usize) -> String {
let mut ret = String::new();
// the entries are not guarenteed to be returned in the same order thus we need to sort them.
// the entries are not guaranteed to be returned in the same order thus we need to sort them.
let mut entries =
fs::read_dir(dir).unwrap().collect::<std::result::Result<Vec<_>, _>>().unwrap();

View File

@ -42,7 +42,7 @@ fn quoted_by(quote: char, input: Span) -> IResult<Token> {
)));
}
}
// if it was preceeded by a `\` or if it was anything else we can continue to advance
// if it was preceded by a `\` or if it was anything else we can continue to advance
}
Ok((

View File

@ -870,7 +870,7 @@ mod tests {
debug_snapshot!(autobatch_from(false,None, [doc_imp(UpdateDocuments, false, None), settings(false), idx_del()]), @"Some((IndexDeletion { ids: [0, 2, 1] }, false))");
debug_snapshot!(autobatch_from(false,None, [doc_imp(ReplaceDocuments,false, None), settings(false), doc_clr(), idx_del()]), @"Some((IndexDeletion { ids: [1, 3, 0, 2] }, false))");
debug_snapshot!(autobatch_from(false,None, [doc_imp(UpdateDocuments, false, None), settings(false), doc_clr(), idx_del()]), @"Some((IndexDeletion { ids: [1, 3, 0, 2] }, false))");
// The third and final case is when the first task doesn't create an index but is directly followed by a task creating an index. In this case we can't batch whith what
// The third and final case is when the first task doesn't create an index but is directly followed by a task creating an index. In this case we can't batch whit what
// follows because we first need to process the erronous batch.
debug_snapshot!(autobatch_from(false,None, [doc_imp(ReplaceDocuments,false, None), settings(true), idx_del()]), @"Some((DocumentOperation { method: ReplaceDocuments, allow_index_creation: false, primary_key: None, operation_ids: [0] }, false))");
debug_snapshot!(autobatch_from(false,None, [doc_imp(UpdateDocuments, false, None), settings(true), idx_del()]), @"Some((DocumentOperation { method: UpdateDocuments, allow_index_creation: false, primary_key: None, operation_ids: [0] }, false))");

View File

@ -920,7 +920,11 @@ impl IndexScheduler {
}
// 3.2. Dump the settings
let settings = meilisearch_types::settings::settings(index, &rtxn)?;
let settings = meilisearch_types::settings::settings(
index,
&rtxn,
meilisearch_types::settings::SecretPolicy::RevealSecrets,
)?;
index_dumper.settings(&settings)?;
Ok(())
})?;

View File

@ -1301,8 +1301,8 @@ impl IndexScheduler {
wtxn.commit().map_err(Error::HeedTransaction)?;
// Once the tasks are commited, we should delete all the update files associated ASAP to avoid leaking files in case of a restart
tracing::debug!("Deleting the upadate files");
// Once the tasks are committed, we should delete all the update files associated ASAP to avoid leaking files in case of a restart
tracing::debug!("Deleting the update files");
//We take one read transaction **per thread**. Then, every thread is going to pull out new IDs from the roaring bitmap with the help of an atomic shared index into the bitmap
let idx = AtomicU32::new(0);
@ -1332,7 +1332,7 @@ impl IndexScheduler {
Ok(TickOutcome::TickAgain(processed_tasks))
}
/// Once the tasks changes have been commited we must send all the tasks that were updated to our webhook if there is one.
/// Once the tasks changes have been committed we must send all the tasks that were updated to our webhook if there is one.
fn notify_webhook(&self, updated: &RoaringBitmap) -> Result<()> {
if let Some(ref url) = self.webhook_url {
struct TaskReader<'a, 'b> {
@ -1395,7 +1395,10 @@ impl IndexScheduler {
// let reader = GzEncoder::new(BufReader::new(task_reader), Compression::default());
let reader = GzEncoder::new(BufReader::new(task_reader), Compression::default());
let request = ureq::post(url).set("Content-Encoding", "gzip");
let request = ureq::post(url)
.timeout(Duration::from_secs(30))
.set("Content-Encoding", "gzip")
.set("Content-Type", "application/x-ndjson");
let request = match &self.webhook_authorization_header {
Some(header) => request.set("Authorization", header),
None => request,
@ -3025,6 +3028,66 @@ mod tests {
snapshot!(serde_json::to_string_pretty(&documents).unwrap(), name: "documents");
}
#[test]
fn test_settings_update() {
use meilisearch_types::settings::{Settings, Unchecked};
use milli::update::Setting;
let (index_scheduler, mut handle) = IndexScheduler::test(true, vec![]);
let mut new_settings: Box<Settings<Unchecked>> = Box::default();
let mut embedders = BTreeMap::default();
let embedding_settings = milli::vector::settings::EmbeddingSettings {
source: Setting::Set(milli::vector::settings::EmbedderSource::Rest),
api_key: Setting::Set(S("My super secret")),
url: Setting::Set(S("http://localhost:7777")),
..Default::default()
};
embedders.insert(S("default"), Setting::Set(embedding_settings));
new_settings.embedders = Setting::Set(embedders);
index_scheduler
.register(
KindWithContent::SettingsUpdate {
index_uid: S("doggos"),
new_settings,
is_deletion: false,
allow_index_creation: true,
},
None,
false,
)
.unwrap();
index_scheduler.assert_internally_consistent();
snapshot!(snapshot_index_scheduler(&index_scheduler), name: "after_registering_settings_task");
{
let rtxn = index_scheduler.read_txn().unwrap();
let task = index_scheduler.get_task(&rtxn, 0).unwrap().unwrap();
let task = meilisearch_types::task_view::TaskView::from_task(&task);
insta::assert_json_snapshot!(task.details);
}
handle.advance_n_successful_batches(1);
snapshot!(snapshot_index_scheduler(&index_scheduler), name: "settings_update_processed");
{
let rtxn = index_scheduler.read_txn().unwrap();
let task = index_scheduler.get_task(&rtxn, 0).unwrap().unwrap();
let task = meilisearch_types::task_view::TaskView::from_task(&task);
insta::assert_json_snapshot!(task.details);
}
// has everything being pushed successfully in milli?
let index = index_scheduler.index("doggos").unwrap();
let rtxn = index.read_txn().unwrap();
let configs = index.embedding_configs(&rtxn).unwrap();
let (_, embedding_config) = configs.first().unwrap();
insta::assert_json_snapshot!(embedding_config.embedder_options);
}
#[test]
fn test_document_replace_without_autobatching() {
let (index_scheduler, mut handle) = IndexScheduler::test(false, vec![]);

View File

@ -0,0 +1,13 @@
---
source: index-scheduler/src/lib.rs
expression: task.details
---
{
"embedders": {
"default": {
"source": "rest",
"apiKey": "MyXXXX...",
"url": "http://localhost:7777"
}
}
}

View File

@ -0,0 +1,23 @@
---
source: index-scheduler/src/lib.rs
expression: embedding_config.embedder_options
---
{
"Rest": {
"api_key": "My super secret",
"distribution": null,
"dimensions": null,
"url": "http://localhost:7777",
"query": null,
"input_field": [
"input"
],
"path_to_embeddings": [
"data"
],
"embedding_object": [
"embedding"
],
"input_type": "text"
}
}

View File

@ -0,0 +1,13 @@
---
source: index-scheduler/src/lib.rs
expression: task.details
---
{
"embedders": {
"default": {
"source": "rest",
"apiKey": "MyXXXX...",
"url": "http://localhost:7777"
}
}
}

View File

@ -0,0 +1,36 @@
---
source: index-scheduler/src/lib.rs
---
### Autobatching Enabled = true
### Processing Tasks:
[]
----------------------------------------------------------------------
### All Tasks:
0 {uid: 0, status: enqueued, details: { settings: Settings { displayed_attributes: NotSet, searchable_attributes: NotSet, filterable_attributes: NotSet, sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: Set({"default": Set(EmbeddingSettings { source: Set(Rest), model: NotSet, revision: NotSet, api_key: Set("My super secret"), dimensions: NotSet, document_template: NotSet, url: Set("http://localhost:7777"), query: NotSet, input_field: NotSet, path_to_embeddings: NotSet, embedding_object: NotSet, input_type: NotSet, distribution: NotSet })}), search_cutoff_ms: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> } }, kind: SettingsUpdate { index_uid: "doggos", new_settings: Settings { displayed_attributes: NotSet, searchable_attributes: NotSet, filterable_attributes: NotSet, sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: Set({"default": Set(EmbeddingSettings { source: Set(Rest), model: NotSet, revision: NotSet, api_key: Set("My super secret"), dimensions: NotSet, document_template: NotSet, url: Set("http://localhost:7777"), query: NotSet, input_field: NotSet, path_to_embeddings: NotSet, embedding_object: NotSet, input_type: NotSet, distribution: NotSet })}), search_cutoff_ms: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> }, is_deletion: false, allow_index_creation: true }}
----------------------------------------------------------------------
### Status:
enqueued [0,]
----------------------------------------------------------------------
### Kind:
"settingsUpdate" [0,]
----------------------------------------------------------------------
### Index Tasks:
doggos [0,]
----------------------------------------------------------------------
### Index Mapper:
----------------------------------------------------------------------
### Canceled By:
----------------------------------------------------------------------
### Enqueued At:
[timestamp] [0,]
----------------------------------------------------------------------
### Started At:
----------------------------------------------------------------------
### Finished At:
----------------------------------------------------------------------
### File Store:
----------------------------------------------------------------------

View File

@ -0,0 +1,40 @@
---
source: index-scheduler/src/lib.rs
---
### Autobatching Enabled = true
### Processing Tasks:
[]
----------------------------------------------------------------------
### All Tasks:
0 {uid: 0, status: succeeded, details: { settings: Settings { displayed_attributes: NotSet, searchable_attributes: NotSet, filterable_attributes: NotSet, sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: Set({"default": Set(EmbeddingSettings { source: Set(Rest), model: NotSet, revision: NotSet, api_key: Set("My super secret"), dimensions: NotSet, document_template: NotSet, url: Set("http://localhost:7777"), query: NotSet, input_field: NotSet, path_to_embeddings: NotSet, embedding_object: NotSet, input_type: NotSet, distribution: NotSet })}), search_cutoff_ms: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> } }, kind: SettingsUpdate { index_uid: "doggos", new_settings: Settings { displayed_attributes: NotSet, searchable_attributes: NotSet, filterable_attributes: NotSet, sortable_attributes: NotSet, ranking_rules: NotSet, stop_words: NotSet, non_separator_tokens: NotSet, separator_tokens: NotSet, dictionary: NotSet, synonyms: NotSet, distinct_attribute: NotSet, proximity_precision: NotSet, typo_tolerance: NotSet, faceting: NotSet, pagination: NotSet, embedders: Set({"default": Set(EmbeddingSettings { source: Set(Rest), model: NotSet, revision: NotSet, api_key: Set("My super secret"), dimensions: NotSet, document_template: NotSet, url: Set("http://localhost:7777"), query: NotSet, input_field: NotSet, path_to_embeddings: NotSet, embedding_object: NotSet, input_type: NotSet, distribution: NotSet })}), search_cutoff_ms: NotSet, _kind: PhantomData<meilisearch_types::settings::Unchecked> }, is_deletion: false, allow_index_creation: true }}
----------------------------------------------------------------------
### Status:
enqueued []
succeeded [0,]
----------------------------------------------------------------------
### Kind:
"settingsUpdate" [0,]
----------------------------------------------------------------------
### Index Tasks:
doggos [0,]
----------------------------------------------------------------------
### Index Mapper:
doggos: { number_of_documents: 0, field_distribution: {} }
----------------------------------------------------------------------
### Canceled By:
----------------------------------------------------------------------
### Enqueued At:
[timestamp] [0,]
----------------------------------------------------------------------
### Started At:
[timestamp] [0,]
----------------------------------------------------------------------
### Finished At:
[timestamp] [0,]
----------------------------------------------------------------------
### File Store:
----------------------------------------------------------------------

View File

@ -11,7 +11,7 @@ edition.workspace = true
license.workspace = true
[dependencies]
actix-web = { version = "4.4.1", default-features = false }
actix-web = { version = "4.5.1", default-features = false }
anyhow = "1.0.79"
convert_case = "0.6.0"
csv = "1.3.0"

View File

@ -2,6 +2,7 @@ use std::{fmt, io};
use actix_web::http::StatusCode;
use actix_web::{self as aweb, HttpResponseBuilder};
use aweb::http::header;
use aweb::rt::task::JoinError;
use convert_case::Casing;
use milli::heed::{Error as HeedError, MdbError};
@ -56,7 +57,14 @@ where
impl aweb::error::ResponseError for ResponseError {
fn error_response(&self) -> aweb::HttpResponse {
let json = serde_json::to_vec(self).unwrap();
HttpResponseBuilder::new(self.status_code()).content_type("application/json").body(json)
let mut builder = HttpResponseBuilder::new(self.status_code());
builder.content_type("application/json");
if self.code == StatusCode::SERVICE_UNAVAILABLE {
builder.insert_header((header::RETRY_AFTER, "10"));
}
builder.body(json)
}
fn status_code(&self) -> StatusCode {
@ -237,6 +245,7 @@ InvalidSearchCropMarker , InvalidRequest , BAD_REQUEST ;
InvalidSearchFacets , InvalidRequest , BAD_REQUEST ;
InvalidSearchSemanticRatio , InvalidRequest , BAD_REQUEST ;
InvalidFacetSearchFacetName , InvalidRequest , BAD_REQUEST ;
InvalidRecommendId , InvalidRequest , BAD_REQUEST ;
InvalidSearchFilter , InvalidRequest , BAD_REQUEST ;
InvalidSearchHighlightPostTag , InvalidRequest , BAD_REQUEST ;
InvalidSearchHighlightPreTag , InvalidRequest , BAD_REQUEST ;
@ -259,6 +268,7 @@ InvalidSettingsProximityPrecision , InvalidRequest , BAD_REQUEST ;
InvalidSettingsFaceting , InvalidRequest , BAD_REQUEST ;
InvalidSettingsFilterableAttributes , InvalidRequest , BAD_REQUEST ;
InvalidSettingsPagination , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSearchCutoffMs , InvalidRequest , BAD_REQUEST ;
InvalidSettingsEmbedders , InvalidRequest , BAD_REQUEST ;
InvalidSettingsRankingRules , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSearchableAttributes , InvalidRequest , BAD_REQUEST ;
@ -304,6 +314,7 @@ MissingSwapIndexes , InvalidRequest , BAD_REQUEST ;
MissingTaskFilters , InvalidRequest , BAD_REQUEST ;
NoSpaceLeftOnDevice , System , UNPROCESSABLE_ENTITY;
PayloadTooLarge , InvalidRequest , PAYLOAD_TOO_LARGE ;
TooManySearchRequests , System , SERVICE_UNAVAILABLE ;
TaskNotFound , InvalidRequest , NOT_FOUND ;
TooManyOpenFiles , System , UNPROCESSABLE_ENTITY ;
TooManyVectors , InvalidRequest , BAD_REQUEST ;
@ -352,6 +363,7 @@ impl ErrorCode for milli::Error {
| UserError::InvalidOpenAiModelDimensions { .. }
| UserError::InvalidOpenAiModelDimensionsMax { .. }
| UserError::InvalidSettingsDimensions { .. }
| UserError::InvalidUrl { .. }
| UserError::InvalidPrompt(_) => Code::InvalidSettingsEmbedders,
UserError::TooManyEmbedders(_) => Code::InvalidSettingsEmbedders,
UserError::InvalidPromptForEmbeddings(..) => Code::InvalidSettingsEmbedders,

View File

@ -202,12 +202,52 @@ pub struct Settings<T> {
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsEmbedders>)]
pub embedders: Setting<BTreeMap<String, Setting<milli::vector::settings::EmbeddingSettings>>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsSearchCutoffMs>)]
pub search_cutoff_ms: Setting<u64>,
#[serde(skip)]
#[deserr(skip)]
pub _kind: PhantomData<T>,
}
impl<T> Settings<T> {
pub fn hide_secrets(&mut self) {
let Setting::Set(embedders) = &mut self.embedders else {
return;
};
for mut embedder in embedders.values_mut() {
let Setting::Set(embedder) = &mut embedder else {
continue;
};
let Setting::Set(api_key) = &mut embedder.api_key else {
continue;
};
Self::hide_secret(api_key);
}
}
fn hide_secret(secret: &mut String) {
match secret.len() {
x if x < 10 => {
secret.replace_range(.., "XXX...");
}
x if x < 20 => {
secret.replace_range(2.., "XXXX...");
}
x if x < 30 => {
secret.replace_range(3.., "XXXXX...");
}
_x => {
secret.replace_range(5.., "XXXXXX...");
}
}
}
}
impl Settings<Checked> {
pub fn cleared() -> Settings<Checked> {
Settings {
@ -227,6 +267,7 @@ impl Settings<Checked> {
faceting: Setting::Reset,
pagination: Setting::Reset,
embedders: Setting::Reset,
search_cutoff_ms: Setting::Reset,
_kind: PhantomData,
}
}
@ -249,6 +290,7 @@ impl Settings<Checked> {
faceting,
pagination,
embedders,
search_cutoff_ms,
..
} = self;
@ -269,6 +311,7 @@ impl Settings<Checked> {
faceting,
pagination,
embedders,
search_cutoff_ms,
_kind: PhantomData,
}
}
@ -315,6 +358,7 @@ impl Settings<Unchecked> {
faceting: self.faceting,
pagination: self.pagination,
embedders: self.embedders,
search_cutoff_ms: self.search_cutoff_ms,
_kind: PhantomData,
}
}
@ -347,19 +391,40 @@ pub fn apply_settings_to_builder(
settings: &Settings<Checked>,
builder: &mut milli::update::Settings,
) {
match settings.searchable_attributes {
let Settings {
displayed_attributes,
searchable_attributes,
filterable_attributes,
sortable_attributes,
ranking_rules,
stop_words,
non_separator_tokens,
separator_tokens,
dictionary,
synonyms,
distinct_attribute,
proximity_precision,
typo_tolerance,
faceting,
pagination,
embedders,
search_cutoff_ms,
_kind,
} = settings;
match searchable_attributes {
Setting::Set(ref names) => builder.set_searchable_fields(names.clone()),
Setting::Reset => builder.reset_searchable_fields(),
Setting::NotSet => (),
}
match settings.displayed_attributes {
match displayed_attributes {
Setting::Set(ref names) => builder.set_displayed_fields(names.clone()),
Setting::Reset => builder.reset_displayed_fields(),
Setting::NotSet => (),
}
match settings.filterable_attributes {
match filterable_attributes {
Setting::Set(ref facets) => {
builder.set_filterable_fields(facets.clone().into_iter().collect())
}
@ -367,13 +432,13 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.sortable_attributes {
match sortable_attributes {
Setting::Set(ref fields) => builder.set_sortable_fields(fields.iter().cloned().collect()),
Setting::Reset => builder.reset_sortable_fields(),
Setting::NotSet => (),
}
match settings.ranking_rules {
match ranking_rules {
Setting::Set(ref criteria) => {
builder.set_criteria(criteria.iter().map(|c| c.clone().into()).collect())
}
@ -381,13 +446,13 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.stop_words {
match stop_words {
Setting::Set(ref stop_words) => builder.set_stop_words(stop_words.clone()),
Setting::Reset => builder.reset_stop_words(),
Setting::NotSet => (),
}
match settings.non_separator_tokens {
match non_separator_tokens {
Setting::Set(ref non_separator_tokens) => {
builder.set_non_separator_tokens(non_separator_tokens.clone())
}
@ -395,7 +460,7 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.separator_tokens {
match separator_tokens {
Setting::Set(ref separator_tokens) => {
builder.set_separator_tokens(separator_tokens.clone())
}
@ -403,31 +468,31 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.dictionary {
match dictionary {
Setting::Set(ref dictionary) => builder.set_dictionary(dictionary.clone()),
Setting::Reset => builder.reset_dictionary(),
Setting::NotSet => (),
}
match settings.synonyms {
match synonyms {
Setting::Set(ref synonyms) => builder.set_synonyms(synonyms.clone().into_iter().collect()),
Setting::Reset => builder.reset_synonyms(),
Setting::NotSet => (),
}
match settings.distinct_attribute {
match distinct_attribute {
Setting::Set(ref attr) => builder.set_distinct_field(attr.clone()),
Setting::Reset => builder.reset_distinct_field(),
Setting::NotSet => (),
}
match settings.proximity_precision {
match proximity_precision {
Setting::Set(ref precision) => builder.set_proximity_precision((*precision).into()),
Setting::Reset => builder.reset_proximity_precision(),
Setting::NotSet => (),
}
match settings.typo_tolerance {
match typo_tolerance {
Setting::Set(ref value) => {
match value.enabled {
Setting::Set(val) => builder.set_autorize_typos(val),
@ -482,7 +547,7 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match &settings.faceting {
match faceting {
Setting::Set(FacetingSettings { max_values_per_facet, sort_facet_values_by }) => {
match max_values_per_facet {
Setting::Set(val) => builder.set_max_values_per_facet(*val),
@ -504,7 +569,7 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.pagination {
match pagination {
Setting::Set(ref value) => match value.max_total_hits {
Setting::Set(val) => builder.set_pagination_max_total_hits(val),
Setting::Reset => builder.reset_pagination_max_total_hits(),
@ -514,16 +579,28 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.embedders.clone() {
Setting::Set(value) => builder.set_embedder_settings(value),
match embedders {
Setting::Set(value) => builder.set_embedder_settings(value.clone()),
Setting::Reset => builder.reset_embedder_settings(),
Setting::NotSet => (),
}
match search_cutoff_ms {
Setting::Set(cutoff) => builder.set_search_cutoff(*cutoff),
Setting::Reset => builder.reset_search_cutoff(),
Setting::NotSet => (),
}
}
pub enum SecretPolicy {
RevealSecrets,
HideSecrets,
}
pub fn settings(
index: &Index,
rtxn: &crate::heed::RoTxn,
secret_policy: SecretPolicy,
) -> Result<Settings<Checked>, milli::Error> {
let displayed_attributes =
index.displayed_fields(rtxn)?.map(|fields| fields.into_iter().map(String::from).collect());
@ -607,7 +684,9 @@ pub fn settings(
.collect();
let embedders = if embedders.is_empty() { Setting::NotSet } else { Setting::Set(embedders) };
Ok(Settings {
let search_cutoff_ms = index.search_cutoff(rtxn)?;
let mut settings = Settings {
displayed_attributes: match displayed_attributes {
Some(attrs) => Setting::Set(attrs),
None => Setting::Reset,
@ -633,8 +712,18 @@ pub fn settings(
faceting: Setting::Set(faceting),
pagination: Setting::Set(pagination),
embedders,
search_cutoff_ms: match search_cutoff_ms {
Some(cutoff) => Setting::Set(cutoff),
None => Setting::Reset,
},
_kind: PhantomData,
})
};
if let SecretPolicy::HideSecrets = secret_policy {
settings.hide_secrets()
}
Ok(settings)
}
#[derive(Debug, Clone, PartialEq, Eq, Deserr)]
@ -783,6 +872,7 @@ pub(crate) mod test {
faceting: Setting::NotSet,
pagination: Setting::NotSet,
embedders: Setting::NotSet,
search_cutoff_ms: Setting::NotSet,
_kind: PhantomData::<Unchecked>,
};
@ -809,6 +899,7 @@ pub(crate) mod test {
faceting: Setting::NotSet,
pagination: Setting::NotSet,
embedders: Setting::NotSet,
search_cutoff_ms: Setting::NotSet,
_kind: PhantomData::<Unchecked>,
};

View File

@ -86,7 +86,8 @@ impl From<Details> for DetailsView {
..DetailsView::default()
}
}
Details::SettingsUpdate { settings } => {
Details::SettingsUpdate { mut settings } => {
settings.hide_secrets();
DetailsView { settings: Some(settings), ..DetailsView::default() }
}
Details::IndexInfo { primary_key } => {

View File

@ -14,18 +14,18 @@ default-run = "meilisearch"
[dependencies]
actix-cors = "0.7.0"
actix-http = { version = "3.5.1", default-features = false, features = [
actix-http = { version = "3.6.0", default-features = false, features = [
"compress-brotli",
"compress-gzip",
"rustls",
"rustls-0_21",
] }
actix-utils = "3.0.1"
actix-web = { version = "4.4.1", default-features = false, features = [
actix-web = { version = "4.5.1", default-features = false, features = [
"macros",
"compress-brotli",
"compress-gzip",
"cookies",
"rustls",
"rustls-0_21",
] }
actix-web-static-files = { git = "https://github.com/kilork/actix-web-static-files.git", rev = "2d3b6160", optional = true }
anyhow = { version = "1.0.79", features = ["backtrace"] }
@ -52,7 +52,7 @@ index-scheduler = { path = "../index-scheduler" }
indexmap = { version = "2.1.0", features = ["serde"] }
is-terminal = "0.4.10"
itertools = "0.11.0"
jsonwebtoken = "8.3.0"
jsonwebtoken = "9.2.0"
lazy_static = "1.4.0"
meilisearch-auth = { path = "../meilisearch-auth" }
meilisearch-types = { path = "../meilisearch-types" }
@ -75,7 +75,7 @@ reqwest = { version = "0.11.23", features = [
"rustls-tls",
"json",
], default-features = false }
rustls = "0.20.8"
rustls = "0.21.6"
rustls-pemfile = "1.0.2"
segment = { version = "0.2.3", optional = true }
serde = { version = "1.0.195", features = ["derive"] }
@ -107,6 +107,7 @@ tracing = "0.1.40"
tracing-subscriber = { version = "0.3.18", features = ["json"] }
tracing-trace = { version = "0.1.0", path = "../tracing-trace" }
tracing-actix-web = "0.7.9"
build-info = { version = "1.7.0", path = "../build-info" }
[dev-dependencies]
actix-rt = "2.9.0"
@ -131,7 +132,6 @@ reqwest = { version = "0.11.23", features = [
sha-1 = { version = "0.10.1", optional = true }
static-files = { version = "0.2.3", optional = true }
tempfile = { version = "3.9.0", optional = true }
vergen = { version = "7.5.1", default-features = false, features = ["git"] }
zip = { version = "0.6.6", optional = true }
[features]

View File

@ -1,17 +1,4 @@
use vergen::{vergen, Config, SemverKind};
fn main() {
// Note: any code that needs VERGEN_ environment variables should take care to define them manually in the Dockerfile and pass them
// in the corresponding GitHub workflow (publish_docker.yml).
// This is due to the Dockerfile building the binary outside of the git directory.
let mut config = Config::default();
// allow using non-annotated tags
*config.git_mut().semver_kind_mut() = SemverKind::Lightweight;
if let Err(e) = vergen(config) {
println!("cargo:warning=vergen: {}", e);
}
#[cfg(feature = "mini-dashboard")]
mini_dashboard::setup_mini_dashboard().expect("Could not load the mini-dashboard assets");
}

View File

@ -252,6 +252,7 @@ impl super::Analytics for SegmentAnalytics {
struct Infos {
env: String,
experimental_enable_metrics: bool,
experimental_search_queue_size: usize,
experimental_logs_mode: LogMode,
experimental_replication_parameters: bool,
experimental_enable_logs_route: bool,
@ -293,6 +294,7 @@ impl From<Opt> for Infos {
let Opt {
db_path,
experimental_enable_metrics,
experimental_search_queue_size,
experimental_logs_mode,
experimental_replication_parameters,
experimental_enable_logs_route,
@ -342,6 +344,7 @@ impl From<Opt> for Infos {
Self {
env,
experimental_enable_metrics,
experimental_search_queue_size,
experimental_logs_mode,
experimental_replication_parameters,
experimental_enable_logs_route,
@ -473,7 +476,9 @@ impl Segment {
create_all_stats(index_scheduler.into(), auth_controller.into(), &AuthFilter::default())
{
// Replace the version number with the prototype name if any.
let version = if let Some(prototype) = crate::prototype_name() {
let version = if let Some(prototype) = build_info::DescribeResult::from_build()
.and_then(|describe| describe.as_prototype())
{
prototype
} else {
env!("CARGO_PKG_VERSION")
@ -577,6 +582,8 @@ pub struct SearchAggregator {
// requests
total_received: usize,
total_succeeded: usize,
total_degraded: usize,
total_used_negative_operator: usize,
time_spent: BinaryHeap<usize>,
// sort
@ -751,14 +758,22 @@ impl SearchAggregator {
let SearchResult {
hits: _,
query: _,
vector: _,
processing_time_ms,
hits_info: _,
semantic_hit_count: _,
facet_distribution: _,
facet_stats: _,
degraded,
used_negative_operator,
} = result;
self.total_succeeded = self.total_succeeded.saturating_add(1);
if *degraded {
self.total_degraded = self.total_degraded.saturating_add(1);
}
if *used_negative_operator {
self.total_used_negative_operator = self.total_used_negative_operator.saturating_add(1);
}
self.time_spent.push(*processing_time_ms as usize);
}
@ -800,6 +815,8 @@ impl SearchAggregator {
semantic_ratio,
embedder,
hybrid,
total_degraded,
total_used_negative_operator,
} = other;
if self.timestamp.is_none() {
@ -814,6 +831,9 @@ impl SearchAggregator {
// request
self.total_received = self.total_received.saturating_add(total_received);
self.total_succeeded = self.total_succeeded.saturating_add(total_succeeded);
self.total_degraded = self.total_degraded.saturating_add(total_degraded);
self.total_used_negative_operator =
self.total_used_negative_operator.saturating_add(total_used_negative_operator);
self.time_spent.append(time_spent);
// sort
@ -919,6 +939,8 @@ impl SearchAggregator {
semantic_ratio,
embedder,
hybrid,
total_degraded,
total_used_negative_operator,
} = self;
if total_received == 0 {
@ -938,6 +960,8 @@ impl SearchAggregator {
"total_succeeded": total_succeeded,
"total_failed": total_received.saturating_sub(total_succeeded), // just to be sure we never panics
"total_received": total_received,
"total_degraded": total_degraded,
"total_used_negative_operator": total_used_negative_operator,
},
"sort": {
"with_geoPoint": sort_with_geo_point,

View File

@ -23,12 +23,18 @@ pub enum MeilisearchHttpError {
InvalidContentType(String, Vec<String>),
#[error("Document `{0}` not found.")]
DocumentNotFound(String),
#[error("Document `{0}` not found.")]
InvalidDocumentId(String),
#[error("Sending an empty filter is forbidden.")]
EmptyFilter,
#[error("Invalid syntax for the filter parameter: `expected {}, found: {1}`.", .0.join(", "))]
InvalidExpression(&'static [&'static str], Value),
#[error("A {0} payload is missing.")]
MissingPayload(PayloadType),
#[error("Too many search requests running at the same time: {0}. Retry after 10s.")]
TooManySearchRequests(usize),
#[error("Internal error: Search limiter is down.")]
SearchLimiterIsDown,
#[error("The provided payload reached the size limit. The maximum accepted payload size is {}.", Byte::from_bytes(*.0 as u64).get_appropriate_unit(true))]
PayloadTooLarge(usize),
#[error("Two indexes must be given for each swap. The list `[{}]` contains {} indexes.",
@ -66,9 +72,12 @@ impl ErrorCode for MeilisearchHttpError {
MeilisearchHttpError::MissingPayload(_) => Code::MissingPayload,
MeilisearchHttpError::InvalidContentType(_, _) => Code::InvalidContentType,
MeilisearchHttpError::DocumentNotFound(_) => Code::DocumentNotFound,
MeilisearchHttpError::InvalidDocumentId(_) => Code::InvalidDocumentId,
MeilisearchHttpError::EmptyFilter => Code::InvalidDocumentFilter,
MeilisearchHttpError::InvalidExpression(_, _) => Code::InvalidSearchFilter,
MeilisearchHttpError::PayloadTooLarge(_) => Code::PayloadTooLarge,
MeilisearchHttpError::TooManySearchRequests(_) => Code::TooManySearchRequests,
MeilisearchHttpError::SearchLimiterIsDown => Code::Internal,
MeilisearchHttpError::SwapIndexPayloadWrongLength(_) => Code::InvalidSwapIndexes,
MeilisearchHttpError::IndexUid(e) => e.error_code(),
MeilisearchHttpError::SerdeJson(_) => Code::Internal,

View File

@ -9,12 +9,14 @@ pub mod middleware;
pub mod option;
pub mod routes;
pub mod search;
pub mod search_queue;
use std::fs::File;
use std::io::{BufReader, BufWriter};
use std::num::NonZeroUsize;
use std::path::Path;
use std::sync::Arc;
use std::thread;
use std::thread::{self, available_parallelism};
use std::time::Duration;
use actix_cors::Cors;
@ -38,6 +40,7 @@ use meilisearch_types::versioning::{check_version_file, create_version_file};
use meilisearch_types::{compression, milli, VERSION_FILE_NAME};
pub use option::Opt;
use option::ScheduleSnapshot;
use search_queue::SearchQueue;
use tracing::{error, info_span};
use tracing_subscriber::filter::Targets;
@ -469,10 +472,15 @@ pub fn configure_data(
(logs_route, logs_stderr): (LogRouteHandle, LogStderrHandle),
analytics: Arc<dyn Analytics>,
) {
let search_queue = SearchQueue::new(
opt.experimental_search_queue_size,
available_parallelism().unwrap_or(NonZeroUsize::new(2).unwrap()),
);
let http_payload_size_limit = opt.http_payload_size_limit.get_bytes() as usize;
config
.app_data(index_scheduler)
.app_data(auth)
.app_data(web::Data::new(search_queue))
.app_data(web::Data::from(analytics))
.app_data(web::Data::new(logs_route))
.app_data(web::Data::new(logs_stderr))
@ -536,30 +544,3 @@ pub fn dashboard(config: &mut web::ServiceConfig, enable_frontend: bool) {
pub fn dashboard(config: &mut web::ServiceConfig, _enable_frontend: bool) {
config.service(web::resource("/").route(web::get().to(routes::running)));
}
/// Parses the output of
/// [`VERGEN_GIT_SEMVER_LIGHTWEIGHT`](https://docs.rs/vergen/latest/vergen/struct.Git.html#instructions)
/// as a prototype name.
///
/// Returns `Some(prototype_name)` if the following conditions are met on this value:
///
/// 1. starts with `prototype-`,
/// 2. ends with `-<some_number>`,
/// 3. does not end with `<some_number>-<some_number>`.
///
/// Otherwise, returns `None`.
pub fn prototype_name() -> Option<&'static str> {
let prototype: &'static str = option_env!("VERGEN_GIT_SEMVER_LIGHTWEIGHT")?;
if !prototype.starts_with("prototype-") {
return None;
}
let mut rsplit_prototype = prototype.rsplit('-');
// last component MUST be a number
rsplit_prototype.next()?.parse::<u64>().ok()?;
// before than last component SHALL NOT be a number
rsplit_prototype.next()?.parse::<u64>().err()?;
Some(prototype)
}

View File

@ -12,8 +12,8 @@ use is_terminal::IsTerminal;
use meilisearch::analytics::Analytics;
use meilisearch::option::LogMode;
use meilisearch::{
analytics, create_app, prototype_name, setup_meilisearch, LogRouteHandle, LogRouteType,
LogStderrHandle, LogStderrType, Opt, SubscriberForSecondLayer,
analytics, create_app, setup_meilisearch, LogRouteHandle, LogRouteType, LogStderrHandle,
LogStderrType, Opt, SubscriberForSecondLayer,
};
use meilisearch_auth::{generate_master_key, AuthController, MASTER_KEY_MIN_SIZE};
use mimalloc::MiMalloc;
@ -151,7 +151,7 @@ async fn run_http(
.keep_alive(KeepAlive::Os);
if let Some(config) = opt_clone.get_ssl_config()? {
http_server.bind_rustls(opt_clone.http_addr, config)?.run().await?;
http_server.bind_rustls_021(opt_clone.http_addr, config)?.run().await?;
} else {
http_server.bind(&opt_clone.http_addr)?.run().await?;
}
@ -163,8 +163,8 @@ pub fn print_launch_resume(
analytics: Arc<dyn Analytics>,
config_read_from: Option<PathBuf>,
) {
let commit_sha = option_env!("VERGEN_GIT_SHA").unwrap_or("unknown");
let commit_date = option_env!("VERGEN_GIT_COMMIT_TIMESTAMP").unwrap_or("unknown");
let build_info = build_info::BuildInfo::from_build();
let protocol =
if opt.ssl_cert_path.is_some() && opt.ssl_key_path.is_some() { "https" } else { "http" };
let ascii_name = r#"
@ -189,10 +189,18 @@ pub fn print_launch_resume(
eprintln!("Database path:\t\t{:?}", opt.db_path);
eprintln!("Server listening on:\t\"{}://{}\"", protocol, opt.http_addr);
eprintln!("Environment:\t\t{:?}", opt.env);
eprintln!("Commit SHA:\t\t{:?}", commit_sha.to_string());
eprintln!("Commit date:\t\t{:?}", commit_date.to_string());
eprintln!("Commit SHA:\t\t{:?}", build_info.commit_sha1.unwrap_or("unknown"));
eprintln!(
"Commit date:\t\t{:?}",
build_info
.commit_timestamp
.and_then(|commit_timestamp| commit_timestamp
.format(&time::format_description::well_known::Rfc3339)
.ok())
.unwrap_or("unknown".into())
);
eprintln!("Package version:\t{:?}", env!("CARGO_PKG_VERSION").to_string());
if let Some(prototype) = prototype_name() {
if let Some(prototype) = build_info.describe.and_then(|describe| describe.as_prototype()) {
eprintln!("Prototype:\t\t{:?}", prototype);
}

View File

@ -4,24 +4,17 @@ use prometheus::{
register_int_gauge_vec, HistogramVec, IntCounterVec, IntGauge, IntGaugeVec,
};
/// Create evenly distributed buckets
fn create_buckets() -> [f64; 29] {
(0..10)
.chain((10..100).step_by(10))
.chain((100..=1000).step_by(100))
.map(|i| i as f64 / 1000.)
.collect::<Vec<_>>()
.try_into()
.unwrap()
}
lazy_static! {
pub static ref MEILISEARCH_HTTP_RESPONSE_TIME_CUSTOM_BUCKETS: [f64; 29] = create_buckets();
pub static ref MEILISEARCH_HTTP_REQUESTS_TOTAL: IntCounterVec = register_int_counter_vec!(
opts!("meilisearch_http_requests_total", "Meilisearch HTTP requests total"),
&["method", "path"]
&["method", "path", "status"]
)
.expect("Can't create a metric");
pub static ref MEILISEARCH_DEGRADED_SEARCH_REQUESTS: IntGauge = register_int_gauge!(opts!(
"meilisearch_degraded_search_requests",
"Meilisearch number of degraded search requests"
))
.expect("Can't create a metric");
pub static ref MEILISEARCH_DB_SIZE_BYTES: IntGauge =
register_int_gauge!(opts!("meilisearch_db_size_bytes", "Meilisearch DB Size In Bytes"))
.expect("Can't create a metric");
@ -42,7 +35,7 @@ lazy_static! {
"meilisearch_http_response_time_seconds",
"Meilisearch HTTP response times",
&["method", "path"],
MEILISEARCH_HTTP_RESPONSE_TIME_CUSTOM_BUCKETS.to_vec()
vec![0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0]
)
.expect("Can't create a metric");
pub static ref MEILISEARCH_NB_TASKS: IntGaugeVec = register_int_gauge_vec!(

View File

@ -65,9 +65,6 @@ where
.with_label_values(&[&request_method, request_path])
.start_timer(),
);
crate::metrics::MEILISEARCH_HTTP_REQUESTS_TOTAL
.with_label_values(&[&request_method, request_path])
.inc();
}
};
@ -76,6 +73,14 @@ where
Box::pin(async move {
let res = fut.await?;
crate::metrics::MEILISEARCH_HTTP_REQUESTS_TOTAL
.with_label_values(&[
res.request().method().as_str(),
res.request().path(),
res.status().as_str(),
])
.inc();
if let Some(histogram_timer) = histogram_timer {
histogram_timer.observe_duration();
};

View File

@ -54,6 +54,7 @@ const MEILI_EXPERIMENTAL_LOGS_MODE: &str = "MEILI_EXPERIMENTAL_LOGS_MODE";
const MEILI_EXPERIMENTAL_REPLICATION_PARAMETERS: &str = "MEILI_EXPERIMENTAL_REPLICATION_PARAMETERS";
const MEILI_EXPERIMENTAL_ENABLE_LOGS_ROUTE: &str = "MEILI_EXPERIMENTAL_ENABLE_LOGS_ROUTE";
const MEILI_EXPERIMENTAL_ENABLE_METRICS: &str = "MEILI_EXPERIMENTAL_ENABLE_METRICS";
const MEILI_EXPERIMENTAL_SEARCH_QUEUE_SIZE: &str = "MEILI_EXPERIMENTAL_SEARCH_QUEUE_SIZE";
const MEILI_EXPERIMENTAL_REDUCE_INDEXING_MEMORY_USAGE: &str =
"MEILI_EXPERIMENTAL_REDUCE_INDEXING_MEMORY_USAGE";
const MEILI_EXPERIMENTAL_MAX_NUMBER_OF_BATCHED_TASKS: &str =
@ -344,6 +345,15 @@ pub struct Opt {
#[serde(default)]
pub experimental_enable_metrics: bool,
/// Experimental search queue size. For more information, see: <https://github.com/orgs/meilisearch/discussions/729>
///
/// Lets you customize the size of the search queue. Meilisearch processes your search requests as fast as possible but once the
/// queue is full it starts returning HTTP 503, Service Unavailable.
/// The default value is 1000.
#[clap(long, env = MEILI_EXPERIMENTAL_SEARCH_QUEUE_SIZE, default_value_t = 1000)]
#[serde(default)]
pub experimental_search_queue_size: usize,
/// Experimental logs mode feature. For more information, see: <https://github.com/orgs/meilisearch/discussions/723>
///
/// Change the mode of the logs on the console.
@ -473,6 +483,7 @@ impl Opt {
#[cfg(feature = "analytics")]
no_analytics,
experimental_enable_metrics,
experimental_search_queue_size,
experimental_logs_mode,
experimental_enable_logs_route,
experimental_replication_parameters,
@ -532,6 +543,10 @@ impl Opt {
MEILI_EXPERIMENTAL_ENABLE_METRICS,
experimental_enable_metrics.to_string(),
);
export_to_env_if_not_present(
MEILI_EXPERIMENTAL_SEARCH_QUEUE_SIZE,
experimental_search_queue_size.to_string(),
);
export_to_env_if_not_present(
MEILI_EXPERIMENTAL_LOGS_MODE,
experimental_logs_mode.to_string(),
@ -564,11 +579,11 @@ impl Opt {
}
if self.ssl_require_auth {
let verifier = AllowAnyAuthenticatedClient::new(client_auth_roots);
config.with_client_cert_verifier(verifier)
config.with_client_cert_verifier(Arc::from(verifier))
} else {
let verifier =
AllowAnyAnonymousOrAuthenticatedClient::new(client_auth_roots);
config.with_client_cert_verifier(verifier)
config.with_client_cert_verifier(Arc::from(verifier))
}
}
None => config.with_no_client_auth(),

View File

@ -12,11 +12,13 @@ use tracing::debug;
use crate::analytics::{Analytics, FacetSearchAggregator};
use crate::extractors::authentication::policies::*;
use crate::extractors::authentication::GuardedData;
use crate::routes::indexes::search::search_kind;
use crate::search::{
add_search_rules, perform_facet_search, HybridQuery, MatchingStrategy, SearchQuery,
DEFAULT_CROP_LENGTH, DEFAULT_CROP_MARKER, DEFAULT_HIGHLIGHT_POST_TAG,
DEFAULT_HIGHLIGHT_PRE_TAG, DEFAULT_SEARCH_LIMIT, DEFAULT_SEARCH_OFFSET,
};
use crate::search_queue::SearchQueue;
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(web::resource("").route(web::post().to(search)));
@ -48,6 +50,7 @@ pub struct FacetSearchQuery {
pub async fn search(
index_scheduler: GuardedData<ActionPolicy<{ actions::SEARCH }>, Data<IndexScheduler>>,
search_queue: Data<SearchQueue>,
index_uid: web::Path<String>,
params: AwebJson<FacetSearchQuery, DeserrJsonError>,
req: HttpRequest,
@ -71,8 +74,10 @@ pub async fn search(
let index = index_scheduler.index(&index_uid)?;
let features = index_scheduler.features();
let search_kind = search_kind(&search_query, &index_scheduler, &index, features)?;
let _permit = search_queue.try_get_search_permit().await?;
let search_result = tokio::task::spawn_blocking(move || {
perform_facet_search(&index, search_query, facet_query, facet_name, features)
perform_facet_search(&index, search_query, facet_query, facet_name, search_kind)
})
.await?;

View File

@ -27,6 +27,7 @@ use crate::Opt;
pub mod documents;
pub mod facet_search;
pub mod recommend;
pub mod search;
pub mod settings;
@ -48,6 +49,7 @@ pub fn configure(cfg: &mut web::ServiceConfig) {
.service(web::scope("/documents").configure(documents::configure))
.service(web::scope("/search").configure(search::configure))
.service(web::scope("/facet-search").configure(facet_search::configure))
.service(web::scope("/recommend").configure(recommend::configure))
.service(web::scope("/settings").configure(settings::configure)),
);
}

View File

@ -0,0 +1,53 @@
use actix_web::web::{self, Data};
use actix_web::{HttpRequest, HttpResponse};
use deserr::actix_web::AwebJson;
use index_scheduler::IndexScheduler;
use meilisearch_types::deserr::DeserrJsonError;
use meilisearch_types::error::ResponseError;
use meilisearch_types::index_uid::IndexUid;
use meilisearch_types::keys::actions;
use tracing::debug;
use super::ActionPolicy;
use crate::analytics::Analytics;
use crate::extractors::authentication::GuardedData;
use crate::extractors::sequential_extractor::SeqHandler;
use crate::search::{perform_recommend, RecommendQuery, SearchKind};
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(web::resource("").route(web::post().to(SeqHandler(recommend))));
}
pub async fn recommend(
index_scheduler: GuardedData<ActionPolicy<{ actions::SEARCH }>, Data<IndexScheduler>>,
index_uid: web::Path<String>,
params: AwebJson<RecommendQuery, DeserrJsonError>,
_req: HttpRequest,
_analytics: web::Data<dyn Analytics>,
) -> Result<HttpResponse, ResponseError> {
let index_uid = IndexUid::try_from(index_uid.into_inner())?;
// TODO analytics
let query = params.into_inner();
debug!(parameters = ?query, "Recommend post");
let index = index_scheduler.index(&index_uid)?;
let features = index_scheduler.features();
features.check_vector("Using the recommend API.")?;
let (embedder_name, embedder) =
SearchKind::embedder(&index_scheduler, &index, query.embedder.as_deref(), None)?;
let recommendations = tokio::task::spawn_blocking(move || {
perform_recommend(&index, query, embedder_name, embedder)
})
.await?;
let recommendations = recommendations?;
debug!(returns = ?recommendations, "Recommend post");
Ok(HttpResponse::Ok().json(recommendations))
}

View File

@ -1,27 +1,29 @@
use actix_web::web::Data;
use actix_web::{web, HttpRequest, HttpResponse};
use deserr::actix_web::{AwebJson, AwebQueryParameter};
use index_scheduler::IndexScheduler;
use index_scheduler::{IndexScheduler, RoFeatures};
use meilisearch_types::deserr::query_params::Param;
use meilisearch_types::deserr::{DeserrJsonError, DeserrQueryParamError};
use meilisearch_types::error::deserr_codes::*;
use meilisearch_types::error::ResponseError;
use meilisearch_types::index_uid::IndexUid;
use meilisearch_types::milli;
use meilisearch_types::milli::vector::DistributionShift;
use meilisearch_types::serde_cs::vec::CS;
use serde_json::Value;
use tracing::{debug, warn};
use tracing::debug;
use crate::analytics::{Analytics, SearchAggregator};
use crate::error::MeilisearchHttpError;
use crate::extractors::authentication::policies::*;
use crate::extractors::authentication::GuardedData;
use crate::extractors::sequential_extractor::SeqHandler;
use crate::metrics::MEILISEARCH_DEGRADED_SEARCH_REQUESTS;
use crate::search::{
add_search_rules, perform_search, HybridQuery, MatchingStrategy, SearchQuery, SemanticRatio,
DEFAULT_CROP_LENGTH, DEFAULT_CROP_MARKER, DEFAULT_HIGHLIGHT_POST_TAG,
add_search_rules, perform_search, HybridQuery, MatchingStrategy, SearchKind, SearchQuery,
SemanticRatio, DEFAULT_CROP_LENGTH, DEFAULT_CROP_MARKER, DEFAULT_HIGHLIGHT_POST_TAG,
DEFAULT_HIGHLIGHT_PRE_TAG, DEFAULT_SEARCH_LIMIT, DEFAULT_SEARCH_OFFSET, DEFAULT_SEMANTIC_RATIO,
};
use crate::search_queue::SearchQueue;
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(
@ -181,6 +183,7 @@ fn fix_sort_query_parameters(sort_query: &str) -> Vec<String> {
pub async fn search_with_url_query(
index_scheduler: GuardedData<ActionPolicy<{ actions::SEARCH }>, Data<IndexScheduler>>,
search_queue: web::Data<SearchQueue>,
index_uid: web::Path<String>,
params: AwebQueryParameter<SearchQueryGet, DeserrQueryParamError>,
req: HttpRequest,
@ -201,11 +204,11 @@ pub async fn search_with_url_query(
let index = index_scheduler.index(&index_uid)?;
let features = index_scheduler.features();
let distribution = embed(&mut query, index_scheduler.get_ref(), &index).await?;
let search_kind = search_kind(&query, index_scheduler.get_ref(), &index, features)?;
let _permit = search_queue.try_get_search_permit().await?;
let search_result =
tokio::task::spawn_blocking(move || perform_search(&index, query, features, distribution))
.await?;
tokio::task::spawn_blocking(move || perform_search(&index, query, search_kind)).await?;
if let Ok(ref search_result) = search_result {
aggregate.succeed(search_result);
}
@ -219,6 +222,7 @@ pub async fn search_with_url_query(
pub async fn search_with_post(
index_scheduler: GuardedData<ActionPolicy<{ actions::SEARCH }>, Data<IndexScheduler>>,
search_queue: web::Data<SearchQueue>,
index_uid: web::Path<String>,
params: AwebJson<SearchQuery, DeserrJsonError>,
req: HttpRequest,
@ -240,13 +244,16 @@ pub async fn search_with_post(
let features = index_scheduler.features();
let distribution = embed(&mut query, index_scheduler.get_ref(), &index).await?;
let search_kind = search_kind(&query, index_scheduler.get_ref(), &index, features)?;
let _permit = search_queue.try_get_search_permit().await?;
let search_result =
tokio::task::spawn_blocking(move || perform_search(&index, query, features, distribution))
.await?;
tokio::task::spawn_blocking(move || perform_search(&index, query, search_kind)).await?;
if let Ok(ref search_result) = search_result {
aggregate.succeed(search_result);
if search_result.degraded {
MEILISEARCH_DEGRADED_SEARCH_REQUESTS.inc();
}
}
analytics.post_search(aggregate);
@ -256,77 +263,58 @@ pub async fn search_with_post(
Ok(HttpResponse::Ok().json(search_result))
}
pub async fn embed(
query: &mut SearchQuery,
pub fn search_kind(
query: &SearchQuery,
index_scheduler: &IndexScheduler,
index: &milli::Index,
) -> Result<Option<DistributionShift>, ResponseError> {
match (&query.hybrid, &query.vector, &query.q) {
(Some(HybridQuery { semantic_ratio: _, embedder }), None, Some(q))
if !q.trim().is_empty() =>
{
let embedder_configs = index.embedding_configs(&index.read_txn()?)?;
let embedders = index_scheduler.embedders(embedder_configs)?;
features: RoFeatures,
) -> Result<SearchKind, ResponseError> {
if query.vector.is_some() {
features.check_vector("Passing `vector` as a query parameter")?;
}
let embedder = if let Some(embedder_name) = embedder {
embedders.get(embedder_name)
} else {
embedders.get_default()
};
if query.hybrid.is_some() {
features.check_vector("Passing `hybrid` as a query parameter")?;
}
let embedder = embedder
.ok_or(milli::UserError::InvalidEmbedder("default".to_owned()))
.map_err(milli::Error::from)?
.0;
let distribution = embedder.distribution();
let embeddings = embedder
.embed(vec![q.to_owned()])
.await
.map_err(milli::vector::Error::from)
.map_err(milli::Error::from)?
.pop()
.expect("No vector returned from embedding");
if embeddings.iter().nth(1).is_some() {
warn!("Ignoring embeddings past the first one in long search query");
query.vector = Some(embeddings.iter().next().unwrap().to_vec());
} else {
query.vector = Some(embeddings.into_inner());
}
Ok(distribution)
// regardless of anything, always do a keyword search when we don't have a vector and the query is whitespace or missing
if query.vector.is_none() {
match &query.q {
Some(q) if q.trim().is_empty() => return Ok(SearchKind::KeywordOnly),
None => return Ok(SearchKind::KeywordOnly),
_ => {}
}
(Some(hybrid), vector, _) => {
let embedder_configs = index.embedding_configs(&index.read_txn()?)?;
let embedders = index_scheduler.embedders(embedder_configs)?;
}
let embedder = if let Some(embedder_name) = &hybrid.embedder {
embedders.get(embedder_name)
} else {
embedders.get_default()
};
let embedder = embedder
.ok_or(milli::UserError::InvalidEmbedder("default".to_owned()))
.map_err(milli::Error::from)?
.0;
if let Some(vector) = vector {
if vector.len() != embedder.dimensions() {
return Err(meilisearch_types::milli::Error::UserError(
meilisearch_types::milli::UserError::InvalidVectorDimensions {
expected: embedder.dimensions(),
found: vector.len(),
},
)
.into());
}
}
Ok(embedder.distribution())
match &query.hybrid {
Some(HybridQuery { semantic_ratio, embedder }) if **semantic_ratio == 1.0 => {
Ok(SearchKind::semantic(
index_scheduler,
index,
embedder.as_deref(),
query.vector.as_ref().map(Vec::len),
)?)
}
_ => Ok(None),
Some(HybridQuery { semantic_ratio, embedder: _ }) if **semantic_ratio == 0.0 => {
Ok(SearchKind::KeywordOnly)
}
Some(HybridQuery { semantic_ratio, embedder }) => Ok(SearchKind::hybrid(
index_scheduler,
index,
embedder.as_deref(),
**semantic_ratio,
query.vector.as_ref().map(Vec::len),
)?),
None => match (query.q.as_deref(), query.vector.as_deref()) {
(_query, None) => Ok(SearchKind::KeywordOnly),
(None, Some(_vector)) => Ok(SearchKind::semantic(
index_scheduler,
index,
None,
query.vector.as_ref().map(Vec::len),
)?),
(Some(_), Some(_)) => Err(MeilisearchHttpError::MissingSearchHybrid.into()),
},
}
}

View File

@ -7,7 +7,7 @@ use meilisearch_types::error::ResponseError;
use meilisearch_types::facet_values_sort::FacetValuesSort;
use meilisearch_types::index_uid::IndexUid;
use meilisearch_types::milli::update::Setting;
use meilisearch_types::settings::{settings, RankingRuleView, Settings, Unchecked};
use meilisearch_types::settings::{settings, RankingRuleView, SecretPolicy, Settings, Unchecked};
use meilisearch_types::tasks::KindWithContent;
use serde_json::json;
use tracing::debug;
@ -134,7 +134,7 @@ macro_rules! make_setting_route {
let index = index_scheduler.index(&index_uid)?;
let rtxn = index.read_txn()?;
let settings = settings(&index, &rtxn)?;
let settings = settings(&index, &rtxn, meilisearch_types::settings::SecretPolicy::HideSecrets)?;
debug!(returns = ?settings, "Update settings");
let mut json = serde_json::json!(&settings);
@ -604,6 +604,8 @@ fn embedder_analytics(
EmbedderSource::OpenAi => sources.insert("openAi"),
EmbedderSource::HuggingFace => sources.insert("huggingFace"),
EmbedderSource::UserProvided => sources.insert("userProvided"),
EmbedderSource::Ollama => sources.insert("ollama"),
EmbedderSource::Rest => sources.insert("rest"),
};
}
};
@ -623,6 +625,25 @@ fn embedder_analytics(
)
}
make_setting_route!(
"/search-cutoff-ms",
put,
u64,
meilisearch_types::deserr::DeserrJsonError<
meilisearch_types::error::deserr_codes::InvalidSettingsSearchCutoffMs,
>,
search_cutoff_ms,
"searchCutoffMs",
analytics,
|setting: &Option<u64>, req: &HttpRequest| {
analytics.publish(
"Search Cutoff Updated".to_string(),
serde_json::json!({"search_cutoff_ms": setting }),
Some(req),
);
}
);
macro_rules! generate_configure {
($($mod:ident),*) => {
pub fn configure(cfg: &mut web::ServiceConfig) {
@ -653,7 +674,8 @@ generate_configure!(
typo_tolerance,
pagination,
faceting,
embedders
embedders,
search_cutoff_ms
);
pub async fn update_all(
@ -764,7 +786,8 @@ pub async fn update_all(
"synonyms": {
"total": new_settings.synonyms.as_ref().set().map(|synonyms| synonyms.len()),
},
"embedders": crate::routes::indexes::settings::embedder_analytics(new_settings.embedders.as_ref().set())
"embedders": crate::routes::indexes::settings::embedder_analytics(new_settings.embedders.as_ref().set()),
"search_cutoff_ms": new_settings.search_cutoff_ms.as_ref().set(),
}),
Some(&req),
);
@ -796,7 +819,7 @@ pub async fn get_all(
let index = index_scheduler.index(&index_uid)?;
let rtxn = index.read_txn()?;
let new_settings = settings(&index, &rtxn)?;
let new_settings = settings(&index, &rtxn, SecretPolicy::HideSecrets)?;
debug!(returns = ?new_settings, "Get all settings");
Ok(HttpResponse::Ok().json(new_settings))
}

View File

@ -15,6 +15,7 @@ use tracing::debug;
use crate::analytics::Analytics;
use crate::extractors::authentication::policies::*;
use crate::extractors::authentication::GuardedData;
use crate::search_queue::SearchQueue;
use crate::Opt;
const PAGINATION_DEFAULT_LIMIT: usize = 20;
@ -359,12 +360,18 @@ async fn get_version(
) -> HttpResponse {
analytics.publish("Version Seen".to_string(), json!(null), Some(&req));
let commit_sha = option_env!("VERGEN_GIT_SHA").unwrap_or("unknown");
let commit_date = option_env!("VERGEN_GIT_COMMIT_TIMESTAMP").unwrap_or("unknown");
let build_info = build_info::BuildInfo::from_build();
HttpResponse::Ok().json(VersionResponse {
commit_sha: commit_sha.to_string(),
commit_date: commit_date.to_string(),
commit_sha: build_info.commit_sha1.unwrap_or("unknown").to_string(),
commit_date: build_info
.commit_timestamp
.and_then(|commit_timestamp| {
commit_timestamp
.format(&time::format_description::well_known::Iso8601::DEFAULT)
.ok()
})
.unwrap_or("unknown".into()),
pkg_version: env!("CARGO_PKG_VERSION").to_string(),
})
}
@ -379,10 +386,12 @@ pub async fn get_health(
req: HttpRequest,
index_scheduler: Data<IndexScheduler>,
auth_controller: Data<AuthController>,
search_queue: Data<SearchQueue>,
analytics: web::Data<dyn Analytics>,
) -> Result<HttpResponse, ResponseError> {
analytics.health_seen(&req);
search_queue.health().unwrap();
index_scheduler.health().unwrap();
auth_controller.health().unwrap();

View File

@ -13,10 +13,11 @@ use crate::analytics::{Analytics, MultiSearchAggregator};
use crate::extractors::authentication::policies::ActionPolicy;
use crate::extractors::authentication::{AuthenticationError, GuardedData};
use crate::extractors::sequential_extractor::SeqHandler;
use crate::routes::indexes::search::embed;
use crate::routes::indexes::search::search_kind;
use crate::search::{
add_search_rules, perform_search, SearchQueryWithIndex, SearchResultWithIndex,
};
use crate::search_queue::SearchQueue;
pub fn configure(cfg: &mut web::ServiceConfig) {
cfg.service(web::resource("").route(web::post().to(SeqHandler(multi_search_with_post))));
@ -35,6 +36,7 @@ pub struct SearchQueries {
pub async fn multi_search_with_post(
index_scheduler: GuardedData<ActionPolicy<{ actions::SEARCH }>, Data<IndexScheduler>>,
search_queue: Data<SearchQueue>,
params: AwebJson<SearchQueries, DeserrJsonError>,
req: HttpRequest,
analytics: web::Data<dyn Analytics>,
@ -44,6 +46,10 @@ pub async fn multi_search_with_post(
let mut multi_aggregate = MultiSearchAggregator::from_queries(&queries, &req);
let features = index_scheduler.features();
// Since we don't want to process half of the search requests and then get a permit refused
// we're going to get one permit for the whole duration of the multi-search request.
let _permit = search_queue.try_get_search_permit().await?;
// Explicitly expect a `(ResponseError, usize)` for the error type rather than `ResponseError` only,
// so that `?` doesn't work if it doesn't use `with_index`, ensuring that it is not forgotten in case of code
// changes.
@ -75,15 +81,13 @@ pub async fn multi_search_with_post(
})
.with_index(query_index)?;
let distribution = embed(&mut query, index_scheduler.get_ref(), &index)
.await
let search_kind = search_kind(&query, index_scheduler.get_ref(), &index, features)
.with_index(query_index)?;
let search_result = tokio::task::spawn_blocking(move || {
perform_search(&index, query, features, distribution)
})
.await
.with_index(query_index)?;
let search_result =
tokio::task::spawn_blocking(move || perform_search(&index, query, search_kind))
.await
.with_index(query_index)?;
search_results.push(SearchResultWithIndex {
index_uid: index_uid.into_inner(),

View File

@ -1,20 +1,21 @@
use std::cmp::min;
use std::collections::{BTreeMap, BTreeSet, HashSet};
use std::str::FromStr;
use std::time::Instant;
use std::sync::Arc;
use std::time::{Duration, Instant};
use deserr::Deserr;
use either::Either;
use index_scheduler::RoFeatures;
use indexmap::IndexMap;
use meilisearch_auth::IndexSearchRules;
use meilisearch_types::deserr::DeserrJsonError;
use meilisearch_types::error::deserr_codes::*;
use meilisearch_types::error::ResponseError;
use meilisearch_types::heed::RoTxn;
use meilisearch_types::index_uid::IndexUid;
use meilisearch_types::milli::score_details::{self, ScoreDetails, ScoringStrategy};
use meilisearch_types::milli::vector::DistributionShift;
use meilisearch_types::milli::{FacetValueHit, OrderBy, SearchForFacetValues};
use meilisearch_types::milli::score_details::{ScoreDetails, ScoringStrategy};
use meilisearch_types::milli::vector::Embedder;
use meilisearch_types::milli::{FacetValueHit, OrderBy, SearchForFacetValues, TimeBudget};
use meilisearch_types::settings::DEFAULT_PAGINATION_MAX_TOTAL_HITS;
use meilisearch_types::{milli, Document};
use milli::tokenizer::TokenizerBuilder;
@ -90,13 +91,75 @@ pub struct SearchQuery {
#[derive(Debug, Clone, Default, PartialEq, Deserr)]
#[deserr(error = DeserrJsonError<InvalidHybridQuery>, rename_all = camelCase, deny_unknown_fields)]
pub struct HybridQuery {
/// TODO validate that sementic ratio is between 0.0 and 1,0
#[deserr(default, error = DeserrJsonError<InvalidSearchSemanticRatio>, default)]
pub semantic_ratio: SemanticRatio,
#[deserr(default, error = DeserrJsonError<InvalidEmbedder>, default)]
pub embedder: Option<String>,
}
pub enum SearchKind {
KeywordOnly,
SemanticOnly { embedder_name: String, embedder: Arc<Embedder> },
Hybrid { embedder_name: String, embedder: Arc<Embedder>, semantic_ratio: f32 },
}
impl SearchKind {
pub(crate) fn semantic(
index_scheduler: &index_scheduler::IndexScheduler,
index: &Index,
embedder_name: Option<&str>,
vector_len: Option<usize>,
) -> Result<Self, ResponseError> {
let (embedder_name, embedder) =
Self::embedder(index_scheduler, index, embedder_name, vector_len)?;
Ok(Self::SemanticOnly { embedder_name, embedder })
}
pub(crate) fn hybrid(
index_scheduler: &index_scheduler::IndexScheduler,
index: &Index,
embedder_name: Option<&str>,
semantic_ratio: f32,
vector_len: Option<usize>,
) -> Result<Self, ResponseError> {
let (embedder_name, embedder) =
Self::embedder(index_scheduler, index, embedder_name, vector_len)?;
Ok(Self::Hybrid { embedder_name, embedder, semantic_ratio })
}
pub(crate) fn embedder(
index_scheduler: &index_scheduler::IndexScheduler,
index: &Index,
embedder_name: Option<&str>,
vector_len: Option<usize>,
) -> Result<(String, Arc<Embedder>), ResponseError> {
let embedder_configs = index.embedding_configs(&index.read_txn()?)?;
let embedders = index_scheduler.embedders(embedder_configs)?;
let embedder_name = embedder_name.unwrap_or_else(|| embedders.get_default_embedder_name());
let embedder = embedders.get(embedder_name);
let embedder = embedder
.ok_or(milli::UserError::InvalidEmbedder(embedder_name.to_owned()))
.map_err(milli::Error::from)?
.0;
if let Some(vector_len) = vector_len {
if vector_len != embedder.dimensions() {
return Err(meilisearch_types::milli::Error::UserError(
meilisearch_types::milli::UserError::InvalidVectorDimensions {
expected: embedder.dimensions(),
found: vector_len,
},
)
.into());
}
}
Ok((embedder_name.to_owned(), embedder))
}
}
#[derive(Debug, Clone, Copy, PartialEq, Deserr)]
#[deserr(try_from(f32) = TryFrom::try_from -> InvalidSearchSemanticRatio)]
pub struct SemanticRatio(f32);
@ -249,6 +312,27 @@ impl SearchQueryWithIndex {
}
}
#[derive(Debug, Clone, Default, PartialEq, Deserr)]
#[deserr(error = DeserrJsonError, rename_all = camelCase, deny_unknown_fields)]
pub struct RecommendQuery {
#[deserr(default, error = DeserrJsonError<InvalidRecommendId>)]
pub id: String,
#[deserr(default = DEFAULT_SEARCH_OFFSET(), error = DeserrJsonError<InvalidSearchOffset>)]
pub offset: usize,
#[deserr(default = DEFAULT_SEARCH_LIMIT(), error = DeserrJsonError<InvalidSearchLimit>)]
pub limit: usize,
#[deserr(default, error = DeserrJsonError<InvalidSearchFilter>)]
pub filter: Option<Value>,
#[deserr(default, error = DeserrJsonError<InvalidEmbedder>, default)]
pub embedder: Option<String>,
#[deserr(default, error = DeserrJsonError<InvalidSearchAttributesToRetrieve>)]
pub attributes_to_retrieve: Option<BTreeSet<String>>,
#[deserr(default, error = DeserrJsonError<InvalidSearchShowRankingScore>, default)]
pub show_ranking_score: bool,
#[deserr(default, error = DeserrJsonError<InvalidSearchShowRankingScoreDetails>, default)]
pub show_ranking_score_details: bool,
}
#[derive(Debug, Copy, Clone, PartialEq, Eq, Deserr)]
#[deserr(rename_all = camelCase)]
pub enum MatchingStrategy {
@ -305,8 +389,6 @@ pub struct SearchHit {
pub ranking_score: Option<f64>,
#[serde(rename = "_rankingScoreDetails", skip_serializing_if = "Option::is_none")]
pub ranking_score_details: Option<serde_json::Map<String, serde_json::Value>>,
#[serde(rename = "_semanticScore", skip_serializing_if = "Option::is_none")]
pub semantic_score: Option<f32>,
}
#[derive(Serialize, Debug, Clone, PartialEq)]
@ -314,8 +396,6 @@ pub struct SearchHit {
pub struct SearchResult {
pub hits: Vec<SearchHit>,
pub query: String,
#[serde(skip_serializing_if = "Option::is_none")]
pub vector: Option<Vec<f32>>,
pub processing_time_ms: u128,
#[serde(flatten)]
pub hits_info: HitsInfo,
@ -323,6 +403,25 @@ pub struct SearchResult {
pub facet_distribution: Option<BTreeMap<String, IndexMap<String, u64>>>,
#[serde(skip_serializing_if = "Option::is_none")]
pub facet_stats: Option<BTreeMap<String, FacetStats>>,
#[serde(skip_serializing_if = "Option::is_none")]
pub semantic_hit_count: Option<u32>,
// These fields are only used for analytics purposes
#[serde(skip)]
pub degraded: bool,
#[serde(skip)]
pub used_negative_operator: bool,
}
#[derive(Serialize, Debug, Clone, PartialEq)]
#[serde(rename_all = "camelCase")]
pub struct RecommendResult {
pub hits: Vec<SearchHit>,
pub id: String,
pub processing_time_ms: u128,
#[serde(flatten)]
pub hits_info: HitsInfo,
}
#[derive(Serialize, Debug, Clone, PartialEq)]
@ -380,45 +479,36 @@ fn prepare_search<'t>(
index: &'t Index,
rtxn: &'t RoTxn,
query: &'t SearchQuery,
features: RoFeatures,
distribution: Option<DistributionShift>,
search_kind: &SearchKind,
time_budget: TimeBudget,
) -> Result<(milli::Search<'t>, bool, usize, usize), MeilisearchHttpError> {
let mut search = index.search(rtxn);
search.time_budget(time_budget);
if query.vector.is_some() {
features.check_vector("Passing `vector` as a query parameter")?;
}
if query.hybrid.is_some() {
features.check_vector("Passing `hybrid` as a query parameter")?;
}
if query.hybrid.is_none() && query.q.is_some() && query.vector.is_some() {
return Err(MeilisearchHttpError::MissingSearchHybrid);
}
search.distribution_shift(distribution);
if let Some(ref vector) = query.vector {
match &query.hybrid {
// If semantic ratio is 0.0, only the query search will impact the search results,
// skip the vector
Some(hybrid) if *hybrid.semantic_ratio == 0.0 => (),
_otherwise => {
search.vector(vector.clone());
}
}
}
if let Some(ref q) = query.q {
match &query.hybrid {
// If semantic ratio is 1.0, only the vector search will impact the search results,
// skip the query
Some(hybrid) if *hybrid.semantic_ratio == 1.0 => (),
_otherwise => {
match search_kind {
SearchKind::KeywordOnly => {
if let Some(q) = &query.q {
search.query(q);
}
}
SearchKind::SemanticOnly { embedder_name, embedder } => {
let vector = match query.vector.clone() {
Some(vector) => vector,
None => embedder
.embed_one(query.q.clone().unwrap())
.map_err(milli::vector::Error::from)
.map_err(milli::Error::from)?,
};
search.semantic(embedder_name.clone(), embedder.clone(), Some(vector));
}
SearchKind::Hybrid { embedder_name, embedder, semantic_ratio: _ } => {
if let Some(q) = &query.q {
search.query(q);
}
// will be embedded in hybrid search if necessary
search.semantic(embedder_name.clone(), embedder.clone(), query.vector.clone());
}
}
if let Some(ref searchable) = query.attributes_to_search_on {
@ -441,10 +531,6 @@ fn prepare_search<'t>(
ScoringStrategy::Skip
});
if let Some(HybridQuery { embedder: Some(embedder), .. }) = &query.hybrid {
search.embedder_name(embedder);
}
// compute the offset on the limit depending on the pagination mode.
let (offset, limit) = if is_finite_pagination {
let limit = query.hits_per_page.unwrap_or_else(DEFAULT_SEARCH_LIMIT);
@ -487,23 +573,37 @@ fn prepare_search<'t>(
pub fn perform_search(
index: &Index,
query: SearchQuery,
features: RoFeatures,
distribution: Option<DistributionShift>,
search_kind: SearchKind,
) -> Result<SearchResult, MeilisearchHttpError> {
let before_search = Instant::now();
let rtxn = index.read_txn()?;
let time_budget = match index.search_cutoff(&rtxn)? {
Some(cutoff) => TimeBudget::new(Duration::from_millis(cutoff)),
None => TimeBudget::default(),
};
let (search, is_finite_pagination, max_total_hits, offset) =
prepare_search(index, &rtxn, &query, features, distribution)?;
prepare_search(index, &rtxn, &query, &search_kind, time_budget)?;
let milli::SearchResult { documents_ids, matching_words, candidates, document_scores, .. } =
match &query.hybrid {
Some(hybrid) => match *hybrid.semantic_ratio {
ratio if ratio == 0.0 || ratio == 1.0 => search.execute()?,
ratio => search.execute_hybrid(ratio)?,
},
None => search.execute()?,
};
let (
milli::SearchResult {
documents_ids,
matching_words,
candidates,
document_scores,
degraded,
used_negative_operator,
},
semantic_hit_count,
) = match &search_kind {
SearchKind::KeywordOnly => (search.execute()?, None),
SearchKind::SemanticOnly { .. } => {
let results = search.execute()?;
let semantic_hit_count = results.document_scores.len() as u32;
(results, Some(semantic_hit_count))
}
SearchKind::Hybrid { semantic_ratio, .. } => search.execute_hybrid(*semantic_ratio)?,
};
let fields_ids_map = index.fields_ids_map(&rtxn).unwrap();
@ -530,7 +630,7 @@ pub fn perform_search(
// The attributes to retrieve are the ones explicitly marked as to retrieve (all by default),
// but these attributes must be also be present
// - in the fields_ids_map
// - in the the displayed attributes
// - in the displayed attributes
let to_retrieve_ids: BTreeSet<_> = query
.attributes_to_retrieve
.as_ref()
@ -612,18 +712,6 @@ pub fn perform_search(
insert_geo_distance(sort, &mut document);
}
let mut semantic_score = None;
for details in &score {
if let ScoreDetails::Vector(score_details::Vector {
target_vector: _,
value_similarity: Some((_matching_vector, similarity)),
}) = details
{
semantic_score = Some(*similarity);
break;
}
}
let ranking_score =
query.show_ranking_score.then(|| ScoreDetails::global_score(score.iter()));
let ranking_score_details =
@ -635,7 +723,6 @@ pub fn perform_search(
matches_position,
ranking_score_details,
ranking_score,
semantic_score,
};
documents.push(hit);
}
@ -671,27 +758,16 @@ pub fn perform_search(
let sort_facet_values_by =
index.sort_facet_values_by(&rtxn).map_err(milli::Error::from)?;
let default_sort_facet_values_by =
sort_facet_values_by.get("*").copied().unwrap_or_default();
if fields.iter().all(|f| f != "*") {
let fields: Vec<_> = fields
.iter()
.map(|n| {
(
n,
sort_facet_values_by
.get(n)
.copied()
.unwrap_or(default_sort_facet_values_by),
)
})
.collect();
let fields: Vec<_> =
fields.iter().map(|n| (n, sort_facet_values_by.get(n))).collect();
facet_distribution.facets(fields);
}
let distribution = facet_distribution
.candidates(candidates)
.default_order_by(default_sort_facet_values_by)
.default_order_by(sort_facet_values_by.get("*"))
.execute()?;
let stats = facet_distribution.compute_stats()?;
(Some(distribution), Some(stats))
@ -707,10 +783,12 @@ pub fn perform_search(
hits: documents,
hits_info,
query: query.q.unwrap_or_default(),
vector: query.vector,
processing_time_ms: before_search.elapsed().as_millis(),
facet_distribution,
facet_stats,
degraded,
used_negative_operator,
semantic_hit_count,
};
Ok(result)
}
@ -720,14 +798,21 @@ pub fn perform_facet_search(
search_query: SearchQuery,
facet_query: Option<String>,
facet_name: String,
features: RoFeatures,
search_kind: SearchKind,
) -> Result<FacetSearchResult, MeilisearchHttpError> {
let before_search = Instant::now();
let rtxn = index.read_txn()?;
let time_budget = match index.search_cutoff(&rtxn)? {
Some(cutoff) => TimeBudget::new(Duration::from_millis(cutoff)),
None => TimeBudget::default(),
};
let (search, _, _, _) = prepare_search(index, &rtxn, &search_query, features, None)?;
let mut facet_search =
SearchForFacetValues::new(facet_name, search, search_query.hybrid.is_some());
let (search, _, _, _) = prepare_search(index, &rtxn, &search_query, &search_kind, time_budget)?;
let mut facet_search = SearchForFacetValues::new(
facet_name,
search,
matches!(search_kind, SearchKind::Hybrid { .. }),
);
if let Some(facet_query) = &facet_query {
facet_search.query(facet_query);
}
@ -742,6 +827,131 @@ pub fn perform_facet_search(
})
}
pub fn perform_recommend(
index: &Index,
query: RecommendQuery,
embedder_name: String,
embedder: Arc<Embedder>,
) -> Result<RecommendResult, MeilisearchHttpError> {
let before_search = Instant::now();
let rtxn = index.read_txn()?;
let internal_id = index
.external_documents_ids()
.get(&rtxn, &query.id)?
.ok_or_else(|| MeilisearchHttpError::DocumentNotFound(query.id.clone()))?;
let mut recommend = milli::Recommend::new(
internal_id,
query.offset,
query.limit,
index,
&rtxn,
embedder_name,
embedder,
);
if let Some(ref filter) = query.filter {
if let Some(facets) = parse_filter(filter)? {
recommend.filter(facets);
}
}
let milli::SearchResult {
documents_ids,
matching_words: _,
candidates,
document_scores,
degraded: _,
used_negative_operator: _,
} = recommend.execute()?;
let fields_ids_map = index.fields_ids_map(&rtxn).unwrap();
let displayed_ids = index
.displayed_fields_ids(&rtxn)?
.map(|fields| fields.into_iter().collect::<BTreeSet<_>>())
.unwrap_or_else(|| fields_ids_map.iter().map(|(id, _)| id).collect());
let fids = |attrs: &BTreeSet<String>| {
let mut ids = BTreeSet::new();
for attr in attrs {
if attr == "*" {
ids = displayed_ids.clone();
break;
}
if let Some(id) = fields_ids_map.id(attr) {
ids.insert(id);
}
}
ids
};
// The attributes to retrieve are the ones explicitly marked as to retrieve (all by default),
// but these attributes must be also be present
// - in the fields_ids_map
// - in the displayed attributes
let to_retrieve_ids: BTreeSet<_> = query
.attributes_to_retrieve
.as_ref()
.map(fids)
.unwrap_or_else(|| displayed_ids.clone())
.intersection(&displayed_ids)
.cloned()
.collect();
let mut documents = Vec::new();
let documents_iter = index.documents(&rtxn, documents_ids)?;
for ((_id, obkv), score) in documents_iter.into_iter().zip(document_scores.into_iter()) {
// First generate a document with all the displayed fields
let displayed_document = make_document(&displayed_ids, &fields_ids_map, obkv)?;
// select the attributes to retrieve
let attributes_to_retrieve = to_retrieve_ids
.iter()
.map(|&fid| fields_ids_map.name(fid).expect("Missing field name"));
let document =
permissive_json_pointer::select_values(&displayed_document, attributes_to_retrieve);
let ranking_score =
query.show_ranking_score.then(|| ScoreDetails::global_score(score.iter()));
let ranking_score_details =
query.show_ranking_score_details.then(|| ScoreDetails::to_json_map(score.iter()));
let hit = SearchHit {
document,
formatted: Default::default(),
matches_position: None,
ranking_score_details,
ranking_score,
};
documents.push(hit);
}
let max_total_hits = index
.pagination_max_total_hits(&rtxn)
.map_err(milli::Error::from)?
.map(|x| x as usize)
.unwrap_or(DEFAULT_PAGINATION_MAX_TOTAL_HITS);
let number_of_hits = min(candidates.len() as usize, max_total_hits);
let hits_info = HitsInfo::OffsetLimit {
limit: query.limit,
offset: query.offset,
estimated_total_hits: number_of_hits,
};
let result = RecommendResult {
hits: documents,
hits_info,
id: query.id,
processing_time_ms: before_search.elapsed().as_millis(),
};
Ok(result)
}
fn insert_geo_distance(sorts: &[String], document: &mut Document) {
lazy_static::lazy_static! {
static ref GEO_REGEX: Regex =

View File

@ -0,0 +1,130 @@
//! This file implements a queue of searches to process and the ability to control how many searches can be run in parallel.
//! We need this because we don't want to process more search requests than we have cores.
//! That slows down everything and consumes RAM for no reason.
//! The steps to do a search are to get the `SearchQueue` data structure and try to get a search permit.
//! This can fail if the queue is full, and we need to drop your search request to register a new one.
//!
//! ### How to do a search request
//!
//! In order to do a search request you should try to get a search permit.
//! Retrieve the `SearchQueue` structure from actix-web (`search_queue: Data<SearchQueue>`)
//! and right before processing the search, calls the `SearchQueue::try_get_search_permit` method: `search_queue.try_get_search_permit().await?;`
//!
//! What is going to happen at this point is that you're going to send a oneshot::Sender over an async mpsc channel.
//! Then, the queue/scheduler is going to either:
//! - Drop your oneshot channel => that means there are too many searches going on, and yours won't be executed.
//! You should exit and free all the RAM you use ASAP.
//! - Sends you a Permit => that will unlock the method, and you will be able to process your search.
//! And should drop the Permit only once you have freed all the RAM consumed by the method.
use std::num::NonZeroUsize;
use rand::rngs::StdRng;
use rand::{Rng, SeedableRng};
use tokio::sync::{mpsc, oneshot};
use crate::error::MeilisearchHttpError;
#[derive(Debug)]
pub struct SearchQueue {
sender: mpsc::Sender<oneshot::Sender<Permit>>,
capacity: usize,
}
/// You should only run search requests while holding this permit.
/// Once it's dropped, a new search request will be able to process.
#[derive(Debug)]
pub struct Permit {
sender: mpsc::Sender<()>,
}
impl Drop for Permit {
fn drop(&mut self) {
// if the channel is closed then the whole instance is down
let _ = futures::executor::block_on(self.sender.send(()));
}
}
impl SearchQueue {
pub fn new(capacity: usize, paralellism: NonZeroUsize) -> Self {
// Search requests are going to wait until we're available anyway,
// so let's not allocate any RAM and keep a capacity of 1.
let (sender, receiver) = mpsc::channel(1);
tokio::task::spawn(Self::run(capacity, paralellism, receiver));
Self { sender, capacity }
}
/// This function is the main loop, it's in charge on scheduling which search request should execute first and
/// how many should executes at the same time.
///
/// It **must never** panic or exit.
async fn run(
capacity: usize,
parallelism: NonZeroUsize,
mut receive_new_searches: mpsc::Receiver<oneshot::Sender<Permit>>,
) {
let mut queue: Vec<oneshot::Sender<Permit>> = Default::default();
let mut rng: StdRng = StdRng::from_entropy();
let mut searches_running: usize = 0;
// By having a capacity of parallelism we ensures that every time a search finish it can release its RAM asap
let (sender, mut search_finished) = mpsc::channel(parallelism.into());
loop {
tokio::select! {
// biased select because we wants to free up space before trying to register new tasks
biased;
_ = search_finished.recv() => {
searches_running = searches_running.saturating_sub(1);
if !queue.is_empty() {
// Can't panic: the queue wasn't empty thus the range isn't empty.
let remove = rng.gen_range(0..queue.len());
let channel = queue.swap_remove(remove);
let _ = channel.send(Permit { sender: sender.clone() });
}
},
search_request = receive_new_searches.recv() => {
// this unwrap is safe because we're sure the `SearchQueue` still lives somewhere in actix-web
let search_request = search_request.unwrap();
if searches_running < usize::from(parallelism) && queue.is_empty() {
searches_running += 1;
// if the search requests die it's not a hard error on our side
let _ = search_request.send(Permit { sender: sender.clone() });
continue;
} else if capacity == 0 {
// in the very specific case where we have a capacity of zero
// we must refuse the request straight away without going through
// the queue stuff.
drop(search_request);
continue;
} else if queue.len() >= capacity {
let remove = rng.gen_range(0..queue.len());
let thing = queue.swap_remove(remove); // this will drop the channel and notify the search that it won't be processed
drop(thing);
}
queue.push(search_request);
},
}
}
}
/// Returns a search `Permit`.
/// It should be dropped as soon as you've freed all the RAM associated with the search request being processed.
pub async fn try_get_search_permit(&self) -> Result<Permit, MeilisearchHttpError> {
let (sender, receiver) = oneshot::channel();
self.sender.send(sender).await.map_err(|_| MeilisearchHttpError::SearchLimiterIsDown)?;
receiver.await.map_err(|_| MeilisearchHttpError::TooManySearchRequests(self.capacity))
}
/// Returns `Ok(())` if everything seems normal.
/// Returns `Err(MeilisearchHttpError::SearchLimiterIsDown)` if the search limiter seems down.
pub fn health(&self) -> Result<(), MeilisearchHttpError> {
if self.sender.is_closed() {
Err(MeilisearchHttpError::SearchLimiterIsDown)
} else {
Ok(())
}
}
}

View File

@ -328,6 +328,11 @@ impl Index<'_> {
self.service.patch_encoded(url, settings, self.encoder).await
}
pub async fn update_settings_search_cutoff_ms(&self, settings: Value) -> (Value, StatusCode) {
let url = format!("/indexes/{}/settings/search-cutoff-ms", urlencode(self.uid.as_ref()));
self.service.put_encoded(url, settings, self.encoder).await
}
pub async fn delete_settings(&self) -> (Value, StatusCode) {
let url = format!("/indexes/{}/settings", urlencode(self.uid.as_ref()));
self.service.delete(url).await

View File

@ -16,6 +16,7 @@ pub use server::{default_settings, Server};
pub struct Value(pub serde_json::Value);
impl Value {
#[track_caller]
pub fn uid(&self) -> u64 {
if let Some(uid) = self["uid"].as_u64() {
uid

View File

@ -1237,8 +1237,8 @@ async fn error_add_documents_missing_document_id() {
}
#[actix_rt::test]
#[ignore] // // TODO: Fix in an other PR: this does not provoke any error.
async fn error_document_field_limit_reached() {
#[should_panic]
async fn error_document_field_limit_reached_in_one_document() {
let server = Server::new().await;
let index = server.index("test");
@ -1246,22 +1246,241 @@ async fn error_document_field_limit_reached() {
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "wow");
for i in 0..65535 {
for i in 0..(u16::MAX as usize + 1) {
let key = i.to_string();
big_object.insert(key, "I am a text!");
}
let documents = json!([big_object]);
let (_response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202");
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"500 Internal Server Error");
index.wait_task(0).await;
let (response, code) = index.get_task(0).await;
snapshot!(code, @"200");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
// Documents without a primary key are not accepted.
snapshot!(json_string!(response, { ".duration" => "[duration]", ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]" }),
@"");
snapshot!(response,
@r###"
{
"uid": 1,
"indexUid": "test",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
}
#[actix_rt::test]
async fn error_document_field_limit_reached_over_multiple_documents() {
let server = Server::new().await;
let index = server.index("test");
index.create(Some("id")).await;
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "wow");
for i in 0..(u16::MAX / 2) {
let key = i.to_string();
big_object.insert(key, "I am a text!");
}
let documents = json!([big_object]);
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202 Accepted");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
snapshot!(response,
@r###"
{
"uid": 1,
"indexUid": "test",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "waw");
for i in (u16::MAX as usize / 2)..(u16::MAX as usize + 1) {
let key = i.to_string();
big_object.insert(key, "I am a text!");
}
let documents = json!([big_object]);
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202 Accepted");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
snapshot!(response,
@r###"
{
"uid": 2,
"indexUid": "test",
"status": "failed",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 0
},
"error": {
"message": "A document cannot contain more than 65,535 fields.",
"code": "max_fields_limit_exceeded",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#max_fields_limit_exceeded"
},
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
}
#[actix_rt::test]
async fn error_document_field_limit_reached_in_one_nested_document() {
let server = Server::new().await;
let index = server.index("test");
index.create(Some("id")).await;
let mut nested = std::collections::HashMap::new();
for i in 0..(u16::MAX as usize + 1) {
let key = i.to_string();
nested.insert(key, "I am a text!");
}
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "wow");
let documents = json!([big_object]);
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202 Accepted");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
// Documents without a primary key are not accepted.
snapshot!(response,
@r###"
{
"uid": 1,
"indexUid": "test",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
}
#[actix_rt::test]
async fn error_document_field_limit_reached_over_multiple_documents_with_nested_fields() {
let server = Server::new().await;
let index = server.index("test");
index.create(Some("id")).await;
let mut nested = std::collections::HashMap::new();
for i in 0..(u16::MAX / 2) {
let key = i.to_string();
nested.insert(key, "I am a text!");
}
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "wow");
let documents = json!([big_object]);
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202 Accepted");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
snapshot!(response,
@r###"
{
"uid": 1,
"indexUid": "test",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
let mut nested = std::collections::HashMap::new();
for i in 0..(u16::MAX / 2) {
let key = i.to_string();
nested.insert(key, "I am a text!");
}
let mut big_object = std::collections::HashMap::new();
big_object.insert("id".to_owned(), "wow");
let documents = json!([big_object]);
let (response, code) = index.update_documents(documents, Some("id")).await;
snapshot!(code, @"202 Accepted");
let response = index.wait_task(response.uid()).await;
snapshot!(code, @"202 Accepted");
snapshot!(response,
@r###"
{
"uid": 2,
"indexUid": "test",
"status": "succeeded",
"type": "documentAdditionOrUpdate",
"canceledBy": null,
"details": {
"receivedDocuments": 1,
"indexedDocuments": 1
},
"error": null,
"duration": "[duration]",
"enqueuedAt": "[date]",
"startedAt": "[date]",
"finishedAt": "[date]"
}
"###);
}
#[actix_rt::test]

View File

@ -77,7 +77,8 @@ async fn import_dump_v1_movie_raw() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -238,7 +239,8 @@ async fn import_dump_v1_movie_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -385,7 +387,8 @@ async fn import_dump_v1_rubygems_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -518,7 +521,8 @@ async fn import_dump_v2_movie_raw() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -663,7 +667,8 @@ async fn import_dump_v2_movie_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -807,7 +812,8 @@ async fn import_dump_v2_rubygems_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -940,7 +946,8 @@ async fn import_dump_v3_movie_raw() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1085,7 +1092,8 @@ async fn import_dump_v3_movie_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1229,7 +1237,8 @@ async fn import_dump_v3_rubygems_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1362,7 +1371,8 @@ async fn import_dump_v4_movie_raw() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1507,7 +1517,8 @@ async fn import_dump_v4_movie_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1651,7 +1662,8 @@ async fn import_dump_v4_rubygems_with_settings() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###
);
@ -1895,7 +1907,8 @@ async fn import_dump_v6_containing_experimental_features() {
},
"pagination": {
"maxTotalHits": 1000
}
},
"searchCutoffMs": null
}
"###);

View File

@ -123,6 +123,28 @@ async fn simple_facet_search_with_max_values() {
assert_eq!(dbg!(response)["facetHits"].as_array().unwrap().len(), 1);
}
#[actix_rt::test]
async fn simple_facet_search_by_count_with_max_values() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index
.update_settings_faceting(
json!({ "maxValuesPerFacet": 1, "sortFacetValuesBy": { "*": "count" } }),
)
.await;
index.update_settings_filterable_attributes(json!(["genres"])).await;
index.add_documents(documents, None).await;
index.wait_task(2).await;
let (response, code) =
index.facet_search(json!({"facetName": "genres", "facetQuery": "a"})).await;
assert_eq!(code, 200, "{}", response);
assert_eq!(dbg!(response)["facetHits"].as_array().unwrap().len(), 1);
}
#[actix_rt::test]
async fn non_filterable_facet_search_error() {
let server = Server::new().await;
@ -157,3 +179,24 @@ async fn facet_search_dont_support_words() {
assert_eq!(code, 200, "{}", response);
assert_eq!(response["facetHits"].as_array().unwrap().len(), 0);
}
#[actix_rt::test]
async fn simple_facet_search_with_sort_by_count() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.update_settings_faceting(json!({ "sortFacetValuesBy": { "*": "count" } })).await;
index.update_settings_filterable_attributes(json!(["genres"])).await;
index.add_documents(documents, None).await;
index.wait_task(2).await;
let (response, code) =
index.facet_search(json!({"facetName": "genres", "facetQuery": "a"})).await;
assert_eq!(code, 200, "{}", response);
let hits = response["facetHits"].as_array().unwrap();
assert_eq!(hits.len(), 2);
assert_eq!(hits[0], json!({ "value": "Action", "count": 3 }));
assert_eq!(hits[1], json!({ "value": "Adventure", "count": 2 }));
}

View File

@ -77,14 +77,57 @@ async fn simple_search() {
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]}},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]}},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]}}]"###);
snapshot!(response["semanticHitCount"], @"0");
let (response, code) = index
.search_post(
json!({"q": "Captain", "vector": [1.0, 1.0], "hybrid": {"semanticRatio": 0.8}}),
json!({"q": "Captain", "vector": [1.0, 1.0], "hybrid": {"semanticRatio": 0.5}, "showRankingScore": true}),
)
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_semanticScore":0.99029034},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_semanticScore":0.97434163},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_semanticScore":0.9472136}]"###);
snapshot!(response["hits"], @r###"[{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.996969696969697},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.996969696969697},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":0.9472135901451112}]"###);
snapshot!(response["semanticHitCount"], @"1");
let (response, code) = index
.search_post(
json!({"q": "Captain", "vector": [1.0, 1.0], "hybrid": {"semanticRatio": 0.8}, "showRankingScore": true}),
)
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.990290343761444},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.974341630935669},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":0.9472135901451112}]"###);
snapshot!(response["semanticHitCount"], @"3");
}
#[actix_rt::test]
async fn distribution_shift() {
let server = Server::new().await;
let index = index_with_documents(&server, &SIMPLE_SEARCH_DOCUMENTS).await;
let search = json!({"q": "Captain", "vector": [1.0, 1.0], "showRankingScore": true, "hybrid": {"semanticRatio": 1.0}});
let (response, code) = index.search_post(search.clone()).await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.990290343761444},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.974341630935669},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":0.9472135901451112}]"###);
let (response, code) = index
.update_settings(json!({
"embedders": {
"default": {
"distribution": {
"mean": 0.998,
"sigma": 0.01
}
}
}
}))
.await;
snapshot!(code, @"202 Accepted");
let response = server.wait_task(response.uid()).await;
snapshot!(response["details"], @r###"{"embedders":{"default":{"distribution":{"mean":0.998,"sigma":0.01}}}}"###);
let (response, code) = index.search_post(search).await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.19161224365234375},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":1.1920928955078125e-7},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.1920928955078125e-7}]"###);
}
#[actix_rt::test]
@ -104,10 +147,12 @@ async fn highlighter() {
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_formatted":{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":["2.0","3.0"]}}},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_formatted":{"title":"Shazam!","desc":"a **BEGIN**Captain**END** **BEGIN**Marvel**END** ersatz","id":"1","_vectors":{"default":["1.0","3.0"]}}},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_formatted":{"title":"Captain Planet","desc":"He's not part of the **BEGIN**Marvel**END** Cinematic Universe","id":"2","_vectors":{"default":["1.0","2.0"]}}}]"###);
snapshot!(response["semanticHitCount"], @"0");
let (response, code) = index
.search_post(json!({"q": "Captain Marvel", "vector": [1.0, 1.0],
"hybrid": {"semanticRatio": 0.8},
"showRankingScore": true,
"attributesToHighlight": [
"desc"
],
@ -116,12 +161,14 @@ async fn highlighter() {
}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_formatted":{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":["2.0","3.0"]}},"_semanticScore":0.99029034},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_formatted":{"title":"Captain Planet","desc":"He's not part of the **BEGIN**Marvel**END** Cinematic Universe","id":"2","_vectors":{"default":["1.0","2.0"]}},"_semanticScore":0.97434163},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_formatted":{"title":"Shazam!","desc":"a **BEGIN**Captain**END** **BEGIN**Marvel**END** ersatz","id":"1","_vectors":{"default":["1.0","3.0"]}},"_semanticScore":0.9472136}]"###);
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_formatted":{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":["2.0","3.0"]}},"_rankingScore":0.990290343761444},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_formatted":{"title":"Captain Planet","desc":"He's not part of the **BEGIN**Marvel**END** Cinematic Universe","id":"2","_vectors":{"default":["1.0","2.0"]}},"_rankingScore":0.974341630935669},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_formatted":{"title":"Shazam!","desc":"a **BEGIN**Captain**END** **BEGIN**Marvel**END** ersatz","id":"1","_vectors":{"default":["1.0","3.0"]}},"_rankingScore":0.9472135901451112}]"###);
snapshot!(response["semanticHitCount"], @"3");
// no highlighting on full semantic
let (response, code) = index
.search_post(json!({"q": "Captain Marvel", "vector": [1.0, 1.0],
"hybrid": {"semanticRatio": 1.0},
"showRankingScore": true,
"attributesToHighlight": [
"desc"
],
@ -130,7 +177,8 @@ async fn highlighter() {
}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_formatted":{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":["2.0","3.0"]}},"_semanticScore":0.99029034},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_formatted":{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":["1.0","2.0"]}},"_semanticScore":0.97434163},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_formatted":{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":["1.0","3.0"]}}}]"###);
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_formatted":{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":["2.0","3.0"]}},"_rankingScore":0.990290343761444},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_formatted":{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":["1.0","2.0"]}},"_rankingScore":0.974341630935669},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_formatted":{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":["1.0","3.0"]}},"_rankingScore":0.9472135901451112}]"###);
snapshot!(response["semanticHitCount"], @"3");
}
#[actix_rt::test]
@ -217,5 +265,115 @@ async fn single_document() {
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"][0], @r###"{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.0,"_semanticScore":1.0}"###);
snapshot!(response["hits"][0], @r###"{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.0}"###);
snapshot!(response["semanticHitCount"], @"1");
}
#[actix_rt::test]
async fn query_combination() {
let server = Server::new().await;
let index = index_with_documents(&server, &SIMPLE_SEARCH_DOCUMENTS).await;
// search without query and vector, but with hybrid => still placeholder
let (response, code) = index
.search_post(json!({"hybrid": {"semanticRatio": 1.0}, "showRankingScore": true}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.0},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":1.0},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":1.0}]"###);
snapshot!(response["semanticHitCount"], @"null");
// same with a different semantic ratio
let (response, code) = index
.search_post(json!({"hybrid": {"semanticRatio": 0.76}, "showRankingScore": true}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.0},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":1.0},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":1.0}]"###);
snapshot!(response["semanticHitCount"], @"null");
// wrong vector dimensions
let (response, code) = index
.search_post(json!({"vector": [1.0, 0.0, 1.0], "hybrid": {"semanticRatio": 1.0}, "showRankingScore": true}))
.await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Invalid vector dimensions: expected: `2`, found: `3`.",
"code": "invalid_vector_dimensions",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_vector_dimensions"
}
"###);
// full vector
let (response, code) = index
.search_post(json!({"vector": [1.0, 0.0], "hybrid": {"semanticRatio": 1.0}, "showRankingScore": true}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.7773500680923462},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.7236068248748779},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":0.6581138968467712}]"###);
snapshot!(response["semanticHitCount"], @"3");
// full keyword, without a query
let (response, code) = index
.search_post(json!({"vector": [1.0, 0.0], "hybrid": {"semanticRatio": 0.0}, "showRankingScore": true}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":1.0},{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":1.0},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":1.0}]"###);
snapshot!(response["semanticHitCount"], @"null");
// query + vector, full keyword => keyword
let (response, code) = index
.search_post(json!({"q": "Captain", "vector": [1.0, 0.0], "hybrid": {"semanticRatio": 0.0}, "showRankingScore": true}))
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.996969696969697},{"title":"Captain Marvel","desc":"a Shazam ersatz","id":"3","_vectors":{"default":[2.0,3.0]},"_rankingScore":0.996969696969697},{"title":"Shazam!","desc":"a Captain Marvel ersatz","id":"1","_vectors":{"default":[1.0,3.0]},"_rankingScore":0.8848484848484849}]"###);
snapshot!(response["semanticHitCount"], @"null");
// query + vector, no hybrid keyword =>
let (response, code) = index
.search_post(json!({"q": "Captain", "vector": [1.0, 0.0], "showRankingScore": true}))
.await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Invalid request: missing `hybrid` parameter when both `q` and `vector` are present.",
"code": "missing_search_hybrid",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#missing_search_hybrid"
}
"###);
// full vector, without a vector => error
let (response, code) = index
.search_post(
json!({"q": "Captain", "hybrid": {"semanticRatio": 1.0}, "showRankingScore": true}),
)
.await;
snapshot!(code, @"400 Bad Request");
snapshot!(response, @r###"
{
"message": "Error while generating embeddings: user error: attempt to embed the following text in a configuration where embeddings must be user provided: \"Captain\"",
"code": "vector_embedding_error",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#vector_embedding_error"
}
"###);
// hybrid without a vector => full keyword
let (response, code) = index
.search_post(
json!({"q": "Planet", "hybrid": {"semanticRatio": 0.99}, "showRankingScore": true}),
)
.await;
snapshot!(code, @"200 OK");
snapshot!(response["hits"], @r###"[{"title":"Captain Planet","desc":"He's not part of the Marvel Cinematic Universe","id":"2","_vectors":{"default":[1.0,2.0]},"_rankingScore":0.9848484848484848}]"###);
snapshot!(response["semanticHitCount"], @"0");
}

View File

@ -10,6 +10,7 @@ mod hybrid;
mod multi;
mod pagination;
mod restrict_searchable;
mod search_queue;
use once_cell::sync::Lazy;
@ -184,6 +185,110 @@ async fn phrase_search_with_stop_word() {
.await;
}
#[actix_rt::test]
async fn negative_phrase_search() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, None).await;
index.wait_task(0).await;
index
.search(json!({"q": "-\"train your dragon\"" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 4);
assert_eq!(hits[0]["id"], "287947");
assert_eq!(hits[1]["id"], "299537");
assert_eq!(hits[2]["id"], "522681");
assert_eq!(hits[3]["id"], "450465");
})
.await;
}
#[actix_rt::test]
async fn negative_word_search() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, None).await;
index.wait_task(0).await;
index
.search(json!({"q": "-escape" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 4);
assert_eq!(hits[0]["id"], "287947");
assert_eq!(hits[1]["id"], "299537");
assert_eq!(hits[2]["id"], "166428");
assert_eq!(hits[3]["id"], "450465");
})
.await;
// Everything that contains derivates of escape but not escape: nothing
index
.search(json!({"q": "-escape escape" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 0);
})
.await;
}
#[actix_rt::test]
async fn non_negative_search() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, None).await;
index.wait_task(0).await;
index
.search(json!({"q": "- escape" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 1);
assert_eq!(hits[0]["id"], "522681");
})
.await;
index
.search(json!({"q": "- \"train your dragon\"" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 1);
assert_eq!(hits[0]["id"], "166428");
})
.await;
}
#[actix_rt::test]
async fn negative_special_cases_search() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, None).await;
index.wait_task(0).await;
index.update_settings(json!({"synonyms": { "escape": ["glass"] }})).await;
index.wait_task(1).await;
// There is a synonym for escape -> glass but we don't want "escape", only the derivates: glass
index
.search(json!({"q": "-escape escape" }), |response, code| {
assert_eq!(code, 200, "{}", response);
let hits = response["hits"].as_array().unwrap();
assert_eq!(hits.len(), 1);
assert_eq!(hits[0]["id"], "450465");
})
.await;
}
#[cfg(feature = "default")]
#[actix_rt::test]
async fn test_kanji_language_detection() {
@ -834,6 +939,94 @@ async fn test_score_details() {
.await;
}
#[actix_rt::test]
async fn test_degraded_score_details() {
let server = Server::new().await;
let index = server.index("test");
let documents = NESTED_DOCUMENTS.clone();
index.add_documents(json!(documents), None).await;
// We can't really use anything else than 0ms here; otherwise, the test will get flaky.
let (res, _code) = index.update_settings(json!({ "searchCutoffMs": 0 })).await;
index.wait_task(res.uid()).await;
index
.search(
json!({
"q": "b",
"attributesToRetrieve": ["doggos.name", "cattos"],
"showRankingScoreDetails": true,
}),
|response, code| {
meili_snap::snapshot!(code, @"200 OK");
meili_snap::snapshot!(meili_snap::json_string!(response, { ".processingTimeMs" => "[duration]" }), @r###"
{
"hits": [
{
"doggos": [
{
"name": "bobby"
},
{
"name": "buddy"
}
],
"cattos": "pésti",
"_rankingScoreDetails": {
"skipped": {
"order": 0
}
}
},
{
"doggos": [
{
"name": "gros bill"
}
],
"cattos": [
"simba",
"pestiféré"
],
"_rankingScoreDetails": {
"skipped": {
"order": 0
}
}
},
{
"doggos": [
{
"name": "turbo"
},
{
"name": "fast"
}
],
"cattos": [
"moumoute",
"gomez"
],
"_rankingScoreDetails": {
"skipped": {
"order": 0
}
}
}
],
"query": "b",
"processingTimeMs": "[duration]",
"limit": 20,
"offset": 0,
"estimatedTotalHits": 3
}
"###);
},
)
.await;
}
#[actix_rt::test]
async fn experimental_feature_vector_store() {
let server = Server::new().await;
@ -847,6 +1040,7 @@ async fn experimental_feature_vector_store() {
let (response, code) = index
.search_post(json!({
"vector": [1.0, 2.0, 3.0],
"showRankingScore": true
}))
.await;
meili_snap::snapshot!(code, @"400 Bad Request");
@ -889,6 +1083,7 @@ async fn experimental_feature_vector_store() {
let (response, code) = index
.search_post(json!({
"vector": [1.0, 2.0, 3.0],
"showRankingScore": true,
}))
.await;
@ -906,7 +1101,7 @@ async fn experimental_feature_vector_store() {
3
]
},
"_semanticScore": 1.0
"_rankingScore": 1.0
},
{
"title": "Captain Marvel",
@ -918,7 +1113,7 @@ async fn experimental_feature_vector_store() {
54
]
},
"_semanticScore": 0.9129112
"_rankingScore": 0.9129111766815186
},
{
"title": "Gläss",
@ -930,7 +1125,7 @@ async fn experimental_feature_vector_store() {
90
]
},
"_semanticScore": 0.8106413
"_rankingScore": 0.8106412887573242
},
{
"title": "How to Train Your Dragon: The Hidden World",
@ -942,7 +1137,7 @@ async fn experimental_feature_vector_store() {
32
]
},
"_semanticScore": 0.74120104
"_rankingScore": 0.7412010431289673
},
{
"title": "Escape Room",
@ -953,7 +1148,8 @@ async fn experimental_feature_vector_store() {
-23,
32
]
}
},
"_rankingScore": 0.6972063183784485
}
]
"###);

View File

@ -0,0 +1,184 @@
use std::num::NonZeroUsize;
use std::sync::Arc;
use std::time::Duration;
use actix_web::ResponseError;
use meili_snap::snapshot;
use meilisearch::search_queue::SearchQueue;
#[actix_rt::test]
async fn search_queue_register() {
let queue = SearchQueue::new(4, NonZeroUsize::new(2).unwrap());
// First, use all the cores
let permit1 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
let _permit2 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
// If we free one spot we should be able to register one new search
drop(permit1);
let permit3 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
// And again
drop(permit3);
let _permit4 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
}
#[actix_rt::test]
async fn wait_till_cores_are_available() {
let queue = Arc::new(SearchQueue::new(4, NonZeroUsize::new(1).unwrap()));
// First, use all the cores
let permit1 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
let ret = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit()).await;
assert!(ret.is_err(), "The capacity is full, we should not get a permit");
let q = queue.clone();
let task = tokio::task::spawn(async move { q.try_get_search_permit().await });
// after dropping a permit the previous task should be able to finish
drop(permit1);
let _permit2 = tokio::time::timeout(Duration::from_secs(1), task)
.await
.expect("I should get a permit straight away")
.unwrap();
}
#[actix_rt::test]
async fn refuse_search_requests_when_queue_is_full() {
let queue = Arc::new(SearchQueue::new(1, NonZeroUsize::new(1).unwrap()));
// First, use the whole capacity of the
let _permit1 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
let q = queue.clone();
let permit2 = tokio::task::spawn(async move { q.try_get_search_permit().await });
// Here the queue is full. By registering two new search requests the permit 2 and 3 should be thrown out
let q = queue.clone();
let _permit3 = tokio::task::spawn(async move { q.try_get_search_permit().await });
let permit2 = tokio::time::timeout(Duration::from_secs(1), permit2)
.await
.expect("I should get a result straight away")
.unwrap(); // task should end successfully
let err = meilisearch_types::error::ResponseError::from(permit2.unwrap_err());
let http_response = err.error_response();
let mut headers: Vec<_> = http_response
.headers()
.iter()
.map(|(name, value)| (name.to_string(), value.to_str().unwrap().to_string()))
.collect();
headers.sort();
snapshot!(format!("{headers:?}"), @r###"[("content-type", "application/json"), ("retry-after", "10")]"###);
let err = serde_json::to_string_pretty(&err).unwrap();
snapshot!(err, @r###"
{
"message": "Too many search requests running at the same time: 1. Retry after 10s.",
"code": "too_many_search_requests",
"type": "system",
"link": "https://docs.meilisearch.com/errors#too_many_search_requests"
}
"###);
}
#[actix_rt::test]
async fn search_request_crashes_while_holding_permits() {
let queue = Arc::new(SearchQueue::new(1, NonZeroUsize::new(1).unwrap()));
let (send, recv) = tokio::sync::oneshot::channel();
// This first request take a cpu
let q = queue.clone();
tokio::task::spawn(async move {
let _permit = q.try_get_search_permit().await.unwrap();
recv.await.unwrap();
panic!("oops an unexpected crash happened")
});
// This second request waits in the queue till the first request finishes
let q = queue.clone();
let task = tokio::task::spawn(async move {
let _permit = q.try_get_search_permit().await.unwrap();
});
// By sending something in the channel the request holding a CPU will panic and should lose its permit
send.send(()).unwrap();
// Then the second request should be able to process and finishes correctly without panic
tokio::time::timeout(Duration::from_secs(1), task)
.await
.expect("I should get a permit straight away")
.unwrap();
// I should even be able to take second permit here
let _permit1 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
}
#[actix_rt::test]
async fn works_with_capacity_of_zero() {
let queue = Arc::new(SearchQueue::new(0, NonZeroUsize::new(1).unwrap()));
// First, use the whole capacity of the
let permit1 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
// then we should get an error if we try to register a second search request.
let permit2 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a result straight away");
let err = meilisearch_types::error::ResponseError::from(permit2.unwrap_err());
let http_response = err.error_response();
let mut headers: Vec<_> = http_response
.headers()
.iter()
.map(|(name, value)| (name.to_string(), value.to_str().unwrap().to_string()))
.collect();
headers.sort();
snapshot!(format!("{headers:?}"), @r###"[("content-type", "application/json"), ("retry-after", "10")]"###);
let err = serde_json::to_string_pretty(&err).unwrap();
snapshot!(err, @r###"
{
"message": "Too many search requests running at the same time: 0. Retry after 10s.",
"code": "too_many_search_requests",
"type": "system",
"link": "https://docs.meilisearch.com/errors#too_many_search_requests"
}
"###);
drop(permit1);
// After dropping the first permit we should be able to get a new permit
let _permit3 = tokio::time::timeout(Duration::from_secs(1), queue.try_get_search_permit())
.await
.expect("I should get a permit straight away")
.unwrap();
}

View File

@ -337,3 +337,31 @@ async fn settings_bad_pagination() {
}
"###);
}
#[actix_rt::test]
async fn settings_bad_search_cutoff_ms() {
let server = Server::new().await;
let index = server.index("test");
let (response, code) = index.update_settings(json!({ "searchCutoffMs": "doggo" })).await;
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Invalid value type at `.searchCutoffMs`: expected a positive integer, but found a string: `\"doggo\"`",
"code": "invalid_settings_search_cutoff_ms",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_settings_search_cutoff_ms"
}
"###);
let (response, code) = index.update_settings_search_cutoff_ms(json!("doggo")).await;
snapshot!(code, @"400 Bad Request");
snapshot!(json_string!(response), @r###"
{
"message": "Invalid value type: expected a positive integer, but found a string: `\"doggo\"`",
"code": "invalid_settings_search_cutoff_ms",
"type": "invalid_request",
"link": "https://docs.meilisearch.com/errors#invalid_settings_search_cutoff_ms"
}
"###);
}

View File

@ -35,6 +35,7 @@ static DEFAULT_SETTINGS_VALUES: Lazy<HashMap<&'static str, Value>> = Lazy::new(|
"maxTotalHits": json!(1000),
}),
);
map.insert("search_cutoff_ms", json!(null));
map
});
@ -49,12 +50,12 @@ async fn get_settings_unexisting_index() {
async fn get_settings() {
let server = Server::new().await;
let index = server.index("test");
index.create(None).await;
index.wait_task(0).await;
let (response, _code) = index.create(None).await;
index.wait_task(response.uid()).await;
let (response, code) = index.settings().await;
assert_eq!(code, 200);
let settings = response.as_object().unwrap();
assert_eq!(settings.keys().len(), 15);
assert_eq!(settings.keys().len(), 16);
assert_eq!(settings["displayedAttributes"], json!(["*"]));
assert_eq!(settings["searchableAttributes"], json!(["*"]));
assert_eq!(settings["filterableAttributes"], json!([]));
@ -84,6 +85,137 @@ async fn get_settings() {
})
);
assert_eq!(settings["proximityPrecision"], json!("byWord"));
assert_eq!(settings["searchCutoffMs"], json!(null));
}
#[actix_rt::test]
async fn secrets_are_hidden_in_settings() {
let server = Server::new().await;
let (response, code) = server.set_features(json!({"vectorStore": true})).await;
meili_snap::snapshot!(code, @"200 OK");
meili_snap::snapshot!(meili_snap::json_string!(response), @r###"
{
"vectorStore": true,
"metrics": false,
"logsRoute": false,
"exportPuffinReports": false
}
"###);
let index = server.index("test");
let (response, _code) = index.create(None).await;
index.wait_task(response.uid()).await;
let (response, code) = index
.update_settings(json!({
"embedders": {
"default": {
"source": "rest",
"url": "https://localhost:7777",
"apiKey": "My super secret value you will never guess"
}
}
}))
.await;
meili_snap::snapshot!(code, @"202 Accepted");
meili_snap::snapshot!(meili_snap::json_string!(response, { ".duration" => "[duration]", ".enqueuedAt" => "[date]", ".startedAt" => "[date]", ".finishedAt" => "[date]" }),
@r###"
{
"taskUid": 1,
"indexUid": "test",
"status": "enqueued",
"type": "settingsUpdate",
"enqueuedAt": "[date]"
}
"###);
let settings_update_uid = response.uid();
index.wait_task(settings_update_uid).await;
let (response, code) = index.settings().await;
meili_snap::snapshot!(code, @"200 OK");
meili_snap::snapshot!(meili_snap::json_string!(response), @r###"
{
"displayedAttributes": [
"*"
],
"searchableAttributes": [
"*"
],
"filterableAttributes": [],
"sortableAttributes": [],
"rankingRules": [
"words",
"typo",
"proximity",
"attribute",
"sort",
"exactness"
],
"stopWords": [],
"nonSeparatorTokens": [],
"separatorTokens": [],
"dictionary": [],
"synonyms": {},
"distinctAttribute": null,
"proximityPrecision": "byWord",
"typoTolerance": {
"enabled": true,
"minWordSizeForTypos": {
"oneTypo": 5,
"twoTypos": 9
},
"disableOnWords": [],
"disableOnAttributes": []
},
"faceting": {
"maxValuesPerFacet": 100,
"sortFacetValuesBy": {
"*": "alpha"
}
},
"pagination": {
"maxTotalHits": 1000
},
"embedders": {
"default": {
"source": "rest",
"apiKey": "My suXXXXXX...",
"documentTemplate": "{% for field in fields %} {{ field.name }}: {{ field.value }}\n{% endfor %}",
"url": "https://localhost:7777",
"query": null,
"inputField": [
"input"
],
"pathToEmbeddings": [
"data"
],
"embeddingObject": [
"embedding"
],
"inputType": "text"
}
},
"searchCutoffMs": null
}
"###);
let (response, code) = server.get_task(settings_update_uid).await;
meili_snap::snapshot!(code, @"200 OK");
meili_snap::snapshot!(meili_snap::json_string!(response["details"]), @r###"
{
"embedders": {
"default": {
"source": "rest",
"apiKey": "My suXXXXXX...",
"url": "https://localhost:7777"
}
}
}
"###);
}
#[actix_rt::test]
@ -285,7 +417,8 @@ test_setting_routes!(
ranking_rules put,
synonyms put,
pagination patch,
faceting patch
faceting patch,
search_cutoff_ms put
);
#[actix_rt::test]

View File

@ -7,7 +7,7 @@ use std::sync::Arc;
use actix_http::body::MessageBody;
use actix_web::dev::{ServiceFactory, ServiceResponse};
use actix_web::web::{Bytes, Data};
use actix_web::{post, App, HttpResponse, HttpServer};
use actix_web::{post, App, HttpRequest, HttpResponse, HttpServer};
use meili_snap::{json_string, snapshot};
use meilisearch::Opt;
use tokio::sync::mpsc;
@ -17,7 +17,17 @@ use crate::common::{default_settings, Server};
use crate::json;
#[post("/")]
async fn forward_body(sender: Data<mpsc::UnboundedSender<Vec<u8>>>, body: Bytes) -> HttpResponse {
async fn forward_body(
req: HttpRequest,
sender: Data<mpsc::UnboundedSender<Vec<u8>>>,
body: Bytes,
) -> HttpResponse {
let headers = req.headers();
assert_eq!(headers.get("content-type").unwrap(), "application/x-ndjson");
assert_eq!(headers.get("transfer-encoding").unwrap(), "chunked");
assert_eq!(headers.get("accept-encoding").unwrap(), "gzip");
assert_eq!(headers.get("content-encoding").unwrap(), "gzip");
let body = body.to_vec();
sender.send(body).unwrap();
HttpResponse::Ok().into()

View File

@ -291,7 +291,11 @@ fn export_a_dump(
}
// 4.2. Dump the settings
let settings = meilisearch_types::settings::settings(&index, &rtxn)?;
let settings = meilisearch_types::settings::settings(
&index,
&rtxn,
meilisearch_types::settings::SecretPolicy::RevealSecrets,
)?;
index_dumper.settings(&settings)?;
count += 1;
}

View File

@ -17,7 +17,7 @@ bincode = "1.3.3"
bstr = "1.9.0"
bytemuck = { version = "1.14.0", features = ["extern_crate_alloc"] }
byteorder = "1.5.0"
charabia = { version = "0.8.7", default-features = false }
charabia = { version = "0.8.8", default-features = false }
concat-arrays = "0.1.2"
crossbeam-channel = "0.5.11"
deserr = "0.6.1"
@ -70,27 +70,23 @@ itertools = "0.11.0"
# profiling
puffin = "0.16.0"
# logging
logging_timer = "1.1.0"
csv = "1.3.0"
candle-core = { git = "https://github.com/huggingface/candle.git", version = "0.3.1" }
candle-transformers = { git = "https://github.com/huggingface/candle.git", version = "0.3.1" }
candle-nn = { git = "https://github.com/huggingface/candle.git", version = "0.3.1" }
tokenizers = { git = "https://github.com/huggingface/tokenizers.git", tag = "v0.14.1", version = "0.14.1", default_features = false, features = ["onig"] }
candle-core = { version = "0.4.1" }
candle-transformers = { version = "0.4.1" }
candle-nn = { version = "0.4.1" }
tokenizers = { git = "https://github.com/huggingface/tokenizers.git", tag = "v0.15.2", version = "0.15.2", default_features = false, features = [
"onig",
] }
hf-hub = { git = "https://github.com/dureuill/hf-hub.git", branch = "rust_tls", default_features = false, features = [
"online",
] }
tokio = { version = "1.35.1", features = ["rt"] }
futures = "0.3.30"
reqwest = { version = "0.11.23", features = [
"rustls-tls",
"json",
], default-features = false }
tiktoken-rs = "0.5.8"
liquid = "0.26.4"
arroy = "0.2.0"
rand = "0.8.5"
tracing = "0.1.40"
ureq = { version = "2.9.6", features = ["json"] }
url = "2.5.0"
[dev-dependencies]
mimalloc = { version = "0.1.39", default-features = false }

View File

@ -6,7 +6,7 @@ use std::time::Instant;
use heed::EnvOpenOptions;
use milli::{
execute_search, filtered_universe, DefaultSearchLogger, GeoSortStrategy, Index, SearchContext,
SearchLogger, TermsMatchingStrategy,
SearchLogger, TermsMatchingStrategy, TimeBudget,
};
#[global_allocator]
@ -49,7 +49,7 @@ fn main() -> Result<(), Box<dyn Error>> {
let start = Instant::now();
let mut ctx = SearchContext::new(&index, &txn);
let universe = filtered_universe(&ctx, &None)?;
let universe = filtered_universe(ctx.index, ctx.txn, &None)?;
let docs = execute_search(
&mut ctx,
@ -65,6 +65,7 @@ fn main() -> Result<(), Box<dyn Error>> {
None,
&mut DefaultSearchLogger,
logger,
TimeBudget::max(),
)?;
if let Some((logger, dir)) = detailed_logger {
logger.finish(&mut ctx, Path::new(dir))?;

View File

@ -196,7 +196,7 @@ only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and undersco
InvalidPromptForEmbeddings(String, crate::prompt::error::NewPromptError),
#[error("Too many embedders in the configuration. Found {0}, but limited to 256.")]
TooManyEmbedders(usize),
#[error("Cannot find embedder with name {0}.")]
#[error("Cannot find embedder with name `{0}`.")]
InvalidEmbedder(String),
#[error("Too many vectors for document with id {0}: found {1}, but limited to 256.")]
TooManyVectors(String, usize),
@ -243,6 +243,8 @@ only composed of alphanumeric characters (a-z A-Z 0-9), hyphens (-) and undersco
},
#[error("`.embedders.{embedder_name}.dimensions`: `dimensions` cannot be zero")]
InvalidSettingsDimensions { embedder_name: String },
#[error("`.embedders.{embedder_name}.url`: could not parse `{url}`: {inner_error}")]
InvalidUrl { embedder_name: String, inner_error: url::ParseError, url: String },
}
impl From<crate::vector::Error> for Error {

View File

@ -20,13 +20,13 @@ use crate::heed_codec::facet::{
use crate::heed_codec::{
BEU16StrCodec, FstSetCodec, ScriptLanguageCodec, StrBEU16Codec, StrRefCodec,
};
use crate::order_by_map::OrderByMap;
use crate::proximity::ProximityPrecision;
use crate::vector::EmbeddingConfig;
use crate::{
default_criteria, CboRoaringBitmapCodec, Criterion, DocumentId, ExternalDocumentsIds,
FacetDistribution, FieldDistribution, FieldId, FieldIdWordCountCodec, GeoPoint, ObkvCodec,
OrderBy, Result, RoaringBitmapCodec, RoaringBitmapLenCodec, Search, U8StrStrCodec, BEU16,
BEU32, BEU64,
Result, RoaringBitmapCodec, RoaringBitmapLenCodec, Search, U8StrStrCodec, BEU16, BEU32, BEU64,
};
pub const DEFAULT_MIN_WORD_LEN_ONE_TYPO: u8 = 5;
@ -67,6 +67,7 @@ pub mod main_key {
pub const PAGINATION_MAX_TOTAL_HITS: &str = "pagination-max-total-hits";
pub const PROXIMITY_PRECISION: &str = "proximity-precision";
pub const EMBEDDING_CONFIGS: &str = "embedding_configs";
pub const SEARCH_CUTOFF: &str = "search_cutoff";
}
pub mod db_name {
@ -1115,7 +1116,7 @@ impl Index {
/* words prefixes fst */
/// Writes the FST which is the words prefixes dictionnary of the engine.
/// Writes the FST which is the words prefixes dictionary of the engine.
pub(crate) fn put_words_prefixes_fst<A: AsRef<[u8]>>(
&self,
wtxn: &mut RwTxn,
@ -1128,7 +1129,7 @@ impl Index {
)
}
/// Returns the FST which is the words prefixes dictionnary of the engine.
/// Returns the FST which is the words prefixes dictionary of the engine.
pub fn words_prefixes_fst<'t>(&self, rtxn: &'t RoTxn) -> Result<fst::Set<Cow<'t, [u8]>>> {
match self.main.remap_types::<Str, Bytes>().get(rtxn, main_key::WORDS_PREFIXES_FST_KEY)? {
Some(bytes) => Ok(fst::Set::new(bytes)?.map_data(Cow::Borrowed)?),
@ -1373,21 +1374,19 @@ impl Index {
self.main.remap_key_type::<Str>().delete(txn, main_key::MAX_VALUES_PER_FACET)
}
pub fn sort_facet_values_by(&self, txn: &RoTxn) -> heed::Result<HashMap<String, OrderBy>> {
let mut orders = self
pub fn sort_facet_values_by(&self, txn: &RoTxn) -> heed::Result<OrderByMap> {
let orders = self
.main
.remap_types::<Str, SerdeJson<HashMap<String, OrderBy>>>()
.remap_types::<Str, SerdeJson<OrderByMap>>()
.get(txn, main_key::SORT_FACET_VALUES_BY)?
.unwrap_or_default();
// Insert the default ordering if it is not already overwritten by the user.
orders.entry("*".to_string()).or_insert(OrderBy::Lexicographic);
Ok(orders)
}
pub(crate) fn put_sort_facet_values_by(
&self,
txn: &mut RwTxn,
val: &HashMap<String, OrderBy>,
val: &OrderByMap,
) -> heed::Result<()> {
self.main.remap_types::<Str, SerdeJson<_>>().put(txn, main_key::SORT_FACET_VALUES_BY, &val)
}
@ -1500,12 +1499,16 @@ impl Index {
.unwrap_or_default())
}
pub fn default_embedding_name(&self, rtxn: &RoTxn<'_>) -> Result<String> {
let configs = self.embedding_configs(rtxn)?;
Ok(match configs.as_slice() {
[(ref first_name, _)] => first_name.clone(),
_ => "default".to_owned(),
})
pub(crate) fn put_search_cutoff(&self, wtxn: &mut RwTxn<'_>, cutoff: u64) -> heed::Result<()> {
self.main.remap_types::<Str, BEU64>().put(wtxn, main_key::SEARCH_CUTOFF, &cutoff)
}
pub fn search_cutoff(&self, rtxn: &RoTxn<'_>) -> Result<Option<u64>> {
Ok(self.main.remap_types::<Str, BEU64>().get(rtxn, main_key::SEARCH_CUTOFF)?)
}
pub(crate) fn delete_search_cutoff(&self, wtxn: &mut RwTxn<'_>) -> heed::Result<bool> {
self.main.remap_key_type::<Str>().delete(wtxn, main_key::SEARCH_CUTOFF)
}
}
@ -2423,6 +2426,8 @@ pub(crate) mod tests {
candidates: _,
document_scores: _,
mut documents_ids,
degraded: _,
used_negative_operator: _,
} = search.execute().unwrap();
let primary_key_id = index.fields_ids_map(&rtxn).unwrap().id("primary_key").unwrap();
documents_ids.sort_unstable();

View File

@ -16,6 +16,7 @@ pub mod facet;
mod fields_ids_map;
pub mod heed_codec;
pub mod index;
pub mod order_by_map;
pub mod prompt;
pub mod proximity;
pub mod score_details;
@ -29,6 +30,7 @@ pub mod snapshot_tests;
use std::collections::{BTreeMap, HashMap};
use std::convert::{TryFrom, TryInto};
use std::fmt;
use std::hash::BuildHasherDefault;
use charabia::normalizer::{CharNormalizer, CompatibilityDecompositionNormalizer};
@ -56,10 +58,11 @@ pub use self::heed_codec::{
UncheckedU8StrStrCodec,
};
pub use self::index::Index;
pub use self::search::facet::{FacetValueHit, SearchForFacetValues};
pub use self::search::recommend::Recommend;
pub use self::search::{
FacetDistribution, FacetValueHit, Filter, FormatOptions, MatchBounds, MatcherBuilder,
MatchingWords, OrderBy, Search, SearchForFacetValues, SearchResult, TermsMatchingStrategy,
DEFAULT_VALUES_PER_FACET,
FacetDistribution, Filter, FormatOptions, MatchBounds, MatcherBuilder, MatchingWords, OrderBy,
Search, SearchResult, SemanticSearch, TermsMatchingStrategy, DEFAULT_VALUES_PER_FACET,
};
pub type Result<T> = std::result::Result<T, error::Error>;
@ -103,6 +106,73 @@ pub const MAX_WORD_LENGTH: usize = MAX_LMDB_KEY_LENGTH / 2;
pub const MAX_POSITION_PER_ATTRIBUTE: u32 = u16::MAX as u32 + 1;
#[derive(Clone)]
pub struct TimeBudget {
started_at: std::time::Instant,
budget: std::time::Duration,
/// When testing the time budget, ensuring we did more than iteration of the bucket sort can be useful.
/// But to avoid being flaky, the only option is to add the ability to stop after a specific number of calls instead of a `Duration`.
#[cfg(test)]
stop_after: Option<(std::sync::Arc<std::sync::atomic::AtomicUsize>, usize)>,
}
impl fmt::Debug for TimeBudget {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_struct("TimeBudget")
.field("started_at", &self.started_at)
.field("budget", &self.budget)
.field("left", &(self.budget - self.started_at.elapsed()))
.finish()
}
}
impl Default for TimeBudget {
fn default() -> Self {
Self::new(std::time::Duration::from_millis(150))
}
}
impl TimeBudget {
pub fn new(budget: std::time::Duration) -> Self {
Self {
started_at: std::time::Instant::now(),
budget,
#[cfg(test)]
stop_after: None,
}
}
pub fn max() -> Self {
Self::new(std::time::Duration::from_secs(u64::MAX))
}
#[cfg(test)]
pub fn with_stop_after(mut self, stop_after: usize) -> Self {
use std::sync::atomic::AtomicUsize;
use std::sync::Arc;
self.stop_after = Some((Arc::new(AtomicUsize::new(0)), stop_after));
self
}
pub fn exceeded(&self) -> bool {
#[cfg(test)]
if let Some((current, stop_after)) = &self.stop_after {
let current = current.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
if current >= *stop_after {
return true;
} else {
// if a number has been specified then we ignore entirely the time budget
return false;
}
}
self.started_at.elapsed() > self.budget
}
}
// Convert an absolute word position into a relative position.
// Return the field id of the attribute related to the absolute position
// and the relative position in the attribute.

57
milli/src/order_by_map.rs Normal file
View File

@ -0,0 +1,57 @@
use std::collections::{hash_map, HashMap};
use std::iter::FromIterator;
use serde::{Deserialize, Deserializer, Serialize};
use crate::OrderBy;
#[derive(Serialize)]
pub struct OrderByMap(HashMap<String, OrderBy>);
impl OrderByMap {
pub fn get(&self, key: impl AsRef<str>) -> OrderBy {
self.0
.get(key.as_ref())
.copied()
.unwrap_or_else(|| self.0.get("*").copied().unwrap_or_default())
}
pub fn insert(&mut self, key: String, value: OrderBy) -> Option<OrderBy> {
self.0.insert(key, value)
}
}
impl Default for OrderByMap {
fn default() -> Self {
let mut map = HashMap::new();
map.insert("*".to_string(), OrderBy::Lexicographic);
OrderByMap(map)
}
}
impl FromIterator<(String, OrderBy)> for OrderByMap {
fn from_iter<T: IntoIterator<Item = (String, OrderBy)>>(iter: T) -> Self {
OrderByMap(iter.into_iter().collect())
}
}
impl IntoIterator for OrderByMap {
type Item = (String, OrderBy);
type IntoIter = hash_map::IntoIter<String, OrderBy>;
fn into_iter(self) -> Self::IntoIter {
self.0.into_iter()
}
}
impl<'de> Deserialize<'de> for OrderByMap {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: Deserializer<'de>,
{
let mut map = Deserialize::deserialize(deserializer).map(OrderByMap)?;
// Insert the default ordering if it is not already overwritten by the user.
map.0.entry("*".to_string()).or_insert(OrderBy::default());
Ok(map)
}
}

View File

@ -17,6 +17,9 @@ pub enum ScoreDetails {
Sort(Sort),
Vector(Vector),
GeoSort(GeoSort),
/// Returned when we don't have the time to finish applying all the subsequent ranking-rules
Skipped,
}
#[derive(Clone, Copy)]
@ -50,6 +53,7 @@ impl ScoreDetails {
ScoreDetails::Sort(_) => None,
ScoreDetails::GeoSort(_) => None,
ScoreDetails::Vector(_) => None,
ScoreDetails::Skipped => Some(Rank { rank: 0, max_rank: 1 }),
}
}
@ -94,9 +98,10 @@ impl ScoreDetails {
ScoreDetails::ExactWords(e) => RankOrValue::Rank(e.rank()),
ScoreDetails::Sort(sort) => RankOrValue::Sort(sort),
ScoreDetails::GeoSort(geosort) => RankOrValue::GeoSort(geosort),
ScoreDetails::Vector(vector) => RankOrValue::Score(
vector.value_similarity.as_ref().map(|(_, s)| *s as f64).unwrap_or(0.0f64),
),
ScoreDetails::Vector(vector) => {
RankOrValue::Score(vector.similarity.as_ref().map(|s| *s as f64).unwrap_or(0.0f64))
}
ScoreDetails::Skipped => RankOrValue::Rank(Rank { rank: 0, max_rank: 1 }),
}
}
@ -244,16 +249,18 @@ impl ScoreDetails {
order += 1;
}
ScoreDetails::Vector(s) => {
let vector = format!("vectorSort({:?})", s.target_vector);
let value = s.value_similarity.as_ref().map(|(v, _)| v);
let similarity = s.value_similarity.as_ref().map(|(_, s)| s);
let similarity = s.similarity.as_ref();
let details = serde_json::json!({
"order": order,
"value": value,
"similarity": similarity,
});
details_map.insert(vector, details);
details_map.insert("vectorSort".into(), details);
order += 1;
}
ScoreDetails::Skipped => {
details_map
.insert("skipped".to_string(), serde_json::json!({ "order": order }));
order += 1;
}
}
@ -484,8 +491,7 @@ impl PartialOrd for GeoSort {
#[derive(Debug, Clone, PartialEq, PartialOrd)]
pub struct Vector {
pub target_vector: Vec<f32>,
pub value_similarity: Option<(Vec<f32>, f32)>,
pub similarity: Option<f32>,
}
impl GeoSort {

View File

@ -168,7 +168,7 @@ impl<'t, 'b, 'bitmap> FacetRangeSearch<'t, 'b, 'bitmap> {
}
// should we stop?
// We should if the the search range doesn't include any
// We should if the search range doesn't include any
// element from the previous key or its successors
let should_stop = {
match self.right {
@ -232,7 +232,7 @@ impl<'t, 'b, 'bitmap> FacetRangeSearch<'t, 'b, 'bitmap> {
}
// should we stop?
// We should if the the search range doesn't include any
// We should if the search range doesn't include any
// element from the previous key or its successors
let should_stop = {
match self.right {

View File

@ -6,15 +6,18 @@ use roaring::RoaringBitmap;
pub use self::facet_distribution::{FacetDistribution, OrderBy, DEFAULT_VALUES_PER_FACET};
pub use self::filter::{BadGeoError, Filter};
pub use self::search::{FacetValueHit, SearchForFacetValues};
use crate::heed_codec::facet::{FacetGroupKeyCodec, FacetGroupValueCodec, OrderedF64Codec};
use crate::heed_codec::BytesRefCodec;
use crate::{Index, Result};
mod facet_distribution;
mod facet_distribution_iter;
mod facet_range_search;
mod facet_sort_ascending;
mod facet_sort_descending;
mod filter;
mod search;
fn facet_extreme_value<'t>(
mut extreme_it: impl Iterator<Item = heed::Result<(RoaringBitmap, &'t [u8])>> + 't,

View File

@ -0,0 +1,332 @@
use std::cmp::{Ordering, Reverse};
use std::collections::BinaryHeap;
use std::ops::ControlFlow;
use charabia::normalizer::NormalizerOption;
use charabia::Normalize;
use fst::automaton::{Automaton, Str};
use fst::{IntoStreamer, Streamer};
use roaring::RoaringBitmap;
use tracing::error;
use crate::error::UserError;
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupValue};
use crate::search::build_dfa;
use crate::{DocumentId, FieldId, OrderBy, Result, Search};
/// The maximum number of values per facet returned by the facet search route.
const DEFAULT_MAX_NUMBER_OF_VALUES_PER_FACET: usize = 100;
pub struct SearchForFacetValues<'a> {
query: Option<String>,
facet: String,
search_query: Search<'a>,
max_values: usize,
is_hybrid: bool,
}
impl<'a> SearchForFacetValues<'a> {
pub fn new(
facet: String,
search_query: Search<'a>,
is_hybrid: bool,
) -> SearchForFacetValues<'a> {
SearchForFacetValues {
query: None,
facet,
search_query,
max_values: DEFAULT_MAX_NUMBER_OF_VALUES_PER_FACET,
is_hybrid,
}
}
pub fn query(&mut self, query: impl Into<String>) -> &mut Self {
self.query = Some(query.into());
self
}
pub fn max_values(&mut self, max: usize) -> &mut Self {
self.max_values = max;
self
}
fn one_original_value_of(
&self,
field_id: FieldId,
facet_str: &str,
any_docid: DocumentId,
) -> Result<Option<String>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let key: (FieldId, _, &str) = (field_id, any_docid, facet_str);
Ok(index.field_id_docid_facet_strings.get(rtxn, &key)?.map(|v| v.to_owned()))
}
pub fn execute(&self) -> Result<Vec<FacetValueHit>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let filterable_fields = index.filterable_fields(rtxn)?;
if !filterable_fields.contains(&self.facet) {
let (valid_fields, hidden_fields) =
index.remove_hidden_fields(rtxn, filterable_fields)?;
return Err(UserError::InvalidFacetSearchFacetName {
field: self.facet.clone(),
valid_fields,
hidden_fields,
}
.into());
}
let fields_ids_map = index.fields_ids_map(rtxn)?;
let fid = match fields_ids_map.id(&self.facet) {
Some(fid) => fid,
// we return an empty list of results when the attribute has been
// set as filterable but no document contains this field (yet).
None => return Ok(Vec::new()),
};
let fst = match self.search_query.index.facet_id_string_fst.get(rtxn, &fid)? {
Some(fst) => fst,
None => return Ok(Vec::new()),
};
let search_candidates = self.search_query.execute_for_candidates(
self.is_hybrid
|| self
.search_query
.semantic
.as_ref()
.and_then(|semantic| semantic.vector.as_ref())
.is_some(),
)?;
let mut results = match index.sort_facet_values_by(rtxn)?.get(&self.facet) {
OrderBy::Lexicographic => ValuesCollection::by_lexicographic(self.max_values),
OrderBy::Count => ValuesCollection::by_count(self.max_values),
};
match self.query.as_ref() {
Some(query) => {
let options = NormalizerOption { lossy: true, ..Default::default() };
let query = query.normalize(&options);
let query = query.as_ref();
let authorize_typos = self.search_query.index.authorize_typos(rtxn)?;
let field_authorizes_typos =
!self.search_query.index.exact_attributes_ids(rtxn)?.contains(&fid);
if authorize_typos && field_authorizes_typos {
let exact_words_fst = self.search_query.index.exact_words(rtxn)?;
if exact_words_fst.map_or(false, |fst| fst.contains(query)) {
if fst.contains(query) {
self.fetch_original_facets_using_normalized(
fid,
query,
query,
&search_candidates,
&mut results,
)?;
}
} else {
let one_typo = self.search_query.index.min_word_len_one_typo(rtxn)?;
let two_typos = self.search_query.index.min_word_len_two_typos(rtxn)?;
let is_prefix = true;
let automaton = if query.len() < one_typo as usize {
build_dfa(query, 0, is_prefix)
} else if query.len() < two_typos as usize {
build_dfa(query, 1, is_prefix)
} else {
build_dfa(query, 2, is_prefix)
};
let mut stream = fst.search(automaton).into_stream();
while let Some(facet_value) = stream.next() {
let value = std::str::from_utf8(facet_value)?;
if self
.fetch_original_facets_using_normalized(
fid,
value,
query,
&search_candidates,
&mut results,
)?
.is_break()
{
break;
}
}
}
} else {
let automaton = Str::new(query).starts_with();
let mut stream = fst.search(automaton).into_stream();
while let Some(facet_value) = stream.next() {
let value = std::str::from_utf8(facet_value)?;
if self
.fetch_original_facets_using_normalized(
fid,
value,
query,
&search_candidates,
&mut results,
)?
.is_break()
{
break;
}
}
}
}
None => {
let prefix = FacetGroupKey { field_id: fid, level: 0, left_bound: "" };
for result in index.facet_id_string_docids.prefix_iter(rtxn, &prefix)? {
let (FacetGroupKey { left_bound, .. }, FacetGroupValue { bitmap, .. }) =
result?;
let count = search_candidates.intersection_len(&bitmap);
if count != 0 {
let value = self
.one_original_value_of(fid, left_bound, bitmap.min().unwrap())?
.unwrap_or_else(|| left_bound.to_string());
if results.insert(FacetValueHit { value, count }).is_break() {
break;
}
}
}
}
}
Ok(results.into_sorted_vec())
}
fn fetch_original_facets_using_normalized(
&self,
fid: FieldId,
value: &str,
query: &str,
search_candidates: &RoaringBitmap,
results: &mut ValuesCollection,
) -> Result<ControlFlow<()>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let database = index.facet_id_normalized_string_strings;
let key = (fid, value);
let original_strings = match database.get(rtxn, &key)? {
Some(original_strings) => original_strings,
None => {
error!("the facet value is missing from the facet database: {key:?}");
return Ok(ControlFlow::Continue(()));
}
};
for original in original_strings {
let key = FacetGroupKey { field_id: fid, level: 0, left_bound: original.as_str() };
let docids = match index.facet_id_string_docids.get(rtxn, &key)? {
Some(FacetGroupValue { bitmap, .. }) => bitmap,
None => {
error!("the facet value is missing from the facet database: {key:?}");
return Ok(ControlFlow::Continue(()));
}
};
let count = search_candidates.intersection_len(&docids);
if count != 0 {
let value = self
.one_original_value_of(fid, &original, docids.min().unwrap())?
.unwrap_or_else(|| query.to_string());
if results.insert(FacetValueHit { value, count }).is_break() {
break;
}
}
}
Ok(ControlFlow::Continue(()))
}
}
#[derive(Debug, Clone, serde::Serialize, PartialEq)]
pub struct FacetValueHit {
/// The original facet value
pub value: String,
/// The number of documents associated to this facet
pub count: u64,
}
impl PartialOrd for FacetValueHit {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
impl Ord for FacetValueHit {
fn cmp(&self, other: &Self) -> Ordering {
self.count.cmp(&other.count).then_with(|| self.value.cmp(&other.value))
}
}
impl Eq for FacetValueHit {}
/// A wrapper type that collects the best facet values by
/// lexicographic or number of associated values.
enum ValuesCollection {
/// Keeps the top values according to the lexicographic order.
Lexicographic { max: usize, content: Vec<FacetValueHit> },
/// Keeps the top values according to the number of values associated to them.
///
/// Note that it is a max heap and we need to move the smallest counts
/// at the top to be able to pop them when we reach the max_values limit.
Count { max: usize, content: BinaryHeap<Reverse<FacetValueHit>> },
}
impl ValuesCollection {
pub fn by_lexicographic(max: usize) -> Self {
ValuesCollection::Lexicographic { max, content: Vec::new() }
}
pub fn by_count(max: usize) -> Self {
ValuesCollection::Count { max, content: BinaryHeap::new() }
}
pub fn insert(&mut self, value: FacetValueHit) -> ControlFlow<()> {
match self {
ValuesCollection::Lexicographic { max, content } => {
if content.len() < *max {
content.push(value);
if content.len() < *max {
return ControlFlow::Continue(());
}
}
ControlFlow::Break(())
}
ValuesCollection::Count { max, content } => {
if content.len() == *max {
// Peeking gives us the worst value in the list as
// this is a max-heap and we reversed it.
let Some(mut peek) = content.peek_mut() else { return ControlFlow::Break(()) };
if peek.0.count <= value.count {
// Replace the current worst value in the heap
// with the new one we received that is better.
*peek = Reverse(value);
}
} else {
content.push(Reverse(value));
}
ControlFlow::Continue(())
}
}
}
/// Returns the list of facet values in descending order of, either,
/// count or lexicographic order of the value depending on the type.
pub fn into_sorted_vec(self) -> Vec<FacetValueHit> {
match self {
ValuesCollection::Lexicographic { content, .. } => content.into_iter().collect(),
ValuesCollection::Count { content, .. } => {
// Convert the heap into a vec of hits by removing the Reverse wrapper.
// Hits are already in the right order as they were reversed and there
// are output in ascending order.
content.into_sorted_vec().into_iter().map(|Reverse(hit)| hit).collect()
}
}
}
}

View File

@ -4,12 +4,15 @@ use itertools::Itertools;
use roaring::RoaringBitmap;
use crate::score_details::{ScoreDetails, ScoreValue, ScoringStrategy};
use crate::search::SemanticSearch;
use crate::{MatchingWords, Result, Search, SearchResult};
struct ScoreWithRatioResult {
matching_words: MatchingWords,
candidates: RoaringBitmap,
document_scores: Vec<(u32, ScoreWithRatio)>,
degraded: bool,
used_negative_operator: bool,
}
type ScoreWithRatio = (Vec<ScoreDetails>, f32);
@ -49,8 +52,12 @@ fn compare_scores(
order => return order,
}
}
(Some(ScoreValue::Score(_)), Some(_)) => return Ordering::Greater,
(Some(_), Some(ScoreValue::Score(_))) => return Ordering::Less,
(Some(ScoreValue::Score(x)), Some(_)) => {
return if x == 0. { Ordering::Less } else { Ordering::Greater }
}
(Some(_), Some(ScoreValue::Score(x))) => {
return if x == 0. { Ordering::Greater } else { Ordering::Less }
}
// if we have this, we're bad
(Some(ScoreValue::GeoSort(_)), Some(ScoreValue::Sort(_)))
| (Some(ScoreValue::Sort(_)), Some(ScoreValue::GeoSort(_))) => {
@ -72,51 +79,82 @@ impl ScoreWithRatioResult {
matching_words: results.matching_words,
candidates: results.candidates,
document_scores,
degraded: results.degraded,
used_negative_operator: results.used_negative_operator,
}
}
fn merge(left: Self, right: Self, from: usize, length: usize) -> SearchResult {
let mut documents_ids =
Vec::with_capacity(left.document_scores.len() + right.document_scores.len());
let mut document_scores =
Vec::with_capacity(left.document_scores.len() + right.document_scores.len());
fn merge(
vector_results: Self,
keyword_results: Self,
from: usize,
length: usize,
) -> (SearchResult, u32) {
#[derive(Clone, Copy)]
enum ResultSource {
Semantic,
Keyword,
}
let mut semantic_hit_count = 0;
let mut documents_ids = Vec::with_capacity(
vector_results.document_scores.len() + keyword_results.document_scores.len(),
);
let mut document_scores = Vec::with_capacity(
vector_results.document_scores.len() + keyword_results.document_scores.len(),
);
let mut documents_seen = RoaringBitmap::new();
for (docid, (main_score, _sub_score)) in left
for ((docid, (main_score, _sub_score)), source) in vector_results
.document_scores
.into_iter()
.merge_by(right.document_scores.into_iter(), |(_, left), (_, right)| {
// the first value is the one with the greatest score
compare_scores(left, right).is_ge()
})
.zip(std::iter::repeat(ResultSource::Semantic))
.merge_by(
keyword_results
.document_scores
.into_iter()
.zip(std::iter::repeat(ResultSource::Keyword)),
|((_, left), _), ((_, right), _)| {
// the first value is the one with the greatest score
compare_scores(left, right).is_ge()
},
)
// remove documents we already saw
.filter(|(docid, _)| documents_seen.insert(*docid))
.filter(|((docid, _), _)| documents_seen.insert(*docid))
// start skipping **after** the filter
.skip(from)
// take **after** skipping
.take(length)
{
if let ResultSource::Semantic = source {
semantic_hit_count += 1;
}
documents_ids.push(docid);
// TODO: pass both scores to documents_score in some way?
document_scores.push(main_score);
}
SearchResult {
matching_words: right.matching_words,
candidates: left.candidates | right.candidates,
documents_ids,
document_scores,
}
(
SearchResult {
matching_words: keyword_results.matching_words,
candidates: vector_results.candidates | keyword_results.candidates,
documents_ids,
document_scores,
degraded: vector_results.degraded | keyword_results.degraded,
used_negative_operator: vector_results.used_negative_operator
| keyword_results.used_negative_operator,
},
semantic_hit_count,
)
}
}
impl<'a> Search<'a> {
pub fn execute_hybrid(&self, semantic_ratio: f32) -> Result<SearchResult> {
pub fn execute_hybrid(&self, semantic_ratio: f32) -> Result<(SearchResult, Option<u32>)> {
// TODO: find classier way to achieve that than to reset vector and query params
// create separate keyword and semantic searches
let mut search = Search {
query: self.query.clone(),
vector: self.vector.clone(),
filter: self.filter.clone(),
offset: 0,
limit: self.limit + self.offset,
@ -129,25 +167,43 @@ impl<'a> Search<'a> {
exhaustive_number_hits: self.exhaustive_number_hits,
rtxn: self.rtxn,
index: self.index,
distribution_shift: self.distribution_shift,
embedder_name: self.embedder_name.clone(),
semantic: self.semantic.clone(),
time_budget: self.time_budget.clone(),
};
let vector_query = search.vector.take();
let semantic = search.semantic.take();
let keyword_results = search.execute()?;
// skip semantic search if we don't have a vector query (placeholder search)
let Some(vector_query) = vector_query else {
return Ok(keyword_results);
};
// completely skip semantic search if the results of the keyword search are good enough
if self.results_good_enough(&keyword_results, semantic_ratio) {
return Ok(keyword_results);
return Ok((keyword_results, Some(0)));
}
search.vector = Some(vector_query);
search.query = None;
// no vector search against placeholder search
let Some(query) = search.query.take() else {
return Ok((keyword_results, Some(0)));
};
// no embedder, no semantic search
let Some(SemanticSearch { vector, embedder_name, embedder }) = semantic else {
return Ok((keyword_results, Some(0)));
};
let vector_query = match vector {
Some(vector_query) => vector_query,
None => {
// attempt to embed the vector
match embedder.embed_one(query) {
Ok(embedding) => embedding,
Err(error) => {
tracing::error!(error=%error, "Embedding failed");
return Ok((keyword_results, Some(0)));
}
}
}
};
search.semantic =
Some(SemanticSearch { vector: Some(vector_query), embedder_name, embedder });
// TODO: would be better to have two distinct functions at this point
let vector_results = search.execute()?;
@ -155,10 +211,10 @@ impl<'a> Search<'a> {
let keyword_results = ScoreWithRatioResult::new(keyword_results, 1.0 - semantic_ratio);
let vector_results = ScoreWithRatioResult::new(vector_results, semantic_ratio);
let merge_results =
let (merge_results, semantic_hit_count) =
ScoreWithRatioResult::merge(vector_results, keyword_results, self.offset, self.limit);
assert!(merge_results.documents_ids.len() <= self.limit);
Ok(merge_results)
Ok((merge_results, Some(semantic_hit_count)))
}
fn results_good_enough(&self, keyword_results: &SearchResult, semantic_ratio: f32) -> bool {

View File

@ -1,25 +1,18 @@
use std::fmt;
use std::ops::ControlFlow;
use std::sync::Arc;
use charabia::normalizer::NormalizerOption;
use charabia::Normalize;
use fst::automaton::{Automaton, Str};
use fst::{IntoStreamer, Streamer};
use levenshtein_automata::{LevenshteinAutomatonBuilder as LevBuilder, DFA};
use once_cell::sync::Lazy;
use roaring::bitmap::RoaringBitmap;
use tracing::error;
pub use self::facet::{FacetDistribution, Filter, OrderBy, DEFAULT_VALUES_PER_FACET};
pub use self::new::matches::{FormatOptions, MatchBounds, MatcherBuilder, MatchingWords};
use self::new::{execute_vector_search, PartialSearchResult};
use crate::error::UserError;
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupValue};
use crate::score_details::{ScoreDetails, ScoringStrategy};
use crate::vector::DistributionShift;
use crate::vector::Embedder;
use crate::{
execute_search, filtered_universe, AscDesc, DefaultSearchLogger, DocumentId, FieldId, Index,
Result, SearchContext,
execute_search, filtered_universe, AscDesc, DefaultSearchLogger, DocumentId, Index, Result,
SearchContext, TimeBudget,
};
// Building these factories is not free.
@ -27,17 +20,21 @@ static LEVDIST0: Lazy<LevBuilder> = Lazy::new(|| LevBuilder::new(0, true));
static LEVDIST1: Lazy<LevBuilder> = Lazy::new(|| LevBuilder::new(1, true));
static LEVDIST2: Lazy<LevBuilder> = Lazy::new(|| LevBuilder::new(2, true));
/// The maximum number of values per facet returned by the facet search route.
const DEFAULT_MAX_NUMBER_OF_VALUES_PER_FACET: usize = 100;
pub mod facet;
mod fst_utils;
pub mod hybrid;
pub mod new;
pub mod recommend;
#[derive(Debug, Clone)]
pub struct SemanticSearch {
vector: Option<Vec<f32>>,
embedder_name: String,
embedder: Arc<Embedder>,
}
pub struct Search<'a> {
query: Option<String>,
vector: Option<Vec<f32>>,
// this should be linked to the String in the query
filter: Option<Filter<'a>>,
offset: usize,
@ -49,18 +46,16 @@ pub struct Search<'a> {
scoring_strategy: ScoringStrategy,
words_limit: usize,
exhaustive_number_hits: bool,
/// TODO: Add semantic ratio or pass it directly to execute_hybrid()
rtxn: &'a heed::RoTxn<'a>,
index: &'a Index,
distribution_shift: Option<DistributionShift>,
embedder_name: Option<String>,
semantic: Option<SemanticSearch>,
time_budget: TimeBudget,
}
impl<'a> Search<'a> {
pub fn new(rtxn: &'a heed::RoTxn, index: &'a Index) -> Search<'a> {
Search {
query: None,
vector: None,
filter: None,
offset: 0,
limit: 20,
@ -73,8 +68,8 @@ impl<'a> Search<'a> {
words_limit: 10,
rtxn,
index,
distribution_shift: None,
embedder_name: None,
semantic: None,
time_budget: TimeBudget::max(),
}
}
@ -83,8 +78,13 @@ impl<'a> Search<'a> {
self
}
pub fn vector(&mut self, vector: Vec<f32>) -> &mut Search<'a> {
self.vector = Some(vector);
pub fn semantic(
&mut self,
embedder_name: String,
embedder: Arc<Embedder>,
vector: Option<Vec<f32>>,
) -> &mut Search<'a> {
self.semantic = Some(SemanticSearch { embedder_name, embedder, vector });
self
}
@ -141,48 +141,38 @@ impl<'a> Search<'a> {
self
}
pub fn distribution_shift(
&mut self,
distribution_shift: Option<DistributionShift>,
) -> &mut Search<'a> {
self.distribution_shift = distribution_shift;
self
}
pub fn embedder_name(&mut self, embedder_name: impl Into<String>) -> &mut Search<'a> {
self.embedder_name = Some(embedder_name.into());
pub fn time_budget(&mut self, time_budget: TimeBudget) -> &mut Search<'a> {
self.time_budget = time_budget;
self
}
pub fn execute_for_candidates(&self, has_vector_search: bool) -> Result<RoaringBitmap> {
if has_vector_search {
let ctx = SearchContext::new(self.index, self.rtxn);
filtered_universe(&ctx, &self.filter)
filtered_universe(ctx.index, ctx.txn, &self.filter)
} else {
Ok(self.execute()?.candidates)
}
}
pub fn execute(&self) -> Result<SearchResult> {
let embedder_name;
let embedder_name = match &self.embedder_name {
Some(embedder_name) => embedder_name,
None => {
embedder_name = self.index.default_embedding_name(self.rtxn)?;
&embedder_name
}
};
let mut ctx = SearchContext::new(self.index, self.rtxn);
if let Some(searchable_attributes) = self.searchable_attributes {
ctx.searchable_attributes(searchable_attributes)?;
}
let universe = filtered_universe(&ctx, &self.filter)?;
let PartialSearchResult { located_query_terms, candidates, documents_ids, document_scores } =
match self.vector.as_ref() {
Some(vector) => execute_vector_search(
let universe = filtered_universe(ctx.index, ctx.txn, &self.filter)?;
let PartialSearchResult {
located_query_terms,
candidates,
documents_ids,
document_scores,
degraded,
used_negative_operator,
} = match self.semantic.as_ref() {
Some(SemanticSearch { vector: Some(vector), embedder_name, embedder }) => {
execute_vector_search(
&mut ctx,
vector,
self.scoring_strategy,
@ -191,25 +181,28 @@ impl<'a> Search<'a> {
self.geo_strategy,
self.offset,
self.limit,
self.distribution_shift,
embedder_name,
)?,
None => execute_search(
&mut ctx,
self.query.as_deref(),
self.terms_matching_strategy,
self.scoring_strategy,
self.exhaustive_number_hits,
universe,
&self.sort_criteria,
self.geo_strategy,
self.offset,
self.limit,
Some(self.words_limit),
&mut DefaultSearchLogger,
&mut DefaultSearchLogger,
)?,
};
embedder,
self.time_budget.clone(),
)?
}
_ => execute_search(
&mut ctx,
self.query.as_deref(),
self.terms_matching_strategy,
self.scoring_strategy,
self.exhaustive_number_hits,
universe,
&self.sort_criteria,
self.geo_strategy,
self.offset,
self.limit,
Some(self.words_limit),
&mut DefaultSearchLogger,
&mut DefaultSearchLogger,
self.time_budget.clone(),
)?,
};
// consume context and located_query_terms to build MatchingWords.
let matching_words = match located_query_terms {
@ -217,7 +210,14 @@ impl<'a> Search<'a> {
None => MatchingWords::default(),
};
Ok(SearchResult { matching_words, candidates, document_scores, documents_ids })
Ok(SearchResult {
matching_words,
candidates,
document_scores,
documents_ids,
degraded,
used_negative_operator,
})
}
}
@ -225,7 +225,6 @@ impl fmt::Debug for Search<'_> {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let Search {
query,
vector: _,
filter,
offset,
limit,
@ -238,8 +237,8 @@ impl fmt::Debug for Search<'_> {
exhaustive_number_hits,
rtxn: _,
index: _,
distribution_shift,
embedder_name,
semantic,
time_budget,
} = self;
f.debug_struct("Search")
.field("query", query)
@ -253,8 +252,11 @@ impl fmt::Debug for Search<'_> {
.field("scoring_strategy", scoring_strategy)
.field("exhaustive_number_hits", exhaustive_number_hits)
.field("words_limit", words_limit)
.field("distribution_shift", distribution_shift)
.field("embedder_name", embedder_name)
.field(
"semantic.embedder_name",
&semantic.as_ref().map(|semantic| &semantic.embedder_name),
)
.field("time_budget", time_budget)
.finish()
}
}
@ -265,6 +267,8 @@ pub struct SearchResult {
pub candidates: RoaringBitmap,
pub documents_ids: Vec<DocumentId>,
pub document_scores: Vec<Vec<ScoreDetails>>,
pub degraded: bool,
pub used_negative_operator: bool,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
@ -302,240 +306,6 @@ pub fn build_dfa(word: &str, typos: u8, is_prefix: bool) -> DFA {
}
}
pub struct SearchForFacetValues<'a> {
query: Option<String>,
facet: String,
search_query: Search<'a>,
max_values: usize,
is_hybrid: bool,
}
impl<'a> SearchForFacetValues<'a> {
pub fn new(
facet: String,
search_query: Search<'a>,
is_hybrid: bool,
) -> SearchForFacetValues<'a> {
SearchForFacetValues {
query: None,
facet,
search_query,
max_values: DEFAULT_MAX_NUMBER_OF_VALUES_PER_FACET,
is_hybrid,
}
}
pub fn query(&mut self, query: impl Into<String>) -> &mut Self {
self.query = Some(query.into());
self
}
pub fn max_values(&mut self, max: usize) -> &mut Self {
self.max_values = max;
self
}
fn one_original_value_of(
&self,
field_id: FieldId,
facet_str: &str,
any_docid: DocumentId,
) -> Result<Option<String>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let key: (FieldId, _, &str) = (field_id, any_docid, facet_str);
Ok(index.field_id_docid_facet_strings.get(rtxn, &key)?.map(|v| v.to_owned()))
}
pub fn execute(&self) -> Result<Vec<FacetValueHit>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let filterable_fields = index.filterable_fields(rtxn)?;
if !filterable_fields.contains(&self.facet) {
let (valid_fields, hidden_fields) =
index.remove_hidden_fields(rtxn, filterable_fields)?;
return Err(UserError::InvalidFacetSearchFacetName {
field: self.facet.clone(),
valid_fields,
hidden_fields,
}
.into());
}
let fields_ids_map = index.fields_ids_map(rtxn)?;
let fid = match fields_ids_map.id(&self.facet) {
Some(fid) => fid,
// we return an empty list of results when the attribute has been
// set as filterable but no document contains this field (yet).
None => return Ok(Vec::new()),
};
let fst = match self.search_query.index.facet_id_string_fst.get(rtxn, &fid)? {
Some(fst) => fst,
None => return Ok(vec![]),
};
let search_candidates = self
.search_query
.execute_for_candidates(self.is_hybrid || self.search_query.vector.is_some())?;
match self.query.as_ref() {
Some(query) => {
let options = NormalizerOption { lossy: true, ..Default::default() };
let query = query.normalize(&options);
let query = query.as_ref();
let authorize_typos = self.search_query.index.authorize_typos(rtxn)?;
let field_authorizes_typos =
!self.search_query.index.exact_attributes_ids(rtxn)?.contains(&fid);
if authorize_typos && field_authorizes_typos {
let exact_words_fst = self.search_query.index.exact_words(rtxn)?;
if exact_words_fst.map_or(false, |fst| fst.contains(query)) {
let mut results = vec![];
if fst.contains(query) {
self.fetch_original_facets_using_normalized(
fid,
query,
query,
&search_candidates,
&mut results,
)?;
}
Ok(results)
} else {
let one_typo = self.search_query.index.min_word_len_one_typo(rtxn)?;
let two_typos = self.search_query.index.min_word_len_two_typos(rtxn)?;
let is_prefix = true;
let automaton = if query.len() < one_typo as usize {
build_dfa(query, 0, is_prefix)
} else if query.len() < two_typos as usize {
build_dfa(query, 1, is_prefix)
} else {
build_dfa(query, 2, is_prefix)
};
let mut stream = fst.search(automaton).into_stream();
let mut results = vec![];
while let Some(facet_value) = stream.next() {
let value = std::str::from_utf8(facet_value)?;
if self
.fetch_original_facets_using_normalized(
fid,
value,
query,
&search_candidates,
&mut results,
)?
.is_break()
{
break;
}
}
Ok(results)
}
} else {
let automaton = Str::new(query).starts_with();
let mut stream = fst.search(automaton).into_stream();
let mut results = vec![];
while let Some(facet_value) = stream.next() {
let value = std::str::from_utf8(facet_value)?;
if self
.fetch_original_facets_using_normalized(
fid,
value,
query,
&search_candidates,
&mut results,
)?
.is_break()
{
break;
}
}
Ok(results)
}
}
None => {
let mut results = vec![];
let prefix = FacetGroupKey { field_id: fid, level: 0, left_bound: "" };
for result in index.facet_id_string_docids.prefix_iter(rtxn, &prefix)? {
let (FacetGroupKey { left_bound, .. }, FacetGroupValue { bitmap, .. }) =
result?;
let count = search_candidates.intersection_len(&bitmap);
if count != 0 {
let value = self
.one_original_value_of(fid, left_bound, bitmap.min().unwrap())?
.unwrap_or_else(|| left_bound.to_string());
results.push(FacetValueHit { value, count });
}
if results.len() >= self.max_values {
break;
}
}
Ok(results)
}
}
}
fn fetch_original_facets_using_normalized(
&self,
fid: FieldId,
value: &str,
query: &str,
search_candidates: &RoaringBitmap,
results: &mut Vec<FacetValueHit>,
) -> Result<ControlFlow<()>> {
let index = self.search_query.index;
let rtxn = self.search_query.rtxn;
let database = index.facet_id_normalized_string_strings;
let key = (fid, value);
let original_strings = match database.get(rtxn, &key)? {
Some(original_strings) => original_strings,
None => {
error!("the facet value is missing from the facet database: {key:?}");
return Ok(ControlFlow::Continue(()));
}
};
for original in original_strings {
let key = FacetGroupKey { field_id: fid, level: 0, left_bound: original.as_str() };
let docids = match index.facet_id_string_docids.get(rtxn, &key)? {
Some(FacetGroupValue { bitmap, .. }) => bitmap,
None => {
error!("the facet value is missing from the facet database: {key:?}");
return Ok(ControlFlow::Continue(()));
}
};
let count = search_candidates.intersection_len(&docids);
if count != 0 {
let value = self
.one_original_value_of(fid, &original, docids.min().unwrap())?
.unwrap_or_else(|| query.to_string());
results.push(FacetValueHit { value, count });
}
if results.len() >= self.max_values {
return Ok(ControlFlow::Break(()));
}
}
Ok(ControlFlow::Continue(()))
}
}
#[derive(Debug, Clone, serde::Serialize, PartialEq)]
pub struct FacetValueHit {
/// The original facet value
pub value: String,
/// The number of documents associated to this facet
pub count: u64,
}
#[cfg(test)]
mod test {
#[allow(unused_imports)]

View File

@ -5,17 +5,19 @@ use super::ranking_rules::{BoxRankingRule, RankingRuleQueryTrait};
use super::SearchContext;
use crate::score_details::{ScoreDetails, ScoringStrategy};
use crate::search::new::distinct::{apply_distinct_rule, distinct_single_docid, DistinctOutput};
use crate::Result;
use crate::{Result, TimeBudget};
pub struct BucketSortOutput {
pub docids: Vec<u32>,
pub scores: Vec<Vec<ScoreDetails>>,
pub all_candidates: RoaringBitmap,
pub degraded: bool,
}
// TODO: would probably be good to regroup some of these inside of a struct?
#[allow(clippy::too_many_arguments)]
#[logging_timer::time]
#[tracing::instrument(level = "trace", skip_all, target = "search::bucket_sort")]
pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
ctx: &mut SearchContext<'ctx>,
mut ranking_rules: Vec<BoxRankingRule<'ctx, Q>>,
@ -25,6 +27,7 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
length: usize,
scoring_strategy: ScoringStrategy,
logger: &mut dyn SearchLogger<Q>,
time_budget: TimeBudget,
) -> Result<BucketSortOutput> {
logger.initial_query(query);
logger.ranking_rules(&ranking_rules);
@ -41,6 +44,7 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
docids: vec![],
scores: vec![],
all_candidates: universe.clone(),
degraded: false,
});
}
if ranking_rules.is_empty() {
@ -74,6 +78,7 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
scores: vec![Default::default(); results.len()],
docids: results,
all_candidates,
degraded: false,
});
} else {
let docids: Vec<u32> = universe.iter().skip(from).take(length).collect();
@ -81,6 +86,7 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
scores: vec![Default::default(); docids.len()],
docids,
all_candidates: universe.clone(),
degraded: false,
});
};
}
@ -154,6 +160,28 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
}
while valid_docids.len() < length {
if time_budget.exceeded() {
loop {
let bucket = std::mem::take(&mut ranking_rule_universes[cur_ranking_rule_index]);
ranking_rule_scores.push(ScoreDetails::Skipped);
maybe_add_to_results!(bucket);
ranking_rule_scores.pop();
if cur_ranking_rule_index == 0 {
break;
}
back!();
}
return Ok(BucketSortOutput {
scores: valid_scores,
docids: valid_docids,
all_candidates,
degraded: true,
});
}
// The universe for this bucket is zero, so we don't need to sort
// anything, just go back to the parent ranking rule.
if ranking_rule_universes[cur_ranking_rule_index].is_empty()
@ -219,7 +247,12 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
)?;
}
Ok(BucketSortOutput { docids: valid_docids, scores: valid_scores, all_candidates })
Ok(BucketSortOutput {
docids: valid_docids,
scores: valid_scores,
all_candidates,
degraded: false,
})
}
/// Add the candidates to the results. Take `distinct`, `from`, `length`, and `cur_offset`

View File

@ -240,6 +240,7 @@ pub(crate) mod tests {
use super::super::super::located_query_terms_from_tokens;
use super::*;
use crate::index::tests::TempIndex;
use crate::search::new::query_term::ExtractedTokens;
pub(crate) fn temp_index_with_documents() -> TempIndex {
let temp_index = TempIndex::new();
@ -261,7 +262,8 @@ pub(crate) mod tests {
let mut builder = TokenizerBuilder::default();
let tokenizer = builder.build();
let tokens = tokenizer.tokenize("split this world");
let query_terms = located_query_terms_from_tokens(&mut ctx, tokens, None).unwrap();
let ExtractedTokens { query_terms, .. } =
located_query_terms_from_tokens(&mut ctx, tokens, None).unwrap();
let matching_words = MatchingWords::new(ctx, query_terms);
assert_eq!(

View File

@ -502,12 +502,12 @@ mod tests {
use super::*;
use crate::index::tests::TempIndex;
use crate::{execute_search, filtered_universe, SearchContext};
use crate::{execute_search, filtered_universe, SearchContext, TimeBudget};
impl<'a> MatcherBuilder<'a> {
fn new_test(rtxn: &'a heed::RoTxn, index: &'a TempIndex, query: &str) -> Self {
let mut ctx = SearchContext::new(index, rtxn);
let universe = filtered_universe(&ctx, &None).unwrap();
let universe = filtered_universe(ctx.index, ctx.txn, &None).unwrap();
let crate::search::PartialSearchResult { located_query_terms, .. } = execute_search(
&mut ctx,
Some(query),
@ -522,6 +522,7 @@ mod tests {
Some(10),
&mut crate::DefaultSearchLogger,
&mut crate::DefaultSearchLogger,
TimeBudget::max(),
)
.unwrap();

View File

@ -33,7 +33,9 @@ use interner::{DedupInterner, Interner};
pub use logger::visual::VisualSearchLogger;
pub use logger::{DefaultSearchLogger, SearchLogger};
use query_graph::{QueryGraph, QueryNode};
use query_term::{located_query_terms_from_tokens, LocatedQueryTerm, Phrase, QueryTerm};
use query_term::{
located_query_terms_from_tokens, ExtractedTokens, LocatedQueryTerm, Phrase, QueryTerm,
};
use ranking_rules::{
BoxRankingRule, PlaceholderQuery, RankingRule, RankingRuleOutput, RankingRuleQueryTrait,
};
@ -50,9 +52,10 @@ use self::vector_sort::VectorSort;
use crate::error::FieldIdMapMissingEntry;
use crate::score_details::{ScoreDetails, ScoringStrategy};
use crate::search::new::distinct::apply_distinct_rule;
use crate::vector::DistributionShift;
use crate::vector::Embedder;
use crate::{
AscDesc, DocumentId, FieldId, Filter, Index, Member, Result, TermsMatchingStrategy, UserError,
AscDesc, DocumentId, FieldId, Filter, Index, Member, Result, TermsMatchingStrategy, TimeBudget,
UserError,
};
/// A structure used throughout the execution of a search query.
@ -191,7 +194,7 @@ fn resolve_maximally_reduced_query_graph(
Ok(docids)
}
#[logging_timer::time]
#[tracing::instrument(level = "trace", skip_all, target = "search")]
fn resolve_universe(
ctx: &mut SearchContext,
initial_universe: &RoaringBitmap,
@ -208,6 +211,35 @@ fn resolve_universe(
)
}
#[tracing::instrument(level = "trace", skip_all, target = "search")]
fn resolve_negative_words(
ctx: &mut SearchContext,
negative_words: &[Word],
) -> Result<RoaringBitmap> {
let mut negative_bitmap = RoaringBitmap::new();
for &word in negative_words {
if let Some(bitmap) = ctx.word_docids(word)? {
negative_bitmap |= bitmap;
}
}
Ok(negative_bitmap)
}
#[tracing::instrument(level = "trace", skip_all, target = "search")]
fn resolve_negative_phrases(
ctx: &mut SearchContext,
negative_phrases: &[LocatedQueryTerm],
) -> Result<RoaringBitmap> {
let mut negative_bitmap = RoaringBitmap::new();
for term in negative_phrases {
let query_term = ctx.term_interner.get(term.value);
if let Some(phrase) = query_term.original_phrase() {
negative_bitmap |= ctx.get_phrase_docids(phrase)?;
}
}
Ok(negative_bitmap)
}
/// Return the list of initialised ranking rules to be used for a placeholder search.
fn get_ranking_rules_for_placeholder_search<'ctx>(
ctx: &SearchContext<'ctx>,
@ -266,8 +298,8 @@ fn get_ranking_rules_for_vector<'ctx>(
geo_strategy: geo_sort::Strategy,
limit_plus_offset: usize,
target: &[f32],
distribution_shift: Option<DistributionShift>,
embedder_name: &str,
embedder: &Embedder,
) -> Result<Vec<BoxRankingRule<'ctx, PlaceholderQuery>>> {
// query graph search
@ -293,8 +325,8 @@ fn get_ranking_rules_for_vector<'ctx>(
target.to_vec(),
vector_candidates,
limit_plus_offset,
distribution_shift,
embedder_name,
embedder,
)?;
ranking_rules.push(Box::new(vector_sort));
vector = true;
@ -498,11 +530,15 @@ fn resolve_sort_criteria<'ctx, Query: RankingRuleQueryTrait>(
Ok(())
}
pub fn filtered_universe(ctx: &SearchContext, filters: &Option<Filter>) -> Result<RoaringBitmap> {
pub fn filtered_universe(
index: &Index,
txn: &RoTxn<'_>,
filters: &Option<Filter>,
) -> Result<RoaringBitmap> {
Ok(if let Some(filters) = filters {
filters.evaluate(ctx.txn, ctx.index)?
filters.evaluate(txn, index)?
} else {
ctx.index.documents_ids(ctx.txn)?
index.documents_ids(txn)?
})
}
@ -516,8 +552,9 @@ pub fn execute_vector_search(
geo_strategy: geo_sort::Strategy,
from: usize,
length: usize,
distribution_shift: Option<DistributionShift>,
embedder_name: &str,
embedder: &Embedder,
time_budget: TimeBudget,
) -> Result<PartialSearchResult> {
check_sort_criteria(ctx, sort_criteria.as_ref())?;
@ -529,15 +566,15 @@ pub fn execute_vector_search(
geo_strategy,
from + length,
vector,
distribution_shift,
embedder_name,
embedder,
)?;
let mut placeholder_search_logger = logger::DefaultSearchLogger;
let placeholder_search_logger: &mut dyn SearchLogger<PlaceholderQuery> =
&mut placeholder_search_logger;
let BucketSortOutput { docids, scores, all_candidates } = bucket_sort(
let BucketSortOutput { docids, scores, all_candidates, degraded } = bucket_sort(
ctx,
ranking_rules,
&PlaceholderQuery,
@ -546,6 +583,7 @@ pub fn execute_vector_search(
length,
scoring_strategy,
placeholder_search_logger,
time_budget,
)?;
Ok(PartialSearchResult {
@ -553,11 +591,13 @@ pub fn execute_vector_search(
document_scores: scores,
documents_ids: docids,
located_query_terms: None,
degraded,
used_negative_operator: false,
})
}
#[allow(clippy::too_many_arguments)]
#[logging_timer::time]
#[tracing::instrument(level = "trace", skip_all, target = "search")]
pub fn execute_search(
ctx: &mut SearchContext,
query: Option<&str>,
@ -572,11 +612,16 @@ pub fn execute_search(
words_limit: Option<usize>,
placeholder_search_logger: &mut dyn SearchLogger<PlaceholderQuery>,
query_graph_logger: &mut dyn SearchLogger<QueryGraph>,
time_budget: TimeBudget,
) -> Result<PartialSearchResult> {
check_sort_criteria(ctx, sort_criteria.as_ref())?;
let mut used_negative_operator = false;
let mut located_query_terms = None;
let query_terms = if let Some(query) = query {
let span = tracing::trace_span!(target: "search::tokens", "tokenizer_builder");
let entered = span.enter();
// We make sure that the analyzer is aware of the stop words
// this ensures that the query builder is able to properly remove them.
let mut tokbuilder = TokenizerBuilder::new();
@ -605,9 +650,23 @@ pub fn execute_search(
}
let tokenizer = tokbuilder.build();
let tokens = tokenizer.tokenize(query);
drop(entered);
let span = tracing::trace_span!(target: "search::tokens", "tokenize");
let entered = span.enter();
let tokens = tokenizer.tokenize(query);
drop(entered);
let ExtractedTokens { query_terms, negative_words, negative_phrases } =
located_query_terms_from_tokens(ctx, tokens, words_limit)?;
used_negative_operator = !negative_words.is_empty() || !negative_phrases.is_empty();
let ignored_documents = resolve_negative_words(ctx, &negative_words)?;
let ignored_phrases = resolve_negative_phrases(ctx, &negative_phrases)?;
universe -= ignored_documents;
universe -= ignored_phrases;
let query_terms = located_query_terms_from_tokens(ctx, tokens, words_limit)?;
if query_terms.is_empty() {
// Do a placeholder search instead
None
@ -617,6 +676,7 @@ pub fn execute_search(
} else {
None
};
let bucket_sort_output = if let Some(query_terms) = query_terms {
let (graph, new_located_query_terms) = QueryGraph::from_query(ctx, &query_terms)?;
located_query_terms = Some(new_located_query_terms);
@ -640,6 +700,7 @@ pub fn execute_search(
length,
scoring_strategy,
query_graph_logger,
time_budget,
)?
} else {
let ranking_rules =
@ -653,10 +714,11 @@ pub fn execute_search(
length,
scoring_strategy,
placeholder_search_logger,
time_budget,
)?
};
let BucketSortOutput { docids, scores, mut all_candidates } = bucket_sort_output;
let BucketSortOutput { docids, scores, mut all_candidates, degraded } = bucket_sort_output;
let fields_ids_map = ctx.index.fields_ids_map(ctx.txn)?;
// The candidates is the universe unless the exhaustive number of hits
@ -674,6 +736,8 @@ pub fn execute_search(
document_scores: scores,
documents_ids: docids,
located_query_terms,
degraded,
used_negative_operator,
})
}
@ -734,4 +798,7 @@ pub struct PartialSearchResult {
pub candidates: RoaringBitmap,
pub documents_ids: Vec<DocumentId>,
pub document_scores: Vec<Vec<ScoreDetails>>,
pub degraded: bool,
pub used_negative_operator: bool,
}

View File

@ -9,7 +9,9 @@ use std::ops::RangeInclusive;
use either::Either;
pub use ntypo_subset::NTypoTermSubset;
pub use parse_query::{located_query_terms_from_tokens, make_ngram, number_of_typos_allowed};
pub use parse_query::{
located_query_terms_from_tokens, make_ngram, number_of_typos_allowed, ExtractedTokens,
};
pub use phrase::Phrase;
use super::interner::{DedupInterner, Interned};
@ -478,6 +480,11 @@ impl QueryTerm {
pub fn original_word(&self, ctx: &SearchContext) -> String {
ctx.word_interner.get(self.original).clone()
}
pub fn original_phrase(&self) -> Option<Interned<Phrase>> {
self.zero_typo.phrase
}
pub fn all_computed_derivations(&self) -> (Vec<Interned<String>>, Vec<Interned<Phrase>>) {
let mut words = BTreeSet::new();
let mut phrases = BTreeSet::new();

View File

@ -6,20 +6,37 @@ use charabia::{SeparatorKind, TokenKind};
use super::compute_derivations::partially_initialized_term_from_word;
use super::{LocatedQueryTerm, ZeroTypoTerm};
use crate::search::new::query_term::{Lazy, Phrase, QueryTerm};
use crate::search::new::Word;
use crate::{Result, SearchContext, MAX_WORD_LENGTH};
#[derive(Clone)]
/// Extraction of the content of a query.
pub struct ExtractedTokens {
/// The terms to search for in the database.
pub query_terms: Vec<LocatedQueryTerm>,
/// The words that must not appear in the results.
pub negative_words: Vec<Word>,
/// The phrases that must not appear in the results.
pub negative_phrases: Vec<LocatedQueryTerm>,
}
/// Convert the tokenised search query into a list of located query terms.
#[logging_timer::time]
#[tracing::instrument(level = "trace", skip_all, target = "search::query")]
pub fn located_query_terms_from_tokens(
ctx: &mut SearchContext,
query: NormalizedTokenIter,
words_limit: Option<usize>,
) -> Result<Vec<LocatedQueryTerm>> {
) -> Result<ExtractedTokens> {
let nbr_typos = number_of_typos_allowed(ctx)?;
let mut located_terms = Vec::new();
let mut query_terms = Vec::new();
let mut negative_phrase = false;
let mut phrase: Option<PhraseBuilder> = None;
let mut encountered_whitespace = true;
let mut negative_next_token = false;
let mut negative_words = Vec::new();
let mut negative_phrases = Vec::new();
let parts_limit = words_limit.unwrap_or(usize::MAX);
@ -31,9 +48,10 @@ pub fn located_query_terms_from_tokens(
if token.lemma().is_empty() {
continue;
}
// early return if word limit is exceeded
if located_terms.len() >= parts_limit {
return Ok(located_terms);
if query_terms.len() >= parts_limit {
return Ok(ExtractedTokens { query_terms, negative_words, negative_phrases });
}
match token.kind {
@ -46,6 +64,11 @@ pub fn located_query_terms_from_tokens(
// 3. if the word is the last token of the query we push it as a prefix word.
if let Some(phrase) = &mut phrase {
phrase.push_word(ctx, &token, position)
} else if negative_next_token {
let word = token.lemma().to_string();
let word = Word::Original(ctx.word_interner.insert(word));
negative_words.push(word);
negative_next_token = false;
} else if peekable.peek().is_some() {
match token.kind {
TokenKind::Word => {
@ -61,9 +84,9 @@ pub fn located_query_terms_from_tokens(
value: ctx.term_interner.push(term),
positions: position..=position,
};
located_terms.push(located_term);
query_terms.push(located_term);
}
TokenKind::StopWord | TokenKind::Separator(_) | TokenKind::Unknown => {}
TokenKind::StopWord | TokenKind::Separator(_) | TokenKind::Unknown => (),
}
} else {
let word = token.lemma();
@ -78,7 +101,7 @@ pub fn located_query_terms_from_tokens(
value: ctx.term_interner.push(term),
positions: position..=position,
};
located_terms.push(located_term);
query_terms.push(located_term);
}
}
TokenKind::Separator(separator_kind) => {
@ -94,7 +117,14 @@ pub fn located_query_terms_from_tokens(
let phrase = if separator_kind == SeparatorKind::Hard {
if let Some(phrase) = phrase {
if let Some(located_query_term) = phrase.build(ctx) {
located_terms.push(located_query_term)
// as we are evaluating a negative operator we put the phrase
// in the negative one *but* we don't reset the negative operator
// as we are immediatly starting a new negative phrase.
if negative_phrase {
negative_phrases.push(located_query_term);
} else {
query_terms.push(located_query_term);
}
}
Some(PhraseBuilder::empty())
} else {
@ -115,26 +145,49 @@ pub fn located_query_terms_from_tokens(
// Per the check above, quote_count > 0
quote_count -= 1;
if let Some(located_query_term) = phrase.build(ctx) {
located_terms.push(located_query_term)
// we were evaluating a negative operator so we
// put the phrase in the negative phrases
if negative_phrase {
negative_phrases.push(located_query_term);
negative_phrase = false;
} else {
query_terms.push(located_query_term);
}
}
}
// Start new phrase if the token ends with an opening quote
(quote_count % 2 == 1).then_some(PhraseBuilder::empty())
if quote_count % 2 == 1 {
negative_phrase = negative_next_token;
Some(PhraseBuilder::empty())
} else {
None
}
};
negative_next_token =
phrase.is_none() && token.lemma() == "-" && encountered_whitespace;
}
_ => (),
}
encountered_whitespace =
token.lemma().chars().last().filter(|c| c.is_whitespace()).is_some();
}
// If a quote is never closed, we consider all of the end of the query as a phrase.
if let Some(phrase) = phrase.take() {
if let Some(located_query_term) = phrase.build(ctx) {
located_terms.push(located_query_term);
// put the phrase in the negative set if we are evaluating a negative operator.
if negative_phrase {
negative_phrases.push(located_query_term);
} else {
query_terms.push(located_query_term);
}
}
}
Ok(located_terms)
Ok(ExtractedTokens { query_terms, negative_words, negative_phrases })
}
pub fn number_of_typos_allowed<'ctx>(
@ -315,8 +368,10 @@ mod tests {
let rtxn = index.read_txn()?;
let mut ctx = SearchContext::new(&index, &rtxn);
// panics with `attempt to add with overflow` before <https://github.com/meilisearch/meilisearch/issues/3785>
let located_query_terms = located_query_terms_from_tokens(&mut ctx, tokens, None)?;
assert!(located_query_terms.is_empty());
let ExtractedTokens { query_terms, .. } =
located_query_terms_from_tokens(&mut ctx, tokens, None)?;
assert!(query_terms.is_empty());
Ok(())
}
}

View File

@ -0,0 +1,429 @@
//! This module test the search cutoff and ensure a few things:
//! 1. A basic test works and mark the search as degraded
//! 2. A test that ensure the filters are affectively applied even with a cutoff of 0
//! 3. A test that ensure the cutoff works well with the ranking scores
use std::time::Duration;
use big_s::S;
use maplit::hashset;
use meili_snap::snapshot;
use crate::index::tests::TempIndex;
use crate::score_details::{ScoreDetails, ScoringStrategy};
use crate::{Criterion, Filter, Search, TimeBudget};
fn create_index() -> TempIndex {
let index = TempIndex::new();
index
.update_settings(|s| {
s.set_primary_key("id".to_owned());
s.set_searchable_fields(vec!["text".to_owned()]);
s.set_filterable_fields(hashset! { S("id") });
s.set_criteria(vec![Criterion::Words, Criterion::Typo]);
})
.unwrap();
// reverse the ID / insertion order so we see better what was sorted from what got the insertion order ordering
index
.add_documents(documents!([
{
"id": 4,
"text": "hella puppo kefir",
},
{
"id": 3,
"text": "hella puppy kefir",
},
{
"id": 2,
"text": "hello",
},
{
"id": 1,
"text": "hello puppy",
},
{
"id": 0,
"text": "hello puppy kefir",
},
]))
.unwrap();
index
}
#[test]
fn basic_degraded_search() {
let index = create_index();
let rtxn = index.read_txn().unwrap();
let mut search = Search::new(&rtxn, &index);
search.query("hello puppy kefir");
search.limit(3);
search.time_budget(TimeBudget::new(Duration::from_millis(0)));
let result = search.execute().unwrap();
assert!(result.degraded);
}
#[test]
fn degraded_search_cannot_skip_filter() {
let index = create_index();
let rtxn = index.read_txn().unwrap();
let mut search = Search::new(&rtxn, &index);
search.query("hello puppy kefir");
search.limit(100);
search.time_budget(TimeBudget::new(Duration::from_millis(0)));
let filter_condition = Filter::from_str("id > 2").unwrap().unwrap();
search.filter(filter_condition);
let result = search.execute().unwrap();
assert!(result.degraded);
snapshot!(format!("{:?}\n{:?}", result.candidates, result.documents_ids), @r###"
RoaringBitmap<[0, 1]>
[0, 1]
"###);
}
#[test]
#[allow(clippy::format_collect)] // the test is already quite big
fn degraded_search_and_score_details() {
let index = create_index();
let rtxn = index.read_txn().unwrap();
let mut search = Search::new(&rtxn, &index);
search.query("hello puppy kefir");
search.limit(4);
search.scoring_strategy(ScoringStrategy::Detailed);
search.time_budget(TimeBudget::max());
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [4, 1, 0, 3]
Scores: 1.0000 0.9167 0.8333 0.6667
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 1,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 2,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 2,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 2,
},
),
],
]
"###);
// Do ONE loop iteration. Not much can be deduced, almost everyone matched the words first bucket.
search.time_budget(TimeBudget::max().with_stop_after(1));
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [0, 1, 4, 2]
Scores: 0.6667 0.6667 0.6667 0.0000
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Skipped,
],
]
"###);
// Do TWO loop iterations. The first document should be entirely sorted
search.time_budget(TimeBudget::max().with_stop_after(2));
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [4, 0, 1, 2]
Scores: 1.0000 0.6667 0.6667 0.0000
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Skipped,
],
]
"###);
// Do THREE loop iterations. The second document should be entirely sorted as well
search.time_budget(TimeBudget::max().with_stop_after(3));
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [4, 1, 0, 2]
Scores: 1.0000 0.9167 0.6667 0.0000
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 1,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Skipped,
],
[
Skipped,
],
]
"###);
// Do FOUR loop iterations. The third document should be entirely sorted as well
// The words bucket have still not progressed thus the last document doesn't have any info yet.
search.time_budget(TimeBudget::max().with_stop_after(4));
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [4, 1, 0, 2]
Scores: 1.0000 0.9167 0.8333 0.0000
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 1,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 2,
max_typo_count: 3,
},
),
],
[
Skipped,
],
]
"###);
// After SIX loop iteration. The words ranking rule gave us a new bucket.
// Since we reached the limit we were able to early exit without checking the typo ranking rule.
search.time_budget(TimeBudget::max().with_stop_after(6));
let result = search.execute().unwrap();
snapshot!(format!("IDs: {:?}\nScores: {}\nScore Details:\n{:#?}", result.documents_ids, result.document_scores.iter().map(|scores| format!("{:.4} ", ScoreDetails::global_score(scores.iter()))).collect::<String>(), result.document_scores), @r###"
IDs: [4, 1, 0, 3]
Scores: 1.0000 0.9167 0.8333 0.3333
Score Details:
[
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 0,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 1,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 3,
max_matching_words: 3,
},
),
Typo(
Typo {
typo_count: 2,
max_typo_count: 3,
},
),
],
[
Words(
Words {
matching_words: 2,
max_matching_words: 3,
},
),
Skipped,
],
]
"###);
}

View File

@ -1,5 +1,6 @@
pub mod attribute_fid;
pub mod attribute_position;
pub mod cutoff;
pub mod distinct;
pub mod exactness;
pub mod geo_sort;

View File

@ -5,7 +5,7 @@ The typo ranking rule should transform the query graph such that it only contain
the combinations of word derivations that it used to compute its bucket.
The proximity ranking rule should then look for proximities only between those specific derivations.
For example, given the the search query `beautiful summer` and the dataset:
For example, given the search query `beautiful summer` and the dataset:
```text
{ "id": 0, "text": "beautigul summer...... beautiful day in the summer" }
{ "id": 1, "text": "beautiful summer" }

View File

@ -5,14 +5,14 @@ use roaring::RoaringBitmap;
use super::ranking_rules::{RankingRule, RankingRuleOutput, RankingRuleQueryTrait};
use crate::score_details::{self, ScoreDetails};
use crate::vector::DistributionShift;
use crate::vector::{DistributionShift, Embedder};
use crate::{DocumentId, Result, SearchContext, SearchLogger};
pub struct VectorSort<Q: RankingRuleQueryTrait> {
query: Option<Q>,
target: Vec<f32>,
vector_candidates: RoaringBitmap,
cached_sorted_docids: std::vec::IntoIter<(DocumentId, f32, Vec<f32>)>,
cached_sorted_docids: std::vec::IntoIter<(DocumentId, f32)>,
limit: usize,
distribution_shift: Option<DistributionShift>,
embedder_index: u8,
@ -24,8 +24,8 @@ impl<Q: RankingRuleQueryTrait> VectorSort<Q> {
target: Vec<f32>,
vector_candidates: RoaringBitmap,
limit: usize,
distribution_shift: Option<DistributionShift>,
embedder_name: &str,
embedder: &Embedder,
) -> Result<Self> {
let embedder_index = ctx
.index
@ -39,7 +39,7 @@ impl<Q: RankingRuleQueryTrait> VectorSort<Q> {
vector_candidates,
cached_sorted_docids: Default::default(),
limit,
distribution_shift,
distribution_shift: embedder.distribution(),
embedder_index,
})
}
@ -70,14 +70,9 @@ impl<Q: RankingRuleQueryTrait> VectorSort<Q> {
for reader in readers.iter() {
let nns_by_vector =
reader.nns_by_vector(ctx.txn, target, self.limit, None, Some(vector_candidates))?;
let vectors: std::result::Result<Vec<_>, _> = nns_by_vector
.iter()
.map(|(docid, _)| reader.item_vector(ctx.txn, *docid).transpose().unwrap())
.collect();
let vectors = vectors?;
results.extend(nns_by_vector.into_iter().zip(vectors).map(|((x, y), z)| (x, y, z)));
results.extend(nns_by_vector.into_iter());
}
results.sort_unstable_by_key(|(_, distance, _)| OrderedFloat(*distance));
results.sort_unstable_by_key(|(_, distance)| OrderedFloat(*distance));
self.cached_sorted_docids = results.into_iter();
Ok(())
@ -118,14 +113,11 @@ impl<'ctx, Q: RankingRuleQueryTrait> RankingRule<'ctx, Q> for VectorSort<Q> {
return Ok(Some(RankingRuleOutput {
query,
candidates: universe.clone(),
score: ScoreDetails::Vector(score_details::Vector {
target_vector: self.target.clone(),
value_similarity: None,
}),
score: ScoreDetails::Vector(score_details::Vector { similarity: None }),
}));
}
for (docid, distance, vector) in self.cached_sorted_docids.by_ref() {
for (docid, distance) in self.cached_sorted_docids.by_ref() {
if vector_candidates.contains(docid) {
let score = 1.0 - distance;
let score = self
@ -135,10 +127,7 @@ impl<'ctx, Q: RankingRuleQueryTrait> RankingRule<'ctx, Q> for VectorSort<Q> {
return Ok(Some(RankingRuleOutput {
query,
candidates: RoaringBitmap::from_iter([docid]),
score: ScoreDetails::Vector(score_details::Vector {
target_vector: self.target.clone(),
value_similarity: Some((vector, score)),
}),
score: ScoreDetails::Vector(score_details::Vector { similarity: Some(score) }),
}));
}
}
@ -154,10 +143,7 @@ impl<'ctx, Q: RankingRuleQueryTrait> RankingRule<'ctx, Q> for VectorSort<Q> {
return Ok(Some(RankingRuleOutput {
query,
candidates: universe.clone(),
score: ScoreDetails::Vector(score_details::Vector {
target_vector: self.target.clone(),
value_similarity: None,
}),
score: ScoreDetails::Vector(score_details::Vector { similarity: None }),
}));
}

View File

@ -0,0 +1,108 @@
use std::sync::Arc;
use ordered_float::OrderedFloat;
use crate::score_details::{self, ScoreDetails};
use crate::vector::Embedder;
use crate::{filtered_universe, DocumentId, Filter, Index, Result, SearchResult};
pub struct Recommend<'a> {
id: DocumentId,
// this should be linked to the String in the query
filter: Option<Filter<'a>>,
offset: usize,
limit: usize,
rtxn: &'a heed::RoTxn<'a>,
index: &'a Index,
embedder_name: String,
embedder: Arc<Embedder>,
}
impl<'a> Recommend<'a> {
pub fn new(
id: DocumentId,
offset: usize,
limit: usize,
index: &'a Index,
rtxn: &'a heed::RoTxn<'a>,
embedder_name: String,
embedder: Arc<Embedder>,
) -> Self {
Self { id, filter: None, offset, limit, rtxn, index, embedder_name, embedder }
}
pub fn filter(&mut self, filter: Filter<'a>) -> &mut Self {
self.filter = Some(filter);
self
}
pub fn execute(&self) -> Result<SearchResult> {
let universe = filtered_universe(self.index, self.rtxn, &self.filter)?;
let embedder_index =
self.index
.embedder_category_id
.get(self.rtxn, &self.embedder_name)?
.ok_or_else(|| crate::UserError::InvalidEmbedder(self.embedder_name.to_owned()))?;
let writer_index = (embedder_index as u16) << 8;
let readers: std::result::Result<Vec<_>, _> = (0..=u8::MAX)
.map_while(|k| {
arroy::Reader::open(self.rtxn, writer_index | (k as u16), self.index.vector_arroy)
.map(Some)
.or_else(|e| match e {
arroy::Error::MissingMetadata => Ok(None),
e => Err(e),
})
.transpose()
})
.collect();
let readers = readers?;
let mut results = Vec::new();
for reader in readers.iter() {
let nns_by_item = reader.nns_by_item(
self.rtxn,
self.id,
self.limit + self.offset + 1,
None,
Some(&universe),
)?;
if let Some(mut nns_by_item) = nns_by_item {
results.append(&mut nns_by_item);
}
}
results.sort_unstable_by_key(|(_, distance)| OrderedFloat(*distance));
let mut documents_ids = Vec::with_capacity(self.limit);
let mut document_scores = Vec::with_capacity(self.limit);
// skip offset +1 to skip the target document that is normally returned
for (docid, distance) in results.into_iter().skip(self.offset + 1) {
documents_ids.push(docid);
let score = 1.0 - distance;
let score = self
.embedder
.distribution()
.map(|distribution| distribution.shift(score))
.unwrap_or(score);
let score = ScoreDetails::Vector(score_details::Vector { similarity: Some(score) });
document_scores.push(vec![score]);
}
Ok(SearchResult {
matching_words: Default::default(),
candidates: universe,
documents_ids,
document_scores,
degraded: false,
used_negative_operator: false,
})
}
}

View File

@ -149,7 +149,7 @@ impl<'i> FacetsUpdate<'i> {
self.index.set_updated_at(wtxn, &OffsetDateTime::now_utc())?;
// See self::comparison_bench::benchmark_facet_indexing
if self.data_size >= (self.database.len(wtxn)? / 50) {
if self.data_size >= (self.database.len(wtxn)? / 500) {
let field_ids =
self.index.faceted_fields_ids(wtxn)?.iter().copied().collect::<Vec<_>>();
let bulk_update = FacetsUpdateBulk::new(

View File

@ -339,6 +339,7 @@ pub fn extract_embeddings<R: io::Read + io::Seek>(
prompt_reader: grenad::Reader<R>,
indexer: GrenadParameters,
embedder: Arc<Embedder>,
request_threads: &rayon::ThreadPool,
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let n_chunks = embedder.chunk_count_hint(); // chunk level parallelism
@ -376,7 +377,10 @@ pub fn extract_embeddings<R: io::Read + io::Seek>(
if chunks.len() == chunks.capacity() {
let chunked_embeds = embedder
.embed_chunks(std::mem::replace(&mut chunks, Vec::with_capacity(n_chunks)))
.embed_chunks(
std::mem::replace(&mut chunks, Vec::with_capacity(n_chunks)),
request_threads,
)
.map_err(crate::vector::Error::from)
.map_err(crate::Error::from)?;
@ -394,7 +398,7 @@ pub fn extract_embeddings<R: io::Read + io::Seek>(
// send last chunk
if !chunks.is_empty() {
let chunked_embeds = embedder
.embed_chunks(std::mem::take(&mut chunks))
.embed_chunks(std::mem::take(&mut chunks), request_threads)
.map_err(crate::vector::Error::from)
.map_err(crate::Error::from)?;
for (docid, embeddings) in chunks_ids
@ -408,7 +412,7 @@ pub fn extract_embeddings<R: io::Read + io::Seek>(
if !current_chunk.is_empty() {
let embeds = embedder
.embed_chunks(vec![std::mem::take(&mut current_chunk)])
.embed_chunks(vec![std::mem::take(&mut current_chunk)], request_threads)
.map_err(crate::vector::Error::from)
.map_err(crate::Error::from)?;

View File

@ -210,8 +210,7 @@ fn run_extraction_task<FE, FS, M>(
let current_span = tracing::Span::current();
rayon::spawn(move || {
let child_span =
tracing::trace_span!(target: "", parent: &current_span, "extract_multiple_chunks");
let child_span = tracing::trace_span!(target: "indexing::extract::details", parent: &current_span, "extract_multiple_chunks");
let _entered = child_span.enter();
puffin::profile_scope!("extract_multiple_chunks", name);
match extract_fn(chunk, indexer) {
@ -239,6 +238,12 @@ fn send_original_documents_data(
let documents_chunk_cloned = original_documents_chunk.clone();
let lmdb_writer_sx_cloned = lmdb_writer_sx.clone();
let request_threads = rayon::ThreadPoolBuilder::new()
.num_threads(crate::vector::REQUEST_PARALLELISM)
.thread_name(|index| format!("embedding-request-{index}"))
.build()?;
rayon::spawn(move || {
for (name, (embedder, prompt)) in embedders {
let result = extract_vector_points(
@ -250,7 +255,12 @@ fn send_original_documents_data(
);
match result {
Ok(ExtractedVectorPoints { manual_vectors, remove_vectors, prompts }) => {
let embeddings = match extract_embeddings(prompts, indexer, embedder.clone()) {
let embeddings = match extract_embeddings(
prompts,
indexer,
embedder.clone(),
&request_threads,
) {
Ok(results) => Some(results),
Err(error) => {
let _ = lmdb_writer_sx_cloned.send(Err(error));

View File

@ -284,7 +284,7 @@ where
#[tracing::instrument(
level = "trace",
skip_all,
target = "profile::indexing::details",
target = "indexing::details",
name = "index_documents_raw"
)]
pub fn execute_raw(self, output: TransformOutput) -> Result<u64>
@ -2646,6 +2646,13 @@ mod tests {
api_key: Setting::NotSet,
dimensions: Setting::Set(3),
document_template: Setting::NotSet,
url: Setting::NotSet,
query: Setting::NotSet,
input_field: Setting::NotSet,
path_to_embeddings: Setting::NotSet,
embedding_object: Setting::NotSet,
input_type: Setting::NotSet,
distribution: Setting::NotSet,
}),
);
settings.set_embedder_settings(embedders);
@ -2665,7 +2672,16 @@ mod tests {
.unwrap();
let rtxn = index.read_txn().unwrap();
let res = index.search(&rtxn).vector([0.0, 1.0, 2.0].to_vec()).execute().unwrap();
let mut embedding_configs = index.embedding_configs(&rtxn).unwrap();
let (embedder_name, embedder) = embedding_configs.pop().unwrap();
let embedder =
std::sync::Arc::new(crate::vector::Embedder::new(embedder.embedder_options).unwrap());
assert_eq!("manual", embedder_name);
let res = index
.search(&rtxn)
.semantic(embedder_name, embedder, Some([0.0, 1.0, 2.0].to_vec()))
.execute()
.unwrap();
assert_eq!(res.documents_ids.len(), 3);
}

View File

@ -473,7 +473,7 @@ pub(crate) fn write_typed_chunk_into_index(
is_merged_database = true;
}
TypedChunk::FieldIdFacetIsEmptyDocids(_) => {
let span = tracing::trace_span!(target: "profile::indexing::write_db", "field_id_facet_is_empty_docids");
let span = tracing::trace_span!(target: "indexing::write_db", "field_id_facet_is_empty_docids");
let _entered = span.enter();
let mut builder = MergerBuilder::new(merge_deladd_cbo_roaring_bitmaps as MergeFn);

View File

@ -14,12 +14,13 @@ use super::IndexerConfig;
use crate::criterion::Criterion;
use crate::error::UserError;
use crate::index::{DEFAULT_MIN_WORD_LEN_ONE_TYPO, DEFAULT_MIN_WORD_LEN_TWO_TYPOS};
use crate::order_by_map::OrderByMap;
use crate::proximity::ProximityPrecision;
use crate::update::index_documents::IndexDocumentsMethod;
use crate::update::{IndexDocuments, UpdateIndexingStep};
use crate::vector::settings::{check_set, check_unset, EmbedderSource, EmbeddingSettings};
use crate::vector::{Embedder, EmbeddingConfig, EmbeddingConfigs};
use crate::{FieldsIdsMap, Index, OrderBy, Result};
use crate::{FieldsIdsMap, Index, Result};
#[derive(Debug, Clone, PartialEq, Eq, Copy)]
pub enum Setting<T> {
@ -145,10 +146,11 @@ pub struct Settings<'a, 't, 'i> {
/// Attributes on which typo tolerance is disabled.
exact_attributes: Setting<HashSet<String>>,
max_values_per_facet: Setting<usize>,
sort_facet_values_by: Setting<HashMap<String, OrderBy>>,
sort_facet_values_by: Setting<OrderByMap>,
pagination_max_total_hits: Setting<usize>,
proximity_precision: Setting<ProximityPrecision>,
embedder_settings: Setting<BTreeMap<String, Setting<EmbeddingSettings>>>,
search_cutoff: Setting<u64>,
}
impl<'a, 't, 'i> Settings<'a, 't, 'i> {
@ -182,6 +184,7 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
pagination_max_total_hits: Setting::NotSet,
proximity_precision: Setting::NotSet,
embedder_settings: Setting::NotSet,
search_cutoff: Setting::NotSet,
indexer_config,
}
}
@ -340,7 +343,7 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
self.max_values_per_facet = Setting::Reset;
}
pub fn set_sort_facet_values_by(&mut self, value: HashMap<String, OrderBy>) {
pub fn set_sort_facet_values_by(&mut self, value: OrderByMap) {
self.sort_facet_values_by = Setting::Set(value);
}
@ -372,6 +375,14 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
self.embedder_settings = Setting::Reset;
}
pub fn set_search_cutoff(&mut self, value: u64) {
self.search_cutoff = Setting::Set(value);
}
pub fn reset_search_cutoff(&mut self) {
self.search_cutoff = Setting::Reset;
}
#[tracing::instrument(
level = "trace"
skip(self, progress_callback, should_abort, old_fields_ids_map),
@ -965,7 +976,12 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
match joined {
// updated config
EitherOrBoth::Both((name, mut old), (_, new)) => {
changed |= old.apply(new);
changed |= EmbeddingSettings::apply_and_need_reindex(&mut old, new);
if changed {
tracing::debug!(embedder = name, "need reindex");
} else {
tracing::debug!(embedder = name, "skip reindex");
}
let new = validate_embedding_settings(old, &name)?;
new_configs.insert(name, new);
}
@ -1025,6 +1041,24 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
Ok(update)
}
fn update_search_cutoff(&mut self) -> Result<bool> {
let changed = match self.search_cutoff {
Setting::Set(new) => {
let old = self.index.search_cutoff(self.wtxn)?;
if old == Some(new) {
false
} else {
self.index.put_search_cutoff(self.wtxn, new)?;
true
}
}
Setting::Reset => self.index.delete_search_cutoff(self.wtxn)?,
Setting::NotSet => false,
};
Ok(changed)
}
pub fn execute<FP, FA>(mut self, progress_callback: FP, should_abort: FA) -> Result<()>
where
FP: Fn(UpdateIndexingStep) + Sync,
@ -1032,6 +1066,14 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
{
self.index.set_updated_at(self.wtxn, &OffsetDateTime::now_utc())?;
// Note: this MUST be before `update_sortable` so that we can get the old value to compare with the updated value afterwards
let existing_fields: HashSet<_> = self
.index
.field_distribution(self.wtxn)?
.into_iter()
.filter_map(|(field, count)| (count != 0).then_some(field))
.collect();
let old_faceted_fields = self.index.user_defined_faceted_fields(self.wtxn)?;
let old_fields_ids_map = self.index.fields_ids_map(self.wtxn)?;
@ -1048,12 +1090,7 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
self.update_sort_facet_values_by()?;
self.update_pagination_max_total_hits()?;
// If there is new faceted fields we indicate that we must reindex as we must
// index new fields as facets. It means that the distinct attribute,
// an Asc/Desc criterion or a filtered attribute as be added or removed.
let new_faceted_fields = self.index.user_defined_faceted_fields(self.wtxn)?;
let faceted_updated = old_faceted_fields != new_faceted_fields;
let faceted_updated = self.update_faceted(existing_fields, old_faceted_fields)?;
let stop_words_updated = self.update_stop_words()?;
let non_separator_tokens_updated = self.update_non_separator_tokens()?;
let separator_tokens_updated = self.update_separator_tokens()?;
@ -1070,6 +1107,9 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
// 3. Keep the old vectors but reattempt indexing on a prompt change: only actually changed prompt will need embedding + storage
let embedding_configs_updated = self.update_embedding_configs()?;
// never trigger re-indexing
self.update_search_cutoff()?;
if stop_words_updated
|| non_separator_tokens_updated
|| separator_tokens_updated
@ -1086,6 +1126,34 @@ impl<'a, 't, 'i> Settings<'a, 't, 'i> {
Ok(())
}
fn update_faceted(
&self,
existing_fields: HashSet<String>,
old_faceted_fields: HashSet<String>,
) -> Result<bool> {
if existing_fields.iter().any(|field| field.contains('.')) {
return Ok(true);
}
if old_faceted_fields.iter().any(|field| field.contains('.')) {
return Ok(true);
}
// If there is new faceted fields we indicate that we must reindex as we must
// index new fields as facets. It means that the distinct attribute,
// an Asc/Desc criterion or a filtered attribute as be added or removed.
let new_faceted_fields = self.index.user_defined_faceted_fields(self.wtxn)?;
if new_faceted_fields.iter().any(|field| field.contains('.')) {
return Ok(true);
}
let faceted_updated =
(&existing_fields - &old_faceted_fields) != (&existing_fields - &new_faceted_fields);
Ok(faceted_updated)
}
}
fn validate_prompt(
@ -1100,6 +1168,13 @@ fn validate_prompt(
api_key,
dimensions,
document_template: Setting::Set(template),
url,
query,
input_field,
path_to_embeddings,
embedding_object,
input_type,
distribution,
}) => {
// validate
let template = crate::prompt::Prompt::new(template)
@ -1113,6 +1188,13 @@ fn validate_prompt(
api_key,
dimensions,
document_template: Setting::Set(template),
url,
query,
input_field,
path_to_embeddings,
embedding_object,
input_type,
distribution,
}))
}
new => Ok(new),
@ -1125,8 +1207,21 @@ pub fn validate_embedding_settings(
) -> Result<Setting<EmbeddingSettings>> {
let settings = validate_prompt(name, settings)?;
let Setting::Set(settings) = settings else { return Ok(settings) };
let EmbeddingSettings { source, model, revision, api_key, dimensions, document_template } =
settings;
let EmbeddingSettings {
source,
model,
revision,
api_key,
dimensions,
document_template,
url,
query,
input_field,
path_to_embeddings,
embedding_object,
input_type,
distribution,
} = settings;
if let Some(0) = dimensions.set() {
return Err(crate::error::UserError::InvalidSettingsDimensions {
@ -1135,6 +1230,14 @@ pub fn validate_embedding_settings(
.into());
}
if let Some(url) = url.as_ref().set() {
url::Url::parse(url).map_err(|error| crate::error::UserError::InvalidUrl {
embedder_name: name.to_owned(),
inner_error: error,
url: url.to_owned(),
})?;
}
let Some(inferred_source) = source.set() else {
return Ok(Setting::Set(EmbeddingSettings {
source,
@ -1143,11 +1246,36 @@ pub fn validate_embedding_settings(
api_key,
dimensions,
document_template,
url,
query,
input_field,
path_to_embeddings,
embedding_object,
input_type,
distribution,
}));
};
match inferred_source {
EmbedderSource::OpenAi => {
check_unset(&revision, "revision", inferred_source, name)?;
check_unset(&revision, EmbeddingSettings::REVISION, inferred_source, name)?;
check_unset(&url, EmbeddingSettings::URL, inferred_source, name)?;
check_unset(&query, EmbeddingSettings::QUERY, inferred_source, name)?;
check_unset(&input_field, EmbeddingSettings::INPUT_FIELD, inferred_source, name)?;
check_unset(
&path_to_embeddings,
EmbeddingSettings::PATH_TO_EMBEDDINGS,
inferred_source,
name,
)?;
check_unset(
&embedding_object,
EmbeddingSettings::EMBEDDING_OBJECT,
inferred_source,
name,
)?;
check_unset(&input_type, EmbeddingSettings::INPUT_TYPE, inferred_source, name)?;
if let Setting::Set(model) = &model {
let model = crate::vector::openai::EmbeddingModel::from_name(model.as_str())
.ok_or(crate::error::UserError::InvalidOpenAiModel {
@ -1178,16 +1306,82 @@ pub fn validate_embedding_settings(
}
}
}
EmbedderSource::Ollama => {
// Dimensions get inferred, only model name is required
check_unset(&dimensions, EmbeddingSettings::DIMENSIONS, inferred_source, name)?;
check_set(&model, EmbeddingSettings::MODEL, inferred_source, name)?;
check_unset(&revision, EmbeddingSettings::REVISION, inferred_source, name)?;
check_unset(&query, EmbeddingSettings::QUERY, inferred_source, name)?;
check_unset(&input_field, EmbeddingSettings::INPUT_FIELD, inferred_source, name)?;
check_unset(
&path_to_embeddings,
EmbeddingSettings::PATH_TO_EMBEDDINGS,
inferred_source,
name,
)?;
check_unset(
&embedding_object,
EmbeddingSettings::EMBEDDING_OBJECT,
inferred_source,
name,
)?;
check_unset(&input_type, EmbeddingSettings::INPUT_TYPE, inferred_source, name)?;
}
EmbedderSource::HuggingFace => {
check_unset(&api_key, "apiKey", inferred_source, name)?;
check_unset(&dimensions, "dimensions", inferred_source, name)?;
check_unset(&api_key, EmbeddingSettings::API_KEY, inferred_source, name)?;
check_unset(&dimensions, EmbeddingSettings::DIMENSIONS, inferred_source, name)?;
check_unset(&url, EmbeddingSettings::URL, inferred_source, name)?;
check_unset(&query, EmbeddingSettings::QUERY, inferred_source, name)?;
check_unset(&input_field, EmbeddingSettings::INPUT_FIELD, inferred_source, name)?;
check_unset(
&path_to_embeddings,
EmbeddingSettings::PATH_TO_EMBEDDINGS,
inferred_source,
name,
)?;
check_unset(
&embedding_object,
EmbeddingSettings::EMBEDDING_OBJECT,
inferred_source,
name,
)?;
check_unset(&input_type, EmbeddingSettings::INPUT_TYPE, inferred_source, name)?;
}
EmbedderSource::UserProvided => {
check_unset(&model, "model", inferred_source, name)?;
check_unset(&revision, "revision", inferred_source, name)?;
check_unset(&api_key, "apiKey", inferred_source, name)?;
check_unset(&document_template, "documentTemplate", inferred_source, name)?;
check_set(&dimensions, "dimensions", inferred_source, name)?;
check_unset(&model, EmbeddingSettings::MODEL, inferred_source, name)?;
check_unset(&revision, EmbeddingSettings::REVISION, inferred_source, name)?;
check_unset(&api_key, EmbeddingSettings::API_KEY, inferred_source, name)?;
check_unset(
&document_template,
EmbeddingSettings::DOCUMENT_TEMPLATE,
inferred_source,
name,
)?;
check_set(&dimensions, EmbeddingSettings::DIMENSIONS, inferred_source, name)?;
check_unset(&url, EmbeddingSettings::URL, inferred_source, name)?;
check_unset(&query, EmbeddingSettings::QUERY, inferred_source, name)?;
check_unset(&input_field, EmbeddingSettings::INPUT_FIELD, inferred_source, name)?;
check_unset(
&path_to_embeddings,
EmbeddingSettings::PATH_TO_EMBEDDINGS,
inferred_source,
name,
)?;
check_unset(
&embedding_object,
EmbeddingSettings::EMBEDDING_OBJECT,
inferred_source,
name,
)?;
check_unset(&input_type, EmbeddingSettings::INPUT_TYPE, inferred_source, name)?;
}
EmbedderSource::Rest => {
check_unset(&model, EmbeddingSettings::MODEL, inferred_source, name)?;
check_unset(&revision, EmbeddingSettings::REVISION, inferred_source, name)?;
check_set(&url, EmbeddingSettings::URL, inferred_source, name)?;
}
}
Ok(Setting::Set(EmbeddingSettings {
@ -1197,6 +1391,13 @@ pub fn validate_embedding_settings(
api_key,
dimensions,
document_template,
url,
query,
input_field,
path_to_embeddings,
embedding_object,
input_type,
distribution,
}))
}
@ -2019,6 +2220,7 @@ mod tests {
pagination_max_total_hits,
proximity_precision,
embedder_settings,
search_cutoff,
} = settings;
assert!(matches!(searchable_fields, Setting::NotSet));
assert!(matches!(displayed_fields, Setting::NotSet));
@ -2042,6 +2244,7 @@ mod tests {
assert!(matches!(pagination_max_total_hits, Setting::NotSet));
assert!(matches!(proximity_precision, Setting::NotSet));
assert!(matches!(embedder_settings, Setting::NotSet));
assert!(matches!(search_cutoff, Setting::NotSet));
})
.unwrap();
}

View File

@ -20,7 +20,7 @@ impl<'t, 'i> WordsPrefixesFst<'t, 'i> {
/// Set the number of words required to make a prefix be part of the words prefixes
/// database. If a word prefix is supposed to match more than this number of words in the
/// dictionnary, therefore this prefix is added to the words prefixes datastructures.
/// dictionary, therefore this prefix is added to the words prefixes datastructures.
///
/// Default value is 100. This value must be higher than 50 and will be clamped
/// to this bound otherwise.

View File

@ -3,7 +3,6 @@ use std::path::PathBuf;
use hf_hub::api::sync::ApiError;
use crate::error::FaultSource;
use crate::vector::openai::OpenAiError;
#[derive(Debug, thiserror::Error)]
#[error("Error while generating embeddings: {inner}")]
@ -51,26 +50,36 @@ pub enum EmbedErrorKind {
TensorValue(candle_core::Error),
#[error("could not run model: {0}")]
ModelForward(candle_core::Error),
#[error("could not reach OpenAI: {0}")]
OpenAiNetwork(reqwest::Error),
#[error("unexpected response from OpenAI: {0}")]
OpenAiUnexpected(reqwest::Error),
#[error("could not authenticate against OpenAI: {0}")]
OpenAiAuth(OpenAiError),
#[error("sent too many requests to OpenAI: {0}")]
OpenAiTooManyRequests(OpenAiError),
#[error("received internal error from OpenAI: {0:?}")]
OpenAiInternalServerError(Option<OpenAiError>),
#[error("sent too many tokens in a request to OpenAI: {0}")]
OpenAiTooManyTokens(OpenAiError),
#[error("received unhandled HTTP status code {0} from OpenAI")]
OpenAiUnhandledStatusCode(u16),
#[error("attempt to embed the following text in a configuration where embeddings must be user provided: {0:?}")]
ManualEmbed(String),
#[error("could not initialize asynchronous runtime: {0}")]
OpenAiRuntimeInit(std::io::Error),
#[error("initializing web client for sending embedding requests failed: {0}")]
InitWebClient(reqwest::Error),
#[error("model not found. Meilisearch will not automatically download models from the Ollama library, please pull the model manually: {0:?}")]
OllamaModelNotFoundError(Option<String>),
#[error("error deserialization the response body as JSON: {0}")]
RestResponseDeserialization(std::io::Error),
#[error("component `{0}` not found in path `{1}` in response: `{2}`")]
RestResponseMissingEmbeddings(String, String, String),
#[error("unexpected format of the embedding response: {0}")]
RestResponseFormat(serde_json::Error),
#[error("expected a response containing {0} embeddings, got only {1}")]
RestResponseEmbeddingCount(usize, usize),
#[error("could not authenticate against embedding server: {0:?}")]
RestUnauthorized(Option<String>),
#[error("sent too many requests to embedding server: {0:?}")]
RestTooManyRequests(Option<String>),
#[error("sent a bad request to embedding server: {0:?}")]
RestBadRequest(Option<String>),
#[error("received internal error from embedding server: {0:?}")]
RestInternalServerError(u16, Option<String>),
#[error("received HTTP {0} from embedding server: {0:?}")]
RestOtherStatusCode(u16, Option<String>),
#[error("could not reach embedding server: {0}")]
RestNetwork(ureq::Transport),
#[error("was expected '{}' to be an object in query '{0}'", .1.join("."))]
RestNotAnObject(serde_json::Value, Vec<String>),
#[error("while embedding tokenized, was expecting embeddings of dimension `{0}`, got embeddings of dimensions `{1}`")]
OpenAiUnexpectedDimension(usize, usize),
#[error("no embedding was produced")]
MissingEmbedding,
}
impl EmbedError {
@ -90,44 +99,101 @@ impl EmbedError {
Self { kind: EmbedErrorKind::ModelForward(inner), fault: FaultSource::Runtime }
}
pub fn openai_network(inner: reqwest::Error) -> Self {
Self { kind: EmbedErrorKind::OpenAiNetwork(inner), fault: FaultSource::Runtime }
}
pub fn openai_unexpected(inner: reqwest::Error) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiUnexpected(inner), fault: FaultSource::Bug }
}
pub(crate) fn openai_auth_error(inner: OpenAiError) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiAuth(inner), fault: FaultSource::User }
}
pub(crate) fn openai_too_many_requests(inner: OpenAiError) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiTooManyRequests(inner), fault: FaultSource::Runtime }
}
pub(crate) fn openai_internal_server_error(inner: Option<OpenAiError>) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiInternalServerError(inner), fault: FaultSource::Runtime }
}
pub(crate) fn openai_too_many_tokens(inner: OpenAiError) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiTooManyTokens(inner), fault: FaultSource::Bug }
}
pub(crate) fn openai_unhandled_status_code(code: u16) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiUnhandledStatusCode(code), fault: FaultSource::Bug }
}
pub(crate) fn embed_on_manual_embedder(texts: String) -> EmbedError {
Self { kind: EmbedErrorKind::ManualEmbed(texts), fault: FaultSource::User }
}
pub(crate) fn openai_runtime_init(inner: std::io::Error) -> EmbedError {
Self { kind: EmbedErrorKind::OpenAiRuntimeInit(inner), fault: FaultSource::Runtime }
pub(crate) fn ollama_model_not_found(inner: Option<String>) -> EmbedError {
Self { kind: EmbedErrorKind::OllamaModelNotFoundError(inner), fault: FaultSource::User }
}
pub fn openai_initialize_web_client(inner: reqwest::Error) -> Self {
Self { kind: EmbedErrorKind::InitWebClient(inner), fault: FaultSource::Runtime }
pub(crate) fn rest_response_deserialization(error: std::io::Error) -> EmbedError {
Self {
kind: EmbedErrorKind::RestResponseDeserialization(error),
fault: FaultSource::Runtime,
}
}
pub(crate) fn rest_response_missing_embeddings<S: AsRef<str>>(
response: serde_json::Value,
component: &str,
response_field: &[S],
) -> EmbedError {
let response_field: Vec<&str> = response_field.iter().map(AsRef::as_ref).collect();
let response_field = response_field.join(".");
Self {
kind: EmbedErrorKind::RestResponseMissingEmbeddings(
component.to_owned(),
response_field,
serde_json::to_string_pretty(&response).unwrap_or_default(),
),
fault: FaultSource::Undecided,
}
}
pub(crate) fn rest_response_format(error: serde_json::Error) -> EmbedError {
Self { kind: EmbedErrorKind::RestResponseFormat(error), fault: FaultSource::Undecided }
}
pub(crate) fn rest_response_embedding_count(expected: usize, got: usize) -> EmbedError {
Self {
kind: EmbedErrorKind::RestResponseEmbeddingCount(expected, got),
fault: FaultSource::Runtime,
}
}
pub(crate) fn rest_unauthorized(error_response: Option<String>) -> EmbedError {
Self { kind: EmbedErrorKind::RestUnauthorized(error_response), fault: FaultSource::User }
}
pub(crate) fn rest_too_many_requests(error_response: Option<String>) -> EmbedError {
Self {
kind: EmbedErrorKind::RestTooManyRequests(error_response),
fault: FaultSource::Runtime,
}
}
pub(crate) fn rest_bad_request(error_response: Option<String>) -> EmbedError {
Self { kind: EmbedErrorKind::RestBadRequest(error_response), fault: FaultSource::User }
}
pub(crate) fn rest_internal_server_error(
code: u16,
error_response: Option<String>,
) -> EmbedError {
Self {
kind: EmbedErrorKind::RestInternalServerError(code, error_response),
fault: FaultSource::Runtime,
}
}
pub(crate) fn rest_other_status_code(code: u16, error_response: Option<String>) -> EmbedError {
Self {
kind: EmbedErrorKind::RestOtherStatusCode(code, error_response),
fault: FaultSource::Undecided,
}
}
pub(crate) fn rest_network(transport: ureq::Transport) -> EmbedError {
Self { kind: EmbedErrorKind::RestNetwork(transport), fault: FaultSource::Runtime }
}
pub(crate) fn rest_not_an_object(
query: serde_json::Value,
input_path: Vec<String>,
) -> EmbedError {
Self { kind: EmbedErrorKind::RestNotAnObject(query, input_path), fault: FaultSource::User }
}
pub(crate) fn openai_unexpected_dimension(expected: usize, got: usize) -> EmbedError {
Self {
kind: EmbedErrorKind::OpenAiUnexpectedDimension(expected, got),
fault: FaultSource::Runtime,
}
}
pub(crate) fn missing_embedding() -> EmbedError {
Self { kind: EmbedErrorKind::MissingEmbedding, fault: FaultSource::Undecided }
}
}
@ -188,16 +254,12 @@ impl NewEmbedderError {
Self { kind: NewEmbedderErrorKind::LoadModel(inner), fault: FaultSource::Runtime }
}
pub fn hf_could_not_determine_dimension(inner: EmbedError) -> NewEmbedderError {
pub fn could_not_determine_dimension(inner: EmbedError) -> NewEmbedderError {
Self {
kind: NewEmbedderErrorKind::CouldNotDetermineDimension(inner),
fault: FaultSource::Runtime,
}
}
pub fn openai_invalid_api_key_format(inner: reqwest::header::InvalidHeaderValue) -> Self {
Self { kind: NewEmbedderErrorKind::InvalidApiKeyFormat(inner), fault: FaultSource::User }
}
}
#[derive(Debug, thiserror::Error)]
@ -244,7 +306,4 @@ pub enum NewEmbedderErrorKind {
CouldNotDetermineDimension(EmbedError),
#[error("loading model failed: {0}")]
LoadModel(candle_core::Error),
// openai
#[error("The API key passed to Authorization error was in an invalid format: {0}")]
InvalidApiKeyFormat(reqwest::header::InvalidHeaderValue),
}

View File

@ -33,6 +33,7 @@ enum WeightSource {
pub struct EmbedderOptions {
pub model: String,
pub revision: Option<String>,
pub distribution: Option<DistributionShift>,
}
impl EmbedderOptions {
@ -40,6 +41,7 @@ impl EmbedderOptions {
Self {
model: "BAAI/bge-base-en-v1.5".to_string(),
revision: Some("617ca489d9e86b49b8167676d8220688b99db36e".into()),
distribution: None,
}
}
}
@ -87,11 +89,11 @@ impl Embedder {
let config = api.get("config.json").map_err(NewEmbedderError::api_get)?;
let tokenizer = api.get("tokenizer.json").map_err(NewEmbedderError::api_get)?;
let (weights, source) = {
api.get("pytorch_model.bin")
.map(|filename| (filename, WeightSource::Pytorch))
api.get("model.safetensors")
.map(|filename| (filename, WeightSource::Safetensors))
.or_else(|_| {
api.get("model.safetensors")
.map(|filename| (filename, WeightSource::Safetensors))
api.get("pytorch_model.bin")
.map(|filename| (filename, WeightSource::Pytorch))
})
.map_err(NewEmbedderError::api_get)?
};
@ -131,7 +133,7 @@ impl Embedder {
let embeddings = this
.embed(vec!["test".into()])
.map_err(NewEmbedderError::hf_could_not_determine_dimension)?;
.map_err(NewEmbedderError::could_not_determine_dimension)?;
this.dimensions = embeddings.first().unwrap().dimension();
Ok(this)
@ -193,10 +195,15 @@ impl Embedder {
}
pub fn distribution(&self) -> Option<DistributionShift> {
if self.options.model == "BAAI/bge-base-en-v1.5" {
Some(DistributionShift { current_mean: 0.85, current_sigma: 0.1 })
} else {
None
}
self.options.distribution.or_else(|| {
if self.options.model == "BAAI/bge-base-en-v1.5" {
Some(DistributionShift {
current_mean: ordered_float::OrderedFloat(0.85),
current_sigma: ordered_float::OrderedFloat(0.1),
})
} else {
None
}
})
}
}

View File

@ -1,19 +1,21 @@
use super::error::EmbedError;
use super::Embeddings;
use super::{DistributionShift, Embeddings};
#[derive(Debug, Clone, Copy)]
pub struct Embedder {
dimensions: usize,
distribution: Option<DistributionShift>,
}
#[derive(Debug, Clone, Hash, PartialEq, Eq, serde::Deserialize, serde::Serialize)]
pub struct EmbedderOptions {
pub dimensions: usize,
pub distribution: Option<DistributionShift>,
}
impl Embedder {
pub fn new(options: EmbedderOptions) -> Self {
Self { dimensions: options.dimensions }
Self { dimensions: options.dimensions, distribution: options.distribution }
}
pub fn embed(&self, mut texts: Vec<String>) -> Result<Vec<Embeddings<f32>>, EmbedError> {
@ -31,4 +33,8 @@ impl Embedder {
) -> Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
text_chunks.into_iter().map(|prompts| self.embed(prompts)).collect()
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.distribution
}
}

View File

@ -1,6 +1,10 @@
use std::collections::HashMap;
use std::sync::Arc;
use deserr::{DeserializeError, Deserr};
use ordered_float::OrderedFloat;
use serde::{Deserialize, Serialize};
use self::error::{EmbedError, NewEmbedderError};
use crate::prompt::{Prompt, PromptData};
@ -10,50 +14,71 @@ pub mod manual;
pub mod openai;
pub mod settings;
pub mod ollama;
pub mod rest;
pub use self::error::Error;
pub type Embedding = Vec<f32>;
pub const REQUEST_PARALLELISM: usize = 40;
/// One or multiple embeddings stored consecutively in a flat vector.
pub struct Embeddings<F> {
data: Vec<F>,
dimension: usize,
}
impl<F> Embeddings<F> {
/// Declares an empty vector of embeddings of the specified dimensions.
pub fn new(dimension: usize) -> Self {
Self { data: Default::default(), dimension }
}
/// Declares a vector of embeddings containing a single element.
///
/// The dimension is inferred from the length of the passed embedding.
pub fn from_single_embedding(embedding: Vec<F>) -> Self {
Self { dimension: embedding.len(), data: embedding }
}
/// Declares a vector of embeddings from its components.
///
/// `data.len()` must be a multiple of `dimension`, otherwise an error is returned.
pub fn from_inner(data: Vec<F>, dimension: usize) -> Result<Self, Vec<F>> {
let mut this = Self::new(dimension);
this.append(data)?;
Ok(this)
}
/// Returns the number of embeddings in this vector of embeddings.
pub fn embedding_count(&self) -> usize {
self.data.len() / self.dimension
}
/// Dimension of a single embedding.
pub fn dimension(&self) -> usize {
self.dimension
}
/// Deconstructs self into the inner flat vector.
pub fn into_inner(self) -> Vec<F> {
self.data
}
/// A reference to the inner flat vector.
pub fn as_inner(&self) -> &[F] {
&self.data
}
/// Iterates over the embeddings contained in the flat vector.
pub fn iter(&self) -> impl Iterator<Item = &'_ [F]> + '_ {
self.data.as_slice().chunks_exact(self.dimension)
}
/// Push an embedding at the end of the embeddings.
///
/// If `embedding.len() != self.dimension`, then the push operation fails.
pub fn push(&mut self, mut embedding: Vec<F>) -> Result<(), Vec<F>> {
if embedding.len() != self.dimension {
return Err(embedding);
@ -62,6 +87,9 @@ impl<F> Embeddings<F> {
Ok(())
}
/// Append a flat vector of embeddings a the end of the embeddings.
///
/// If `embeddings.len() % self.dimension != 0`, then the append operation fails.
pub fn append(&mut self, mut embeddings: Vec<F>) -> Result<(), Vec<F>> {
if embeddings.len() % self.dimension != 0 {
return Err(embeddings);
@ -71,44 +99,68 @@ impl<F> Embeddings<F> {
}
}
/// An embedder can be used to transform text into embeddings.
#[derive(Debug)]
pub enum Embedder {
/// An embedder based on running local models, fetched from the Hugging Face Hub.
HuggingFace(hf::Embedder),
/// An embedder based on making embedding queries against the OpenAI API.
OpenAi(openai::Embedder),
/// An embedder based on the user providing the embeddings in the documents and queries.
UserProvided(manual::Embedder),
/// An embedder based on making embedding queries against an <https://ollama.com> embedding server.
Ollama(ollama::Embedder),
/// An embedder based on making embedding queries against a generic JSON/REST embedding server.
Rest(rest::Embedder),
}
/// Configuration for an embedder.
#[derive(Debug, Clone, Default, serde::Deserialize, serde::Serialize)]
pub struct EmbeddingConfig {
/// Options of the embedder, specific to each kind of embedder
pub embedder_options: EmbedderOptions,
/// Document template
pub prompt: PromptData,
// TODO: add metrics and anything needed
}
/// Map of embedder configurations.
///
/// Each configuration is mapped to a name.
#[derive(Clone, Default)]
pub struct EmbeddingConfigs(HashMap<String, (Arc<Embedder>, Arc<Prompt>)>);
impl EmbeddingConfigs {
/// Create the map from its internal component.s
pub fn new(data: HashMap<String, (Arc<Embedder>, Arc<Prompt>)>) -> Self {
Self(data)
}
/// Get an embedder configuration and template from its name.
pub fn get(&self, name: &str) -> Option<(Arc<Embedder>, Arc<Prompt>)> {
self.0.get(name).cloned()
}
/// Get the default embedder configuration, if any.
pub fn get_default(&self) -> Option<(Arc<Embedder>, Arc<Prompt>)> {
self.get_default_embedder_name().and_then(|default| self.get(&default))
self.get(self.get_default_embedder_name())
}
pub fn get_default_embedder_name(&self) -> Option<String> {
/// Get the name of the default embedder configuration.
///
/// The default embedder is determined as follows:
///
/// - If there is only one embedder, it is always the default.
/// - If there are multiple embedders and one of them is called `default`, then that one is the default embedder.
/// - In all other cases, there is no default embedder.
pub fn get_default_embedder_name(&self) -> &str {
let mut it = self.0.keys();
let first_name = it.next();
let second_name = it.next();
match (first_name, second_name) {
(None, _) => None,
(Some(first), None) => Some(first.to_owned()),
(Some(_), Some(_)) => Some("default".to_owned()),
(None, _) => "default",
(Some(first), None) => first,
(Some(_), Some(_)) => "default",
}
}
}
@ -123,11 +175,14 @@ impl IntoIterator for EmbeddingConfigs {
}
}
/// Options of an embedder, specific to each kind of embedder.
#[derive(Debug, Clone, Hash, PartialEq, Eq, serde::Deserialize, serde::Serialize)]
pub enum EmbedderOptions {
HuggingFace(hf::EmbedderOptions),
OpenAi(openai::EmbedderOptions),
Ollama(ollama::EmbedderOptions),
UserProvided(manual::EmbedderOptions),
Rest(rest::EmbedderOptions),
}
impl Default for EmbedderOptions {
@ -137,91 +192,204 @@ impl Default for EmbedderOptions {
}
impl EmbedderOptions {
/// Default options for the Hugging Face embedder
pub fn huggingface() -> Self {
Self::HuggingFace(hf::EmbedderOptions::new())
}
/// Default options for the OpenAI embedder
pub fn openai(api_key: Option<String>) -> Self {
Self::OpenAi(openai::EmbedderOptions::with_default_model(api_key))
}
pub fn ollama(api_key: Option<String>, url: Option<String>) -> Self {
Self::Ollama(ollama::EmbedderOptions::with_default_model(api_key, url))
}
}
impl Embedder {
/// Spawns a new embedder built from its options.
pub fn new(options: EmbedderOptions) -> std::result::Result<Self, NewEmbedderError> {
Ok(match options {
EmbedderOptions::HuggingFace(options) => Self::HuggingFace(hf::Embedder::new(options)?),
EmbedderOptions::OpenAi(options) => Self::OpenAi(openai::Embedder::new(options)?),
EmbedderOptions::Ollama(options) => Self::Ollama(ollama::Embedder::new(options)?),
EmbedderOptions::UserProvided(options) => {
Self::UserProvided(manual::Embedder::new(options))
}
EmbedderOptions::Rest(options) => Self::Rest(rest::Embedder::new(options)?),
})
}
pub async fn embed(
/// Embed one or multiple texts.
///
/// Each text can be embedded as one or multiple embeddings.
pub fn embed(
&self,
texts: Vec<String>,
) -> std::result::Result<Vec<Embeddings<f32>>, EmbedError> {
match self {
Embedder::HuggingFace(embedder) => embedder.embed(texts),
Embedder::OpenAi(embedder) => {
let client = embedder.new_client()?;
embedder.embed(texts, &client).await
}
Embedder::OpenAi(embedder) => embedder.embed(texts),
Embedder::Ollama(embedder) => embedder.embed(texts),
Embedder::UserProvided(embedder) => embedder.embed(texts),
Embedder::Rest(embedder) => embedder.embed(texts),
}
}
/// # Panics
pub fn embed_one(&self, text: String) -> std::result::Result<Embedding, EmbedError> {
let mut embeddings = self.embed(vec![text])?;
let embeddings = embeddings.pop().ok_or_else(EmbedError::missing_embedding)?;
Ok(if embeddings.iter().nth(1).is_some() {
tracing::warn!("Ignoring embeddings past the first one in long search query");
embeddings.iter().next().unwrap().to_vec()
} else {
embeddings.into_inner()
})
}
/// Embed multiple chunks of texts.
///
/// - if called from an asynchronous context
/// Each chunk is composed of one or multiple texts.
pub fn embed_chunks(
&self,
text_chunks: Vec<Vec<String>>,
threads: &rayon::ThreadPool,
) -> std::result::Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
match self {
Embedder::HuggingFace(embedder) => embedder.embed_chunks(text_chunks),
Embedder::OpenAi(embedder) => embedder.embed_chunks(text_chunks),
Embedder::OpenAi(embedder) => embedder.embed_chunks(text_chunks, threads),
Embedder::Ollama(embedder) => embedder.embed_chunks(text_chunks, threads),
Embedder::UserProvided(embedder) => embedder.embed_chunks(text_chunks),
Embedder::Rest(embedder) => embedder.embed_chunks(text_chunks, threads),
}
}
/// Indicates the preferred number of chunks to pass to [`Self::embed_chunks`]
pub fn chunk_count_hint(&self) -> usize {
match self {
Embedder::HuggingFace(embedder) => embedder.chunk_count_hint(),
Embedder::OpenAi(embedder) => embedder.chunk_count_hint(),
Embedder::Ollama(embedder) => embedder.chunk_count_hint(),
Embedder::UserProvided(_) => 1,
Embedder::Rest(embedder) => embedder.chunk_count_hint(),
}
}
/// Indicates the preferred number of texts in a single chunk passed to [`Self::embed`]
pub fn prompt_count_in_chunk_hint(&self) -> usize {
match self {
Embedder::HuggingFace(embedder) => embedder.prompt_count_in_chunk_hint(),
Embedder::OpenAi(embedder) => embedder.prompt_count_in_chunk_hint(),
Embedder::Ollama(embedder) => embedder.prompt_count_in_chunk_hint(),
Embedder::UserProvided(_) => 1,
Embedder::Rest(embedder) => embedder.prompt_count_in_chunk_hint(),
}
}
/// Indicates the dimensions of a single embedding produced by the embedder.
pub fn dimensions(&self) -> usize {
match self {
Embedder::HuggingFace(embedder) => embedder.dimensions(),
Embedder::OpenAi(embedder) => embedder.dimensions(),
Embedder::Ollama(embedder) => embedder.dimensions(),
Embedder::UserProvided(embedder) => embedder.dimensions(),
Embedder::Rest(embedder) => embedder.dimensions(),
}
}
/// An optional distribution used to apply an affine transformation to the similarity score of a document.
pub fn distribution(&self) -> Option<DistributionShift> {
match self {
Embedder::HuggingFace(embedder) => embedder.distribution(),
Embedder::OpenAi(embedder) => embedder.distribution(),
Embedder::UserProvided(_embedder) => None,
Embedder::Ollama(embedder) => embedder.distribution(),
Embedder::UserProvided(embedder) => embedder.distribution(),
Embedder::Rest(embedder) => embedder.distribution(),
}
}
}
#[derive(Debug, Clone, Copy)]
/// Describes the mean and sigma of distribution of embedding similarity in the embedding space.
///
/// The intended use is to make the similarity score more comparable to the regular ranking score.
/// This allows to correct effects where results are too "packed" around a certain value.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Deserialize, Serialize)]
#[serde(from = "DistributionShiftSerializable")]
#[serde(into = "DistributionShiftSerializable")]
pub struct DistributionShift {
pub current_mean: f32,
pub current_sigma: f32,
/// Value where the results are "packed".
///
/// Similarity scores are translated so that they are packed around 0.5 instead
pub current_mean: OrderedFloat<f32>,
/// standard deviation of a similarity score.
///
/// Set below 0.4 to make the results less packed around the mean, and above 0.4 to make them more packed.
pub current_sigma: OrderedFloat<f32>,
}
impl<E> Deserr<E> for DistributionShift
where
E: DeserializeError,
{
fn deserialize_from_value<V: deserr::IntoValue>(
value: deserr::Value<V>,
location: deserr::ValuePointerRef,
) -> Result<Self, E> {
let value = DistributionShiftSerializable::deserialize_from_value(value, location)?;
if value.mean < 0. || value.mean > 1. {
return Err(deserr::take_cf_content(E::error::<std::convert::Infallible>(
None,
deserr::ErrorKind::Unexpected {
msg: format!(
"the distribution mean must be in the range [0, 1], got {}",
value.mean
),
},
location,
)));
}
if value.sigma <= 0. || value.sigma > 1. {
return Err(deserr::take_cf_content(E::error::<std::convert::Infallible>(
None,
deserr::ErrorKind::Unexpected {
msg: format!(
"the distribution sigma must be in the range ]0, 1], got {}",
value.sigma
),
},
location,
)));
}
Ok(value.into())
}
}
#[derive(Serialize, Deserialize, Deserr)]
#[serde(deny_unknown_fields)]
#[deserr(deny_unknown_fields)]
struct DistributionShiftSerializable {
mean: f32,
sigma: f32,
}
impl From<DistributionShift> for DistributionShiftSerializable {
fn from(
DistributionShift {
current_mean: OrderedFloat(current_mean),
current_sigma: OrderedFloat(current_sigma),
}: DistributionShift,
) -> Self {
Self { mean: current_mean, sigma: current_sigma }
}
}
impl From<DistributionShiftSerializable> for DistributionShift {
fn from(DistributionShiftSerializable { mean, sigma }: DistributionShiftSerializable) -> Self {
Self { current_mean: OrderedFloat(mean), current_sigma: OrderedFloat(sigma) }
}
}
impl DistributionShift {
@ -230,11 +398,13 @@ impl DistributionShift {
if sigma <= 0.0 {
None
} else {
Some(Self { current_mean: mean, current_sigma: sigma })
Some(Self { current_mean: OrderedFloat(mean), current_sigma: OrderedFloat(sigma) })
}
}
pub fn shift(&self, score: f32) -> f32 {
let current_mean = self.current_mean.0;
let current_sigma = self.current_sigma.0;
// <https://math.stackexchange.com/a/2894689>
// We're somewhat abusively mapping the distribution of distances to a gaussian.
// The parameters we're given is the mean and sigma of the native result distribution.
@ -244,9 +414,9 @@ impl DistributionShift {
let target_sigma = 0.4;
// a^2 sig1^2 = sig2^2 => a^2 = sig2^2 / sig1^2 => a = sig2 / sig1, assuming a, sig1, and sig2 positive.
let factor = target_sigma / self.current_sigma;
let factor = target_sigma / current_sigma;
// a*mu1 + b = mu2 => b = mu2 - a*mu1
let offset = target_mean - (factor * self.current_mean);
let offset = target_mean - (factor * current_mean);
let mut score = factor * score + offset;
@ -262,6 +432,7 @@ impl DistributionShift {
}
}
/// Whether CUDA is supported in this version of Meilisearch.
pub const fn is_cuda_enabled() -> bool {
cfg!(feature = "cuda")
}

101
milli/src/vector/ollama.rs Normal file
View File

@ -0,0 +1,101 @@
use rayon::iter::{IntoParallelIterator as _, ParallelIterator as _};
use super::error::{EmbedError, EmbedErrorKind, NewEmbedderError, NewEmbedderErrorKind};
use super::rest::{Embedder as RestEmbedder, EmbedderOptions as RestEmbedderOptions};
use super::{DistributionShift, Embeddings};
#[derive(Debug)]
pub struct Embedder {
rest_embedder: RestEmbedder,
}
#[derive(Debug, Clone, Hash, PartialEq, Eq, serde::Deserialize, serde::Serialize)]
pub struct EmbedderOptions {
pub embedding_model: String,
pub url: Option<String>,
pub api_key: Option<String>,
pub distribution: Option<DistributionShift>,
}
impl EmbedderOptions {
pub fn with_default_model(api_key: Option<String>, url: Option<String>) -> Self {
Self { embedding_model: "nomic-embed-text".into(), api_key, url, distribution: None }
}
}
impl Embedder {
pub fn new(options: EmbedderOptions) -> Result<Self, NewEmbedderError> {
let model = options.embedding_model.as_str();
let rest_embedder = match RestEmbedder::new(RestEmbedderOptions {
api_key: options.api_key,
dimensions: None,
distribution: options.distribution,
url: options.url.unwrap_or_else(get_ollama_path),
query: serde_json::json!({
"model": model,
}),
input_field: vec!["prompt".to_owned()],
path_to_embeddings: Default::default(),
embedding_object: vec!["embedding".to_owned()],
input_type: super::rest::InputType::Text,
}) {
Ok(embedder) => embedder,
Err(NewEmbedderError {
kind:
NewEmbedderErrorKind::CouldNotDetermineDimension(EmbedError {
kind: super::error::EmbedErrorKind::RestOtherStatusCode(404, error),
fault: _,
}),
fault: _,
}) => {
return Err(NewEmbedderError::could_not_determine_dimension(
EmbedError::ollama_model_not_found(error),
))
}
Err(error) => return Err(error),
};
Ok(Self { rest_embedder })
}
pub fn embed(&self, texts: Vec<String>) -> Result<Vec<Embeddings<f32>>, EmbedError> {
match self.rest_embedder.embed(texts) {
Ok(embeddings) => Ok(embeddings),
Err(EmbedError { kind: EmbedErrorKind::RestOtherStatusCode(404, error), fault: _ }) => {
Err(EmbedError::ollama_model_not_found(error))
}
Err(error) => Err(error),
}
}
pub fn embed_chunks(
&self,
text_chunks: Vec<Vec<String>>,
threads: &rayon::ThreadPool,
) -> Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
threads.install(move || {
text_chunks.into_par_iter().map(move |chunk| self.embed(chunk)).collect()
})
}
pub fn chunk_count_hint(&self) -> usize {
self.rest_embedder.chunk_count_hint()
}
pub fn prompt_count_in_chunk_hint(&self) -> usize {
self.rest_embedder.prompt_count_in_chunk_hint()
}
pub fn dimensions(&self) -> usize {
self.rest_embedder.dimensions()
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.rest_embedder.distribution()
}
}
fn get_ollama_path() -> String {
// Important: Hostname not enough, has to be entire path to embeddings endpoint
std::env::var("MEILI_OLLAMA_URL").unwrap_or("http://localhost:11434/api/embeddings".to_string())
}

View File

@ -1,23 +1,47 @@
use std::fmt::Display;
use reqwest::StatusCode;
use serde::{Deserialize, Serialize};
use ordered_float::OrderedFloat;
use rayon::iter::{IntoParallelIterator, ParallelIterator as _};
use super::error::{EmbedError, NewEmbedderError};
use super::{DistributionShift, Embedding, Embeddings};
#[derive(Debug)]
pub struct Embedder {
headers: reqwest::header::HeaderMap,
tokenizer: tiktoken_rs::CoreBPE,
options: EmbedderOptions,
}
use super::rest::{Embedder as RestEmbedder, EmbedderOptions as RestEmbedderOptions};
use super::{DistributionShift, Embeddings};
use crate::vector::error::EmbedErrorKind;
#[derive(Debug, Clone, Hash, PartialEq, Eq, serde::Deserialize, serde::Serialize)]
pub struct EmbedderOptions {
pub api_key: Option<String>,
pub embedding_model: EmbeddingModel,
pub dimensions: Option<usize>,
pub distribution: Option<DistributionShift>,
}
impl EmbedderOptions {
pub fn dimensions(&self) -> usize {
if self.embedding_model.supports_overriding_dimensions() {
self.dimensions.unwrap_or(self.embedding_model.default_dimensions())
} else {
self.embedding_model.default_dimensions()
}
}
pub fn query(&self) -> serde_json::Value {
let model = self.embedding_model.name();
let mut query = serde_json::json!({
"model": model,
});
if self.embedding_model.supports_overriding_dimensions() {
if let Some(dimensions) = self.dimensions {
query["dimensions"] = dimensions.into();
}
}
query
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.distribution.or(self.embedding_model.distribution())
}
}
#[derive(
@ -92,15 +116,18 @@ impl EmbeddingModel {
fn distribution(&self) -> Option<DistributionShift> {
match self {
EmbeddingModel::TextEmbeddingAda002 => {
Some(DistributionShift { current_mean: 0.90, current_sigma: 0.08 })
}
EmbeddingModel::TextEmbedding3Large => {
Some(DistributionShift { current_mean: 0.70, current_sigma: 0.1 })
}
EmbeddingModel::TextEmbedding3Small => {
Some(DistributionShift { current_mean: 0.75, current_sigma: 0.1 })
}
EmbeddingModel::TextEmbeddingAda002 => Some(DistributionShift {
current_mean: OrderedFloat(0.90),
current_sigma: OrderedFloat(0.08),
}),
EmbeddingModel::TextEmbedding3Large => Some(DistributionShift {
current_mean: OrderedFloat(0.70),
current_sigma: OrderedFloat(0.1),
}),
EmbeddingModel::TextEmbedding3Small => Some(DistributionShift {
current_mean: OrderedFloat(0.75),
current_sigma: OrderedFloat(0.1),
}),
}
}
@ -117,410 +144,123 @@ pub const OPENAI_EMBEDDINGS_URL: &str = "https://api.openai.com/v1/embeddings";
impl EmbedderOptions {
pub fn with_default_model(api_key: Option<String>) -> Self {
Self { api_key, embedding_model: Default::default(), dimensions: None }
Self { api_key, embedding_model: Default::default(), dimensions: None, distribution: None }
}
pub fn with_embedding_model(api_key: Option<String>, embedding_model: EmbeddingModel) -> Self {
Self { api_key, embedding_model, dimensions: None }
Self { api_key, embedding_model, dimensions: None, distribution: None }
}
}
impl Embedder {
pub fn new_client(&self) -> Result<reqwest::Client, EmbedError> {
reqwest::ClientBuilder::new()
.default_headers(self.headers.clone())
.build()
.map_err(EmbedError::openai_initialize_web_client)
}
pub fn new(options: EmbedderOptions) -> Result<Self, NewEmbedderError> {
let mut headers = reqwest::header::HeaderMap::new();
let mut inferred_api_key = Default::default();
let api_key = options.api_key.as_ref().unwrap_or_else(|| {
inferred_api_key = infer_api_key();
&inferred_api_key
});
headers.insert(
reqwest::header::AUTHORIZATION,
reqwest::header::HeaderValue::from_str(&format!("Bearer {}", api_key))
.map_err(NewEmbedderError::openai_invalid_api_key_format)?,
);
headers.insert(
reqwest::header::CONTENT_TYPE,
reqwest::header::HeaderValue::from_static("application/json"),
);
// looking at the code it is very unclear that this can actually fail.
let tokenizer = tiktoken_rs::cl100k_base().unwrap();
Ok(Self { options, headers, tokenizer })
}
pub async fn embed(
&self,
texts: Vec<String>,
client: &reqwest::Client,
) -> Result<Vec<Embeddings<f32>>, EmbedError> {
let mut tokenized = false;
for attempt in 0..7 {
let result = if tokenized {
self.try_embed_tokenized(&texts, client).await
} else {
self.try_embed(&texts, client).await
};
let retry_duration = match result {
Ok(embeddings) => return Ok(embeddings),
Err(retry) => {
tracing::warn!("Failed: {}", retry.error);
tokenized |= retry.must_tokenize();
retry.into_duration(attempt)
}
}?;
let retry_duration = retry_duration.min(std::time::Duration::from_secs(60)); // don't wait more than a minute
tracing::warn!(
"Attempt #{}, retrying after {}ms.",
attempt,
retry_duration.as_millis()
);
tokio::time::sleep(retry_duration).await;
}
let result = if tokenized {
self.try_embed_tokenized(&texts, client).await
} else {
self.try_embed(&texts, client).await
};
result.map_err(Retry::into_error)
}
async fn check_response(response: reqwest::Response) -> Result<reqwest::Response, Retry> {
if !response.status().is_success() {
match response.status() {
StatusCode::UNAUTHORIZED => {
let error_response: OpenAiErrorResponse = response
.json()
.await
.map_err(EmbedError::openai_unexpected)
.map_err(Retry::retry_later)?;
return Err(Retry::give_up(EmbedError::openai_auth_error(
error_response.error,
)));
}
StatusCode::TOO_MANY_REQUESTS => {
let error_response: OpenAiErrorResponse = response
.json()
.await
.map_err(EmbedError::openai_unexpected)
.map_err(Retry::retry_later)?;
return Err(Retry::rate_limited(EmbedError::openai_too_many_requests(
error_response.error,
)));
}
StatusCode::INTERNAL_SERVER_ERROR
| StatusCode::BAD_GATEWAY
| StatusCode::SERVICE_UNAVAILABLE => {
let error_response: Result<OpenAiErrorResponse, _> = response.json().await;
return Err(Retry::retry_later(EmbedError::openai_internal_server_error(
error_response.ok().map(|error_response| error_response.error),
)));
}
StatusCode::BAD_REQUEST => {
// Most probably, one text contained too many tokens
let error_response: OpenAiErrorResponse = response
.json()
.await
.map_err(EmbedError::openai_unexpected)
.map_err(Retry::retry_later)?;
tracing::warn!("OpenAI: received `BAD_REQUEST`. Input was maybe too long, retrying on tokenized version. For best performance, limit the size of your prompt.");
return Err(Retry::retry_tokenized(EmbedError::openai_too_many_tokens(
error_response.error,
)));
}
code => {
return Err(Retry::retry_later(EmbedError::openai_unhandled_status_code(
code.as_u16(),
)));
}
}
}
Ok(response)
}
async fn try_embed<S: AsRef<str> + serde::Serialize>(
&self,
texts: &[S],
client: &reqwest::Client,
) -> Result<Vec<Embeddings<f32>>, Retry> {
for text in texts {
tracing::trace!("Received prompt: {}", text.as_ref())
}
let request = OpenAiRequest {
model: self.options.embedding_model.name(),
input: texts,
dimensions: self.overriden_dimensions(),
};
let response = client
.post(OPENAI_EMBEDDINGS_URL)
.json(&request)
.send()
.await
.map_err(EmbedError::openai_network)
.map_err(Retry::retry_later)?;
let response = Self::check_response(response).await?;
let response: OpenAiResponse = response
.json()
.await
.map_err(EmbedError::openai_unexpected)
.map_err(Retry::retry_later)?;
tracing::trace!("response: {:?}", response.data);
Ok(response
.data
.into_iter()
.map(|data| Embeddings::from_single_embedding(data.embedding))
.collect())
}
async fn try_embed_tokenized(
&self,
text: &[String],
client: &reqwest::Client,
) -> Result<Vec<Embeddings<f32>>, Retry> {
pub const OVERLAP_SIZE: usize = 200;
let mut all_embeddings = Vec::with_capacity(text.len());
for text in text {
let max_token_count = self.options.embedding_model.max_token();
let encoded = self.tokenizer.encode_ordinary(text.as_str());
let len = encoded.len();
if len < max_token_count {
all_embeddings.append(&mut self.try_embed(&[text], client).await?);
continue;
}
let mut tokens = encoded.as_slice();
let mut embeddings_for_prompt = Embeddings::new(self.dimensions());
while tokens.len() > max_token_count {
let window = &tokens[..max_token_count];
embeddings_for_prompt.push(self.embed_tokens(window, client).await?).unwrap();
tokens = &tokens[max_token_count - OVERLAP_SIZE..];
}
// end of text
embeddings_for_prompt.push(self.embed_tokens(tokens, client).await?).unwrap();
all_embeddings.push(embeddings_for_prompt);
}
Ok(all_embeddings)
}
async fn embed_tokens(
&self,
tokens: &[usize],
client: &reqwest::Client,
) -> Result<Embedding, Retry> {
for attempt in 0..9 {
let duration = match self.try_embed_tokens(tokens, client).await {
Ok(embedding) => return Ok(embedding),
Err(retry) => retry.into_duration(attempt),
}
.map_err(Retry::retry_later)?;
tokio::time::sleep(duration).await;
}
self.try_embed_tokens(tokens, client)
.await
.map_err(|retry| Retry::give_up(retry.into_error()))
}
async fn try_embed_tokens(
&self,
tokens: &[usize],
client: &reqwest::Client,
) -> Result<Embedding, Retry> {
let request = OpenAiTokensRequest {
model: self.options.embedding_model.name(),
input: tokens,
dimensions: self.overriden_dimensions(),
};
let response = client
.post(OPENAI_EMBEDDINGS_URL)
.json(&request)
.send()
.await
.map_err(EmbedError::openai_network)
.map_err(Retry::retry_later)?;
let response = Self::check_response(response).await?;
let mut response: OpenAiResponse = response
.json()
.await
.map_err(EmbedError::openai_unexpected)
.map_err(Retry::retry_later)?;
Ok(response.data.pop().map(|data| data.embedding).unwrap_or_default())
}
pub fn embed_chunks(
&self,
text_chunks: Vec<Vec<String>>,
) -> Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
let rt = tokio::runtime::Builder::new_current_thread()
.enable_io()
.enable_time()
.build()
.map_err(EmbedError::openai_runtime_init)?;
let client = self.new_client()?;
rt.block_on(futures::future::try_join_all(
text_chunks.into_iter().map(|prompts| self.embed(prompts, &client)),
))
}
pub fn chunk_count_hint(&self) -> usize {
10
}
pub fn prompt_count_in_chunk_hint(&self) -> usize {
10
}
pub fn dimensions(&self) -> usize {
if self.options.embedding_model.supports_overriding_dimensions() {
self.options.dimensions.unwrap_or(self.options.embedding_model.default_dimensions())
} else {
self.options.embedding_model.default_dimensions()
}
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.options.embedding_model.distribution()
}
fn overriden_dimensions(&self) -> Option<usize> {
if self.options.embedding_model.supports_overriding_dimensions() {
self.options.dimensions
} else {
None
}
}
}
// retrying in case of failure
struct Retry {
error: EmbedError,
strategy: RetryStrategy,
}
enum RetryStrategy {
GiveUp,
Retry,
RetryTokenized,
RetryAfterRateLimit,
}
impl Retry {
fn give_up(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::GiveUp }
}
fn retry_later(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::Retry }
}
fn retry_tokenized(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::RetryTokenized }
}
fn rate_limited(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::RetryAfterRateLimit }
}
fn into_duration(self, attempt: u32) -> Result<tokio::time::Duration, EmbedError> {
match self.strategy {
RetryStrategy::GiveUp => Err(self.error),
RetryStrategy::Retry => Ok(tokio::time::Duration::from_millis((10u64).pow(attempt))),
RetryStrategy::RetryTokenized => Ok(tokio::time::Duration::from_millis(1)),
RetryStrategy::RetryAfterRateLimit => {
Ok(tokio::time::Duration::from_millis(100 + 10u64.pow(attempt)))
}
}
}
fn must_tokenize(&self) -> bool {
matches!(self.strategy, RetryStrategy::RetryTokenized)
}
fn into_error(self) -> EmbedError {
self.error
}
}
// openai api structs
#[derive(Debug, Serialize)]
struct OpenAiRequest<'a, S: AsRef<str> + serde::Serialize> {
model: &'a str,
input: &'a [S],
#[serde(skip_serializing_if = "Option::is_none")]
dimensions: Option<usize>,
}
#[derive(Debug, Serialize)]
struct OpenAiTokensRequest<'a> {
model: &'a str,
input: &'a [usize],
#[serde(skip_serializing_if = "Option::is_none")]
dimensions: Option<usize>,
}
#[derive(Debug, Deserialize)]
struct OpenAiResponse {
data: Vec<OpenAiEmbedding>,
}
#[derive(Debug, Deserialize)]
struct OpenAiErrorResponse {
error: OpenAiError,
}
#[derive(Debug, Deserialize)]
pub struct OpenAiError {
message: String,
// type: String,
code: Option<String>,
}
impl Display for OpenAiError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match &self.code {
Some(code) => write!(f, "{} ({})", self.message, code),
None => write!(f, "{}", self.message),
}
}
}
#[derive(Debug, Deserialize)]
struct OpenAiEmbedding {
embedding: Embedding,
// object: String,
// index: usize,
}
fn infer_api_key() -> String {
std::env::var("MEILI_OPENAI_API_KEY")
.or_else(|_| std::env::var("OPENAI_API_KEY"))
.unwrap_or_default()
}
#[derive(Debug)]
pub struct Embedder {
tokenizer: tiktoken_rs::CoreBPE,
rest_embedder: RestEmbedder,
options: EmbedderOptions,
}
impl Embedder {
pub fn new(options: EmbedderOptions) -> Result<Self, NewEmbedderError> {
let mut inferred_api_key = Default::default();
let api_key = options.api_key.as_ref().unwrap_or_else(|| {
inferred_api_key = infer_api_key();
&inferred_api_key
});
let rest_embedder = RestEmbedder::new(RestEmbedderOptions {
api_key: Some(api_key.clone()),
distribution: None,
dimensions: Some(options.dimensions()),
url: OPENAI_EMBEDDINGS_URL.to_owned(),
query: options.query(),
input_field: vec!["input".to_owned()],
input_type: crate::vector::rest::InputType::TextArray,
path_to_embeddings: vec!["data".to_owned()],
embedding_object: vec!["embedding".to_owned()],
})?;
// looking at the code it is very unclear that this can actually fail.
let tokenizer = tiktoken_rs::cl100k_base().unwrap();
Ok(Self { options, rest_embedder, tokenizer })
}
pub fn embed(&self, texts: Vec<String>) -> Result<Vec<Embeddings<f32>>, EmbedError> {
match self.rest_embedder.embed_ref(&texts) {
Ok(embeddings) => Ok(embeddings),
Err(EmbedError { kind: EmbedErrorKind::RestBadRequest(error), fault: _ }) => {
tracing::warn!(error=?error, "OpenAI: received `BAD_REQUEST`. Input was maybe too long, retrying on tokenized version. For best performance, limit the size of your document template.");
self.try_embed_tokenized(&texts)
}
Err(error) => Err(error),
}
}
fn try_embed_tokenized(&self, text: &[String]) -> Result<Vec<Embeddings<f32>>, EmbedError> {
pub const OVERLAP_SIZE: usize = 200;
let mut all_embeddings = Vec::with_capacity(text.len());
for text in text {
let max_token_count = self.options.embedding_model.max_token();
let encoded = self.tokenizer.encode_ordinary(text.as_str());
let len = encoded.len();
if len < max_token_count {
all_embeddings.append(&mut self.rest_embedder.embed_ref(&[text])?);
continue;
}
let mut tokens = encoded.as_slice();
let mut embeddings_for_prompt = Embeddings::new(self.dimensions());
while tokens.len() > max_token_count {
let window = &tokens[..max_token_count];
let embedding = self.rest_embedder.embed_tokens(window)?;
embeddings_for_prompt.append(embedding.into_inner()).map_err(|got| {
EmbedError::openai_unexpected_dimension(self.dimensions(), got.len())
})?;
tokens = &tokens[max_token_count - OVERLAP_SIZE..];
}
// end of text
let embedding = self.rest_embedder.embed_tokens(tokens)?;
embeddings_for_prompt.append(embedding.into_inner()).map_err(|got| {
EmbedError::openai_unexpected_dimension(self.dimensions(), got.len())
})?;
all_embeddings.push(embeddings_for_prompt);
}
Ok(all_embeddings)
}
pub fn embed_chunks(
&self,
text_chunks: Vec<Vec<String>>,
threads: &rayon::ThreadPool,
) -> Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
threads.install(move || {
text_chunks.into_par_iter().map(move |chunk| self.embed(chunk)).collect()
})
}
pub fn chunk_count_hint(&self) -> usize {
self.rest_embedder.chunk_count_hint()
}
pub fn prompt_count_in_chunk_hint(&self) -> usize {
self.rest_embedder.prompt_count_in_chunk_hint()
}
pub fn dimensions(&self) -> usize {
self.options.dimensions()
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.options.distribution()
}
}

373
milli/src/vector/rest.rs Normal file
View File

@ -0,0 +1,373 @@
use deserr::Deserr;
use rayon::iter::{IntoParallelIterator as _, ParallelIterator as _};
use serde::{Deserialize, Serialize};
use super::{
DistributionShift, EmbedError, Embedding, Embeddings, NewEmbedderError, REQUEST_PARALLELISM,
};
// retrying in case of failure
pub struct Retry {
pub error: EmbedError,
strategy: RetryStrategy,
}
pub enum RetryStrategy {
GiveUp,
Retry,
RetryTokenized,
RetryAfterRateLimit,
}
impl Retry {
pub fn give_up(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::GiveUp }
}
pub fn retry_later(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::Retry }
}
pub fn retry_tokenized(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::RetryTokenized }
}
pub fn rate_limited(error: EmbedError) -> Self {
Self { error, strategy: RetryStrategy::RetryAfterRateLimit }
}
pub fn into_duration(self, attempt: u32) -> Result<std::time::Duration, EmbedError> {
match self.strategy {
RetryStrategy::GiveUp => Err(self.error),
RetryStrategy::Retry => Ok(std::time::Duration::from_millis((10u64).pow(attempt))),
RetryStrategy::RetryTokenized => Ok(std::time::Duration::from_millis(1)),
RetryStrategy::RetryAfterRateLimit => {
Ok(std::time::Duration::from_millis(100 + 10u64.pow(attempt)))
}
}
}
pub fn must_tokenize(&self) -> bool {
matches!(self.strategy, RetryStrategy::RetryTokenized)
}
pub fn into_error(self) -> EmbedError {
self.error
}
}
#[derive(Debug)]
pub struct Embedder {
client: ureq::Agent,
options: EmbedderOptions,
bearer: Option<String>,
dimensions: usize,
}
#[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize)]
pub struct EmbedderOptions {
pub api_key: Option<String>,
pub distribution: Option<DistributionShift>,
pub dimensions: Option<usize>,
pub url: String,
pub query: serde_json::Value,
pub input_field: Vec<String>,
// path to the array of embeddings
pub path_to_embeddings: Vec<String>,
// shape of a single embedding
pub embedding_object: Vec<String>,
pub input_type: InputType,
}
impl Default for EmbedderOptions {
fn default() -> Self {
Self {
url: Default::default(),
query: Default::default(),
input_field: vec!["input".into()],
path_to_embeddings: vec!["data".into()],
embedding_object: vec!["embedding".into()],
input_type: InputType::Text,
api_key: None,
distribution: None,
dimensions: None,
}
}
}
impl std::hash::Hash for EmbedderOptions {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.api_key.hash(state);
self.distribution.hash(state);
self.dimensions.hash(state);
self.url.hash(state);
// skip hashing the query
// collisions in regular usage should be minimal,
// and the list is limited to 256 values anyway
self.input_field.hash(state);
self.path_to_embeddings.hash(state);
self.embedding_object.hash(state);
self.input_type.hash(state);
}
}
#[derive(Debug, Clone, Copy, Deserialize, Serialize, PartialEq, Eq, Hash, Deserr)]
#[serde(rename_all = "camelCase")]
#[deserr(rename_all = camelCase, deny_unknown_fields)]
pub enum InputType {
Text,
TextArray,
}
impl Embedder {
pub fn new(options: EmbedderOptions) -> Result<Self, NewEmbedderError> {
let bearer = options.api_key.as_deref().map(|api_key| format!("Bearer {api_key}"));
let client = ureq::AgentBuilder::new()
.max_idle_connections(REQUEST_PARALLELISM * 2)
.max_idle_connections_per_host(REQUEST_PARALLELISM * 2)
.build();
let dimensions = if let Some(dimensions) = options.dimensions {
dimensions
} else {
infer_dimensions(&client, &options, bearer.as_deref())?
};
Ok(Self { client, dimensions, options, bearer })
}
pub fn embed(&self, texts: Vec<String>) -> Result<Vec<Embeddings<f32>>, EmbedError> {
embed(&self.client, &self.options, self.bearer.as_deref(), texts.as_slice(), texts.len())
}
pub fn embed_ref<S>(&self, texts: &[S]) -> Result<Vec<Embeddings<f32>>, EmbedError>
where
S: AsRef<str> + Serialize,
{
embed(&self.client, &self.options, self.bearer.as_deref(), texts, texts.len())
}
pub fn embed_tokens(&self, tokens: &[usize]) -> Result<Embeddings<f32>, EmbedError> {
let mut embeddings = embed(&self.client, &self.options, self.bearer.as_deref(), tokens, 1)?;
// unwrap: guaranteed that embeddings.len() == 1, otherwise the previous line terminated in error
Ok(embeddings.pop().unwrap())
}
pub fn embed_chunks(
&self,
text_chunks: Vec<Vec<String>>,
threads: &rayon::ThreadPool,
) -> Result<Vec<Vec<Embeddings<f32>>>, EmbedError> {
threads.install(move || {
text_chunks.into_par_iter().map(move |chunk| self.embed(chunk)).collect()
})
}
pub fn chunk_count_hint(&self) -> usize {
super::REQUEST_PARALLELISM
}
pub fn prompt_count_in_chunk_hint(&self) -> usize {
match self.options.input_type {
InputType::Text => 1,
InputType::TextArray => 10,
}
}
pub fn dimensions(&self) -> usize {
self.dimensions
}
pub fn distribution(&self) -> Option<DistributionShift> {
self.options.distribution
}
}
fn infer_dimensions(
client: &ureq::Agent,
options: &EmbedderOptions,
bearer: Option<&str>,
) -> Result<usize, NewEmbedderError> {
let v = embed(client, options, bearer, ["test"].as_slice(), 1)
.map_err(NewEmbedderError::could_not_determine_dimension)?;
// unwrap: guaranteed that v.len() == 1, otherwise the previous line terminated in error
Ok(v.first().unwrap().dimension())
}
fn embed<S>(
client: &ureq::Agent,
options: &EmbedderOptions,
bearer: Option<&str>,
inputs: &[S],
expected_count: usize,
) -> Result<Vec<Embeddings<f32>>, EmbedError>
where
S: Serialize,
{
let request = client.post(&options.url);
let request =
if let Some(bearer) = bearer { request.set("Authorization", bearer) } else { request };
let request = request.set("Content-Type", "application/json");
let input_value = match options.input_type {
InputType::Text => serde_json::json!(inputs.first()),
InputType::TextArray => serde_json::json!(inputs),
};
let body = match options.input_field.as_slice() {
[] => {
// inject input in body
input_value
}
[input] => {
let mut body = options.query.clone();
body.as_object_mut()
.ok_or_else(|| {
EmbedError::rest_not_an_object(
options.query.clone(),
options.input_field.clone(),
)
})?
.insert(input.clone(), input_value);
body
}
[path @ .., input] => {
let mut body = options.query.clone();
let mut current_value = &mut body;
for component in path {
current_value = current_value
.as_object_mut()
.ok_or_else(|| {
EmbedError::rest_not_an_object(
options.query.clone(),
options.input_field.clone(),
)
})?
.entry(component.clone())
.or_insert(serde_json::json!({}));
}
current_value.as_object_mut().unwrap().insert(input.clone(), input_value);
body
}
};
for attempt in 0..7 {
let response = request.clone().send_json(&body);
let result = check_response(response);
let retry_duration = match result {
Ok(response) => return response_to_embedding(response, options, expected_count),
Err(retry) => {
tracing::warn!("Failed: {}", retry.error);
retry.into_duration(attempt)
}
}?;
let retry_duration = retry_duration.min(std::time::Duration::from_secs(60)); // don't wait more than a minute
tracing::warn!("Attempt #{}, retrying after {}ms.", attempt, retry_duration.as_millis());
std::thread::sleep(retry_duration);
}
let response = request.send_json(&body);
let result = check_response(response);
result
.map_err(Retry::into_error)
.and_then(|response| response_to_embedding(response, options, expected_count))
}
fn check_response(response: Result<ureq::Response, ureq::Error>) -> Result<ureq::Response, Retry> {
match response {
Ok(response) => Ok(response),
Err(ureq::Error::Status(code, response)) => {
let error_response: Option<String> = response.into_string().ok();
Err(match code {
401 => Retry::give_up(EmbedError::rest_unauthorized(error_response)),
429 => Retry::rate_limited(EmbedError::rest_too_many_requests(error_response)),
400 => Retry::give_up(EmbedError::rest_bad_request(error_response)),
500..=599 => {
Retry::retry_later(EmbedError::rest_internal_server_error(code, error_response))
}
402..=499 => {
Retry::give_up(EmbedError::rest_other_status_code(code, error_response))
}
_ => Retry::retry_later(EmbedError::rest_other_status_code(code, error_response)),
})
}
Err(ureq::Error::Transport(transport)) => {
Err(Retry::retry_later(EmbedError::rest_network(transport)))
}
}
}
fn response_to_embedding(
response: ureq::Response,
options: &EmbedderOptions,
expected_count: usize,
) -> Result<Vec<Embeddings<f32>>, EmbedError> {
let response: serde_json::Value =
response.into_json().map_err(EmbedError::rest_response_deserialization)?;
let mut current_value = &response;
for component in &options.path_to_embeddings {
let component = component.as_ref();
current_value = current_value.get(component).ok_or_else(|| {
EmbedError::rest_response_missing_embeddings(
response.clone(),
component,
&options.path_to_embeddings,
)
})?;
}
let embeddings = match options.input_type {
InputType::Text => {
for component in &options.embedding_object {
current_value = current_value.get(component).ok_or_else(|| {
EmbedError::rest_response_missing_embeddings(
response.clone(),
component,
&options.embedding_object,
)
})?;
}
let embeddings = current_value.to_owned();
let embeddings: Embedding =
serde_json::from_value(embeddings).map_err(EmbedError::rest_response_format)?;
vec![Embeddings::from_single_embedding(embeddings)]
}
InputType::TextArray => {
let empty = vec![];
let values = current_value.as_array().unwrap_or(&empty);
let mut embeddings: Vec<Embeddings<f32>> = Vec::with_capacity(expected_count);
for value in values {
let mut current_value = value;
for component in &options.embedding_object {
current_value = current_value.get(component).ok_or_else(|| {
EmbedError::rest_response_missing_embeddings(
response.clone(),
component,
&options.embedding_object,
)
})?;
}
let embedding = current_value.to_owned();
let embedding: Embedding =
serde_json::from_value(embedding).map_err(EmbedError::rest_response_format)?;
embeddings.push(Embeddings::from_single_embedding(embedding));
}
embeddings
}
};
if embeddings.len() != expected_count {
return Err(EmbedError::rest_response_embedding_count(expected_count, embeddings.len()));
}
Ok(embeddings)
}

Some files were not shown because too many files have changed in this diff Show More