Commit Graph

224 Commits

Author SHA1 Message Date
9ac981d025 Remove some clippy type complexity warns by deboxing iters 2023-02-20 13:52:27 +01:00
74859ecd61 Add min and max facet stats 2023-02-20 13:52:27 +01:00
8ae441a4db Update usage of iterators 2023-02-20 13:52:27 +01:00
6fa877efb0 Fix attributes set candidates 2023-02-13 17:49:52 +01:00
fbec48f56e Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
20f05efb3c clippy: needless_lifetimes 2023-01-31 11:12:59 +01:00
3296cf7ae6 clippy: remove needless lifetimes 2023-01-31 09:32:40 +01:00
1d507c84b2 Fix the formatting 2023-01-17 18:25:55 +01:00
1b78231e18 Make clippy happy 2023-01-17 18:25:54 +01:00
02fd06ea0b Integrate deserr 2023-01-11 13:56:47 +01:00
c3f4835e8e Merge #733
733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec

# Pull Request

## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:

1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.

2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
   1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
   2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8

3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]
```

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. 

---

## Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.   

```
1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup
```

Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 09:00:50 +00:00
b5df889dcb Apply review suggestions: simplify implementation of exactness criterion 2023-01-02 13:11:47 +01:00
8d36570958 Add explicit criterion impl strategy to proximity search tests 2023-01-02 10:37:01 +01:00
32c6062e65 Optimise exactness criterion
1. Cache some results between calls to next()
2. Compute the combinations of exact words more efficiently
2022-12-22 12:28:45 +01:00
f097aafa1c Add unit test for prefix handling by the proximity criterion 2022-12-22 12:08:00 +01:00
777b387dc4 Avoid a prefix-related worst-case scenario in the proximity criterion 2022-12-22 12:08:00 +01:00
339a4b0789 Make clippy happy 2022-12-21 12:49:34 +01:00
229405aeb9 Choose implementation strategy of criterion at runtime 2022-12-21 09:29:39 +01:00
5114686394 Merge #743
743: Fix finite pagination with placeholder search r=Kerollmops a=ManyTheFish

this bug is reproducible on real datasets and is hard to isolate in a simple test.

related to: https://github.com/meilisearch/meilisearch/issues/3200

poke `@curquiza` 

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-12-15 09:31:47 +00:00
3322018c06 Fix placeholder search 2022-12-14 20:09:47 +01:00
0276d5212a Merge #728
728: Add some integration tests on the sort criterion r=ManyTheFish a=loiclec

This is simply an integration test ensuring that the sort criterion works properly. 

However, only one version of the algorithm is tested here (the iterative one). To test the version that uses the facet DB, one has to manually set the `CANDIDATES_THRESHOLD` constant to `0`. I have done that and ensured that the test still succeeds. However, in the future, we will probably want to have an option to force which algorithm is used at runtime, for testing purposes.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-14 09:27:12 +00:00
2d8d0af1a6 Rename short name bc by ic for initial_candidates 2022-12-13 10:56:38 +01:00
80d34a4169 Fix typo initial candiddates computation 2022-12-12 19:02:48 +01:00
55724f2412 Introduce an initial candidates set that makes the difference between an exhaustive count and an estimation 2022-12-08 09:41:34 +01:00
f37c86e0b2 Add some integration tests on the sort criterion 2022-12-07 15:59:33 +01:00
70465aa5ce Execute cargo fmt 2022-11-04 08:59:58 +09:00
3009981d31 Fix clippy errors
Add clippy job

Add clippy job to CI
2022-11-04 08:58:14 +09:00
6add470805 Merge #659
659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley

## Related PR
This PR is for #673 

## What does this PR do?
- ~~add `Run Clippy` job to CI (rust.yml)~~
- apply `cargo clippy --fix` command
- fix some `cargo clippy` error manually (but warnings still remain on tests)

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: unvalley <kirohi.code@gmail.com>
Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>
2022-11-03 15:24:38 +00:00
c965200010 Merge #664
664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2

# Pull Request

This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.

## Related issue
Fixes #661 and towards fixing meilisearch/meilisearch#2905

## What does this PR do?
- [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word
- [x] Update all other uses of phrase operation
- [x] Update `resolve_phrase`
- [x] Update `create_primitive_query`?
- [x] Add test

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>
2022-10-29 13:42:52 +00:00
ecb88143f9 Run cargo fmt 2022-10-28 19:37:02 +05:30
03eb5d87c1 Only call plane_sweep on subgroups when 2 or more are present 2022-10-28 19:32:05 +05:30
d35afa0cf5 Change consecutive phrase search grouping logic
Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-10-26 23:10:48 +05:30
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
811f156031 Execute cargo clippy --fix 2022-10-27 01:00:00 +09:00
af33d22f25 Consecutive is false when at least 1 stop word is surrounded by words 2022-10-26 19:09:45 +05:30
2aa11afb87 Fix panic when phrase contains only one stop word and nothing else 2022-10-26 19:09:42 +05:30
bb9ce3c5c5 Run cargo fmt 2022-10-26 19:09:03 +05:30
c8c666c6a6 Use resolve_phrase in exactness and typo criteria 2022-10-26 19:09:01 +05:30
3e190503e6 Search for closest non-stop words in proximity criteria 2022-10-26 19:08:34 +05:30
709ab3c14c Increment position even when it's a stop word in exactness criteria 2022-10-26 19:08:33 +05:30
ef13c6a5b6 Perform filter after enumerate to keep origin indices 2022-10-26 19:08:33 +05:30
62816dddde [WIP] Fix phrase search containing stop words
Fixes #661 and meilisearch/meilisearch#2905
2022-10-26 19:08:06 +05:30
54c0cf93fe Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 15:13:34 +02:00
a034a1e628 Move StrRefCodec and ByteSliceRefCodec to their own files 2022-10-26 13:47:46 +02:00
86d9f50b9c Fix bugs in incremental facet indexing with variable parameters
e.g. add one facet value incrementally with a group_size = X and then
add another one with group_size = Y

It is not actually possible to do so with the public API of milli,
but I wanted to make sure the algorithm worked well in those cases
anyway.

The bugs were found by fuzzing the code with fuzzcheck, which I've added
to milli as a conditional dev-dependency. But it can be removed later.
2022-10-26 13:47:04 +02:00
de52a9bf75 Improve documentation of some facet-related algorithms 2022-10-26 13:47:04 +02:00
9026867d17 Give same interface to bulk and incremental facet indexing types
+ cargo fmt, oops, sorry for the bad history :(
2022-10-26 13:47:04 +02:00
485a72306d Refactor facet-related codecs 2022-10-26 13:47:04 +02:00
3d145d7f48 Merge the two <facetttype>_faceted_documents_ids methods into one 2022-10-26 13:47:04 +02:00
afdf87f6f7 Fix bugs in asc/desc criterion and facet indexing 2022-10-26 13:47:04 +02:00