Commit Graph

1268 Commits

Author SHA1 Message Date
Loïc Lecrenier
3794962330 Use an unstable algorithm for grenad::Sorter when possible 2022-09-13 14:49:53 +02:00
Kerollmops
d4d7c9d577 We avoid skipping errors in the indexing pipeline 2022-09-13 14:03:00 +02:00
Vincent Herlemont
8cd5200f48 Make charabia languages configurable 2022-09-08 12:21:43 +02:00
Vincent Herlemont
5e07ea79c2 Make charabia default feature optional 2022-09-07 20:54:31 +02:00
curquiza
077dcd2002 Update version for the next release (v0.33.3) in Cargo.toml files 2022-09-07 15:48:53 +00:00
Kerollmops
fe3973a51c Make sure that long words are correctly skipped 2022-09-07 15:03:32 +02:00
Kerollmops
c83c3cd796 Add a test to make sure that long words are correctly skipped 2022-09-07 14:12:36 +02:00
ManyTheFish
bf750e45a1 Fix word removal issue 2022-09-01 12:10:47 +02:00
ManyTheFish
a38608fe59 Add test mixing phrased and no-phrased words 2022-09-01 12:02:10 +02:00
ManyTheFish
97a04887a3 Update version for next release (v0.33.2) in Cargo.toml 2022-09-01 11:47:23 +02:00
bors[bot]
17d020e996 Merge #618
618: Update version for next release (v0.33.1) in Cargo.toml r=Kerollmops a=curquiza

No breaking for this release

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-08-31 10:43:45 +00:00
Clémentine Urquizar
c3363706c5 Update version for next release (v0.33.1) in Cargo.toml 2022-08-31 11:37:27 +02:00
Clément Renault
7f92116b51 Accept again integers as document ids 2022-08-31 10:56:39 +02:00
Irevoire
f6024b3269 Remove the artifacts of the past 2022-08-23 16:10:38 +02:00
bors[bot]
a79ff8a1a9 Merge #611
611: Upgrade charabia v0.6.0 r=curquiza a=ManyTheFish

# Pull Request

## What does this PR do?

- Update `log`
- Upgrade `charabia`

related to https://github.com/meilisearch/meilisearch/issues/2686


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-23 10:17:29 +00:00
Clémentine Urquizar
9ed7324995 Update version for next release (v0.33.0) 2022-08-23 11:47:48 +02:00
bors[bot]
18886dc6b7 Merge #598
598: Matching query terms policy r=Kerollmops a=ManyTheFish

## Summary

Implement several optional words strategy.

## Content

Replace `optional_words` boolean with an enum containing several term matching strategies:
```rust
pub enum TermsMatchingStrategy {
    // remove last word first
    Last,
    // remove first word first
    First,
    // remove more frequent word first
    Frequency,
    // remove smallest word first
    Size,
    // only one of the word is mandatory
    Any,
    // all words are mandatory
    All,
}
```

All strategies implemented during the prototype are kept, but only `Last` and `All` will be published by Meilisearch in the `v0.29.0` release.

## Related

spec: https://github.com/meilisearch/specifications/pull/173
prototype discussion: https://github.com/meilisearch/meilisearch/discussions/2639#discussioncomment-3447699


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-22 15:51:37 +00:00
ManyTheFish
5391e3842c replace optional_words by term_matching_strategy 2022-08-22 17:47:19 +02:00
ManyTheFish
ba5ca8a362 Upgrade charabia v0.6.0 2022-08-22 14:38:00 +02:00
Irevoire
e7624abe63 share heed between all sub-crates 2022-08-19 11:23:41 +02:00
ManyTheFish
993aa1321c Fix query tree building 2022-08-18 17:56:06 +02:00
ManyTheFish
bff9653050 Fix remove count 2022-08-18 17:36:30 +02:00
ManyTheFish
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
bors[bot]
afc10acd19 Merge #596
596: Filter operators: NOT + IN[..] r=irevoire a=loiclec

# Pull Request

## What does this PR do?
Implements the changes described in https://github.com/meilisearch/meilisearch/issues/2580
It is based on top of #556 

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-18 11:24:32 +00:00
Loïc Lecrenier
9b6602cba2 Avoid cloning FilterCondition in filter array parsing 2022-08-18 13:06:57 +02:00
Loïc Lecrenier
c51dcad51b Don't recompute filterable fields in evaluation of IN[] filter 2022-08-18 10:59:21 +02:00
Irevoire
4aae07d5f5 expose the size methods 2022-08-17 17:07:38 +02:00
Irevoire
e96b852107 bump heed 2022-08-17 17:05:50 +02:00
bors[bot]
087da5621a Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
bors[bot]
fb95e67a2a Merge #608
608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish

When we replaced or updated some documents, the indexing was skipping the replaced documents.

Related to https://github.com/meilisearch/meilisearch/issues/2672

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:38:10 +00:00
bors[bot]
e4a52e6e45 Merge #594
594: Fix(Search): Fix phrase search candidates computation r=Kerollmops a=ManyTheFish

This bug is an old bug but was hidden by the proximity criterion,
Phrase searches were always returning an empty candidates list when the proximity criterion is deactivated.

Before the fix, we were trying to find any words[n] near words[n]
instead of finding  any words[n] near words[n+1], for example:

for a phrase search '"Hello world"' we were searching for "hello" near "hello" first, instead of "hello" near "world".



Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:22:52 +00:00
ManyTheFish
8c3f1a9c39 Remove useless lifetime declaration 2022-08-17 15:20:43 +02:00
ManyTheFish
e9e2349ce6 Fix typo in comment 2022-08-17 15:09:48 +02:00
ManyTheFish
2668f841d1 Fix update indexing 2022-08-17 15:03:37 +02:00
ManyTheFish
7384650d85 Update test to showcase the bug 2022-08-17 15:03:08 +02:00
bors[bot]
39869be23b Merge #590
590: Optimise facets indexing r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Fixes #589 

## Notes
I added documentation for the whole module which attempts to explain the shape of the databases and their purpose. However, I realise there is already some documentation about this, so I am not sure if we want to keep it.

## Benchmarks

We get a ~1.15x speed up on the geo_point benchmark.

```
group                                                                     indexing_main_57042355                  indexing_optimise-facets-indexation_5728619a
-----                                                                     ----------------------                  --------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1862.7±294.45µs        ? ?/sec    1.58      2.9±1.32ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.11      8.9±2.44ms        ? ?/sec     1.00      8.0±1.42ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.8±3.32ms        ? ?/sec     1.32     16.9±6.98ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.09     43.8±4.78ms        ? ?/sec     1.00     40.3±3.79ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   287.4±28.72ms        ? ?/sec     1.00    264.9±9.46ms        ? ?/sec
indexing/Indexing geo_point                                               1.14      61.2±0.39s        ? ?/sec     1.00      53.8±0.57s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      16.6±0.12s        ? ?/sec     1.00      16.5±0.10s        ? ?/sec
indexing/Indexing movies with default settings                            1.00      14.1±0.30s        ? ?/sec     1.00      14.0±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.9±0.50s        ? ?/sec     1.00      10.0±0.10s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.01       9.6±0.23s        ? ?/sec     1.00       9.5±0.06s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.07      66.3±0.55s        ? ?/sec     1.00      61.8±0.63s        ? ?/sec
indexing/Indexing songs with default settings                             1.03      58.8±0.82s        ? ?/sec     1.00      57.1±1.22s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      53.6±1.09s        ? ?/sec     1.01      54.0±0.58s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.02      58.0±1.29s        ? ?/sec     1.00      57.1±1.43s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1064.1±21.20s        ? ?/sec     1.00   1068.0±20.49s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1182.5±9.62s        ? ?/sec     1.01   1191.2±10.96s        ? ?/sec
indexing/Reindexing geo_point                                             1.12      68.0±0.21s        ? ?/sec     1.00      60.5±0.82s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01      14.1±0.21s        ? ?/sec     1.00      14.0±0.26s        ? ?/sec
indexing/Reindexing songs with default settings                           1.04      61.6±0.57s        ? ?/sec     1.00      59.2±0.87s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1734.0±11.38s        ? ?/sec     1.01   1746.6±22.48s        ? ?/sec
```


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 11:46:55 +00:00
Loïc Lecrenier
6cc975704d Add some documentation to facets.rs 2022-08-17 12:59:52 +02:00
Loïc Lecrenier
93252769af Apply review suggestions 2022-08-17 12:41:22 +02:00
Loïc Lecrenier
196f79115a Run cargo fmt 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
d10d78d520 Add integration tests for the IN filter 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
ca97cb0eda Implement the IN filter operator 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
cc7415bb31 Simplify FilterCondition code, made possible by the new NOT operator 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
44744d9e67 Implement the simplified NOT operator 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
01675771d5 Reimplement != filter to select all docids not selected by = 2022-08-17 12:28:33 +02:00
Loïc Lecrenier
258c3dd563 Make AND+OR filters n-ary (store a vector of subfilters instead of 2)
NOTE: The token_at_depth is method is a bit useless now, as the only
cases where there would be a toke at depth 1000 are the cases where
the parser already stack-overflowed earlier.

Example: (((((... (x=1) ...)))))
2022-08-17 12:28:33 +02:00
Loïc Lecrenier
39687908f1 Add documentation and comments to facets.rs 2022-08-17 12:26:49 +02:00
Loïc Lecrenier
8d4b21a005 Switch string facet levels indexation to new algo
Write the algorithm once for both numbers and strings
2022-08-17 12:26:49 +02:00
Loïc Lecrenier
cf0cd92ed4 Refactor Facets::execute to increase performance 2022-08-17 12:26:49 +02:00
bors[bot]
cd2635ccfc Merge #602
602: Use mimalloc as the default allocator r=Kerollmops a=loiclec

## What does this PR do?
Use mimalloc as the global allocator for milli's benchmarks on macOS.

## Why?
On Linux, we use jemalloc, which is a very fast allocator. But on macOS, we currently use the system allocator, which is very slow. In practice, this difference in allocator speed means that it is difficult to gain insight into milli's performance by running benchmarks locally on the Mac.

By using mimalloc, which is another excellent allocator, we reduce the speed difference between the two platforms.

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 10:26:13 +00:00
Loïc Lecrenier
78d9f0622d cargo fmt 2022-08-17 12:21:24 +02:00