Commit Graph

811 Commits

Author SHA1 Message Date
bors[bot]
087da5621a Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
bors[bot]
fb95e67a2a Merge #608
608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish

When we replaced or updated some documents, the indexing was skipping the replaced documents.

Related to https://github.com/meilisearch/meilisearch/issues/2672

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:38:10 +00:00
bors[bot]
e4a52e6e45 Merge #594
594: Fix(Search): Fix phrase search candidates computation r=Kerollmops a=ManyTheFish

This bug is an old bug but was hidden by the proximity criterion,
Phrase searches were always returning an empty candidates list when the proximity criterion is deactivated.

Before the fix, we were trying to find any words[n] near words[n]
instead of finding  any words[n] near words[n+1], for example:

for a phrase search '"Hello world"' we were searching for "hello" near "hello" first, instead of "hello" near "world".



Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:22:52 +00:00
ManyTheFish
8c3f1a9c39 Remove useless lifetime declaration 2022-08-17 15:20:43 +02:00
ManyTheFish
e9e2349ce6 Fix typo in comment 2022-08-17 15:09:48 +02:00
ManyTheFish
2668f841d1 Fix update indexing 2022-08-17 15:03:37 +02:00
ManyTheFish
7384650d85 Update test to showcase the bug 2022-08-17 15:03:08 +02:00
Loïc Lecrenier
6cc975704d Add some documentation to facets.rs 2022-08-17 12:59:52 +02:00
Loïc Lecrenier
93252769af Apply review suggestions 2022-08-17 12:41:22 +02:00
Loïc Lecrenier
39687908f1 Add documentation and comments to facets.rs 2022-08-17 12:26:49 +02:00
Loïc Lecrenier
8d4b21a005 Switch string facet levels indexation to new algo
Write the algorithm once for both numbers and strings
2022-08-17 12:26:49 +02:00
Loïc Lecrenier
cf0cd92ed4 Refactor Facets::execute to increase performance 2022-08-17 12:26:49 +02:00
Loïc Lecrenier
78d9f0622d cargo fmt 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
4f9edf13d7 Remove commented-out function 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
405555b401 Add some documentation to PrefixTrieNode 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
1bc4788e59 Remove cached Allocations struct from wpppd indexing 2022-08-17 12:18:22 +02:00
Loïc Lecrenier
ef75a77464 Fix undefined behaviour caused by reusing key from the database
New full snapshot:
---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5                a    1  [101, ]
5                a    2  [101, ]
5                am   1  [101, ]
5                b    4  [101, ]
5                be   4  [101, ]
am               a    3  [101, ]
amazing          a    1  [100, ]
amazing          a    2  [100, ]
amazing          a    3  [100, ]
amazing          an   1  [100, ]
amazing          an   2  [100, ]
amazing          b    2  [100, ]
amazing          be   2  [100, ]
an               a    1  [100, ]
an               a    2  [100, 202, ]
an               am   1  [100, ]
an               an   2  [100, ]
an               b    3  [100, ]
an               be   3  [100, ]
and              a    2  [100, ]
and              a    3  [100, ]
and              a    4  [100, ]
and              am   2  [100, ]
and              an   3  [100, ]
and              b    1  [100, ]
and              be   1  [100, ]
at               a    1  [100, 202, ]
at               a    2  [100, 101, ]
at               a    3  [100, ]
at               am   2  [100, 101, ]
at               an   1  [100, 202, ]
at               an   3  [100, ]
at               b    3  [101, ]
at               b    4  [100, ]
at               be   3  [101, ]
at               be   4  [100, ]
beautiful        a    2  [100, ]
beautiful        a    3  [100, ]
beautiful        a    4  [100, ]
beautiful        am   3  [100, ]
beautiful        an   2  [100, ]
beautiful        an   4  [100, ]
bell             a    2  [101, ]
bell             a    4  [101, ]
bell             am   4  [101, ]
extraordinary    a    2  [202, ]
extraordinary    a    3  [202, ]
extraordinary    an   2  [202, ]
house            a    3  [100, 202, ]
house            a    4  [100, 202, ]
house            am   4  [100, ]
house            an   3  [100, 202, ]
house            b    2  [100, ]
house            be   2  [100, ]
rings            a    1  [101, ]
rings            a    3  [101, ]
rings            am   3  [101, ]
rings            b    2  [101, ]
rings            be   2  [101, ]
the              a    3  [101, ]
the              b    1  [101, ]
the              be   1  [101, ]
2022-08-17 12:17:45 +02:00
Loïc Lecrenier
7309111433 Don't run block code in doc tests of word_pair_proximity_docids 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
f6f8f543e1 Run cargo fmt 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
34c991ea02 Add newlines in documentation of word_prefix_pair_proximity_docids 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
06f3fd8c6d Add more comments to WordPrefixPairProximityDocids::execute 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
474500362c Update wpppd snapshots
New snapshot (yes, it's wrong as well, it will get fixed later):

---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5                a    1  [101, ]
5                a    2  [101, ]
5                am   1  [101, ]
5                b    4  [101, ]
5                be   4  [101, ]
am               a    3  [101, ]
amazing          a    1  [100, ]
amazing          a    2  [100, ]
amazing          a    3  [100, ]
amazing          an   1  [100, ]
amazing          an   2  [100, ]
amazing          b    2  [100, ]
amazing          be   2  [100, ]
an               a    1  [100, ]
an               a    2  [100, 202, ]
an               am   1  [100, ]
an               b    3  [100, ]
an               be   3  [100, ]
and              a    2  [100, ]
and              a    3  [100, ]
and              a    4  [100, ]
and              b    1  [100, ]
and              be   1  [100, ]
                 d\0  0  [100, 202, ]
an               an   2  [100, ]
and              am   2  [100, ]
and              an   3  [100, ]
at               a    2  [100, 101, ]
at               a    3  [100, ]
at               am   2  [100, 101, ]
at               an   1  [100, 202, ]
at               an   3  [100, ]
at               b    3  [101, ]
at               b    4  [100, ]
at               be   3  [101, ]
at               be   4  [100, ]
beautiful        a    2  [100, ]
beautiful        a    3  [100, ]
beautiful        a    4  [100, ]
beautiful        am   3  [100, ]
beautiful        an   2  [100, ]
beautiful        an   4  [100, ]
bell             a    2  [101, ]
bell             a    4  [101, ]
bell             am   4  [101, ]
extraordinary    a    2  [202, ]
extraordinary    a    3  [202, ]
extraordinary    an   2  [202, ]
house            a    4  [100, 202, ]
house            a    4  [100, ]
house            am   4  [100, ]
house            an   3  [100, 202, ]
house            b    2  [100, ]
house            be   2  [100, ]
rings            a    1  [101, ]
rings            a    3  [101, ]
rings            am   3  [101, ]
rings            b    2  [101, ]
rings            be   2  [101, ]
the              a    3  [101, ]
the              b    1  [101, ]
the              be   1  [101, ]
2022-08-17 12:17:18 +02:00
Loïc Lecrenier
ea4a96761c Move content of readme for WordPrefixPairProximityDocids into the code 2022-08-17 12:05:37 +02:00
Loïc Lecrenier
220921628b Simplify and document WordPrefixPairProximityDocIds::execute 2022-08-17 11:59:19 +02:00
Loïc Lecrenier
044356d221 Optimise WordPrefixPairProximityDocIds merge operation 2022-08-17 11:59:18 +02:00
Loïc Lecrenier
d350114159 Add tests for WordPrefixPairProximityDocIds 2022-08-17 11:59:15 +02:00
Loïc Lecrenier
86807ca848 Refactor word prefix pair proximity indexation further 2022-08-17 11:59:13 +02:00
Loïc Lecrenier
306593144d Refactor word prefix pair proximity indexation 2022-08-17 11:59:00 +02:00
Loïc Lecrenier
dea00311b6 Add type annotations to remove compiler error 2022-08-16 09:19:30 +02:00
Loïc Lecrenier
6f49126223 Fix db_snap macro with inline parameter 2022-08-10 15:55:22 +02:00
Loïc Lecrenier
12920f2a4f Fix paths of snapshot tests 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
ce560fdcb5 Add documentation for db_snap! 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
748bb86b5b cargo fmt 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
051f24f674 Switch to snapshot tests for search/matches/mod.rs 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
d2e01528a6 Switch to snapshot tests for search/criteria/typo.rs 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
a9c7d82693 Switch to snapshot tests for search/criteria/attribute.rs 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
4bba2f41d7 Switch to snapshot tests for query_tree.rs 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
8ac24d3114 Cargo fmt + fix compiler warnings/error 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
6066256689 Add snapshot tests for indexing of word_prefix_pair_proximity_docids 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
3a734af159 Add snapshot tests for Facets::execute 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
b9907997e4 Remove old snapshot tests code 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
ef889ade5d Refactor snapshot tests 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
334098a7e0 Add index snapshot test helper function 2022-08-10 15:53:46 +02:00
ManyTheFish
b389be48a0 Factorize phrase computation 2022-08-08 10:37:31 +02:00
Loïc Lecrenier
58cb1c1bda Simplify unit tests in facet/filter.rs 2022-08-04 12:03:44 +02:00
Loïc Lecrenier
acff17fb88 Simplify indexing tests 2022-08-04 12:03:13 +02:00
bors[bot]
21284cf235 Merge #556
556: Add EXISTS filter r=loiclec a=loiclec

## What does this PR do?

Fixes issue [#2484](https://github.com/meilisearch/meilisearch/issues/2484) in the meilisearch repo.

It creates a `field EXISTS` filter which selects all documents containing the `field` key. 
For example, with the following documents:
```json
[{
	"id": 0,
	"colour": []
},
{
	"id": 1,
	"colour": ["blue", "green"]
},
{
	"id": 2,
	"colour": 145238
},
{
	"id": 3,
	"colour": null
},
{
	"id": 4,
	"colour": {
		"green": []
	}
},
{
	"id": 5,
	"colour": {}
},
{
	"id": 6
}]
```
Then the filter `colour EXISTS` selects the ids `[0, 1, 2, 3, 4, 5]`. The filter `colour NOT EXISTS` selects `[6]`.

## Details
There is a new database named `facet-id-exists-docids`. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.

To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the `facet-id-f64-docids` database, with appropriate modifications in place.

There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:
```json
{
    "id": 0,
    "colour": [],
    "size": {}
}
```
would be flattened to:
```json
{
    "id": 0
}
```
prior to being given to the extraction pipeline.

This transformation would lose the information that is needed to populate the `facet-id-exists-docids` database. Therefore, I have also changed the implementation of the `flatten-serde-json` crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:
```json
{
    "id": 0,
    "colour": {
        "green": [],
        "blue": 1
    },
    "size": {}
} 
```
becomes
```json
{
    "id": 0,
    "colour": [],
    "colour.green": [],
    "colour.blue": 1,
    "size": []
} 
```


Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-08-04 09:46:06 +00:00
bors[bot]
50f6524ff2 Merge #579
579: Stop reindexing already indexed documents r=ManyTheFish a=irevoire

```
 % ./compare.sh indexing_stop-reindexing-unchanged-documents_cb5a1669.json indexing_main_eeba1960.json
group                                                                     indexing_main_eeba1960                 indexing_stop-reindexing-unchanged-documents_cb5a1669
-----                                                                     ----------------------                 -----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.03      2.0±0.22ms        ? ?/sec    1.00  1955.4±336.24µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.08     11.0±2.93ms        ? ?/sec    1.00     10.2±4.04ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     15.1±3.89ms        ? ?/sec    1.14     17.1±5.18ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.26    59.2±12.01ms        ? ?/sec    1.00     47.1±8.52ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   316.6±31.53ms        ? ?/sec    1.00   293.6±17.00ms        ? ?/sec
indexing/Indexing geo_point                                               1.01      60.9±0.31s        ? ?/sec    1.00      60.6±0.36s        ? ?/sec
indexing/Indexing movies in three batches                                 1.04      20.0±0.30s        ? ?/sec    1.00      19.2±0.25s        ? ?/sec
indexing/Indexing movies with default settings                            1.02      19.1±0.18s        ? ?/sec    1.00      18.7±0.24s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.02      26.2±0.29s        ? ?/sec    1.00      25.9±0.22s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.02      25.3±0.32s        ? ?/sec    1.00      24.7±0.26s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      66.7±0.41s        ? ?/sec    1.01      67.1±0.86s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      58.3±0.90s        ? ?/sec    1.01      58.8±1.32s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      54.5±1.43s        ? ?/sec    1.01      55.2±1.29s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      57.9±1.20s        ? ?/sec    1.01      58.4±0.93s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1052.0±10.95s        ? ?/sec    1.02   1069.4±20.38s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1193.1±8.83s        ? ?/sec    1.00    1189.5±9.40s        ? ?/sec
indexing/Reindexing geo_point                                             3.22      67.5±0.73s        ? ?/sec    1.00      21.0±0.16s        ? ?/sec
indexing/Reindexing movies with default settings                          3.75      19.4±0.28s        ? ?/sec    1.00       5.2±0.05s        ? ?/sec
indexing/Reindexing songs with default settings                           8.90      61.4±0.91s        ? ?/sec    1.00       6.9±0.07s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1748.2±35.68s        ? ?/sec    1.00   1750.5±18.53s        ? ?/sec
```

tldr: We do not lose any performance on the normal indexing benchmark, but we get between 3 and 8 times faster on the reindexing benchmarks 👍 

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-08-04 08:10:37 +00:00
ManyTheFish
d6f9a60a32 fix: Remove whitespace trimming during document id validation
fix #592
2022-08-03 11:38:40 +02:00
Tamo
7fc35c5586 remove the useless prints 2022-08-02 10:31:22 +02:00