Commit Graph

1062 Commits

Author SHA1 Message Date
072b576514 Fix proximity value in keys of prefix_word_pair_proximity_docids 2022-10-18 10:37:34 +02:00
6c3a5d69e1 Update snapshots 2022-10-18 10:37:34 +02:00
a7de4f5b85 Don't add swapped word pairs to the word_pair_proximity_docids db 2022-10-18 10:37:34 +02:00
264a04922d Add prefix_word_pair_proximity database
Similar to the word_prefix_pair_proximity one but instead the keys are:
(proximity, prefix, word2)
2022-10-18 10:37:34 +02:00
1dbbd8694f Rename StrStrU8Codec to U8StrStrCodec and reorder its fields 2022-10-18 10:37:34 +02:00
bdeb47305e Change encoding of word_pair_proximity DB to (proximity, word1, word2)
Same for word_prefix_pair_proximity
2022-10-18 10:37:34 +02:00
81919a35a2 Update milli/src/search/criteria/initial.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-10-17 18:23:20 +02:00
516e838eb4 Update milli/src/search/criteria/initial.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-10-17 18:23:15 +02:00
fc03e53615 Add a test to check that we can abort an indexation 2022-10-17 17:28:03 +02:00
6603437cb1 Introduce an indexation abortion function when indexing documents 2022-10-17 17:28:03 +02:00
6f55e7844c Add some code comments 2022-10-17 14:41:57 +02:00
cf203b7fde Take filter in account when computing the pages candidates 2022-10-17 14:13:44 +02:00
d71bc1e69f Compute an exact count when using distinct 2022-10-17 14:13:44 +02:00
a396806343 Add settings to force milli to exhaustively compute the total number of hits 2022-10-17 14:13:44 +02:00
beb987d3d1 Fixing piles of clippy errors.
Most of these are calling clone when the struct supports Copy.

Many are using & and &mut on `self` when the function they are called
from already has an immutable or mutable borrow so this isn't needed.

I tried to stay away from actual changes or places where I'd have to
name fresh variables.
2022-10-13 22:02:54 +02:00
f30979d021 Merge #662
662: Enhance word splitting strategy r=ManyTheFish a=akki1306

# Pull Request

## Related issue
Fixes #648 

## What does this PR do?
- [split_best_frequency](55d889522b/milli/src/search/query_tree.rs (L282-L301)) to use frequency of word pairs near together with proximity value of 1 instead of considering the frequency of individual words. Word pairs having max frequency are considered.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Co-authored-by: Akshay Kulkarni <akshayk.gj@gmail.com>
2022-10-13 08:14:22 +00:00
85f3028317 remove underscore and introduce back word_documents_count 2022-10-13 13:21:59 +05:30
8195fc6141 revert removal of word_documents_count method 2022-10-13 13:14:27 +05:30
32f825d442 move default implementation of word_pair_frequency to TestContext 2022-10-13 12:57:50 +05:30
ff8b2d4422 formatting 2022-10-13 12:44:08 +05:30
6cb8b46900 use word_pair_frequency and remove word_documents_count 2022-10-13 12:43:11 +05:30
8c9245149e format file 2022-10-12 15:27:56 +05:30
63e79a9039 update comment 2022-10-12 13:36:48 +05:30
7f9680f0a0 Enhance word splitting strategy 2022-10-12 13:18:23 +05:30
6fbf5dac68 Simplify documents! macro to reduce compile times 2022-10-12 09:22:05 +02:00
762e320c35 Add proximity calculation for the same word 2022-10-07 12:59:12 +02:00
00c02d00f3 Add missing logging timer to extractors 2022-09-30 22:17:06 +05:30
15d478cf4d Merge #635
635: Use an unstable algorithm for `grenad::Sorter` when possible r=Kerollmops a=loiclec

# Pull Request
## What does this PR do?

Use an unstable algorithm to sort the internal vector used by `grenad::Sorter` whenever possible to speed up indexing.

In practice, every time the merge function creates a `RoaringBitmap`, we use an unstable sort. For every other merge function, such as `keep_first`, `keep_last`, etc., a stable sort is used.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-09-14 12:00:52 +00:00
3794962330 Use an unstable algorithm for grenad::Sorter when possible 2022-09-13 14:49:53 +02:00
d4d7c9d577 We avoid skipping errors in the indexing pipeline 2022-09-13 14:03:00 +02:00
fe3973a51c Make sure that long words are correctly skipped 2022-09-07 15:03:32 +02:00
c83c3cd796 Add a test to make sure that long words are correctly skipped 2022-09-07 14:12:36 +02:00
bf750e45a1 Fix word removal issue 2022-09-01 12:10:47 +02:00
a38608fe59 Add test mixing phrased and no-phrased words 2022-09-01 12:02:10 +02:00
7f92116b51 Accept again integers as document ids 2022-08-31 10:56:39 +02:00
f6024b3269 Remove the artifacts of the past 2022-08-23 16:10:38 +02:00
5391e3842c replace optional_words by term_matching_strategy 2022-08-22 17:47:19 +02:00
993aa1321c Fix query tree building 2022-08-18 17:56:06 +02:00
bff9653050 Fix remove count 2022-08-18 17:36:30 +02:00
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
afc10acd19 Merge #596
596: Filter operators: NOT + IN[..] r=irevoire a=loiclec

# Pull Request

## What does this PR do?
Implements the changes described in https://github.com/meilisearch/meilisearch/issues/2580
It is based on top of #556 

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-18 11:24:32 +00:00
9b6602cba2 Avoid cloning FilterCondition in filter array parsing 2022-08-18 13:06:57 +02:00
c51dcad51b Don't recompute filterable fields in evaluation of IN[] filter 2022-08-18 10:59:21 +02:00
4aae07d5f5 expose the size methods 2022-08-17 17:07:38 +02:00
e96b852107 bump heed 2022-08-17 17:05:50 +02:00
087da5621a Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
fb95e67a2a Merge #608
608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish

When we replaced or updated some documents, the indexing was skipping the replaced documents.

Related to https://github.com/meilisearch/meilisearch/issues/2672

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:38:10 +00:00
e4a52e6e45 Merge #594
594: Fix(Search): Fix phrase search candidates computation r=Kerollmops a=ManyTheFish

This bug is an old bug but was hidden by the proximity criterion,
Phrase searches were always returning an empty candidates list when the proximity criterion is deactivated.

Before the fix, we were trying to find any words[n] near words[n]
instead of finding  any words[n] near words[n+1], for example:

for a phrase search '"Hello world"' we were searching for "hello" near "hello" first, instead of "hello" near "world".



Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:22:52 +00:00
8c3f1a9c39 Remove useless lifetime declaration 2022-08-17 15:20:43 +02:00
e9e2349ce6 Fix typo in comment 2022-08-17 15:09:48 +02:00