Commit Graph

354 Commits

Author SHA1 Message Date
4f3ce6d9cd nested fields 2022-04-07 16:58:46 +02:00
b799f3326b rename merge_nothing to merge_ignore_values 2022-04-05 18:44:35 +02:00
201fea0fda limit extract_word_docids memory usage 2022-04-05 14:14:15 +02:00
b85cd4983e remove field_id_from_position 2022-04-05 09:50:34 +02:00
b7694c34f5 remove println 2022-04-04 21:00:07 +02:00
6cabd47c32 fix typo in comment 2022-04-04 20:59:20 +02:00
6b2c2509b2 fix bug in exact search 2022-04-04 20:54:03 +02:00
e8f06f6c06 extract exact_word_prefix_docids 2022-04-04 20:54:03 +02:00
ba0bb29cd8 refactor WordPrefixDocids to take dbs instead of indexes 2022-04-04 20:54:02 +02:00
c4c6e35352 query exact_word_docids in resolve_query_tree 2022-04-04 20:54:02 +02:00
8d46a5b0b5 extract exact word docids 2022-04-04 20:54:02 +02:00
0a77be4ec0 introduce exact_word_docids db 2022-04-04 20:54:02 +02:00
5f9f82757d refactor spawn_extraction_task 2022-04-04 20:54:02 +02:00
d5b8b5a2f8 Replace the ugly unwraps by clean if let Somes 2022-02-28 16:31:33 +01:00
8d26f3040c Remove a useless grenad file merging 2022-02-28 16:31:33 +01:00
04b1bbf932 Reintroduce appending sorted entries when possible 2022-02-24 14:50:45 +01:00
25123af3b8 Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00
ff8d7a810d Change the behavior of the as_cloneable_grenad by taking a ref 2022-02-16 15:40:08 +01:00
f367cc2e75 Finally bump grenad to v0.4.1 2022-02-16 15:28:48 +01:00
d59bcea749 Revert "Revert "Change chunk size to 4MiB to fit more the end user usage"" 2022-02-02 17:01:13 +01:00
fb79c32430 Compute the new, common and, deleted prefix words fst once 2022-01-27 11:00:18 +01:00
51d1e64b23 Remove, now useless, the WriteMethod enum 2022-01-27 10:08:35 +01:00
e9c02173cf Rework the WordsPrefixPositionDocids update to compute a subset of the database 2022-01-27 10:08:35 +01:00
d59e559317 Fix the computation of the newly added and common prefix words 2022-01-27 10:08:34 +01:00
28692f65be Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
5404bc02dd Move the fst_stream_into_hashset method in the helper methods 2022-01-27 10:06:00 +01:00
822f67e9ad Bring the newly created word pair proximity docids 2022-01-27 10:06:00 +01:00
d28f18658e Retrieve the previous version of the words prefixes FST 2022-01-27 10:05:59 +01:00
fd177b63f8 Merge #423
423: Remove an unused file r=irevoire a=irevoire

This empty file is not included anywhere

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-19 14:18:05 +00:00
0c84a40298 document batch support
reusable transform

rework update api

add indexer config

fix tests

review changes

Co-authored-by: Clément Renault <clement@meilisearch.com>

fmt
2022-01-19 12:40:20 +01:00
98a365aaae store the geopoint in three dimensions 2021-12-14 12:21:24 +01:00
d671d6f0f1 remove an unused file 2021-12-13 19:27:34 +01:00
8970246bc4 Sort positions before iterating over them during word pair proximity extraction 2021-11-22 18:16:54 +01:00
6eb47ab792 remove update_id in UpdateBuilder 2021-11-16 13:07:04 +01:00
09b4281cff improve document addition returned metaimprove document addition
returned metaimprove document addition returned metaimprove document
addition returned metaimprove document addition returned metaimprove
document addition returned metaimprove document addition returned
metaimprove document addition returned meta
2021-11-10 14:08:36 +01:00
3599df77f0 Change some error messages 2021-10-27 19:33:01 +02:00
baddd80069 implement review suggestions 2021-10-25 18:29:12 +02:00
430e9b13d3 add csv builder tests 2021-10-25 10:26:43 +02:00
2e62925a6e fix tests 2021-10-25 10:26:42 +02:00
0f86d6b28f implement csv serialization 2021-10-25 10:26:42 +02:00
8d70b01714 optimize document deserialization 2021-10-25 10:26:42 +02:00
c7db4176f3 Merge #384
384: Replace memmap with memmap2 r=Kerollmops a=palfrey

[memmap is unmaintained](https://rustsec.org/advisories/RUSTSEC-2020-0077.html) and needs replacing. memmap2 is a drop-in replacement fork that's well maintained. Note that the version numbers got reset on fork, hence the lower values.

Co-authored-by: Tom Parker-Shemilt <palfrey@tevp.net>
2021-10-13 13:47:23 +00:00
6e3b869e6a Merge #388
388: fix primary key inference r=MarinPostma a=MarinPostma

The primary key is was infered from a hashtable index of the field. For this reason the order in which the fields were interated upon was not deterministic, and the primary key was chosed ffrom the first field containing "id".

This fix sorts the the index by field_id when infering the primary key.


Co-authored-by: mpostma <postma.marin@protonmail.com>
2021-10-12 09:25:16 +00:00
86ead92ed5 infer primary key on sorted fields 2021-10-12 11:15:11 +02:00
9a266a531b test correct primary key inference 2021-10-12 11:08:53 +02:00
c5a6075484 Make max_position_per_attributes changable 2021-10-12 10:10:50 +02:00
360c5ff3df Remove limit of 1000 position per attribute
Instead of using an arbitrary limit we encode the absolute position in a u32
using one strong u16 for the field id and a weak u16 for the relative position in the attribute.
2021-10-12 10:10:50 +02:00
2dfe24f067 memmap -> memmap2 2021-10-10 22:47:12 +01:00
3296bb243c Simplify word level position DB into a word position DB 2021-10-05 12:15:02 +02:00
26b5dad042 Revert "Change chunk size to 4MiB to fit more the end user usage" 2021-09-29 15:08:39 +02:00