Commit Graph

1633 Commits

Author SHA1 Message Date
Kerollmops
0353fbb5df Bump the tokenizer version to v0.2.4 2021-07-22 17:14:45 +02:00
Kerollmops
92c0a2cdc1 Add a test that triggers a panic when indexing zeroes 2021-07-22 17:14:44 +02:00
Kerollmops
aa02a7fdd8 Add a test to check that we indeed impact the relevancy 2021-07-22 17:04:38 +02:00
bors[bot]
77de82aaa4 Merge #254
254: Improve the facet string distribution speed r=Kerollmops a=Kerollmops

This pull request creates a data structure similar to the one we use for the faceted numbers, a tetratomic decision tree but this time for the facet strings. This PR also changes the facet distribution behavior by returning one of the original facet values, fixes #260.

This data structure defines bucket-like structures where documents ids are stored under their facet value and helps the search decide if it wants to move to a lower level under a given bucket or not, depending on if the current bucket contains interesting documents or not. The whole format, algorithm, and previous attempts are explained in the [`facet_string.rs` file](ec1cfdd42b/milli/src/search/facet/facet_string.rs).

Note that this data structure **could** be used to sort by string lexicographically, that hypothetically possible. We need more testing, in terms of performance and quality, as we will sort on lowercased versions of the facet values.

 - [x] Implement a faster and more precise way to fetch the facet distribution.
 - [x] Store and return the original facet string value. We currently return the lowercased version.

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-07-21 15:34:40 +00:00
Clément Renault
0227254a65 Return the original string values for the inverted facet index database 2021-07-21 16:59:39 +02:00
Kerollmops
03a01166ba Display the original facet string value from the linear facet database 2021-07-21 16:59:39 +02:00
Clément Renault
d23c250ad5 Fix a bound error in the facet string range construction 2021-07-21 16:59:39 +02:00
Clément Renault
081278dfd6 Use the facet string levels when computing the facet distribution 2021-07-21 16:59:39 +02:00
Clément Renault
5676b204dd Fix the facet string levels codecs 2021-07-21 16:59:38 +02:00
Kerollmops
8c86348119 Indexing the facet strings levels 2021-07-21 16:59:38 +02:00
Kerollmops
a7ae552ba7 Fix the FacetStringLevelZeroRange range when unbounded 2021-07-21 16:59:38 +02:00
Kerollmops
757b2b502a Remove the FacetValueStringCodec 2021-07-21 16:59:38 +02:00
Kerollmops
adfd4da24c Introduce the FacetStringIter iterator 2021-07-21 16:59:38 +02:00
Kerollmops
a79661c6dc Introduce a lot of facet string helper iterators 2021-07-21 16:59:38 +02:00
Kerollmops
851f979039 Describe the way we want to group the facet strings 2021-07-21 16:59:38 +02:00
Kerollmops
f858f64b1f Move the facet number iterators into their own module 2021-07-21 16:59:37 +02:00
Kerollmops
9f8095c069 Make sure that we don't keep a reference on the LMDB key when using put_current 2021-07-21 10:35:35 +02:00
bors[bot]
fa44e95c91 Merge #290
290: Add a $HOME to the CI r=Kerollmops a=irevoire

This should fix this issue:
https://github.com/meilisearch/milli/runs/3104228432?check_suite_focus=true

I think a real fix would be to fix the configuration of our github runner but I don't know how to do it.
@curquiza could probably help us on that once she's back from vacation 😄 

Co-authored-by: Tamo <tamo@meilisearch.com>
2021-07-20 07:32:46 +00:00
Tamo
0ab541627b add a $HOME var to the ci 2021-07-19 14:33:49 +02:00
bors[bot]
16698f714b Merge #287
287: Add benchmarks for indexing r=Kerollmops a=irevoire

closes #274 
I don't really know how much time this will take on our bench machine. I'm afraid the wiki dataset will take a really long time to bench (it takes 1h30 on my computer).

If you are ok with it, I would like to merge this first PR since it introduces a first set of benchmarks and see how much time it takes in reality on our setup.

Co-authored-by: Tamo <tamo@meilisearch.com>
2021-07-07 15:41:15 +00:00
Tamo
931021fe57 add benchmarks for indexing 2021-07-07 13:09:05 +02:00
bors[bot]
4c9531bdf3 Merge #285
285: Support documents with at most 65536 fields r=Kerollmops a=Kerollmops

Fixes #248.

In this PR I updated the `obkv` crate, it now supports arbitrary key length and therefore I was able to use an `u16` to represent the fields instead of a single byte. It was impressively easy to update the whole codebase 🍡 🍔

Co-authored-by: Kerollmops <clement@meilisearch.com>
2021-07-06 16:44:51 +00:00
Kerollmops
0a78107525 Fix the infos crate to make it read u16 field ids 2021-07-06 11:58:03 +02:00
Kerollmops
a9553af635 Add a test to check that we can index more that 256 fields 2021-07-06 11:58:03 +02:00
Kerollmops
838ed1cd32 Use an u16 field id instead of one byte 2021-07-06 11:58:03 +02:00
bors[bot]
cc54c41e30 Merge #283
283: Use the AlwaysFreePages flag when opening an index r=irevoire a=Kerollmops

We introduced a new flag in our fork of LMDB, this `AlwaysFreePages` flag forces LMDB to always free the single pages it uses before writing to the disk instead of keeping them in a linked list.

Declaring this flag reduces the memory print (leak) we have on memory after indexing a lot of documents.

Fixes #279.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2021-07-05 16:59:16 +00:00
bors[bot]
63db43cc7a Merge #284
284: [http-ui] Introduce the route `die` r=Kerollmops a=irevoire

This route just `exit` the process. This can come in handy when you run `http-ui` inside of another process (a profiler for example), and you don't want to kill everything

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Irevoire <tamo@meilisearch.com>
2021-07-05 15:47:53 +00:00
Irevoire
4562b278a8 remove a warning and add a log
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-07-05 17:46:02 +02:00
Tamo
a57e522a67 introduce a die route let the program exit itself alone 2021-07-05 17:38:10 +02:00
Kerollmops
91c5d0c042 Use the AlwaysFreePages flag when opening an index 2021-07-05 16:36:13 +02:00
bors[bot]
007fec21fc Merge #281
281: Bump to v0.7.2 r=ManyTheFish a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2021-07-05 09:00:26 +00:00
Kerollmops
a6b4069172 Bump to v0.7.2 2021-07-05 10:54:53 +02:00
bors[bot]
d7bc6a6999 Merge #280
280: Fix matching lenghth in matching_words r=Kerollmops a=ManyTheFish

related to https://github.com/meilisearch/MeiliSearch/issues/1441

Co-authored-by: many <maxime@meilisearch.com>
2021-07-01 18:50:46 +00:00
many
9f62149b94 Fix matching lenghth in matching_words 2021-07-01 19:03:28 +02:00
bors[bot]
f25f454bd4 Merge #275
275: Fix the benchmarks dependencies r=Kerollmops a=irevoire

Import exactly the same dependency as milli instead of a wildcard that can do anything

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Irevoire <irevoire@protonmail.ch>
2021-07-01 11:07:01 +00:00
bors[bot]
885f243afc Merge #276
276: Fix the fmt of the auto-generated file r=Kerollmops a=irevoire

The file generated by the `build.rs` file of the benchmark was badly formatted and that was causing an issue with the git pre-commit hook I wrote [earlier](https://github.com/meilisearch/milli/blob/main/script/pre-commit)

Co-authored-by: Tamo <tamo@meilisearch.com>
2021-07-01 10:24:36 +00:00
Irevoire
ec87bf3dd5 Update benchmarks/Cargo.toml
Co-authored-by: Clément Renault <renault.cle@gmail.com>
2021-07-01 11:45:05 +02:00
Tamo
ef965aa3f3 fix the fmt of the auto-generated file 2021-07-01 11:43:09 +02:00
Tamo
fc09d77e89 fix the benchmarks dependcies 2021-07-01 11:38:30 +02:00
bors[bot]
056180e6c8 Merge #273
273: Update tokenizer version to v0.2.3 r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2021-07-01 09:02:16 +00:00
Clémentine Urquizar
3c149d8a43 Update tokenizer version to v0.2.3 2021-06-30 18:41:35 +02:00
bors[bot]
b4dcdbf00d Merge #269 #271
269: Fix bug when inserting previously deleted documents r=Kerollmops a=Kerollmops

This PR fixes #268.

The issue was in the `ExternalDocumentsIds` implementation in the specific case that an external document id was in the soft map marked as deleted.

The bug was due to a wrong assumption on my side about how the FST unions were returning the `IndexedValue`s, I thought the values returned in an array were in the same order as the FSTs given to the `OpBuilder` but in fact, [the `IndexedValue`'s `index` field was here to indicate from which FST the values were coming from](https://docs.rs/fst/0.4.7/fst/map/struct.IndexedValue.html).

271: Remove the roaring operation functions warnings r=Kerollmops a=Kerollmops

In this PR we are just replacing the usages of the roaring operations function by the new operators. This removes a lot of warnings.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2021-06-30 12:34:55 +00:00
Kerollmops
32b7bd366f Remove the roaring operation functions warnings 2021-06-30 14:12:56 +02:00
bors[bot]
00e2845f0f Merge #270
270: Update milli version to v0.7.1 r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2021-06-30 12:12:24 +00:00
Kerollmops
c92ef54466 Add a test for when we insert a previously deleted document 2021-06-30 14:00:01 +02:00
Kerollmops
28782ff99d Fix ExternalDocumentsIds struct when inserting previously deleted ids 2021-06-30 14:00:01 +02:00
Clémentine Urquizar
b489515f4d Update milli version to v0.7.1 2021-06-30 13:52:46 +02:00
Kerollmops
54889813ce Implement some debug functions on the ExternalDocumentsIds struct 2021-06-30 11:29:41 +02:00
Kerollmops
4bce66d5ff Make the Index::delete_* method private 2021-06-30 10:07:31 +02:00
bors[bot]
66e6ea56b8 Merge #267
267: Highlighting r=Kerollmops a=irevoire

closes #262 
I basically rewrote a part of the damerau-levenshtein function we were using for the highlighting to accept at most two errors from the user and stop on the third mistake.
Also, now it supports utf-8, so it should fix our issue.

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Irevoire <irevoire@protonmail.ch>
2021-06-30 05:43:50 +00:00