Commit Graph

1585 Commits

Author SHA1 Message Date
bors[bot]
df518d8b0b Merge #459
459: Update heed link in cargo toml r=Kerollmops a=curquiza

Since grenad and heed have been moved to the meilisearch orga, this PR changes the link.
This is a minor change since GitHub handles automatically the redirection. This PR is only for consisitency.

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-03-01 19:47:14 +00:00
Clémentine Urquizar
d9ed9de2b0 Update heed link in cargo toml 2022-03-01 19:45:29 +01:00
bors[bot]
51cf44d6fd Merge #456
456: Remove useless grenad merging r=Kerollmops a=Kerollmops

This PR must be merged after #454.

This PR removes the part of code that was merging all of the grenad Readers merging that we don't need as the indexer should have merged them and, therefore, we should only have one final grenad Reader. We reduce the amount of CPU usage and memory pressure we were doing uselessly.

`@ManyTheFish` are you sure I can skip merging the `word_docids` database?

Here is the benchmark comparison with the previously merged PR #454:
```
group                                              indexing_reintroduce-appending-sorted-values_c05e42a8    indexing_remove-useless-grenad-merging_d5b8b5a2
-----                                              -----------------------------------------------------    -----------------------------------------------
indexing/Indexing movies with default settings     1.06      16.6±1.04s        ? ?/sec                      1.00      15.7±0.93s        ? ?/sec
indexing/Indexing songs with default settings      1.16      60.1±7.07s        ? ?/sec                      1.00      51.7±5.98s        ? ?/sec
indexing/Indexing songs without faceted numbers    1.06      55.4±6.14s        ? ?/sec                      1.00      52.2±4.13s        ? ?/sec
```

And the comparison with multi-batch indexing before #436, we can see that we gain time for benchmarks that index datasets in multiple batches but there is _so much_ variance that it's not clear.

```
group                                                             indexing_benchmark-multi-batch-indexing-before-speed-up_45f52620    indexing_remove-useless-grenad-merging_d5b8b5a2
-----                                                             ----------------------------------------------------------------    -----------------------------------------------
indexing/Indexing geo_point                                       1.07       6.6±0.08s        ? ?/sec                                 1.00       6.2±0.11s        ? ?/sec
indexing/Indexing songs in three batches with default settings    1.12      57.7±2.14s        ? ?/sec                                 1.00      51.5±3.80s        ? ?/sec
indexing/Indexing songs with default settings                     1.00      47.5±2.52s        ? ?/sec                                 1.09      51.7±5.98s        ? ?/sec
indexing/Indexing songs without any facets                        1.00      43.5±1.43s        ? ?/sec                                 1.12      48.8±3.73s        ? ?/sec
indexing/Indexing songs without faceted numbers                   1.00      47.1±2.23s        ? ?/sec                                 1.11      52.2±4.13s        ? ?/sec
indexing/Indexing wiki                                            1.00    917.3±30.01s        ? ?/sec                                 1.09    998.7±38.92s        ? ?/sec
indexing/Indexing wiki in three batches                           1.09   1091.2±32.73s        ? ?/sec                                 1.00    996.5±15.70s        ? ?/sec
```

What do you think `@irevoire?` Should we change the benchmarks to make them do more runs?

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-03-01 16:48:08 +00:00
Kerollmops
d5b8b5a2f8 Replace the ugly unwraps by clean if let Somes 2022-02-28 16:31:33 +01:00
Kerollmops
8d26f3040c Remove a useless grenad file merging 2022-02-28 16:31:33 +01:00
bors[bot]
21898ffc60 Merge #454
454: Reintroduce appending sorted entries when possible r=Kerollmops a=Kerollmops

This PR modifies the `sorter_into_lmdb_database` function to append values into the database instead of get-put-merging them, it should improve the indexation speed for when the database is empty.

```txt
group                                             indexing_main_25123af3                 indexing_reintroduce-appending-sorted-values_c05e42a8
-----                                             ----------------------                 -----------------------------------------------------
indexing/Indexing movies with default settings    1.07      17.8±0.99s        ? ?/sec    1.00      16.6±1.04s        ? ?/sec
indexing/Indexing songs with default settings     1.00      57.0±6.01s        ? ?/sec    1.05      60.1±7.07s        ? ?/sec
indexing/Indexing songs without any facets        1.10      51.8±5.36s        ? ?/sec    1.00      47.3±3.30s        ? ?/sec
```

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-02-28 14:55:37 +00:00
Clément Renault
04b1bbf932 Reintroduce appending sorted entries when possible 2022-02-24 14:50:45 +01:00
bors[bot]
382be56d36 Merge #453
453: Benchmark multi batch indexing r=Kerollmops a=Kerollmops

Hey `@irevoire,` could you please add the new benchmarks into influx?

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-02-24 12:33:13 +00:00
Clément Renault
acfc96525c Apply GitHub suggestions 2022-02-23 16:20:29 +01:00
Clément Renault
a820aa11e6 Add a new movies benchmark to test multi batch indexing 2022-02-23 16:20:29 +01:00
Kerollmops
8d2e3e4aba Add a new wiki benchmark to test multi batch indexing 2022-02-23 16:20:29 +01:00
Kerollmops
ab5247dc64 Add a new songs benchmark to test multi batch indexing 2022-02-23 16:20:28 +01:00
bors[bot]
acd9535588 Merge #455
455: Raise the GitHub CI timeout limit to 72h r=irevoire a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-23 14:33:31 +00:00
Kerollmops
19bfb2649b Raise the GitHub CI timeout limit to 72h 2022-02-23 15:27:51 +01:00
bors[bot]
25123af3b8 Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00
Clément Renault
ff8d7a810d Change the behavior of the as_cloneable_grenad by taking a ref 2022-02-16 15:40:08 +01:00
Clément Renault
f367cc2e75 Finally bump grenad to v0.4.1 2022-02-16 15:28:48 +01:00
bors[bot]
f2984f66e6 Merge #452
452: bump milli r=curquiza a=irevoire



Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-02-16 13:49:14 +00:00
Irevoire
0defeb268c bump milli 2022-02-16 13:27:41 +01:00
bors[bot]
030064da25 Merge #451
451: Update LICENSE with Meili SAS name r=Kerollmops a=curquiza

Check with thomas, we must put the real name of the company

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-02-15 16:18:47 +00:00
Clémentine Urquizar - curqui
84035a27f5 Update LICENSE 2022-02-15 15:52:50 +01:00
bors[bot]
0885fcf973 Merge #450
450: Get rid of chrono in favor of time r=Kerollmops a=irevoire

We only use `chrono` as a wrapper around `time`, and since there has been an [open CVE on `chrono` for at least 3 months now](https://github.com/chronotope/chrono/pull/632) and the repo seems to be [struggling with maintenance](https://github.com/chronotope/chrono/pull/639), I think we should use `time` directly which is way more active and sufficient for our use case.

EDIT: Actually the CVE status has been known for more than 6 months: https://github.com/chronotope/chrono/issues/602

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-02-15 10:54:46 +00:00
Irevoire
48542ac8fd get rid of chrono in favor of time 2022-02-15 11:41:55 +01:00
bors[bot]
ea15ad6c34 Merge #447
447: Update version for the next release (v0.22.1) r=curquiza a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-02-07 17:44:09 +00:00
Clémentine Urquizar
d03b3ceb58 Update version for the next release (v0.22.1) 2022-02-07 18:39:29 +01:00
bors[bot]
5d58cb7449 Merge #442
442: fix phrase search r=curquiza a=MarinPostma

Run the exact match search on 7 words windows instead of only two. This makes false positive very very unlikely, and impossible on phrase query that are less than seven words.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-02-07 16:18:20 +00:00
bors[bot]
c5a996aa78 Merge #446
446: Update LICENSE r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-02-07 09:47:39 +00:00
Clémentine Urquizar - curqui
1279c38ac9 Update LICENSE 2022-02-05 18:29:11 +01:00
bors[bot]
267d14c28d Merge #445
445: allow null values in csv r=Kerollmops a=MarinPostma

This pr allows null values in csv:
- if the field is of type string, then an empty field is considered null (`,,`), anything other is turned into a string (i.e `, ,` is a single whitespace string)
- if the field is of type number, when the trimmed field is empty, we consider the value null (i.e `,,`, `, ,` are both null numbers) otherwise we try to parse the number.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-02-03 15:11:32 +00:00
ad hoc
bd2262ceea allow null values in csv 2022-02-03 16:03:01 +01:00
ad hoc
13de251047 rewrite word pair distance gathering 2022-02-03 15:57:20 +01:00
bors[bot]
fda4f229bb Merge #417
417: Change chunk size to 4MiB to fit more the end user usage r=Kerollmops a=ManyTheFish

Reverts meilisearch/milli#379

We made several indexing tests using different sizes of datasets (5 datasets from 9MiB to 100MiB) on several typologies of VMs (`XS: 1GiB RAM, 1 VCPU`, `S: 2GiB RAM, 2 VCPU`, `M: 4GiB RAM, 3 VCPU`, `L: 8GiB RAM, 4 VCPU`).
The result of these tests shows that the `4MiB` chunk size seems to be the best size compared to other chunk sizes (`2Mib`, `4MiB`, `8Mib`, `16Mib`,  `32Mib`, `64Mib`, `128Mib`).

below is the average time per chunk size:

![Capture d’écran 2021-09-27 à 14 27 50](https://user-images.githubusercontent.com/6482087/134909368-ef0bc45e-68d5-49d1-aaf9-91113b7c410f.png)

<details>
<summary>Detailled data</summary>
<br>

![Capture d’écran 2021-09-27 à 14 39 48](https://user-images.githubusercontent.com/6482087/134909952-a36b1457-bbbd-4a6c-bbe5-519e4b926b5a.png)
</br>
</details> 


Co-authored-by: Many <many@meilisearch.com>
2022-02-02 18:30:59 +00:00
bors[bot]
2468ebb76b Merge #444
444: Fix the parsing of ndjson requests to index more than the first line r=Kerollmops a=Kerollmops

This PR correctly uses the `BufRead` trait to read every line of the content instead of just the first one. This bug was only affecting the http-ui test crate.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-02 17:59:44 +00:00
Kerollmops
9142ba9dd4 Fix the parsing of ndjson requests to index more than the first line 2022-02-02 17:55:13 +01:00
Many
d59bcea749 Revert "Revert "Change chunk size to 4MiB to fit more the end user usage"" 2022-02-02 17:01:13 +01:00
mpostma
7541ab99cd review changes 2022-02-02 12:59:01 +01:00
mpostma
d0aabde502 optimize 2 typos case 2022-02-02 12:56:09 +01:00
mpostma
55e6cb9c7b typos on first letter counts as 2 2022-02-02 12:56:09 +01:00
mpostma
642c01d0dc set max typos on ngram to 1 2022-02-02 12:56:08 +01:00
ad hoc
d852dc0d2b fix phrase search 2022-02-01 20:21:33 +01:00
Kerollmops
fb79c32430 Compute the new, common and, deleted prefix words fst once 2022-01-27 11:00:18 +01:00
Clément Renault
51d1e64b23 Remove, now useless, the WriteMethod enum 2022-01-27 10:08:35 +01:00
Clément Renault
e9c02173cf Rework the WordsPrefixPositionDocids update to compute a subset of the database 2022-01-27 10:08:35 +01:00
Clément Renault
dbba5fd461 Create a function to simplify the word prefix pair proximity docids compute 2022-01-27 10:08:35 +01:00
Clément Renault
e760e02737 Fix the computation of the newly added and common prefix pair proximity words 2022-01-27 10:08:35 +01:00
Clément Renault
d59e559317 Fix the computation of the newly added and common prefix words 2022-01-27 10:08:34 +01:00
Clément Renault
2ec8542105 Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
Clément Renault
28692f65be Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
Clément Renault
5404bc02dd Move the fst_stream_into_hashset method in the helper methods 2022-01-27 10:06:00 +01:00
Clément Renault
c90fa95f93 Only compute the word prefix pairs on the created word pair proximities 2022-01-27 10:06:00 +01:00