Commit Graph

1552 Commits

Author SHA1 Message Date
8efac33b53 Merge #467
467: optimize prefix database r=Kerollmops a=MarinPostma

This pr introduces two optimizations that greatly improve the speed of computing prefix databases.

- The time that it takes to create the prefix FST has been divided by 5 by inverting the way we iterated over the words FST.
- We unconditionally and needlessly checked for documents to remove in  `word_prefix_pair`, which caused an iteration over the whole database.

Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-03-15 16:14:35 +00:00
d127c57f2d review edits 2022-03-15 17:12:48 +01:00
d633ac5b9d optimize word prefix pair 2022-03-15 16:37:22 +01:00
d68fe2b3c7 optimize word prefix fst 2022-03-15 16:36:48 +01:00
d87e8b63a9 Merge #465
465: Update dependencies r=ManyTheFish a=Kerollmops

This PR upgrade and updates this crate's dependencies but first, it removes three dependencies that we don't use anymore. I used [cargo udeps](https://github.com/est31/cargo-udeps) to upgrade them ⬆️

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-03-15 13:49:17 +00:00
0c5f4ed7de Apply suggestions
Co-authored-by: Many <many@meilisearch.com>
2022-03-15 14:18:29 +01:00
21ec334dcc Fix the compilation error of the dependency versions 2022-03-15 11:17:45 +01:00
63682c2c9a Upgrade the dependencies 2022-03-15 11:17:44 +01:00
288a879411 Remove three useless dependencies 2022-03-15 11:17:44 +01:00
712bf035a7 Merge #464
464: exporting heed to avoid having different versions of Heed in Meilisearch r=curquiza a=psvnlsaikumar

# Pull Request

## What does this PR do?
Fixes the issue in meilisearch https://github.com/meilisearch/meilisearch/issues/2210

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: psvnl sai kumar <psvnlsaikumar@gmail.com>
2022-03-15 09:51:56 +00:00
5e08fac729 fixes for rustfmt pass 2022-03-14 19:22:41 +05:30
92e2e09434 exporting heed to avoid having different versions of Heed in Meilisearch 2022-03-14 01:01:58 +05:30
290a29b5fb Merge #457
457: Avoid iterating on big databases when useless r=Kerollmops a=Kerollmops

This PR makes the prefix database updates to avoid iterating on big grenad files when it is unnecessary. We introduced this regression in #436 but it went unnoticed.

---

According to the following benchmark results, we take more time when we index documents in one run than before #436. It looks like it is probably due to the fact that, now, instead of computing the prefixes database by iterating on the LMDB we directly iterate on the grenad file. Those could be slower to iterate on and could be the slowdown cause.

I just pushed a commit that tests this branch with the new unreleased version of grenad where some work was done to speed up the iteration on grenad files. [The benchmarks for this last commit](https://github.com/meilisearch/milli/actions/runs/1927187408) are currently running. You can [see the diff](https://github.com/meilisearch/grenad/compare/v0.4.1...main) between the v0.4 and the unreleased v0.5 version of grenad.

```diff
  group                                                             indexing_benchmark-multi-batch-indexing-before-speed-up_45f52620    indexing_stop-iterating-on-big-grenad-files_ac8b85c4
  -----                                                             ----------------------------------------------------------------    ----------------------------------------------------
+ indexing/Indexing songs in three batches with default settings    1.12      57.7±2.14s        ? ?/sec                                 1.00      51.3±2.76s        ? ?/sec
- indexing/Indexing wiki                                            1.00    917.3±30.01s        ? ?/sec                                 1.10   1008.4±38.27s        ? ?/sec
+ indexing/Indexing wiki in three batches                           1.10   1091.2±32.73s        ? ?/sec                                 1.00    995.5±24.33s        ? ?/sec
```

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-03-09 16:46:34 +00:00
1ae13c1374 Avoid iterating on big databases when useless 2022-03-09 15:43:54 +01:00
a8d28e364d Merge #461
461: Add a new error message when the `valid_fields` is empty r=curquiza a=brunoocasali

I've created a test case to handle the new error formatting behavior, but I'm not sure if:

- this is the right place to add the test?
- this is the best way to test this behavior?

And I'm not sure also regarding the `match` implementation, is this something required? Or maybe just an `if` statement is ok as well?

I left the two messages literally without "reusing the prefix" in the implementation because I think this could help the "searchability" of the error in the future.

# Pull Request

## What does this PR do?
Fixes https://github.com/meilisearch/meilisearch/issues/2140

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [ ] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Bruno Casali <brunoocasali@gmail.com>
2022-03-08 09:55:58 +00:00
2ef5751795 Merge #463
463: Allow setting the primary-key in the cli r=irevoire a=irevoire



Co-authored-by: Tamo <tamo@meilisearch.com>
2022-03-07 14:11:40 +00:00
8bb45956d4 allow to set the primary key in the cli 2022-03-07 14:56:49 +01:00
3cbadf92b6 Merge #462
462: cli improvements r=Kerollmops a=MarinPostma

a few improvements:
- use bufreader to load documents, so the loading of the document doesn't appear on flamegraphs
- set default db path to current directory so the `-i` flag can be omitted.



Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-03-07 09:39:01 +00:00
db3a1905de default db path 2022-03-07 10:30:47 +01:00
6cf82ba993 bufread documents 2022-03-07 10:29:52 +01:00
66c6d5e1ef Add a new error message when the valid_fields is empty
> "Attribute `{}` is not sortable. This index doesn't have configured sortable attributes."
> "Attribute `{}` is not sortable. Available sortable attributes are: `{}`."

coexist in the error handling
2022-03-05 10:38:18 -03:00
df518d8b0b Merge #459
459: Update heed link in cargo toml r=Kerollmops a=curquiza

Since grenad and heed have been moved to the meilisearch orga, this PR changes the link.
This is a minor change since GitHub handles automatically the redirection. This PR is only for consisitency.

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-03-01 19:47:14 +00:00
d9ed9de2b0 Update heed link in cargo toml 2022-03-01 19:45:29 +01:00
51cf44d6fd Merge #456
456: Remove useless grenad merging r=Kerollmops a=Kerollmops

This PR must be merged after #454.

This PR removes the part of code that was merging all of the grenad Readers merging that we don't need as the indexer should have merged them and, therefore, we should only have one final grenad Reader. We reduce the amount of CPU usage and memory pressure we were doing uselessly.

`@ManyTheFish` are you sure I can skip merging the `word_docids` database?

Here is the benchmark comparison with the previously merged PR #454:
```
group                                              indexing_reintroduce-appending-sorted-values_c05e42a8    indexing_remove-useless-grenad-merging_d5b8b5a2
-----                                              -----------------------------------------------------    -----------------------------------------------
indexing/Indexing movies with default settings     1.06      16.6±1.04s        ? ?/sec                      1.00      15.7±0.93s        ? ?/sec
indexing/Indexing songs with default settings      1.16      60.1±7.07s        ? ?/sec                      1.00      51.7±5.98s        ? ?/sec
indexing/Indexing songs without faceted numbers    1.06      55.4±6.14s        ? ?/sec                      1.00      52.2±4.13s        ? ?/sec
```

And the comparison with multi-batch indexing before #436, we can see that we gain time for benchmarks that index datasets in multiple batches but there is _so much_ variance that it's not clear.

```
group                                                             indexing_benchmark-multi-batch-indexing-before-speed-up_45f52620    indexing_remove-useless-grenad-merging_d5b8b5a2
-----                                                             ----------------------------------------------------------------    -----------------------------------------------
indexing/Indexing geo_point                                       1.07       6.6±0.08s        ? ?/sec                                 1.00       6.2±0.11s        ? ?/sec
indexing/Indexing songs in three batches with default settings    1.12      57.7±2.14s        ? ?/sec                                 1.00      51.5±3.80s        ? ?/sec
indexing/Indexing songs with default settings                     1.00      47.5±2.52s        ? ?/sec                                 1.09      51.7±5.98s        ? ?/sec
indexing/Indexing songs without any facets                        1.00      43.5±1.43s        ? ?/sec                                 1.12      48.8±3.73s        ? ?/sec
indexing/Indexing songs without faceted numbers                   1.00      47.1±2.23s        ? ?/sec                                 1.11      52.2±4.13s        ? ?/sec
indexing/Indexing wiki                                            1.00    917.3±30.01s        ? ?/sec                                 1.09    998.7±38.92s        ? ?/sec
indexing/Indexing wiki in three batches                           1.09   1091.2±32.73s        ? ?/sec                                 1.00    996.5±15.70s        ? ?/sec
```

What do you think `@irevoire?` Should we change the benchmarks to make them do more runs?

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-03-01 16:48:08 +00:00
d5b8b5a2f8 Replace the ugly unwraps by clean if let Somes 2022-02-28 16:31:33 +01:00
8d26f3040c Remove a useless grenad file merging 2022-02-28 16:31:33 +01:00
21898ffc60 Merge #454
454: Reintroduce appending sorted entries when possible r=Kerollmops a=Kerollmops

This PR modifies the `sorter_into_lmdb_database` function to append values into the database instead of get-put-merging them, it should improve the indexation speed for when the database is empty.

```txt
group                                             indexing_main_25123af3                 indexing_reintroduce-appending-sorted-values_c05e42a8
-----                                             ----------------------                 -----------------------------------------------------
indexing/Indexing movies with default settings    1.07      17.8±0.99s        ? ?/sec    1.00      16.6±1.04s        ? ?/sec
indexing/Indexing songs with default settings     1.00      57.0±6.01s        ? ?/sec    1.05      60.1±7.07s        ? ?/sec
indexing/Indexing songs without any facets        1.10      51.8±5.36s        ? ?/sec    1.00      47.3±3.30s        ? ?/sec
```

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-02-28 14:55:37 +00:00
04b1bbf932 Reintroduce appending sorted entries when possible 2022-02-24 14:50:45 +01:00
382be56d36 Merge #453
453: Benchmark multi batch indexing r=Kerollmops a=Kerollmops

Hey `@irevoire,` could you please add the new benchmarks into influx?

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-02-24 12:33:13 +00:00
acfc96525c Apply GitHub suggestions 2022-02-23 16:20:29 +01:00
a820aa11e6 Add a new movies benchmark to test multi batch indexing 2022-02-23 16:20:29 +01:00
8d2e3e4aba Add a new wiki benchmark to test multi batch indexing 2022-02-23 16:20:29 +01:00
ab5247dc64 Add a new songs benchmark to test multi batch indexing 2022-02-23 16:20:28 +01:00
acd9535588 Merge #455
455: Raise the GitHub CI timeout limit to 72h r=irevoire a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-23 14:33:31 +00:00
19bfb2649b Raise the GitHub CI timeout limit to 72h 2022-02-23 15:27:51 +01:00
25123af3b8 Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00
ff8d7a810d Change the behavior of the as_cloneable_grenad by taking a ref 2022-02-16 15:40:08 +01:00
f367cc2e75 Finally bump grenad to v0.4.1 2022-02-16 15:28:48 +01:00
f2984f66e6 Merge #452
452: bump milli r=curquiza a=irevoire



Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-02-16 13:49:14 +00:00
0defeb268c bump milli 2022-02-16 13:27:41 +01:00
030064da25 Merge #451
451: Update LICENSE with Meili SAS name r=Kerollmops a=curquiza

Check with thomas, we must put the real name of the company

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-02-15 16:18:47 +00:00
84035a27f5 Update LICENSE 2022-02-15 15:52:50 +01:00
0885fcf973 Merge #450
450: Get rid of chrono in favor of time r=Kerollmops a=irevoire

We only use `chrono` as a wrapper around `time`, and since there has been an [open CVE on `chrono` for at least 3 months now](https://github.com/chronotope/chrono/pull/632) and the repo seems to be [struggling with maintenance](https://github.com/chronotope/chrono/pull/639), I think we should use `time` directly which is way more active and sufficient for our use case.

EDIT: Actually the CVE status has been known for more than 6 months: https://github.com/chronotope/chrono/issues/602

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-02-15 10:54:46 +00:00
48542ac8fd get rid of chrono in favor of time 2022-02-15 11:41:55 +01:00
ea15ad6c34 Merge #447
447: Update version for the next release (v0.22.1) r=curquiza a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-02-07 17:44:09 +00:00
d03b3ceb58 Update version for the next release (v0.22.1) 2022-02-07 18:39:29 +01:00
5d58cb7449 Merge #442
442: fix phrase search r=curquiza a=MarinPostma

Run the exact match search on 7 words windows instead of only two. This makes false positive very very unlikely, and impossible on phrase query that are less than seven words.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-02-07 16:18:20 +00:00
c5a996aa78 Merge #446
446: Update LICENSE r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-02-07 09:47:39 +00:00
1279c38ac9 Update LICENSE 2022-02-05 18:29:11 +01:00
267d14c28d Merge #445
445: allow null values in csv r=Kerollmops a=MarinPostma

This pr allows null values in csv:
- if the field is of type string, then an empty field is considered null (`,,`), anything other is turned into a string (i.e `, ,` is a single whitespace string)
- if the field is of type number, when the trimmed field is empty, we consider the value null (i.e `,,`, `, ,` are both null numbers) otherwise we try to parse the number.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-02-03 15:11:32 +00:00