Commit Graph

807 Commits

Author SHA1 Message Date
211c8763b9 Make sure that we do not generate too long keys 2022-05-03 10:03:15 +02:00
7e47031bdc Add a test for long keys in LMDB 2022-05-03 10:03:13 +02:00
9db86aac51 Merge #518
518: Return facets even when there is no value associated to it r=Kerollmops a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2352 and should fix the issue when Meilisearch is up-to-date with this PR.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-04-28 09:04:36 +00:00
a4d343aade Add a test to check for the returned facet distribution 2022-04-26 18:12:58 +02:00
c2bd94c871 Merge #511
511: Update version in every workspace r=curquiza a=curquiza

Checked with `@Kerollmops` 

- Update the version into every workspace (the current version is v0.27.0, but I forgot to update it for the previous release)
- add `publish = false` except in `milli` workspace.


Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-26 16:06:47 +00:00
7d1c2d97bf Return facets even when there is no values associated to it 2022-04-26 17:59:53 +02:00
d388ea0f9d Merge #506
506: fix cargo warnings r=Kerollmops a=MarinPostma

fix cargo warnings


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-04-26 15:45:20 +00:00
5c29258e8e fix cargo warnings 2022-04-26 17:33:11 +02:00
2fdf520271 Merge #514
514: Stop flattening every field r=Kerollmops a=irevoire

When we need to flatten a document:
* The primary key contains a `.`.
* Some fields need to be flattened

Instead of flattening the whole object and thus creating a lot of allocations with the `serde_json_flatten_crate`, we instead generate a minimal sub-object containing only the fields that need to be flattened.
That should create fewer allocations and thus index faster.

---------

```
group                                                             indexing_main_e1e362fa                 indexing_stop-flattening-every-field_40d1bd6b
-----                                                             ----------------------                 ---------------------------------------------
indexing/Indexing geo_point                                       1.99      23.7±0.23s        ? ?/sec    1.00      11.9±0.21s        ? ?/sec
indexing/Indexing movies in three batches                         1.00      18.2±0.24s        ? ?/sec    1.01      18.3±0.29s        ? ?/sec
indexing/Indexing movies with default settings                    1.00      17.5±0.09s        ? ?/sec    1.01      17.7±0.26s        ? ?/sec
indexing/Indexing songs in three batches with default settings    1.00      64.8±0.47s        ? ?/sec    1.00      65.1±0.49s        ? ?/sec
indexing/Indexing songs with default settings                     1.00      54.9±0.99s        ? ?/sec    1.01      55.7±1.34s        ? ?/sec
indexing/Indexing songs without any facets                        1.00      50.6±0.62s        ? ?/sec    1.01      50.9±1.05s        ? ?/sec
indexing/Indexing songs without faceted numbers                   1.00      54.0±1.14s        ? ?/sec    1.01      54.7±1.13s        ? ?/sec
indexing/Indexing wiki                                            1.00     996.2±8.54s        ? ?/sec    1.02   1021.1±30.63s        ? ?/sec
indexing/Indexing wiki in three batches                           1.00    1136.8±9.72s        ? ?/sec    1.00    1138.6±6.59s        ? ?/sec
```

So basically everything slowed down a liiiiiittle bit except the dataset with a nested field which got twice faster

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-04-26 11:50:33 +00:00
f19d2dc548 Only flatten the required fields
apply review comments

Co-authored-by: Kerollmops <kero@meilisearch.com>
2022-04-26 12:33:46 +02:00
d138b3c704 Update version 2022-04-25 18:43:46 +02:00
fa6f495662 fix the indexing fuzzer 2022-04-25 18:32:06 +02:00
8010eca9c7 Merge #505
505: normalize exact words r=curquiza a=MarinPostma

Normalize the exact words, as specified in the specification.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-04-25 09:35:32 +00:00
2e0089d5ff normalize exact words 2022-04-21 15:38:40 +02:00
3a2451fcba add test normalize exact words 2022-04-21 13:52:09 +02:00
eb5830aa40 Add a test to make sure that long words are handled 2022-04-21 13:45:28 +02:00
8b14090927 fix min-word-len-for-typo not reset properly 2022-04-19 15:20:16 +02:00
ea4bb9402f Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
f1115e274f Use Copy impl of FormatOption instead of clonning 2022-04-19 10:35:50 +02:00
8d630a6f62 Update version for the next release (v0.26.1) 2022-04-14 11:44:06 +02:00
00f78d6b5a Apply code suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-04-14 11:14:08 +02:00
399fba16bb only flatten an object if it's nested 2022-04-14 11:14:08 +02:00
ee64f4a936 Use smartstring to store the external id in our hashmap
We need to store all the external id (primary key) in a hashmap
associated to their internal id during.
The smartstring remove heap allocation / memory usage and should
improve the cache locality.
2022-04-13 21:22:07 +02:00
dda28d7415 exclude excluded canditates from search result candidates 2022-04-13 12:10:35 +02:00
cd83014fff add test for disctinct nb hits 2022-04-13 12:10:35 +02:00
bbb6728d2f add distinct attributes to cli 2022-04-13 12:10:35 +02:00
5809d3ae0d Add first benchmarks on formatting 2022-04-12 16:31:58 +02:00
827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
011f8210ed Make compute_matches more rust idiomatic 2022-04-12 10:19:02 +02:00
a16de5de84 Symplify format and remove intermediate function 2022-04-08 11:20:41 +02:00
a769e09dfa Make token_crop_bounds more rust idiomatic 2022-04-07 20:15:14 +02:00
9ac2fd1c37 Merge #487
487: Update version (v0.26.0) r=Kerollmops a=curquiza

breaking because of #458 

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-07 17:10:24 +00:00
bab898ce86 move the flatten-serde-json crate inside of milli 2022-04-07 18:20:44 +02:00
c8ed1675a7 Add some documentation 2022-04-07 17:32:13 +02:00
b1905dfa24 Make split_best_frequency returns references instead of owned data 2022-04-07 17:05:44 +02:00
ab458d8840 fix tests after rebase 2022-04-07 17:00:00 +02:00
4f3ce6d9cd nested fields 2022-04-07 16:58:46 +02:00
ee1d627803 Update version (v0.26.0) 2022-04-07 15:56:10 +02:00
4ae7aea3b2 Merge #486
486: Update version (v0.25.0) r=curquiza a=curquiza

v0.25.0 will be released once #478 is merged

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-06 11:40:41 +00:00
b799f3326b rename merge_nothing to merge_ignore_values 2022-04-05 18:44:35 +02:00
fa7d3a37c0 Make some cleaning and add comments 2022-04-05 17:48:56 +02:00
3bb1e35ada Fix match count 2022-04-05 17:48:45 +02:00
56e0edd621 Put crop markers direclty around words 2022-04-05 17:41:32 +02:00
a93cd8c61c Fix prefix highlight with special chars 2022-04-05 17:41:32 +02:00
b3f0f39106 Make some cleaning 2022-04-05 17:41:32 +02:00
6dc345bc53 Test and Fix prefix highlight 2022-04-05 17:41:32 +02:00
bd30ee97b8 Keep separators at start of the croped string 2022-04-05 17:41:32 +02:00
29c5f76d7f Use new matcher in http-ui 2022-04-05 17:41:32 +02:00
734d0899d3 Publish Matcher 2022-04-05 17:41:32 +02:00
4428cb5909 Add some tests and fix some corner cases 2022-04-05 17:41:32 +02:00