Commit Graph

1210 Commits

Author SHA1 Message Date
tamo
a2bff68c1a remove the optional words for the typo criterion 2021-06-02 11:05:07 +02:00
tamo
aee49bb3cd add the proximity criterion 2021-06-02 11:05:07 +02:00
tamo
49e4cc3daf add the words criterion to the bench 2021-06-02 11:05:07 +02:00
tamo
15cce89a45 update the README with instructions to get the download the dataset 2021-06-02 11:05:07 +02:00
tamo
e425f70ef9 let criterion decide how much iteration it wants to do in 10s 2021-06-02 11:05:07 +02:00
tamo
4fdbfd6048 push a first version of the benchmark for the typo 2021-06-02 11:05:07 +02:00
bors[bot]
270da98c46 Merge #202
202: Add field id word count docids database r=Kerollmops a=LegendreM

This PR introduces a new database, `field_id_word_count_docids`, that maps the number of words in an attribute with a list of document ids. This relation is limited to attributes that contain less than 11 words.
This database is used by the exactness criterion to know if a document has an attribute that contains exactly the query without any additional word.

Fix #165 
Fix #196
Related to [specifications:#36](https://github.com/meilisearch/specifications/pull/36)

Co-authored-by: many <maxime@meilisearch.com>
Co-authored-by: Many <legendre.maxime.isn@gmail.com>
2021-06-01 16:09:48 +00:00
many
e857ca4d7d Fix PR comments 2021-06-01 18:06:46 +02:00
Many
ab2cf69e8d Update milli/src/update/delete_documents.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-06-01 17:04:10 +02:00
Many
8e6d1ff0dc Update milli/src/update/index_documents/store.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-06-01 17:04:02 +02:00
bors[bot]
168fe0aa28 Merge #206
206: Fix http-ui r=Kerollmops a=irevoire

I just noticed that `http-ui` was not compiling on `main`.
I'm not sure this is the best fix, but it works 👀

Co-authored-by: Tamo <irevoire@hotmail.fr>
2021-06-01 14:31:32 +00:00
Tamo
608c5bad24 fix http-ui 2021-06-01 16:24:46 +02:00
bors[bot]
7d36d664a7 Merge #203
203: Make the MatchingWords return the number of matching bytes r=Kerollmops a=LegendreM

Make the MatchingWords return the number of matching bytes using a custom Levenshtein algorithm.

Fix #138

Co-authored-by: many <maxime@meilisearch.com>
2021-06-01 12:00:33 +00:00
many
225ae6fd25 Resolve PR comments 2021-06-01 11:53:09 +02:00
bors[bot]
2f9f6a1f21 Merge #169
169: Optimize roaring codec r=Kerollmops a=MarinPostma

Optimize the `BoRoaringBitmapCodec` by preventing it from emiting useless error that caused allocation. On my flamegraph, the byte_decode function went from 4.13% to  1.70% (of transplant graph).

This may not be the greatest optimization ever, but hey, this was a low hanging fruit.

before:
![image](https://user-images.githubusercontent.com/28804882/116241125-17018880-a754-11eb-9f9d-a67418d100e1.png)
after:
![image](https://user-images.githubusercontent.com/28804882/116241167-21bc1d80-a754-11eb-9afc-d9d72727477c.png)



Co-authored-by: Marin Postma <postma.marin@protonmail.com>
2021-06-01 06:30:25 +00:00
Marin Postma
984dc7c1ed rewrite roaring codec without byteorder. 2021-05-31 22:15:39 +02:00
Marin Postma
1373637da1 optimize roaring codec 2021-05-31 22:15:35 +02:00
many
1df68d342a Make the MatchingWords return the number of matching bytes 2021-05-31 18:22:29 +02:00
many
b8e6db0feb Add database in infos crate 2021-05-31 16:29:27 +02:00
many
c701f8bf36 Use field id word count database in exactness criterion 2021-05-31 16:27:28 +02:00
many
4ddf008be2 add field id word count database 2021-05-31 16:27:28 +02:00
bors[bot]
2f5e61bacb Merge #184
184: Transfer numbers and strings facets into the appropriate facet databases r=Kerollmops a=Kerollmops

This pull request is related to https://github.com/meilisearch/milli/issues/152 and changes the layout of the facets values, numbers and strings are now in dedicated databases and the user no more needs to define the type of the fields. No more conversion between the two types is done, numbers (floats and integers converted to f64) go to the facet float database and strings go to the strings facet database.

There is one related issue that I found regarding CSVs, the values in a CSV are always considered to be strings, [meilisearch/specifications#28](d916b57d74/text/0028-indexing-csv.md) fixes this issue by allowing the user to define the fields types using `:` in the "CSV Formatting Rules" section.

All previous tests on facets have been modified to pass again and I have also done hand-driven tests with the 115m songs dataset. Everything seems to be good!

Fixes #192.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2021-05-31 13:32:58 +00:00
Kerollmops
1c0a5cd136 Resolve code modification suggestions 2021-05-31 15:22:50 +02:00
bors[bot]
76b9178b16 Merge #200
200: Fix plane sweep algorithm r=Kerollmops a=LegendreM

Fix plain sweep algorithm after creating some tests on proximity.

Co-authored-by: many <maxime@meilisearch.com>
2021-05-26 11:36:24 +00:00
many
a5e98cf46d Fix plane sweep algorithm 2021-05-25 18:21:55 +02:00
Kerollmops
5012cc3a32 Fix the http-ui crate to support split facet databases 2021-05-25 11:31:06 +02:00
Kerollmops
28bd9e183e Fix the infos crate to support split facet databases 2021-05-25 11:31:06 +02:00
Clément Renault
3a4a150ef0 Fix the tests and remaining warnings 2021-05-25 11:31:06 +02:00
Clément Renault
02c655ff1a Refine the facet distribution to use both databases 2021-05-25 11:30:00 +02:00
Clément Renault
79efded841 Refine the FacetCondition from_array constructor 2021-05-25 11:30:00 +02:00
Clément Renault
f7efde11d9 Refine the facet condition to use both facet databases 2021-05-25 11:30:00 +02:00
Clément Renault
e62b89a2ed Make the facet distinct work with the new split facets 2021-05-25 11:30:00 +02:00
Clément Renault
bd7b285bae Split the update side to use the number and the strings facet databases 2021-05-25 11:30:00 +02:00
Clément Renault
038e03a4e4 Use both facet databases in the FacetIter type 2021-05-25 11:30:00 +02:00
Clément Renault
597144b0b9 Use both number and string facet databases in the distinct system 2021-05-25 11:29:59 +02:00
Clément Renault
837c1041c7 Clear and delete the documents from the facet database 2021-05-25 11:28:36 +02:00
Clément Renault
a56c46b6f1 Explode the string and f64 facet databases into two 2021-05-25 11:28:36 +02:00
Clément Renault
df7a32e3d0 Move the creation date initialization into a function 2021-05-25 11:28:35 +02:00
bors[bot]
49bee2ebc5 Merge #190
190: Make bucket candidates optionals r=Kerollmops a=LegendreM

Before the bucket candidates were the result of the facet filters or result of the query tree.
They will now be only the result of the query tree, making the number of candidates more consistent between the same request with or without facet filters.

Fix some clippy warnings.

Fix #186 

Co-authored-by: many <maxime@meilisearch.com>
2021-05-24 11:19:32 +00:00
many
a3944a7083 Introduce a filtered_candidates field 2021-05-11 11:37:40 +02:00
many
efba662ca6 Fix clippy warnings in cirteria 2021-05-10 10:27:18 +02:00
many
e923d51b8f Make bucket candidates optionals 2021-05-10 10:27:04 +02:00
Marin Postma
eeb0c70ea2 meilisearch compatible primary key inference 2021-05-06 22:42:32 +02:00
Marin Postma
313c362461 early return on empty document addition 2021-05-06 18:14:16 +02:00
Many
c620626515 Merge pull request #188 from meilisearch/exactness-criterion
Exactness criterion
2021-05-06 17:56:21 +02:00
Many
44b6843de7 Fix pull request reviews
Update milli/src/fields_ids_map.rs
Update milli/src/search/criteria/exactness.rs
Update milli/src/search/criteria/mod.rs
2021-05-06 14:31:03 +02:00
many
c1ce4e4ca9 Introduce mocked ExactAttribute step in exactness criterion 2021-05-06 14:28:31 +02:00
many
a3f8686fbf Introduce exactness criterion 2021-05-06 14:28:30 +02:00
bors[bot]
25f75d4d03 Merge #189
189: Update version for the next release (v0.2.1) r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2021-05-05 15:28:56 +00:00
bors[bot]
7e63e32960 Merge #187
187: Fix fields distribution after documents merge r=Kerollmops a=shekhirin

Resolves https://github.com/meilisearch/milli/issues/174

The problem was with calculation of fields distribution before the merge in `output_from_sorter()`. So if you'd import two documents with the same primary key value, fields distribution will count it as two documents, while `output_from_sorter()` will merge these documents into one.

---

```console
➜ Downloads cat short_movies.json
[
{"id":"47474","title":"The Serpent's Egg","poster":"https://image.tmdb.org/t/p/w500/n7z0doFkXHcvo8QQWHLFnkEPXRU.jpg","overview":"The Serpent's Egg follows a week in the life of Abel Rosenberg, an out-of-work American circus acrobat living in poverty-stricken Berlin following Germany's defeat in World War I.","release_date":246844800,"genres":["Thriller","Drama","Mystery"]},
{"id":"47474","title":"The Serpent's Egg","poster":"https://image.tmdb.org/t/p/w500/n7z0doFkXHcvo8QQWHLFnkEPXRU.jpg","overview":"The Serpent's Egg follows a week in the life of Abel Rosenberg, an out-of-work American circus acrobat living in poverty-stricken Berlin following Germany's defeat in World War I.","release_date":246844800,"genres":["Thriller","Drama","Mystery"]}
]
➜ Downloads curl -X POST -H "Content-Type: text/json" --data-binary @short_movies.json 127.0.0.1:7700/indexes/movies/documents
{"updateId":0}
```

## Before
```console
➜ Downloads curl -s 127.0.0.1:7700/indexes/movies/stats | jq
{
  "numberOfDocuments": 1,
  "isIndexing": false,
  "fieldsDistribution": {
    "release_date": 2,
    "poster": 2,
    "title": 2,
    "overview": 2,
    "genres": 2,
    "id": 2
  }
}
```

## After
```console
➜ Downloads curl -s 127.0.0.1:7700/indexes/movies/stats | jq
{
  "numberOfDocuments": 1,
  "isIndexing": false,
  "fieldsDistribution": {
    "poster": 1,
    "release_date": 1,
    "title": 1,
    "genres": 1,
    "id": 1,
    "overview": 1
  }
}
```

Co-authored-by: Alexey Shekhirin <a.shekhirin@gmail.com>
2021-05-05 14:45:08 +00:00