Commit Graph

331 Commits

Author SHA1 Message Date
f6f8f543e1 Run cargo fmt 2022-08-17 12:17:18 +02:00
34c991ea02 Add newlines in documentation of word_prefix_pair_proximity_docids 2022-08-17 12:17:18 +02:00
06f3fd8c6d Add more comments to WordPrefixPairProximityDocids::execute 2022-08-17 12:17:18 +02:00
474500362c Update wpppd snapshots
New snapshot (yes, it's wrong as well, it will get fixed later):

---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5                a    1  [101, ]
5                a    2  [101, ]
5                am   1  [101, ]
5                b    4  [101, ]
5                be   4  [101, ]
am               a    3  [101, ]
amazing          a    1  [100, ]
amazing          a    2  [100, ]
amazing          a    3  [100, ]
amazing          an   1  [100, ]
amazing          an   2  [100, ]
amazing          b    2  [100, ]
amazing          be   2  [100, ]
an               a    1  [100, ]
an               a    2  [100, 202, ]
an               am   1  [100, ]
an               b    3  [100, ]
an               be   3  [100, ]
and              a    2  [100, ]
and              a    3  [100, ]
and              a    4  [100, ]
and              b    1  [100, ]
and              be   1  [100, ]
                 d\0  0  [100, 202, ]
an               an   2  [100, ]
and              am   2  [100, ]
and              an   3  [100, ]
at               a    2  [100, 101, ]
at               a    3  [100, ]
at               am   2  [100, 101, ]
at               an   1  [100, 202, ]
at               an   3  [100, ]
at               b    3  [101, ]
at               b    4  [100, ]
at               be   3  [101, ]
at               be   4  [100, ]
beautiful        a    2  [100, ]
beautiful        a    3  [100, ]
beautiful        a    4  [100, ]
beautiful        am   3  [100, ]
beautiful        an   2  [100, ]
beautiful        an   4  [100, ]
bell             a    2  [101, ]
bell             a    4  [101, ]
bell             am   4  [101, ]
extraordinary    a    2  [202, ]
extraordinary    a    3  [202, ]
extraordinary    an   2  [202, ]
house            a    4  [100, 202, ]
house            a    4  [100, ]
house            am   4  [100, ]
house            an   3  [100, 202, ]
house            b    2  [100, ]
house            be   2  [100, ]
rings            a    1  [101, ]
rings            a    3  [101, ]
rings            am   3  [101, ]
rings            b    2  [101, ]
rings            be   2  [101, ]
the              a    3  [101, ]
the              b    1  [101, ]
the              be   1  [101, ]
2022-08-17 12:17:18 +02:00
ea4a96761c Move content of readme for WordPrefixPairProximityDocids into the code 2022-08-17 12:05:37 +02:00
220921628b Simplify and document WordPrefixPairProximityDocIds::execute 2022-08-17 11:59:19 +02:00
044356d221 Optimise WordPrefixPairProximityDocIds merge operation 2022-08-17 11:59:18 +02:00
d350114159 Add tests for WordPrefixPairProximityDocIds 2022-08-17 11:59:15 +02:00
86807ca848 Refactor word prefix pair proximity indexation further 2022-08-17 11:59:13 +02:00
306593144d Refactor word prefix pair proximity indexation 2022-08-17 11:59:00 +02:00
12920f2a4f Fix paths of snapshot tests 2022-08-10 15:53:46 +02:00
8ac24d3114 Cargo fmt + fix compiler warnings/error 2022-08-10 15:53:46 +02:00
6066256689 Add snapshot tests for indexing of word_prefix_pair_proximity_docids 2022-08-10 15:53:46 +02:00
3a734af159 Add snapshot tests for Facets::execute 2022-08-10 15:53:46 +02:00
58cb1c1bda Simplify unit tests in facet/filter.rs 2022-08-04 12:03:44 +02:00
acff17fb88 Simplify indexing tests 2022-08-04 12:03:13 +02:00
21284cf235 Merge #556
556: Add EXISTS filter r=loiclec a=loiclec

## What does this PR do?

Fixes issue [#2484](https://github.com/meilisearch/meilisearch/issues/2484) in the meilisearch repo.

It creates a `field EXISTS` filter which selects all documents containing the `field` key. 
For example, with the following documents:
```json
[{
	"id": 0,
	"colour": []
},
{
	"id": 1,
	"colour": ["blue", "green"]
},
{
	"id": 2,
	"colour": 145238
},
{
	"id": 3,
	"colour": null
},
{
	"id": 4,
	"colour": {
		"green": []
	}
},
{
	"id": 5,
	"colour": {}
},
{
	"id": 6
}]
```
Then the filter `colour EXISTS` selects the ids `[0, 1, 2, 3, 4, 5]`. The filter `colour NOT EXISTS` selects `[6]`.

## Details
There is a new database named `facet-id-exists-docids`. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.

To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the `facet-id-f64-docids` database, with appropriate modifications in place.

There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:
```json
{
    "id": 0,
    "colour": [],
    "size": {}
}
```
would be flattened to:
```json
{
    "id": 0
}
```
prior to being given to the extraction pipeline.

This transformation would lose the information that is needed to populate the `facet-id-exists-docids` database. Therefore, I have also changed the implementation of the `flatten-serde-json` crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:
```json
{
    "id": 0,
    "colour": {
        "green": [],
        "blue": 1
    },
    "size": {}
} 
```
becomes
```json
{
    "id": 0,
    "colour": [],
    "colour.green": [],
    "colour.blue": 1,
    "size": []
} 
```


Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-08-04 09:46:06 +00:00
50f6524ff2 Merge #579
579: Stop reindexing already indexed documents r=ManyTheFish a=irevoire

```
 % ./compare.sh indexing_stop-reindexing-unchanged-documents_cb5a1669.json indexing_main_eeba1960.json
group                                                                     indexing_main_eeba1960                 indexing_stop-reindexing-unchanged-documents_cb5a1669
-----                                                                     ----------------------                 -----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.03      2.0±0.22ms        ? ?/sec    1.00  1955.4±336.24µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.08     11.0±2.93ms        ? ?/sec    1.00     10.2±4.04ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     15.1±3.89ms        ? ?/sec    1.14     17.1±5.18ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.26    59.2±12.01ms        ? ?/sec    1.00     47.1±8.52ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   316.6±31.53ms        ? ?/sec    1.00   293.6±17.00ms        ? ?/sec
indexing/Indexing geo_point                                               1.01      60.9±0.31s        ? ?/sec    1.00      60.6±0.36s        ? ?/sec
indexing/Indexing movies in three batches                                 1.04      20.0±0.30s        ? ?/sec    1.00      19.2±0.25s        ? ?/sec
indexing/Indexing movies with default settings                            1.02      19.1±0.18s        ? ?/sec    1.00      18.7±0.24s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.02      26.2±0.29s        ? ?/sec    1.00      25.9±0.22s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.02      25.3±0.32s        ? ?/sec    1.00      24.7±0.26s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.00      66.7±0.41s        ? ?/sec    1.01      67.1±0.86s        ? ?/sec
indexing/Indexing songs with default settings                             1.00      58.3±0.90s        ? ?/sec    1.01      58.8±1.32s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      54.5±1.43s        ? ?/sec    1.01      55.2±1.29s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.00      57.9±1.20s        ? ?/sec    1.01      58.4±0.93s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1052.0±10.95s        ? ?/sec    1.02   1069.4±20.38s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1193.1±8.83s        ? ?/sec    1.00    1189.5±9.40s        ? ?/sec
indexing/Reindexing geo_point                                             3.22      67.5±0.73s        ? ?/sec    1.00      21.0±0.16s        ? ?/sec
indexing/Reindexing movies with default settings                          3.75      19.4±0.28s        ? ?/sec    1.00       5.2±0.05s        ? ?/sec
indexing/Reindexing songs with default settings                           8.90      61.4±0.91s        ? ?/sec    1.00       6.9±0.07s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1748.2±35.68s        ? ?/sec    1.00   1750.5±18.53s        ? ?/sec
```

tldr: We do not lose any performance on the normal indexing benchmark, but we get between 3 and 8 times faster on the reindexing benchmarks 👍 

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-08-04 08:10:37 +00:00
d6f9a60a32 fix: Remove whitespace trimming during document id validation
fix #592
2022-08-03 11:38:40 +02:00
7fc35c5586 remove the useless prints 2022-08-02 10:31:22 +02:00
f156d7dd3b Stop reindexing already indexed documents 2022-08-02 10:31:20 +02:00
07003704a8 Merge branch 'filter/field-exist' 2022-07-21 14:51:41 +02:00
1506683705 Avoid using too much memory when indexing facet-exists-docids 2022-07-19 14:42:35 +02:00
aed8c69bcb Refactor indexation of the "facet-id-exists-docids" database
The idea is to directly create a sorted and merged list of bitmaps
in the form of a BTreeMap<FieldId, RoaringBitmap> instead of creating
a grenad::Reader where the keys are field_id and the values are docids.

Then we send that BTreeMap to the thing that handles TypedChunks, which
inserts its content into the database.
2022-07-19 10:07:33 +02:00
1eb1e73bb3 Add integration tests for the EXISTS filter 2022-07-19 10:07:33 +02:00
80b962b4f4 Run cargo fmt 2022-07-19 10:07:33 +02:00
c17d616250 Refactor index_documents_check_exists_database tests 2022-07-19 10:07:33 +02:00
30bd4db0fc Simplify indexing task for facet_exists_docids database 2022-07-19 10:07:33 +02:00
392472f4bb Apply suggestions from code review
Co-authored-by: Tamo <tamo@meilisearch.com>
2022-07-19 10:07:33 +02:00
453d593ce8 Add a database containing the docids where each field exists 2022-07-19 10:07:33 +02:00
fc9f3f31e7 Change DocumentsBatchReader to access cursor and index at same time
Otherwise it is not possible to iterate over all documents while
using the fields index at the same time.
2022-07-18 16:08:14 +02:00
ab1571cdec Simplify Transform::read_documents, enabled by enriched documents reader 2022-07-18 12:45:47 +02:00
448114cc1c Fix the benchmarks with the new indexation API 2022-07-12 15:22:09 +02:00
25e768f31c Fix another issue with the nested primary key selector 2022-07-12 15:14:07 +02:00
192793ee38 Add some tests to check for the nested documents ids 2022-07-12 15:14:07 +02:00
dc61105554 Fix the nested document id fetching function 2022-07-12 15:14:06 +02:00
2eec290424 Check the validity of the latitute and longitude numbers 2022-07-12 15:14:06 +02:00
5d149d631f Remove tests for a function that no more exists 2022-07-12 15:14:06 +02:00
0bbcc7b180 Expose the DocumentId struct to be sure to inject the generated ids 2022-07-12 15:14:06 +02:00
d1a4da9812 Generate a real UUIDv4 when ids are auto-generated 2022-07-12 15:14:06 +02:00
c8ebf0de47 Rename the validate function as an enriching function 2022-07-12 15:14:06 +02:00
905af2a2e9 Use the primary key and external id in the transform 2022-07-12 15:14:05 +02:00
742543091e Constify the default primary key name 2022-07-12 14:55:52 +02:00
5f1bfb73ee Extract the primary key name and make it accessible 2022-07-12 14:55:52 +02:00
6a0a0ae94f Make the Transform read from an EnrichedDocumentsBatchReader 2022-07-12 14:55:52 +02:00
8ebf5eed0d Make the nested primary key work 2022-07-12 14:55:52 +02:00
19eb3b4708 Make sur that we do not accept floats as documents ids 2022-07-12 14:55:52 +02:00
2ceeb51c37 Support the auto-generated ids when validating documents 2022-07-12 14:55:51 +02:00
399eec5c01 Fix the indexation tests 2022-07-12 14:55:51 +02:00
fcfc4caf8c Move the Object type in the lib.rs file and use it everywhere 2022-07-12 14:55:51 +02:00