Commit Graph

106 Commits

Author SHA1 Message Date
89637bcaaf Use bumparaw-collections in Meilisearch/milli 2024-12-10 11:52:20 +01:00
07f42e8057 Do not index a filed count when no word is counted 2024-12-09 15:45:12 +01:00
bd5110a2fe Fix clippy warnings 2024-12-05 16:13:07 +01:00
fa8b9acdf6 Ignore documents that didn't change in facets 2024-12-05 16:12:52 +01:00
2b74d1824b Ignore documents that didn't change any field in word pair proximity 2024-12-05 15:56:22 +01:00
c77b00d3ac Don't extract word docids when no searchable changed 2024-12-05 15:51:58 +01:00
cac355bfa7 Merge #5124
5124: Optimize Prefixes and Merges r=ManyTheFish a=Kerollmops

In this PR, we plan to optimize the read of LMDB to use read the entries in lexicographic order and better use the memory-mapping OS cache:

 - Optimize the prefix generation for word position docids (`@manythefish)`
 - Optimize the parallel merging of the caches to sort entries before merging the caches (`@kerollmops)`
 
## Benchmarks on 1cpu 2gb gpo3 (5k IOps)
 
Before on the tag meilisearch-v1.12.0-rc.3.

```
word_position_docids:merge_and_send_docids: 988s
compute_word_fst: 23.3s
word_pair_proximity_docids:merge_and_send_docids: 428s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 76.3s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 429s
```

After sorting the whole `HashMap`s in a `Vec` on this branch.

```
word_position_docids:merge_and_send_docids: 202s
compute_word_fst: 20.4s
word_pair_proximity_docids:merge_and_send_docids: 427s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 65.5s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 62.5s
```

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-05 09:35:52 +00:00
52843123d4 Clean up and remove the non-sorted merge_caches function 2024-12-05 10:03:05 +01:00
5f896b1050 Fix geo when spilling 2024-12-04 17:51:12 +01:00
2e32d0474c Lexicographically sort all the map to merge 2024-12-04 17:05:11 +01:00
cb99ac6f7e Consume vec instead of draining 2024-12-04 17:00:22 +01:00
be411435f5 Use the merge_caches_alt function in the docids merging 2024-12-04 16:37:29 +01:00
29ef164530 Introduce a new semi ordered merge function 2024-12-04 16:33:35 +01:00
db4eaf4d2d Rename serialize_into into serialize_into_writer 2024-12-02 10:03:27 +01:00
08d6413365 Fix result types 2024-11-27 14:32:42 +01:00
70802eb7c7 Fix most issues with the lifetimes 2024-11-27 14:32:42 +01:00
6ac5b3b136 Finish most of the channels types 2024-11-27 14:32:26 +01:00
8442db8101 Implement mostly all senders 2024-11-27 14:16:35 +01:00
221e547e86 Slight changes 2024-11-21 16:47:44 +01:00
61d0615253 Document the geo point extractor 2024-11-21 16:47:08 +01:00
5727e00374 Remove useless geo skipped 2024-11-21 16:47:08 +01:00
36962b943b First batch of PR comment 2024-11-21 16:38:11 +01:00
a38344acb3 Replace eprintlns by tracing 2024-11-20 15:29:51 +01:00
4d616f8794 Parse every attributes and filter before tokenization 2024-11-20 15:15:25 +01:00
fe5d50969a Fix filed selector in extrators 2024-11-20 13:16:44 +01:00
56c7c5d5f0 Fix comments 2024-11-20 13:16:44 +01:00
2afa33011a Fix tokenize_document 2024-11-20 13:16:43 +01:00
f893b5153e Don't mark [""] as empty facet 2024-11-20 13:16:42 +01:00
ca779c21f9 facets: Handle boolean and skip empty strings 2024-11-20 13:16:42 +01:00
b1f8aec348 Fix index_documents_check_exists_database 2024-11-20 13:16:41 +01:00
ba7f091db3 Use tokenizer on numbers and booleans 2024-11-20 13:16:41 +01:00
8049df125b Add depth to facet extraction so that null inside an array doesn't mark the entire field as null 2024-11-20 13:16:40 +01:00
41dbdd2d18 Fix filtered_placeholder_search_should_not_return_deleted_documents and word_scale_set_and_reset 2024-11-19 16:08:25 +01:00
c782c09208 Move step to a dedicated mod and replace it with an enum 2024-11-18 18:22:13 +01:00
04c38220ca Move MostlySend, ThreadLocal, FullySend to their own commit 2024-11-18 16:43:05 +01:00
5f93651cef fixes 2024-11-18 16:23:11 +01:00
0a21d9bfb3 Fix double borrow of new fields id map 2024-11-18 15:56:01 +01:00
5b4c06c24c Plug the grenad max memory parameter 2024-11-18 11:28:04 +01:00
4ff2b3c2ee Fix test on locales 2024-11-14 15:45:04 +01:00
91c58cfa38 Fix positional databases 2024-11-14 11:40:12 +01:00
9e8367f1e6 Move the rayon thread pool outside the extract method 2024-11-14 10:40:32 +01:00
8e5b1a3ec1 Compute the field distribution and convert _geo into an f64s 2024-11-13 17:44:05 +01:00
e627e182ce Fix facet strings 2024-11-13 17:43:02 +01:00
51b6293738 Add linear facet databases 2024-11-13 17:43:02 +01:00
b17896d899 Finialize the GeoExtractor 2024-11-13 17:43:02 +01:00
3b0cb5b487 Fix vector error messages 2024-11-12 23:26:16 +01:00
c4e9f761e9 Emit better error messages when parsing vectors 2024-11-12 22:49:22 +01:00
980921e078 Vector fixes 2024-11-12 16:31:22 +01:00
6094bb299a Fix user_provided vectors 2024-11-12 10:15:55 +01:00
1f5d801271 Fix crashes in facet search indexing 2024-11-07 17:22:30 +01:00