meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2025-10-23 03:56:28 +00:00

Author	SHA1	Message	Date
Loïc Lecrenier	072b576514	Fix proximity value in keys of prefix_word_pair_proximity_docids	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	6c3a5d69e1	Update snapshots	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	a7de4f5b85	Don't add swapped word pairs to the word_pair_proximity_docids db	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	264a04922d	Add prefix_word_pair_proximity database Similar to the word_prefix_pair_proximity one but instead the keys are: (proximity, prefix, word2)	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	1dbbd8694f	Rename StrStrU8Codec to U8StrStrCodec and reorder its fields	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	bdeb47305e	Change encoding of word_pair_proximity DB to (proximity, word1, word2) Same for word_prefix_pair_proximity	2022-10-18 10:37:34 +02:00
Many the fish	81919a35a2	Update milli/src/search/criteria/initial.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2022-10-17 18:23:20 +02:00
Many the fish	516e838eb4	Update milli/src/search/criteria/initial.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2022-10-17 18:23:15 +02:00
Clément Renault	fc03e53615	Add a test to check that we can abort an indexation	2022-10-17 17:28:03 +02:00
Kerollmops	6603437cb1	Introduce an indexation abortion function when indexing documents	2022-10-17 17:28:03 +02:00
ManyTheFish	6f55e7844c	Add some code comments	2022-10-17 14:41:57 +02:00
ManyTheFish	cf203b7fde	Take filter in account when computing the pages candidates	2022-10-17 14:13:44 +02:00
ManyTheFish	d71bc1e69f	Compute an exact count when using distinct	2022-10-17 14:13:44 +02:00
ManyTheFish	a396806343	Add settings to force milli to exhaustively compute the total number of hits	2022-10-17 14:13:44 +02:00
Ewan Higgs	beb987d3d1	Fixing piles of clippy errors. Most of these are calling clone when the struct supports Copy. Many are using & and &mut on `self` when the function they are called from already has an immutable or mutable borrow so this isn't needed. I tried to stay away from actual changes or places where I'd have to name fresh variables.	2022-10-13 22:02:54 +02:00
bors[bot]	f30979d021	Merge #662 662: Enhance word splitting strategy r=ManyTheFish a=akki1306 # Pull Request ## Related issue Fixes #648 ## What does this PR do? - [split_best_frequency](`55d889522b/milli/src/search/query_tree.rs (L282-L301)`) to use frequency of word pairs near together with proximity value of 1 instead of considering the frequency of individual words. Word pairs having max frequency are considered. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Akshay Kulkarni <akshayk.gj@gmail.com>	2022-10-13 08:14:22 +00:00
Akshay Kulkarni	85f3028317	remove underscore and introduce back word_documents_count	2022-10-13 13:21:59 +05:30
Akshay Kulkarni	8195fc6141	revert removal of word_documents_count method	2022-10-13 13:14:27 +05:30
Akshay Kulkarni	32f825d442	move default implementation of word_pair_frequency to TestContext	2022-10-13 12:57:50 +05:30
Akshay Kulkarni	ff8b2d4422	formatting	2022-10-13 12:44:08 +05:30
Akshay Kulkarni	6cb8b46900	use word_pair_frequency and remove word_documents_count	2022-10-13 12:43:11 +05:30
Akshay Kulkarni	8c9245149e	format file	2022-10-12 15:27:56 +05:30
Akshay Kulkarni	63e79a9039	update comment	2022-10-12 13:36:48 +05:30
Akshay Kulkarni	7f9680f0a0	Enhance word splitting strategy	2022-10-12 13:18:23 +05:30
Loïc Lecrenier	6fbf5dac68	Simplify documents! macro to reduce compile times	2022-10-12 09:22:05 +02:00
msvaljek	762e320c35	Add proximity calculation for the same word	2022-10-07 12:59:12 +02:00
vishalsodani	00c02d00f3	Add missing logging timer to extractors	2022-09-30 22:17:06 +05:30
bors[bot]	15d478cf4d	Merge #635 635: Use an unstable algorithm for `grenad::Sorter` when possible r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Use an unstable algorithm to sort the internal vector used by `grenad::Sorter` whenever possible to speed up indexing. In practice, every time the merge function creates a `RoaringBitmap`, we use an unstable sort. For every other merge function, such as `keep_first`, `keep_last`, etc., a stable sort is used. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-09-14 12:00:52 +00:00
Loïc Lecrenier	3794962330	Use an unstable algorithm for grenad::Sorter when possible	2022-09-13 14:49:53 +02:00
Kerollmops	d4d7c9d577	We avoid skipping errors in the indexing pipeline	2022-09-13 14:03:00 +02:00
Kerollmops	fe3973a51c	Make sure that long words are correctly skipped	2022-09-07 15:03:32 +02:00
Kerollmops	c83c3cd796	Add a test to make sure that long words are correctly skipped	2022-09-07 14:12:36 +02:00
ManyTheFish	bf750e45a1	Fix word removal issue	2022-09-01 12:10:47 +02:00
ManyTheFish	a38608fe59	Add test mixing phrased and no-phrased words	2022-09-01 12:02:10 +02:00
Clément Renault	7f92116b51	Accept again integers as document ids	2022-08-31 10:56:39 +02:00
Irevoire	f6024b3269	Remove the artifacts of the past	2022-08-23 16:10:38 +02:00
ManyTheFish	5391e3842c	replace optional_words by term_matching_strategy	2022-08-22 17:47:19 +02:00
ManyTheFish	993aa1321c	Fix query tree building	2022-08-18 17:56:06 +02:00
ManyTheFish	bff9653050	Fix remove count	2022-08-18 17:36:30 +02:00
ManyTheFish	9640976c79	Rename TermMatchingPolicies	2022-08-18 17:36:08 +02:00
bors[bot]	afc10acd19	Merge #596 596: Filter operators: NOT + IN[..] r=irevoire a=loiclec # Pull Request ## What does this PR do? Implements the changes described in https://github.com/meilisearch/meilisearch/issues/2580 It is based on top of #556 Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-08-18 11:24:32 +00:00
Loïc Lecrenier	9b6602cba2	Avoid cloning FilterCondition in filter array parsing	2022-08-18 13:06:57 +02:00
Loïc Lecrenier	c51dcad51b	Don't recompute filterable fields in evaluation of IN[] filter	2022-08-18 10:59:21 +02:00
Irevoire	4aae07d5f5	expose the size methods	2022-08-17 17:07:38 +02:00
Irevoire	e96b852107	bump heed	2022-08-17 17:05:50 +02:00
bors[bot]	087da5621a	Merge #587 587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test. ## Why is it faster? Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 1. reusing allocations 2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word 3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. ## What bug was fixed? When reindexing, the `new_prefix_fst_words` prefixes may look like: ``` ["ant", "axo", "bor"] ``` which we group by first letter: ``` [["ant", "axo"], ["bor"]] ``` Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`. ## StrStrU8Codec I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change. ## Requests when reviewing this PR I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. I also have a couple questions regarding the code itself: - Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified. - I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead? ## Benchmark Results We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. ``` group indexing_main_ce90fc62 indexing_word-prefix-pair-proximity-docids-refactor_cbad2023 ----- ---------------------- ------------------------------------------------------------ indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.00 1893.0±233.03µs ? ?/sec 1.01 1921.2±260.79µs ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.05 9.4±3.51ms ? ?/sec 1.00 9.0±2.14ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.22 18.3±11.42ms ? ?/sec 1.00 15.0±5.79ms ? ?/sec indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.00 41.4±4.20ms ? ?/sec 1.28 53.0±13.97ms ? ?/sec indexing/-wiki-delete-searchable- 1.00 285.6±18.12ms ? ?/sec 1.03 293.1±16.09ms ? ?/sec indexing/Indexing geo_point 1.03 60.8±0.45s ? ?/sec 1.00 58.8±0.68s ? ?/sec indexing/Indexing movies in three batches 1.14 16.5±0.30s ? ?/sec 1.00 14.5±0.24s ? ?/sec indexing/Indexing movies with default settings 1.11 13.7±0.07s ? ?/sec 1.00 12.3±0.28s ? ?/sec indexing/Indexing nested movies with default settings 1.10 10.6±0.11s ? ?/sec 1.00 9.6±0.15s ? ?/sec indexing/Indexing nested movies without any facets 1.11 9.4±0.15s ? ?/sec 1.00 8.5±0.10s ? ?/sec indexing/Indexing songs in three batches with default settings 1.18 66.2±0.39s ? ?/sec 1.00 56.0±0.67s ? ?/sec indexing/Indexing songs with default settings 1.07 58.7±1.26s ? ?/sec 1.00 54.7±1.71s ? ?/sec indexing/Indexing songs without any facets 1.08 53.1±0.88s ? ?/sec 1.00 49.3±1.43s ? ?/sec indexing/Indexing songs without faceted numbers 1.08 57.7±1.33s ? ?/sec 1.00 53.3±0.98s ? ?/sec indexing/Indexing wiki 1.06 1051.1±21.46s ? ?/sec 1.00 989.6±24.55s ? ?/sec indexing/Indexing wiki in three batches 1.20 1184.8±8.93s ? ?/sec 1.00 989.7±7.06s ? ?/sec indexing/Reindexing geo_point 1.04 67.5±0.75s ? ?/sec 1.00 64.9±0.32s ? ?/sec indexing/Reindexing movies with default settings 1.12 13.9±0.17s ? ?/sec 1.00 12.4±0.13s ? ?/sec indexing/Reindexing songs with default settings 1.05 60.6±0.84s ? ?/sec 1.00 57.5±0.99s ? ?/sec indexing/Reindexing wiki 1.07 1725.0±17.92s ? ?/sec 1.00 1611.4±9.90s ? ?/sec ``` Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-08-17 14:06:12 +00:00
bors[bot]	fb95e67a2a	Merge #608 608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish When we replaced or updated some documents, the indexing was skipping the replaced documents. Related to https://github.com/meilisearch/meilisearch/issues/2672 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-17 13:38:10 +00:00
bors[bot]	e4a52e6e45	Merge #594 594: Fix(Search): Fix phrase search candidates computation r=Kerollmops a=ManyTheFish This bug is an old bug but was hidden by the proximity criterion, Phrase searches were always returning an empty candidates list when the proximity criterion is deactivated. Before the fix, we were trying to find any words[n] near words[n] instead of finding any words[n] near words[n+1], for example: for a phrase search '"Hello world"' we were searching for "hello" near "hello" first, instead of "hello" near "world". Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-17 13:22:52 +00:00
ManyTheFish	8c3f1a9c39	Remove useless lifetime declaration	2022-08-17 15:20:43 +02:00
ManyTheFish	e9e2349ce6	Fix typo in comment	2022-08-17 15:09:48 +02:00

... 3 4 5 6 7 ...

1062 Commits