Commit Graph

119 Commits

Author SHA1 Message Date
2c5c79d68e Update Tokenizer version to v0.2.1 2021-04-14 18:54:04 +02:00
dcb00b2e54 test a new implementation of the stop_words 2021-04-12 18:35:33 +02:00
da036dcc3e Revert "Integrate the stop_words in the querytree"
This reverts commit 12fb509d84.
We revert this commit because it's causing the bug #150.
The initial algorithm we implemented for the stop_words was:

1. remove the stop_words from the dataset
2. keep the stop_words in the query to see if we can generate new words by
   integrating typos or if the word was a prefix
=> This was causing the bug since, in the case of “The hobbit”, we were
   **always** looking for something starting with “t he” or “th e”
   instead of ignoring the word completely.

For now we are going to fix the bug by completely ignoring the
stop_words in the query.
This could cause another problem were someone mistyped a normal word and
ended up typing a stop_word.

For example imagine someone searching for the music “Won't he do it”.
If that person misplace one space and write “Won' the do it” then we
will loose a part of the request.

One fix would be to update our query tree to something like that:

---------------------
OR
  OR
    TOLERANT hobbit # the first option is to ignore the stop_word
    AND
      CONSECUTIVE   # the second option is to do as we are doing
        EXACT t	    # currently
        EXACT he
      TOLERANT hobbit
---------------------

This would increase drastically the size of our query tree on request
with a lot of stop_words. For example think of “The Lord Of The Rings”.

For now whatsoever we decided we were going to ignore this problem and consider
that it doesn't reduce too much the relevancy of the search to do that
while it improves the performances.
2021-04-12 18:35:33 +02:00
84c1dda39d test(http): setting enum serialize/deserialize 2021-04-08 17:03:40 +03:00
dc636d190d refactor(http, update): introduce setting enum 2021-04-08 17:03:40 +03:00
0a4bde1f2f update the default ordering of the criterion 2021-04-01 19:45:31 +02:00
2658c5c545 feat(index): update fields distribution in clear & delete operations
fixes after review

bump the version of the tokenizer

implement a first version of the stop_words

The front must provide a BTreeSet containing the stop words
The stop_words are set at None if an empty Set is provided
add the stop-words in the http-ui interface

Use maplit in the test
and remove all the useless drop(rtxn) at the end of all tests

Integrate the stop_words in the querytree

remove the stop_words from the querytree except if it was a prefix or a typo

more fixes after review
2021-04-01 19:12:35 +03:00
27c7ab6e00 feat(index): store fields distribution in index 2021-04-01 18:35:19 +03:00
12fb509d84 Integrate the stop_words in the querytree
remove the stop_words from the querytree except if it was a prefix or a typo
2021-04-01 13:57:55 +02:00
a2f46029c7 implement a first version of the stop_words
The front must provide a BTreeSet containing the stop words
The stop_words are set at None if an empty Set is provided
add the stop-words in the http-ui interface

Use maplit in the test
and remove all the useless drop(rtxn) at the end of all tests
2021-04-01 13:57:55 +02:00
62a8f1d707 bump the version of the tokenizer 2021-04-01 13:49:22 +02:00
9205b640a4 feat(index): introduce fields_ids_distribution 2021-03-31 18:44:47 +03:00
2cb32edaa9 fix(criterion): compile asc/desc regex only once
use once_cell instead of lazy_static

reorder imports
2021-03-30 16:07:14 +03:00
1e3f05db8f use fixed number of candidates as a threshold 2021-03-30 11:57:10 +03:00
a776ec9718 fix division 2021-03-29 19:16:58 +03:00
522e79f2e0 feat(search, criteria): introduce a percentage threshold to the asc/desc 2021-03-29 19:08:31 +03:00
73dcdb27f6 select a specific release of the tokenizer instead of using the latests git commit 2021-03-25 15:00:18 +01:00
9c27183876 fix broken offset 2021-03-15 20:23:50 +01:00
f0210453a6 add updated at on put primary key 2021-03-15 14:05:48 +01:00
615fe095e1 update index updated at on index writes 2021-03-15 14:05:47 +01:00
80d0f9c49d methods to update index time metadata 2021-03-15 14:05:47 +01:00
d48008339e Introduce two new optional_words and authorize_typos Search options 2021-03-10 11:16:30 +01:00
54b97ed8e1 Update the fetcher comments 2021-03-10 10:56:26 +01:00
d301859bbd Introduce a special word_derivations function for Proximity 2021-03-10 10:42:53 +01:00
facfb4b615 Fix the bucket candidates 2021-03-10 10:42:53 +01:00
42fd7dea78 Remove the useless typo cache 2021-03-10 10:42:53 +01:00
62a70c300d Optimize words criterion 2021-03-10 10:42:53 +01:00
f51eb46c69 Use the RoaringBitmapLenCodec to retrieve the count of documents 2021-03-09 10:25:39 +01:00
d781a6164a Rewrite some code with idiomatic Rust 2021-03-08 16:27:52 +01:00
b18ec00a7a Add a logging_timer macro to te criterion next methods 2021-03-08 16:12:06 +01:00
82a0f678fb Introduce a cache on the docid_word_positions database method 2021-03-08 16:12:03 +01:00
5fcaedb880 Introduce a WordDerivationsCache struct 2021-03-08 16:00:53 +01:00
2606c92ef9 use plain sweep in proximity criterion 2021-03-08 15:58:39 +01:00
ae47bb3594 Introduce plane_sweep function in proximity criterion 2021-03-08 15:58:38 +01:00
636a9df177 Temporarily fix the tinytemplate doc hidden issue 2021-03-08 15:57:45 +01:00
3c76b3548d Rework the Asc/Desc criteria to be facet iterator based 2021-03-08 13:32:25 +01:00
a58d2b6137 Print the Asc/Desc criterion field name in the debug prints 2021-03-08 13:32:25 +01:00
e3095be85c Remove Debug use in Display impl 2021-03-08 12:09:09 +01:00
9e1eb25232 implement display for criterion
Update milli/src/criterion.rs

Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-03-08 11:00:30 +01:00
e5bb96bc3b Fix the searchable settings test 2021-03-06 12:48:41 +01:00
9b6b35d9b7 Clean up some comments 2021-03-03 18:19:10 +01:00
2cc4a467a6 Change the criterion output that cannot fail 2021-03-03 18:18:33 +01:00
1fc25148da Remove useless where clauses for the criteria 2021-03-03 18:09:19 +01:00
07784c8990 Tune the words prefixes threshold to compute for 1/1000 instead 2021-03-03 15:51:28 +01:00
f376c6a728 Make sure we retrieve the docid word positions 2021-03-03 15:45:03 +01:00
5c5e51095c Fix the Asc/Desc criteria to alsways return the QueryTree when available 2021-03-03 15:45:03 +01:00
cdaa96df63 optimize proximity criterion 2021-03-03 15:45:03 +01:00
246286f0eb take hard separator into account 2021-03-03 15:45:03 +01:00
6bf6b40495 Remove unused files 2021-03-03 15:45:03 +01:00
f118d7e067 build criteria from settings 2021-03-03 15:45:03 +01:00