Commit Graph

210 Commits

Author SHA1 Message Date
Clément Renault
6e1f4af833 wip: Create a tree from query but need to show synonyms 2020-01-07 18:24:13 +01:00
Clément Renault
856c5c4214 Fix group offset computing 2019-12-31 14:24:10 +01:00
Clément Renault
670e80c151 Use the cached postings lists in the query system 2019-12-31 13:32:36 +01:00
Clément Renault
eed07c724f Add more logging for postings lists fetching by word 2019-12-31 13:32:36 +01:00
Clément Renault
99d35fb940 Introduce a first version of a number of candidates reducer
It works by ignoring the postings lists associated to documents that the previous words did not returned
2019-12-31 13:32:36 +01:00
Clément Renault
106b886873 Cache the prefix postings lists 2019-12-30 18:01:32 +01:00
Clément Renault
928876b553 Introduce the postings lists caching stores
Currently not used
2019-12-30 18:01:27 +01:00
Clément Renault
58836d89aa Rename the PrefixCache into PrefixDocumentsCache 2019-12-30 15:42:09 +01:00
Clément Renault
1a5a104f13 Display proximity evaluation number of calls 2019-12-30 15:42:09 +01:00
Clément Renault
064cfa4755 Add more debug, where are those 100ms 2019-12-30 15:42:08 +01:00
Clément Renault
ed6172aa94 Add a time measurement of the criterion loop 2019-12-30 15:42:08 +01:00
Clément Renault
8c140f6bcd Increase the disk usage limit 2019-12-30 15:42:08 +01:00
Clément Renault
1e1f0fcaf5 Introduce a basic cache system for first letters 2019-12-30 15:42:08 +01:00
Clément Renault
d21352a109 Change the time measurement of the FST 2019-12-30 15:42:08 +01:00
Clément Renault
4be11f961b Use an ugly trick to avoid cloning the FST 2019-12-30 15:42:07 +01:00
Clément Renault
1163f390b3 Restrict FST search to the first letter of the word 2019-12-30 15:42:07 +01:00
Clément Renault
691e2a3c1d Fix a blocking channel, appearing like a deadlock 2019-12-30 15:28:28 +01:00
Clément Renault
04bb49989f Add more debug timings 2019-12-20 14:18:48 +01:00
Clément Renault
d12ff15ee3 Set the indexes info in the create_index function 2019-12-19 10:38:56 +01:00
Clément Renault
40c0b14d1c Reintroduce searchable attributes and reordering 2019-12-13 14:38:25 +01:00
Clément Renault
a4dd033ccf Rename raw_matches into bare_matches 2019-12-13 14:38:25 +01:00
Clément Renault
48e8778881 Clean up the modules declarations 2019-12-13 14:38:25 +01:00
Clément Renault
4be23efe66 Remove the AttrCount type
Could probably be reintroduced later
2019-12-13 14:38:25 +01:00
Clément Renault
7d67750865 Reintroduce exacteness for one word document field 2019-12-13 14:38:25 +01:00
Clément Renault
746e6e170c Make the test pass again 2019-12-13 14:38:24 +01:00
Clément Renault
d93e35cace Introduce ContextMut and Context structs 2019-12-13 14:38:24 +01:00
Clément Renault
d75339a271 Prefer summing the attribute 2019-12-13 14:38:24 +01:00
Clément Renault
86ee0cbd6e Introduce bucket_sort_with_distinct function 2019-12-13 14:38:24 +01:00
Clément Renault
248ccfc0d8 Update the criteria to the new ones 2019-12-13 14:38:24 +01:00
Clément Renault
ea148575cf Remove the raw_query functions 2019-12-13 14:38:23 +01:00
Clément Renault
efc2be0b7b Bump the sdset dependency to 0.3.6 2019-12-13 14:38:23 +01:00
Clément Renault
8d71112dcb Rewrite the phrase query postings lists
This simplified the multiword_rewrite_matches function a little bit.
2019-12-13 14:38:23 +01:00
Clément Renault
dd03a6256a Debug pre filtered number of documents 2019-12-13 14:38:23 +01:00
Clément Renault
9c03bb3428 First probably working phrase query doc filtering 2019-12-13 14:38:23 +01:00
Clément Renault
22b19c0d93 Fix the processed distance algorithm 2019-12-13 14:38:22 +01:00
Clément Renault
0f698d6bd9 Work in progress: Bad Typo detection
I have an issue where "speakers" is split into "speaker" and "s",
when I compute the distances for the Typo criterion,
it takes "s" into account and put a distance of zero in the bucket 0
(the "speakers" bucket), therefore it reports any document matching "s"
without typos as best results.

I need to make sure to ignore "s" when its associated part "speaker"
doesn't even exist in the document and is not in the place
it should be ("speaker" followed by "s").

This is hard to think that it will had much computation time to
the Typo criterion like in the previous algorithm where I computed
the real query/words indexes based and removed the invalid ones
before sending the documents to the bucket sort.
2019-12-13 14:38:22 +01:00
Clément Renault
4e91b31b1f Make the Typo and Words work with synonyms 2019-12-13 14:38:22 +01:00
Clément Renault
f87c67fcad Improve the QueryEnhancer by doing a single lookup 2019-12-13 14:38:22 +01:00
Clément Renault
902625601a Work in progress: It seems like we support synonyms, split and concat words 2019-12-13 14:38:22 +01:00
Clément Renault
d17d4dc5ec Add more debug infos 2019-12-13 14:38:21 +01:00
Clément Renault
ef6a4db182 Before improving fields AttrCount
Removing the fields_count fetching reduced by 2 times the serach time, we should look at lazily pulling them form the criterions in needs

ugly-test: Make the fields_count fetching lazy

Just before running the exactness criterion
2019-12-13 14:38:21 +01:00
Clément Renault
11f3d7782d Introduce the AttrCount type 2019-12-13 14:38:21 +01:00
Clément Renault
951f0bcb10 sqaush-me: Improve benchmarks naming 2019-12-13 14:17:40 +01:00
Clément Renault
d8ba405baf Add some criterion benchmarks to help mesure improvements 2019-12-13 14:17:40 +01:00
Quentin de Quelen
3a4130f344 Allow to index files with null or boolean 2019-12-12 19:25:05 +01:00
Quentin de Quelen
88b3c05155 Stop words; Do not reindex all documents if there is no documents 2019-12-12 15:31:39 +01:00
Quentin de Quelen
a4f26e8e48 Rewrite the synonym endpoint 2019-12-12 12:47:02 +01:00
Clément Renault
dc1849d291 Bump heed to 0.6.1 2019-12-07 11:49:45 +01:00
Clément Renault
29fd54dcfa Allow users to send csv files from stdin in examples 2019-12-05 12:23:56 +01:00
Thomas Payet
51636402c2 Add debian package in CI 2019-12-04 18:02:30 +01:00