Commit Graph

940 Commits

Author SHA1 Message Date
Kerollmops
11c7fef80a Implement a memory dumper
It moves the in memory HashMaps used when indexing to a disk based MTBL file
2020-07-07 16:48:49 +02:00
Kerollmops
b12bfcb03b Reduce the deepness of the word position document ids
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4 First basic version using MTBL again 2020-07-07 11:32:33 +02:00
Kerollmops
45d0d7c3d4 Clean up the README 2020-07-06 17:38:22 +02:00
Kerollmops
adb1038b26 Add a jobs parameter to set the number of threads the indexer uses 2020-07-06 12:17:17 +02:00
Kerollmops
2a3b03138b Use heed 0.8.1 with the RwIter append method 2020-07-05 19:50:28 +02:00
Kerollmops
ec1023e790 Intersect document ids by inverse popularity of the words
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
cd7e64b2b3 Allow users to set the arc cache size when indexing 2020-07-04 18:12:41 +02:00
Kerollmops
ac8353a64f Merge pre-computed word attribute documents ids 2020-07-04 17:02:27 +02:00
Kerollmops
fea7cac206 Display the time it took to compute the word attribute documents ids 2020-07-04 15:18:38 +02:00
Kerollmops
46ced5c828 Introduce the RwIter append heed API 2020-07-04 12:34:10 +02:00
Kerollmops
7e7440c431 Finalize the LMDB indexing design 2020-07-01 22:45:43 +02:00
Kerollmops
2ae3f40971 Make the indexer ignore certain words
This is a preparation for making the indexing fully parallel by making the
indexer only be aware of certain words for each threads to avoid postings lists
conflicts for each words
2020-07-01 17:49:46 +02:00
Kerollmops
a3ac2623d5 Introduce multiple functions to clean up the code 2020-07-01 17:24:55 +02:00
Kerollmops
ac5cc7ddad Introduce an Iterator yielding owned entries for the LruCache 2020-07-01 17:21:52 +02:00
Kerollmops
014a25697d Use only one ARC cache based on the words 2020-07-01 12:03:18 +02:00
Kerollmops
fc4013a43f Fix the ARC cache 2020-07-01 10:35:07 +02:00
Kerollmops
2fcae719ad Use another LRU impl which uses hashbrown 2020-06-29 22:26:06 +02:00
Kerollmops
f98b615bf3 Replace the LRU by an Arc cache 2020-06-29 20:48:57 +02:00
Kerollmops
07abebfc46 Introduce a (too big) LRU cache 2020-06-29 18:15:03 +02:00
Kerollmops
5f0088594b Index by writing directly into LMDB 2020-06-29 13:54:47 +02:00
Kerollmops
8453828a65 Update the README 2020-06-28 12:40:08 +02:00
Kerollmops
63cbeca64e Skip all derived words when too short 2020-06-28 12:13:12 +02:00
Kerollmops
736f0f7560 Use the proximity instead of the attributes when searching for <= 7 proximities 2020-06-28 12:13:12 +02:00
Kerollmops
fe3be8f18a Replace the HashMap by a Vec for attributes documents ids 2020-06-28 12:13:12 +02:00
Kerollmops
6a2834f2b0 Add a jobs parameter to set the number of threads the indexer uses 2020-06-28 12:13:10 +02:00
Kerollmops
7e16afbdce Ignore documents which are not part of the candidates when exploring with A* 2020-06-24 15:06:45 +02:00
Kerollmops
1c7a9a4132 Remove the found documents from the candidates list 2020-06-24 15:00:26 +02:00
Kerollmops
50169b9798 Compute the full list of ids we are willing to find by attribute 2020-06-24 14:48:04 +02:00
Kerollmops
374ec6773f Introduce a database to store all docids for a word and attribute 2020-06-22 19:24:20 +02:00
Kerollmops
a044cb6cc8 Clean up the warnings for prefix postings 2020-06-22 18:10:31 +02:00
Kerollmops
ba3e805981 Document the Index types and the internal LMDB databases 2020-06-22 18:09:22 +02:00
Kerollmops
2f0e1afd16 Introduce the roaring bitmap heed codec 2020-06-22 17:56:07 +02:00
Kerollmops
8148210860 Use the cache when retrieving the documents at the end 2020-06-21 12:25:19 +02:00
Kerollmops
1628a31efa Cache the unions of the derived words positions 2020-06-20 15:38:10 +02:00
Kerollmops
115e0142d9 Add a feature flags to enable the export of stats 2020-06-20 13:25:42 +02:00
Kerollmops
beb49b24f6 Skip looking at connections for proximity 0 2020-06-20 13:19:03 +02:00
Kerollmops
c84012d655 Accept queries from standard input when not given as argument 2020-06-20 12:01:15 +02:00
Kerollmops
d6705d5529 Introduce the criterion dependency to bench the engine 2020-06-19 18:32:25 +02:00
Kerollmops
55a8941922 Optimize things 2020-06-19 17:48:17 +02:00
Kerollmops
a3ca80d20d Ignore every proximities bigger or equal to 8 2020-06-18 15:42:46 +02:00
Kerollmops
3577de04b8 Reduce the number of KV lookups to the sucessfulls only 2020-06-16 12:58:29 +02:00
Kerollmops
e974e6b3c9 Acquire search intersections metrics 2020-06-16 12:10:23 +02:00
Kerollmops
8db16ff306 Add a cache to the contains_documents success function 2020-06-14 13:39:39 +02:00
Kerollmops
a8cda248b4 Introduce a customized A* algorithm.
This custom algo lazily compute the intersections between words, to avoid too much set operations and database reads
2020-06-14 12:51:57 +02:00
Kerollmops
69285b22d3 Check that an edges combination contains results 2020-06-13 11:16:02 +02:00
Kerollmops
b9cc6c10af Introduce a function to ignore useless paths 2020-06-13 00:17:43 +02:00
Kerollmops
d02c5cb023 Fix node skipping by computing the accumulated proximity 2020-06-12 14:08:46 +02:00
Kerollmops
37a48489da Reworked the best proximity algo a little bit 2020-06-12 12:53:08 +02:00
Kerollmops
302866ad73 Make the algo don't work with an astar 2020-06-11 17:43:06 +02:00