Commit Graph

186 Commits

Author SHA1 Message Date
efc156a4a4 Executing Lua works correctly 2024-07-10 16:27:36 +02:00
ba85959642 Support filtering the documents to edit with lua 2024-07-10 16:23:21 +02:00
1702b5cf44 Prepare for processing documents edition 2024-07-10 16:23:21 +02:00
3bc8f81abc user_provided => regenerate 2024-06-12 18:12:20 +02:00
d85ab23b82 rename all occurences of user_defined to user_provided for consistency 2024-06-06 11:39:29 +02:00
cc5dca8321 fix two bug and add a dump test 2024-06-06 11:39:29 +02:00
dc949ab46a Remove puffin usage 2024-05-27 15:59:14 +02:00
8a941c0241 Smaller review changes 2024-05-22 14:44:42 +02:00
9969f7a638 Add test on index-scheduler 2024-05-20 14:44:10 +02:00
02714ef5ed Add vectors from vector DB in dump 2024-05-20 10:36:18 +02:00
897d25780e update milli to latest version 2024-05-16 18:31:32 +02:00
ab43a8a949 chore: fix some typos in comments
Signed-off-by: writegr <wellweek@outlook.com>
2024-04-18 14:12:52 +08:00
f82d056072 Hide secrets in settings and task queue 2024-03-26 10:36:24 +01:00
066a7a3cde takes only one read transaction per thread 2024-02-26 10:43:04 +01:00
02e6c8a440 Add tracing to index-scheduler 2024-02-08 15:03:31 +01:00
1ccde9bf0b Merge #4316
4316: Autobatch the task deletions r=curquiza a=irevoire

# Pull Request

## Related issue
Fix part of https://github.com/meilisearch/meilisearch-support/issues/69
Fix #4315 

## What does this PR do?
- Autobatch the task deletions

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-01-15 17:54:50 +00:00
b4d7d80ad9 autobatch the task deletions 2024-01-11 14:58:07 +01:00
97bb1ff9e2 Move currently_updating_index to IndexMapper 2024-01-09 15:37:27 +01:00
ee54d3171e Check experimental feature at query time 2023-12-21 15:26:12 +01:00
9e1b458010 Merge branch 'main' into change-proximity-precision-settings 2023-12-18 09:08:47 +01:00
13c2c6c16b Small commit to add hybrid search and autoembedding 2023-12-14 16:07:48 +01:00
35e1981488 Remove proximityPrecision form the experimental feature 2023-12-14 15:52:42 +01:00
7e259cb0d2 Expose the --max-number-of-batched-tasks argument 2023-12-11 16:08:39 +01:00
1f4fc9c229 Make the feature experimental 2023-12-06 15:49:05 +01:00
ec9b52d608 Rename copy_to_path to copy_to_file 2023-11-28 14:32:30 +01:00
0dbf1a16ff Make clippy happy 2023-11-23 14:11:38 +01:00
0d4482625a Make the changes to use heed v0.20-alpha.6 2023-11-23 11:43:58 +01:00
7cb7e37ba8 Merge branch 'main' into tmp-release-v1.5.0 2023-11-21 16:30:46 +01:00
33b7c574ea Merge #4090
4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-21 09:44:38 +00:00
5b57fbab08 makes the dump cancellable 2023-11-14 11:23:13 +01:00
a2d6dc8571 Fix typo, remove caching for the change of index 2023-11-13 10:44:36 +01:00
492fc086f0 cargo fmt 2023-11-12 21:53:11 +01:00
a2d0c73b41 Save the currently updating index so that the search can access it at all times 2023-11-10 10:52:03 +01:00
f8289cd974 Use it from delete-by-filter 2023-11-09 14:23:15 +01:00
ef6fa10f7a Remove IndexOperation::DocumentDeletion 2023-11-06 12:16:15 +01:00
cbaa54cafd Fix clippy issues 2023-11-06 11:19:31 +01:00
e507ef5932 Slow the logging down 2023-11-01 13:49:32 +01:00
dfab6293c9 Use an LMDB database to store the external documents ids 2023-10-30 11:41:23 +01:00
652ac3052d use new iterator in batch 2023-10-30 11:41:22 +01:00
c534a1b687 Stop using delete documents pipeline in batch runner 2023-10-30 11:41:22 +01:00
cf8dad1ca0 index_scheduler.features() is no longer fallible 2023-10-23 10:38:56 +02:00
055ca3935b Update index-scheduler/src/batch.rs
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-10-13 13:11:30 +02:00
f2a9e1ebbb Improve the debugging experience in the puffin reports 2023-10-13 13:11:30 +02:00
34fac115d5 fix clippy 2023-09-11 17:15:57 +02:00
9258e5b5bf Fix the stats of the documents deletion by filter
The issue was that the operation « DocumentDeletionByFilter » was not
declared as an index operation. That means the indexes stats were not
reprocessed after the application of the operation.
2023-09-11 14:04:10 +02:00
eef95de30e First iteration on exposing puffin profiling 2023-07-18 17:38:13 +02:00
13e9b4c2e5 Add dump support 2023-06-26 16:29:43 +02:00
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
96da5130a4 fix the error code in case of not filterable attributes on the get / delete documents by filter routes 2023-05-16 13:56:18 +02:00
6df2ba93a9 remove one useless txn 2023-05-03 17:41:49 +02:00