Commit Graph

425 Commits

Author SHA1 Message Date
6f135457f8 update and fix the test 2023-11-28 15:55:48 +01:00
41f3a30b0b return a task view instead of a task 2023-11-28 15:08:13 +01:00
9090e0fe9d add a first working test with actixweb 2023-11-28 14:47:07 +01:00
9b1de777de Implement the webhook 2023-11-28 11:40:09 +01:00
5751f5c640 fix puffin in the index scheduler 2023-11-27 15:18:33 +01:00
7cb7e37ba8 Merge branch 'main' into tmp-release-v1.5.0 2023-11-21 16:30:46 +01:00
33b7c574ea Merge #4090
4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-21 09:44:38 +00:00
5b57fbab08 makes the dump cancellable 2023-11-14 11:23:13 +01:00
a2d6dc8571 Fix typo, remove caching for the change of index 2023-11-13 10:44:36 +01:00
492fc086f0 cargo fmt 2023-11-12 21:53:11 +01:00
a2d0c73b41 Save the currently updating index so that the search can access it at all times 2023-11-10 10:52:03 +01:00
f8289cd974 Use it from delete-by-filter 2023-11-09 14:23:15 +01:00
ef6fa10f7a Remove IndexOperation::DocumentDeletion 2023-11-06 12:16:15 +01:00
cbaa54cafd Fix clippy issues 2023-11-06 11:19:31 +01:00
e507ef5932 Slow the logging down 2023-11-01 13:49:32 +01:00
13416ccbf7 Introduce a new meilitool to help the cloud team 2023-10-30 14:30:20 +01:00
dfab6293c9 Use an LMDB database to store the external documents ids 2023-10-30 11:41:23 +01:00
652ac3052d use new iterator in batch 2023-10-30 11:41:22 +01:00
c534a1b687 Stop using delete documents pipeline in batch runner 2023-10-30 11:41:22 +01:00
cf8dad1ca0 index_scheduler.features() is no longer fallible 2023-10-23 10:38:56 +02:00
dd619913da Use RwLock to never persist cli state to db 2023-10-19 12:45:57 -07:00
d8c649b3cd Return recoverable error if we fail to retrieve metrics state 2023-10-18 08:28:24 -07:00
12fc878640 Merge remote-tracking branch 'origin/main' into enable-metrics-http 2023-10-16 13:48:01 -07:00
689ec7c7ad Make the experimental route /metrics activable via HTTP 2023-10-13 22:12:54 +00:00
3655d4bdca Move the puffin file export logic into the run function 2023-10-13 13:11:30 +02:00
055ca3935b Update index-scheduler/src/batch.rs
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-10-13 13:11:30 +02:00
bf8fac6676 Fix the tests 2023-10-13 13:11:30 +02:00
f2a9e1ebbb Improve the debugging experience in the puffin reports 2023-10-13 13:11:30 +02:00
513e61e9a3 Remove the experimental CLI flag 2023-10-13 13:11:29 +02:00
90a626bf80 Use the runtime feature to enable puffin report exporting 2023-10-13 13:11:29 +02:00
0d4acf2daa Fix the metrics product URL 2023-10-13 13:11:29 +02:00
58db8d85ec Add the exportPuffinReports option to the runtime features route 2023-10-13 13:11:29 +02:00
656dadabea Expose an experimental flag to write the puffin reports to disk 2023-10-13 13:11:09 +02:00
34fac115d5 fix clippy 2023-09-11 17:15:57 +02:00
9258e5b5bf Fix the stats of the documents deletion by filter
The issue was that the operation « DocumentDeletionByFilter » was not
declared as an index operation. That means the indexes stats were not
reprocessed after the application of the operation.
2023-09-11 14:04:10 +02:00
e4e49e63d0 Merge #3993
3993: Bringing back changes from v1.3.1 to `main` r=irevoire a=curquiza



Co-authored-by: irevoire <irevoire@users.noreply.github.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-08-10 14:30:02 +00:00
fe819a9d80 fix the get stats method
It was not taking into account the processing tasks at all
2023-08-08 13:21:15 +02:00
b45c36cd71 Merge branch 'main' into tmp-release-v1.3.0 2023-08-01 15:05:17 +02:00
eef95de30e First iteration on exposing puffin profiling 2023-07-18 17:38:13 +02:00
22762808ab Fix the tests 2023-07-06 12:13:29 +02:00
86b834c9e4 Display the total number of tasks in the tasks route 2023-07-06 10:05:18 +02:00
aae099e330 Merge #3851
3851: Expose lastUpdate and isIndexing in /stats endpoint r=dureuill a=gentcys

# Pull Request

## Related issue
Fixes #3843

## What does this PR do?
- expose lastUpdate in `/stats` endpoint
- expose isIndex in `stats` endpoint
- add a method `is_task_processing` in index-scheduler/src/lib.rs.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Cong Chen <cong.chen@ocrlabs.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-07-03 13:41:04 +00:00
71500a4e15 Update tests 2023-07-03 11:20:43 +02:00
324d448236 Format let-else ❤️ 🎉 2023-07-03 10:20:28 +02:00
9859e65d2f fix tests 2023-07-01 09:32:50 +08:00
3bdf01bc1c Fix failed test 2023-06-30 17:39:23 +08:00
a5a31667b0 fix converse result of is_task_processing() 2023-06-30 11:28:18 +08:00
e3fc7112bc use RoaringBitmap::is_empty instead 2023-06-29 11:46:47 +08:00
816d7ed174 Update the Vector Store product feature link 2023-06-27 12:32:42 +02:00
13e9b4c2e5 Add dump support 2023-06-26 16:29:43 +02:00