Commit Graph

8753 Commits

Author SHA1 Message Date
6f135457f8 update and fix the test 2023-11-28 15:55:48 +01:00
41f3a30b0b return a task view instead of a task 2023-11-28 15:08:13 +01:00
9090e0fe9d add a first working test with actixweb 2023-11-28 14:47:07 +01:00
b3098b9d9a start writing a test with actix but it doesn't works 2023-11-28 14:01:44 +01:00
9b1de777de Implement the webhook 2023-11-28 11:40:09 +01:00
00f0711207 add the option 2023-11-27 15:22:44 +01:00
5751f5c640 fix puffin in the index scheduler 2023-11-27 15:18:33 +01:00
3d23b388bc Merge #4231
4231: Fixed payload limit setting being ignored for delete documents by batch r=Kerollmops a=Karribalu


# Pull Request

## Related issue
Fixes #4224

## What does this PR do?
- Added http_payload_size_limit to JsonConfig to allow deleting documents in batches with a payload size greater than 2MB, which is the default limit set in the JsonConfig crate.

## PR checklist
Please check if your PR fulfills the following requirements:
- [Y] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [Y] Have you read the contributing guidelines?
- [Y] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: karribalu <karri.balu123456@gmail.com>
2023-11-27 09:26:21 +00:00
85626cff8e Fixed payload limit setting being ignored for delete documents by batch route 2023-11-25 18:41:16 +00:00
b366acdae6 Merge #4220
4220: Bring back changes from v1.5.0 into main r=dureuill a=Kerollmops

This will bring the fixes from v1.5.0 into main. By [following this guide](https://github.com/meilisearch/engine-team/blob/main/resources/meilisearch-release.md#after-the-release) I decided to create a temporary branch to fix the git conflicts and merge into main afterward.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis.dureuil@xinra.net>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-22 07:46:22 +00:00
7cb7e37ba8 Merge branch 'main' into tmp-release-v1.5.0 2023-11-21 16:30:46 +01:00
33b7c574ea Merge #4090
4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-21 09:44:38 +00:00
d3575fb028 Make into_del_add_obkv parameters more human readable 2023-11-20 16:10:39 +01:00
39cbb499c2 Small fixes 2023-11-20 10:20:39 +01:00
ebef6bc24d Simplify documents database writing 2023-11-20 10:14:57 +01:00
d59b7db8d0 remove unused code 2023-11-20 10:10:45 +01:00
263e825619 Fix typos in comments 2023-11-20 10:06:29 +01:00
69354a6144 Add the benchmarck name to the bot message 2023-11-15 13:56:54 +01:00
b0adc73ce6 Merge pull request #4207 from meilisearch/diff-indexing-prefix-databases
Diff indexing prefix databases
prototype-diff-indexing-1
2023-11-14 16:04:05 +01:00
2b5d9042d1 Merge #4208
4208: Makes the dump cancellable r=Kerollmops a=irevoire

# Pull Request

Make the dump tasks cancellable even when they have already started processing.

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4157


Co-authored-by: Tamo <tamo@meilisearch.com>
2023-11-14 13:31:45 +00:00
5b57fbab08 makes the dump cancellable 2023-11-14 11:23:13 +01:00
72d3fa4898 Merge #4203
4203: Extract external document docids from docs on deletion by filter r=Kerollmops a=dureuill

This fixes some of the performance regression observed on `diff-indexing` when doing delete-by-filter with a filter matching many documents.

To delete 19 768 771 documents (hackernews dataset, all documents matching `type = comment`), here are the observed time:

|branch (commit sha1sum)|time|speed-down factor (lower is better)|
|--|--|--|
|`main` (48865470d7)|1212.885536s (~20min)|x1.0 (baseline)|
|`diff-indexing` (523519fdbf)|5385.550543s (90min)|x4.44|
|**`diff-indexing-extract-primary-key`**(f8289cd974)|2582.323324s (43min) | x2.13|

So we're still suffering a speed-down of x2.13, but that's much better than x4.44.

---

Changes:

- Refactor the logic of PrimaryKey extraction to a struct
- Add a trait to abstract the extraction of field id from a name between `DocumentBatch` and `FieldIdMap`.
- Add `Index::external_id_of` to get the external ids of a bitmap of internal ids.
- Use this new method to add new Transform and Batch methods to remove documents that are known to be from the DB.
- Modify delete-by-filter to use the new method

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-13 13:02:10 +00:00
772964125d Factor removal of document from DB 2023-11-13 13:51:22 +01:00
378deb0bef Rename trait 2023-11-13 13:38:36 +01:00
1f36410541 Update tests 2023-11-13 13:36:39 +01:00
b11f85a635 Merge #4205
4205: Prevent search hang on the processing index r=Kerollmops a=dureuill

Fixes #4206, an issue originally [reported on Discord](https://discord.com/channels/1006923006964154428/1148983671026618579/1148983671026618579) where having parallel search requests on more indexes than the index cache capacity would cause search requests on the currently updating index to hang until the index is done updating.

## Test setup

- Create 20 empty indexes by sending settings to them
- repeatedly send placeholder search requests to each of the indexes in a loop
- Create another index and send a significant batch of documents to index.
- Attempt to perform a search request on that last index.
  - Before this PR, the search request hangs while the index update task is processing
  - After this PR, the search request respond immediately even while the index update task is processing

## Changes

- When getting the handle to an index for some potentially long running batches of tasks, save it in the index scheduler.
- Drop the handle from the index-scheduler when the task is done so that we don't leak indexes.
- When getting an index from outside the task queue processor, check if there is such an handle matching the requested index. If so, skip the cache entirely and clone the handle.

Co-authored-by: Louis Dureuil <louis.dureuil@xinra.net>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
v1.5.0 v1.5.0-rc.3
2023-11-13 10:36:01 +00:00
a2d6dc8571 Fix typo, remove caching for the change of index 2023-11-13 10:44:36 +01:00
ee1701157f Merge #4204
4204: Throw error when the vector search is sent with the wrong size r=Kerollmops a=dureuill

# Pull Request

## Related issue
Fixes #4201 


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-13 09:43:20 +00:00
8c649d8061 Throw error when the vector search is sent with the wrong size 2023-11-13 09:57:42 +01:00
492fc086f0 cargo fmt 2023-11-12 21:53:11 +01:00
a2d0c73b41 Save the currently updating index so that the search can access it at all times 2023-11-10 10:52:03 +01:00
264b10ec20 Fixup documentation 2023-11-09 16:23:20 +01:00
825257da76 Use more efficient method for deletion in benchmarks 2023-11-09 16:13:15 +01:00
f8289cd974 Use it from delete-by-filter 2023-11-09 14:23:15 +01:00
3053e01c05 Batch::remove_documents_from_db_no_batch 2023-11-09 14:23:02 +01:00
b11c2afac0 Index::external_id_of 2023-11-09 14:22:43 +01:00
9cef800b2a Enrich uses the new type 2023-11-09 14:22:05 +01:00
db2fb86b8b Extract PrimaryKey logic to a type 2023-11-09 14:19:16 +01:00
882ab9cc85 remove warnings 2023-11-09 11:35:33 +01:00
5a9c96e1db Compute word integer prefix cache 2023-11-09 11:34:26 +01:00
70ce40828c Compute word docids prefix cache 2023-11-08 17:01:00 +01:00
688266c83e Remove word pair proximity prefix cache and compute it at search time 2023-11-08 14:16:01 +01:00
6dab826908 Reactivate prefix databases 2023-11-08 13:58:01 +01:00
1e2fbc6a42 revert "REVERT ME: ignore prefix pair databases tests"
This reverts commit 1b2ea6cf19.
2023-11-08 11:50:52 +01:00
523519fdbf Merge pull request #4195 from meilisearch/diff-indexing-remove-from-batch
Remove `IndexOperation::DocumentDeletion`
2023-11-08 10:29:49 +01:00
ef6fa10f7a Remove IndexOperation::DocumentDeletion 2023-11-06 12:16:15 +01:00
620fee35f9 Fix benches prototype-diff-indexing-0 2023-11-06 11:56:46 +01:00
cbaa54cafd Fix clippy issues 2023-11-06 11:19:31 +01:00
1bccf2079e Correctly mark non-tests as non-tests 2023-11-06 11:03:56 +01:00
1b2ea6cf19 REVERT ME: ignore prefix pair databases tests 2023-11-06 10:46:22 +01:00