Commit Graph

315 Commits

Author SHA1 Message Date
130d2061bd Fix indexing of word_position_docid and fid 2023-04-06 17:50:39 +02:00
66ddee4390 Fix word_position_docids indexing 2023-04-06 17:50:39 +02:00
e58426109a Fix panics and issues in exactness graph ranking rule 2023-04-06 17:50:39 +02:00
996619b22a Increase position by 8 on hard separator when building query terms 2023-04-06 17:50:39 +02:00
d18ebe4f3a Remove more warnings 2023-03-23 09:41:18 +01:00
9b2653427d Split position DB into fid and relative position DB 2023-03-23 09:22:01 +01:00
bbecab8948 fix clippy 2023-02-21 10:18:44 +01:00
8aa808d51b Merge branch 'main' into enhance-language-detection 2023-02-20 18:14:34 +01:00
b08a49a16e Merge #3319 #3470
3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill

# Pull Request

## Related issue
Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760

## What does this PR do?

### User standpoint

- Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error.
- Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred.

### Implementation standpoint

- Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one
- `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field
- `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick.
- Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again.
- Add `IndexMapper::resize_index` to increase the size of an index.
- In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task.

## Testing the PR

The following diff can be applied to this branch to make testing the PR easier:

<details>


```diff
diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs
index 553ab45a..022b2f00 100644
--- a/index-scheduler/src/index_mapper.rs
+++ b/index-scheduler/src/index_mapper.rs
`@@` -228,13 +228,15 `@@` impl IndexMapper {
 
         drop(lock);
 
+        std:🧵:sleep_ms(2000);
+
         let current_size = index.map_size()?;
         let closing_event = index.prepare_for_closing();
-        log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         closing_event.wait();
 
-        log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         let index_path = self.base_path.join(uuid.to_string());
         let index = self.create_or_open_index(&index_path, None, 2 * current_size)?;
`@@` -268,8 +270,10 `@@` impl IndexMapper {
             match index {
                 Some(Available(index)) => break index,
                 Some(BeingResized(ref resize_operation)) => {
+                    log::error!("waiting for resize end");
                     // Deadlock: no lock taken while doing this operation.
                     resize_operation.wait();
+                    log::error!("trying our luck again!");
                     continue;
                 }
                 Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())),
diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs
index 11b17d05..242dc095 100644
--- a/index-scheduler/src/lib.rs
+++ b/index-scheduler/src/lib.rs
`@@` -908,6 +908,7 `@@` impl IndexScheduler {
     ///
     /// Returns the number of processed tasks.
     fn tick(&self) -> Result<TickOutcome> {
+        log::error!("ticking!");
         #[cfg(test)]
         {
             *self.run_loop_iteration.write().unwrap() += 1;
diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs
index 050c825a..63f312f6 100644
--- a/meilisearch/src/main.rs
+++ b/meilisearch/src/main.rs
`@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> {
 
 #[actix_web::main]
 async fn main() -> anyhow::Result<()> {
-    let (opt, config_read_from) = Opt::try_build()?;
+    let (mut opt, config_read_from) = Opt::try_build()?;
 
     setup(&opt)?;
 
`@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token):
         _ => (),
     }
 
+    opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap();
+
     let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?;
 
     #[cfg(all(not(debug_assertions), feature = "analytics"))]
```
</details>

Mainly, these debug changes do the following:

- Set the default index size to 1MiB so that index resizes are initially frequent
- Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos)
- Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests)

## Open questions

- Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB.

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


3470: Autobatch addition and deletion r=irevoire a=irevoire

This PR adds the capability to meilisearch to batch document addition and deletion together.

Fix https://github.com/meilisearch/meilisearch/issues/3440

--------------

Things to check before merging;

- [x] What happens if we delete multiple time the same documents -> add a test
- [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-02-20 15:00:19 +00:00
18796d6e6a Consider null as a valid geo object 2023-02-20 13:45:51 +01:00
895ab2906c apply review suggestions 2023-02-16 18:42:47 +01:00
74dcfe9676 Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform 2023-02-14 19:09:40 +01:00
1b1703a609 make a small optimization to merge obkvs a little bit faster 2023-02-14 18:32:41 +01:00
fb5e4957a6 fix and test the early exit in case a grenad ends with a deletion 2023-02-14 18:23:57 +01:00
8de3c9f737 Update milli/src/update/index_documents/transform.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-02-14 17:57:14 +01:00
43a19d0709 document the operation enum + the grenads 2023-02-14 17:55:26 +01:00
746b31c1ce makes clippy happy 2023-02-09 12:23:01 +01:00
93db755d57 add a test to ensure we handle correctly a deletion of multiple time the same document 2023-02-08 21:03:34 +01:00
93f130a400 fix all warnings 2023-02-08 20:57:35 +01:00
421a9cf05e provide a new method on the transform to remove documents 2023-02-08 16:06:09 +01:00
8f64fba1ce rewrite the current transform to handle a new byte specifying the kind of operation it's merging 2023-02-08 12:53:38 +01:00
fbec48f56e Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
7681be5367 Format code 2023-01-31 11:28:05 +01:00
50bc156257 Fix tests 2023-01-31 11:28:05 +01:00
d8207356f4 Skip script,language insertion if language is undetected 2023-01-31 11:28:05 +01:00
fd60a39f1c Format code 2023-01-31 11:28:05 +01:00
a27f329e3a Add tests for checking that detected script and language associated with document(s) were stored during indexing 2023-01-31 11:28:05 +01:00
d97fb6117e Extract and index data 2023-01-31 11:28:05 +01:00
cbf029f64c clippy: --fix 2023-01-31 11:12:59 +01:00
89675e5f15 clippy: Replace seek 0 by rewind 2023-01-31 09:32:40 +01:00
de3c4f1986 throw an error on unknown fields specified in the _geo field 2023-01-24 12:23:24 +01:00
f5ca421227 Superfluous test removed 2023-01-19 15:39:21 +01:00
a2cd7214f0 Fixes error message when lat/lng are unparseable 2023-01-19 10:10:26 +01:00
d1fc42b53a Use compatibility decomposition normalizer in facets 2023-01-18 15:02:13 +01:00
1b78231e18 Make clippy happy 2023-01-17 18:25:54 +01:00
6a10e85707 Merge #736
736: Update charabia r=curquiza a=ManyTheFish

Update Charabia to the last version.

> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-01-03 15:44:41 +00:00
4b166bea2b Add primary_key_inference test 2022-12-21 15:13:38 +01:00
5943100754 Fix existing tests 2022-12-21 15:13:38 +01:00
b24def3281 Add logging when inference took place.
Displays log message in the form:
```
[2022-12-21T09:19:42Z INFO  milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id'
```
2022-12-21 15:13:38 +01:00
402dcd6b2f Simplify primary key inference 2022-12-21 15:13:38 +01:00
13c95d25aa Remove uses of UserError::MissingPrimaryKey not related to inference 2022-12-21 15:13:36 +01:00
fc7618d49b Add DeletionStrategy 2022-12-19 09:49:58 +01:00
7f88c4ff2f Fix #1714 test 2022-12-15 18:22:28 +01:00
be3b00350c Apply review suggestions: naming and documentation 2022-12-13 10:15:22 +01:00
e3ee553dcc Remove soft deleted ids from ExternalDocumentIds during document import
If the document import replaces a document using hard deletion
2022-12-12 14:16:09 +01:00
303d740245 Prepare fix within facet range search
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
2022-12-07 14:38:10 +01:00
67d8cec209 Fix bug in handling of soft deleted documents when updating settings 2022-12-06 15:09:19 +01:00
f2cf981641 Add more tests and allow disabling of soft-deletion outside of tests
Also allow disabling soft-deletion in the IndexDocumentsConfig
2022-12-05 10:51:01 +01:00
8d0ace2d64 Avoid creating a MatchingWord for words that exceed the length limit 2022-11-28 10:20:13 +01:00
ac3baafbe8 Truncate facet values that are too long before indexing them 2022-11-17 11:29:42 +01:00