Commit Graph

8571 Commits

Author SHA1 Message Date
Loïc Lecrenier
83ab8cf4e5 Remove dbg!(..) expression in highlighter tests 2023-05-08 09:45:23 +02:00
Jakub Jirutka
8095f21999 Move comments above keys in config.toml
The current style is very unusual, confusing and breaks compatibility
with tools for parsing config files including comments. Everyone writes
comments above the items to which they refer (maybe except pythonists),
so let's stick to that.
2023-05-06 18:10:54 +02:00
ManyTheFish
cd2573fcc3 Fix prefix highlighting 2023-05-04 16:53:50 +02:00
meili-bors[bot]
9f7981df28 Merge #3687
3687: Allow to disable specialized tokenizations (again) r=Kerollmops a=jirutka

In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.


Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>
2023-05-04 14:48:01 +00:00
Jakub Jirutka
e615fa5ec6 Fix unused_imports warning in milli when japanese is not enabled 2023-05-04 15:46:11 +02:00
Jakub Jirutka
13f1277637 Allow to disable specialized tokenizations (again)
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai`
feature flags to allow melisearch to be built without huge specialed
tokenizations that took up 90% of the melisearch binary size.
Unfortunately, due to some recent changes, this doesn't work anymore.
The problem lies in excessive use of the `default` feature flag, which
infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier
and more future-proof to not declare `default` in `milli` and
`meilisearch-types`. I've renamed it to `all-tokenizers`, which also
makes it a bit clearer what it's about.
2023-05-04 15:45:40 +02:00
meili-bors[bot]
4919774f2e Merge #3570
3570: Get documents by filter r=irevoire a=dureuill

# Pull Request

## Related issue

Associated spec: https://github.com/meilisearch/specifications/pull/234

None really, this is more of an extension of #3477: since after this issue we'll be able to delete documents by filter, it makes sense to also be able to get documents by filter. 

## What does this PR do?

### User standpoint

- Add a new `filter` URL parameter to `GET /indexes/{:indexUid}/documents` and a new `POST /indexes/{:indexUid}/documents/fetch` route with the same `offset, limit, fields, filter` 

### Implementation standpoint

-  Add a new `Index::iter_documents` method to iterate on a set of documents rather than return a vector of these documents.
- Rewrite the other `Index::*documents` methods to use the new `Index::iter_documents` method.

## Usage

<details>
<summary>
Sample request and response
</summary>

```
curl -X POST 'http://localhost:7700/indexes/index-1101/documents/fetch' -H 'Content-Type: application/json' --data-binary '{ "filter": "genres = Comedy", "limit": 3, "offset": 8000}' | jsonxf
```

```json
{
  "results": [
    {
      "id": 326126,
      "title": "Bad Exorcists",
      "overview": "A trio of awkward teens intend to win a horror festival by making their own movie, but wind up getting their actress possessed in the process.",
      "genres": [
        "Horror",
        "Comedy"
      ],
      "poster": "https://image.tmdb.org/t/p/w500/lwd65kPbjFacAw3QSXiwSsW6cFU.jpg",
      "release_date": 1425081600
    },
    {
      "id": 326215,
      "title": "Ooops! Noah is Gone...",
      "overview": "It's the end of the world. A flood is coming. Luckily for Dave and his son Finny, a couple of clumsy Nestrians, an Ark has been built to save all animals. But as it turns out, Nestrians aren't allowed. Sneaking on board with the involuntary help of Hazel and her daughter Leah, two Grymps, they think they're safe. Until the curious kids fall off the Ark. Now Finny and Leah struggle to survive the flood and hungry predators and attempt to reach the top of a mountain, while Dave and Hazel must put aside their differences, turn the Ark around and save their kids. It's definitely not going to be smooth sailing.",
      "genres": [
        "Animation",
        "Adventure",
        "Comedy",
        "Family"
      ],
      "poster": "https://image.tmdb.org/t/p/w500/gEJXHgpiKh89Vwjc4XUY5CIgUdB.jpg",
      "release_date": 1427328000
    },
    {
      "id": 326241,
      "title": "For Here or to Go?",
      "overview": "An aspiring Indian tech entrepreneur in the Silicon Valley finds himself unexpectedly battling the bizarre American immigration system to keep his dream alive or prepare to return home forever.",
      "genres": [
        "Drama",
        "Comedy"
      ],
      "poster": "https://image.tmdb.org/t/p/w500/ff8WaA7ItBgl36kdT232i0d0Fnq.jpg",
      "release_date": 1490918400
    }
  ],
  "offset": 8000,
  "limit": 3,
  "total": 9331
}
```

<img width="1348" alt="Capture d’écran 2023-03-08 à 10 09 04" src="https://user-images.githubusercontent.com/41078892/223670905-6932b79b-f9b8-4a41-b59e-be2171705b7d.png">



</details>

# Draft status

- [ ] Route naming: having one route be `GET /indexes/{:indexUid}/documents` and the other `POST /indexes/{:indexUid}/documents/fetch` is suboptimal (also, technically a breaking change for documents with `fetch` as uid?), but `POST /indexes/{:indexUid}/documents` is already used to insert documents.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-05-04 12:54:26 +00:00
Tamo
a3da680ce6 Update meilisearch/tests/documents/errors.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-04 14:51:17 +02:00
Tamo
11e394dba1 merge the document fetch and get error codes 2023-05-04 15:39:49 +02:00
Tamo
469d2f2a9c fix the fields field of the POST fetch document API 2023-05-04 15:34:09 +02:00
Tamo
ce6507d20c improve the test of the get document by filter 2023-05-04 15:34:09 +02:00
Tamo
b92da5d15a add a big test on the get document by filter of the get route 2023-05-04 15:34:09 +02:00
Tamo
ed3dfbe729 add error codes and tests 2023-05-04 15:34:08 +02:00
Louis Dureuil
441641397b Implement document get with filters 2023-05-04 15:32:34 +02:00
Louis Dureuil
a35d3fc708 Add Index::iter_documents 2023-05-04 15:31:54 +02:00
Louis Dureuil
745c1a2668 Make parse_filter pub 2023-05-04 15:31:53 +02:00
meili-bors[bot]
a95128df6b Merge #3550
3550: Delete documents by filter r=irevoire a=dureuill

# Prototype `prototype-delete-by-filter-0`

Usage:
A new route is available under `POST /indexes/{index_uid}/documents/delete` that allows you to delete your documents by filter.
The expected payload looks like that:
```json
{
  "filter": "doggo = bernese",
}
```

It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed.
Here is an example of the associated details;
```json
  "details": {
    "deletedDocuments": 53,
    "originalFilter": "\"doggo = bernese\""
  }
```

----------


# Pull Request

## Related issue
Related to https://github.com/meilisearch/meilisearch/issues/3477

## What does this PR do?

### User standpoint

- Modifies the `/indexes/{:indexUid}/documents/delete-batch` route to accept either the existing array of documents ids, or a JSON object with a `filter` field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted.

### Implementation standpoint

- (processing time version) Adds a new BatchKind that is not autobatchable and that performs the delete by filter
- Reuse the `documentDeletion` task with a new `originalFilter` detail that replaces the `providedIds` detail.

## Example

<details>
<summary>Sample request, response and task result</summary>

Request:

```
curl \
  -X POST 'http://localhost:7700/indexes/index-10/documents/delete-batch' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "filter" : "mass = 600"}'
```

Response:

```
{
  "taskUid": 3902,
  "indexUid": "index-10",
  "status": "enqueued",
  "type": "documentDeletion",
  "enqueuedAt": "2023-02-28T20:50:31.667502Z"
}
```

Task log:

```json
    {
      "uid": 3906,
      "indexUid": "index-12",
      "status": "succeeded",
      "type": "documentDeletion",
      "canceledBy": null,
      "details": {
        "deletedDocuments": 3,
        "originalFilter": "\"mass = 600\""
      },
      "error": null,
      "duration": "PT0.001819S",
      "enqueuedAt": "2023-03-07T08:57:20.11387Z",
      "startedAt": "2023-03-07T08:57:20.115895Z",
      "finishedAt": "2023-03-07T08:57:20.117714Z"
    }
```

</details>

## Draft status

- [ ] Error handling
- [ ] Analytics
- [ ] Do we want to reuse the `delete-batch` route in this way, or create a new route instead?
- [ ] Should the filter be applied at request time or when the deletion task is processed? 
  - The first commit in this PR applies the filter at request time, meaning that even if a document is modified in a way that no longer matches the filter in a later update, it will be deleted as long as the deletion task is processed after that update. 
  - The other commits in this PR apply the filter only when the asynchronous deletion task is processed, meaning that documents that match the filter at processing time are deleted even if they didn't match the filter at request time.
- [ ] If keeping the filter at request time, find a more elegant way to recover the user document ids from the internal document ids. The current way implemented in the first commit of this PR involves getting all the documents matching the filter, looking for the value of their primary key, and turning it into a string by copy-pasting routines found in milli...
- [ ] Security consideration, if any
- [ ] Fix the tests (but waiting until product questions are resolved)
- [ ] Add delete by filter specific tests



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-05-04 10:44:41 +00:00
meili-bors[bot]
e0537c3870 Merge #3720
3720: Change links of docs everywhere r=curquiza a=curquiza

Completely fixes #3668 

Co-authored-by: curquiza <clementine@meilisearch.com>
2023-05-04 10:07:41 +00:00
meili-bors[bot]
da220294f6 Merge #3639
3639: Add a dedicated error variant for planned failures in index scheduler tests r=Kerollmops a=Sufflope

# Pull Request

## Related issue
Fixes #3086

## What does this PR do?
- Add a dedicated test variant in test cfg to avoid reusing a misleading existing error

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Jean-Sébastien Bour <jean-sebastien@bour.name>
2023-05-04 09:33:57 +00:00
meili-bors[bot]
78e611f282 Merge #3693
3693: Implement the auto deletion of tasks r=dureuill a=irevoire

Fixes https://github.com/meilisearch/meilisearch/issues/3622

This PR should be the definite fix for #3622.

It adds a limit (1M) to the maximum number of tasks the task queue can hold.
Once the task queue reaches this limit (1M of tasks are in the task queue, whatever their status is), meilisearch will schedule a task deletion that tries to delete the oldest 100k tasks.
If meilisearch can't delete 100k tasks because some of them are not yet finished, it will delete as many tasks as possible.

Once the limit is reached, you're still able to register new tasks. The engine will only stop you from adding new tasks once [the other hard limit](https://github.com/meilisearch/meilisearch/pull/3659) of 10GiB of tasks is reached (that's between 5M and 15M of tasks depending on your workflow).

-------

Technically;
- We only try to schedule our task deletion when calling the tick function but before creating a new batch. This means we never enqueue a task we're not going to process ~right away.
- If our task deletion doesn't delete anything, we don't enqueue it and log a warn the user that the engine is not working properly

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-04 08:30:22 +00:00
Louis Dureuil
d8381eb790 Fix originalFilter 2023-05-04 10:07:59 +02:00
Louis Dureuil
b212aef5db add one nanosecond to generated filter so as to generate a filter that would have matched the last task to delete 2023-05-04 09:56:48 +02:00
meili-bors[bot]
6bf66f35be Merge #3721
3721: Use new bors URL of our self hosted bors instance r=curquiza a=curquiza



Co-authored-by: curquiza <clementine@meilisearch.com>
2023-05-04 07:53:39 +00:00
Louis Dureuil
52ab114f6c Fix test on macOS: 50 tasks would result in the test consistently failing on a local macOS 2023-05-04 00:06:49 +02:00
Tamo
dcbfecf42c make the generated filter valid 2023-05-04 00:06:49 +02:00
Tamo
9ca6f59546 Update index-scheduler/src/lib.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-04 00:06:49 +02:00
Tamo
aa7537a11e make the autodeletion work with a fixed number of tasks and update the tests 2023-05-04 00:06:49 +02:00
Tamo
972bb2831c log when meilisearch need to delete tasks 2023-05-04 00:06:49 +02:00
Tamo
f9ddd32545 implement the auto-deletion of tasks 2023-05-04 00:06:49 +02:00
Louis Dureuil
d5059520aa Fix typo 2023-05-03 22:27:03 +02:00
Louis Dureuil
1c3642c9b2 Fix deletion per filter analytics 2023-05-03 22:26:51 +02:00
Tamo
d2d2bacaf2 add a test on the complex filter 2023-05-03 20:07:08 +02:00
curquiza
30edba3497 Update links of the docs 2023-05-03 19:14:57 +02:00
Louis Dureuil
84e7bd9342 Fix test after rebase on filter additions 2023-05-03 17:51:28 +02:00
Louis Dureuil
2b74e4d116 Fix test 2023-05-03 17:41:50 +02:00
Tamo
b5fe0b2b07 fix the details 2023-05-03 17:41:50 +02:00
Tamo
0f0cd2d929 handle the array of array form of filter in the dumps 2023-05-03 17:41:50 +02:00
Tamo
fc8c1d118d fix the analytics 2023-05-03 17:41:50 +02:00
Tamo
0548ab9038 create and use the error code 2023-05-03 17:41:50 +02:00
Tamo
143acb9cdc update the tests 2023-05-03 17:41:49 +02:00
Tamo
4b92f1b269 wip 2023-05-03 17:41:49 +02:00
Tamo
c12a1cd956 test all the error messages 2023-05-03 17:41:49 +02:00
Tamo
8af8aa5a33 add a test 2023-05-03 17:41:49 +02:00
Tamo
6df2ba93a9 remove one useless txn 2023-05-03 17:41:49 +02:00
Louis Dureuil
3680a6bf1e extract impl to a function 2023-05-03 17:41:49 +02:00
Louis Dureuil
732c52093d Processing time without autobatching implementation 2023-05-03 17:41:48 +02:00
Louis Dureuil
05cc463fbc Draft implementation of filter support for /delete-by-batch route 2023-05-03 17:41:48 +02:00
meili-bors[bot]
1afde4fea5 Merge #3542
3542: Refactor of the search algorithms r=dureuill a=loiclec

This PR refactors a large part of the search logic (related to https://github.com/meilisearch/meilisearch/issues/3547)

- The "query tree" is replaced by a "query graph", which describes the different ways in which the search query can be interpreted and precomputes the word derivations for each query term. Example:

<img width="1162" alt="Screenshot 2023-02-27 at 10 26 50" src="https://user-images.githubusercontent.com/6040237/221525270-87917cc0-60d1-473f-847f-2c5a7de9e370.png">

- The control flow between the ~criterions~ ranking rules is managed in a single place instead of being independently implemented by each ranking rule.

- The set of document candidates is determined greedily from the beginning. It is often referred as the "universe" in the code.

- The ranking rules  `proximity`, `attribute`, `typo`, and (maybe) `exactness` are or will be implemented using a K-shortest path graph algorithm. This minimises the number of database and bitmap operations we need to do to compute each ranking rule bucket. It also simplifies the code a lot since a lot of ranking rules will share a large part of their implementation.

- Pointers to database values are stored in a cache to avoid searching in the LMDB databases needlessly.

- The result of some roaring bitmap operations are also stored in a cache, although we'll need to measure the memory pressure this puts on the system and maybe deactivate this cache later on.

- Search requests can be visually logged and debugged in tests.

TODO:
- [ ] Reintroduce search benchmarks
- [x] Implement `disableOnWords` and `disableOnAttributes` settings of typo tolerance
- [x] Implement "exhaustive number of hits
- [x] Implement `attribute` ranking rule
   - [x] Indexing changes: split into `word_fid_docids` and `word_position_docids` (with bucketed position)
   - [x] Ranking rule implementations
- [ ] Implement `exactness` ranking rule
  - [x] Initial implementation
  - [ ] Correct implementation when followed by `Words`
- [ ] Implement `geosort` ranking rule
- [ ] Add tests
   - [x] Typo tolerance `disableOnWords`/`disableOnAttributes`
   - [ ] Geosort
   - [x] Exactness
   - [ ] Attribute/Position
   - [ ] Interactions between ranking rules:
     - [x] Typo/Proximity/Attribute not preceded by Words
     - [x] Exactness not preceded by Words
     - [x] Exactness -> Words (+ check universe correctness)
     - [x] Exactness -> Typo, etc.
     - [ ] Sort -> Words (performance tests)
     - [ ] Attribute/Position -> Typo
     - [ ] Attribute/Position -> Proximity
     - [x] Typo -> Exactness 
     - [x] Typo -> Proximity
     - [x] Proximity -> Typo
   - [x] Words 
   - [x] Typo
   - [x] Proximity
   - [x] Sort
   - [x] Ngrams
   - [x] Split words
   - [x] Ngram + Split Words
   - [x] Term matching strategy
   - [x] Distinct attribute
   - [x] Phrase Search
   - [x] Placeholder search
   - [x] Highlighter 
- [x] Limit the number of word derivations in a search query
- [x] Compute the initial universe correctly according to the terms matching strategy
- [x] Implement placeholder search
- [x] Get the list of ranking rules from the settings 
- [x] Implement `distinct`
- [x] Determine what to do when one of `attribute`, `proximity`, `typo`, or `exactness` is placed before `words`
- [x] Make sure the correct number of allowed typos is used for each word, including the prefix one
- [x] Make sure stop words are treated correctly (e.g. correct position in query graph), including in phrases
- [x] Support phrases correctly
- [x] Support synonyms
- [x] Support split words
- [x] Support combination of ngram + split-words (e.g. `whiteh orse` -> `"white horse"`)
- [x] Implement `typo` ranking rule
- [x] Implement `sort` ranking rule
- [x] Use existing `Search` interface to use the new search algorithms
- [x] Remove old code


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-03 13:42:51 +00:00
Louis Dureuil
f8f190cd40 Update exactness tests following charabia camelCase tokenization 2023-05-03 14:45:09 +02:00
Louis Dureuil
3a408e8287 Increase map size for tests following charabia camelCase tokenization 2023-05-03 14:44:48 +02:00