Commit Graph

8123 Commits

Author SHA1 Message Date
Clémentine U. - curqui
44f231d41e Update .github/workflows/test-suite.yml 2023-05-25 10:07:45 +02:00
TATHAGATA ROY
3c5d1c93de Added a cron test for disabled all-tokenization 2023-05-25 10:07:32 +02:00
meili-bors[bot]
087866d59f Merge #3775
3775: Last error code changes on the new get/delete documents routes r=dureuill a=irevoire

# Pull Request

## Related issue
Fixes #3774

## What does this PR do?
Following the specification: https://github.com/meilisearch/specifications/pull/236

1. Get rid of the `invalid_document_delete_filter` and always use the `invalid_document_filter`
2. Introduce a new `missing_document_filter` instead of returning `invalid_document_delete_filter` (that’s consistent with all the other routes that have a mandatory parameter)
3. Always return the `original_filter` in the details (potentially set to `null`) instead of hiding it if it wasn’t used


Co-authored-by: Tamo <tamo@meilisearch.com>
v1.2.0-rc.2
2023-05-24 10:07:41 +00:00
Tamo
9111f5176f get rid of the invalid document delete filter in favor of the invalid document filter 2023-05-24 11:53:16 +02:00
Tamo
b9dd092a62 make the details return null in the originalFilter field if no filter was provided + add a big test on the details 2023-05-24 11:48:22 +02:00
Tamo
ca99bc3188 implement the missing document filter error code when deleting documents 2023-05-24 11:29:20 +02:00
Tamo
57d53de402 Increase the number of buckets 2023-05-24 10:47:15 +02:00
meili-bors[bot]
2e49d6aec1 Merge #3768
3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec

This PR contains three changes:

## 1. Don't call the `words` ranking rule if the term matching strategy is `All`

This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph.

## 2. The `words` ranking rule is replaced by a graph-based ranking rule. 

This is for three reasons:

1. **performance**: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily.

2. **consistency**: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get.

3. **surfacing bugs**: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix.

## 3. Fix the `update_all_costs_before_nodes` function

It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like:
<img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7">
and we gave the node `is` as argument.

Then, we'd walk backwards from the node breadth-first. We'd update the costs of:
1. `sun`
2. `thesun`
3. `start`
4. `the`

which is an incorrect order. The correct order is:

1. `sun`
2. `thesun`
3. `the`
4. `start`

That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-23 13:28:08 +00:00
Louis Dureuil
51043f78f0 Remove trailing whitespace 2023-05-23 15:27:25 +02:00
Louis Dureuil
a490a11325 Add explanatory comment on the way we're recomputing costs 2023-05-23 15:24:24 +02:00
Tamo
002f42875f fix the fuzzer 2023-05-23 11:42:40 +02:00
Tamo
22213dc604 push the fuzzer 2023-05-23 09:14:26 +02:00
Tamo
602ad98cb8 improve the way we handle the fsts 2023-05-22 11:15:14 +02:00
Tamo
7f619ff0e4 get rids of the now unused soft_deletion_used parameter 2023-05-22 10:33:49 +02:00
Tamo
4391cba6ca fix the addition + deletion bug 2023-05-17 18:28:57 +02:00
Tamo
d7ddf4925e Revert "Disable autobatching of additions and deletions"
This reverts commit a94e78ffb0.
2023-05-17 14:25:50 +02:00
meili-bors[bot]
101f5a20d2 Merge #3757
3757: Adjust the cost of edges in the `position` ranking rule by bucketing positions more aggressively r=loiclec a=loiclec

This PR significantly improves the performance of the `position` ranking rule when:
1. a query contains many words
2. the `position` ranking rule needs to be called many times
3. the score of the documents according to `position` is high

These conditions greatly increase:
1. the number of edge traversals that are needed to find a valid path from the `start` node to the `end` node
2. the number of edges that need to be deleted from the graph, and therefore the number of times that we need to recompute all the possible costs from START to END

As a result, a majority of the search time is spent in `visit_condition`, `visit_node`, and `update_all_costs_before_node`. This is frustrating because it often happens when the "universe" given to the rule consists of only a handful of document ids.

By limiting the number of possible edges between two nodes from `20` to `10`, we:
1. reduce the number of possible costs from START to END
2. reduce the number of edges that will be deleted 
3. make it faster to update the costs after deleting an edge
4. reduce the number of buckets that need to be computed

In terms of relevancy, I don't think we lose or gain much. We still prefer terms that are in a lower positions, with decreasing precision as we go further. The previous choice of bucketing wasn't chosen in a principled way, and neither is this one. They both "feel" right to me.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
v1.2.0-rc.1
2023-05-17 11:43:59 +00:00
meili-bors[bot]
6ce1ce77e6 Merge #3738
3738: Add analytics on the get documents resource r=dureuill a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/3737
Related spec https://github.com/meilisearch/specifications/pull/234

## What does this PR do?
Add the analytics for the following routes:
- `GET` - `/indexes/:uid/documents`
- `GET` - `/indexes/:uid/documents/:doc_id`
- `POST` - `/indexes/:uid/documents/fetch`

These analytics are aggregated between two events:
- `Documents Fetched GET`
- `Documents Fetched POST`

That shares the same payload:
 Property name | Description | Example |
|---------------|-------------|---------|
| `requests.total_received` | Total number of request received in this batch | 325 |
| `per_document_id` | `false` | false |
| `per_filter` | `true` if `POST /indexes/:indexUid/documents/fetch` endpoint was used with a filter in this batch, otherwise `false` | false |
| `pagination.max_limit` | Highest value given for the `limit` parameter in this batch | 60 |
| `pagination.max_offset` | Highest value given for the `offset` parameter in this batch | 1000 |

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-05-16 19:37:41 +00:00
Loïc Lecrenier
ec8f685d84 Fix bug in cheapest path algorithm 2023-05-16 17:01:30 +02:00
Loïc Lecrenier
5758268866 Don't compute split_words for phrases 2023-05-16 17:01:18 +02:00
meili-bors[bot]
4d037e6693 Merge #3759
3759: Invalid error code when parsing filters r=dureuill a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/3753

## What does this PR do?
Fix the error code in case the error comes from the evaluate of the filter for the get, fetch and delete documents routes.


Co-authored-by: Tamo <tamo@meilisearch.com>
2023-05-16 12:55:06 +00:00
Tamo
96da5130a4 fix the error code in case of not filterable attributes on the get / delete documents by filter routes 2023-05-16 13:56:18 +02:00
Loïc Lecrenier
3e19702de6 Update snapshot tests 2023-05-16 12:22:46 +02:00
meili-bors[bot]
1e762d151f Merge #3755
3755: Re-add final dot r=curquiza a=ManyTheFish

I removed the final dot of the error message in my last PR, this one re-adds it.

related to https://github.com/meilisearch/meilisearch/pull/3749

> Oups 😬 

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-05-16 10:10:58 +00:00
Tamo
0b38f211ac test the new introduced route 2023-05-16 12:07:44 +02:00
Loïc Lecrenier
f6524a6858 Adjust costs of edges in position ranking rule
To ensure good performance
2023-05-16 11:28:56 +02:00
meili-bors[bot]
65ad8cce36 Merge #3741
3741: Add ngram support to the highlighter r=ManyTheFish a=loiclec

This PR fixes a bug introduced by the search refactor, where ngrams were not highlighted. 

The solution was to add the ngrams to the vector of `LocatedQueryTerm` that is given to the `MatchingWords` structure.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-16 09:03:31 +00:00
ManyTheFish
42650f82e8 Re-add final dot 2023-05-16 10:57:26 +02:00
Loïc Lecrenier
a37da36766 Implement words as a graph-based ranking rule and fix some bugs 2023-05-16 10:42:11 +02:00
Loïc Lecrenier
85d96d35a8 Highlight ngram matches as well 2023-05-16 10:39:36 +02:00
meili-bors[bot]
bf66e97b48 Merge #3749
3749: Fix back: sort error message r=ManyTheFish a=ManyTheFish

This PR reintroduces the error message modified in https://github.com/meilisearch/milli/pull/375.
However, this added double-quotes around `sort` in the message. I don't think another message contains double-quotes, so I have added a separate commit replacing the double-quotes with back-ticks, which seems more consistent with the other error messages, this last change can be reverted easily.

## Detailed changes
#### v1.2-rc0
```
The sort ranking rule must be specified in the ranking rules settings to use the sort parameter at search time.
```
#### [Reintroduce fix (previous and expected behavior)](23d1c86825)
```
You must specify where "sort" is listed in the rankingRules setting to use the sort parameter at search time
```
#### [Replace double-quotes with back-ticks (my suggestion)](4d691d071a)
```
You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time
```

## Related

Fixes #3722

## Reviewers

- technical review: `@irevoire`
- to validate the replacement: `@macraig`

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-05-15 14:55:51 +00:00
meili-bors[bot]
a7ea5ec748 Merge #3651
3651: Use the writemap flag to reduce the memory usage r=irevoire a=Kerollmops

This draft PR is showing some stats about the memory usage of Meilisearch when [the LMDB `MDB_WRITEMAP` flag](3947014aed/libraries/liblmdb/lmdb.h (L573-L581)) is enabled and when it is not. As you can see there is a reduction of about 50% of the memory usage pick. The dataset used was [the Wikipedia one](https://www.notion.so/meilisearch/Wikipedia-8b1486e4b17547c5bda485d2d97767a0) with the first 30 000 first CSV documents without settings. This PR depends on https://github.com/meilisearch/heed/pull/168.

I just [opened a discussion](https://github.com/meilisearch/product/discussions/652) for people to understand the tradeoffs and give their feedback.

- [x] Create an experiment flag `--experimental-reduce-indexing-memory-usage`.
- [x] Add it to the config file.
- [x] Explain the tradeoff and copy/link the LMDB documentation in the help message.
- [x] Add analytics about the experimental flag.
- [x] Document that this flag cannot be used on Windows, ~~or hide it~~.

<details>
  <summary>The command I used to run the tests</summary>

#### Sign the binary to be able to use Instruments / xcrun
```sh
codesign -s - -f --entitlements ~/ent.plist target/release/meilisearch
```

where `ent.plist` contains:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>com.apple.security.get-task-allow</key>
        <true/>
    </dict>
</plist>
```

#### Run Meilisearch in measure-mode
```sh
xcrun xctrace record --template 'Allocations' --launch -- target/release/meilisearch --max-indexing-memory 0MiB
```

#### Send the wiki dataset available on notion.so / Public
```sh
for f in 0.csv 15000.csv; do echo sending $f; xh 'localhost:7700/indexes/wiki/documents' 'content-type:text/csv' `@$f;` done
```

#### Wait for the task to finish
```sh
watch --color xh --pretty all 'localhost:7700/tasks?statuses=processing'
```
</details>

Keep in mind that I tested that with the Instruments Apple tools on an iMac 5k 2019. More benchmarks must be done, especially on the indexation speed, as the flag is told to slow down writing into databases bigger that the amount of memory.

On the left Meilisearch is running without the flag. On the right, it is running with the flag.

<p align="center">
<img align="left" width="45%" alt="Instrument showing the memory usage of Meilisearch without the MDB_WRITEMAP flag" src="https://user-images.githubusercontent.com/3610253/234299524-7607f1df-6fc1-45d3-bd3d-4f9388002857.png">
<img align="right" width="45%" alt="Instrument showing the memory usage of Meilisearch with the MDB_WRITEMAP flag" src="https://user-images.githubusercontent.com/3610253/234299534-6cc3ae58-8bd9-426c-aa79-4c78f9e88b94.png">
</p>

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-05-15 14:10:07 +00:00
Kerollmops
dc7ba77e57 Add the option in the config file 2023-05-15 16:07:43 +02:00
Clément Renault
13f870e993 Fix typos and documentation issues 2023-05-15 15:11:45 +02:00
Kerollmops
1a79fd0c3c Use the new heed v0.12.6 2023-05-15 11:42:30 +02:00
Kerollmops
f759ec7fad Expose a flag to enable the MDB_WRITEMAP flag 2023-05-15 11:38:43 +02:00
ManyTheFish
4d691d071a Change double-quotes by back-ticks in sort error message 2023-05-15 11:10:36 +02:00
ManyTheFish
23d1c86825 Re-introduce the sort error message fix 2023-05-15 11:07:23 +02:00
Kerollmops
c4a40e7110 Use the writemap flag to reduce the memory usage 2023-05-15 10:15:33 +02:00
meili-bors[bot]
e01980c6f4 Merge #3739
3739: fix: update `payload_too_large` error message to include human readable maximum acceptable payload size r=Kerollmops a=cymruu


# Pull Request

## Related issue
Fixes #3736 

## What does this PR do?
- update `payload_too_large` error message as requested in ticket

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Filip Bachul <filipbachul@gmail.com>
2023-05-11 09:37:19 +00:00
Filip Bachul
25209a3590 introduce remaining field in Payload 2023-05-10 20:55:18 +02:00
Filip Bachul
3064ea6495 fix: update payload_too_large error message to include human readable maximum acceptable payload size 2023-05-10 18:16:59 +02:00
Tamo
46ec8a97e9 rename the analytics according to the spec 2023-05-10 14:28:30 +02:00
Tamo
c42a65a297 Update meilisearch/src/analytics/segment_analytics.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-10 14:28:30 +02:00
Tamo
d08f8690d2 add analytics on the get documents resource 2023-05-10 14:28:30 +02:00
meili-bors[bot]
ad5f25d880 Merge #3742
3742: Compute split words derivations of terms that don't accept typos r=ManyTheFish a=loiclec

Allows looking for the split-word derivation for short words in the user's query (like `the -> "t he"` or `door -> do or`) as well as for 3grams.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-10 12:12:52 +00:00
Loïc Lecrenier
4d352a21ac Compute split words derivations of terms that don't accept typos 2023-05-10 13:31:19 +02:00
meili-bors[bot]
918ce1dd67 Merge #3731
3731: Move comments above keys in config.toml r=curquiza a=jirutka

The current style is very unusual, confusing and breaks compatibility with tools for parsing config files including comments. Everyone writes comments above the items to which they refer (maybe except pythonists), so let's stick to that.


Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>
2023-05-09 09:24:36 +00:00
meili-bors[bot]
4a4210c116 Merge #3734
3734: Update version for the next release (v1.2.0) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
v1.2.0-rc.0
2023-05-09 07:35:48 +00:00
curquiza
3533d4f2bb Update version for the next release (v1.2.0) in Cargo.toml 2023-05-08 17:52:33 +00:00