Commit Graph

1742 Commits

Author SHA1 Message Date
Kerollmops
c2a402f3ae Implement an ugly deletion of values in the HNSW 2023-06-27 12:32:39 +02:00
Kerollmops
436a10bef4 Replace the euclidean with a dot product 2023-06-27 12:32:39 +02:00
Kerollmops
8debf6fe81 Use a basic euclidean distance function 2023-06-27 12:32:39 +02:00
Kerollmops
c79e82c62a Move back to the hnsw crate
This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.
2023-06-27 12:32:39 +02:00
Kerollmops
aca305bb77 Log more to make sure we insert vectors in the hgg data-structure 2023-06-27 12:32:38 +02:00
Kerollmops
5816008139 Introduce an optimized version of the euclidean distance function 2023-06-27 12:32:38 +02:00
Kerollmops
268a9ef416 Move to the hgg crate 2023-06-27 12:32:38 +02:00
Clément Renault
642b0f3a1b Expose a new vector field on the search route 2023-06-27 12:32:38 +02:00
Clément Renault
4571e512d2 Store the vectors in an HNSW in LMDB 2023-06-27 12:32:38 +02:00
Clément Renault
7ac2f1489d Extract the vectors from the documents 2023-06-27 12:32:37 +02:00
Clément Renault
34349faeae Create a new _vector extractor 2023-06-27 12:32:37 +02:00
meili-bors[bot]
2d34005965 Merge #3821
3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #3771 

## What does this PR do?

### User standpoint

<details>
<summary>Request ranking score</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScore": true, 
  "limit": 10,
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>


<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScore": 0.6657594086021505
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScore": 0.6654905913978495
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Batman",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Begins",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Returns",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Forever",
      "_rankingScore": 0.11553030303030302
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 12,
  "limit": 10,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>



- If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). 
  - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`.

<details>
<summary>Request detailed ranking scores</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScoreDetails": true, 
  "limit": 5, 
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>

<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8064516129032258,
          "score": 0.8064516129032258
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.7419354838709677,
          "score": 0.7419354838709677
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Angel and the Badman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 1,
          "maxMatchingWords": 4,
          "score": 0.25
        },
        "typo": {
          "order": 1,
          "typoCount": 0,
          "maxTypoCount": 1,
          "score": 1.0
        },
        "proximity": {
          "order": 2,
          "score": 1.0
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8181818181818182,
          "score": 0.8181818181818182
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.3333333333333333
        }
      }
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 9,
  "limit": 5,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>

- If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields:
  - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule)
  - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule.
  - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document.
- If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute **not** part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`.  

- Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). 

### Implementation standpoint

- Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check.
- Fix the id reported by the sort ranking rule (very minor)
- Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words.
- When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score.
- add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset
- the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0.
- modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query
- Add a new `milli::score_details` module containing all the types that are involved in score computation.
- Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`.
- Expose the scores in the REST API.
- Add very light analytics for scoring.
- Update the search tests to add the expected scores.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-26 09:32:43 +00:00
meili-bors[bot]
040b5a5b6f Merge #3842
3842: fix some typos r=dureuill a=cuishuang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- fix some typos

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: cui fliter <imcusg@gmail.com>
2023-06-22 18:01:10 +00:00
cui fliter
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
Louis Dureuil
d26e9a96ec Add score details to new search tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
49c8bc4de6 Fix tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
da833eb095 Expose the scores and detailed scores in the API 2023-06-22 12:39:14 +02:00
Louis Dureuil
701d44bd91 Store the scores for each bucket
Remove optimization where ranking rules are not executed on buckets of a single document
when the score needs to be computed
2023-06-22 12:39:14 +02:00
Louis Dureuil
c621a250a7 Score for graph based ranking rules
Count phrases in matchingWords and maxMatchingWords
2023-06-22 12:39:14 +02:00
Louis Dureuil
8939e85f60 Add rank_to_score for graph based ranking rules 2023-06-22 12:39:14 +02:00
Louis Dureuil
fa41d2489e Score for sort 2023-06-22 12:39:14 +02:00
Louis Dureuil
59c5b992c2 Score for geosort 2023-06-22 12:39:14 +02:00
Louis Dureuil
2ea8194c18 Score for exact_attributes 2023-06-22 12:39:14 +02:00
Louis Dureuil
421df64602 RankingRuleOutput now contains a Score 2023-06-22 12:39:14 +02:00
Louis Dureuil
c0fca6f884 Add score_details 2023-06-22 12:39:14 +02:00
Louis Dureuil
f050634b1e add virtual conditions to fid and position to always have the max cost 2023-06-20 10:07:18 +02:00
Louis Dureuil
becf1f066a Change how the cost of removing words is computed 2023-06-20 09:45:43 +02:00
Louis Dureuil
701d299369 Remove out-of-date comment 2023-06-20 09:45:42 +02:00
Louis Dureuil
a20e4d447c Position now takes into account the distance to the position of the word in the query
it used to be based on the distance to the position 0
2023-06-20 09:45:42 +02:00
Louis Dureuil
af57c3c577 Proximity costs 0 for documents that are perfectly matching 2023-06-20 09:45:42 +02:00
Louis Dureuil
0c40ef6911 Fix sort id 2023-06-20 09:45:42 +02:00
meili-bors[bot]
45636d315c Merge #3670
3670: Fix addition deletion bug r=irevoire a=irevoire

The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed.

It fixes https://github.com/meilisearch/meilisearch/issues/3440.

### What was happening?

The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents.
Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible.
The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced:
1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB.
2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand
3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform.

### Other things introduced in this  PR

Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug.
It's not perfect, but it's easy to improve in the future.
It'll also run for as long as possible on every merge on the main branch.

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-19 09:09:30 +00:00
meili-bors[bot]
cb9d78fc7f Merge #3835
3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec

In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776



Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-15 15:30:24 +00:00
Louis Dureuil
e0c4682758 Fix tests 2023-06-14 13:30:52 +02:00
Louis Dureuil
d9b4b39922 Add trailing pipe to the snapshots so it doesn't end with trailing whitespace 2023-06-14 13:30:52 +02:00
Loïc Lecrenier
2da86b31a6 Remove comments and add documentation 2023-06-14 12:39:42 +02:00
Louis Dureuil
a2a3b8c973 Fix offset difference between query and indexing for hard separators 2023-06-08 12:07:12 +02:00
Louis Dureuil
9f37b61666 DB BREAKING: raise limit of word count from 10 to 30. 2023-06-08 12:07:12 +02:00
Louis Dureuil
c15c076da9 DB BREAKING: Count the number of words in field_id_word_count_docids 2023-06-08 12:07:11 +02:00
Loïc Lecrenier
8628a0c856 Remove docid_word_positions_db + fix deletion bug
That would happen when a word was deleted from all exact attributes
but not all regular attributes.
2023-06-07 10:52:50 +02:00
Clémentine U. - curqui
f3e2f79290 Merge branch 'main' into tmp-release-v1.2.0 2023-06-05 18:36:28 +02:00
Kerollmops
da04edff8c Better use deserialize_unchecked_from to reduce the deserialization time 2023-05-30 14:58:30 +02:00
Tamo
23a5b45ebf drop the old fuzz file 2023-05-29 14:02:37 +02:00
Tamo
6c6387d05e move the fuzzer to its own crate 2023-05-29 12:27:39 +02:00
Louis Dureuil
1dfc4038ab Add test that fails before PR and passes now 2023-05-29 11:58:26 +02:00
Louis Dureuil
73198179f1 Consistently use wrapping add to avoid overflow in debug when query starts with a separator 2023-05-29 11:54:12 +02:00
meili-bors[bot]
2e49d6aec1 Merge #3768
3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec

This PR contains three changes:

## 1. Don't call the `words` ranking rule if the term matching strategy is `All`

This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph.

## 2. The `words` ranking rule is replaced by a graph-based ranking rule. 

This is for three reasons:

1. **performance**: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily.

2. **consistency**: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get.

3. **surfacing bugs**: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix.

## 3. Fix the `update_all_costs_before_nodes` function

It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like:
<img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7">
and we gave the node `is` as argument.

Then, we'd walk backwards from the node breadth-first. We'd update the costs of:
1. `sun`
2. `thesun`
3. `start`
4. `the`

which is an incorrect order. The correct order is:

1. `sun`
2. `thesun`
3. `the`
4. `start`

That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-23 13:28:08 +00:00
Louis Dureuil
51043f78f0 Remove trailing whitespace 2023-05-23 15:27:25 +02:00
Louis Dureuil
a490a11325 Add explanatory comment on the way we're recomputing costs 2023-05-23 15:24:24 +02:00
Tamo
002f42875f fix the fuzzer 2023-05-23 11:42:40 +02:00