Commit Graph

8267 Commits

Author SHA1 Message Date
ManyTheFish
114f878205 Rename restrictSearchableAttributes into attributesToSearchOn 2023-06-26 14:55:57 +02:00
ManyTheFish
42709ea9a5 Fix clippy warnings 2023-06-26 14:55:57 +02:00
ManyTheFish
993b0d012c Remove proximity_ranking_rule_order test, fixing this test would force us to create a fid_word_pair_proximity_docids and a fid_word_prefix_pair_proximity_docids databases which may multiply the keys of word_pair_proximity_docids and word_prefix_pair_proximity_docids by the number of attributes in the searchable_attributes list 2023-06-26 14:55:57 +02:00
ManyTheFish
fb8fa07169 Restrict field ids in search context 2023-06-26 14:55:57 +02:00
ManyTheFish
0ccf1e2e40 Allow the search cache to store owned values 2023-06-26 14:55:57 +02:00
ManyTheFish
9680e1e41f Introduce a BytesDecodeOwned trait in heed_codecs 2023-06-26 14:55:14 +02:00
ManyTheFish
a61ca4066e Add tests 2023-06-26 14:55:14 +02:00
ManyTheFish
461b5118bd Add API search setting 2023-06-26 14:55:14 +02:00
Tamo
a3716c5678 add the new parameter to the search builder of milli 2023-06-26 14:55:14 +02:00
meili-bors[bot]
2d34005965 Merge #3821
3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #3771 

## What does this PR do?

### User standpoint

<details>
<summary>Request ranking score</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScore": true, 
  "limit": 10,
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>


<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScore": 0.6657594086021505
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScore": 0.6654905913978495
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Batman",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Begins",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Returns",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Forever",
      "_rankingScore": 0.11553030303030302
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 12,
  "limit": 10,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>



- If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). 
  - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`.

<details>
<summary>Request detailed ranking scores</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScoreDetails": true, 
  "limit": 5, 
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>

<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8064516129032258,
          "score": 0.8064516129032258
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.7419354838709677,
          "score": 0.7419354838709677
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Angel and the Badman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 1,
          "maxMatchingWords": 4,
          "score": 0.25
        },
        "typo": {
          "order": 1,
          "typoCount": 0,
          "maxTypoCount": 1,
          "score": 1.0
        },
        "proximity": {
          "order": 2,
          "score": 1.0
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8181818181818182,
          "score": 0.8181818181818182
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.3333333333333333
        }
      }
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 9,
  "limit": 5,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>

- If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields:
  - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule)
  - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule.
  - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document.
- If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute **not** part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`.  

- Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). 

### Implementation standpoint

- Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check.
- Fix the id reported by the sort ranking rule (very minor)
- Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words.
- When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score.
- add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset
- the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0.
- modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query
- Add a new `milli::score_details` module containing all the types that are involved in score computation.
- Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`.
- Expose the scores in the REST API.
- Add very light analytics for scoring.
- Update the search tests to add the expected scores.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-26 09:32:43 +00:00
Louis Dureuil
62eefcda6e Change and add links to the Cloud 2023-06-26 09:17:15 +02:00
0xflotus
85a24775c5 Update README.md 2023-06-23 12:25:53 +02:00
0xflotus
6b0e9b9a7f Update README.md 2023-06-23 12:20:43 +02:00
0xflotus
b18c57ea7f docs: fixed some broken links
Some of the links in the README file were broken.
2023-06-23 12:18:43 +02:00
meili-bors[bot]
040b5a5b6f Merge #3842
3842: fix some typos r=dureuill a=cuishuang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- fix some typos

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: cui fliter <imcusg@gmail.com>
2023-06-22 18:01:10 +00:00
cui fliter
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
Louis Dureuil
11d32ad192 Add very light analytics for scoring 2023-06-22 12:39:14 +02:00
Louis Dureuil
d26e9a96ec Add score details to new search tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
49c8bc4de6 Fix tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
da833eb095 Expose the scores and detailed scores in the API 2023-06-22 12:39:14 +02:00
Louis Dureuil
701d44bd91 Store the scores for each bucket
Remove optimization where ranking rules are not executed on buckets of a single document
when the score needs to be computed
2023-06-22 12:39:14 +02:00
Louis Dureuil
c621a250a7 Score for graph based ranking rules
Count phrases in matchingWords and maxMatchingWords
2023-06-22 12:39:14 +02:00
Louis Dureuil
8939e85f60 Add rank_to_score for graph based ranking rules 2023-06-22 12:39:14 +02:00
Louis Dureuil
fa41d2489e Score for sort 2023-06-22 12:39:14 +02:00
Louis Dureuil
59c5b992c2 Score for geosort 2023-06-22 12:39:14 +02:00
Louis Dureuil
2ea8194c18 Score for exact_attributes 2023-06-22 12:39:14 +02:00
Louis Dureuil
421df64602 RankingRuleOutput now contains a Score 2023-06-22 12:39:14 +02:00
Louis Dureuil
c0fca6f884 Add score_details 2023-06-22 12:39:14 +02:00
Filip Bachul
9015a8e8d9 Merge branch 'main' into cymruu/payload-unit-test 2023-06-21 09:26:50 +02:00
meili-bors[bot]
28404d56b7 Merge #3799
3799: Fix error messages in `check-release.sh` r=curquiza a=vvv

- `check_tag`: Report file name correctly. Use named variables.
- Introduce `read_version` helper function. Simplify the implementation.
- Show meaningful error message if `GITHUB_REF` is not set or its format is incorrect.

Co-authored-by: Valeriy V. Vorotyntsev <valery.vv@gmail.com>
2023-06-20 13:35:33 +00:00
meili-bors[bot]
262c1f2baf Merge #3844
3844: Fix SDK CI (again) r=curquiza a=curquiza

Following this PR: https://github.com/meilisearch/meilisearch/pull/3813

Sorry `@Kerollmops,` here is (I hope) the latest fix 🙏 I made tests last time that were not sufficient. I really did a lot this time. I hope I have not missed anything.



Co-authored-by: curquiza <clementine@meilisearch.com>
2023-06-20 13:01:07 +00:00
Valeriy V. Vorotyntsev
cfed349aa3 Fix error messages in check-release.sh
- `check_tag`: Report file name correctly. Use named variables.
- Introduce `read_version` helper function. Simplify the implementation.
- Show meaningful error message if `GITHUB_REF` is not set or its format
  is incorrect.
2023-06-20 13:58:09 +03:00
Louis Dureuil
f050634b1e add virtual conditions to fid and position to always have the max cost 2023-06-20 10:07:18 +02:00
Louis Dureuil
becf1f066a Change how the cost of removing words is computed 2023-06-20 09:45:43 +02:00
Louis Dureuil
701d299369 Remove out-of-date comment 2023-06-20 09:45:42 +02:00
Louis Dureuil
a20e4d447c Position now takes into account the distance to the position of the word in the query
it used to be based on the distance to the position 0
2023-06-20 09:45:42 +02:00
Louis Dureuil
af57c3c577 Proximity costs 0 for documents that are perfectly matching 2023-06-20 09:45:42 +02:00
Louis Dureuil
0c40ef6911 Fix sort id 2023-06-20 09:45:42 +02:00
curquiza
bbc9f68ff5 Use the input from the previous job instead of the workflow dispatch 2023-06-19 18:49:15 +02:00
meili-bors[bot]
45636d315c Merge #3670
3670: Fix addition deletion bug r=irevoire a=irevoire

The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed.

It fixes https://github.com/meilisearch/meilisearch/issues/3440.

### What was happening?

The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents.
Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible.
The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced:
1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB.
2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand
3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform.

### Other things introduced in this  PR

Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug.
It's not perfect, but it's easy to improve in the future.
It'll also run for as long as possible on every merge on the main branch.

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-19 09:09:30 +00:00
meili-bors[bot]
cb9d78fc7f Merge #3835
3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec

In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776



Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-15 15:30:24 +00:00
meili-bors[bot]
01d2ee5cc1 Merge #3836
3836: Remove trailing whitespace in snapshots r=dureuill a=dureuill

# Pull Request

## Related issue

No issue, maintenance

## What does this PR do?
- Remove trailing whitespace in snapshots by adding a trailing `|` at the end of lines that would previously end with fixed-width integers
- This allows contributors whose editor is configured to remove trailing whitespace not to modify the tests when changing an unrelated part of the file containing the tests


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-14 13:00:52 +00:00
Louis Dureuil
e0c4682758 Fix tests 2023-06-14 13:30:52 +02:00
Louis Dureuil
d9b4b39922 Add trailing pipe to the snapshots so it doesn't end with trailing whitespace 2023-06-14 13:30:52 +02:00
Loïc Lecrenier
2da86b31a6 Remove comments and add documentation 2023-06-14 12:39:42 +02:00
Loïc Lecrenier
4e81445d42 Stop the fuzzer after an hour 2023-06-12 15:30:51 +02:00
meili-bors[bot]
4829348d6e Merge #3813
3813: Fix SDK CI for scheduled jobs r=curquiza a=curquiza

The SDK CI does not run for the scheduled job (`cron`) every day, and only works for manual triggers.

I added a job to define the Docker image we use depending on the event: `worflow_dispatch` = manual triggering, or `scheduled` = cron jobs

Co-authored-by: curquiza <clementine@meilisearch.com>
2023-06-12 08:41:03 +00:00
meili-bors[bot]
047d22fcb1 Merge #3824
3824: Changes the way words are counted in the word count DB r=ManyTheFish a=dureuill

# Pull Request

## Related issue

Fixes https://github.com/meilisearch/meilisearch/issues/3823

## What does this PR do?

- Apply offset when parsing query that is consistent with the indexing

### DB breaking changes

- Count the number of words in `field_id_word_count_docids`
- raise limit of word count for storing the entry in the DB from 10 to 30

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-08 13:26:05 +00:00
Louis Dureuil
a2a3b8c973 Fix offset difference between query and indexing for hard separators 2023-06-08 12:07:12 +02:00
Louis Dureuil
9f37b61666 DB BREAKING: raise limit of word count from 10 to 30. 2023-06-08 12:07:12 +02:00