Commit Graph

1822 Commits

Author SHA1 Message Date
Loïc Lecrenier
345c99d5bd Introduce the words ranking rule working with the new search structures 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
89d696c1e3 Introduce the proximity ranking rule as a graph-based ranking rule 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
c645853529 Introduce a generic graph-based ranking rule 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
a70ab8b072 Introduce a function to find the K shortest paths in a graph 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
48aae76b15 Introduce a function to find the docids of a set of paths in a graph 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
23bf572dea Introduce cache structures used with ranking rule graphs 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
864f6410ed Introduce a structure to represent a set of graph paths efficiently 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
c9bf6bb2fa Introduce a structure to implement ranking rules with graph algorithms 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
46249ea901 Implement a function to find a QueryGraph's docids 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
ce0d1e0e13 Introduce a common way to manage the coordination between ranking rules 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
5065d8b0c1 Introduce a DatabaseCache to memorize the addresses of LMDB values 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
a83007c013 Introduce structure to represent search queries as graphs 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
79e0a6dd4e Introduce a new search module, eventually meant to replace the old one
The code here does not compile, because I am merely splitting one giant
commit into smaller ones where each commit explains a single file.
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
2d88089129 Remove unused term matching strategies 2023-03-20 09:41:55 +01:00
Loïc Lecrenier
6c659dc12f Use MiMalloc in milli tests 2023-03-20 09:41:37 +01:00
Clément Renault
cf34d1c95f Fix a test that forget to match a Null value 2023-03-15 17:17:19 +01:00
Clément Renault
1a9c58a7ab Fix a bug with the new flattening rules 2023-03-15 16:56:44 +01:00
Clément Renault
64571c8288 Improve the testing of the filters 2023-03-15 14:57:17 +01:00
Clément Renault
ea016d97af Implementing an IS EMPTY filter 2023-03-15 14:12:34 +01:00
Clément Renault
fa2ea4a379 Update the test to accept the new IS syntax 2023-03-14 10:31:27 +01:00
Tamo
0f33a65468 makes kero happy 2023-03-13 16:51:11 +01:00
bors[bot]
fb1260ee88 Merge #3568 #3569
3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza

Fixes #3563 

Main change
- add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container.

Small additional changes
- remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...)
- Remove useless step in job

Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882

3569: Enhance Japanese language detection r=dureuill a=ManyTheFish

# Pull Request

This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore):

```bash
$ docker pull getmeili/meilisearch:prototype-better-language-detection-0
```

## Context
Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.

A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.

For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.
However if in the dataset there is at least one document containing a field with only Kanjis like:
_A document with only 1 field containing only Kanjis:_
```json
{
 "id":4,
 "name": "東京特許許可局"
}
```
_A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_
```json
{
 "id":105,
 "name": "東京特許許可局",
 "desc": "日経平均株価は26日 に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面 は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。"
}
```

Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore,  the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.

## Technical Approach

The current PR partially fixes these issues by:
1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it.
 >  1) run a first extraction allowing the tokenizer to detect any Language in any Script
 >  2) generate a distribution of tokens by Script and Languages (`script_language`)
 >  3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages
 >  4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction.

2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents

## Limits
This PR introduces 2 arbitrary thresholds:
1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK").
2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language.

This PR only partially fixes these issues:
-  the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese.
-  the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`.
-  the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.

## Related issue
Fixes #3565

## Possible future enhancements
- Change or contribute to the Library used to detect the Language
  - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122

Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2023-03-09 15:34:35 +00:00
ManyTheFish
2f8eb4f54a last PR fixes 2023-03-09 15:34:36 +01:00
Clément Renault
175e8a8495 Fix a diacritic issue 2023-03-09 14:57:47 +01:00
Clément Renault
df48ac8803 Add one more test for the NULL operator 2023-03-09 13:53:37 +01:00
Clément Renault
ff86073288 Add a snapshot for the NULL facet database 2023-03-09 13:32:27 +01:00
Clément Renault
0ad53784e7 Create a new struct to reduce the type complexity 2023-03-09 13:21:21 +01:00
Clément Renault
e064c52544 Rename an internal facet deletion method 2023-03-09 13:08:02 +01:00
Clément Renault
e106b16148 Fix a typo in a variable
Co-authored-by: Louis Dureuil <louis@meilisearch.com>

aaa
2023-03-09 13:08:02 +01:00
Tamo
eddefb0e0f refactor the error type of the milli::document thing
silence a warning
2023-03-09 13:03:14 +01:00
ManyTheFish
5deea631ea fix clippy too many arguments 2023-03-09 11:19:13 +01:00
Tamo
c5f22be6e1 add boolean support for csv documents 2023-03-09 11:12:49 +01:00
ManyTheFish
b4b859ec8c Fix typos 2023-03-09 10:58:35 +01:00
Clément Renault
b1d61f5a02 Add more tests for the NULL filter 2023-03-09 10:04:27 +01:00
Clément Renault
7dc04747fd Make clippy happy 2023-03-08 17:37:08 +01:00
Clément Renault
7c0cd7172d Introduce the NULL and NOT value NULL operator 2023-03-08 17:14:34 +01:00
Clément Renault
43ff236df8 Write the NULL facet values in the database 2023-03-08 16:49:53 +01:00
Clément Renault
19ab4d1a15 Classify the NULL fields values in the facet extractor 2023-03-08 16:49:31 +01:00
Clément Renault
9287858997 Introduce a new facet_id_is_null_docids database in the index 2023-03-08 16:14:00 +01:00
ManyTheFish
24c0775c67 Change indexing threshold 2023-03-08 12:36:04 +01:00
ManyTheFish
3092cf0448 Fix clippy errors 2023-03-08 10:53:42 +01:00
ManyTheFish
37d4551e8e Add a threshold filtering the Languages allowed to be detected at search time 2023-03-07 19:38:01 +01:00
ManyTheFish
da48506f15 Rerun extraction when language detection might have failed 2023-03-07 18:35:26 +01:00
bors[bot]
4f1ccbc495 Merge #3525
3525: Fix phrase search containing stop words r=ManyTheFish a=ManyTheFish

# Summary
A search with a phrase containing only stop words was returning an HTTP error 500,
this PR filters the phrase containing only stop words dropping them before the search starts, a query with a phrase containing only stop words now behaves like a placeholder search.

fixes https://github.com/meilisearch/meilisearch/issues/3521

related v1.0.2 PR on milli: https://github.com/meilisearch/milli/pull/779



Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-03-02 10:55:37 +00:00
ManyTheFish
37489fd495 Return an internal error in the case of matching word is invalid 2023-03-01 19:05:16 +01:00
Louis Dureuil
5822764be9 Skip computing index budget in tests 2023-02-23 11:23:39 +01:00
bors[bot]
ac5a1e4c4b Merge #3423
3423: Add min and max facet stats r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #3426

## What does this PR do?

### User standpoint

- When using a `facets` parameter in search, the facets that have numeric values are displayed in a new section of the response called `facetStats` that contains, per facet, the numeric min and max value of the hits returned by the search.

<details>
<summary>
Sample request/response
</summary>

```json
❯ curl \
  -X POST 'http://localhost:7700/indexes/meteorites/search?facets=mass' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "q": "LL6", "facets":["mass", "recclass"], "limit": 5 }' | jsonxf
{
  "hits": [
    {
      "name": "Niger (LL6)",
      "id": "16975",
      "nametype": "Valid",
      "recclass": "LL6",
      "mass": 3.3,
      "fall": "Fell"
    },
    {
      "name": "Appley Bridge",
      "id": "2318",
      "nametype": "Valid",
      "recclass": "LL6",
      "mass": 15000,
      "fall": "Fell",
      "_geo": {
        "lat": 53.58333,
        "lng": -2.71667
      }
    },
    {
      "name": "Athens",
      "id": "4885",
      "nametype": "Valid",
      "recclass": "LL6",
      "mass": 265,
      "fall": "Fell",
      "_geo": {
        "lat": 34.75,
        "lng": -87.0
      }
    },
    {
      "name": "Bandong",
      "id": "4935",
      "nametype": "Valid",
      "recclass": "LL6",
      "mass": 11500,
      "fall": "Fell",
      "_geo": {
        "lat": -6.91667,
        "lng": 107.6
      }
    },
    {
      "name": "Benguerir",
      "id": "30443",
      "nametype": "Valid",
      "recclass": "LL6",
      "mass": 25000,
      "fall": "Fell",
      "_geo": {
        "lat": 32.25,
        "lng": -8.15
      }
    }
  ],
  "query": "LL6",
  "processingTimeMs": 15,
  "limit": 5,
  "offset": 0,
  "estimatedTotalHits": 42,
  "facetDistribution": {
    "mass": {
      "110000": 1,
      "11500": 1,
      "1161": 1,
      "12000": 1,
      "1215.5": 1,
      "127000": 1,
      "15000": 1,
      "1676": 1,
      "1700": 1,
      "1710.5": 1,
      "18000": 1,
      "19000": 1,
      "220000": 1,
      "2220": 1,
      "22300": 1,
      "25000": 2,
      "265": 1,
      "271000": 1,
      "2840": 1,
      "3.3": 1,
      "3000": 1,
      "303": 1,
      "32000": 1,
      "34000": 1,
      "36.1": 1,
      "45000": 1,
      "460": 1,
      "478": 1,
      "483": 1,
      "5500": 2,
      "600": 1,
      "6000": 1,
      "67.8": 1,
      "678": 1,
      "680.5": 1,
      "6930": 1,
      "8": 1,
      "8300": 1,
      "840": 1,
      "8400": 1
    },
    "recclass": {
      "L/LL6": 3,
      "LL6": 39
    }
  },
  "facetStats": {
    "mass": {
      "min": 3.3,
      "max": 271000.0
    }
  }
}
```

</details>

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-22 13:06:43 +00:00
ManyTheFish
900bae3d9d keep phrases that has at least one word 2023-02-21 18:16:51 +01:00
ManyTheFish
28b7d73d4a Remove an unefficient part of a test on milli 2023-02-21 18:16:51 +01:00
bors[bot]
39407885c2 Merge #3347
3347: Enhance language detection r=irevoire a=ManyTheFish

## Summary

Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using `whatlang`, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query.

This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search.

## Detail

- Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected
- Fill the newly created database during indexing
- Create an allow-list with this database and pass it to Charabia
- Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese

## Related issues
Fixes #2403
Fixes #3513

Co-authored-by: f3r10 <frledesma@outlook.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2023-02-21 10:52:13 +00:00