Commit Graph

7567 Commits

Author SHA1 Message Date
c2f4b6ced0 Test: await for the deletion task to complete before trying to add another task 2023-04-13 18:22:42 +02:00
1e6cbcaf12 Update test comment
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-04-13 17:27:12 +02:00
066c6bd875 test task db full now checks that a task can be successfully added after deleting tasks 2023-04-13 17:20:06 +02:00
fd583501d7 Use non_free_pages_size instead of real_disk_size to check task db space taken 2023-04-13 17:07:44 +02:00
bff4bde0ce Merge #3672
3672: Update version for the next release (v1.1.1) in Cargo.toml r=dureuill a=meili-bot

โš ๏ธ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2023-04-13 13:34:29 +00:00
cd45d21d6e Update version for the next release (v1.1.1) in Cargo.toml 2023-04-13 13:25:10 +00:00
f9960be115 Merge #3659
3659: stops receiving tasks once the task queue is full r=Kerollmops a=irevoire

Give 20GiB to the task queue + once 50% of the task queue is used, it blocks itself and only receives task deletion requests to ensure we never get in a state where we canโ€™t do anything.

Also, create a new error message when we reach this case:
```
Meilisearch cannot receive write operations because the size limit of the tasks database has been reached. Please delete tasks to continue performing write operations.
```

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-04-13 09:11:12 +00:00
b3f60ee805 try to fix the ci 2023-04-13 10:18:58 +02:00
b4fabce36d update the error message + update the task db size to 20GiB with a limit at 50% 2023-04-12 18:54:11 +02:00
9350a7b017 improve the test and try to understand the issue happening on windows 2023-04-12 18:54:11 +02:00
be69ab320d stops receiving tasks once the task queue is full 2023-04-12 18:54:11 +02:00
d59d75c9cd Merge #3667
3667: Disable autobatching of additions and deletions r=irevoire a=dureuill

# Pull Request

## Related issue
Fixes #3664

## What does this PR do?
- Modifies the autobatcher to not batch document additions and deletions, as a workaround to the DB corruption in #3664 



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-04-12 16:51:13 +00:00
a94e78ffb0 Disable autobatching of additions and deletions prototype-disable-autobatch-addition-deletion-0 2023-04-12 10:53:00 +02:00
950f73b8bb Merge #3623
3623: Update mini-dashboard to version v0.2.7 r=curquiza a=bidoubiwa

## Changes

* Retrieve the API Key from the url parameters (#416) `@qdequele`

## ๐Ÿ› Bug Fixes

* Fix show more button not displaying all fields (#419) `@bidoubiwa`

Thanks again to `@bidoubiwa,`     and `@qdequele!` ๐ŸŽ‰


Co-authored-by: Charlotte Vermandel <charlottevermandel@gmail.com>
v1.1.0-rc.3 v1.1.0
2023-03-30 08:31:29 +00:00
e7153e0a97 Update mini-dashboard to version V0.2.7 2023-03-29 14:49:39 +02:00
37a24a4a05 Merge #3621
3621: Fix facet normalization r=Kerollmops a=ManyTheFish

# Pull Request

Make sure the facet normalization is the same between indexing and search.

## Related issue
Fixes #3599



Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-03-29 12:47:20 +00:00
6592746337 Fix other unrelated tests 2023-03-29 14:36:17 +02:00
efea1e5837 Fix facet normalization 2023-03-29 12:02:24 +02:00
b744f33530 Add test 2023-03-29 12:01:52 +02:00
d4f54fc55e Merge #3617
3617: update the geoBoundingBox feature r=dureuill a=irevoire

Closing #3616
Implementing this change in the spec:โ€ฏ38a715c072


Now instead of using the (top_left, bottom_right) corners of the bounding box, itโ€™s using the (top_right, bottom_left) corners.

Co-authored-by: Tamo <tamo@meilisearch.com>
v1.1.0-rc.2
2023-03-29 07:01:17 +00:00
a50b058557 update the geoBoundingBox feature
Now instead of using the (top_left, bottom_right) corners of the bounding box it s using the (top_right, bottom_left) corners.
2023-03-28 18:26:18 +02:00
514b60f8c8 Merge #3597
3597: ensure that the task queue is correctly imported r=irevoire a=irevoire

## Related issue
Fixes #3596

I updated all the dump's integration tests to ensure that we're effectively able to query the tasks

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-03-21 17:31:26 +00:00
a2b151e877 ensure that the task queue is correctly imported
reduce the size of the snapshots file
2023-03-21 14:41:46 +01:00
fb1260ee88 Merge #3568 #3569
3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza

Fixes #3563 

Main change
- add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container.

Small additional changes
- remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...)
- Remove useless step in job

Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882

3569: Enhance Japanese language detection r=dureuill a=ManyTheFish

# Pull Request

This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore):

```bash
$ docker pull getmeili/meilisearch:prototype-better-language-detection-0
```

## Context
Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.

A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.

For instance, a Japanese document `{"id": 1, "name": "ๆฑไบฌใ‚นใ‚ซใƒ‘ใƒฉใƒ€ใ‚คใ‚นใ‚ชใƒผใ‚ฑใ‚นใƒˆใƒฉ"}` is detected as Japanese during indexing, during the search the query `ๆฑไบฌ` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.
However if in the dataset there is at least one document containing a field with only Kanjis like:
_A document with only 1 field containing only Kanjis:_
```json
{
 "id":4,
 "name": "ๆฑไบฌ็‰น่จฑ่จฑๅฏๅฑ€"
}
```
_A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_
```json
{
 "id":105,
 "name": "ๆฑไบฌ็‰น่จฑ่จฑๅฏๅฑ€",
 "desc": "ๆ—ฅ็ตŒๅนณๅ‡ๆ ชไพกใฏ26ๆ—ฅ ใซ็ด„8ใ‚ซๆœˆใถใ‚Šใซ2ไธ‡4000ๅ††ใฎๅฟƒ็†็š„ใช็ฏ€็›ฎใ‚’ไธŠๅ›žใฃใŸใ€‚ๆ ช้ซ˜ใ‚’ๆ”ฏใˆใ‚‹ๆๆ–™ใฎใฒใจใคใฏใ€่‡ชๆฐ‘ๅ…š็ท่ฃ้ธใง3้ธใ‚’ๆฑบใ‚ใŸๅฎ‰ๅ€ๆ™‹ไธ‰้ฆ–็›ธใฎ็ตŒๆธˆๆ”ฟ็ญ–ใธใฎๆœŸๅพ…ใ ใ€‚ๆฉๆตใŒ่ฆ‹่พผใพใ‚Œใ‚‹ใจใ•ใ‚Œใ‚‹ไบบๆใ‚ตใƒผใƒ“ใ‚นใ‚„ๅปบ่จญๆ ชใฎไธ€่ง’ใŒ่ฒทใ‚ใ‚Œใฆใ„ใ‚‹ใ€‚ใŸใ ๆ€ๆƒ‘ใŒๅ…ˆ่กŒใ—ใฆ่ณ‡้‡‘ใŒ้›†ใพใฃใฆใ„ใ‚‹้ข ใฏๅฆใ‚ใชใ„ใ€‚ๅฎŸ้š›ใซๆ”ฟ็ญ–ๅŠนๆžœใ‚’ๅ–ใ‚Š่พผใ‚€ไผๆฅญใฏใฉใ“ใ‹ใ€ใชใŠๆœช็Ÿฅๆ•ฐใ ใ€‚"
}
```

Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore,  the query `ๆฑไบฌ` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.

## Technical Approach

The current PR partially fixes these issues by:
1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it.
 >  1) run a first extraction allowing the tokenizer to detect any Language in any Script
 >  2) generate a distribution of tokens by Script and Languages (`script_language`)
 >  3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages
 >  4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction.

2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents

## Limits
This PR introduces 2 arbitrary thresholds:
1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK").
2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language.

This PR only partially fixes these issues:
- โœ… the query `ๆฑไบฌ` now find Japanese documents if less than 5% of documents are detected as Chinese.
- โœ… the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `ๆฑไบฌ`.
- โŒ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.

## Related issue
Fixes #3565

## Possible future enhancements
- Change or contribute to the Library used to detect the Language
  - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122

Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
v1.1.0-rc.1
2023-03-09 15:34:35 +00:00
48a51e5cd6 Merge #3577
3577: Avoid fetching an LMDB value with an empty string r=ManyTheFish a=Kerollmops

# Pull Request

## Related issue
Fixes #3574 

## What does this PR do?
This PR fixes a bug where an empty key fetches an entry in the database. LMDB throws an error if an empty or too-long key is used to fetch an entry. This empty string seems to have been generated by the Charabia tokenizer.

Co-authored-by: Clรฉment Renault <clement@meilisearch.com>
2023-03-09 14:35:25 +00:00
2f8eb4f54a last PR fixes 2023-03-09 15:34:36 +01:00
dea101e3d9 Update meilisearch/src/routes/indexes/mod.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-03-09 15:17:03 +01:00
175e8a8495 Fix a diacritic issue 2023-03-09 14:57:47 +01:00
6da54d0cb6 Add a test to fix a diacritic issue 2023-03-09 14:57:38 +01:00
667bb87e35 Merge #3541
3541: Add cache on the indexes stats r=dureuill a=irevoire

Fix https://github.com/meilisearch/meilisearch/issues/3540

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-03-09 13:32:52 +00:00
dff2715ef3 Try removing needless collect 2023-03-09 11:28:10 +01:00
5deea631ea fix clippy too many arguments 2023-03-09 11:19:13 +01:00
b4b859ec8c Fix typos 2023-03-09 10:58:35 +01:00
b99ef3d336 Update CI to still use ubuntu-18 2023-03-08 17:11:36 +01:00
7e2fd82e41 Use Language allow list in the highlighter prototype-better-language-detection-0 2023-03-08 12:44:16 +01:00
24c0775c67 Change indexing threshold 2023-03-08 12:36:04 +01:00
3092cf0448 Fix clippy errors 2023-03-08 10:53:42 +01:00
37d4551e8e Add a threshold filtering the Languages allowed to be detected at search time 2023-03-07 19:38:01 +01:00
da48506f15 Rerun extraction when language detection might have failed 2023-03-07 18:35:26 +01:00
2f5b9fbbd8 Restore contribution of the index sizes to the db size
- the index size now contributes to the db size even if the index is not authorized
2023-03-07 14:05:27 +01:00
7faa9a22f6 Pass IndexStat by ref in store_stats_of 2023-03-07 14:00:54 +01:00
370d88f626 Merge #3561
3561: Fix the snapshots permissions on unix system r=irevoire a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/3507

The snapshot permissions were wrong after the v0.30 and the huge refacto of the index scheduler.
Fix this issue + add a test on the permissions on unix

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-03-07 08:51:38 +00:00
d34faa8f9c put back the sleep as it was and fix the from 2023-03-06 18:09:09 +01:00
e5d0bef6d8 update a comment 2023-03-06 17:04:24 +01:00
76288fad72 Fix snapshots 2023-03-06 16:57:31 +01:00
076a3d371c Eagerly compute stats as fallback to the cache.
- Refactor all around to avoid spawning indexes more times than necessary
2023-03-06 16:57:31 +01:00
3bbf760542 update most snapshots 2023-03-06 16:57:31 +01:00
fd5c48941a Add cache on the indexes stats 2023-03-06 16:57:31 +01:00
e704728ee7 fix the snapshots permissions on unix system 2023-03-06 16:28:40 +01:00
c0ede6d152 Merge #3562
3562: Update version for the next release (v1.1.0) in Cargo.toml r=curquiza a=meili-bot

โš ๏ธ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
v1.1.0-rc.0
2023-03-06 13:54:16 +00:00