Commit Graph

10043 Commits

Author SHA1 Message Date
bors[bot]
4c516c00da Merge #426
426: Fix search highlight for non-unicode chars r=ManyTheFish a=Samyak2

# Pull Request

## What does this PR do?
Fixes https://github.com/meilisearch/MeiliSearch/issues/1480
<!-- Please link the issue you're trying to fix with this PR, if none then please create an issue first. -->

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

## Changes

The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme clusters to highlight.

In essence, the `matching_bytes` function now returns the number of matching grapheme clusters instead of bytes.

Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme clusters from tokens
- `<mark>` tag is put around only the matched part
    - before this change, the entire word was highlighted even if only a part of it matched

## Questions

Since `matching_bytes` does not return number of bytes but grapheme clusters, should it be renamed to something like `matching_chars` or `matching_graphemes`? Will this break the API?

Thank you very much `@ManyTheFish` for helping 😄 

Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
2022-01-17 13:39:00 +00:00
Tamo
d1ac40ea14 fix(filter): Fix two bugs.
- Stop lowercasing the field when looking in the field id map
- When a field id does not exist it means there is currently zero
  documents containing this field thus we returns an empty RoaringBitmap
  instead of throwing an internal error
2022-01-17 13:51:46 +01:00
bors[bot]
15bbde1022 Merge #432
432: Fuzzer r=Kerollmops a=irevoire

Provide a first way of fuzzing the indexing part of milli.
It depends on [cargo-fuzz](https://rust-fuzz.github.io/book/cargo-fuzz.html)

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-17 12:50:26 +00:00
Samyak S Sarnayak
c0313f3026 Use chars for highlight instead of graphemes
Tokenizer v0.2.7 uses chars instead of graphemes for matching bytes.
`unicode-segmentation` dependency isn't needed anymore.

Also, oxidised the highlight code :)

Co-authored-by: many <maxime@meilisearch.com>
2022-01-17 13:15:31 +05:30
Samyak S Sarnayak
2d7607734e Run cargo fmt on matching_words.rs 2022-01-17 13:04:33 +05:30
Samyak S Sarnayak
5ab505be33 Fix highlight by replacing num_graphemes_from_bytes
num_graphemes_from_bytes has been renamed in the tokenizer to
num_chars_from_bytes.

Highlight now works correctly!
2022-01-17 13:02:55 +05:30
Samyak S Sarnayak
c10f58b7bd Update tokenizer to v0.2.7 2022-01-17 13:02:00 +05:30
Samyak S Sarnayak
e752bd06f7 Fix matching_words tests to compile successfully
The tests still fail due to a bug in https://github.com/meilisearch/tokenizer/pull/59
2022-01-17 11:37:45 +05:30
Samyak S Sarnayak
30247d70cd Fix search highlight for non-unicode chars
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme
  clusters to highlight.

In essence, the `matching_bytes` function returns the number of matching
grapheme clusters instead of bytes. Should this function be renamed
then?

Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme
  clusters from tokens
- `<mark>` tag is put around only the matched part
    - before this change, the entire word was highlighted even if only a
      part of it matched
2022-01-17 11:37:44 +05:30
Tamo
0605c0ac68 apply review comments 2022-01-13 18:51:08 +01:00
mpostma
0515c6e844 bug(http): fix task duration v0.25.1 2022-01-13 16:41:07 +01:00
Irevoire
38176181ac fix(dump): Fix the import of dump from the v24 and before 2022-01-13 16:40:58 +01:00
Tamo
b22c80106f add some settings to the fuzzed milli and use the published version of arbitrary json 2022-01-13 15:35:24 +01:00
bors[bot]
a7e634bd4f Merge #2074
2074: fix(dump): Fix the import of dumps when there is no data.ms r=irevoire a=irevoire



Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-01-13 13:47:03 +00:00
bors[bot]
78a381a30b Merge #2076
2076: fix(dump): Fix the import of dump from the v24 and before r=ManyTheFish a=irevoire

Same as https://github.com/meilisearch/MeiliSearch/pull/2073 but on main this time

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-01-13 13:09:23 +00:00
Irevoire
343bce6a29 fix(dump): Fix the import of dump from the v24 and before 2022-01-13 13:23:57 +01:00
mpostma
d263f762bf feat(http): accept empty document additions
wip
2022-01-13 12:46:56 +01:00
Irevoire
dfaeb19566 fix(dump): Fix the import of dumps when there is no data.ms 2022-01-13 12:30:58 +01:00
Tamo
c94952e25d update the readme + dependencies 2022-01-12 18:30:11 +01:00
Tamo
e1053989c0 add a fuzzer on milli 2022-01-12 17:57:54 +01:00
bors[bot]
010dcc3e80 Merge #2066
2066: bug(http): fix task duration r=MarinPostma a=MarinPostma

`@gmourier` found that the duration in the task view was not computed correctly, this pr fixes it.

`@curquiza,` I let you decide if we need to make a hotfix out of this or wait for the next release. This is not breaking.


Co-authored-by: mpostma <postma.marin@protonmail.com>
2022-01-12 14:50:58 +00:00
bors[bot]
d0aa5f747c Merge #2067
2067: chore(all): fix rust edition r=irevoire a=MarinPostma

I hadn't correctly set the rust edition in my previous pr, and cargo was returning a warning. This time I followed this guide: https://doc.rust-lang.org/edition-guide/editions/transitioning-an-existing-project-to-a-new-edition.html


Co-authored-by: mpostma <postma.marin@protonmail.com>
2022-01-12 13:32:42 +00:00
mpostma
f6d53e03f1 chore(http): migrate from structopt to clap3 2022-01-12 14:07:19 +01:00
mpostma
3ecebd15ee chore(all): fix rust edition 2022-01-12 11:14:50 +01:00
mpostma
db83e39a7f bug(http): fix task duration 2022-01-11 18:01:25 +01:00
bors[bot]
5d48f72ade Merge #2065
2065: MeiliSearch v0.25.0: `stable` -> `main` r=curquiza a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: many <maxime@meilisearch.com>
Co-authored-by: Marin Postma <postma.marin@protonmail.com>
Co-authored-by: Maxime Legendre <maximelegendre@MacBook-Pro-de-Maxime.local>
Co-authored-by: Maxime Legendre <maximelegendre@mbp-de-maxime.home>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-01-11 16:30:22 +00:00
bors[bot]
1818026a84 Merge #2057
2057: fix(dump): Uncompress the dump IN the data.ms  r=irevoire a=irevoire

When loading a dump with docker, we had two problems.
After creating a tempdirectory, uncompressing and re-indexing the dump:
1. We try to `move` the new “data.ms” onto the currently present
   one. The problem is that if the `data.ms` is a mount point because
   that's what peoples do with docker usually. We can't override
   a mount point, and thus we were throwing an error.
2. The tempdir is created in `/tmp`, which is usually quite small AND may not
   be on the same partition as the `data.ms`. This means when we tried to move
   the dump over the `data.ms`, it was also failing because we can't move data
   between two partitions.
------------------
1 was fixed by deleting the *content* of the `data.ms` and moving the *content*
of the tempdir *inside* the `data.ms`. If someone tries to create volumes inside
the `data.ms` that's his problem, not ours.
2 was fixed by creating the tempdir *inside* of the `data.ms`. If a user mounted
its `data.ms` on a large partition, there is no reason he could not load a big
dump because his `/tmp` was too small. This solves the issue; now the dump is
extracted and indexed on the same partition the `data.ms` will lay.

fix #1833

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-10 17:57:16 +00:00
bors[bot]
0ad7d38eec Merge #2061
2061: Update dashboard for v0.25.0 r=curquiza a=mdubus



Co-authored-by: Morgane Dubus <30866152+mdubus@users.noreply.github.com>
v0.25.0 v0.25.0rc5
2022-01-10 16:29:31 +00:00
Morgane Dubus
b17ad5c2be Update with latest release of the dashboard 2022-01-10 17:10:09 +01:00
bors[bot]
559e019de1 Merge #424
424: Store the geopoint in three dimensions r=Kerollmops a=irevoire

Related to this issue: https://github.com/meilisearch/MeiliSearch/issues/1872

Fix the whole computation of distance for any “geo” operations (sort or filter). Now when you sort points they are returned to you in the right order.
And when you filter on a specific radius you only get points included in the radius.

This PR changes the way we store the geo points in the RTree.
Instead of considering the latitude and longitude as orthogonal coordinates, we convert them to real orthogonal coordinates projected on a sphere with a radius of 1.
This is the conversion formulae.
![image](https://user-images.githubusercontent.com/7032172/145990456-eefe840a-384f-4486-848b-81d0036814ec.png)
Which, in rust, translate to this function:
```rust
pub fn lat_lng_to_xyz(coord: &[f64; 2]) -> [f64; 3] {
    let [lat, lng] = coord.map(|f| f.to_radians());
    let x = lat.cos() * lng.cos();
    let y = lat.cos() * lng.sin();
    let z = lat.sin();

    [x, y, z]
}
```

Storing the points on a sphere is easier / faster to compute than storing the point on an approximation of the real earth shape.
But when we need to compute the distance between two points we still need to use the haversine distance which works with latitude and longitude.
So, to do the fewest search-time computation possible I'm now associating every point with its `DocId` and its lat/lng.

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-10 15:23:43 +00:00
bors[bot]
1824b3c07b Merge #2060
2060: chore(all) set rust edition to 2021 r=MarinPostma a=MarinPostma

set the rust edition for the project to 2021

this make the MSRV to v1.56

#2058


Co-authored-by: Marin Postma <postma.marin@protonmail.com>
2022-01-10 15:04:14 +00:00
bors[bot]
660eac50b2 Merge #427
427: Handle escaped characters in filters r=Kerollmops a=irevoire



Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-10 15:01:23 +00:00
Tamo
92804f6f45 apply clippy suggestions 2022-01-10 15:59:04 +01:00
Tamo
0fcde35a20 Update filter-parser/src/value.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-01-10 15:53:44 +01:00
Tamo
3c7ea1d298 Apply code suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-01-10 15:19:21 +01:00
Tamo
c9c7da3626 fix(dump): Uncompress the dump IN the data.ms
When loading a dump with docker, we had two problems.
After creating a tempdirectory, uncompressing and re-indexing the dump:
1. We try to `move` the new “data.ms” onto the currently present
   one. The problem is that if the `data.ms` is a mount point because
   that's what peoples do with docker usually. We can't override
   a mount point, and thus we were throwing an error.
2. The tempdir is created in `/tmp`, which is usually quite small AND may not
   be on the same partition as the `data.ms`. This means when we tried to move
   the dump over the `data.ms`, it was also failing because we can't move data
   between two partitions.
==============
1 was fixed by deleting the *content* of the `data.ms` and moving the *content*
of the tempdir *inside* the `data.ms`. If someone tries to create volumes inside
the `data.ms` that's his problem, not ours.
2 was fixed by creating the tempdir *inside* of the `data.ms`. If a user mounted
its `data.ms` on a large partition, there is no reason he could not load a big
dump because his `/tmp` was too small. This solves the issue; now the dump is
extracted and indexed on the same partition the `data.ms` will lay.

fix #1833
2022-01-10 14:56:03 +01:00
Morgane Dubus
030a90523d Update dashboard for v0.25.0 2022-01-10 10:50:57 +01:00
bors[bot]
56d223a51d Merge #2059
2059: change indexed doc count on error r=irevoire a=MarinPostma

change `indexed_documents` and `deleted_documents` to return 0 instead of null when empty when the task has failed.

close #2053


Co-authored-by: Marin Postma <postma.marin@protonmail.com>
v0.25.0rc4
2022-01-06 15:55:50 +00:00
Marin Postma
f558ff826a feat(http): task view indexed and deleted documents return 0 instead of null 2022-01-06 14:55:02 +01:00
Marin Postma
5fb4ed60e7 chore(all) set rust edition to 2021 2022-01-06 13:30:45 +01:00
bors[bot]
0d2a358cc2 Merge #2056
2056: Allow any header for CORS r=curquiza a=curquiza

Bug fix: trigger a CORS error when trying to send the `User-Agent` header via the browser

`@bidoubiwa` thanks for the bug report!

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-01-05 16:45:51 +00:00
Clémentine Urquizar
595250c93e Allow any header for CORS 2022-01-05 15:38:47 +01:00
bors[bot]
c636988935 Merge #2055
2055: fix(dump): Fix the loading of dump with empty indexes r=irevoire a=irevoire



Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-05 14:31:53 +00:00
Tamo
eea483c470 fix(dump): Fix the loading of dump with empty indexes 2022-01-05 15:08:21 +01:00
bors[bot]
d53c61a6d0 Merge #2054
2054: Bug(auth): Wrap key list in results r=irevoire a=ManyTheFish

fix #2052

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-01-04 15:44:55 +00:00
bors[bot]
74594be234 Merge #429
429: Benchmark CIs: not use a default label to call the GH runner r=irevoire a=curquiza

Since we now have multiple self-hosted github runners, we need to differentiate them calling them in the CI. The `self-hosted` label is the default one, so we need to use the unique and appropriate one for the benchmark machine

<img width="925" alt="Capture d’écran 2022-01-04 à 15 42 18" src="https://user-images.githubusercontent.com/20380692/148079840-49cd7878-5912-46ff-8ab8-bf646777f782.png">


Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-01-04 15:41:08 +00:00
Clémentine Urquizar
3d99686f7a Change self-hosted label by benchmarks 2022-01-04 16:01:01 +01:00
ManyTheFish
c0d4f71a34 Bug(auth): Wrap key list in results 2022-01-04 14:10:30 +01:00
bors[bot]
c039562723 Merge #428
428: Reintroduce the gitignore for the fuzzer r=Kerollmops a=irevoire

Reintroduce the gitignore in the fuzz directory

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-04 12:09:06 +00:00
Tamo
9bdcd42b9b reintroduce the gitignore for the fuzzer 2022-01-04 13:07:32 +01:00