Commit Graph

184 Commits

Author SHA1 Message Date
Clémentine Urquizar
31f749b5d8 Update version for next release (v0.30.0) 2022-06-20 12:09:57 +02:00
Tamo
676187ba43 bump milli version 2022-06-09 16:53:32 +02:00
Kerollmops
56ee9cc21f Bump the version to 0.29.2 2022-06-08 16:00:06 +02:00
Clémentine Urquizar
478dbfa45a Update version for next release (v0.29.1) 2022-06-07 18:59:33 +02:00
bors[bot]
05ae6dbfa4 Merge #541
541: Update version for next release (v0.29.0) r=ManyTheFish a=curquiza

Need to update the version since #540 was merged and breaking

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-06-02 16:53:28 +00:00
ManyTheFish
a5c790bf4b Update http-ui 2022-06-02 18:15:36 +02:00
Clémentine Urquizar
6ce1c6487a Update version for next release (v0.29.0) 2022-06-02 18:07:55 +02:00
ManyTheFish
4dd3675d2b Update http-ui 2022-06-02 16:59:11 +02:00
Clémentine Urquizar
c19c17eddb Update version to v0.28.1 2022-06-01 18:31:02 +02:00
ManyTheFish
895f5d8a26 Bump milli version 2022-05-18 10:37:12 +02:00
Clémentine Urquizar
d138b3c704 Update version 2022-04-25 18:43:46 +02:00
bors[bot]
ea4bb9402f Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
Clémentine Urquizar
8d630a6f62 Update version for the next release (v0.26.1) 2022-04-14 11:44:06 +02:00
ManyTheFish
827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
bors[bot]
9ac2fd1c37 Merge #487
487: Update version (v0.26.0) r=Kerollmops a=curquiza

breaking because of #458 

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-07 17:10:24 +00:00
Irevoire
4f3ce6d9cd nested fields 2022-04-07 16:58:46 +02:00
Clémentine Urquizar
ee1d627803 Update version (v0.26.0) 2022-04-07 15:56:10 +02:00
ManyTheFish
29c5f76d7f Use new matcher in http-ui 2022-04-05 17:41:32 +02:00
Clémentine Urquizar
9eec44dd98 Update version (v0.25.0) 2022-04-05 12:06:42 +02:00
Clémentine Urquizar
ddf78a735b Update version (v0.24.1) 2022-03-24 16:39:45 +01:00
Kerollmops
08a06b49f0 Bump version to 0.23.1 2022-03-15 15:50:28 +01:00
Kerollmops
21ec334dcc Fix the compilation error of the dependency versions 2022-03-15 11:17:45 +01:00
Kerollmops
63682c2c9a Upgrade the dependencies 2022-03-15 11:17:44 +01:00
Kerollmops
288a879411 Remove three useless dependencies 2022-03-15 11:17:44 +01:00
Clémentine Urquizar
d9ed9de2b0 Update heed link in cargo toml 2022-03-01 19:45:29 +01:00
Irevoire
0defeb268c bump milli 2022-02-16 13:27:41 +01:00
Clémentine Urquizar
d03b3ceb58 Update version for the next release (v0.22.1) 2022-02-07 18:39:29 +01:00
Kerollmops
9142ba9dd4 Fix the parsing of ndjson requests to index more than the first line 2022-02-02 17:55:13 +01:00
bors[bot]
9f2ff71581 Merge #434
434: bump milli to v0.22.0 r=curquiza a=irevoire

This is breaking because of this PR:
98a365aaae

Should we do a special branch to only release the [patch](https://github.com/meilisearch/milli/pull/433) for https://github.com/meilisearch/MeiliSearch/issues/2082 (which is non-breaking)?

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-01-24 17:31:20 +00:00
Marin Postma
0c84a40298 document batch support
reusable transform

rework update api

add indexer config

fix tests

review changes

Co-authored-by: Clément Renault <clement@meilisearch.com>

fmt
2022-01-19 12:40:20 +01:00
Tamo
367f403693 bump milli 2022-01-17 16:41:34 +01:00
Samyak S Sarnayak
c0313f3026 Use chars for highlight instead of graphemes
Tokenizer v0.2.7 uses chars instead of graphemes for matching bytes.
`unicode-segmentation` dependency isn't needed anymore.

Also, oxidised the highlight code :)

Co-authored-by: many <maxime@meilisearch.com>
2022-01-17 13:15:31 +05:30
Samyak S Sarnayak
c10f58b7bd Update tokenizer to v0.2.7 2022-01-17 13:02:00 +05:30
Samyak S Sarnayak
30247d70cd Fix search highlight for non-unicode chars
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme
  clusters to highlight.

In essence, the `matching_bytes` function returns the number of matching
grapheme clusters instead of bytes. Should this function be renamed
then?

Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme
  clusters from tokens
- `<mark>` tag is put around only the matched part
    - before this change, the entire word was highlighted even if only a
      part of it matched
2022-01-17 11:37:44 +05:30
Clément Renault
1c6c89f345 Fix the binaries that use the new optional filters 2021-12-09 11:57:53 +01:00
many
1b3923b5ce Update all packages to 0.21.0 2021-11-29 12:17:59 +01:00
many
64ef5869d7 Update tokenizer v0.2.6 2021-11-18 16:56:05 +01:00
Marin Postma
6eb47ab792 remove update_id in UpdateBuilder 2021-11-16 13:07:04 +01:00
Tamo
6831c23449 merge with main 2021-11-06 16:34:30 +01:00
Tamo
07a5ffb04c update http-ui 2021-11-04 15:52:22 +01:00
many
743ed9f57f Bump milli version 2021-11-04 14:04:21 +01:00
many
702589104d Update version for the next release (v0.20.1) 2021-11-03 14:20:01 +01:00
Clémentine Urquizar
056ff13c4d Update version for the next release (v0.20.0) 2021-10-28 14:52:57 +02:00
bors[bot]
d7943fe225 Merge #402
402: Optimize document transform r=MarinPostma a=MarinPostma

This pr optimizes the transform of documents additions in the obkv format. Instead on accepting any serializable objects, we instead treat json and CSV specifically:
- For json, we build a serde `Visitor`, that transform the json straight into obkv without intermediate representation.
- For csv, we directly write the lines in the obkv, applying other optimization as well.

Co-authored-by: marin postma <postma.marin@protonmail.com>
2021-10-26 09:55:28 +00:00
marin postma
baddd80069 implement review suggestions 2021-10-25 18:29:12 +02:00
Clémentine Urquizar
679fe18b17 Update version for the next release (v0.19.0) 2021-10-25 11:52:17 +02:00
Tamo
e25ca9776f start updating the exposed function to makes other modules happy 2021-10-22 17:23:22 +02:00
Clémentine Urquizar
f8fe9316c0 Update version for the next release (v0.18.1) 2021-10-21 11:56:14 +02:00
Clémentine Urquizar
2209acbfe2 Update version for the next release (v0.18.2) 2021-10-18 13:45:48 +02:00
bors[bot]
c7db4176f3 Merge #384
384: Replace memmap with memmap2 r=Kerollmops a=palfrey

[memmap is unmaintained](https://rustsec.org/advisories/RUSTSEC-2020-0077.html) and needs replacing. memmap2 is a drop-in replacement fork that's well maintained. Note that the version numbers got reset on fork, hence the lower values.

Co-authored-by: Tom Parker-Shemilt <palfrey@tevp.net>
2021-10-13 13:47:23 +00:00