Commit Graph

2249 Commits

Author SHA1 Message Date
Louis Dureuil
87576cf26c Perform cargo check on the release artifacts 2022-11-15 10:25:02 +01:00
Louis Dureuil
6dc6a5d874 Force using vendored version of LMDB
- don't use lmdb master3 branch anymore
2022-11-14 17:17:51 +01:00
bors[bot]
e75829aded Merge #694
694: Update version for the next release (v0.36.0) in Cargo.toml files r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one before merging.

Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>
2022-11-09 11:22:24 +00:00
Kerollmops
d00d2aab3f Update version for the next release (v0.36.0) in Cargo.toml files 2022-11-09 11:03:09 +00:00
bors[bot]
f46a8ab2e2 Merge #693
693: use the lmdb-master.3 branch r=Kerollmops a=irevoire

After investigating https://github.com/meilisearch/meilisearch/issues/3017, we found out that it was due to lmdb and that, without any code change on our side, bumping using the lmdb-master-3 branch fix our issues.

But, we’re not really confident about what changed between the `mdb.master` and `mdb.master3` branches; thus this is a temporary change, and we hope we’ll be able to move to the new version of heed asap (either before the end of the pre-release or for the next release).

--------

The bug is hard to reproduce; I can reproduce it 100% of the time on my archlinux personal computer. But on a scaleway archlinux bare-metal machine, it doesn’t reproduce. It’s flaky on our test suite, but `@loiclec` was able to write a minimal test that reproduces it every time on macOS.
Basically, what happens is when there are multiple threads opening databases in a different directory at the same time.
If there are 10 or more threads running at the same time, lmdb starts throwing the `Invalid argument (os error 22)` error for no reason, we believe.
I would like to submit an issue to lmdb, but I don’t really have the time to write a test in C without heed currently.

`@hyc,` if you want to take a look at it, here is the repo that reproduces the issue on macOS: https://github.com/irevoire/heed-bug

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-11-09 09:42:38 +00:00
bors[bot]
c3b75bbe5d Merge #691
691: Update version for the next release (v0.35.1) in Cargo.toml files r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one before merging.

Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>
2022-11-08 15:31:50 +00:00
Irevoire
c7711daca3 use the lmdb-master.3 branch 2022-11-08 16:28:01 +01:00
bors[bot]
f18a4581f1 Merge #692
692: Update CONTRIBUTING.md r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-11-08 14:51:30 +00:00
Clémentine Urquizar - curqui
8ce8bbcdfc Update CONTRIBUTING.md 2022-11-08 15:49:45 +01:00
Kerollmops
bd12989610 Update version for the next release (v0.35.1) in Cargo.toml files 2022-11-08 14:31:39 +00:00
bors[bot]
24a298a83c Merge #690
690: Fix soft deleted bug settings r=ManyTheFish a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-11-08 13:45:10 +00:00
bors[bot]
d85cd9bf1a Merge #689
689: Handle non-finite floats consistently in filters r=irevoire a=dureuill

# Pull Request

## Related issue

Related meilisearch/meilisearch#3000

## What does this PR do?

### User

- Filters using `field = inf`, (or `infinite`, `NaN`) now match the value as a string rather than returning an internal error.
- Filters using `field < inf` (or other comparison operators) now return an invalid_filter error rather than returning an internal error, much like when using `field < aaa`.

### Implementation

- Add new `NonFiniteFloat` error variants to the filter-parser errors
- Add `Token::parse_as_finite_float` that can fail both when the string is not a float and when the float is not finite
- Refactor `Filter::inner_evaluate` to always use `parse_as_finite_float` instead of just `parse`
- Add corresponding tests

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2022-11-08 13:24:38 +00:00
Kerollmops
37b3c5c323 Fix transform to use all_documents and ignore soft_deleted documents 2022-11-08 14:23:16 +01:00
Kerollmops
1b1ad1923b Add a test to check that we take care of soft deleted documents 2022-11-08 14:23:14 +01:00
Louis Dureuil
a836b8e703 tests: Tests filter with non-finite floats 2022-11-08 13:56:55 +01:00
Louis Dureuil
3328560788 fix: allow filters on = inf, = NaN, return InvalidFilter for < inf, < NaN
Fixes meilisearch/meilisearch#3000
2022-11-08 13:27:15 +01:00
bors[bot]
cf76ec7b37 Merge #673
673: Add clippy job r=ManyTheFish a=unvalley

# Pull Request

## Related issue
Fixes #231 

## What does this PR do?
- fix some clippy errors remain
- add clippy job to CI (I set `nightly` as toolchain)

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: unvalley <kirohi.code@gmail.com>
2022-11-08 09:43:26 +00:00
unvalley
abf1cf9cd5 Fix clippy errors 2022-11-04 09:27:46 +09:00
unvalley
b09676779d Use nightly for clippy and remove conflict mistake 2022-11-04 09:13:01 +09:00
unvalley
70465aa5ce Execute cargo fmt 2022-11-04 08:59:58 +09:00
unvalley
3009981d31 Fix clippy errors
Add clippy job

Add clippy job to CI
2022-11-04 08:58:14 +09:00
unvalley
401e956128 Add clippy job
Add clippy job to CI
2022-11-04 08:58:12 +09:00
bors[bot]
6add470805 Merge #659
659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley

## Related PR
This PR is for #673 

## What does this PR do?
- ~~add `Run Clippy` job to CI (rust.yml)~~
- apply `cargo clippy --fix` command
- fix some `cargo clippy` error manually (but warnings still remain on tests)

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: unvalley <kirohi.code@gmail.com>
Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>
2022-11-03 15:24:38 +00:00
unvalley
13175f2339 refactor: match for filterCondition 2022-11-03 17:34:33 +09:00
bors[bot]
1a1ad8a792 Merge #679
679: Bump Swatinem/rust-cache from 2.0.0 to 2.0.1 r=curquiza a=dependabot[bot]

Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.0 to 2.0.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/releases">Swatinem/rust-cache's releases</a>.</em></p>
<blockquote>
<h2>v2.0.1</h2>
<ul>
<li>Primarily just updating dependencies to fix GitHub deprecation notices.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md">Swatinem/rust-cache's changelog</a>.</em></p>
<blockquote>
<h2>2.0.1</h2>
<ul>
<li>Primarily just updating dependencies to fix GitHub deprecation notices.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="22c9328bcb"><code>22c9328</code></a> 2.0.1</li>
<li><a href="d4d463bd9b"><code>d4d463b</code></a> bump deps and rebuild</li>
<li><a href="c4652c677c"><code>c4652c6</code></a> Update <code>`@actions/core</code>` (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/83">#83</a>)</li>
<li><a href="76686c56f2"><code>76686c5</code></a> docs: Fix github workflows directory (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/79">#79</a>)</li>
<li><a href="1b43d2f2c3"><code>1b43d2f</code></a> remove outdated versioning note</li>
<li><a href="20b9201e8a"><code>20b9201</code></a> bump cargo hash</li>
<li><a href="0d72e5f9a0"><code>0d72e5f</code></a> revert explicit dir close</li>
<li><a href="86531941c2"><code>8653194</code></a> Merge branch 'master' of <a href="https://github.com/Swatinem/rust-cache">https://github.com/Swatinem/rust-cache</a></li>
<li><a href="be4be3720d"><code>be4be37</code></a> explicitly close dir handles, add more logging, cleanups</li>
<li><a href="213334cd98"><code>213334c</code></a> cargo update</li>
<li>Additional commits viewable in <a href="https://github.com/Swatinem/rust-cache/compare/v2.0.0...v2.0.1">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Swatinem/rust-cache&package-manager=github_actions&previous-version=2.0.0&new-version=2.0.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

You can trigger a rebase of this PR by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-02 12:38:05 +00:00
dependabot[bot]
4492605a78 Bump Swatinem/rust-cache from 2.0.0 to 2.0.1
Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.0 to 2.0.1.
- [Release notes](https://github.com/Swatinem/rust-cache/releases)
- [Changelog](https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Swatinem/rust-cache/compare/v2.0.0...v2.0.1)

---
updated-dependencies:
- dependency-name: Swatinem/rust-cache
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-01 10:19:45 +00:00
bors[bot]
fe5a0219e1 Merge #677
677: run the tests in all workspaces r=curquiza a=irevoire

With #676 I noticed the tests were not running in any of our sub crates.
Most of our sub crates didn't includes any tests though.
But the filter-parser did and we're lucky we never broke these one without noticing 😁 

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-10-31 18:05:04 +00:00
Irevoire
5ff066c3e7 run the tests in all workspaces 2022-10-31 18:38:48 +01:00
bors[bot]
6770eb2a87 Merge #676
676: chore: added `IN`,`NOT IN` to `invalid_filter` msg r=irevoire a=Pranav-yadav

# Pull Request

## Related issue
`Fixes` https://github.com/meilisearch/meilisearch/issues/3004

## What does this PR do?
- Improves correct error msg in response

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Pranav Yadav <Pranavyadav3912@gmail.com>
2022-10-31 17:29:24 +00:00
unvalley
0d43ddbd85 Update filter-parser/src/lib.rs
Co-authored-by: Tamo <irevoire@protonmail.ch>
2022-11-01 01:32:54 +09:00
Pranav Yadav
3950ec8d3c chore: update tests for invalid_filter msg 2022-10-31 15:41:49 +00:00
Pranav Yadav
3b35ebda50 chore: added IN,NOT IN to invalid_filter msg 2022-10-31 15:01:14 +00:00
bors[bot]
4bcfd14a45 Merge #675
675: Deleted empty files r=Kerollmops a=SKVKPandey

# Pull Request

## Related issue
Fixes #674

## What does this PR do?
Delete empty files:
- `milli/src/heed_codec/facet/facet_string_level_zero_value_codec.rs`
- `milli/src/heed_codec/facet/facet_string_zero_bounds_value_codec.rs`

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Shashank Kashyap <50551759+SKVKPandey@users.noreply.github.com>
2022-10-30 07:09:30 +00:00
Shashank Kashyap
a07f0a4a43 Delete facet_string_zero_bounds_value_codec.rs 2022-10-30 08:59:04 +05:30
Shashank Kashyap
2dec6e86e9 Delete facet_string_level_zero_value_codec.rs 2022-10-30 08:58:36 +05:30
bors[bot]
c965200010 Merge #664
664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2

# Pull Request

This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.

## Related issue
Fixes #661 and towards fixing meilisearch/meilisearch#2905

## What does this PR do?
- [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word
- [x] Update all other uses of phrase operation
- [x] Update `resolve_phrase`
- [x] Update `create_primitive_query`?
- [x] Add test

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>
2022-10-29 13:42:52 +00:00
unvalley
d55f0e2e53 Execute cargo fmt 2022-10-28 23:42:23 +09:00
unvalley
d53a80b408 Fix clippy error 2022-10-28 23:41:35 +09:00
Samyak Sarnayak
ecb88143f9 Run cargo fmt 2022-10-28 19:37:02 +05:30
Samyak Sarnayak
03eb5d87c1 Only call plane_sweep on subgroups when 2 or more are present 2022-10-28 19:32:05 +05:30
unvalley
a1d7ed1258 fix clippy error and remove clippy job from ci
Remove clippy job

Fix clippy error type_complexity

Restore ambiguous change
2022-10-28 22:33:50 +09:00
unvalley
f3c0b05ae8 Fix rust fmt 2022-10-28 09:32:31 +09:00
unvalley
f4ec1abb9b Fix all clippy error after conflicts 2022-10-27 23:58:13 +09:00
Samyak S Sarnayak
d35afa0cf5 Change consecutive phrase search grouping logic
Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-10-26 23:10:48 +05:30
Samyak S Sarnayak
752d031010 Update phrase search to use new execute method 2022-10-26 23:07:20 +05:30
unvalley
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
unvalley
811f156031 Execute cargo clippy --fix 2022-10-27 01:00:00 +09:00
unvalley
d8fed1f7a9 Add clippy job
Add Run Clippy to bors.toml
2022-10-27 01:00:00 +09:00
bors[bot]
2e539249cb Merge #619
619: Refactor the Facets databases to enable incremental indexing r=curquiza a=loiclec

# Pull Request

## What does this PR do?
Party fixes https://github.com/meilisearch/milli/issues/605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and https://github.com/meilisearch/meilisearch/issues/2820 . Two more untracked bugs were also fixed:
1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids.
2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

## How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:
1. the new structure of the databases
2. the algorithms for sort, facet distribution, and range search
3. the new/removed heed codecs

Then, weigh in on the following concerns:
1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides
2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist
3. the new behaviour of facet distributions can be considered incorrect
4. incremental deletion is useless given that documents are always deleted in bulk

## What's left for me to do

1. Re-read everything once to make sure I haven't forgotten anything
2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
3. Test whether the `info`/`http-ui` crates survived the refactor

## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where the first line is the key of the database, consisting of :
- the field id
- the level height
- the left and right bound of the group 

and the second line is the value of the database, consisting of:
- a bitmap of all the docids that have a facet value within the bounds

The `facet_id_string_docids` had a similar structure:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where, **at level 0**, the key is:
* the normalised facet value (string)

and the value is:
* the original facet value (string)
* a bitmap of all the docids that have this normalised string facet value

**At level 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* the left bound of the range as a normalised string
* the right bound of the range as a normalised string
* a bitmap of all the docids that have a string facet value within the bounds

**At level > 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* a bitmap of all the docids that have a string facet value within the bounds

## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure:
```                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where for all levels, the key is a `FacetGroupKey<T>` containing:
* the field id (`u16`)
* the level height (`u8`)
* the left bound of the range (`T`)

and the value is a `FacetGroupValue` containing:
* the number of elements from the level below that are part of the range (`u8`, =0 for level 0)
* a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`)

The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`.

In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

## Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink:
<img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png">

which results in:
<img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png">

* one element was added in level 0
* one key/value was modified in level 1
* one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0:
<img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png">
Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
<img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png">
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
<img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png">

We also have two other parameters:
* `group_size` is the default group size when building the database from scratch
* `min_level_size` is the minimum number of elements that a level should contain

When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to:
<img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png">

The root of the code for incremental indexing is the `FacetUpdateIncremental` builder.

## Incrementally removing a facet value
TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.  

## Bulk insertion
While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by:
1. clearing all levels > 0 from the DB
2. adding all new elements in level 0
3. rebuilding the higher levels from scratch 

The algorithm for bulk insertion is the same as the previous one.

## Choosing between incremental and bulk insertion
On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

## Fuzz-testing

**Potentially controversial:**
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`.

The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. 

## Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:
1. Sort, ascending
2. Sort, descending
3. Facet distribution
4. Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). 

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:
1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs`
2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs`
3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs`
4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs`

I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

## Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,
```json
{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }
```
Facet distribution for the `colour` field among the candidates `[1]`:
```
{ "RED": 1 }
```
Here, "RED" was given as the original facet value even though it does not appear in the document id `1`.

## Heed codecs

A number of heed codecs related to the facet databases were removed:
* `FacetLevelValueF64Codec`
* `FacetLevelValueU32Codec`
* `FacetStringLevelZeroCodec`
* `StringValueCodec`
* `FacetStringZeroBoundsValueCodec`
* `FacetValueStringCodec`
* `FieldDocIdFacetStringCodec`
* `FieldDocIdFacetF64Codec`

They were replaced by:
* `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases)
* `FacetGroupValueCodec` (replaces all value codecs for the facet databases)
* `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`)

Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs:
* `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]`
* `StrRefCodec` is a codec with a `EItem = DItem = &str`

I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-10-26 15:04:53 +00:00
Samyak S Sarnayak
488d31ecdf Run cargo fmt 2022-10-26 19:09:45 +05:30