Compare commits

...

15 Commits

Author SHA1 Message Date
1032d82643 Merge #4140
4140: Fix bug where search with distinct attribute has wrong totalHits and totalPages r=dureuill a=vivek-26

# Pull Request

## Related issue
Fixes #4130

## What does this PR do?
This PR - 
- Fixes the bug where search with distinct attribute and pagination (no ranking) returns incorrect values for `totalHits` and `totalPages`.
- Add/update unit and integration tests.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
2023-10-19 14:00:53 +00:00
18bbadf645 Add explanatory comment 2023-10-19 15:58:25 +02:00
aa2cd52797 add/update tests when search with distinct attribute & pagination with no ranking 2023-10-19 16:50:14 +05:30
460e61b853 compute all candidates correctly when skipping 2023-10-19 16:48:45 +05:30
ec90946bdf Merge #4137
4137: Update version for the next release (v1.4.2) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2023-10-19 08:52:54 +00:00
58864a4dfa Update version for the next release (v1.4.2) in Cargo.toml 2023-10-19 08:50:11 +00:00
f343ef5f2f Merge #4108
4108: Fix bug where search with distinct attribute and no ranking, returns offset+limit hits r=curquiza a=vivek-26

# Pull Request

## Related issue
Fixes #4078 

## What does this PR do?
This PR - 
- Fixes bug where search with distinct attribute and no ranking, returns offset+limit hits.
- Adds unit and integration tests.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
2023-10-12 07:51:29 +00:00
67a678cfb6 Merge #4089
4089: Use a bufreader and bufwriter everytime there is a grenad<file> r=curquiza a=irevoire

# Pull Request
Wrap all the files we give to a grenad in a `BufReader` or `BufWriter`.

The dump import I tried in the issue went from 2h to 10 minutes on my machine.

I also ran a bunch of benchmarks on my machine, and we're faster by a few seconds everywhere but nothing huge.

-----

The one thing I’m afraid about is if we used to get the inner file in a grenad and then do a read right after without a seek at the beginning of the file or a reopen.
Since we now use a bufreader our read would return the bytes one buffer later and probably completely corrupt what we were supposed to read.

From what I see, it looks like it works, but I may have missed something, I don't know much about this part of the codebase.

This issue should not arise on the bufwriter, though, because if we're not able to write the content of the buffer I ensured that the `into_inner` of the bufwriter should return an internal error.

## Related issue
Fixes #4087


Co-authored-by: Tamo <tamo@meilisearch.com>
2023-10-11 14:27:00 +00:00
d1331d8abf add integration test for distinct search with no ranking 2023-10-11 19:12:56 +05:30
19ba129165 add unit test for distinct search with no ranking 2023-10-11 19:02:27 +05:30
d4da06ff47 fix bug where distinct search with no ranking returns offset+limit hits 2023-10-11 19:02:16 +05:30
c0f2724c2d get rids of the new introduced error code in favor of an io::Error 2023-10-10 15:12:23 +02:00
d772073dfa use a bufreader everytime there is a grenad<file> 2023-10-10 15:00:30 +02:00
8fe8ddea79 Merge #4112
4112: Update version for the next release (v1.4.1) in Cargo.toml r=curquiza a=meili-bot

⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.

Co-authored-by: curquiza <curquiza@users.noreply.github.com>
2023-10-10 09:05:10 +00:00
8a95bf28e5 Update version for the next release (v1.4.1) in Cargo.toml 2023-10-10 09:01:45 +00:00
27 changed files with 435 additions and 101 deletions

28
Cargo.lock generated
View File

@ -468,7 +468,7 @@ checksum = "8c3c1a368f70d6cf7302d78f8f7093da241fb8e8807c05cc9e51a125895a6d5b"
[[package]]
name = "benchmarks"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"anyhow",
"bytes",
@ -1206,7 +1206,7 @@ dependencies = [
[[package]]
name = "dump"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"anyhow",
"big_s",
@ -1417,7 +1417,7 @@ dependencies = [
[[package]]
name = "file-store"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"faux",
"tempfile",
@ -1439,7 +1439,7 @@ dependencies = [
[[package]]
name = "filter-parser"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"insta",
"nom",
@ -1459,7 +1459,7 @@ dependencies = [
[[package]]
name = "flatten-serde-json"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"criterion",
"serde_json",
@ -1577,7 +1577,7 @@ dependencies = [
[[package]]
name = "fuzzers"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"arbitrary",
"clap",
@ -1891,7 +1891,7 @@ dependencies = [
[[package]]
name = "index-scheduler"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"anyhow",
"big_s",
@ -2088,7 +2088,7 @@ dependencies = [
[[package]]
name = "json-depth-checker"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"criterion",
"serde_json",
@ -2500,7 +2500,7 @@ checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771"
[[package]]
name = "meili-snap"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"insta",
"md5",
@ -2509,7 +2509,7 @@ dependencies = [
[[package]]
name = "meilisearch"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"actix-cors",
"actix-http",
@ -2600,7 +2600,7 @@ dependencies = [
[[package]]
name = "meilisearch-auth"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"base64 0.21.2",
"enum-iterator",
@ -2619,7 +2619,7 @@ dependencies = [
[[package]]
name = "meilisearch-types"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"actix-web",
"anyhow",
@ -2673,7 +2673,7 @@ dependencies = [
[[package]]
name = "milli"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"big_s",
"bimap",
@ -2995,7 +2995,7 @@ checksum = "9b2a4787296e9989611394c33f193f676704af1686e70b8f8033ab5ba9a35a94"
[[package]]
name = "permissive-json-pointer"
version = "1.4.0"
version = "1.4.2"
dependencies = [
"big_s",
"serde_json",

View File

@ -18,7 +18,7 @@ members = [
]
[workspace.package]
version = "1.4.0"
version = "1.4.2"
authors = ["Quentin de Quelen <quentin@dequelen.me>", "Clément Renault <clement@meilisearch.com>"]
description = "Meilisearch HTTP server"
homepage = "https://meilisearch.com"

View File

@ -0,0 +1,241 @@
use meili_snap::snapshot;
use once_cell::sync::Lazy;
use serde_json::{json, Value};
use crate::common::Server;
pub(self) static DOCUMENTS: Lazy<Value> = Lazy::new(|| {
json!([
{
"id": 1,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Brown"
},
{
"id": 2,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Black"
},
{
"id": 3,
"description": "Leather Jacket",
"brand": "Lee Jeans",
"product_id": "123456",
"color": "Blue"
},
{
"id": 4,
"description": "T-Shirt",
"brand": "Nike",
"product_id": "789012",
"color": "Red"
},
{
"id": 5,
"description": "T-Shirt",
"brand": "Nike",
"product_id": "789012",
"color": "Blue"
},
{
"id": 6,
"description": "Running Shoes",
"brand": "Adidas",
"product_id": "456789",
"color": "Black"
},
{
"id": 7,
"description": "Running Shoes",
"brand": "Adidas",
"product_id": "456789",
"color": "White"
},
{
"id": 8,
"description": "Hoodie",
"brand": "Puma",
"product_id": "987654",
"color": "Gray"
},
{
"id": 9,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Green"
},
{
"id": 10,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Red"
},
{
"id": 11,
"description": "Sweater",
"brand": "Gap",
"product_id": "234567",
"color": "Blue"
},
{
"id": 12,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Indigo"
},
{
"id": 13,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Black"
},
{
"id": 14,
"description": "Jeans",
"brand": "Levi's",
"product_id": "345678",
"color": "Stone Wash"
}
])
});
pub(self) static DOCUMENT_PRIMARY_KEY: &str = "id";
pub(self) static DOCUMENT_DISTINCT_KEY: &str = "product_id";
/// testing: https://github.com/meilisearch/meilisearch/issues/4078
#[actix_rt::test]
async fn distinct_search_with_offset_no_ranking() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, Some(DOCUMENT_PRIMARY_KEY)).await;
index.update_distinct_attribute(json!(DOCUMENT_DISTINCT_KEY)).await;
index.wait_task(1).await;
fn get_hits(response: &Value) -> Vec<&str> {
let hits_array = response["hits"].as_array().unwrap();
hits_array.iter().map(|h| h[DOCUMENT_DISTINCT_KEY].as_str().unwrap()).collect::<Vec<_>>()
}
let (response, code) = index.search_post(json!({"offset": 0, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["123456", "789012"]"#);
snapshot!(response["estimatedTotalHits"] , @"11");
let (response, code) = index.search_post(json!({"offset": 2, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["456789", "987654"]"#);
snapshot!(response["estimatedTotalHits"], @"10");
let (response, code) = index.search_post(json!({"offset": 4, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["234567", "345678"]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 5, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"1");
snapshot!(format!("{:?}", hits), @r#"["345678"]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 6, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["estimatedTotalHits"], @"6");
let (response, code) = index.search_post(json!({"offset": 7, "limit": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["estimatedTotalHits"], @"6");
}
/// testing: https://github.com/meilisearch/meilisearch/issues/4130
#[actix_rt::test]
async fn distinct_search_with_pagination_no_ranking() {
let server = Server::new().await;
let index = server.index("test");
let documents = DOCUMENTS.clone();
index.add_documents(documents, Some(DOCUMENT_PRIMARY_KEY)).await;
index.update_distinct_attribute(json!(DOCUMENT_DISTINCT_KEY)).await;
index.wait_task(1).await;
fn get_hits(response: &Value) -> Vec<&str> {
let hits_array = response["hits"].as_array().unwrap();
hits_array.iter().map(|h| h[DOCUMENT_DISTINCT_KEY].as_str().unwrap()).collect::<Vec<_>>()
}
let (response, code) = index.search_post(json!({"page": 0, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["page"], @"0");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 1, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["123456", "789012"]"#);
snapshot!(response["page"], @"1");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 2, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["456789", "987654"]"#);
snapshot!(response["page"], @"2");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 3, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"2");
snapshot!(format!("{:?}", hits), @r#"["234567", "345678"]"#);
snapshot!(response["page"], @"3");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 4, "hitsPerPage": 2})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"0");
snapshot!(format!("{:?}", hits), @r#"[]"#);
snapshot!(response["page"], @"4");
snapshot!(response["totalPages"], @"3");
snapshot!(response["totalHits"], @"6");
let (response, code) = index.search_post(json!({"page": 2, "hitsPerPage": 3})).await;
let hits = get_hits(&response);
snapshot!(code, @"200 OK");
snapshot!(hits.len(), @"3");
snapshot!(format!("{:?}", hits), @r#"["987654", "234567", "345678"]"#);
snapshot!(response["page"], @"2");
snapshot!(response["totalPages"], @"2");
snapshot!(response["totalHits"], @"6");
}

View File

@ -1,6 +1,7 @@
// This modules contains all the test concerning search. Each particular feature of the search
// should be tested in its own module to isolate tests and keep the tests readable.
mod distinct;
mod errors;
mod facet_search;
mod formatted;

View File

@ -1,4 +1,5 @@
use std::fs::File;
use std::io::BufReader;
use std::{io, str};
use obkv::KvReader;
@ -19,14 +20,14 @@ use crate::FieldId;
pub struct EnrichedDocumentsBatchReader<R> {
documents: DocumentsBatchReader<R>,
primary_key: String,
external_ids: grenad::ReaderCursor<File>,
external_ids: grenad::ReaderCursor<BufReader<File>>,
}
impl<R: io::Read + io::Seek> EnrichedDocumentsBatchReader<R> {
pub fn new(
documents: DocumentsBatchReader<R>,
primary_key: String,
external_ids: grenad::Reader<File>,
external_ids: grenad::Reader<BufReader<File>>,
) -> Result<Self, Error> {
if documents.documents_count() as u64 == external_ids.len() {
Ok(EnrichedDocumentsBatchReader {
@ -75,7 +76,7 @@ pub struct EnrichedDocument<'a> {
pub struct EnrichedDocumentsBatchCursor<R> {
documents: DocumentsBatchCursor<R>,
primary_key: String,
external_ids: grenad::ReaderCursor<File>,
external_ids: grenad::ReaderCursor<BufReader<File>>,
}
impl<R> EnrichedDocumentsBatchCursor<R> {

View File

@ -53,11 +53,22 @@ pub fn bucket_sort<'ctx, Q: RankingRuleQueryTrait>(
if excluded.contains(docid) {
continue;
}
distinct_single_docid(ctx.index, ctx.txn, distinct_fid, docid, &mut excluded)?;
results.push(docid);
}
let mut all_candidates = universe - excluded;
all_candidates.extend(results.iter().copied());
// drain the results of the skipped elements
// this **must** be done **after** writing the entire results in `all_candidates` to ensure
// e.g. estimatedTotalHits is correct.
if results.len() >= from {
results.drain(..from);
} else {
results.clear();
}
return Ok(BucketSortOutput {
scores: vec![Default::default(); results.len()],
docids: results,

View File

@ -1,5 +1,6 @@
use std::borrow::Cow;
use std::fs::File;
use std::io::BufReader;
use grenad::CompressionType;
use heed::types::ByteSlice;
@ -30,7 +31,7 @@ pub struct FacetsUpdateBulk<'i> {
facet_type: FacetType,
field_ids: Vec<FieldId>,
// None if level 0 does not need to be updated
new_data: Option<grenad::Reader<File>>,
new_data: Option<grenad::Reader<BufReader<File>>>,
}
impl<'i> FacetsUpdateBulk<'i> {
@ -38,7 +39,7 @@ impl<'i> FacetsUpdateBulk<'i> {
index: &'i Index,
field_ids: Vec<FieldId>,
facet_type: FacetType,
new_data: grenad::Reader<File>,
new_data: grenad::Reader<BufReader<File>>,
group_size: u8,
min_level_size: u8,
) -> FacetsUpdateBulk<'i> {
@ -187,7 +188,7 @@ impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
&self,
field_id: FieldId,
txn: &RoTxn,
) -> Result<(Vec<grenad::Reader<File>>, RoaringBitmap)> {
) -> Result<(Vec<grenad::Reader<BufReader<File>>>, RoaringBitmap)> {
let mut all_docids = RoaringBitmap::new();
let subwriters = self.compute_higher_levels(txn, field_id, 32, &mut |bitmaps, _| {
for bitmap in bitmaps {
@ -259,7 +260,7 @@ impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
field_id: u16,
level: u8,
handle_group: &mut dyn FnMut(&[RoaringBitmap], &'t [u8]) -> Result<()>,
) -> Result<Vec<grenad::Reader<File>>> {
) -> Result<Vec<grenad::Reader<BufReader<File>>>> {
if level == 0 {
self.read_level_0(rtxn, field_id, handle_group)?;
// Level 0 is already in the database

View File

@ -1,5 +1,6 @@
use std::collections::HashMap;
use std::fs::File;
use std::io::BufReader;
use heed::types::{ByteSlice, DecodeIgnore};
use heed::{BytesDecode, Error, RoTxn, RwTxn};
@ -34,14 +35,14 @@ pub struct FacetsUpdateIncremental<'i> {
index: &'i Index,
inner: FacetsUpdateIncrementalInner,
facet_type: FacetType,
new_data: grenad::Reader<File>,
new_data: grenad::Reader<BufReader<File>>,
}
impl<'i> FacetsUpdateIncremental<'i> {
pub fn new(
index: &'i Index,
facet_type: FacetType,
new_data: grenad::Reader<File>,
new_data: grenad::Reader<BufReader<File>>,
group_size: u8,
min_level_size: u8,
max_group_size: u8,

View File

@ -78,6 +78,7 @@ pub const FACET_MIN_LEVEL_SIZE: u8 = 5;
use std::collections::BTreeSet;
use std::fs::File;
use std::io::BufReader;
use std::iter::FromIterator;
use charabia::normalizer::{Normalize, NormalizerOption};
@ -108,13 +109,17 @@ pub struct FacetsUpdate<'i> {
index: &'i Index,
database: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
facet_type: FacetType,
new_data: grenad::Reader<File>,
new_data: grenad::Reader<BufReader<File>>,
group_size: u8,
max_group_size: u8,
min_level_size: u8,
}
impl<'i> FacetsUpdate<'i> {
pub fn new(index: &'i Index, facet_type: FacetType, new_data: grenad::Reader<File>) -> Self {
pub fn new(
index: &'i Index,
facet_type: FacetType,
new_data: grenad::Reader<BufReader<File>>,
) -> Self {
let database = match facet_type {
FacetType::String => index
.facet_id_string_docids

View File

@ -1,4 +1,4 @@
use std::io::{Read, Seek};
use std::io::{BufWriter, Read, Seek};
use std::result::Result as StdResult;
use std::{fmt, iter};
@ -35,7 +35,7 @@ pub fn enrich_documents_batch<R: Read + Seek>(
let (mut cursor, mut documents_batch_index) = reader.into_cursor_and_fields_index();
let mut external_ids = tempfile::tempfile().map(grenad::Writer::new)?;
let mut external_ids = tempfile::tempfile().map(BufWriter::new).map(grenad::Writer::new)?;
let mut uuid_buffer = [0; uuid::fmt::Hyphenated::LENGTH];
// The primary key *field id* that has already been set for this index or the one

View File

@ -1,6 +1,7 @@
use std::collections::{HashMap, HashSet};
use std::convert::TryInto;
use std::fs::File;
use std::io::BufReader;
use std::{io, mem, str};
use charabia::{Language, Script, SeparatorKind, Token, TokenKind, Tokenizer, TokenizerBuilder};
@ -31,7 +32,7 @@ pub fn extract_docid_word_positions<R: io::Read + io::Seek>(
allowed_separators: Option<&[&str]>,
dictionary: Option<&[&str]>,
max_positions_per_attributes: Option<u32>,
) -> Result<(RoaringBitmap, grenad::Reader<File>, ScriptLanguageDocidsMap)> {
) -> Result<(RoaringBitmap, grenad::Reader<BufReader<File>>, ScriptLanguageDocidsMap)> {
puffin::profile_function!();
let max_positions_per_attributes = max_positions_per_attributes

View File

@ -1,5 +1,5 @@
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use heed::{BytesDecode, BytesEncode};
@ -19,7 +19,7 @@ use crate::Result;
pub fn extract_facet_number_docids<R: io::Read + io::Seek>(
docid_fid_facet_number: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,5 +1,5 @@
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use heed::BytesEncode;
@ -17,7 +17,7 @@ use crate::{FieldId, Result, MAX_FACET_VALUE_LENGTH};
pub fn extract_facet_string_docids<R: io::Read + io::Seek>(
docid_fid_facet_string: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,7 +1,7 @@
use std::collections::{BTreeMap, HashSet};
use std::convert::TryInto;
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use std::mem::size_of;
use heed::zerocopy::AsBytes;
@ -17,11 +17,11 @@ use crate::{CboRoaringBitmapCodec, DocumentId, FieldId, Result, BEU32, MAX_FACET
/// The extracted facet values stored in grenad files by type.
pub struct ExtractedFacetValues {
pub docid_fid_facet_numbers_chunk: grenad::Reader<File>,
pub docid_fid_facet_strings_chunk: grenad::Reader<File>,
pub fid_facet_is_null_docids_chunk: grenad::Reader<File>,
pub fid_facet_is_empty_docids_chunk: grenad::Reader<File>,
pub fid_facet_exists_docids_chunk: grenad::Reader<File>,
pub docid_fid_facet_numbers_chunk: grenad::Reader<BufReader<File>>,
pub docid_fid_facet_strings_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_is_null_docids_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_is_empty_docids_chunk: grenad::Reader<BufReader<File>>,
pub fid_facet_exists_docids_chunk: grenad::Reader<BufReader<File>>,
}
/// Extracts the facet values of each faceted field of each document.

View File

@ -1,6 +1,6 @@
use std::collections::HashMap;
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use grenad::Sorter;
@ -21,7 +21,7 @@ use crate::{relative_from_absolute_position, DocumentId, FieldId, Result};
pub fn extract_fid_word_count_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,5 +1,5 @@
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use concat_arrays::concat_arrays;
use serde_json::Value;
@ -18,7 +18,7 @@ pub fn extract_geo_points<R: io::Read + io::Seek>(
indexer: GrenadParameters,
primary_key_id: FieldId,
(lat_fid, lng_fid): (FieldId, FieldId),
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let mut writer = create_writer(

View File

@ -1,6 +1,6 @@
use std::convert::TryFrom;
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use bytemuck::cast_slice;
use serde_json::{from_slice, Value};
@ -18,7 +18,7 @@ pub fn extract_vector_points<R: io::Read + io::Seek>(
indexer: GrenadParameters,
primary_key_id: FieldId,
vectors_fid: FieldId,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let mut writer = create_writer(

View File

@ -1,6 +1,6 @@
use std::collections::HashSet;
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use std::iter::FromIterator;
use roaring::RoaringBitmap;
@ -26,7 +26,7 @@ pub fn extract_word_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
exact_attributes: &HashSet<FieldId>,
) -> Result<(grenad::Reader<File>, grenad::Reader<File>)> {
) -> Result<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,5 +1,5 @@
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use super::helpers::{
create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
@ -14,7 +14,7 @@ use crate::{relative_from_absolute_position, DocumentId, Result};
pub fn extract_word_fid_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,6 +1,7 @@
use std::cmp::Ordering;
use std::collections::{BinaryHeap, HashMap};
use std::fs::File;
use std::io::BufReader;
use std::{cmp, io, mem, str, vec};
use super::helpers::{
@ -20,7 +21,7 @@ use crate::{DocumentId, Result};
pub fn extract_word_pair_proximity_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -1,5 +1,5 @@
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use super::helpers::{
create_sorter, merge_cbo_roaring_bitmaps, read_u32_ne_bytes, sorter_into_reader,
@ -17,7 +17,7 @@ use crate::{bucketed_position, relative_from_absolute_position, DocumentId, Resu
pub fn extract_word_position_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();

View File

@ -12,6 +12,7 @@ mod extract_word_position_docids;
use std::collections::HashSet;
use std::fs::File;
use std::io::BufReader;
use crossbeam_channel::Sender;
use log::debug;
@ -39,8 +40,8 @@ use crate::{FieldId, Result};
/// Send data in grenad file over provided Sender.
#[allow(clippy::too_many_arguments)]
pub(crate) fn data_from_obkv_documents(
original_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<File>>> + Send,
flattened_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<File>>> + Send,
original_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>> + Send,
flattened_obkv_chunks: impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>> + Send,
indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>,
searchable_fields: Option<HashSet<FieldId>>,
@ -152,7 +153,7 @@ pub(crate) fn data_from_obkv_documents(
});
}
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(),
indexer,
lmdb_writer_sx.clone(),
@ -162,7 +163,7 @@ pub(crate) fn data_from_obkv_documents(
"word-pair-proximity-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(),
indexer,
lmdb_writer_sx.clone(),
@ -172,7 +173,11 @@ pub(crate) fn data_from_obkv_documents(
"field-id-wordcount-docids",
);
spawn_extraction_task::<_, _, Vec<(grenad::Reader<File>, grenad::Reader<File>)>>(
spawn_extraction_task::<
_,
_,
Vec<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)>,
>(
docid_word_positions_chunks.clone(),
indexer,
lmdb_writer_sx.clone(),
@ -185,7 +190,7 @@ pub(crate) fn data_from_obkv_documents(
"word-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks.clone(),
indexer,
lmdb_writer_sx.clone(),
@ -194,7 +199,7 @@ pub(crate) fn data_from_obkv_documents(
TypedChunk::WordPositionDocids,
"word-position-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_word_positions_chunks,
indexer,
lmdb_writer_sx.clone(),
@ -204,7 +209,7 @@ pub(crate) fn data_from_obkv_documents(
"word-fid-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_fid_facet_strings_chunks,
indexer,
lmdb_writer_sx.clone(),
@ -214,7 +219,7 @@ pub(crate) fn data_from_obkv_documents(
"field-id-facet-string-docids",
);
spawn_extraction_task::<_, _, Vec<grenad::Reader<File>>>(
spawn_extraction_task::<_, _, Vec<grenad::Reader<BufReader<File>>>>(
docid_fid_facet_numbers_chunks,
indexer,
lmdb_writer_sx,
@ -269,7 +274,7 @@ fn spawn_extraction_task<FE, FS, M>(
/// Extract chunked data and send it into lmdb_writer_sx sender:
/// - documents
fn send_original_documents_data(
original_documents_chunk: Result<grenad::Reader<File>>,
original_documents_chunk: Result<grenad::Reader<BufReader<File>>>,
indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>,
vectors_field_id: Option<FieldId>,
@ -311,7 +316,7 @@ fn send_original_documents_data(
#[allow(clippy::too_many_arguments)]
#[allow(clippy::type_complexity)]
fn send_and_extract_flattened_documents_data(
flattened_documents_chunk: Result<grenad::Reader<File>>,
flattened_documents_chunk: Result<grenad::Reader<BufReader<File>>>,
indexer: GrenadParameters,
lmdb_writer_sx: Sender<Result<TypedChunk>>,
searchable_fields: &Option<HashSet<FieldId>>,
@ -328,7 +333,10 @@ fn send_and_extract_flattened_documents_data(
grenad::Reader<CursorClonableMmap>,
(
grenad::Reader<CursorClonableMmap>,
(grenad::Reader<File>, (grenad::Reader<File>, grenad::Reader<File>)),
(
grenad::Reader<BufReader<File>>,
(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>),
),
),
),
)> {

View File

@ -1,6 +1,6 @@
use std::borrow::Cow;
use std::fs::File;
use std::io::{self, Seek};
use std::io::{self, BufReader, BufWriter, Seek};
use std::time::Instant;
use grenad::{CompressionType, Sorter};
@ -17,13 +17,13 @@ pub fn create_writer<R: io::Write>(
typ: grenad::CompressionType,
level: Option<u32>,
file: R,
) -> grenad::Writer<R> {
) -> grenad::Writer<BufWriter<R>> {
let mut builder = grenad::Writer::builder();
builder.compression_type(typ);
if let Some(level) = level {
builder.compression_level(level);
}
builder.build(file)
builder.build(BufWriter::new(file))
}
pub fn create_sorter(
@ -53,7 +53,7 @@ pub fn create_sorter(
pub fn sorter_into_reader(
sorter: grenad::Sorter<MergeFn>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
) -> Result<grenad::Reader<BufReader<File>>> {
let mut writer = create_writer(
indexer.chunk_compression_type,
indexer.chunk_compression_level,
@ -64,16 +64,18 @@ pub fn sorter_into_reader(
writer_into_reader(writer)
}
pub fn writer_into_reader(writer: grenad::Writer<File>) -> Result<grenad::Reader<File>> {
let mut file = writer.into_inner()?;
pub fn writer_into_reader(
writer: grenad::Writer<BufWriter<File>>,
) -> Result<grenad::Reader<BufReader<File>>> {
let mut file = writer.into_inner()?.into_inner().map_err(|err| err.into_error())?;
file.rewind()?;
grenad::Reader::new(file).map_err(Into::into)
grenad::Reader::new(BufReader::new(file)).map_err(Into::into)
}
pub unsafe fn as_cloneable_grenad(
reader: &grenad::Reader<File>,
reader: &grenad::Reader<BufReader<File>>,
) -> Result<grenad::Reader<CursorClonableMmap>> {
let file = reader.get_ref();
let file = reader.get_ref().get_ref();
let mmap = memmap2::Mmap::map(file)?;
let cursor = io::Cursor::new(ClonableMmap::from(mmap));
let reader = grenad::Reader::new(cursor)?;
@ -89,8 +91,8 @@ where
fn merge(self, merge_fn: MergeFn, indexer: &GrenadParameters) -> Result<Self::Output>;
}
impl MergeableReader for Vec<grenad::Reader<File>> {
type Output = grenad::Reader<File>;
impl MergeableReader for Vec<grenad::Reader<BufReader<File>>> {
type Output = grenad::Reader<BufReader<File>>;
fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> {
let mut merger = MergerBuilder::new(merge_fn);
@ -99,8 +101,8 @@ impl MergeableReader for Vec<grenad::Reader<File>> {
}
}
impl MergeableReader for Vec<(grenad::Reader<File>, grenad::Reader<File>)> {
type Output = (grenad::Reader<File>, grenad::Reader<File>);
impl MergeableReader for Vec<(grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>)> {
type Output = (grenad::Reader<BufReader<File>>, grenad::Reader<BufReader<File>>);
fn merge(self, merge_fn: MergeFn, params: &GrenadParameters) -> Result<Self::Output> {
let mut m1 = MergerBuilder::new(merge_fn);
@ -125,7 +127,7 @@ impl<R: io::Read + io::Seek> MergerBuilder<R> {
Ok(())
}
fn finish(self, params: &GrenadParameters) -> Result<grenad::Reader<File>> {
fn finish(self, params: &GrenadParameters) -> Result<grenad::Reader<BufReader<File>>> {
let merger = self.0.build();
let mut writer = create_writer(
params.chunk_compression_type,
@ -176,7 +178,7 @@ pub fn grenad_obkv_into_chunks<R: io::Read + io::Seek>(
reader: grenad::Reader<R>,
indexer: GrenadParameters,
documents_chunk_size: usize,
) -> Result<impl Iterator<Item = Result<grenad::Reader<File>>>> {
) -> Result<impl Iterator<Item = Result<grenad::Reader<BufReader<File>>>>> {
let mut continue_reading = true;
let mut cursor = reader.into_cursor()?;

View File

@ -659,8 +659,10 @@ impl<'a, 'i> Transform<'a, 'i> {
new_documents_ids: self.new_documents_ids,
replaced_documents_ids: self.replaced_documents_ids,
documents_count: self.documents_count,
original_documents,
flattened_documents,
original_documents: original_documents.into_inner().map_err(|err| err.into_error())?,
flattened_documents: flattened_documents
.into_inner()
.map_err(|err| err.into_error())?,
})
}
@ -779,8 +781,10 @@ impl<'a, 'i> Transform<'a, 'i> {
new_documents_ids: documents_ids,
replaced_documents_ids: RoaringBitmap::default(),
documents_count,
original_documents,
flattened_documents,
original_documents: original_documents.into_inner().map_err(|err| err.into_error())?,
flattened_documents: flattened_documents
.into_inner()
.map_err(|err| err.into_error())?,
};
let new_facets = output.compute_real_facets(wtxn, self.index)?;

View File

@ -2,7 +2,7 @@ use std::borrow::Cow;
use std::collections::HashMap;
use std::convert::TryInto;
use std::fs::File;
use std::io;
use std::io::{self, BufReader};
use bytemuck::allocation::pod_collect_to_vec;
use charabia::{Language, Script};
@ -27,22 +27,22 @@ pub(crate) enum TypedChunk {
FieldIdDocidFacetStrings(grenad::Reader<CursorClonableMmap>),
FieldIdDocidFacetNumbers(grenad::Reader<CursorClonableMmap>),
Documents(grenad::Reader<CursorClonableMmap>),
FieldIdWordcountDocids(grenad::Reader<File>),
FieldIdWordcountDocids(grenad::Reader<BufReader<File>>),
NewDocumentsIds(RoaringBitmap),
WordDocids {
word_docids_reader: grenad::Reader<File>,
exact_word_docids_reader: grenad::Reader<File>,
word_docids_reader: grenad::Reader<BufReader<File>>,
exact_word_docids_reader: grenad::Reader<BufReader<File>>,
},
WordPositionDocids(grenad::Reader<File>),
WordFidDocids(grenad::Reader<File>),
WordPairProximityDocids(grenad::Reader<File>),
FieldIdFacetStringDocids(grenad::Reader<File>),
FieldIdFacetNumberDocids(grenad::Reader<File>),
FieldIdFacetExistsDocids(grenad::Reader<File>),
FieldIdFacetIsNullDocids(grenad::Reader<File>),
FieldIdFacetIsEmptyDocids(grenad::Reader<File>),
GeoPoints(grenad::Reader<File>),
VectorPoints(grenad::Reader<File>),
WordPositionDocids(grenad::Reader<BufReader<File>>),
WordFidDocids(grenad::Reader<BufReader<File>>),
WordPairProximityDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetStringDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetNumberDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetExistsDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetIsNullDocids(grenad::Reader<BufReader<File>>),
FieldIdFacetIsEmptyDocids(grenad::Reader<BufReader<File>>),
GeoPoints(grenad::Reader<BufReader<File>>),
VectorPoints(grenad::Reader<BufReader<File>>),
ScriptLanguageDocids(HashMap<(Script, Language), RoaringBitmap>),
}

View File

@ -1,6 +1,6 @@
use std::borrow::Cow;
use std::collections::HashSet;
use std::io::BufReader;
use std::io::{BufReader, BufWriter};
use grenad::CompressionType;
use heed::types::ByteSlice;
@ -119,9 +119,9 @@ pub fn insert_into_database(
pub fn write_into_lmdb_database_without_merging(
wtxn: &mut heed::RwTxn,
database: heed::PolyDatabase,
writer: grenad::Writer<std::fs::File>,
writer: grenad::Writer<BufWriter<std::fs::File>>,
) -> Result<()> {
let file = writer.into_inner()?;
let file = writer.into_inner()?.into_inner().map_err(|err| err.into_error())?;
let reader = grenad::Reader::new(BufReader::new(file))?;
if database.is_empty(wtxn)? {
let mut out_iter = database.iter_mut::<_, ByteSlice, ByteSlice>(wtxn)?;

View File

@ -8,7 +8,7 @@ use Criterion::*;
use crate::search::{self, EXTERNAL_DOCUMENTS_IDS};
macro_rules! test_distinct {
($func:ident, $distinct:ident, $exhaustive:ident, $limit:expr, $criteria:expr, $n_res:expr) => {
($func:ident, $distinct:ident, $exhaustive:ident, $limit:expr, $offset:expr, $criteria:expr, $n_res:expr) => {
#[test]
fn $func() {
let criteria = $criteria;
@ -27,6 +27,7 @@ macro_rules! test_distinct {
let mut search = Search::new(&rtxn, &index);
search.query(search::TEST_QUERY);
search.limit($limit);
search.offset($offset);
search.exhaustive_number_hits($exhaustive);
search.terms_matching_strategy(TermsMatchingStrategy::default());
@ -47,6 +48,7 @@ macro_rules! test_distinct {
Some(d.id)
}
})
.skip($offset)
.take($limit)
.collect();
@ -61,6 +63,7 @@ test_distinct!(
tag,
true,
1,
0,
vec![Words, Typo, Proximity, Attribute, Exactness],
3
);
@ -69,6 +72,7 @@ test_distinct!(
asc_desc_rank,
true,
1,
0,
vec![Words, Typo, Proximity, Attribute, Exactness],
7
);
@ -77,6 +81,7 @@ test_distinct!(
asc_desc_rank,
true,
0,
0,
vec![Desc(S("attribute_rank")), Desc(S("exactness_rank")), Exactness, Typo],
7
);
@ -86,6 +91,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo, Proximity, Attribute, Exactness],
3
);
@ -94,6 +100,7 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo, Proximity, Attribute, Exactness],
7
);
@ -102,6 +109,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words],
3
);
@ -110,6 +118,7 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words],
7
);
@ -118,6 +127,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo],
3
);
@ -126,6 +136,7 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Typo],
7
);
@ -134,6 +145,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Proximity],
3
);
@ -142,6 +154,7 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Proximity],
7
);
@ -150,6 +163,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Attribute],
3
);
@ -158,6 +172,7 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Attribute],
7
);
@ -166,6 +181,7 @@ test_distinct!(
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Exactness],
3
);
@ -174,6 +190,47 @@ test_distinct!(
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
0,
vec![Words, Exactness],
7
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
distinct_string_limit_and_offset,
tag,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
1,
vec![],
3
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
exhaustive_distinct_string_limit_and_offset,
tag,
true,
1,
2,
vec![],
3
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
distinct_number_limit_and_offset,
asc_desc_rank,
false,
EXTERNAL_DOCUMENTS_IDS.len(),
2,
vec![],
7
);
test_distinct!(
// testing: https://github.com/meilisearch/meilisearch/issues/4078
exhaustive_distinct_number_limit_and_offset,
asc_desc_rank,
true,
2,
4,
vec![],
7
);