Compare commits

...

29 Commits

Author SHA1 Message Date
Kerollmops
63b6229984 Log when an entry is too large to fit in the BBQueue 2024-12-05 11:24:16 +01:00
meili-bors[bot]
cac355bfa7 Merge #5124
5124: Optimize Prefixes and Merges r=ManyTheFish a=Kerollmops

In this PR, we plan to optimize the read of LMDB to use read the entries in lexicographic order and better use the memory-mapping OS cache:

 - Optimize the prefix generation for word position docids (`@manythefish)`
 - Optimize the parallel merging of the caches to sort entries before merging the caches (`@kerollmops)`
 
## Benchmarks on 1cpu 2gb gpo3 (5k IOps)
 
Before on the tag meilisearch-v1.12.0-rc.3.

```
word_position_docids:merge_and_send_docids: 988s
compute_word_fst: 23.3s
word_pair_proximity_docids:merge_and_send_docids: 428s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 76.3s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 429s
```

After sorting the whole `HashMap`s in a `Vec` on this branch.

```
word_position_docids:merge_and_send_docids: 202s
compute_word_fst: 20.4s
word_pair_proximity_docids:merge_and_send_docids: 427s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 65.5s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 62.5s
```

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-05 09:35:52 +00:00
Kerollmops
52843123d4 Clean up and remove the non-sorted merge_caches function 2024-12-05 10:03:05 +01:00
meili-bors[bot]
6298db5bea Merge #5113
5113: Fix the Minimum BBQueue channel threshold r=Kerollmops a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-12-05 09:01:02 +00:00
meili-bors[bot]
a003a0934a Merge #5121
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Tests on ubuntu-20.04 (push) Failing after 11s
Test suite / Run tests in debug (push) Failing after 9s
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 24s
Test suite / Run Rustfmt (push) Successful in 1m19s
Test suite / Run Clippy (push) Successful in 5m32s
5121: Make the tasks pulling timeout configurable r=dureuill a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-04 17:04:14 +00:00
Louis Dureuil
3a11e39c01 Force max_memory to a min of 100MiB 2024-12-04 17:53:30 +01:00
Louis Dureuil
5f896b1050 Fix geo when spilling 2024-12-04 17:51:12 +01:00
Kerollmops
d0c4e6da6b Make clippy happy 2024-12-04 17:39:10 +01:00
Kerollmops
2da5584bb5 Make the tasks pulling timeout configurable 2024-12-04 17:39:07 +01:00
meili-bors[bot]
b7eb802ae6 Merge #5120
5120: Add cross tasks r=Kerollmops a=ManyTheFish

Add 4 xtask bench workloads:
- `hackernews-add-new-documents`: adds new documents on a db already containing documents
- `hackernews-modify-facet-numbers`: modify filterable fields containing numbers of documents on a db already containing documents
- `hackernews-modify-facet-strings`: modify filterable fields containing strings of documents on a db already containing documents
- `hackernews-modify-searchables`: modify searchable fields of documents on a db already containing documents

Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-12-04 16:16:57 +00:00
Kerollmops
2e32d0474c Lexicographically sort all the map to merge 2024-12-04 17:05:11 +01:00
Kerollmops
cb99ac6f7e Consume vec instead of draining 2024-12-04 17:00:22 +01:00
Kerollmops
be411435f5 Use the merge_caches_alt function in the docids merging 2024-12-04 16:37:29 +01:00
Kerollmops
29ef164530 Introduce a new semi ordered merge function 2024-12-04 16:33:35 +01:00
ManyTheFish
739c52a3cd Replace HashSets by BTreeSets for the prefixes 2024-12-04 16:16:48 +01:00
ManyTheFish
8388698993 Fix dat hash 2024-12-04 15:09:10 +01:00
ManyTheFish
7458f0386c fix asset name 2024-12-04 14:44:57 +01:00
meili-bors[bot]
3ded069042 Merge #5122
5122: Yield the BBQueue writing loop r=ManyTheFish a=Kerollmops

We prefer yielding to let the writing thread do its job instead of spin looping.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-04 13:33:51 +00:00
Kerollmops
261d2ceb06 Yield the BBQueue writer instead of spin looping 2024-12-04 14:16:40 +01:00
ManyTheFish
1a17e2e572 fix formating 2024-12-04 13:57:06 +01:00
meili-bors[bot]
5b8cd68abe Merge #5110
5110: Increase margin on deletion of task r=dureuill a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/5077

## What does this PR do?
- Increase the margin we keep to enqueue task deletion

The issue was that we had not enough space on the reserved memory to write both the batch and the deletion task we just enqueued.
We could fix it only for this test as it’s not an issue in production where we have 10GiB of margin, but I thought it wasn’t a bad idea either to increase our margin a bit since we’re effectively writing more to lmdb.


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-12-04 12:54:48 +00:00
ManyTheFish
5ce9acb0b9 Add workloads 2024-12-04 12:19:19 +01:00
meili-bors[bot]
54341c2e80 Merge #5118
5118: Change the reserve and grant function to accept a closure r=ManyTheFish a=Kerollmops

This simplifies the usage of the grant and commits it at the right time, just after having written in it.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2024-12-04 10:12:39 +00:00
Kerollmops
96831ed9bb Send the WakeUp message if necessary in the reserve function 2024-12-04 11:03:01 +01:00
Kerollmops
0459b1a242 Change the reserve and grant function to accept a closure 2024-12-04 10:32:25 +01:00
Kerollmops
8ecb726683 Fix the minimun BBQueue channel threshold 2024-12-03 15:49:11 +01:00
meili-bors[bot]
297e72e262 Merge #5111
Some checks failed
Test suite / Tests on ${{ matrix.os }} (macos-13) (push) Waiting to run
Test suite / Tests on ${{ matrix.os }} (windows-2022) (push) Failing after 43s
Test suite / Tests on ubuntu-20.04 (push) Failing after 11s
Test suite / Tests almost all features (push) Has been skipped
Test suite / Test disabled tokenization (push) Has been skipped
Test suite / Run tests in debug (push) Failing after 9s
Test suite / Run Clippy (push) Successful in 7m18s
Test suite / Run Rustfmt (push) Successful in 1m32s
5111: Update BBQueue repo to point to the Meilisearch org r=curquiza a=Kerollmops

This PR updates the milli dependencies to make BBQueue point to the Meilisearch org repo.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-12-03 14:27:04 +00:00
Clément Renault
0ad2f57a92 Update bbqueue repo to point to the meilisearch org 2024-12-03 12:00:04 +01:00
Tamo
71d53f413f increase the margin allowed to delete task 2024-12-03 11:07:03 +01:00
17 changed files with 579 additions and 111 deletions

2
Cargo.lock generated
View File

@@ -492,7 +492,7 @@ checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6"
[[package]]
name = "bbqueue"
version = "0.5.1"
source = "git+https://github.com/kerollmops/bbqueue#cbb87cc707b5af415ef203bdaf2443e06ba0d6d4"
source = "git+https://github.com/meilisearch/bbqueue#cbb87cc707b5af415ef203bdaf2443e06ba0d6d4"
[[package]]
name = "benchmarks"

View File

@@ -1440,7 +1440,7 @@ impl IndexScheduler {
// if the task doesn't delete anything and 50% of the task queue is full, we must refuse to enqueue the incomming task
if !matches!(&kind, KindWithContent::TaskDeletion { tasks, .. } if !tasks.is_empty())
&& (self.env.non_free_pages_size()? * 100) / self.env.info().map_size as u64 > 50
&& (self.env.non_free_pages_size()? * 100) / self.env.info().map_size as u64 > 40
{
return Err(Error::NoSpaceLeftInTaskQueue);
}

View File

@@ -98,7 +98,7 @@ allocator-api2 = "0.2.18"
rustc-hash = "2.0.0"
uell = "0.1.0"
enum-iterator = "2.1.0"
bbqueue = { git = "https://github.com/kerollmops/bbqueue" }
bbqueue = { git = "https://github.com/meilisearch/bbqueue" }
flume = { version = "0.11.1", default-features = false }
[dev-dependencies]

View File

@@ -7,7 +7,7 @@ use std::num::NonZeroU16;
use std::ops::Range;
use std::time::Duration;
use bbqueue::framed::{FrameGrantR, FrameGrantW, FrameProducer};
use bbqueue::framed::{FrameGrantR, FrameProducer};
use bbqueue::BBBuffer;
use bytemuck::{checked, CheckedBitPattern, NoUninit};
use flume::{RecvTimeoutError, SendError};
@@ -454,14 +454,10 @@ impl<'b> ExtractorBbqueueSender<'b> {
}
// Spin loop to have a frame the size we requested.
let mut grant = reserve_grant(&mut producer, total_length, &self.sender)?;
payload_header.serialize_into(&mut grant);
// We only send a wake up message when the channel is empty
// so that we don't fill the channel with too many WakeUps.
if self.sender.is_empty() {
self.sender.send(ReceiverAction::WakeUp).unwrap();
}
reserve_and_write_grant(&mut producer, total_length, &self.sender, |grant| {
payload_header.serialize_into(grant);
Ok(())
})?;
Ok(())
}
@@ -484,6 +480,8 @@ impl<'b> ExtractorBbqueueSender<'b> {
let payload_header = EntryHeader::ArroySetVectors(arroy_set_vector);
let total_length = EntryHeader::total_set_vectors_size(embeddings.len(), dimensions);
if total_length > capacity {
tracing::trace!("We are spilling a large vector that is {total_length} bytes which is larger than the capacity of {capacity} bytes");
let mut value_file = tempfile::tempfile().map(BufWriter::new)?;
for embedding in embeddings {
let mut embedding_bytes = bytemuck::cast_slice(embedding);
@@ -500,24 +498,20 @@ impl<'b> ExtractorBbqueueSender<'b> {
}
// Spin loop to have a frame the size we requested.
let mut grant = reserve_grant(&mut producer, total_length, &self.sender)?;
reserve_and_write_grant(&mut producer, total_length, &self.sender, |grant| {
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
if dimensions != 0 {
let output_iter = remaining.chunks_exact_mut(dimensions * mem::size_of::<f32>());
for (embedding, output) in embeddings.iter().zip(output_iter) {
output.copy_from_slice(bytemuck::cast_slice(embedding));
if dimensions != 0 {
let output_iter = remaining.chunks_exact_mut(dimensions * mem::size_of::<f32>());
for (embedding, output) in embeddings.iter().zip(output_iter) {
output.copy_from_slice(bytemuck::cast_slice(embedding));
}
}
}
// We only send a wake up message when the channel is empty
// so that we don't fill the channel with too many WakeUps.
if self.sender.is_empty() {
self.sender.send(ReceiverAction::WakeUp).unwrap();
}
Ok(())
})?;
Ok(())
}
@@ -556,6 +550,8 @@ impl<'b> ExtractorBbqueueSender<'b> {
let payload_header = EntryHeader::DbOperation(operation);
let total_length = EntryHeader::total_key_value_size(key_length, value_length);
if total_length > capacity {
tracing::trace!("We are spilling a large entry that is {total_length} bytes which is larger than the capacity of {capacity} bytes");
let mut key_buffer = vec![0; key_length.get() as usize].into_boxed_slice();
let value_file = tempfile::tempfile()?;
value_file.set_len(value_length.try_into().unwrap())?;
@@ -575,19 +571,13 @@ impl<'b> ExtractorBbqueueSender<'b> {
}
// Spin loop to have a frame the size we requested.
let mut grant = reserve_grant(&mut producer, total_length, &self.sender)?;
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
let (key_buffer, value_buffer) = remaining.split_at_mut(key_length.get() as usize);
key_value_writer(key_buffer, value_buffer)?;
// We only send a wake up message when the channel is empty
// so that we don't fill the channel with too many WakeUps.
if self.sender.is_empty() {
self.sender.send(ReceiverAction::WakeUp).unwrap();
}
reserve_and_write_grant(&mut producer, total_length, &self.sender, |grant| {
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
let (key_buffer, value_buffer) = remaining.split_at_mut(key_length.get() as usize);
key_value_writer(key_buffer, value_buffer)
})?;
Ok(())
}
@@ -629,37 +619,44 @@ impl<'b> ExtractorBbqueueSender<'b> {
}
// Spin loop to have a frame the size we requested.
let mut grant = reserve_grant(&mut producer, total_length, &self.sender)?;
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
key_writer(remaining)?;
// We only send a wake up message when the channel is empty
// so that we don't fill the channel with too many WakeUps.
if self.sender.is_empty() {
self.sender.send(ReceiverAction::WakeUp).unwrap();
}
reserve_and_write_grant(&mut producer, total_length, &self.sender, |grant| {
let header_size = payload_header.header_size();
let (header_bytes, remaining) = grant.split_at_mut(header_size);
payload_header.serialize_into(header_bytes);
key_writer(remaining)
})?;
Ok(())
}
}
/// Try to reserve a frame grant of `total_length` by spin looping
/// on the BBQueue buffer and panics if the receiver has been disconnected.
fn reserve_grant<'b>(
producer: &mut FrameProducer<'b>,
/// Try to reserve a frame grant of `total_length` by spin
/// looping on the BBQueue buffer, panics if the receiver
/// has been disconnected or send a WakeUp message if necessary.
fn reserve_and_write_grant<F>(
producer: &mut FrameProducer,
total_length: usize,
sender: &flume::Sender<ReceiverAction>,
) -> crate::Result<FrameGrantW<'b>> {
f: F,
) -> crate::Result<()>
where
F: FnOnce(&mut [u8]) -> crate::Result<()>,
{
loop {
for _ in 0..10_000 {
match producer.grant(total_length) {
Ok(mut grant) => {
// We could commit only the used memory.
grant.to_commit(total_length);
return Ok(grant);
f(&mut grant)?;
grant.commit(total_length);
// We only send a wake up message when the channel is empty
// so that we don't fill the channel with too many WakeUps.
if sender.is_empty() {
sender.send(ReceiverAction::WakeUp).unwrap();
}
return Ok(());
}
Err(bbqueue::Error::InsufficientSize) => continue,
Err(e) => unreachable!("{e:?}"),
@@ -668,6 +665,11 @@ fn reserve_grant<'b>(
if sender.is_disconnected() {
return Err(Error::InternalError(InternalError::AbortedIndexation));
}
// We prefer to yield and allow the writing thread
// to do its job, especially beneficial when there
// is only one CPU core available.
std::thread::yield_now();
}
}

View File

@@ -466,12 +466,13 @@ pub fn transpose_and_freeze_caches<'a, 'extractor>(
Ok(bucket_caches)
}
/// Merges the caches that must be all associated to the same bucket.
/// Merges the caches that must be all associated to the same bucket
/// but make sure to sort the different buckets before performing the merges.
///
/// # Panics
///
/// - If the bucket IDs in these frozen caches are not exactly the same.
pub fn merge_caches<F>(frozen: Vec<FrozenCache>, mut f: F) -> Result<()>
pub fn merge_caches_sorted<F>(frozen: Vec<FrozenCache>, mut f: F) -> Result<()>
where
F: for<'a> FnMut(&'a [u8], DelAddRoaringBitmap) -> Result<()>,
{
@@ -543,12 +544,12 @@ where
// Then manage the content on the HashMap entries that weren't taken (mem::take).
while let Some(mut map) = maps.pop() {
for (key, bbbul) in map.iter_mut() {
// Make sure we don't try to work with entries already managed by the spilled
if bbbul.is_empty() {
continue;
}
// Make sure we don't try to work with entries already managed by the spilled
let mut ordered_entries: Vec<_> =
map.iter_mut().filter(|(_, bbbul)| !bbbul.is_empty()).collect();
ordered_entries.sort_unstable_by_key(|(key, _)| *key);
for (key, bbbul) in ordered_entries {
let mut output = DelAddRoaringBitmap::empty();
output.union_and_clear_bbbul(bbbul);

View File

@@ -1,6 +1,6 @@
use std::cell::RefCell;
use std::fs::File;
use std::io::{self, BufReader, BufWriter, ErrorKind, Read, Write as _};
use std::io::{self, BufReader, BufWriter, ErrorKind, Read, Seek as _, Write as _};
use std::{iter, mem, result};
use bumpalo::Bump;
@@ -97,30 +97,34 @@ pub struct FrozenGeoExtractorData<'extractor> {
impl<'extractor> FrozenGeoExtractorData<'extractor> {
pub fn iter_and_clear_removed(
&mut self,
) -> impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_ {
mem::take(&mut self.removed)
) -> io::Result<impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_> {
Ok(mem::take(&mut self.removed)
.iter()
.copied()
.map(Ok)
.chain(iterator_over_spilled_geopoints(&mut self.spilled_removed))
.chain(iterator_over_spilled_geopoints(&mut self.spilled_removed)?))
}
pub fn iter_and_clear_inserted(
&mut self,
) -> impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_ {
mem::take(&mut self.inserted)
) -> io::Result<impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_> {
Ok(mem::take(&mut self.inserted)
.iter()
.copied()
.map(Ok)
.chain(iterator_over_spilled_geopoints(&mut self.spilled_inserted))
.chain(iterator_over_spilled_geopoints(&mut self.spilled_inserted)?))
}
}
fn iterator_over_spilled_geopoints(
spilled: &mut Option<BufReader<File>>,
) -> impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_ {
) -> io::Result<impl IntoIterator<Item = io::Result<ExtractedGeoPoint>> + '_> {
let mut spilled = spilled.take();
iter::from_fn(move || match &mut spilled {
if let Some(spilled) = &mut spilled {
spilled.rewind()?;
}
Ok(iter::from_fn(move || match &mut spilled {
Some(file) => {
let geopoint_bytes = &mut [0u8; mem::size_of::<ExtractedGeoPoint>()];
match file.read_exact(geopoint_bytes) {
@@ -130,7 +134,7 @@ fn iterator_over_spilled_geopoints(
}
}
None => None,
})
}))
}
impl<'extractor> Extractor<'extractor> for GeoExtractor {
@@ -157,7 +161,9 @@ impl<'extractor> Extractor<'extractor> for GeoExtractor {
let mut data_ref = context.data.borrow_mut_or_yield();
for change in changes {
if max_memory.map_or(false, |mm| context.extractor_alloc.allocated_bytes() >= mm) {
if data_ref.spilled_removed.is_none()
&& max_memory.map_or(false, |mm| context.extractor_alloc.allocated_bytes() >= mm)
{
// We must spill as we allocated too much memory
data_ref.spilled_removed = tempfile::tempfile().map(BufWriter::new).map(Some)?;
data_ref.spilled_inserted = tempfile::tempfile().map(BufWriter::new).map(Some)?;

View File

@@ -6,7 +6,9 @@ mod searchable;
mod vectors;
use bumpalo::Bump;
pub use cache::{merge_caches, transpose_and_freeze_caches, BalancedCaches, DelAddRoaringBitmap};
pub use cache::{
merge_caches_sorted, transpose_and_freeze_caches, BalancedCaches, DelAddRoaringBitmap,
};
pub use documents::*;
pub use faceted::*;
pub use geo::*;

View File

@@ -86,9 +86,11 @@ where
(grenad_parameters, 2 * minimum_capacity), // 100 MiB by thread by default
|max_memory| {
// 2% of the indexing memory
let total_bbbuffer_capacity = (max_memory / 100 / 2).min(minimum_capacity);
let total_bbbuffer_capacity = (max_memory / 100 / 2).max(minimum_capacity);
let new_grenad_parameters = GrenadParameters {
max_memory: Some(max_memory - total_bbbuffer_capacity),
max_memory: Some(
max_memory.saturating_sub(total_bbbuffer_capacity).max(100 * 1024 * 1024),
),
..grenad_parameters
};
(new_grenad_parameters, total_bbbuffer_capacity)

View File

@@ -9,8 +9,8 @@ use roaring::RoaringBitmap;
use super::channel::*;
use super::extract::{
merge_caches, transpose_and_freeze_caches, BalancedCaches, DelAddRoaringBitmap, FacetKind,
GeoExtractorData,
merge_caches_sorted, transpose_and_freeze_caches, BalancedCaches, DelAddRoaringBitmap,
FacetKind, GeoExtractorData,
};
use crate::{CboRoaringBitmapCodec, FieldId, GeoPoint, Index, InternalError, Result};
@@ -34,7 +34,7 @@ where
}
let mut frozen = data.into_inner().freeze()?;
for result in frozen.iter_and_clear_removed() {
for result in frozen.iter_and_clear_removed()? {
let extracted_geo_point = result?;
let removed = rtree.remove(&GeoPoint::from(extracted_geo_point));
debug_assert!(removed.is_some());
@@ -42,7 +42,7 @@ where
debug_assert!(removed);
}
for result in frozen.iter_and_clear_inserted() {
for result in frozen.iter_and_clear_inserted()? {
let extracted_geo_point = result?;
rtree.insert(GeoPoint::from(extracted_geo_point));
let inserted = faceted.insert(extracted_geo_point.docid);
@@ -78,7 +78,7 @@ where
if must_stop_processing() {
return Err(InternalError::AbortedIndexation.into());
}
merge_caches(frozen, |key, DelAddRoaringBitmap { del, add }| {
merge_caches_sorted(frozen, |key, DelAddRoaringBitmap { del, add }| {
let current = database.get(&rtxn, key)?;
match merge_cbo_bitmaps(current, del, add)? {
Operation::Write(bitmap) => {
@@ -107,7 +107,7 @@ pub fn merge_and_send_facet_docids<'extractor>(
.map(|frozen| {
let mut facet_field_ids_delta = FacetFieldIdsDelta::default();
let rtxn = index.read_txn()?;
merge_caches(frozen, |key, DelAddRoaringBitmap { del, add }| {
merge_caches_sorted(frozen, |key, DelAddRoaringBitmap { del, add }| {
let current = database.get_cbo_roaring_bytes_value(&rtxn, key)?;
match merge_cbo_bitmaps(current, del, add)? {
Operation::Write(bitmap) => {

View File

@@ -1,4 +1,4 @@
use std::collections::HashSet;
use std::collections::BTreeSet;
use std::io::BufWriter;
use fst::{Set, SetBuilder, Streamer};
@@ -75,8 +75,8 @@ pub struct PrefixData {
#[derive(Debug)]
pub struct PrefixDelta {
pub modified: HashSet<Prefix>,
pub deleted: HashSet<Prefix>,
pub modified: BTreeSet<Prefix>,
pub deleted: BTreeSet<Prefix>,
}
struct PrefixFstBuilder {
@@ -86,7 +86,7 @@ struct PrefixFstBuilder {
prefix_fst_builders: Vec<SetBuilder<Vec<u8>>>,
current_prefix: Vec<Prefix>,
current_prefix_count: Vec<usize>,
modified_prefixes: HashSet<Prefix>,
modified_prefixes: BTreeSet<Prefix>,
current_prefix_is_modified: Vec<bool>,
}
@@ -110,7 +110,7 @@ impl PrefixFstBuilder {
prefix_fst_builders,
current_prefix: vec![Prefix::new(); max_prefix_length],
current_prefix_count: vec![0; max_prefix_length],
modified_prefixes: HashSet::new(),
modified_prefixes: BTreeSet::new(),
current_prefix_is_modified: vec![false; max_prefix_length],
})
}
@@ -180,7 +180,7 @@ impl PrefixFstBuilder {
let prefix_fst_mmap = unsafe { Mmap::map(&prefix_fst_file)? };
let new_prefix_fst = Set::new(&prefix_fst_mmap)?;
let old_prefix_fst = index.words_prefixes_fst(rtxn)?;
let mut deleted_prefixes = HashSet::new();
let mut deleted_prefixes = BTreeSet::new();
{
let mut deleted_prefixes_stream = old_prefix_fst.op().add(&new_prefix_fst).difference();
while let Some(prefix) = deleted_prefixes_stream.next() {

View File

@@ -1,5 +1,5 @@
use std::cell::RefCell;
use std::collections::HashSet;
use std::collections::BTreeSet;
use std::io::{BufReader, BufWriter, Read, Seek, Write};
use hashbrown::HashMap;
@@ -37,8 +37,8 @@ impl WordPrefixDocids {
fn execute(
self,
wtxn: &mut heed::RwTxn,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
) -> Result<()> {
delete_prefixes(wtxn, &self.prefix_database, prefix_to_delete)?;
self.recompute_modified_prefixes(wtxn, prefix_to_compute)
@@ -48,7 +48,7 @@ impl WordPrefixDocids {
fn recompute_modified_prefixes(
&self,
wtxn: &mut RwTxn,
prefixes: &HashSet<Prefix>,
prefixes: &BTreeSet<Prefix>,
) -> Result<()> {
// We fetch the docids associated to the newly added word prefix fst only.
// And collect the CboRoaringBitmaps pointers in an HashMap.
@@ -127,7 +127,7 @@ impl<'a, 'rtxn> FrozenPrefixBitmaps<'a, 'rtxn> {
pub fn from_prefixes(
database: Database<Bytes, CboRoaringBitmapCodec>,
rtxn: &'rtxn RoTxn,
prefixes: &'a HashSet<Prefix>,
prefixes: &'a BTreeSet<Prefix>,
) -> heed::Result<Self> {
let database = database.remap_data_type::<Bytes>();
@@ -173,8 +173,8 @@ impl WordPrefixIntegerDocids {
fn execute(
self,
wtxn: &mut heed::RwTxn,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
) -> Result<()> {
delete_prefixes(wtxn, &self.prefix_database, prefix_to_delete)?;
self.recompute_modified_prefixes(wtxn, prefix_to_compute)
@@ -184,7 +184,7 @@ impl WordPrefixIntegerDocids {
fn recompute_modified_prefixes(
&self,
wtxn: &mut RwTxn,
prefixes: &HashSet<Prefix>,
prefixes: &BTreeSet<Prefix>,
) -> Result<()> {
// We fetch the docids associated to the newly added word prefix fst only.
// And collect the CboRoaringBitmaps pointers in an HashMap.
@@ -262,7 +262,7 @@ impl<'a, 'rtxn> FrozenPrefixIntegerBitmaps<'a, 'rtxn> {
pub fn from_prefixes(
database: Database<Bytes, CboRoaringBitmapCodec>,
rtxn: &'rtxn RoTxn,
prefixes: &'a HashSet<Prefix>,
prefixes: &'a BTreeSet<Prefix>,
) -> heed::Result<Self> {
let database = database.remap_data_type::<Bytes>();
@@ -291,7 +291,7 @@ unsafe impl<'a, 'rtxn> Sync for FrozenPrefixIntegerBitmaps<'a, 'rtxn> {}
fn delete_prefixes(
wtxn: &mut RwTxn,
prefix_database: &Database<Bytes, CboRoaringBitmapCodec>,
prefixes: &HashSet<Prefix>,
prefixes: &BTreeSet<Prefix>,
) -> Result<()> {
// We remove all the entries that are no more required in this word prefix docids database.
for prefix in prefixes {
@@ -309,8 +309,8 @@ fn delete_prefixes(
pub fn compute_word_prefix_docids(
wtxn: &mut RwTxn,
index: &Index,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
grenad_parameters: GrenadParameters,
) -> Result<()> {
WordPrefixDocids::new(
@@ -325,8 +325,8 @@ pub fn compute_word_prefix_docids(
pub fn compute_exact_word_prefix_docids(
wtxn: &mut RwTxn,
index: &Index,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
grenad_parameters: GrenadParameters,
) -> Result<()> {
WordPrefixDocids::new(
@@ -341,8 +341,8 @@ pub fn compute_exact_word_prefix_docids(
pub fn compute_word_prefix_fid_docids(
wtxn: &mut RwTxn,
index: &Index,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
grenad_parameters: GrenadParameters,
) -> Result<()> {
WordPrefixIntegerDocids::new(
@@ -357,8 +357,8 @@ pub fn compute_word_prefix_fid_docids(
pub fn compute_word_prefix_position_docids(
wtxn: &mut RwTxn,
index: &Index,
prefix_to_compute: &HashSet<Prefix>,
prefix_to_delete: &HashSet<Prefix>,
prefix_to_compute: &BTreeSet<Prefix>,
prefix_to_delete: &BTreeSet<Prefix>,
grenad_parameters: GrenadParameters,
) -> Result<()> {
WordPrefixIntegerDocids::new(

View File

@@ -82,6 +82,10 @@ pub struct BenchDeriveArgs {
/// Reason for the benchmark invocation
#[arg(short, long)]
reason: Option<String>,
/// The maximum time in seconds we allow for fetching the task queue before timing out.
#[arg(long, default_value_t = 60)]
tasks_queue_timeout_secs: u64,
}
pub fn run(args: BenchDeriveArgs) -> anyhow::Result<()> {
@@ -127,7 +131,7 @@ pub fn run(args: BenchDeriveArgs) -> anyhow::Result<()> {
let meili_client = Client::new(
Some("http://127.0.0.1:7700".into()),
args.master_key.as_deref(),
Some(std::time::Duration::from_secs(60)),
Some(std::time::Duration::from_secs(args.tasks_queue_timeout_secs)),
)?;
// enter runtime

View File

@@ -16,6 +16,7 @@ struct ListFeaturesDeriveArgs {
#[command(author, version, about, long_about)]
#[command(name = "cargo xtask")]
#[command(bin_name = "cargo xtask")]
#[allow(clippy::large_enum_variant)] // please, that's enough...
enum Command {
ListFeatures(ListFeaturesDeriveArgs),
Bench(BenchDeriveArgs),

View File

@@ -0,0 +1,105 @@
{
"name": "hackernews.add_new_documents",
"run_count": 3,
"extra_cli_args": [],
"assets": {
"hackernews-01.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01.ndjson",
"sha256": "cd3627b86c064d865b6754848ed0e73ef1d8142752a25e5f0765c3a1296dd3ae"
},
"hackernews-02.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02.ndjson",
"sha256": "5d533b83bcf992201dace88b4d0c0be8b4df5225c6c4b763582d986844bcc23b"
},
"hackernews-03.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/03.ndjson",
"sha256": "f5f351a0d04a8a83643ace12cafa2b7ec8ca8cb7d46fd268e5126492a6c66f2a"
},
"hackernews-04.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/04.ndjson",
"sha256": "ac1915ee7ce53a6718548c255a6cc59969784b2570745dc5b739f714beda291a"
},
"hackernews-05.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/05.ndjson",
"sha256": "be31d5632602f798e62d1c10c83bdfda2b4deaa068477eacde05fdd247572b82"
}
},
"precommands": [
{
"route": "indexes/movies/settings",
"method": "PATCH",
"body": {
"inline": {
"displayedAttributes": [
"title",
"by",
"score",
"time",
"text"
],
"searchableAttributes": [
"title",
"text"
],
"filterableAttributes": [
"by",
"kids",
"parent"
],
"sortableAttributes": [
"score",
"time"
]
}
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-03.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-04.ndjson"
},
"synchronous": "WaitForTask"
}
],
"commands": [
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-05.ndjson"
},
"synchronous": "WaitForTask"
}
]
}

View File

@@ -0,0 +1,111 @@
{
"name": "hackernews.modify_facet_numbers",
"run_count": 3,
"extra_cli_args": [],
"assets": {
"hackernews-01.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01.ndjson",
"sha256": "cd3627b86c064d865b6754848ed0e73ef1d8142752a25e5f0765c3a1296dd3ae"
},
"hackernews-02.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02.ndjson",
"sha256": "5d533b83bcf992201dace88b4d0c0be8b4df5225c6c4b763582d986844bcc23b"
},
"hackernews-03.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/03.ndjson",
"sha256": "f5f351a0d04a8a83643ace12cafa2b7ec8ca8cb7d46fd268e5126492a6c66f2a"
},
"hackernews-04.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/04.ndjson",
"sha256": "ac1915ee7ce53a6718548c255a6cc59969784b2570745dc5b739f714beda291a"
},
"hackernews-05.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/05.ndjson",
"sha256": "be31d5632602f798e62d1c10c83bdfda2b4deaa068477eacde05fdd247572b82"
},
"hackernews-02-modified-filters.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02-modified-filters.ndjson",
"sha256": "7272cbfd41110d32d7fe168424a0000f07589bfe40f664652b34f4f20aaf3802"
}
},
"precommands": [
{
"route": "indexes/movies/settings",
"method": "PATCH",
"body": {
"inline": {
"displayedAttributes": [
"title",
"by",
"score",
"time",
"text"
],
"searchableAttributes": [
"title",
"text"
],
"filterableAttributes": [
"by",
"kids",
"parent"
],
"sortableAttributes": [
"score",
"time"
]
}
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-03.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-04.ndjson"
},
"synchronous": "WaitForTask"
}
],
"commands": [
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02-modified-filters.ndjson"
},
"synchronous": "WaitForTask"
}
]
}

View File

@@ -0,0 +1,111 @@
{
"name": "hackernews.modify_facet_strings",
"run_count": 3,
"extra_cli_args": [],
"assets": {
"hackernews-01.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01.ndjson",
"sha256": "cd3627b86c064d865b6754848ed0e73ef1d8142752a25e5f0765c3a1296dd3ae"
},
"hackernews-02.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02.ndjson",
"sha256": "5d533b83bcf992201dace88b4d0c0be8b4df5225c6c4b763582d986844bcc23b"
},
"hackernews-03.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/03.ndjson",
"sha256": "f5f351a0d04a8a83643ace12cafa2b7ec8ca8cb7d46fd268e5126492a6c66f2a"
},
"hackernews-04.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/04.ndjson",
"sha256": "ac1915ee7ce53a6718548c255a6cc59969784b2570745dc5b739f714beda291a"
},
"hackernews-05.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/05.ndjson",
"sha256": "be31d5632602f798e62d1c10c83bdfda2b4deaa068477eacde05fdd247572b82"
},
"hackernews-01-modified-filters.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01-modified-filters.ndjson",
"sha256": "b80c245ce1b1df80b9b38800f677f3bd11947ebc62716fb108269d50e796c35c"
}
},
"precommands": [
{
"route": "indexes/movies/settings",
"method": "PATCH",
"body": {
"inline": {
"displayedAttributes": [
"title",
"by",
"score",
"time",
"text"
],
"searchableAttributes": [
"title",
"text"
],
"filterableAttributes": [
"by",
"kids",
"parent"
],
"sortableAttributes": [
"score",
"time"
]
}
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-03.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-04.ndjson"
},
"synchronous": "WaitForTask"
}
],
"commands": [
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01-modified-filters.ndjson"
},
"synchronous": "WaitForTask"
}
]
}

View File

@@ -0,0 +1,123 @@
{
"name": "hackernews.modify_searchables",
"run_count": 3,
"extra_cli_args": [],
"assets": {
"hackernews-01.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01.ndjson",
"sha256": "cd3627b86c064d865b6754848ed0e73ef1d8142752a25e5f0765c3a1296dd3ae"
},
"hackernews-02.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02.ndjson",
"sha256": "5d533b83bcf992201dace88b4d0c0be8b4df5225c6c4b763582d986844bcc23b"
},
"hackernews-03.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/03.ndjson",
"sha256": "f5f351a0d04a8a83643ace12cafa2b7ec8ca8cb7d46fd268e5126492a6c66f2a"
},
"hackernews-04.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/04.ndjson",
"sha256": "ac1915ee7ce53a6718548c255a6cc59969784b2570745dc5b739f714beda291a"
},
"hackernews-05.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/05.ndjson",
"sha256": "be31d5632602f798e62d1c10c83bdfda2b4deaa068477eacde05fdd247572b82"
},
"hackernews-01-modified-searchables.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/01-modified-searchables.ndjson",
"sha256": "e5c08710c6af70031ac7212e0ba242c72ef29c8d4e1fce66c789544641452a7c"
},
"hackernews-02-modified-searchables.ndjson": {
"local_location": null,
"remote_location": "https://milli-benchmarks.fra1.digitaloceanspaces.com/bench/datasets/hackernews/modification/02-modified-searchables.ndjson",
"sha256": "098b029851117087b1e26ccb7ac408eda9bba54c3008213a2880d6fab607346e"
}
},
"precommands": [
{
"route": "indexes/movies/settings",
"method": "PATCH",
"body": {
"inline": {
"displayedAttributes": [
"title",
"by",
"score",
"time",
"text"
],
"searchableAttributes": [
"title",
"text"
],
"filterableAttributes": [
"by",
"kids",
"parent"
],
"sortableAttributes": [
"score",
"time"
]
}
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-03.ndjson"
},
"synchronous": "WaitForResponse"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-04.ndjson"
},
"synchronous": "WaitForTask"
}
],
"commands": [
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-01-modified-searchables.ndjson"
},
"synchronous": "WaitForTask"
},
{
"route": "indexes/movies/documents",
"method": "POST",
"body": {
"asset": "hackernews-02-modified-searchables.ndjson"
},
"synchronous": "WaitForTask"
}
]
}