Compare commits

...

13 Commits

Author SHA1 Message Date
9d5e3457e5 Fix clippy 2023-07-27 14:21:19 +02:00
04694071fe Fix the synonyms settings display 2023-07-27 14:12:23 +02:00
b0c1a9504a ensure the synonyms are updated when the tokenizer settings are changed 2023-07-26 09:33:42 +02:00
d57026cd96 Support synonyms sinergies 2023-07-25 15:01:42 +02:00
41c9e8856a Fix test 2023-07-25 10:55:37 +02:00
d4ff59fcf5 Fix clippy 2023-07-24 18:42:26 +02:00
9c485f8563 Make the search and the indexing work 2023-07-24 18:35:20 +02:00
d8d12d5979 Be able to set and reset settings 2023-07-24 17:00:18 +02:00
0597a97c84 Update tests 2023-07-20 11:15:10 +02:00
2dfbb6813a Merge #3913
3913: Expose a Puffin server to profile the indexing process r=Kerollmops a=Kerollmops

This PR exposes a puffin HTTP server to expose the internal timing it takes to index documents, delete documents, or update the settings of an index.

<img width="1752" alt="Capture d’écran 2023-07-10 à 18 44 58" src="https://github.com/meilisearch/meilisearch/assets/3610253/a3c7a6bf-db5b-42f4-8be1-c4e31c869843">

## To be done

 - [x] Move the puffin HTTP server under a feature flag.
 - [x] Use [the `puffin::set_scopes_on` function](https://docs.rs/puffin/latest/puffin/fn.set_scopes_on.html) to toggle it (by using the feature directly).
     When this function is called with `false`, [a call to `profile_scope!` talked 1-2ns](https://docs.rs/puffin/latest/puffin/fn.set_scopes_on.html).
 - [x] Create a _PROFILING.md_ file explaining how to use it.
   - [x] Explain that merging scopes on the interface is not always useful.
 - [x] Add more info on the number of batched tasks (using the `puffin::profile_scope!` macro data).
   - I added more info, but that's more continuous work when we consider we need more info here and there.
 - [x] Clean up some scopes, and don't touch too much code to inject puffin.
   - I am not sure that the _index_documents/mod.rs_ function is that complex with the addition of the scope.
 - [x] Think about what we consider frames. One indexation operation or the wall program. When must we stop the frame, then?
   - What we consider a frame is one single `IndexScheduler::tick` execution.
   - We can change that later.

Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-07-19 09:44:01 +00:00
8f589a5cce Introduce a PROFILING.md tutorial to profile Meilisearch 2023-07-18 17:38:13 +02:00
0b8bbd8750 Toggle the puffin profiling with a feature flag 2023-07-18 17:38:13 +02:00
eef95de30e First iteration on exposing puffin profiling 2023-07-18 17:38:13 +02:00
51 changed files with 1891 additions and 172 deletions

39
Cargo.lock generated
View File

@ -1973,6 +1973,7 @@ dependencies = [
"meilisearch-types",
"nelson",
"page_size 0.5.0",
"puffin",
"roaring",
"serde",
"serde_json",
@ -2498,6 +2499,12 @@ dependencies = [
"syn 1.0.109",
]
[[package]]
name = "lz4_flex"
version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b8c72594ac26bfd34f2d99dfced2edfaddfe8a476e3ff2ca0eb293d925c4f83"
[[package]]
name = "manifest-dir-macros"
version = "0.1.17"
@ -2587,6 +2594,8 @@ dependencies = [
"pin-project-lite",
"platform-dirs",
"prometheus",
"puffin",
"puffin_http",
"rand",
"rayon",
"regex",
@ -2731,6 +2740,7 @@ dependencies = [
"obkv",
"once_cell",
"ordered-float",
"puffin",
"rand",
"rand_pcg",
"rayon",
@ -3256,6 +3266,35 @@ version = "2.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "106dd99e98437432fed6519dedecfade6a06a73bb7b2a1e019fdd2bee5778d94"
[[package]]
name = "puffin"
version = "0.16.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "76425abd4e1a0ad4bd6995dd974b52f414fca9974171df8e3708b3e660d05a21"
dependencies = [
"anyhow",
"bincode",
"byteorder",
"cfg-if",
"instant",
"lz4_flex",
"once_cell",
"parking_lot",
"serde",
]
[[package]]
name = "puffin_http"
version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "13bffc600c35913d282ae1e96a6ffcdf36dc7a7cdb9310e0ba15914d258c8193"
dependencies = [
"anyhow",
"crossbeam-channel",
"log",
"puffin",
]
[[package]]
name = "quote"
version = "1.0.28"

19
PROFILING.md Normal file
View File

@ -0,0 +1,19 @@
# Profiling Meilisearch
Search engine technologies are complex pieces of software that require thorough profiling tools. We chose to use [Puffin](https://github.com/EmbarkStudios/puffin), which the Rust gaming industry uses extensively. You can export and import the profiling reports using the top bar's _File_ menu options.
![An example profiling with Puffin viewer](assets/profiling-example.png)
## Profiling the Indexing Process
When you enable the `profile-with-puffin` feature of Meilisearch, a Puffin HTTP server will run on Meilisearch and listen on the default _0.0.0.0:8585_ address. This server will record a "frame" whenever it executes the `IndexScheduler::tick` method.
Once your Meilisearch is running and awaits new indexation operations, you must [install and run the `puffin_viewer` tool](https://github.com/EmbarkStudios/puffin/tree/main/puffin_viewer) to see the profiling results. I advise you to run the viewer with the `RUST_LOG=puffin_http::client=debug` environment variable to see the client trying to connect to your server.
Another piece of advice on the Puffin viewer UI interface is to consider the _Merge children with same ID_ option. It can hide the exact actual timings at which events were sent. Please turn it off when you see strange gaps on the Flamegraph. It can help.
## Profiling the Search Process
We still need to take the time to profile the search side of the engine with Puffin. It would require time to profile the filtering phase, query parsing, creation, and execution. We could even profile the Actix HTTP server.
The only issue we see is the framing system. Puffin requires a global frame-based profiling phase, which collides with Meilisearch's ability to accept and answer multiple requests on different threads simultaneously.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

View File

@ -261,6 +261,9 @@ pub(crate) mod test {
sortable_attributes: Setting::Set(btreeset! { S("age") }),
ranking_rules: Setting::NotSet,
stop_words: Setting::NotSet,
non_separator_tokens: Setting::NotSet,
separator_tokens: Setting::NotSet,
dictionary: Setting::NotSet,
synonyms: Setting::NotSet,
distinct_attribute: Setting::NotSet,
typo_tolerance: Setting::NotSet,

View File

@ -340,6 +340,9 @@ impl<T> From<v5::Settings<T>> for v6::Settings<v6::Unchecked> {
}
},
stop_words: settings.stop_words.into(),
non_separator_tokens: v6::Setting::NotSet,
separator_tokens: v6::Setting::NotSet,
dictionary: v6::Setting::NotSet,
synonyms: settings.synonyms.into(),
distinct_attribute: settings.distinct_attribute.into(),
typo_tolerance: match settings.typo_tolerance {

View File

@ -22,6 +22,7 @@ log = "0.4.17"
meilisearch-auth = { path = "../meilisearch-auth" }
meilisearch-types = { path = "../meilisearch-types" }
page_size = "0.5.0"
puffin = "0.16.0"
roaring = { version = "0.10.1", features = ["serde"] }
serde = { version = "1.0.160", features = ["derive"] }
serde_json = { version = "1.0.95", features = ["preserve_order"] }

View File

@ -471,6 +471,8 @@ impl IndexScheduler {
#[cfg(test)]
self.maybe_fail(crate::tests::FailureLocation::InsideCreateBatch)?;
puffin::profile_function!();
let enqueued = &self.get_status(rtxn, Status::Enqueued)?;
let to_cancel = self.get_kind(rtxn, Kind::TaskCancelation)? & enqueued;
@ -575,6 +577,9 @@ impl IndexScheduler {
self.maybe_fail(crate::tests::FailureLocation::PanicInsideProcessBatch)?;
self.breakpoint(crate::Breakpoint::InsideProcessBatch);
}
puffin::profile_function!(format!("{:?}", batch));
match batch {
Batch::TaskCancelation { mut task, previous_started_at, previous_processing_tasks } => {
// 1. Retrieve the tasks that matched the query at enqueue-time.
@ -1111,6 +1116,8 @@ impl IndexScheduler {
index: &'i Index,
operation: IndexOperation,
) -> Result<Vec<Task>> {
puffin::profile_function!();
match operation {
IndexOperation::DocumentClear { mut tasks, .. } => {
let count = milli::update::ClearDocuments::new(index_wtxn, index).execute()?;

View File

@ -1032,6 +1032,8 @@ impl IndexScheduler {
self.breakpoint(Breakpoint::Start);
}
puffin::GlobalProfiler::lock().new_frame();
self.cleanup_task_queue()?;
let rtxn = self.env.read_txn().map_err(Error::HeedTransaction)?;

View File

@ -259,6 +259,9 @@ InvalidSettingsRankingRules , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSearchableAttributes , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSortableAttributes , InvalidRequest , BAD_REQUEST ;
InvalidSettingsStopWords , InvalidRequest , BAD_REQUEST ;
InvalidSettingsNonSeparatorTokens , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSeparatorTokens , InvalidRequest , BAD_REQUEST ;
InvalidSettingsDictionary , InvalidRequest , BAD_REQUEST ;
InvalidSettingsSynonyms , InvalidRequest , BAD_REQUEST ;
InvalidSettingsTypoTolerance , InvalidRequest , BAD_REQUEST ;
InvalidState , Internal , INTERNAL_SERVER_ERROR ;

View File

@ -171,6 +171,15 @@ pub struct Settings<T> {
#[deserr(default, error = DeserrJsonError<InvalidSettingsStopWords>)]
pub stop_words: Setting<BTreeSet<String>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsNonSeparatorTokens>)]
pub non_separator_tokens: Setting<BTreeSet<String>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsSeparatorTokens>)]
pub separator_tokens: Setting<BTreeSet<String>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsDictionary>)]
pub dictionary: Setting<BTreeSet<String>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
#[deserr(default, error = DeserrJsonError<InvalidSettingsSynonyms>)]
pub synonyms: Setting<BTreeMap<String, Vec<String>>>,
#[serde(default, skip_serializing_if = "Setting::is_not_set")]
@ -201,6 +210,9 @@ impl Settings<Checked> {
ranking_rules: Setting::Reset,
stop_words: Setting::Reset,
synonyms: Setting::Reset,
non_separator_tokens: Setting::Reset,
separator_tokens: Setting::Reset,
dictionary: Setting::Reset,
distinct_attribute: Setting::Reset,
typo_tolerance: Setting::Reset,
faceting: Setting::Reset,
@ -217,6 +229,9 @@ impl Settings<Checked> {
sortable_attributes,
ranking_rules,
stop_words,
non_separator_tokens,
separator_tokens,
dictionary,
synonyms,
distinct_attribute,
typo_tolerance,
@ -232,6 +247,9 @@ impl Settings<Checked> {
sortable_attributes,
ranking_rules,
stop_words,
non_separator_tokens,
separator_tokens,
dictionary,
synonyms,
distinct_attribute,
typo_tolerance,
@ -274,6 +292,9 @@ impl Settings<Unchecked> {
ranking_rules: self.ranking_rules,
stop_words: self.stop_words,
synonyms: self.synonyms,
non_separator_tokens: self.non_separator_tokens,
separator_tokens: self.separator_tokens,
dictionary: self.dictionary,
distinct_attribute: self.distinct_attribute,
typo_tolerance: self.typo_tolerance,
faceting: self.faceting,
@ -335,6 +356,28 @@ pub fn apply_settings_to_builder(
Setting::NotSet => (),
}
match settings.non_separator_tokens {
Setting::Set(ref non_separator_tokens) => {
builder.set_non_separator_tokens(non_separator_tokens.clone())
}
Setting::Reset => builder.reset_non_separator_tokens(),
Setting::NotSet => (),
}
match settings.separator_tokens {
Setting::Set(ref separator_tokens) => {
builder.set_separator_tokens(separator_tokens.clone())
}
Setting::Reset => builder.reset_separator_tokens(),
Setting::NotSet => (),
}
match settings.dictionary {
Setting::Set(ref dictionary) => builder.set_dictionary(dictionary.clone()),
Setting::Reset => builder.reset_dictionary(),
Setting::NotSet => (),
}
match settings.synonyms {
Setting::Set(ref synonyms) => builder.set_synonyms(synonyms.clone().into_iter().collect()),
Setting::Reset => builder.reset_synonyms(),
@ -459,15 +502,14 @@ pub fn settings(
})
.transpose()?
.unwrap_or_default();
let non_separator_tokens = index.non_separator_tokens(rtxn)?.unwrap_or_default();
let separator_tokens = index.separator_tokens(rtxn)?.unwrap_or_default();
let dictionary = index.dictionary(rtxn)?.unwrap_or_default();
let distinct_field = index.distinct_field(rtxn)?.map(String::from);
// in milli each word in the synonyms map were split on their separator. Since we lost
// this information we are going to put space between words.
let synonyms = index
.synonyms(rtxn)?
.iter()
.map(|(key, values)| (key.join(" "), values.iter().map(|value| value.join(" ")).collect()))
.collect();
let synonyms = index.user_defined_synonyms(rtxn)?;
let min_typo_word_len = MinWordSizeTyposSetting {
one_typo: Setting::Set(index.min_word_len_one_typo(rtxn)?),
@ -520,6 +562,9 @@ pub fn settings(
sortable_attributes: Setting::Set(sortable_attributes),
ranking_rules: Setting::Set(criteria.iter().map(|c| c.clone().into()).collect()),
stop_words: Setting::Set(stop_words),
non_separator_tokens: Setting::Set(non_separator_tokens),
separator_tokens: Setting::Set(separator_tokens),
dictionary: Setting::Set(dictionary),
distinct_attribute: match distinct_field {
Some(field) => Setting::Set(field),
None => Setting::Reset,
@ -642,6 +687,9 @@ pub(crate) mod test {
sortable_attributes: Setting::NotSet,
ranking_rules: Setting::NotSet,
stop_words: Setting::NotSet,
non_separator_tokens: Setting::NotSet,
separator_tokens: Setting::NotSet,
dictionary: Setting::NotSet,
synonyms: Setting::NotSet,
distinct_attribute: Setting::NotSet,
typo_tolerance: Setting::NotSet,
@ -663,6 +711,9 @@ pub(crate) mod test {
sortable_attributes: Setting::NotSet,
ranking_rules: Setting::NotSet,
stop_words: Setting::NotSet,
non_separator_tokens: Setting::NotSet,
separator_tokens: Setting::NotSet,
dictionary: Setting::NotSet,
synonyms: Setting::NotSet,
distinct_attribute: Setting::NotSet,
typo_tolerance: Setting::NotSet,

View File

@ -67,6 +67,8 @@ permissive-json-pointer = { path = "../permissive-json-pointer" }
pin-project-lite = "0.2.9"
platform-dirs = "0.3.0"
prometheus = { version = "0.13.3", features = ["process"] }
puffin = "0.16.0"
puffin_http = { version = "0.13.0", optional = true }
rand = "0.8.5"
rayon = "1.7.0"
regex = "1.7.3"
@ -133,6 +135,7 @@ zip = { version = "0.6.4", optional = true }
[features]
default = ["analytics", "meilisearch-types/all-tokenizations", "mini-dashboard"]
analytics = ["segment"]
profile-with-puffin = ["dep:puffin_http"]
mini-dashboard = [
"actix-web-static-files",
"static-files",

View File

@ -29,6 +29,10 @@ fn setup(opt: &Opt) -> anyhow::Result<()> {
async fn main() -> anyhow::Result<()> {
let (opt, config_read_from) = Opt::try_build()?;
#[cfg(feature = "profile-with-puffin")]
let _server = puffin_http::Server::new(&format!("0.0.0.0:{}", puffin_http::DEFAULT_PORT))?;
puffin::set_scopes_on(cfg!(feature = "profile-with-puffin"));
anyhow::ensure!(
!(cfg!(windows) && opt.experimental_reduce_indexing_memory_usage),
"The `experimental-reduce-indexing-memory-usage` flag is not supported on Windows"

View File

@ -309,6 +309,81 @@ make_setting_route!(
}
);
make_setting_route!(
"/non-separator-tokens",
put,
std::collections::BTreeSet<String>,
meilisearch_types::deserr::DeserrJsonError<
meilisearch_types::error::deserr_codes::InvalidSettingsNonSeparatorTokens,
>,
non_separator_tokens,
"nonSeparatorTokens",
analytics,
|non_separator_tokens: &Option<std::collections::BTreeSet<String>>, req: &HttpRequest| {
use serde_json::json;
analytics.publish(
"nonSeparatorTokens Updated".to_string(),
json!({
"non_separator_tokens": {
"total": non_separator_tokens.as_ref().map(|non_separator_tokens| non_separator_tokens.len()),
},
}),
Some(req),
);
}
);
make_setting_route!(
"/separator-tokens",
put,
std::collections::BTreeSet<String>,
meilisearch_types::deserr::DeserrJsonError<
meilisearch_types::error::deserr_codes::InvalidSettingsSeparatorTokens,
>,
separator_tokens,
"separatorTokens",
analytics,
|separator_tokens: &Option<std::collections::BTreeSet<String>>, req: &HttpRequest| {
use serde_json::json;
analytics.publish(
"separatorTokens Updated".to_string(),
json!({
"separator_tokens": {
"total": separator_tokens.as_ref().map(|separator_tokens| separator_tokens.len()),
},
}),
Some(req),
);
}
);
make_setting_route!(
"/dictionary",
put,
std::collections::BTreeSet<String>,
meilisearch_types::deserr::DeserrJsonError<
meilisearch_types::error::deserr_codes::InvalidSettingsDictionary,
>,
dictionary,
"dictionary",
analytics,
|dictionary: &Option<std::collections::BTreeSet<String>>, req: &HttpRequest| {
use serde_json::json;
analytics.publish(
"dictionary Updated".to_string(),
json!({
"dictionary": {
"total": dictionary.as_ref().map(|dictionary| dictionary.len()),
},
}),
Some(req),
);
}
);
make_setting_route!(
"/synonyms",
put,

View File

@ -491,6 +491,20 @@ pub fn perform_search(
tokenizer_builder.allow_list(&script_lang_map);
}
let separators = index.allowed_separators(&rtxn)?;
let separators: Option<Vec<_>> =
separators.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref separators) = separators {
tokenizer_builder.separators(separators);
}
let dictionary = index.dictionary(&rtxn)?;
let dictionary: Option<Vec<_>> =
dictionary.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref dictionary) = dictionary {
tokenizer_builder.words_dict(dictionary);
}
let mut formatter_builder = MatcherBuilder::new(matching_words, tokenizer_builder.build());
formatter_builder.crop_marker(query.crop_marker);
formatter_builder.highlight_prefix(query.highlight_pre_tag);

File diff suppressed because it is too large Load Diff

View File

@ -16,6 +16,9 @@ static DEFAULT_SETTINGS_VALUES: Lazy<HashMap<&'static str, Value>> = Lazy::new(|
json!(["words", "typo", "proximity", "attribute", "sort", "exactness"]),
);
map.insert("stop_words", json!([]));
map.insert("non_separator_tokens", json!([]));
map.insert("separator_tokens", json!([]));
map.insert("dictionary", json!([]));
map.insert("synonyms", json!({}));
map.insert(
"faceting",
@ -51,7 +54,7 @@ async fn get_settings() {
let (response, code) = index.settings().await;
assert_eq!(code, 200);
let settings = response.as_object().unwrap();
assert_eq!(settings.keys().len(), 11);
assert_eq!(settings.keys().len(), 14);
assert_eq!(settings["displayedAttributes"], json!(["*"]));
assert_eq!(settings["searchableAttributes"], json!(["*"]));
assert_eq!(settings["filterableAttributes"], json!([]));
@ -62,6 +65,9 @@ async fn get_settings() {
json!(["words", "typo", "proximity", "attribute", "sort", "exactness"])
);
assert_eq!(settings["stopWords"], json!([]));
assert_eq!(settings["nonSeparatorTokens"], json!([]));
assert_eq!(settings["separatorTokens"], json!([]));
assert_eq!(settings["dictionary"], json!([]));
assert_eq!(
settings["faceting"],
json!({

View File

@ -1,3 +1,4 @@
mod distinct;
mod errors;
mod get_settings;
mod tokenizer_customization;

View File

@ -0,0 +1,467 @@
use meili_snap::{json_string, snapshot};
use serde_json::json;
use crate::common::Server;
#[actix_rt::test]
async fn set_and_reset() {
let server = Server::new().await;
let index = server.index("test");
let (_response, _code) = index
.update_settings(json!({
"nonSeparatorTokens": ["#", "&"],
"separatorTokens": ["&sep", "<br/>"],
"dictionary": ["J.R.R.", "J. R. R."],
}))
.await;
index.wait_task(0).await;
let (response, _) = index.settings().await;
snapshot!(json_string!(response["nonSeparatorTokens"]), @r###"
[
"#",
"&"
]
"###);
snapshot!(json_string!(response["separatorTokens"]), @r###"
[
"&sep",
"<br/>"
]
"###);
snapshot!(json_string!(response["dictionary"]), @r###"
[
"J. R. R.",
"J.R.R."
]
"###);
index
.update_settings(json!({
"nonSeparatorTokens": null,
"separatorTokens": null,
"dictionary": null,
}))
.await;
index.wait_task(1).await;
let (response, _) = index.settings().await;
snapshot!(json_string!(response["nonSeparatorTokens"]), @"[]");
snapshot!(json_string!(response["separatorTokens"]), @"[]");
snapshot!(json_string!(response["dictionary"]), @"[]");
}
#[actix_rt::test]
async fn set_and_search() {
let documents = json!([
{
"id": 1,
"content": "Mac & cheese",
},
{
"id": 2,
"content": "G#D#G#D#G#C#D#G#C#",
},
{
"id": 3,
"content": "Mac&sep&&sepcheese",
},
]);
let server = Server::new().await;
let index = server.index("test");
index.add_documents(documents, None).await;
index.wait_task(0).await;
let (_response, _code) = index
.update_settings(json!({
"nonSeparatorTokens": ["#", "&"],
"separatorTokens": ["<br/>", "&sep"],
"dictionary": ["#", "A#", "B#", "C#", "D#", "E#", "F#", "G#"],
}))
.await;
index.wait_task(1).await;
index
.search(json!({"q": "&", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 1,
"content": "Mac & cheese",
"_formatted": {
"id": "1",
"content": "Mac <em>&</em> cheese"
}
},
{
"id": 3,
"content": "Mac&sep&&sepcheese",
"_formatted": {
"id": "3",
"content": "Mac&sep<em>&</em>&sepcheese"
}
}
]
"###);
})
.await;
index
.search(
json!({"q": "Mac & cheese", "attributesToHighlight": ["content"]}),
|response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 1,
"content": "Mac & cheese",
"_formatted": {
"id": "1",
"content": "<em>Mac</em> <em>&</em> <em>cheese</em>"
}
},
{
"id": 3,
"content": "Mac&sep&&sepcheese",
"_formatted": {
"id": "3",
"content": "<em>Mac</em>&sep<em>&</em>&sep<em>cheese</em>"
}
}
]
"###);
},
)
.await;
index
.search(
json!({"q": "Mac&sep&&sepcheese", "attributesToHighlight": ["content"]}),
|response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 1,
"content": "Mac & cheese",
"_formatted": {
"id": "1",
"content": "<em>Mac</em> <em>&</em> <em>cheese</em>"
}
},
{
"id": 3,
"content": "Mac&sep&&sepcheese",
"_formatted": {
"id": "3",
"content": "<em>Mac</em>&sep<em>&</em>&sep<em>cheese</em>"
}
}
]
"###);
},
)
.await;
index
.search(json!({"q": "C#D#G", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 2,
"content": "G#D#G#D#G#C#D#G#C#",
"_formatted": {
"id": "2",
"content": "<em>G</em>#<em>D#</em><em>G</em>#<em>D#</em><em>G</em>#<em>C#</em><em>D#</em><em>G</em>#<em>C#</em>"
}
}
]
"###);
})
.await;
index
.search(json!({"q": "#", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @"[]");
})
.await;
}
#[actix_rt::test]
async fn advanced_synergies() {
let documents = json!([
{
"id": 1,
"content": "J.R.R. Tolkien",
},
{
"id": 2,
"content": "J. R. R. Tolkien",
},
{
"id": 3,
"content": "jrr Tolkien",
},
{
"id": 4,
"content": "J.K. Rowlings",
},
{
"id": 5,
"content": "J. K. Rowlings",
},
{
"id": 6,
"content": "jk Rowlings",
},
]);
let server = Server::new().await;
let index = server.index("test");
index.add_documents(documents, None).await;
index.wait_task(0).await;
let (_response, _code) = index
.update_settings(json!({
"dictionary": ["J.R.R.", "J. R. R."],
"synonyms": {
"J.R.R.": ["jrr", "J. R. R."],
"J. R. R.": ["jrr", "J.R.R."],
"jrr": ["J.R.R.", "J. R. R."],
"J.K.": ["jk", "J. K."],
"J. K.": ["jk", "J.K."],
"jk": ["J.K.", "J. K."],
}
}))
.await;
index.wait_task(1).await;
index
.search(json!({"q": "J.R.R.", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 1,
"content": "J.R.R. Tolkien",
"_formatted": {
"id": "1",
"content": "<em>J.R.R.</em> Tolkien"
}
},
{
"id": 2,
"content": "J. R. R. Tolkien",
"_formatted": {
"id": "2",
"content": "<em>J. R. R.</em> Tolkien"
}
},
{
"id": 3,
"content": "jrr Tolkien",
"_formatted": {
"id": "3",
"content": "<em>jrr</em> Tolkien"
}
}
]
"###);
})
.await;
index
.search(json!({"q": "jrr", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 3,
"content": "jrr Tolkien",
"_formatted": {
"id": "3",
"content": "<em>jrr</em> Tolkien"
}
},
{
"id": 1,
"content": "J.R.R. Tolkien",
"_formatted": {
"id": "1",
"content": "<em>J.R.R.</em> Tolkien"
}
},
{
"id": 2,
"content": "J. R. R. Tolkien",
"_formatted": {
"id": "2",
"content": "<em>J. R. R.</em> Tolkien"
}
}
]
"###);
})
.await;
index
.search(json!({"q": "J. R. R.", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 2,
"content": "J. R. R. Tolkien",
"_formatted": {
"id": "2",
"content": "<em>J. R. R.</em> Tolkien"
}
},
{
"id": 1,
"content": "J.R.R. Tolkien",
"_formatted": {
"id": "1",
"content": "<em>J.R.R.</em> Tolkien"
}
},
{
"id": 3,
"content": "jrr Tolkien",
"_formatted": {
"id": "3",
"content": "<em>jrr</em> Tolkien"
}
}
]
"###);
})
.await;
// Only update dictionary, the synonyms should be recomputed.
let (_response, _code) = index
.update_settings(json!({
"dictionary": ["J.R.R.", "J. R. R.", "J.K.", "J. K."],
}))
.await;
index.wait_task(2).await;
index
.search(json!({"q": "jk", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 6,
"content": "jk Rowlings",
"_formatted": {
"id": "6",
"content": "<em>jk</em> Rowlings"
}
},
{
"id": 4,
"content": "J.K. Rowlings",
"_formatted": {
"id": "4",
"content": "<em>J.K.</em> Rowlings"
}
},
{
"id": 5,
"content": "J. K. Rowlings",
"_formatted": {
"id": "5",
"content": "<em>J. K.</em> Rowlings"
}
}
]
"###);
})
.await;
index
.search(json!({"q": "J.K.", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 4,
"content": "J.K. Rowlings",
"_formatted": {
"id": "4",
"content": "<em>J.K.</em> Rowlings"
}
},
{
"id": 5,
"content": "J. K. Rowlings",
"_formatted": {
"id": "5",
"content": "<em>J. K.</em> Rowlings"
}
},
{
"id": 6,
"content": "jk Rowlings",
"_formatted": {
"id": "6",
"content": "<em>jk</em> Rowlings"
}
}
]
"###);
})
.await;
index
.search(json!({"q": "J. K.", "attributesToHighlight": ["content"]}), |response, code| {
snapshot!(code, @"200 OK");
snapshot!(json_string!(response["hits"]), @r###"
[
{
"id": 5,
"content": "J. K. Rowlings",
"_formatted": {
"id": "5",
"content": "<em>J. K.</em> Rowlings"
}
},
{
"id": 4,
"content": "J.K. Rowlings",
"_formatted": {
"id": "4",
"content": "<em>J.K.</em> Rowlings"
}
},
{
"id": 6,
"content": "jk Rowlings",
"_formatted": {
"id": "6",
"content": "<em>jk</em> Rowlings"
}
},
{
"id": 2,
"content": "J. R. R. Tolkien",
"_formatted": {
"id": "2",
"content": "<em>J. R.</em> R. Tolkien"
}
}
]
"###);
})
.await;
}

View File

@ -67,6 +67,9 @@ filter-parser = { path = "../filter-parser" }
# documents words self-join
itertools = "0.10.5"
# profiling
puffin = "0.16.0"
# logging
log = "0.4.17"
logging_timer = "1.1.0"

View File

@ -1,5 +1,5 @@
use std::borrow::Cow;
use std::collections::{HashMap, HashSet};
use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet};
use std::fs::File;
use std::mem::size_of;
use std::path::Path;
@ -60,8 +60,12 @@ pub mod main_key {
pub const USER_DEFINED_SEARCHABLE_FIELDS_KEY: &str = "user-defined-searchable-fields";
pub const SOFT_EXTERNAL_DOCUMENTS_IDS_KEY: &str = "soft-external-documents-ids";
pub const STOP_WORDS_KEY: &str = "stop-words";
pub const NON_SEPARATOR_TOKENS_KEY: &str = "non-separator-tokens";
pub const SEPARATOR_TOKENS_KEY: &str = "separator-tokens";
pub const DICTIONARY_KEY: &str = "dictionary";
pub const STRING_FACETED_DOCUMENTS_IDS_PREFIX: &str = "string-faceted-documents-ids";
pub const SYNONYMS_KEY: &str = "synonyms";
pub const USER_DEFINED_SYNONYMS_KEY: &str = "user-defined-synonyms";
pub const WORDS_FST_KEY: &str = "words-fst";
pub const WORDS_PREFIXES_FST_KEY: &str = "words-prefixes-fst";
pub const CREATED_AT_KEY: &str = "created-at";
@ -1048,18 +1052,116 @@ impl Index {
}
}
/* non separator tokens */
pub(crate) fn put_non_separator_tokens(
&self,
wtxn: &mut RwTxn,
set: &BTreeSet<String>,
) -> heed::Result<()> {
self.main.put::<_, Str, SerdeBincode<_>>(wtxn, main_key::NON_SEPARATOR_TOKENS_KEY, set)
}
pub(crate) fn delete_non_separator_tokens(&self, wtxn: &mut RwTxn) -> heed::Result<bool> {
self.main.delete::<_, Str>(wtxn, main_key::NON_SEPARATOR_TOKENS_KEY)
}
pub fn non_separator_tokens(&self, rtxn: &RoTxn) -> Result<Option<BTreeSet<String>>> {
Ok(self.main.get::<_, Str, SerdeBincode<BTreeSet<String>>>(
rtxn,
main_key::NON_SEPARATOR_TOKENS_KEY,
)?)
}
/* separator tokens */
pub(crate) fn put_separator_tokens(
&self,
wtxn: &mut RwTxn,
set: &BTreeSet<String>,
) -> heed::Result<()> {
self.main.put::<_, Str, SerdeBincode<_>>(wtxn, main_key::SEPARATOR_TOKENS_KEY, set)
}
pub(crate) fn delete_separator_tokens(&self, wtxn: &mut RwTxn) -> heed::Result<bool> {
self.main.delete::<_, Str>(wtxn, main_key::SEPARATOR_TOKENS_KEY)
}
pub fn separator_tokens(&self, rtxn: &RoTxn) -> Result<Option<BTreeSet<String>>> {
Ok(self
.main
.get::<_, Str, SerdeBincode<BTreeSet<String>>>(rtxn, main_key::SEPARATOR_TOKENS_KEY)?)
}
/* separators easing method */
pub fn allowed_separators(&self, rtxn: &RoTxn) -> Result<Option<BTreeSet<String>>> {
let default_separators =
charabia::separators::DEFAULT_SEPARATORS.iter().map(|s| s.to_string());
let mut separators: Option<BTreeSet<_>> = None;
if let Some(mut separator_tokens) = self.separator_tokens(rtxn)? {
separator_tokens.extend(default_separators.clone());
separators = Some(separator_tokens);
}
if let Some(non_separator_tokens) = self.non_separator_tokens(rtxn)? {
separators = separators
.or_else(|| Some(default_separators.collect()))
.map(|separators| &separators - &non_separator_tokens);
}
Ok(separators)
}
/* dictionary */
pub(crate) fn put_dictionary(
&self,
wtxn: &mut RwTxn,
set: &BTreeSet<String>,
) -> heed::Result<()> {
self.main.put::<_, Str, SerdeBincode<_>>(wtxn, main_key::DICTIONARY_KEY, set)
}
pub(crate) fn delete_dictionary(&self, wtxn: &mut RwTxn) -> heed::Result<bool> {
self.main.delete::<_, Str>(wtxn, main_key::DICTIONARY_KEY)
}
pub fn dictionary(&self, rtxn: &RoTxn) -> Result<Option<BTreeSet<String>>> {
Ok(self
.main
.get::<_, Str, SerdeBincode<BTreeSet<String>>>(rtxn, main_key::DICTIONARY_KEY)?)
}
/* synonyms */
pub(crate) fn put_synonyms(
&self,
wtxn: &mut RwTxn,
synonyms: &HashMap<Vec<String>, Vec<Vec<String>>>,
user_defined_synonyms: &BTreeMap<String, Vec<String>>,
) -> heed::Result<()> {
self.main.put::<_, Str, SerdeBincode<_>>(wtxn, main_key::SYNONYMS_KEY, synonyms)
self.main.put::<_, Str, SerdeBincode<_>>(wtxn, main_key::SYNONYMS_KEY, synonyms)?;
self.main.put::<_, Str, SerdeBincode<_>>(
wtxn,
main_key::USER_DEFINED_SYNONYMS_KEY,
user_defined_synonyms,
)
}
pub(crate) fn delete_synonyms(&self, wtxn: &mut RwTxn) -> heed::Result<bool> {
self.main.delete::<_, Str>(wtxn, main_key::SYNONYMS_KEY)
self.main.delete::<_, Str>(wtxn, main_key::SYNONYMS_KEY)?;
self.main.delete::<_, Str>(wtxn, main_key::USER_DEFINED_SYNONYMS_KEY)
}
pub fn user_defined_synonyms(
&self,
rtxn: &RoTxn,
) -> heed::Result<BTreeMap<String, Vec<String>>> {
Ok(self
.main
.get::<_, Str, SerdeBincode<_>>(rtxn, main_key::USER_DEFINED_SYNONYMS_KEY)?
.unwrap_or_default())
}
pub fn synonyms(&self, rtxn: &RoTxn) -> heed::Result<HashMap<Vec<String>, Vec<Vec<String>>>> {

View File

@ -479,6 +479,20 @@ pub fn execute_search(
tokbuilder.stop_words(stop_words);
}
let separators = ctx.index.allowed_separators(ctx.txn)?;
let separators: Option<Vec<_>> =
separators.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref separators) = separators {
tokbuilder.separators(separators);
}
let dictionary = ctx.index.dictionary(ctx.txn)?;
let dictionary: Option<Vec<_>> =
dictionary.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref dictionary) = dictionary {
tokbuilder.words_dict(dictionary);
}
let script_lang_map = ctx.index.script_language(ctx.txn)?;
if !script_lang_map.is_empty() {
tokbuilder.allow_list(&script_lang_map);

View File

@ -2,7 +2,7 @@ use std::io::Cursor;
use big_s::S;
use heed::EnvOpenOptions;
use maplit::{hashmap, hashset};
use maplit::{btreemap, hashset};
use crate::documents::{DocumentsBatchBuilder, DocumentsBatchReader};
use crate::update::{IndexDocuments, IndexDocumentsConfig, IndexerConfig, Settings};
@ -33,7 +33,7 @@ pub fn setup_search_index_with_criteria(criteria: &[Criterion]) -> Index {
S("tag"),
S("asc_desc_rank"),
});
builder.set_synonyms(hashmap! {
builder.set_synonyms(btreemap! {
S("hello") => vec![S("good morning")],
S("world") => vec![S("earth")],
S("america") => vec![S("the united states")],

View File

@ -15,7 +15,7 @@ they store fewer sprximities than the regular word sprximity DB.
*/
use std::collections::HashMap;
use std::collections::BTreeMap;
use crate::index::tests::TempIndex;
use crate::search::new::tests::collect_field_values;
@ -336,7 +336,7 @@ fn test_proximity_split_word() {
index
.update_settings(|s| {
let mut syns = HashMap::new();
let mut syns = BTreeMap::new();
syns.insert("xyz".to_owned(), vec!["sun flower".to_owned()]);
s.set_synonyms(syns);
})

View File

@ -18,7 +18,7 @@ if `words` doesn't exist before it.
14. Synonyms cost nothing according to the typo ranking rule
*/
use std::collections::HashMap;
use std::collections::BTreeMap;
use crate::index::tests::TempIndex;
use crate::search::new::tests::collect_field_values;
@ -591,7 +591,7 @@ fn test_typo_synonyms() {
.update_settings(|s| {
s.set_criteria(vec![Criterion::Typo]);
let mut synonyms = HashMap::new();
let mut synonyms = BTreeMap::new();
synonyms.insert("lackadaisical".to_owned(), vec!["lazy".to_owned()]);
synonyms.insert("fast brownish".to_owned(), vec!["quick brown".to_owned()]);

View File

@ -15,6 +15,8 @@ impl<'t, 'u, 'i> ClearDocuments<'t, 'u, 'i> {
}
pub fn execute(self) -> Result<u64> {
puffin::profile_function!();
self.index.set_updated_at(self.wtxn, &OffsetDateTime::now_utc())?;
let Index {
env: _env,

View File

@ -110,6 +110,8 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
Some(docid)
}
pub fn execute(self) -> Result<DocumentDeletionResult> {
puffin::profile_function!();
let DetailedDocumentDeletionResult { deleted_documents, remaining_documents } =
self.execute_inner()?;

View File

@ -31,6 +31,8 @@ pub fn enrich_documents_batch<R: Read + Seek>(
autogenerate_docids: bool,
reader: DocumentsBatchReader<R>,
) -> Result<StdResult<EnrichedDocumentsBatchReader<R>, UserError>> {
puffin::profile_function!();
let (mut cursor, mut documents_batch_index) = reader.into_cursor_and_fields_index();
let mut external_ids = tempfile::tempfile().map(grenad::Writer::new)?;

View File

@ -28,8 +28,12 @@ pub fn extract_docid_word_positions<R: io::Read + io::Seek>(
indexer: GrenadParameters,
searchable_fields: &Option<HashSet<FieldId>>,
stop_words: Option<&fst::Set<&[u8]>>,
allowed_separators: Option<&Vec<&str>>,
dictionary: Option<&Vec<&str>>,
max_positions_per_attributes: Option<u32>,
) -> Result<(RoaringBitmap, grenad::Reader<File>, ScriptLanguageDocidsMap)> {
puffin::profile_function!();
let max_positions_per_attributes = max_positions_per_attributes
.map_or(MAX_POSITION_PER_ATTRIBUTE, |max| max.min(MAX_POSITION_PER_ATTRIBUTE));
let max_memory = indexer.max_memory_by_thread();
@ -50,6 +54,14 @@ pub fn extract_docid_word_positions<R: io::Read + io::Seek>(
if let Some(stop_words) = stop_words {
tokenizer_builder.stop_words(stop_words);
}
if let Some(dictionary) = dictionary {
// let dictionary: Vec<_> = dictionary.iter().map(String::as_str).collect();
tokenizer_builder.words_dict(dictionary.as_slice());
}
if let Some(separators) = allowed_separators {
// let separators: Vec<_> = separators.iter().map(String::as_str).collect();
tokenizer_builder.separators(separators.as_slice());
}
let tokenizer = tokenizer_builder.build();
let mut cursor = obkv_documents.into_cursor()?;

View File

@ -20,6 +20,8 @@ pub fn extract_facet_number_docids<R: io::Read + io::Seek>(
docid_fid_facet_number: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut facet_number_docids_sorter = create_sorter(

View File

@ -18,6 +18,8 @@ pub fn extract_facet_string_docids<R: io::Read + io::Seek>(
docid_fid_facet_string: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut facet_string_docids_sorter = create_sorter(

View File

@ -34,6 +34,8 @@ pub fn extract_fid_docid_facet_values<R: io::Read + io::Seek>(
indexer: GrenadParameters,
faceted_fields: &HashSet<FieldId>,
) -> Result<ExtractedFacetValues> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut fid_docid_facet_numbers_sorter = create_sorter(

View File

@ -22,6 +22,8 @@ pub fn extract_fid_word_count_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut fid_word_count_docids_sorter = create_sorter(

View File

@ -19,6 +19,8 @@ pub fn extract_geo_points<R: io::Read + io::Seek>(
primary_key_id: FieldId,
(lat_fid, lng_fid): (FieldId, FieldId),
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let mut writer = create_writer(
indexer.chunk_compression_type,
indexer.chunk_compression_level,

View File

@ -19,6 +19,8 @@ pub fn extract_vector_points<R: io::Read + io::Seek>(
primary_key_id: FieldId,
vectors_fid: FieldId,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let mut writer = create_writer(
indexer.chunk_compression_type,
indexer.chunk_compression_level,

View File

@ -27,6 +27,8 @@ pub fn extract_word_docids<R: io::Read + io::Seek>(
indexer: GrenadParameters,
exact_attributes: &HashSet<FieldId>,
) -> Result<(grenad::Reader<File>, grenad::Reader<File>)> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut word_docids_sorter = create_sorter(

View File

@ -15,6 +15,8 @@ pub fn extract_word_fid_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut word_fid_docids_sorter = create_sorter(

View File

@ -21,6 +21,8 @@ pub fn extract_word_pair_proximity_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut word_pair_proximity_docids_sorter = create_sorter(

View File

@ -18,6 +18,8 @@ pub fn extract_word_position_docids<R: io::Read + io::Seek>(
docid_word_positions: grenad::Reader<R>,
indexer: GrenadParameters,
) -> Result<grenad::Reader<File>> {
puffin::profile_function!();
let max_memory = indexer.max_memory_by_thread();
let mut word_position_docids_sorter = create_sorter(

View File

@ -49,9 +49,13 @@ pub(crate) fn data_from_obkv_documents(
geo_fields_ids: Option<(FieldId, FieldId)>,
vectors_field_id: Option<FieldId>,
stop_words: Option<fst::Set<&[u8]>>,
allowed_separators: Option<Vec<&str>>,
dictionary: Option<Vec<&str>>,
max_positions_per_attributes: Option<u32>,
exact_attributes: HashSet<FieldId>,
) -> Result<()> {
puffin::profile_function!();
original_obkv_chunks
.par_bridge()
.map(|original_documents_chunk| {
@ -74,6 +78,8 @@ pub(crate) fn data_from_obkv_documents(
geo_fields_ids,
vectors_field_id,
&stop_words,
&allowed_separators,
&dictionary,
max_positions_per_attributes,
)
})
@ -238,11 +244,13 @@ fn spawn_extraction_task<FE, FS, M>(
M::Output: Send,
{
rayon::spawn(move || {
puffin::profile_scope!("extract_multiple_chunks", name);
let chunks: Result<M> =
chunks.into_par_iter().map(|chunk| extract_fn(chunk, indexer)).collect();
rayon::spawn(move || match chunks {
Ok(chunks) => {
debug!("merge {} database", name);
puffin::profile_scope!("merge_multiple_chunks", name);
let reader = chunks.merge(merge_fn, &indexer);
let _ = lmdb_writer_sx.send(reader.map(serialize_fn));
}
@ -285,6 +293,8 @@ fn send_and_extract_flattened_documents_data(
geo_fields_ids: Option<(FieldId, FieldId)>,
vectors_field_id: Option<FieldId>,
stop_words: &Option<fst::Set<&[u8]>>,
allowed_separators: &Option<Vec<&str>>,
dictionary: &Option<Vec<&str>>,
max_positions_per_attributes: Option<u32>,
) -> Result<(
grenad::Reader<CursorClonableMmap>,
@ -340,6 +350,8 @@ fn send_and_extract_flattened_documents_data(
indexer,
searchable_fields,
stop_words.as_ref(),
allowed_separators.as_ref(),
dictionary.as_ref(),
max_positions_per_attributes,
)?;

View File

@ -214,6 +214,7 @@ pub fn sorter_into_lmdb_database(
sorter: Sorter<MergeFn>,
merge: MergeFn,
) -> Result<()> {
puffin::profile_function!();
debug!("Writing MTBL sorter...");
let before = Instant::now();

View File

@ -137,6 +137,8 @@ where
mut self,
reader: DocumentsBatchReader<R>,
) -> Result<(Self, StdResult<u64, UserError>)> {
puffin::profile_function!();
// Early return when there is no document to add
if reader.is_empty() {
return Ok((self, Ok(0)));
@ -175,6 +177,8 @@ where
mut self,
to_delete: Vec<String>,
) -> Result<(Self, StdResult<u64, UserError>)> {
puffin::profile_function!();
// Early return when there is no document to add
if to_delete.is_empty() {
return Ok((self, Ok(0)));
@ -194,6 +198,8 @@ where
#[logging_timer::time("IndexDocuments::{}")]
pub fn execute(mut self) -> Result<DocumentAdditionResult> {
puffin::profile_function!();
if self.added_documents == 0 {
let number_of_documents = self.index.number_of_documents(self.wtxn)?;
return Ok(DocumentAdditionResult { indexed_documents: 0, number_of_documents });
@ -232,6 +238,8 @@ where
FP: Fn(UpdateIndexingStep) + Sync,
FA: Fn() -> bool + Sync,
{
puffin::profile_function!();
let TransformOutput {
primary_key,
fields_ids_map,
@ -308,6 +316,12 @@ where
let vectors_field_id = self.index.fields_ids_map(self.wtxn)?.id("_vectors");
let stop_words = self.index.stop_words(self.wtxn)?;
let separators = self.index.allowed_separators(self.wtxn)?;
let separators: Option<Vec<_>> =
separators.as_ref().map(|x| x.iter().map(String::as_str).collect());
let dictionary = self.index.dictionary(self.wtxn)?;
let dictionary: Option<Vec<_>> =
dictionary.as_ref().map(|x| x.iter().map(String::as_str).collect());
let exact_attributes = self.index.exact_attributes_ids(self.wtxn)?;
let pool_params = GrenadParameters {
@ -322,6 +336,7 @@ where
// Run extraction pipeline in parallel.
pool.install(|| {
puffin::profile_scope!("extract_and_send_grenad_chunks");
// split obkv file into several chunks
let original_chunk_iter =
grenad_obkv_into_chunks(original_documents, pool_params, documents_chunk_size);
@ -344,6 +359,8 @@ where
geo_fields_ids,
vectors_field_id,
stop_words,
separators,
dictionary,
max_positions_per_attributes,
exact_attributes,
)
@ -477,6 +494,8 @@ where
FP: Fn(UpdateIndexingStep) + Sync,
FA: Fn() -> bool + Sync,
{
puffin::profile_function!();
// Merged databases are already been indexed, we start from this count;
let mut databases_seen = MERGED_DATABASE_COUNT;
@ -511,26 +530,36 @@ where
return Err(Error::InternalError(InternalError::AbortedIndexation));
}
let current_prefix_fst = self.index.words_prefixes_fst(self.wtxn)?;
let current_prefix_fst;
let common_prefix_fst_words_tmp;
let common_prefix_fst_words: Vec<_>;
let new_prefix_fst_words;
let del_prefix_fst_words;
// We retrieve the common words between the previous and new prefix word fst.
let common_prefix_fst_words = fst_stream_into_vec(
previous_words_prefixes_fst.op().add(&current_prefix_fst).intersection(),
);
let common_prefix_fst_words: Vec<_> = common_prefix_fst_words
.as_slice()
.linear_group_by_key(|x| x.chars().next().unwrap())
.collect();
{
puffin::profile_scope!("compute_prefix_diffs");
// We retrieve the newly added words between the previous and new prefix word fst.
let new_prefix_fst_words = fst_stream_into_vec(
current_prefix_fst.op().add(&previous_words_prefixes_fst).difference(),
);
current_prefix_fst = self.index.words_prefixes_fst(self.wtxn)?;
// We compute the set of prefixes that are no more part of the prefix fst.
let del_prefix_fst_words = fst_stream_into_hashset(
previous_words_prefixes_fst.op().add(&current_prefix_fst).difference(),
);
// We retrieve the common words between the previous and new prefix word fst.
common_prefix_fst_words_tmp = fst_stream_into_vec(
previous_words_prefixes_fst.op().add(&current_prefix_fst).intersection(),
);
common_prefix_fst_words = common_prefix_fst_words_tmp
.as_slice()
.linear_group_by_key(|x| x.chars().next().unwrap())
.collect();
// We retrieve the newly added words between the previous and new prefix word fst.
new_prefix_fst_words = fst_stream_into_vec(
current_prefix_fst.op().add(&previous_words_prefixes_fst).difference(),
);
// We compute the set of prefixes that are no more part of the prefix fst.
del_prefix_fst_words = fst_stream_into_hashset(
previous_words_prefixes_fst.op().add(&current_prefix_fst).difference(),
);
}
databases_seen += 1;
(self.progress)(UpdateIndexingStep::MergeDataIntoFinalDatabase {
@ -668,6 +697,8 @@ fn execute_word_prefix_docids(
common_prefix_fst_words: &[&[String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
puffin::profile_function!();
let cursor = reader.into_cursor()?;
let mut builder = WordPrefixDocids::new(txn, word_docids_db, word_prefix_docids_db);
builder.chunk_compression_type = indexer_config.chunk_compression_type;

View File

@ -558,6 +558,8 @@ impl<'a, 'i> Transform<'a, 'i> {
where
F: Fn(UpdateIndexingStep) + Sync,
{
puffin::profile_function!();
let primary_key = self
.index
.primary_key(wtxn)?

View File

@ -49,6 +49,66 @@ pub(crate) enum TypedChunk {
ScriptLanguageDocids(HashMap<(Script, Language), RoaringBitmap>),
}
impl TypedChunk {
pub fn to_debug_string(&self) -> String {
match self {
TypedChunk::FieldIdDocidFacetStrings(grenad) => {
format!("FieldIdDocidFacetStrings {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdDocidFacetNumbers(grenad) => {
format!("FieldIdDocidFacetNumbers {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::Documents(grenad) => {
format!("Documents {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdWordcountDocids(grenad) => {
format!("FieldIdWordcountDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::NewDocumentsIds(grenad) => {
format!("NewDocumentsIds {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::WordDocids { word_docids_reader, exact_word_docids_reader } => format!(
"WordDocids {{ word_docids_reader: {}, exact_word_docids_reader: {} }}",
word_docids_reader.len(),
exact_word_docids_reader.len()
),
TypedChunk::WordPositionDocids(grenad) => {
format!("WordPositionDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::WordFidDocids(grenad) => {
format!("WordFidDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::WordPairProximityDocids(grenad) => {
format!("WordPairProximityDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdFacetStringDocids(grenad) => {
format!("FieldIdFacetStringDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdFacetNumberDocids(grenad) => {
format!("FieldIdFacetNumberDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdFacetExistsDocids(grenad) => {
format!("FieldIdFacetExistsDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdFacetIsNullDocids(grenad) => {
format!("FieldIdFacetIsNullDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::FieldIdFacetIsEmptyDocids(grenad) => {
format!("FieldIdFacetIsEmptyDocids {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::GeoPoints(grenad) => {
format!("GeoPoints {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::VectorPoints(grenad) => {
format!("VectorPoints {{ number_of_entries: {} }}", grenad.len())
}
TypedChunk::ScriptLanguageDocids(grenad) => {
format!("ScriptLanguageDocids {{ number_of_entries: {} }}", grenad.len())
}
}
}
}
/// Write typed chunk in the corresponding LMDB database of the provided index.
/// Return new documents seen.
pub(crate) fn write_typed_chunk_into_index(
@ -57,6 +117,8 @@ pub(crate) fn write_typed_chunk_into_index(
wtxn: &mut RwTxn,
index_is_empty: bool,
) -> Result<(RoaringBitmap, bool)> {
puffin::profile_function!(typed_chunk.to_debug_string());
let mut is_merged_database = false;
match typed_chunk {
TypedChunk::Documents(obkv_documents_iter) => {
@ -336,6 +398,8 @@ where
FS: for<'a> Fn(&'a [u8], &'a mut Vec<u8>) -> Result<&'a [u8]>,
FM: Fn(&[u8], &[u8], &mut Vec<u8>) -> Result<()>,
{
puffin::profile_function!(format!("number of entries: {}", data.len()));
let mut buffer = Vec::new();
let database = database.remap_types::<ByteSlice, ByteSlice>();
@ -378,6 +442,8 @@ where
FS: for<'a> Fn(&'a [u8], &'a mut Vec<u8>) -> Result<&'a [u8]>,
FM: Fn(&[u8], &[u8], &mut Vec<u8>) -> Result<()>,
{
puffin::profile_function!(format!("number of entries: {}", data.len()));
if !index_is_empty {
return write_entries_into_database(
data,

View File

@ -50,6 +50,8 @@ impl<'t, 'u, 'i> PrefixWordPairsProximityDocids<'t, 'u, 'i> {
common_prefix_fst_words: &[&'a [String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
puffin::profile_function!();
index_word_prefix_database(
self.wtxn,
self.index.word_pair_proximity_docids,

View File

@ -27,6 +27,8 @@ pub fn index_prefix_word_database(
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
) -> Result<()> {
puffin::profile_function!();
let max_proximity = max_proximity - 1;
debug!("Computing and writing the word prefix pair proximity docids into LMDB on disk...");

View File

@ -191,6 +191,7 @@ pub fn index_word_prefix_database(
chunk_compression_type: CompressionType,
chunk_compression_level: Option<u32>,
) -> Result<()> {
puffin::profile_function!();
debug!("Computing and writing the word prefix pair proximity docids into LMDB on disk...");
// Make a prefix trie from the common prefixes that are shorter than self.max_prefix_length

View File

@ -1,4 +1,4 @@
use std::collections::{BTreeSet, HashMap, HashSet};
use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet};
use std::result::Result as StdResult;
use charabia::{Normalize, Tokenizer, TokenizerBuilder};
@ -112,8 +112,11 @@ pub struct Settings<'a, 't, 'u, 'i> {
sortable_fields: Setting<HashSet<String>>,
criteria: Setting<Vec<Criterion>>,
stop_words: Setting<BTreeSet<String>>,
non_separator_tokens: Setting<BTreeSet<String>>,
separator_tokens: Setting<BTreeSet<String>>,
dictionary: Setting<BTreeSet<String>>,
distinct_field: Setting<String>,
synonyms: Setting<HashMap<String, Vec<String>>>,
synonyms: Setting<BTreeMap<String, Vec<String>>>,
primary_key: Setting<String>,
authorize_typos: Setting<bool>,
min_word_len_two_typos: Setting<u8>,
@ -141,6 +144,9 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
sortable_fields: Setting::NotSet,
criteria: Setting::NotSet,
stop_words: Setting::NotSet,
non_separator_tokens: Setting::NotSet,
separator_tokens: Setting::NotSet,
dictionary: Setting::NotSet,
distinct_field: Setting::NotSet,
synonyms: Setting::NotSet,
primary_key: Setting::NotSet,
@ -205,6 +211,39 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
if stop_words.is_empty() { Setting::Reset } else { Setting::Set(stop_words) }
}
pub fn reset_non_separator_tokens(&mut self) {
self.non_separator_tokens = Setting::Reset;
}
pub fn set_non_separator_tokens(&mut self, non_separator_tokens: BTreeSet<String>) {
self.non_separator_tokens = if non_separator_tokens.is_empty() {
Setting::Reset
} else {
Setting::Set(non_separator_tokens)
}
}
pub fn reset_separator_tokens(&mut self) {
self.separator_tokens = Setting::Reset;
}
pub fn set_separator_tokens(&mut self, separator_tokens: BTreeSet<String>) {
self.separator_tokens = if separator_tokens.is_empty() {
Setting::Reset
} else {
Setting::Set(separator_tokens)
}
}
pub fn reset_dictionary(&mut self) {
self.dictionary = Setting::Reset;
}
pub fn set_dictionary(&mut self, dictionary: BTreeSet<String>) {
self.dictionary =
if dictionary.is_empty() { Setting::Reset } else { Setting::Set(dictionary) }
}
pub fn reset_distinct_field(&mut self) {
self.distinct_field = Setting::Reset;
}
@ -217,7 +256,7 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
self.synonyms = Setting::Reset;
}
pub fn set_synonyms(&mut self, synonyms: HashMap<String, Vec<String>>) {
pub fn set_synonyms(&mut self, synonyms: BTreeMap<String, Vec<String>>) {
self.synonyms = if synonyms.is_empty() { Setting::Reset } else { Setting::Set(synonyms) }
}
@ -303,6 +342,8 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
FP: Fn(UpdateIndexingStep) + Sync,
FA: Fn() -> bool + Sync,
{
puffin::profile_function!();
let fields_ids_map = self.index.fields_ids_map(self.wtxn)?;
// if the settings are set before any document update, we don't need to do anything, and
// will set the primary key during the first document addition.
@ -449,9 +490,84 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
}
}
fn update_non_separator_tokens(&mut self) -> Result<bool> {
let changes = match self.non_separator_tokens {
Setting::Set(ref non_separator_tokens) => {
let current = self.index.non_separator_tokens(self.wtxn)?;
// Does the new list differ from the previous one?
if current.map_or(true, |current| &current != non_separator_tokens) {
self.index.put_non_separator_tokens(self.wtxn, non_separator_tokens)?;
true
} else {
false
}
}
Setting::Reset => self.index.delete_non_separator_tokens(self.wtxn)?,
Setting::NotSet => false,
};
// the synonyms must be updated if non separator tokens have been updated.
if changes && self.synonyms == Setting::NotSet {
self.synonyms = Setting::Set(self.index.user_defined_synonyms(self.wtxn)?);
}
Ok(changes)
}
fn update_separator_tokens(&mut self) -> Result<bool> {
let changes = match self.separator_tokens {
Setting::Set(ref separator_tokens) => {
let current = self.index.separator_tokens(self.wtxn)?;
// Does the new list differ from the previous one?
if current.map_or(true, |current| &current != separator_tokens) {
self.index.put_separator_tokens(self.wtxn, separator_tokens)?;
true
} else {
false
}
}
Setting::Reset => self.index.delete_separator_tokens(self.wtxn)?,
Setting::NotSet => false,
};
// the synonyms must be updated if separator tokens have been updated.
if changes && self.synonyms == Setting::NotSet {
self.synonyms = Setting::Set(self.index.user_defined_synonyms(self.wtxn)?);
}
Ok(changes)
}
fn update_dictionary(&mut self) -> Result<bool> {
let changes = match self.dictionary {
Setting::Set(ref dictionary) => {
let current = self.index.dictionary(self.wtxn)?;
// Does the new list differ from the previous one?
if current.map_or(true, |current| &current != dictionary) {
self.index.put_dictionary(self.wtxn, dictionary)?;
true
} else {
false
}
}
Setting::Reset => self.index.delete_dictionary(self.wtxn)?,
Setting::NotSet => false,
};
// the synonyms must be updated if dictionary has been updated.
if changes && self.synonyms == Setting::NotSet {
self.synonyms = Setting::Set(self.index.user_defined_synonyms(self.wtxn)?);
}
Ok(changes)
}
fn update_synonyms(&mut self) -> Result<bool> {
match self.synonyms {
Setting::Set(ref synonyms) => {
Setting::Set(ref user_synonyms) => {
fn normalize(tokenizer: &Tokenizer, text: &str) -> Vec<String> {
tokenizer
.tokenize(text)
@ -470,10 +586,25 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
if let Some(ref stop_words) = stop_words {
builder.stop_words(stop_words);
}
let separators = self.index.allowed_separators(self.wtxn)?;
let separators: Option<Vec<_>> =
separators.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref separators) = separators {
builder.separators(separators);
}
let dictionary = self.index.dictionary(self.wtxn)?;
let dictionary: Option<Vec<_>> =
dictionary.as_ref().map(|x| x.iter().map(String::as_str).collect());
if let Some(ref dictionary) = dictionary {
builder.words_dict(dictionary);
}
let tokenizer = builder.build();
let mut new_synonyms = HashMap::new();
for (word, synonyms) in synonyms {
for (word, synonyms) in user_synonyms {
// Normalize both the word and associated synonyms.
let normalized_word = normalize(&tokenizer, word);
let normalized_synonyms =
@ -494,7 +625,7 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
let old_synonyms = self.index.synonyms(self.wtxn)?;
if new_synonyms != old_synonyms {
self.index.put_synonyms(self.wtxn, &new_synonyms)?;
self.index.put_synonyms(self.wtxn, &new_synonyms, user_synonyms)?;
Ok(true)
} else {
Ok(false)
@ -754,11 +885,17 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
let faceted_updated = old_faceted_fields != new_faceted_fields;
let stop_words_updated = self.update_stop_words()?;
let non_separator_tokens_updated = self.update_non_separator_tokens()?;
let separator_tokens_updated = self.update_separator_tokens()?;
let dictionary_updated = self.update_dictionary()?;
let synonyms_updated = self.update_synonyms()?;
let searchable_updated = self.update_searchable()?;
let exact_attributes_updated = self.update_exact_attributes()?;
if stop_words_updated
|| non_separator_tokens_updated
|| separator_tokens_updated
|| dictionary_updated
|| faceted_updated
|| synonyms_updated
|| searchable_updated
@ -775,7 +912,7 @@ impl<'a, 't, 'u, 'i> Settings<'a, 't, 'u, 'i> {
mod tests {
use big_s::S;
use heed::types::ByteSlice;
use maplit::{btreeset, hashmap, hashset};
use maplit::{btreemap, btreeset, hashset};
use super::*;
use crate::error::Error;
@ -1241,7 +1378,7 @@ mod tests {
// In the same transaction provide some synonyms
index
.update_settings_using_wtxn(&mut wtxn, |settings| {
settings.set_synonyms(hashmap! {
settings.set_synonyms(btreemap! {
"blini".to_string() => vec!["crepes".to_string()],
"super like".to_string() => vec!["love".to_string()],
"puppies".to_string() => vec!["dogs".to_string(), "doggos".to_string()]
@ -1537,6 +1674,9 @@ mod tests {
sortable_fields,
criteria,
stop_words,
non_separator_tokens,
separator_tokens,
dictionary,
distinct_field,
synonyms,
primary_key,
@ -1555,6 +1695,9 @@ mod tests {
assert!(matches!(sortable_fields, Setting::NotSet));
assert!(matches!(criteria, Setting::NotSet));
assert!(matches!(stop_words, Setting::NotSet));
assert!(matches!(non_separator_tokens, Setting::NotSet));
assert!(matches!(separator_tokens, Setting::NotSet));
assert!(matches!(dictionary, Setting::NotSet));
assert!(matches!(distinct_field, Setting::NotSet));
assert!(matches!(synonyms, Setting::NotSet));
assert!(matches!(primary_key, Setting::NotSet));

View File

@ -45,6 +45,8 @@ impl<'t, 'u, 'i> WordPrefixDocids<'t, 'u, 'i> {
common_prefix_fst_words: &[&[String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
puffin::profile_function!();
// It is forbidden to keep a mutable reference into the database
// and write into it at the same time, therefore we write into another file.
let mut prefix_docids_sorter = create_sorter(

View File

@ -50,6 +50,7 @@ impl<'t, 'u, 'i> WordPrefixIntegerDocids<'t, 'u, 'i> {
common_prefix_fst_words: &[&[String]],
del_prefix_fst_words: &HashSet<Vec<u8>>,
) -> Result<()> {
puffin::profile_function!();
debug!("Computing and writing the word levels integers docids into LMDB on disk...");
let mut prefix_integer_docids_sorter = create_sorter(

View File

@ -42,6 +42,8 @@ impl<'t, 'u, 'i> WordsPrefixesFst<'t, 'u, 'i> {
#[logging_timer::time("WordsPrefixesFst::{}")]
pub fn execute(self) -> Result<()> {
puffin::profile_function!();
let words_fst = self.index.words_fst(self.wtxn)?;
let mut current_prefix = vec![SmallString32::new(); self.max_prefix_length];

View File

@ -5,7 +5,7 @@ use std::io::Cursor;
use big_s::S;
use either::{Either, Left, Right};
use heed::EnvOpenOptions;
use maplit::{hashmap, hashset};
use maplit::{btreemap, hashset};
use milli::documents::{DocumentsBatchBuilder, DocumentsBatchReader};
use milli::update::{IndexDocuments, IndexDocumentsConfig, IndexerConfig, Settings};
use milli::{AscDesc, Criterion, DocumentId, Index, Member, Object, TermsMatchingStrategy};
@ -51,7 +51,7 @@ pub fn setup_search_index_with_criteria(criteria: &[Criterion]) -> Index {
S("tag"),
S("asc_desc_rank"),
});
builder.set_synonyms(hashmap! {
builder.set_synonyms(btreemap! {
S("hello") => vec![S("good morning")],
S("world") => vec![S("earth")],
S("america") => vec![S("the united states")],