Skip to content

Get count from split metadata on simple time range query #5758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tontinton
Copy link
Contributor

@tontinton tontinton commented Apr 19, 2025

I think I found a nice optimization here while starting to research the search query code. @rdettai

Count queries can return much faster by not downloading splits, and I couldn't think of a good reason to always download split files on time range queries, if the split time range is fully contained in the search request time range.

Split the PR into 2: #5759.

@tontinton tontinton changed the title Get count from split metadata on simple query with timerange Get count from split metadata on simple query with time range Apr 19, 2025
@tontinton tontinton changed the title Get count from split metadata on simple query with time range Get count from split metadata on simple time range query Apr 19, 2025
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization. I think it deserves a unit test. E.g the following test scenario:

  • The mock metastore returns 4 splits (1 outside the range, 1 overlapping the star, 1 overlapping the end, 1 overlapping the whole range and 1 within the range.
  • The mock search service expects only 1 call.

Comment on lines 757 to 759
let start_time = split.time_range.as_ref().map(|x| x.start()).copied();
let end_time = split.time_range.as_ref().map(|x| x.end()).copied();
if is_metadata_count_request(search_request, start_time, end_time) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, to make search_partial_hits_phase a bit more readable, I would pass the time_range directly to is_metadata_count_request

jobs_to_leaf_request(search_request, indexes_metas_for_leaf_search, client_jobs)?;
leaf_request_tasks.push(cluster_client.leaf_search(leaf_request, client.clone()));
}
leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, moving it to a separate line would help see a bit more clearly how errors are resolved.

Suggested change
leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?);
let executed_leaf_search_responses = try_join_all(leaf_request_tasks).await?;
leaf_search_responses.extend(executed_leaf_search_responses);

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from de8dde6 to f3ab102 Compare April 22, 2025 20:40
@tontinton
Copy link
Contributor Author

Nice optimization. I think it deserves a unit test. E.g the following test scenario:

* The mock metastore returns 4 splits (1 outside the range, 1 overlapping the star, 1 overlapping the end, 1 overlapping the whole range and 1 within the range.

* The mock search service expects only 1 call.

Will add tests when I have a bit more time.

@tontinton tontinton requested a review from rdettai April 29, 2025 17:38
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test. Unfortunately I think it is flawed (so was my example of test earlier, I should have written "the mock search service expects to be called for 3 splits")

Comment on lines 5131 to 5139
mock_search.expect_leaf_search().times(1).returning(|_req| {
Ok(quickwit_proto::search::LeafSearchResponse {
num_hits: 1,
partial_hits: vec![mock_partial_hit("split_inside", 1, 1)],
failed_splits: Vec::new(),
num_attempted_splits: 1,
..Default::default()
})
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • asserting that _req contains the right splits is the important part of this test
  • the mock response should be all splits except "split_inside" which is the one we resolve at the root level 🙃

Comment on lines 5159 to 5160
assert_eq!(resp.num_hits, 1);
assert_eq!(resp.hits.len(), 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the response I would have expected here. num_hits should be 10 (num docs in spit inside) + whatever you decide leaf_search returns for the overlapping splits.

end_timestamp: Some(129_000),
index_id_patterns: vec!["test-index".to_string()],
query_ast: qast_json_helper("test", &["body"]),
max_hits: 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your change is not expected to run if max_hits!=0

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from c535af6 to 89f26f1 Compare April 30, 2025 19:21
@tontinton
Copy link
Contributor Author

tontinton commented Apr 30, 2025

Thanks for adding a test. Unfortunately I think it is flawed (so was my example of test earlier, I should have written "the mock search service expects to be called for 3 splits")

You're totally right, I was supposed to be making a count query, fixed now.

@tontinton tontinton requested a review from rdettai April 30, 2025 19:22
@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from 89f26f1 to 926a2f3 Compare April 30, 2025 20:48
Comment on lines 667 to 670
if let Some(request_start_timestamp) = request.start_timestamp {
let Some(split_start_timestamp) = split_start_timestamp else {
return false;
};
if split_start_timestamp < request_start_timestamp {
return false;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I wonder whether this wouldn't be easier to read:

Suggested change
if let Some(request_start_timestamp) = request.start_timestamp {
let Some(split_start_timestamp) = split_start_timestamp else {
return false;
};
if split_start_timestamp < request_start_timestamp {
return false;
}
}
match (request.start_timestamp, split_start_timestamp) {
(Some(request_start), Some(split_start)) if split_start >= request_start => {}
(Some(_), _) => return false,
(None, _) => {}
}

Comment on lines 5154 to 5158
num_hits: req.leaf_requests[0]
.split_offsets
.iter()
.map(|s| s.num_docs)
.sum(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this test more powerful by setting a smaller number here.

Comment on lines 5184 to 5164
assert_eq!(resp.num_hits, 50);
assert_eq!(resp.hits.len(), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is missing from the other tests, but would be nice to also assert resp.num_successful_splits here

@rdettai
Copy link
Collaborator

rdettai commented May 14, 2025

last minute thought, you could also add some tests in https://github.com/tontinton/quickwit/blob/926a2f33d7b35a5cee064adc457c4503f96dc725/quickwit/rest-api-tests/scenarii/qw_search_api/0001_ts_range.yaml to double check the limit conditions, e.g:

endpoint: simple/search
params:
  query: "*"
  start_timestamp: 1684993000
  end_timestamp: 1684993004
expected:
  num_hits: 3

should confirm that the upper bound exclusion condition is correct and stays that way.

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from 926a2f3 to 2a6a590 Compare May 14, 2025 18:49
@tontinton tontinton requested a review from rdettai May 14, 2025 18:49
@tontinton tontinton force-pushed the optimize-timestamp-range-count branch 3 times, most recently from 545e9d8 to 4f1812e Compare May 15, 2025 12:07
@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from 4f1812e to 8c4d360 Compare May 15, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants