Phrase and proximity queries are more expensive than simple match
queries.
Whereas a match
query just has to look up terms in the inverted index, a
match_phrase
query has to calculate and compare the positions of multiple
possibly repeated terms.
The Lucene nightly
benchmarks show that a simple term
query is about 10 times as fast as a
phrase query, and about 20 times as fast as a proximity query (a phrase query
with slop
). And of course, this cost is paid at search time instead of at index time.
Tip
|
Usually the extra cost of phrase queries is not as scary as these numbers
suggest. Really, the difference in performance is a testimony to just how fast
a simple In certain pathological cases, phrase queries can be costly, but this is
unusual. An example of a pathological case is DNA sequencing, where there are
many many identical terms repeated in many positions. Using higher |
So what can we do to limit the performance cost of phrase and proximity queries? One useful approach is to reduce the total number of documents that need to be examined by the phrase query.
In the preceding section, we discussed using proximity queries just for relevance purposes, not to include or exclude results from the result set. A query may match millions of results, but chances are that our users are interested in only the first few pages of results.
A simple match
query will already have ranked documents that contain all
search terms near the top of the list. Really, we just want to rerank the top
results to give an extra relevance bump to those documents that also match the
phrase query.
The search
API supports exactly this functionality via rescoring. The
rescore phase allows you to apply a more expensive scoring algorithm—like a
phrase
query—to just the top K
results from each shard. These top
results are then resorted according to their new scores.
The request looks like this:
GET /my_index/my_type/_search
{
"query": {
"match": { (1)
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"rescore": {
"window_size": 50, (2)
"query": { (3)
"rescore_query": {
"match_phrase": {
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}
-
The
match
query decides which results will be included in the final result set and ranks results according to TF/IDF. -
The
window_size
is the number of top results to rescore, per shard. -
The only rescoring algorithm currently supported is another query, but there are plans to add more algorithms later.