Properly using Elasticsearch Script Filters
Elasticsearch comes with a vast number of filters and queries for
all sorts of things, including special strategies and combinations of those.
In some rare cases though, these are not enough. For these cases,
Elasticsearch ships with the script
filter, allowing arbitrary
scripts to run on a document to decide whether it should be filtered or not.
These come at a high cost and are impractical to run on larger datasets. However, given the right approach, this can be mitigated.
A word of warning
There aren't many reasons for using a script
filter.
If you can avoid scripting by reworking your data into something that
is easier to work with using standard queries, do so. Lazyness is no excuse!
Still, there are cases where a script is tolerable. For example when the case expressed isn't critical and fits your data badly or - given the following techniques - the cost can be reduced to be negligible.
An example case
The example is a bit contrived, but I'd like to avoid making it to complex.
Consider documents of the following structure:
---
name: The Awesome Bay Hotel
vacancies:
- start: 2014-09-05
end: 2014-09-10
room: double
- start: 2014-09-03
end: 2014-10-20
room: single
name: The Not So Awesome Bay Hotel
vacancies:
- start: 2014-09-11
end: 2014-09-20
room: double
- start: 2014-09-12
end: 2014-10-20
room: single
...
We assume that vacancies is indexed as nested
.
Now, consider a group of 3 people trying to find a hotel with enough free space for them. To keep things simple, they search for one day. Everything else makes the date arithmetic harder, but not more worthwhile. You will run into the problem that there is no way in Elasticsearch to express conditions for complex relationships between multiple fields or multiple documents.
This can be easily expressed as a script, though, taking date
, number_of_persons
as parameters:
vacancies = _source['vacancies']
free = 0
foreach(vacancy : vacancies) {
if (vacany['start'] < date && vacany['end'] > date) {
if (vacany['room'] == 'double') {
free = free + 2
}
if (vacany['room'] == 'single') {
free = free + 1
}
}
if (free >= number_of_persons) {
return true
}
}
false
Running this over a mid-size dataset is already very costly, as it accesses the source field of the document.
Understanding Elasticsearch query order
Let's have a look at a simple ES query:
{
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": { "term": { "my_field": "some_tag" } }
}
},
"post_filter": { "term": { "my_field": "some_other_tag" } }
}
Queries always run before the post_filter
. Inside of a filtered
query, the
order is: the filter
runs before the query
, or - put the other way - the
query
is only run on the documents passing the filter
.
Finding a place for the script
We just established that scripting is a very costly operation, so it should
be run on a minimal set of documents. It follows that the script should
be run as a post_filter
. The post_filter
operates on the documents already
reduced by filters in the filtered
query and then further reduced by those that don't match the query
. We put the script in config/scripts/filter-vacancies.mvel
and use a post_filter
.
post_filter:
script:
script: "filter-vacancies"
params:
number_of_persons: 3
date: 2014-09-12T01:00:00+01:00
Still, the number of documents this filter runs on can be unacceptably high.
Setting boundaries
A good rule of thumb is that you'd be hard pressed to create something running slower than a script in ES. While we cannot answer the full question using Elasticsearch standard material, maybe we can answer subquestions.
For example, we can answer the following subquestion: does the hotel have any vacancies for this time, at all?
query:
filtered:
query: { match_all: {}}
filter:
nested:
path: "vacancies"
filter:
bool:
must:
- range:
vacancies.start: { lt: "2014-09-12T01:00:00+01:00"}
- range:
vacancies.end: { gt: "2014-09-12T01:00:00+01:00"}
This makes sure no hotels are ever considered that couldn't possibly match the more complex condition (does the hotel have vacancies for three?) later.
This also boosts the application of the query.
Tuning the query further
As our script is only run on any documents that also match the query
,
finding queries that exclude more documents can help. This can include
tuning the min_score
level higher or cutting off documents that don't
match certain extended criteera (e.g. hotels that rated badly or are very
expensive and would score low because of that) at first.
In extreme cases, it might be worthwhile to run the query without a script, count the number of expected results and trim them aggressively if the result set is large.
Caching the filter
The final option is to cache the script filter. We recommend using cache key for that, so that the app can purge the cache if necessary.
Conclusion
Scripts should be avoided if possible, but they have their uses. Combining post-filters with techniques to aggressively trim down the candidate set before running them ensures that you won't run into unnecessary filter runs, costing performance.