what we blog

Properly using Elasticsearch Script Filters

Elasticsearch comes with a vast number of filters and queries for all sorts of things, including special strategies and combinations of those. In some rare cases though, these are not enough. For these cases, Elasticsearch ships with the script filter, allowing arbitrary scripts to run on a document to decide whether it should be filtered or not.

These come at a high cost and are impractical to run on larger datasets. However, given the right approach, this can be mitigated.

A word of warning

There aren't many reasons for using a script filter. If you can avoid scripting by reworking your data into something that is easier to work with using standard queries, do so. Lazyness is no excuse!

Still, there are cases where a script is tolerable. For example when the case expressed isn't critical and fits your data badly or - given the following techniques - the cost can be reduced to be negligible.

An example case

The example is a bit contrived, but I'd like to avoid making it to complex.

Consider documents of the following structure:

---
name: The Awesome Bay Hotel
vacancies:
  - start: 2014-09-05
    end: 2014-09-10
    room: double
  - start: 2014-09-03
    end: 2014-10-20
    room: single
name: The Not So Awesome Bay Hotel
vacancies:
  - start: 2014-09-11
    end: 2014-09-20
    room: double
  - start: 2014-09-12
    end: 2014-10-20
    room: single
...

We assume that vacancies is indexed as nested.

Now, consider a group of 3 people trying to find a hotel with enough free space for them. To keep things simple, they search for one day. Everything else makes the date arithmetic harder, but not more worthwhile. You will run into the problem that there is no way in Elasticsearch to express conditions for complex relationships between multiple fields or multiple documents.

This can be easily expressed as a script, though, taking date, number_of_persons as parameters:

vacancies = _source['vacancies']
free = 0
foreach(vacancy : vacancies) {
  if (vacany['start'] < date && vacany['end'] > date) {
    if (vacany['room'] == 'double') {
      free = free + 2
    }
    if (vacany['room'] == 'single') {
      free = free + 1
    }
  }
  if (free >= number_of_persons) {
    return true
  }
}
false

Running this over a mid-size dataset is already very costly, as it accesses the source field of the document.

Understanding Elasticsearch query order

Let's have a look at a simple ES query:

{
  "query": {
    "filtered": {
      "query": { "match_all": {} },
      "filter": { "term": { "my_field": "some_tag" } }
    }
  },
  "post_filter": { "term": { "my_field": "some_other_tag" } }
}

Queries always run before the post_filter. Inside of a filtered query, the order is: the filter runs before the query, or - put the other way - the query is only run on the documents passing the filter.

Finding a place for the script

We just established that scripting is a very costly operation, so it should be run on a minimal set of documents. It follows that the script should be run as a post_filter. The post_filter operates on the documents already reduced by filters in the filtered query and then further reduced by those that don't match the query. We put the script in config/scripts/filter-vacancies.mvel and use a post_filter.

post_filter:
  script:
    script: "filter-vacancies"
    params:
      number_of_persons: 3
      date: 2014-09-12T01:00:00+01:00

Still, the number of documents this filter runs on can be unacceptably high.

Setting boundaries

A good rule of thumb is that you'd be hard pressed to create something running slower than a script in ES. While we cannot answer the full question using Elasticsearch standard material, maybe we can answer subquestions.

For example, we can answer the following subquestion: does the hotel have any vacancies for this time, at all?

query:
  filtered:
    query: { match_all: {}}
    filter:
      nested:
        path: "vacancies"
        filter:
          bool:
            must:
              - range:
                  vacancies.start: { lt: "2014-09-12T01:00:00+01:00"}
              - range:
                  vacancies.end: { gt: "2014-09-12T01:00:00+01:00"}

This makes sure no hotels are ever considered that couldn't possibly match the more complex condition (does the hotel have vacancies for three?) later.

This also boosts the application of the query.

Tuning the query further

As our script is only run on any documents that also match the query, finding queries that exclude more documents can help. This can include tuning the min_score level higher or cutting off documents that don't match certain extended criteera (e.g. hotels that rated badly or are very expensive and would score low because of that) at first.

In extreme cases, it might be worthwhile to run the query without a script, count the number of expected results and trim them aggressively if the result set is large.

Caching the filter

The final option is to cache the script filter. We recommend using cache key for that, so that the app can purge the cache if necessary.

Conclusion

Scripts should be avoided if possible, but they have their uses. Combining post-filters with techniques to aggressively trim down the candidate set before running them ensures that you won't run into unnecessary filter runs, costing performance.