Introduction to Elasticsearch Aggregations

Rating: 5
  
 
4779

Elasticsearch Aggregations

We’ve already seen how we can search with ElasticSearch and get results in the form of hits. However, the _search endpoint offers more. We can also ask it for aggregations based on fields within the documents that the query. As an example, let’s look at this request:
Finding all movies in the Drama genre.

curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
 {
 "query": {
 "constant_score": {
 "filter": {
 "term": {
 "genres.original": "Drama"
         }
       }
     }
   }
 }'

The above request helps us answer the question “What movies are there in the genre drama?”. But, let’s say we were interested the question “What directors have directed the movies in the drama genre?”. Theoretically we could grab all of the results to the above request and look at their director fields.

Of course that would result in duplicates, a single director having directed multiple drama movies. So, we’d have to group by the director name. And, if we’re interested in finding out which director has directed the most dramas we’d have to sort by that.

We could of course do this through code but that would be highly inefficient. Both because we have to write that code and also because we’d have to grab all of the results (while we have just six movies in our index, imagine doing that for IMDB) from ElasticSearch. Luckily, this type of computation is easily done by asking ElasticSearch for an aggregation.

There are many types of aggregations but in this particular case where we want to group by the exact value in a field a terms aggregation is suitable. Below is the above request updated to include a terms aggregation for the director field. Note that it more specifically is for the director.original field as we want to aggregate based on the full director name, not the individual words.

Searching for movies in the drama genre and asking ElasticSearch for a terms aggregation for the director. original field.

curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
 {
 "query": {
 "constant_score": {
 "filter": {
 "term": {
 "genres.original": "Drama"
 }
 }
 }
 },
 "aggregations": {
 "directors": {
 "terms": {
 "field": "director.original"
       }
     }
   }
 }'

MindMajix Youtube Channel

The response to the above request from ElasticSearch is this:

{
 "took": 1,
 "timed_out": false,
 "_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
 },
 "hits": {
 "total": 5,
 "max_score": 0,
 "hits": [
 //Hits here, omitted for brevity
 ]
 },
 "aggregations": {
 "directors": {
 "doc_count_error_upper_bound": 0,
 "sum_other_doc_count": 0,
 "buckets": [
 {
 "key": "Francis Ford Coppola",
 "doc_count": 2
 },
 {
 "key": "Andrew Dominik",
 "doc_count": 1
 },
 {
 "key": "David Lean",
 "doc_count": 1
 },
 {
 "key": "Robert Mulligan",
 "doc_count": 1
         }
       ]
     }
   }
 }

The response now includes another top level property, aggregations and that contains the result of the aggregation we asked for. Let’s inspect both the search request and response with a terms aggregation a bit closer.

Elasticsearch Aggregations

A closer look at a search request with a terms aggregation in Sense.

Elasticsearch Aggregations in Sense

A closer look at the response to a search request with a terms aggregation in Sense.

Size

As with the search results the results of a terms aggregation (and some other aggregation types) are also by default limited to ten items. We can override this default value by adding a size parameter. Here’s an example:
Searching for all movies and aggregating the director.original field with size set to two.

curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
 {
 "size": 0,
 "aggregations": {
 "directors": {
 "terms": {
 "field": "director.original",
 "size": 2
       }
     }
   }
 }'

The response from ElasticSearch to the above request.

{
 "took": 1,
 "timed_out": false,
 "_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
 },
 "hits": {
 "total": 6,
 "max_score": 0,
 "hits": []
 },
 "aggregations": {
 "directors": {
 "doc_count_error_upper_bound": 0,
 "sum_other_doc_count": 3,
 "buckets": [
 {
 "key": "Francis Ford Coppola",
 "doc_count": 2
 },
 {
 "key": "Andrew Dominik",
 "doc_count": 1
          }
       ]
     }
   }
 }

As we can see in the response from ElasticSearch it respects the size parameter in the terms aggregation and only returns two buckets. Also, note that the return sum_other_doc_count property has the value three. This tells us that while there are only two buckets returned ElasticSearch has found a total of five unique values in the director.original field. Two which are included in the response and three which aren’t.

Contrary the top level search request body we can’t give aggregations a from parameter. This means that it’s not possible to do pagination for aggregation results.

Multiple aggregations in a single request

In the above examples we only asked for a single aggregation in the requests. However, we’re by no means limited to doing so. For instance, if we’d like to know both the number of movies grouped by author and grouped by decade we can use the below request:

A search request with two aggregations.

curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
 {
 "size": 0,
 "aggregations": {
 "directors": {
 "terms": {
 "field": "director.original",
 "size": 2
 }
 },
 "decades": {
 "histogram": {
 "field": "year",
 "interval": 10
       }
     }
   }
 }'

In the above request we’ve added a second aggregation named “decades”. This aggregation is of a different type than the “directors” aggregation. It’s a histogram aggregation. Histogram aggregations can be used to group fields with numeric or date¹? values according to a specified interval. The response from ElasticSearch looks like this:

ElasticSearch’s response to our request with two aggregations.

{
 "took": 1,
 "timed_out": false,
 "_shards": {
 "total": 5,
 "successful": 5,
 "failed": 0
 },
 "hits": {
 "total": 6,
 "max_score": 0,
 "hits": []
 },
 "aggregations": {
 "directors": {
 "doc_count_error_upper_bound": 0,
 "sum_other_doc_count": 3,
 "buckets": [
 {
 "key": "Francis Ford Coppola",
 "doc_count": 2
 },
 {
 "key": "Andrew Dominik",
 "doc_count": 1
 }
 ]
 },
 "decades": {
 "buckets": [
 {
 "key": 1960,
 "doc_count": 2
 },
 {
 "key": 1970,
 "doc_count": 2
 },
 {
 "key": 2000,
 "doc_count": 2
          }
       ]
     }
   }
 }
Join our newsletter
inbox

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
NameDates
Elasticsearch TrainingApr 27 to May 12View Details
Elasticsearch TrainingApr 30 to May 15View Details
Elasticsearch TrainingMay 04 to May 19View Details
Elasticsearch TrainingMay 07 to May 22View Details
Last updated: 03 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read more
Recommended Courses

1 / 15