We’ve already seen how we can search with ElasticSearch and get results in the form of hits. However, the _search endpoint offers more. We can also ask it for aggregations based on fields within the documents that the query. As an example, let’s look at this request:
Finding all movies in the Drama genre.
curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"term": {
"genres.original": "Drama"
}
}
}
}
}'
The above request helps us answer the question “What movies are there in the genre drama?”. But, let’s say we were interested the question “What directors have directed the movies in the drama genre?”. Theoretically we could grab all of the results to the above request and look at their director fields.
Of course that would result in duplicates, a single director having directed multiple drama movies. So, we’d have to group by the director name. And, if we’re interested in finding out which director has directed the most dramas we’d have to sort by that.
We could of course do this through code but that would be highly inefficient. Both because we have to write that code and also because we’d have to grab all of the results (while we have just six movies in our index, imagine doing that for IMDB) from ElasticSearch. Luckily, this type of computation is easily done by asking ElasticSearch for an aggregation.
There are many types of aggregations but in this particular case where we want to group by the exact value in a field a terms aggregation is suitable. Below is the above request updated to include a terms aggregation for the director field. Note that it more specifically is for the director.original field as we want to aggregate based on the full director name, not the individual words.
Searching for movies in the drama genre and asking ElasticSearch for a terms aggregation for the director. original field.
curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"term": {
"genres.original": "Drama"
}
}
}
},
"aggregations": {
"directors": {
"terms": {
"field": "director.original"
}
}
}
}'
The response to the above request from ElasticSearch is this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": [
//Hits here, omitted for brevity
]
},
"aggregations": {
"directors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Francis Ford Coppola",
"doc_count": 2
},
{
"key": "Andrew Dominik",
"doc_count": 1
},
{
"key": "David Lean",
"doc_count": 1
},
{
"key": "Robert Mulligan",
"doc_count": 1
}
]
}
}
}
The response now includes another top level property, aggregations and that contains the result of the aggregation we asked for. Let’s inspect both the search request and response with a terms aggregation a bit closer.
A closer look at a search request with a terms aggregation in Sense.
A closer look at the response to a search request with a terms aggregation in Sense.
As with the search results the results of a terms aggregation (and some other aggregation types) are also by default limited to ten items. We can override this default value by adding a size parameter. Here’s an example:
Searching for all movies and aggregating the director.original field with size set to two.
curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
{
"size": 0,
"aggregations": {
"directors": {
"terms": {
"field": "director.original",
"size": 2
}
}
}
}'
The response from ElasticSearch to the above request.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"directors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 3,
"buckets": [
{
"key": "Francis Ford Coppola",
"doc_count": 2
},
{
"key": "Andrew Dominik",
"doc_count": 1
}
]
}
}
}
As we can see in the response from ElasticSearch it respects the size parameter in the terms aggregation and only returns two buckets. Also, note that the return sum_other_doc_count property has the value three. This tells us that while there are only two buckets returned ElasticSearch has found a total of five unique values in the director.original field. Two which are included in the response and three which aren’t.
Contrary the top level search request body we can’t give aggregations a from parameter. This means that it’s not possible to do pagination for aggregation results.
In the above examples we only asked for a single aggregation in the requests. However, we’re by no means limited to doing so. For instance, if we’d like to know both the number of movies grouped by author and grouped by decade we can use the below request:
A search request with two aggregations.
curl -XPOST "https://localhost:9200/movies/movie/_search" -d'
{
"size": 0,
"aggregations": {
"directors": {
"terms": {
"field": "director.original",
"size": 2
}
},
"decades": {
"histogram": {
"field": "year",
"interval": 10
}
}
}
}'
In the above request we’ve added a second aggregation named “decades”. This aggregation is of a different type than the “directors” aggregation. It’s a histogram aggregation. Histogram aggregations can be used to group fields with numeric or date¹? values according to a specified interval. The response from ElasticSearch looks like this:
ElasticSearch’s response to our request with two aggregations.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"directors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 3,
"buckets": [
{
"key": "Francis Ford Coppola",
"doc_count": 2
},
{
"key": "Andrew Dominik",
"doc_count": 1
}
]
},
"decades": {
"buckets": [
{
"key": 1960,
"doc_count": 2
},
{
"key": 1970,
"doc_count": 2
},
{
"key": 2000,
"doc_count": 2
}
]
}
}
}
Name | Dates | |
---|---|---|
Elasticsearch Training | Oct 12 to Oct 27 | View Details |
Elasticsearch Training | Oct 15 to Oct 30 | View Details |
Elasticsearch Training | Oct 19 to Nov 03 | View Details |
Elasticsearch Training | Oct 22 to Nov 06 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.