Mindmajix

Retrieving Multiple Documents — Elasticsearch Multi Get

Multi get

While the bulk API enables us create, update and delete multiple documents it doesn’t support retrieving multiple documents at once. We can of course do that using requests to the _search endpoint but if the only criteria for the document is their IDs ElasticSearch offers a more efficient and convenient way; the multi get API. Below is an example multi get request:

A request that retrieves two movie documents.

curl -XGET “http://localhost:9200/_mget” -d’
{
“docs”: [
{
“_index”: “movies”,
“_type”: “movie”,
“_id”: “1”
},
{
“_index”: “movies”,
“_type”: “movie”,
“_id”: “2”
}
]
}’

The response from ElasticSearch looks like this:

The response from ElasticSearch to the above _mget request.

{
“docs”: [
{
“_index”: “movies”,
“_type”: “movie”,
“_id”: “1”,
“_version”: 1,
“found”: true,
“_source”: {
“title”: “The Godfather”,
“director”: “Francis Ford Coppola”,
“year”: 1972,
“genres”: [
“Crime”,
“Drama”
]
}
},
{
“_index”: “movies”,
“_type”: “movie”,
“_id”: “2”,
“_version”: 1,
“found”: true,
“_source”: {
// Omitted for brevity
        }
     }
  ]
}

If we put the index name in the URL we can omit the _index parameters from the body. The same goes for the type name and the _type parameter. And, if we only want to retrieve documents of the same type we can skip the docs parameter all together and instead send a list of IDs:

Shorthand form of a _mget request.

curl -XGET “http://localhost:9200/movies/movie/_mget” -d’
{
“ids”: [ “1”, “2” ]
}’

The multi get API also supports source filtering, returning only parts of the documents. For more about that and the multi get API in general, see the documentation.

Delete by query

Sometimes we may need to delete documents that match a certain criteria from an index. If we know the IDs of the documents we can of course use the _bulk API, but if we don’t another API comes in handy; the delete by query API.

Through this API we can delete all documents that match a query. The query is expressed using ElasticSearch’s query DSL which we learned about in post three. Below is an example request, deleting all movies from 1962.

A delete by query request, deleting all movies with year == 1962.

curl -XDELETE “http://localhost:9200/movies/movie/_query” -d’
{
“query”: {
“constant_score”: {
“filter”: {
“term”: {
“year”: 1962
         }
      }
    }
  }
}’

The type in the URL is optional but the index is not. However, we can perform the operation over all indexes by using the special index name _all if we really want to.

While it’s possible to delete everything in an index by using delete by query it’s far more efficient to simply delete the index and re-create it instead.

Time to live

Let’s say that we’re indexing content from a content management system. In the system content can have a date set after which it should no longer be considered published. If we’re lucky there’s some event that we can intercept when content is unpublished and when that happens delete the corresponding document from our index. However, that’s not always the case.

This is one of many cases where documents in ElasticSearch has an expiration date and we’d like to tell ElasticSearch, at indexing time, that a document should be removed after a certain duration. ElasticSearch supports this by allowing us to specify a “time to live” for a document when indexing it. We do that by adding a ttl query string parameter to the URL. The value can either be a duration in milliseconds or a duration in text, such as “1w”. Below is an example, indexing a movie with time to live:

Indexing a movie with an hours (60*60*1000 milliseconds) ttl.

curl -XPUT “http://localhost:9200/movies/movie/4?ttl=3600000” -d’
{
“title”: “Apocalypse Now”,
“director”: “Francis Ford Coppola”,
“year”: 1979,
“genres”: [“Drama”, “War”]
}’

If we were to perform the above request and return an hour later we’d expect the document to be gone from the index. That wouldn’t be the case though as the time to live functionality is disabled by default and needs to be activated on a per index basis through mappings. Here’s how we enable it for the movies index:

Updating the movies index’s mappings to enable ttl.

curl -XPUT “http://localhost:9200/movies/movie/_mapping” -d’
{
“_ttl” : { “enabled” : true }
}’

Apart from the enabled property in the above request we can also send a parameter named default with a default ttl value. If we don’t, like in the request above, only documents where we specify ttl during indexing will have a ttl value. Anyhow, if we now, with ttl enabled in the mappings, index the movie with ttl again it will automatically be deleted after the specified duration.

The time to live functionality works by ElasticSearch regularly searching for documents that are due to expire, in indexes with ttl enabled, and deleting them. By default this is done once every 60 seconds. It’s possible to change this interval if needed. For more information about how to do that, and about ttl in general, see the documentation.

As the ttl functionality requires ElasticSearch to regularly perform queries it’s not the most efficient way if all you want to do is limit the size of the indexes in a cluster. When, for instance, storing only the last seven days of log data it’s often better to use rolling indexes, such as one index per day and delete whole indexes when the data in them is no longer needed.


0 Responses on Retrieving Multiple Documents — Elasticsearch Multi Get"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.
Course Adviser

Fill your details, course adviser will reach you.