Mindmajix

Define Elasticsearch Custom Analyzer with Example

Custom Analyzers

An analyzer of type custom that allows to combine a Tokenizer with zero or more Token Filters, and zero or more Char Filters. The custom analyzer accepts a logical/registered name of the tokenizer to use, and a list of logical/registered names of token filters. The name of the custom analyzer must not start with “_”.

While Elasticsearch comes with a number of analyzers available out of the box, the real power comes from the ability to create your own custom analyzers by combining character filters, tokenizers, and token filters in a configuration that suits your particular data.

Elasticsearch Custom analyzer

Analyzers

Analyzers are composed of a single Tokenizer and zero or more TokenFilters. The tokenizer may be preceded by one or more CharFilters. The analysis module allows you to register Analyzers under logical names which, can then be referenced either in mapping definitions or in certain APIs.

Elasticsearch comes with a number of prebuilt analyzers which are ready to use. Alternatively, you can combine the built in character filters, tokenizers and token filters to create custom analyzers.

An analyzer is what a field’s value passes through on it’s way into the index and we define what analyzer, if any, should be used through mappings. In order for text based search to work as expected search terms are also passed through the same analyzer.

An analyzer is composed of:

There are a number of predefined analyzers in ElasticSearch such as the Standard analyzer, which is what’s used for strings by default and the Language analyzers that optimizes for searching in a specific language.

An example

Let’s create a new index with dynamic mappings that does this again, mapping strings as multi field with a .original field that isn’t analyzed:

curl -XPUT "http://localhost:9200/pages" -d'
 {
 "mappings": {
 "_default_": {
 "dynamic_templates": [
 {
 "strings": {
 "match_mapping_type": "string",
 "path_match": "*",
 "mapping": {
 "type": "string",
 "fields": {
 "original": {
 "type": "string",
 "index": "not_analyzed"
                 }
               }
             }
           }
         }
       ]
     }
   }
 }'

We’ve already seen this mapping in action, with movies, and that it’s working as expected. However, there is one problem with it. Let’s say we index a document with a string field that has a really long value:

curl -XPOST "http://localhost:9200/pages/article" -d'
 {
 "title": "With a very long text",
 "text": "This is a actually a really long text..."
 }'

It’s a bit tricky to illustrate a really long text as part of a request in a tutorial, but imagine that the text property’s value in the above request is considerably longer. The response from ElasticSearch would then be:

{
 "error": "IllegalArgumentException[Document contains at least one immense term in fie\
 ld=\"text.original\" (whose UTF8 encoding is longer than the max length 32766), all of \
 which were skipped. Please correct the analyzer to not produce such terms. The prefix\
 of the first immense term is: '[76, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 3\
 2, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111].\
 ..', original message: bytes can be at most 32766 in length; got 47287]; nested: MaxByt\
 esLengthExceededException[bytes can be at most 32766 in length; got 47287]; ",
 "status": 500
 }

ElasticSearch responds with an exception telling us that there’s a term in the text.original field that exceeds 32766 bytes and that it doesn’t like that. While this may seem uncooperative of ElasticSearch, it’s actually quite helpful. After all, having gigantic terms in our index would be problematic performance wise and neither would it be efficient in terms of disk usage.

And, when would we really be interested in filtering on the full, exact, values of really, really long strings? Probably never. However, we still need to be able to do so for shorter strings. One solution could be to skip using dynamic mapping and mapping each individual string field explicitly, only adding the .original mapping for fields that we know won’t contain huge strings. Giving up the dynamic mapping and mapping all fields explicitly wouldn’t be much fun though.

An alternative solution is to map the .original field as analyzed, but with an analyzer that indexes shorter strings as a single token and simply ignores longer strings. To do so, we need to create an analyzer that uses the keyword tokenizer which produces a single token, the same as the input value, and a token filter that removes tokens that are longer than a given length. Below is a request that does this.

curl -XPUT "http://localhost:9200/pages" -d'
 {
 "settings": {
 "analysis": {
 "filter": {
 "short_original": {
 "type": "length",
 "max": 200
 }
 },
 "analyzer": {
 "original": {
 "tokenizer": "keyword",
 "filter": "short_original"
 }
 }
 }
 },
 "mappings": {
 "_default_": {
 "dynamic_templates": [
 {
 "strings": {
 "match_mapping_type": "string",
 "path_match": "*",
 "mapping": {
 "type": "string",
 "fields": {
 "original": {
 "type": "string",
 "index": "analyzed",
 "analyzer": "original"
                 }
               }
             }
           }
         }
       ]
     }
   }
 }'

Using these mappings, all string fields will be indexed using the Standard analyzer and also have a <field name>.original field. However, for a string longer than 200 characters the .original field will be empty.


0 Responses on Define Elasticsearch Custom Analyzer with Example"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.
Course Adviser

Fill your details, course adviser will reach you.