Elasticsearch Custom Analyzer

 

With the advent of highly advanced tools at our disposal, there is always the need to understand and evaluate the features of those tools. Let us now do such an activity on Elasticsearch Custom Analyzer. To be very precise, analyzer is an important and essential tool that has its presence in the relevance engineering. 

Learn how to use Elasticsearch, from beginner basics to advanced techniques, with online video tutorials taught by industry experts. Enroll for Free Elasticsearch Online Training Demo. This course will help you to achieve excellence in this domain.

The following topics will be covered in this Elasticsearch Custom Analyzer article

What is Elasticsearch Analyzer?

Elasticsearch analyzer is basically the combination of three lower level basic building blocks namely, Character Filters, Tokenizers and last but not the least, the Token Filters. The built-in analyzers package all of these blocks into analyzers with different language options and types of text inputs. These can be individually customized to make a customized elasticsearch analyzer as well.

An Elasticsearch Analyzer comprises the following:

  • 0 or more CharFilters
  • 1 Tokenizer
  • 0 or more TokenFilters

A CharFilter is a pre-process step which runs on the input data before this is sent to the Tokenizer component of an Analyzer. A Tokenizer is that component which splits the input data into stream of tokens.

TokenFilters are the ones that accept these tokens from the Tokenizer and does operations like modifying them, adding more tokens, or deleting them as well. To understand the usage of TokenFilters, we can understand it like how stemming, adding synonyms, and removal of stop words is done on them. Having said that, there is a provision from Elasticsearch which allows more number of Tokenizers and TokenFilters available and over that, there is a possibility to create custom ones as well.

The workflow of the whole analyzing part is shown in a simple diagram for easier understanding. This just explains what we have discussed earlier.

[Related Page:Elasticsearch Post Filter Aggregation]

Different types of Elasticsearch Analyzers

Let us discuss a little bit in detail about the various Elasticsearch analyzers. All of these analyzers are already defined in the analysis module with their logical names - these can further referenced either in the APIs or mapping definitions as well.

Standard Analyzer:

This analyzer has a default value as empty for the stopwords parameter and 255 as the default value for the max_token_length setting. If there is a need, these parameters can be set to some values other than the actual defaults.

Simple Analyzer:

Simple Analyzer is the one which has the lowercase tokenizer configured by default.

Whitespace Analyzer:

A Whitespace Analyzer is the one which has the whitespace tokenizer configured by default.

Stop Analyzer:

For the Stop Analyzer, the stopwords and the stopwords_path parameters can be configured. The default configuration for the Stop Analyzer is that the stopwords is set to English and stopwords_path points to a file path containing the stop words.

Keyword Analyzer:

A Keyword Analyzer splits an entire stream of data as a single token and usually used for the Zipcode.

Pattern Analyzer:

A Pattern Analyzer is the one which deals with patterns and has settings like lowercase, flags, pattern and stopwords which can be configured.

Language Analyzer:

A Language Analyzer deals with the languages that could be used, like Hindi, English, etc.

Snowball Analyzer:

A Snowball Analyzer uses the standard tokenizer and a standard filter in conjunction with the snowball filter, stop filter, and the lowercase filter.

Custom Analyzer:

This is a customized option to create Analyzers of our choice that meets our requirements. These can be created with a Tokenizer and optional set of CharFilters and TokenFilters. Further to this, there are settings like the tokenizer, char_filter, filter, and position_increment_gap available to be configured with this kind of Analyzers.

Checkout Elasticsearch Tutorials

Elasticsearch Tokenizers

In this section of the article, let us try and understand what and how a Tokenizer fits into this requirement. A Tokenizer, in general, is the component that generates tokes from the text in Elasticsearch. Tokens are created based on some of the parameters that can be passed to these Tokenizers in turn. Elasticsearch has plenty of such options available for Tokenizers, which can further be used in creating our own Custom Analyzers.

[Related Page: Learn Elasticsearch Stemming With Example]

Different types of Elasticsearch Tokenizers

With the understanding of what an Elasticsearch Tokenizer is from the section above, let us now discuss in some detail about the various available Elasticsearch Tokenizers available for ready usage. The following is the list of these Tokenizers that we are going to discuss in detail:

Standard Tokenizer:

The Standard Tokenizer is the default tokenizer that works based on the grammar, and the tokens are generated based on the grammar. The parameter max_token_length can be set with this kind of tokenizer.

Edge NGram Tokenizer:

The Edge NGram Tokenizer comes with parameters like the min_gram, token_chars and max_gram which can be configured.

Keyword Tokenizer:

The Keyword Tokenizer is the one which creates the whole of input as output and comes with parameters like buffer_size which can be configured.

Letter Tokenizer:

The Letter Tokenizer is the one which captures all the whole words until and unless it encounters a non-letter.

Lowercase Tokenizer:

The Lowercase Tokenizer works in the same manner as that of the Letter Tokenizer, with the only change that the tokens which are generated out of this Tokenizer are further converted to lowercase.

NGram Tokenizer:

The NGram Tokenizer comes with configurable parameters like the min_gram, token_chars and max_gram. The default values for these parameters are 0 for min_gram and 2 for max_gram.

Whitespace Tokenizer:

The Whitespace Tokenizer splits the input string into tokens based on whitespace as the token separator.

Pattern Tokenizer:

The Patter Tokenizer makes use of the patterns or regular expressions as token separators. Parameters like flags, group, and pattern can be configured with this tokenizer.

UAX Email URL Tokenizer:

The UAX Email URL Tokenizer behaves exactly the same as Standard Tokenizer with the only difference that it considers email and url as single tokens.

Path Hierarchy Tokenizer:

The Path Hierarchy Tokenizer generates all possible paths that are present in the provided input path. There are some parameters for this tokenizer like replacement, delimiter, buffer_size, skip, and reverse. The default values for these parameters are 1024 for buffer_size, false for reverse, 0 for skip and / for delimiter.

Classic Tokenizer:

The Classic Tokenizer generally works on the grammar and generates the tokens based on the same. This works just like the Standard Tokenizer, with the same parameter max_token_length that can be set with this tokenizer.

Thai Tokenizer:

The Thai Tokenizer is used for the Thai language and uses the default Thai segmentation algorithm.

Frequently Asked Elasticsearch Interview Questions & Answers

How to use Analyzers?

In an attempt to use the Tokenizer and TokenFilter of your choice, there needs to be an Analyzer created in your index settings. Once this is done, this is then later used in your mapping. Consider a sample Analyzer to split the provided input data in a standardized manner and over that, you want to apply stemming and apply a lowercase filter. Let us now take a look at one such a sample Analyzer, shall we?

[Related Article: Dynamic Mapping]

curl -X POST http://127.0.0.1:9200/trymyindex/ -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "mycustomized_english_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      },
      "analyzer": {
        "mycustomized_lowercase_stemmed": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mycustomized_english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "mycustomized_lowercase_stemmed"
        }
      }
    }
  }
}'

This might be a lot of details to digest on the first look, but let us try to understand the above piece of code or mapping, part by part. On the very first piece of mapping, you see two specific keywords - “settings” and then “mappings”. To put it simple, index settings would go under the “settings” and mappings for all these types would be put under the “mappings” attribute. On the “settings” attribute, there are various possible settings like the index settings, analysis settings, ready settings, replica settings, and cache settings.

[Related Article: Elasticsearch Mapping]

The analysis JSON includes both the analyzers and filters defined. With the customized Analyzer that we have come up here with, we would need to define some of the available filters with their mandatory options. In the case that we are discussing right now, the stemming filter that has been defined mandates to have a language - and hence the section “mycustomized_english_stemmer”.

{
  "mycustomized_english_stemmer": {
    "type": "stemmer",
    "name": "english"
  }
}

With the filter that is defined already, the analyzer can be defined which uses this filter in turn. The analyzer is named “mycustomized_lowercase_stemmed” and uses a standard tokenizer that has a list of filters that could be used. In our example, we have defined one Elasticsearch filter named “lowercase” which doesn’t require any extra configuration. At the same time, we have also created our own customized filter as discussed above. This is how our Analyzer JSON would look like:

{
  "analyzer": {
    "mycustomized_lowercase_stemmed": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "mycustomized_english_stemmer"
      ]
    }
  }
}

One thing that we need to keep in mind is the list of filters that we define, as this is the order in which all the tokens are processed as well. With all these details, we now go to the mappings section where we put these together in this section, as shown below:

{
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "mycustomized_lowercase_stemmed"
        }
      }
    }
  }
}

MindMajix Youtube Channel

Querying using Analyzer

With the Analyzer that we created in the section above, let us now check how that Analyzer can be put to actual use. To use the Analyzer that we created above, there is a need to create a match query for that. Match queries apply this Analyzer automatically before even querying. As an example, if you go ahead and index the following JSON:

{
  "text": "NITIN LIKES PLAYING IN THE RAIN"
}

The following query can be created to test our Analyzer:

curl -XGET ‘http://127.0.0.1:9200/trymyindex/test/_search’ -d '
{
  "query": {
    "match": {
      "text": "play"
    }
  }
}'

Running the query that we have just seen above would return the document that was literally indexed earlier. It wouldn’t have been returned as a query result if it was not for the Custom Analyzer that we have created earlier. Please take a look at the final output of the same:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5,
    "hits": [
      {
        "_index": "trymyindex",
        "_type": "test",
        "_id": "AUtgVQpqJh3uf--po6Ij",
        "_score": 0.5,
        "_source": {
          "text": "nitin likes playing"
        }
      }
    ]
  }
}

Conclusion

In this article, we have understood the topic Elasticsearch Analyzer. Along with that, we have also tried to understand the different types of Elasticsearch analyzers available. We have also gone through the list of tokenizers along with the usage of Analyzers, and also querying along with these. Hope you are able to gain all the information from this article without missing a thing.

Explore Elasticsearch Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!
Course Schedule
NameDates
Elasticsearch TrainingNov 02 to Nov 17View Details
Elasticsearch TrainingNov 05 to Nov 20View Details
Elasticsearch TrainingNov 09 to Nov 24View Details
Elasticsearch TrainingNov 12 to Nov 27View Details
Last updated: 03 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less