With the advent of highly advanced tools at our disposal, there is always the need to understand and evaluate the features of those tools. Let us now do such an activity on Elasticsearch Custom Analyzer. To be very precise, analyzer is an important and essential tool that has its presence in the relevance engineering.
Learn how to use Elasticsearch, from beginner basics to advanced techniques, with online video tutorials taught by industry experts. Enroll for Free Elasticsearch Online Training Demo. This course will help you to achieve excellence in this domain.
The following topics will be covered in this Elasticsearch Custom Analyzer article
Elasticsearch analyzer is basically the combination of three lower level basic building blocks namely, Character Filters, Tokenizers and last but not the least, the Token Filters. The built-in analyzers package all of these blocks into analyzers with different language options and types of text inputs. These can be individually customized to make a customized elasticsearch analyzer as well.
An Elasticsearch Analyzer comprises the following:
A CharFilter is a pre-process step which runs on the input data before this is sent to the Tokenizer component of an Analyzer. A Tokenizer is that component which splits the input data into stream of tokens.
TokenFilters are the ones that accept these tokens from the Tokenizer and does operations like modifying them, adding more tokens, or deleting them as well. To understand the usage of TokenFilters, we can understand it like how stemming, adding synonyms, and removal of stop words is done on them. Having said that, there is a provision from Elasticsearch which allows more number of Tokenizers and TokenFilters available and over that, there is a possibility to create custom ones as well.
The workflow of the whole analyzing part is shown in a simple diagram for easier understanding. This just explains what we have discussed earlier.
[Related Page:Elasticsearch Post Filter Aggregation]
Let us discuss a little bit in detail about the various Elasticsearch analyzers. All of these analyzers are already defined in the analysis module with their logical names - these can further referenced either in the APIs or mapping definitions as well.
This analyzer has a default value as empty for the stopwords parameter and 255 as the default value for the max_token_length setting. If there is a need, these parameters can be set to some values other than the actual defaults.
Simple Analyzer is the one which has the lowercase tokenizer configured by default.
A Whitespace Analyzer is the one which has the whitespace tokenizer configured by default.
For the Stop Analyzer, the stopwords and the stopwords_path parameters can be configured. The default configuration for the Stop Analyzer is that the stopwords is set to English and stopwords_path points to a file path containing the stop words.
A Keyword Analyzer splits an entire stream of data as a single token and usually used for the Zipcode.
A Pattern Analyzer is the one which deals with patterns and has settings like lowercase, flags, pattern and stopwords which can be configured.
A Language Analyzer deals with the languages that could be used, like Hindi, English, etc.
A Snowball Analyzer uses the standard tokenizer and a standard filter in conjunction with the snowball filter, stop filter, and the lowercase filter.
This is a customized option to create Analyzers of our choice that meets our requirements. These can be created with a Tokenizer and optional set of CharFilters and TokenFilters. Further to this, there are settings like the tokenizer, char_filter, filter, and position_increment_gap available to be configured with this kind of Analyzers.
Checkout Elasticsearch Tutorials
In this section of the article, let us try and understand what and how a Tokenizer fits into this requirement. A Tokenizer, in general, is the component that generates tokes from the text in Elasticsearch. Tokens are created based on some of the parameters that can be passed to these Tokenizers in turn. Elasticsearch has plenty of such options available for Tokenizers, which can further be used in creating our own Custom Analyzers.
[Related Page: Learn Elasticsearch Stemming With Example]
With the understanding of what an Elasticsearch Tokenizer is from the section above, let us now discuss in some detail about the various available Elasticsearch Tokenizers available for ready usage. The following is the list of these Tokenizers that we are going to discuss in detail:
The Standard Tokenizer is the default tokenizer that works based on the grammar, and the tokens are generated based on the grammar. The parameter max_token_length can be set with this kind of tokenizer.
The Edge NGram Tokenizer comes with parameters like the min_gram, token_chars and max_gram which can be configured.
The Keyword Tokenizer is the one which creates the whole of input as output and comes with parameters like buffer_size which can be configured.
The Letter Tokenizer is the one which captures all the whole words until and unless it encounters a non-letter.
The Lowercase Tokenizer works in the same manner as that of the Letter Tokenizer, with the only change that the tokens which are generated out of this Tokenizer are further converted to lowercase.
The NGram Tokenizer comes with configurable parameters like the min_gram, token_chars and max_gram. The default values for these parameters are 0 for min_gram and 2 for max_gram.
The Whitespace Tokenizer splits the input string into tokens based on whitespace as the token separator.
The Patter Tokenizer makes use of the patterns or regular expressions as token separators. Parameters like flags, group, and pattern can be configured with this tokenizer.
The UAX Email URL Tokenizer behaves exactly the same as Standard Tokenizer with the only difference that it considers email and url as single tokens.
The Path Hierarchy Tokenizer generates all possible paths that are present in the provided input path. There are some parameters for this tokenizer like replacement, delimiter, buffer_size, skip, and reverse. The default values for these parameters are 1024 for buffer_size, false for reverse, 0 for skip and / for delimiter.
The Classic Tokenizer generally works on the grammar and generates the tokens based on the same. This works just like the Standard Tokenizer, with the same parameter max_token_length that can be set with this tokenizer.
The Thai Tokenizer is used for the Thai language and uses the default Thai segmentation algorithm.
Frequently Asked Elasticsearch Interview Questions & Answers
In an attempt to use the Tokenizer and TokenFilter of your choice, there needs to be an Analyzer created in your index settings. Once this is done, this is then later used in your mapping. Consider a sample Analyzer to split the provided input data in a standardized manner and over that, you want to apply stemming and apply a lowercase filter. Let us now take a look at one such a sample Analyzer, shall we?
[Related Article: Dynamic Mapping]
curl -X POST http://127.0.0.1:9200/trymyindex/ -d'
{
"settings": {
"analysis": {
"filter": {
"mycustomized_english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"mycustomized_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"lowercase",
"mycustomized_english_stemmer"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "mycustomized_lowercase_stemmed"
}
}
}
}
}'
This might be a lot of details to digest on the first look, but let us try to understand the above piece of code or mapping, part by part. On the very first piece of mapping, you see two specific keywords - “settings” and then “mappings”. To put it simple, index settings would go under the “settings” and mappings for all these types would be put under the “mappings” attribute. On the “settings” attribute, there are various possible settings like the index settings, analysis settings, ready settings, replica settings, and cache settings.
[Related Article: Elasticsearch Mapping]
The analysis JSON includes both the analyzers and filters defined. With the customized Analyzer that we have come up here with, we would need to define some of the available filters with their mandatory options. In the case that we are discussing right now, the stemming filter that has been defined mandates to have a language - and hence the section “mycustomized_english_stemmer”.
{
"mycustomized_english_stemmer": {
"type": "stemmer",
"name": "english"
}
}
With the filter that is defined already, the analyzer can be defined which uses this filter in turn. The analyzer is named “mycustomized_lowercase_stemmed” and uses a standard tokenizer that has a list of filters that could be used. In our example, we have defined one Elasticsearch filter named “lowercase” which doesn’t require any extra configuration. At the same time, we have also created our own customized filter as discussed above. This is how our Analyzer JSON would look like:
{
"analyzer": {
"mycustomized_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"lowercase",
"mycustomized_english_stemmer"
]
}
}
}
One thing that we need to keep in mind is the list of filters that we define, as this is the order in which all the tokens are processed as well. With all these details, we now go to the mappings section where we put these together in this section, as shown below:
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "mycustomized_lowercase_stemmed"
}
}
}
}
}
With the Analyzer that we created in the section above, let us now check how that Analyzer can be put to actual use. To use the Analyzer that we created above, there is a need to create a match query for that. Match queries apply this Analyzer automatically before even querying. As an example, if you go ahead and index the following JSON:
{
"text": "NITIN LIKES PLAYING IN THE RAIN"
}
The following query can be created to test our Analyzer:
curl -XGET ‘http://127.0.0.1:9200/trymyindex/test/_search’ -d '
{
"query": {
"match": {
"text": "play"
}
}
}'
Running the query that we have just seen above would return the document that was literally indexed earlier. It wouldn’t have been returned as a query result if it was not for the Custom Analyzer that we have created earlier. Please take a look at the final output of the same:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5,
"hits": [
{
"_index": "trymyindex",
"_type": "test",
"_id": "AUtgVQpqJh3uf--po6Ij",
"_score": 0.5,
"_source": {
"text": "nitin likes playing"
}
}
]
}
}
Conclusion
In this article, we have understood the topic Elasticsearch Analyzer. Along with that, we have also tried to understand the different types of Elasticsearch analyzers available. We have also gone through the list of tokenizers along with the usage of Analyzers, and also querying along with these. Hope you are able to gain all the information from this article without missing a thing.
Name | Dates | |
---|---|---|
Elasticsearch Training | Nov 02 to Nov 17 | View Details |
Elasticsearch Training | Nov 05 to Nov 20 | View Details |
Elasticsearch Training | Nov 09 to Nov 24 | View Details |
Elasticsearch Training | Nov 12 to Nov 27 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.