Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance, foxes may be reduced to the root fox, to remove the difference between singular and plural in the same way as we remove the difference between lowercase and uppercase.
The root form of a word may not even be a real word. The words jumping and jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as the same terms are produced at index time and at search time, the search will just work.
If stemming was easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.
Learn how to use Elasticsearch, from beginner basics to advanced techniques, with online video tutorials taught by industry experts. Enroll for Free“Elasticsearch Training”. Demo!
Let’s return to the movies and look at the below search request:
Searching for the word ‘assassinations’ in the title field.
POST movies/_search
{
"query": {
"query_string": {
"fields": ["title"],
"query": "assassinations"
}
}
}
The above request doesn’t produce any hits. That’s hardly surprising as there’s no document in the index with the word “assassinations” in the title field. However, there is a document with the singular form of it, “assassination”. Sometimes we want words that have similar semantic interpretations to be considered as equivalent in order to improve free text search. To do so, we need to use STEMMING, reducing words to their “root” form.
In order to do so, for our title field, we can map it so that it’s analyzed in a way that does stemming in a given language. We can do this by mapping the title field to be analyzed with the english analyzer. The English analyzer is one of many language analyzers that are predefined in ElasticSearch. These optimizes for search in a given language by removing stop words (such as “and” and “or”) and by doing stemming. Below is an example request for creating the movies index with the title field mapped with the English analyzer:
Related Page: Define Elasticsearch Custom Analyzer With Example
Creating the movies index with the title field mapped to use the english analyzer.
PUT /movies
{
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
Given that we recreate the movies index with the above request and re-index the movies, we’ll now get a hit when searching for “assassinations” in the title field.
However, if we now try to search for the word “the”, we won’t get any hits. That’s because the English analyzer has treated that word as a stop word, effectively not indexing it as part of the title field. Also, we have made it impossible for ourselves to search in a way that doesn’t utilize stemming.
While this may in some cases be fine, other times we may need the flexibility of being able to sometimes search in the title field using the English analyzer and sometimes using the standard analyzer. Also, in the example request for creating the movies index, we left out the dynamic template that mapped all strings so that they have a.original field.
In order to give us greater flexibility and to restore the .original field, we can use dynamic mappings to map all string fields to have both the .original field and a field that is indexed using the English analyzer. Here’ how we can create the movies index with such mappings:
Creating the movies index with a template for strings.
curl -XPUT "https://localhost:9200/movies" -d'
{
"mappings": {
"_default_": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"path_match": "*",
"mapping": {
"type": "string",
"fields": {
"original": {
"type": "string",
"index": "not_analyzed"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
]
}
}
}'
Using the above mappings, all string fields in documents in the movies index will be indexed three times and we’ll be able to search in them using , .original and .english. In the index, the movie with ID 4, Apocalypse Now, will have the following values in these fields:
</ploading="lazy"
If you want, you can use the term vectors API to verify this yourself, like this: curl -XGET “https://localhost:9200/movies/movie/4/_termvector?fields=title*”.
Name | Dates | |
---|---|---|
Elasticsearch Training | Oct 15 to Oct 30 | View Details |
Elasticsearch Training | Oct 19 to Nov 03 | View Details |
Elasticsearch Training | Oct 22 to Nov 06 | View Details |
Elasticsearch Training | Oct 26 to Nov 10 | View Details |
Yamuna Karumuri is a content writer at Mindmajix.com. Her passion lies in writing articles on IT platforms including Machine learning, PowerShell, DevOps, Data Science, Artificial Intelligence, Selenium, MSBI, and so on. You can connect with her via LinkedIn.