Learn Elasticsearch Stemming with Example

 

Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance, foxes may be reduced to the root fox, to remove the difference between singular and plural in the same way as we remove the difference between lowercase and uppercase.

The root form of a word may not even be a real word. The words jumping and jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as the same terms are produced at index time and at search time, the search will just work.

If stemming was easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.

Learn how to use Elasticsearch, from beginner basics to advanced techniques, with online video tutorials taught by industry experts. Enroll for FreeElasticsearch Training”.  Demo!

Elasticsearch Stemming

Let’s return to the movies and look at the below search request:

Searching for the word ‘assassinations’ in the title field.

POST movies/_search
 {
 "query": {
 "query_string": {
 "fields": ["title"],
 "query": "assassinations"
 }
 }
 }

The above request doesn’t produce any hits. That’s hardly surprising as there’s no document in the index with the word “assassinations” in the title field. However, there is a document with the singular form of it, “assassination”. Sometimes we want words that have similar semantic interpretations to be considered as equivalent in order to improve free text search. To do so, we need to use STEMMING, reducing words to their “root” form.

In order to do so, for our title field, we can map it so that it’s analyzed in a way that does stemming in a given language. We can do this by mapping the title field to be analyzed with the english analyzer. The English analyzer is one of many language analyzers that are predefined in ElasticSearch. These optimizes for search in a given language by removing stop words (such as “and” and “or”) and by doing stemming. Below is an example request for creating the movies index with the title field mapped with the English analyzer:

Related Page: Define Elasticsearch Custom Analyzer With Example

Creating the movies index with the title field mapped to use the english analyzer.

PUT /movies
 {
 "mappings": {
 "movie": {
 "properties": {
 "title": {
 "type": "string",
 "analyzer": "english"
 }
 }
 }
 }
 }

Given that we recreate the movies index with the above request and re-index the movies, we’ll now get a hit when searching for “assassinations” in the title field.

However, if we now try to search for the word “the”, we won’t get any hits. That’s because the English analyzer has treated that word as a stop word, effectively not indexing it as part of the title field. Also, we have made it impossible for ourselves to search in a way that doesn’t utilize stemming.

While this may in some cases be fine, other times we may need the flexibility of being able to sometimes search in the title field using the English analyzer and sometimes using the standard analyzer. Also, in the example request for creating the movies index, we left out the dynamic template that mapped all strings so that they have a.original field.

In order to give us greater flexibility and to restore the .original field, we can use dynamic mappings to map all string fields to have both the .original field and a field that is indexed using the English analyzer. Here’ how we can create the movies index with such mappings:

Creating the movies index with a template for strings.

curl -XPUT "https://localhost:9200/movies" -d'
 {
 "mappings": {
 "_default_": {
 "dynamic_templates": [
 {
 "strings": {
 "match_mapping_type": "string",
 "path_match": "*",
 "mapping": {
 "type": "string",
 "fields": {
 "original": {
 "type": "string",
 "index": "not_analyzed"
 },
 "english": {
 "type": "string",
 "analyzer": "english"
 }
 }
 }
 }
 }
 ]
 }
 }
 }'

MindMajix Youtube Channel

Using the above mappings, all string fields in documents in the movies index will be indexed three times and we’ll be able to search in them using , .original and .english. In the index, the movie with ID 4, Apocalypse Now, will have the following values in these fields:

                                       Field Value</ploading="lazy"

If you want, you can use the term vectors API to verify this yourself, like this: curl -XGET “https://localhost:9200/movies/movie/4/_termvector?fields=title*”.

Explore Elasticsearch Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!
Course Schedule
NameDates
Elasticsearch TrainingOct 15 to Oct 30View Details
Elasticsearch TrainingOct 19 to Nov 03View Details
Elasticsearch TrainingOct 22 to Nov 06View Details
Elasticsearch TrainingOct 26 to Nov 10View Details
Last updated: 03 Apr 2023
About Author

Yamuna Karumuri is a content writer at Mindmajix.com. Her passion lies in writing articles on IT platforms including Machine learning, PowerShell, DevOps, Data Science, Artificial Intelligence, Selenium, MSBI, and so on. You can connect with her via  LinkedIn.

read less