Mindmajix

Learn Elasticsearch Stemming with Example

Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance, foxes may be reduced to the root fox, to remove the difference between singular and plural in the same way as we remove the difference between lowercase and uppercase.

The root form of a word may not even be a real word. The words jumping and jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as the same terms are produced at index time and at search time, the search will just work.

If stemming was easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.

Elasticsearch Stemming

Let’s return to the movies and look at the below search request:

Searching for the word ‘assassinations’ in the title field.

POST movies/_search
 {
 "query": {
 "query_string": {
 "fields": ["title"],
 "query": "assassinations"
 }
 }
 }

The above request doesn’t produce any hits. That’s hardly surprising as there’s no document in the index with the word “assassinations” in the title field. However, there is a document with the singular form of it, “assassination”. Sometimes we want words that have similar semantic interpretations to be considered as equivalent in order to improve free text search. To do so, we need to use stemming, reducing words to their “root” form.

In order to do so, for our title field, we can map it so that it’s analyzed in a way that does stemming in a given language. We can do this by mapping the title field to be analyzed with the english analyzer. The English analyzer is one of many language analyzers that are predefined in ElasticSearch. These optimizes for search in a given language by removing stop words (such as “and” and “or”) and by doing stemming. Below is an example request for creating the movies index with the title field mapped with the English analyzer:

Creating the movies index with the title field mapped to use the english analyzer.

PUT /movies
 {
 "mappings": {
 "movie": {
 "properties": {
 "title": {
 "type": "string",
 "analyzer": "english"
 }
 }
 }
 }
 }

Given that we recreate the movies index with the above request and re-index the movies, we’ll now get a hit when searching for “assassinations” in the title field.

However, if we now try to search for the word “the”, we won’t get any hits. That’s because the English analyzer has treated that word as a stop word, effectively not indexing it as part of the title field. Also, we have made it impossible for ourselves to search in a way that doesn’t utilize stemming.

While this may in some cases be fine, other times we may need the flexibility of being able to sometimes search in the title field using the English analyzer and sometimes using the standard analyzer. Also, in the example request for creating the movies index, we left out the dynamic template that mapped all strings so that they have a.original field.

In order to give us greater flexibility and to restore the .original field, we can use dynamic mappings to map all string fields to have both the .original field and a field that is indexed using the English analyzer. Here’ how we can create the movies index with such mappings:

Creating the movies index with a template for strings.

curl -XPUT "http://localhost:9200/movies" -d'
 {
 "mappings": {
 "_default_": {
 "dynamic_templates": [
 {
 "strings": {
 "match_mapping_type": "string",
 "path_match": "*",
 "mapping": {
 "type": "string",
 "fields": {
 "original": {
 "type": "string",
 "index": "not_analyzed"
 },
 "english": {
 "type": "string",
 "analyzer": "english"
 }
 }
 }
 }
 }
 ]
 }
 }
 }'

Using the above mappings, all string fields in documents in the movies index will be indexed three times and we’ll be able to search in them using <field name>, <field name>.original and <field name>.english. In the index, the movie with ID 4, Apocalypse Now, will have the following values in these fields:

elasticsearch stemming

If you want, you can use the term vectors API to verify this yourself, like this: curl -XGET “http://localhost:9200/movies/movie/4/_termvector?fields=title*”.


0 Responses on Learn Elasticsearch Stemming with Example"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.
Course Adviser

Fill your details, course adviser will reach you.