1. Overview 

” Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.”

To quickly explain stemming in the context of Elastic Search, let’s take an example. Consider that you have the following documents uploaded in a field called document_text in Elastic Search index:

  1. this is testing of our passion
  2. Site has been tested by QA team.
  3. All test cases run successfully.

When stemming is not setup on the document_text field (containing the documents above), an Elastic Search query searching on the term “test” (so essentially a search parameter of q?document_text:run) will return only the 3rd document , while if stemming is set up on the document_text field, all or a subset of the 3 documents will be returned as part of the search result set. How many of these documents will be returned with stemming enabled depends on the stemming algorithm being applied.

2. Stemmer Token Filter

Stemmer token filter is used to enable stemming in the analyzers. Language-wise stemming algorithm is available. Refer Stemmer Token Filter for more details.

We need to specify the stemmer name along withintype=stemmer our filter definition. In below example, we have created one English stemmer called english_stemmer.

2.1 Example

PUT /stemminganalyzertest
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "englishstemmeranalyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "english_stemmer"]
                }
            },
            "filter" : {
                "english_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
    }
}

2.2 Test Analyzer

Now we will use Analyze API to check our custom English stemming analyzer.

POST stemminganalyzertest/_analyze
{
  "analyzer": "englishstemmeranalyzer",
  "text":     "This is testing of our passion"
}

Elastic search analyzer given text with our custom analyzer and it tokenizes text as below.

{
  "tokens" : [
    {
      "token" : "thi",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test",
      "start_offset" : 8,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "of",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "our",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "passion",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

4. Prevent Stemming

Elastic Search provides a facility to prevent/disable stemming for the specified term.

4.1 Keyword Marker Token Filter

Keyword Marker is used to protecting terms from being modified by stemmers. It must be placed before stemmers.

we can set below three settings with Keyword Marker token filter.

4.1.1 keywords

We can provide a list of words under this setting which will not be modified by stemmer.

4.1.2 keywords_path

We can also set a keyword file path, Path should be either relative to config folder or full path

4.1.3 ignore_case

Set to true to lower case all words first. Defaults to false.

4.2 Example

PUT /stemminganalyzertest
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "englishstemmeranalyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "no_stem", "english_stemmer"]
                }
            },
            "filter" : {
                "english_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                },
        "no_stem" : {
                    "type" : "keyword_marker",
                    "keywords" : ["testing"]
                }
            }
        }
    }
}

4.3 Test Analyzer

POST stemminganalyzertest/_analyze
{
  "analyzer": "englishstemmeranalyzer",
  "text":     "This is testing of our passion"
}

Elastic search analyzes input text with our custom analyzer and not run stemmer on term running.

{
  "tokens" : [
    {
      "token" : "thi",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "testing",
      "start_offset" : 8,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "of",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "our",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "passion",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

 

5. Customize Stemming

Elastic Search provides a facility to customize stemming behavior of a particular algorithm.

5.1 Stemmer Override Token Filter

Stemmer Override Token Filter is used to customize/ override stemming algorithm. We can apply custom mapping using this filter. we need to specify either rules or rules_path to provide the mapping.

5.2 Example

PUT /stemmingoverrideanalyzer
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "englishstemmeranalyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "custom_stem", "english_stemmer"]
                }
            },
            "filter" : {
                "english_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                },
        "custom_stem" : {
                    "type" : "stemmer_override",
                    "rules" : ["skies=>sky",
        "mice=>mouse",
        "feet=>foot"
        ]
                }
            }
        }
    }
}

5.3 Test

POST stemmingoverrideanalyzer/_analyze
{
  "analyzer": "englishstemmeranalyzer",
  "text":     "back on my feet again"
}

As you can see feet will be indexed as foot as per our custom analyzer.

{
  "tokens" : [
    {
      "token" : "back",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "on",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "my",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "foot",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

 

6. Transport Client Java Example

6.1 JSON Source

public void createSettingsWithEnglishStemAnalyzer() throws ExecutionException, InterruptedException, IOException {
        CreateIndexRequest request = new CreateIndexRequest(indexName);
        request.settings(Settings.builder()
                .put("index.max_inner_result_window", 250)
                .put("index.write.wait_for_active_shards", 1)
                .put("index.query.default_field", "paragraph")
                .put("index.number_of_shards", 3)
                .put("index.number_of_replicas", 2)
                .loadFromSource(Strings.toString(jsonBuilder()
                        .startObject()
                           .startObject("analysis")
                                .startObject("filter")
                                    .startObject("english_stemmer")
                                    .field("type","stemmer")
                                    .field("name", "english")
                                    .endObject()
                                .endObject()
                                .startObject("analyzer")
                                    .startObject("EnglishStopWordAnalyzer")
                                        .field("tokenizer", "standard")
                                        .field("filter", new String[]{"lowercase","english_stemmer"})
                                    .endObject()
                                .endObject()
                            .endObject()
                        .endObject()), XContentType.JSON)
        );
        CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get();
        System.out.println("Index : "+createIndexResponse.index()+" Created");
        getSettingsWithAnalyzer();
    }

6.2 XContentBuilder

public void createSettingsWithAnalyzerJSONSource() throws ExecutionException, InterruptedException, IOException {
        CreateIndexRequest request = new CreateIndexRequest(indexName);


        String jsonSource = "{\n" +
                "        \"settings\":{\n" +
                "            \"index\": {\n" +
                "                \"analysis\": {\n" +
                "                    \"analyzer\": {\n" +
                "                        \"EnglishStopWordAnalyzer\": {\n" +
                "                            \"tokenizer\": \"standard\",\n" +
                "                            \"filter\" : [\n" +
                "                                \"lowercase\",\n" +
                "                                \"english_stemmer\"\n" +
                "                            ]\n" +
                "                        }\n" +
                "                    },\n" +
                "                    \"filter\" : {\n" +
                "                        \"english_stemmer\": {\n" +
                "                            \"type\": \"stemmer\",\n" +
                "                            \"name\": \"english\",\n" +
                "                        }\n" +
                "                    }\n" +
                "                }\n" +
                "            }\n" +
                "        }\n" +
                "    }";
        request.source(jsonSource,XContentType.JSON);
        CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get();
        System.out.println("Index : "+createIndexResponse.index() + " Created.");
        getSettingsWithAnalyzer();
    }

6.3 Output

Index : englishstemsnalyzer Created

***************Get Settings with Analyzers *********************
index.analysis.analyzer.EnglishStopWordAnalyzer.filter : [lowercase, english_stemmer]
index.analysis.analyzer.EnglishStopWordAnalyzer.tokenizer : standard
index.analysis.filter.english_stemmer.name : english
index.analysis.filter.english_stemmer.type : stemmer
index.creation_date : 1551635063876
index.max_inner_result_window : 250
index.number_of_replicas : 2
index.number_of_shards : 3
index.provided_name : stopwordanalyzertesttwo
index.query.default_field : paragraph
index.uuid : 0EabUPI6RaqYD-RUKP8XTQ
index.version.created : 6060099
index.write.wait_for_active_shards : 1

7. Conclusion

In this article, we have discussed how to configure stemming in Elastic Search. Filter type of stemmer is used to apply the various language-specific stemming algorithm on the tokenizer stream. We have discussed stem token filter, keyword marker token filter, stemming override token filter. Lastly, discuss the Elastic Search Java Client code example for the same.

8. References

Refer below links for more details:

7. Source Code

You can download the source code of Elastic Search Java Client Configure Stemming from our git repository.

Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.
Yes
No

Leave a Reply

Your email address will not be published. Required fields are marked *