

Table of Contents
1. Overview
” Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.”
To quickly explain stemming in the context of Elastic Search, let’s take an example. Consider that you have the following documents uploaded in a field called document_text
in Elastic Search index:
- this is testing of our passion
- Site has been tested by QA team.
- All test cases run successfully.
When stemming is not setup on the document_text
field (containing the documents above), an Elastic Search query searching on the term “test” (so essentially a search parameter of q?document_text:run) will return only the 3rd document , while if stemming is set up on the document_text
field, all or a subset of the 3 documents will be returned as part of the search result set. How many of these documents will be returned with stemming enabled depends on the stemming algorithm being applied.
2. Stemmer Token Filter
Stemmer token filter is used to enable stemming in the analyzers. Language-wise stemming algorithm is available. Refer Stemmer Token Filter for more details.
We need to specify the stemmer name along withintype=stemmer
our filter definition. In below example, we have created one English stemmer called english_stemmer.
2.1 Example
PUT /stemminganalyzertest { "settings": { "analysis" : { "analyzer" : { "englishstemmeranalyzer" : { "tokenizer" : "standard", "filter" : ["lowercase", "english_stemmer"] } }, "filter" : { "english_stemmer" : { "type" : "stemmer", "name" : "english" } } } } }
2.2 Test Analyzer
Now we will use Analyze API to check our custom English stemming analyzer.
POST stemminganalyzertest/_analyze { "analyzer": "englishstemmeranalyzer", "text": "This is testing of our passion" }
Elastic search analyzer given text with our custom analyzer and it tokenizes text as below.
{ "tokens" : [ { "token" : "thi", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "test", "start_offset" : 8, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "of", "start_offset" : 16, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "our", "start_offset" : 19, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "passion", "start_offset" : 23, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 5 } ] }
4. Prevent Stemming
Elastic Search provides a facility to prevent/disable stemming for the specified term.
4.1 Keyword Marker Token Filter
Keyword Marker is used to protecting terms from being modified by stemmers. It must be placed before stemmers.
we can set below three settings with Keyword Marker token filter.
4.1.1 keywords
We can provide a list of words under this setting which will not be modified by stemmer.
4.1.2 keywords_path
We can also set a keyword file path, Path should be either relative to config folder or full path
4.1.3 ignore_case
Set to true to lower case all words first. Defaults to false.
4.2 Example
PUT /stemminganalyzertest { "settings": { "analysis" : { "analyzer" : { "englishstemmeranalyzer" : { "tokenizer" : "standard", "filter" : ["lowercase", "no_stem", "english_stemmer"] } }, "filter" : { "english_stemmer" : { "type" : "stemmer", "name" : "english" }, "no_stem" : { "type" : "keyword_marker", "keywords" : ["testing"] } } } } }
4.3 Test Analyzer
POST stemminganalyzertest/_analyze { "analyzer": "englishstemmeranalyzer", "text": "This is testing of our passion" }
Elastic search analyzes input text with our custom analyzer and not run stemmer on term running
.
{ "tokens" : [ { "token" : "thi", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "testing", "start_offset" : 8, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "of", "start_offset" : 16, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "our", "start_offset" : 19, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "passion", "start_offset" : 23, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 5 } ] }
5. Customize Stemming
Elastic Search provides a facility to customize stemming behavior of a particular algorithm.
5.1 Stemmer Override Token Filter
Stemmer Override Token Filter is used to customize/ override stemming algorithm. We can apply custom mapping using this filter. we need to specify either rules or rules_path to provide the mapping.
5.2 Example
PUT /stemmingoverrideanalyzer { "settings": { "analysis" : { "analyzer" : { "englishstemmeranalyzer" : { "tokenizer" : "standard", "filter" : ["lowercase", "custom_stem", "english_stemmer"] } }, "filter" : { "english_stemmer" : { "type" : "stemmer", "name" : "english" }, "custom_stem" : { "type" : "stemmer_override", "rules" : ["skies=>sky", "mice=>mouse", "feet=>foot" ] } } } } }
5.3 Test
POST stemmingoverrideanalyzer/_analyze { "analyzer": "englishstemmeranalyzer", "text": "back on my feet again" }
As you can see feet
will be indexed as foot
as per our custom analyzer.
{ "tokens" : [ { "token" : "back", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "on", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "my", "start_offset" : 8, "end_offset" : 10, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "foot", "start_offset" : 11, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 } ] }
6. Transport Client Java Example
6.1 JSON Source
public void createSettingsWithEnglishStemAnalyzer() throws ExecutionException, InterruptedException, IOException { CreateIndexRequest request = new CreateIndexRequest(indexName); request.settings(Settings.builder() .put("index.max_inner_result_window", 250) .put("index.write.wait_for_active_shards", 1) .put("index.query.default_field", "paragraph") .put("index.number_of_shards", 3) .put("index.number_of_replicas", 2) .loadFromSource(Strings.toString(jsonBuilder() .startObject() .startObject("analysis") .startObject("filter") .startObject("english_stemmer") .field("type","stemmer") .field("name", "english") .endObject() .endObject() .startObject("analyzer") .startObject("EnglishStopWordAnalyzer") .field("tokenizer", "standard") .field("filter", new String[]{"lowercase","english_stemmer"}) .endObject() .endObject() .endObject() .endObject()), XContentType.JSON) ); CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get(); System.out.println("Index : "+createIndexResponse.index()+" Created"); getSettingsWithAnalyzer(); }
6.2 XContentBuilder
public void createSettingsWithAnalyzerJSONSource() throws ExecutionException, InterruptedException, IOException { CreateIndexRequest request = new CreateIndexRequest(indexName); String jsonSource = "{\n" + " \"settings\":{\n" + " \"index\": {\n" + " \"analysis\": {\n" + " \"analyzer\": {\n" + " \"EnglishStopWordAnalyzer\": {\n" + " \"tokenizer\": \"standard\",\n" + " \"filter\" : [\n" + " \"lowercase\",\n" + " \"english_stemmer\"\n" + " ]\n" + " }\n" + " },\n" + " \"filter\" : {\n" + " \"english_stemmer\": {\n" + " \"type\": \"stemmer\",\n" + " \"name\": \"english\",\n" + " }\n" + " }\n" + " }\n" + " }\n" + " }\n" + " }"; request.source(jsonSource,XContentType.JSON); CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get(); System.out.println("Index : "+createIndexResponse.index() + " Created."); getSettingsWithAnalyzer(); }
6.3 Output
Index : englishstemsnalyzer Created ***************Get Settings with Analyzers ********************* index.analysis.analyzer.EnglishStopWordAnalyzer.filter : [lowercase, english_stemmer] index.analysis.analyzer.EnglishStopWordAnalyzer.tokenizer : standard index.analysis.filter.english_stemmer.name : english index.analysis.filter.english_stemmer.type : stemmer index.creation_date : 1551635063876 index.max_inner_result_window : 250 index.number_of_replicas : 2 index.number_of_shards : 3 index.provided_name : stopwordanalyzertesttwo index.query.default_field : paragraph index.uuid : 0EabUPI6RaqYD-RUKP8XTQ index.version.created : 6060099 index.write.wait_for_active_shards : 1
7. Conclusion
In this article, we have discussed how to configure stemming in Elastic Search. Filter type of stemmer is used to apply the various language-specific stemming algorithm on the tokenizer stream. We have discussed stem token filter, keyword marker token filter, stemming override token filter. Lastly, discuss the Elastic Search Java Client code example for the same.
8. References
Refer below links for more details:
- Index API
- JAVA API
- Spring Boot Elastic Search
- GSON-Parse_Large_Json_File
- Elastic-Search-Stemming-Guide
- Elastic-Search-Stemmer-token-filter
7. Source Code
You can download the source code of Elastic Search Java Client Configure Stemming from our git repository.