

Table of Contents
1. Overview
Configure stop word in Elastic Search
is easy. Most written text has a lot of functional words, like “this”, “that”, or “is” which are important to the person reading the content as they help it flow in a cohesive manner, but aren’t necessarily as important to someone searching the content of your documents or web tutorials.
This is generally done in one of two ways. Either ignoring those utility words when they are present in the search query
or second one is to remove these utility words at the time of indexing.
Let’s have a look at the complete example of configuration and indexing process with stopwords and how it behaves.
2. Stop Token Filter
stop
token filter is used to remove stop words from an input tokenization stream. We can also use some settings along with stop type token filter as below.
2.1 stopwords
We can specify a list of stopwords using this setting. the default value of this setting is _english_
.
{ "filter": { "stopwordexample": { "type": "stop", "stopwords": [ "a", "an", "the" ] } } }
2.2 stopwords_path
We can specify the path of the stopword file, Here are the few guidelines which needs to follow while using this setting.
- A file path is either absolute or relative to config folder.
- A file should be in
UTF
format. - Each stopword should be in a
new line
.
a an the their they we our
2.3 ignore_case
This setting is used for case sensitivity. If we set it to true it will ignore case. The default value is False.
2.4 remove_trailing
remove_trailing is special settings which decide whether the last term in a query if a stopword then ignores it or not. The default value is true.
3. Steps to Configure Stopwords
Now we will discuss steps to configure stopwords in our custom analyzer.
3.1 Create a stop filter
The first step is to specify a custom filter of type stop in under analysis while creating the index. In below example, we have created one filter called stopwordexample with only three stopwords.
{ "filter": { "stopwordexample": { "type": "stop", "stopwords": [ "a", "an", "the" ] } } }
3.2 create analyzer and set filters
Next step is to use a custom stop filter in our analyzer chain.
{ "settings": { "index": { "analysis": { "analyzer": { "englishAnalyzer": { "tokenizer": "standard", "filter": [ "lowercase", "stopwordexample" ] } }, "filter": { "stopwordexample": { "type": "stop", "stopwords": [ "a", "an", "the" ] } } } } } }
3.3 Test
After creating stop filter and custom analyzers now it’s time to test it using Elastic Search Analyze
api. Here is the command to check how our custom analyzer tokenizes input text.
POST stopwordexample/_analyze { "analyzer": "englishAnalyzer", "text": "This is an example of the english analyzer" }
if everything is going right we will get the output of analysis as below. You can see that stopwords are removed while analysis.
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "example", "start_offset" : 11, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "of", "start_offset" : 19, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "english", "start_offset" : 26, "end_offset" : 33, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "analyzer", "start_offset" : 34, "end_offset" : 42, "type" : "<ALPHANUM>", "position" : 7 } ] }
4. Transport Client Java Example
Now we will discuss Elastic Search Transport Java Client API
to create custom filters, analyzers and create settings. In this example we will discus two ways to create index settings with custom filters and analyzers
4.1 JSON Source
public void createSettingsWithAnalyzerJSONSource() throws ExecutionException, InterruptedException, IOException { CreateIndexRequest request = new CreateIndexRequest(indexName); String jsonSource = "{\n" + " \"settings\":{\n" + " \"index\": {\n" + " \"analysis\": {\n" + " \"analyzer\": {\n" + " \"englishAnalyzer\": {\n" + " \"tokenizer\": \"standard\",\n" + " \"filter\" : [\n" + " \"lowercase\",\n" + " \"stopwordexample\"\n" + " ]\n" + " }\n" + " },\n" + " \"filter\" : {\n" + " \"stopwordexample\": {\n" + " \"type\": \"stop\",\n" + " \"stopwords\": [\"a\",\"an\",\"the\"]\n" + " }\n" + " }\n" + " }\n" + " }\n" + " }\n" + " }"; request.source(jsonSource,XContentType.JSON); CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get(); System.out.println("Index : "+createIndexResponse.index() + " Created."); getSettingsWithAnalyzer(); }
4.2 XContentBuilder
public void createSettingsWithEnglishStopAnalyzer() throws ExecutionException, InterruptedException, IOException { CreateIndexRequest request = new CreateIndexRequest(indexName); request.settings(Settings.builder() .put("index.max_inner_result_window", 250) .put("index.write.wait_for_active_shards", 1) .put("index.query.default_field", "paragraph") .put("index.number_of_shards", 3) .put("index.number_of_replicas", 2) .loadFromSource(Strings.toString(jsonBuilder() .startObject() .startObject("analysis") .startObject("filter") .startObject("stopwordexample") .field("type","stop") .field("stopwords", new String[]{"a","an","the"}) .endObject() .endObject() .startObject("analyzer") .startObject("EnglishStopWordAnalyzer") .field("tokenizer", "standard") .field("filter", new String[]{"lowercase","stopwordexample"}) .endObject() .endObject() .endObject() .endObject()), XContentType.JSON) ); CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get(); System.out.println("Index : "+createIndexResponse.index()+" Created"); getSettingsWithAnalyzer(); }
4.3 Output
***************Get Settings with Analyzers ********************* index.analysis.analyzer.englishAnalyzer.filter : [lowercase, stopwordexample] index.analysis.analyzer.englishAnalyzer.tokenizer : standard index.analysis.filter.stopwordexample.stopwords : [a, an, the] index.analysis.filter.stopwordexample.type : stop index.creation_date : 1551547349488 index.number_of_replicas : 1 index.number_of_shards : 5 index.provided_name : stopwordanalyzertest index.uuid : g8jH50FORVCXA8PUf5nWFA index.version.created : 6060099
5. Conclusion
In this article, we have discussed how to configure stopwords in Elastic Search. Filter type of stop is used to remove stopwords from tokenizer stream. We have discussed steps of creating custom stop filters, how to set it in an analyzer and how to test custom analyzers. Lastly, discuss the Elastic Search Java Client code example for the same.
6. References
Refer below links for more details:
7. Source Code
You can download the source code of Elastic Search Java Client Configure Stopwords from our git repository.