Table of Contents

1. Overview

Configure stop word in Elastic Search is easy. Most written text has a lot of functional words, like “this”, “that”, or “is” which are important to the person reading the content as they help it flow in a cohesive manner, but aren’t necessarily as important to someone searching the content of your documents or web tutorials.

This is generally done in one of two ways. Either ignoring those utility words when they are present in the search query or second one is to remove these utility words at the time of indexing.

Let’s have a look at the complete example of configuration and indexing process with stopwords and how it behaves.

2. Stop Token Filter

stop token filter is used to remove stop words from an input tokenization stream. We can also use some settings along with stop type token filter as below.

2.1 stopwords

We can specify a list of stopwords using this setting. the default value of this setting is _english_.

{
  "filter": {
    "stopwordexample": {
      "type": "stop",
      "stopwords": [
        "a",
        "an",
        "the"
      ]
    }
  }
}

2.2 stopwords_path

We can specify the path of the stopword file, Here are the few guidelines which needs to follow while using this setting.

A file path is either absolute or relative to config folder.
A file should be in UTF format.
Each stopword should be in a new line.

a
an
the
their
they
we
our

2.3 ignore_case

This setting is used for case sensitivity. If we set it to true it will ignore case. The default value is False.

2.4 remove_trailing

remove_trailing is special settings which decide whether the last term in a query if a stopword then ignores it or not. The default value is true.

3. Steps to Configure Stopwords

Now we will discuss steps to configure stopwords in our custom analyzer.

3.1 Create a stop filter

The first step is to specify a custom filter of type stop in under analysis while creating the index. In below example, we have created one filter called stopwordexample with only three stopwords.

{
  "filter": {
    "stopwordexample": {
      "type": "stop",
      "stopwords": [
        "a",
        "an",
        "the"
      ]
    }
  }
}

3.2 create analyzer and set filters

Next step is to use a custom stop filter in our analyzer chain.

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "englishAnalyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stopwordexample"
            ]
          }
        },
        "filter": {
          "stopwordexample": {
            "type": "stop",
            "stopwords": [
              "a",
              "an",
              "the"
            ]
          }
        }
      }
    }
  }
}

3.3 Test

After creating stop filter and custom analyzers now it’s time to test it using Elastic Search Analyze api. Here is the command to check how our custom analyzer tokenizes input text.

POST stopwordexample/_analyze
{
  "analyzer": "englishAnalyzer",
  "text":     "This is an example of the english analyzer"
}

if everything is going right we will get the output of analysis as below. You can see that stopwords are removed while analysis.

{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "example",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "of",
      "start_offset" : 19,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "english",
      "start_offset" : 26,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "analyzer",
      "start_offset" : 34,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

4. Transport Client Java Example

Now we will discuss Elastic Search Transport Java Client API to create custom filters, analyzers and create settings. In this example we will discus two ways to create index settings with custom filters and analyzers

4.1 JSON Source

public void createSettingsWithAnalyzerJSONSource() throws ExecutionException, InterruptedException, IOException {
        CreateIndexRequest request = new CreateIndexRequest(indexName);
        String jsonSource = "{\n" +
                "        \"settings\":{\n" +
                "            \"index\": {\n" +
                "                \"analysis\": {\n" +
                "                    \"analyzer\": {\n" +
                "                        \"englishAnalyzer\": {\n" +
                "                            \"tokenizer\": \"standard\",\n" +
                "                            \"filter\" : [\n" +
                "                                \"lowercase\",\n" +
                "                                \"stopwordexample\"\n" +
                "                            ]\n" +
                "                        }\n" +
                "                    },\n" +
                "                    \"filter\" : {\n" +
                "                        \"stopwordexample\": {\n" +
                "                            \"type\": \"stop\",\n" +
                "                            \"stopwords\": [\"a\",\"an\",\"the\"]\n" +
                "                        }\n" +
                "                    }\n" +
                "                }\n" +
                "            }\n" +
                "        }\n" +
                "    }";
        request.source(jsonSource,XContentType.JSON);
        CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get();
        System.out.println("Index : "+createIndexResponse.index() + " Created.");
        getSettingsWithAnalyzer();
    }

4.2 XContentBuilder

public void createSettingsWithEnglishStopAnalyzer() throws ExecutionException, InterruptedException, IOException {
        CreateIndexRequest request = new CreateIndexRequest(indexName);
        request.settings(Settings.builder()
                .put("index.max_inner_result_window", 250)
                .put("index.write.wait_for_active_shards", 1)
                .put("index.query.default_field", "paragraph")
                .put("index.number_of_shards", 3)
                .put("index.number_of_replicas", 2)
                .loadFromSource(Strings.toString(jsonBuilder()
                        .startObject()
                           .startObject("analysis")
                                .startObject("filter")
                                    .startObject("stopwordexample")
                                    .field("type","stop")
                                    .field("stopwords", new String[]{"a","an","the"})
                                    .endObject()
                                .endObject()
                                .startObject("analyzer")
                                    .startObject("EnglishStopWordAnalyzer")
                                        .field("tokenizer", "standard")
                                        .field("filter", new String[]{"lowercase","stopwordexample"})
                                    .endObject()
                                .endObject()
                            .endObject()
                        .endObject()), XContentType.JSON)
        );
        CreateIndexResponse createIndexResponse = client.admin().indices().create(request).get();
        System.out.println("Index : "+createIndexResponse.index()+" Created");
        getSettingsWithAnalyzer();
    }

4.3 Output

***************Get Settings with Analyzers *********************
index.analysis.analyzer.englishAnalyzer.filter : [lowercase, stopwordexample]
index.analysis.analyzer.englishAnalyzer.tokenizer : standard
index.analysis.filter.stopwordexample.stopwords : [a, an, the]
index.analysis.filter.stopwordexample.type : stop
index.creation_date : 1551547349488
index.number_of_replicas : 1
index.number_of_shards : 5
index.provided_name : stopwordanalyzertest
index.uuid : g8jH50FORVCXA8PUf5nWFA
index.version.created : 6060099

5. Conclusion

In this article, we have discussed how to configure stopwords in Elastic Search. Filter type of stop is used to remove stopwords from tokenizer stream. We have discussed steps of creating custom stop filters, how to set it in an analyzer and how to test custom analyzers. Lastly, discuss the Elastic Search Java Client code example for the same.

6. References

Refer below links for more details:

7. Source Code

You can download the source code of Elastic Search Java Client Configure Stopwords from our git repository.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Tags: bigdata, elastic-search, elasticsearch-analyzer, elasticsearch-stopwords, language-analysis, lucene, transport-client

Java Developer Zone

http://javadeveloperzone.com

JavaDeveloperZone is a group of innovative software developers. We are experienced in, ● Java Software Development ● Java web development ● Big Data development ● Data analytics ● Artificial Intelligence Development Our contributions will help Java developers and make development journey easy. Feel free to ask any questions and suggestions. Always have space for improvement! Feel free to Contact us for any software development services.

Configure Stopwords in Elastic Search