Solr provide option to configure stemming at the time of indexing as well as in searching.
In this post we will discuss what is stemming , how to setup stemming on a field and how it’s behave.
Table of Contents
Basics of Stemming
” Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.”
To quickly explain stemming in the context of Solr, lets take an example. Consider that you have the following documents uploaded in a field called FULL_TEXT within your Solr core:
- this is testing of our passion
- Site has been tested by QA team.
- All test cases run successfully.
When stemming is not setup on the FULL_TEXT field (containing the documents above), a Solr query searching on the term “test” (so essentially a search parameter of q?FULL_TEXT:run) will return only the 3rd document , while if stemming is setup on the FULL_TEXT field, all or a subset of the 3 documents will be returned as part of the search result set. How many of these documents will be returned with stemming enabled depends on the stemming algorithm being applied.
Step 1 : Create field type or change existing one
We need to add Filter called PorterStemFilterFactory in our field type defination to enable stemming while indexing or searching.There are more filters available for stemming that we discussed later in this post.
FieldType configuration :
<fieldType name="text_gen_stem" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
Step 2 : Field configuration:
We are using above created field type in our field definition.
<field name="FULL_TEXT" type="text_gen_stem" indexed="true" stored="true"/>
That’s it.Now to cross verify our Stemming configuration do following.
- Select solr core name from drop down list
- Click on Analysis.
- Select field name that we have created earlier.
- Enter text in Field Value(query) like “testing” and click on analysis value.
Solr Stemming algorithm implementations:
There are a few flavors of stemming algorithms supported by Solr, some are more aggressive than others, these are: