Sitecore Solr Stop Words Made Easy

Sitecore Solr Stop Words Made Easy

Web content-management systems, like Sitecore, give creators endless flexibility in the type of content and amount of content they can create and make available to the users. Such flexibility requires an equally flexible yet robust search capability to find the content. Sitecore can use Solr as its backbone to help users parse through all of the content and find the exact searched piece. Solr has many tricks up its sleeve when it comes to searching; one of which is to figure out what exactly to look for and what to ignore. Note that certain features can be set up in either Sitecore or Solr. In general, I recommend centralizing as many of the search parameters as possible in the Solr configuration to simplify management and optimize efficiency. (Solr is very fast at indexing and searching).

We usually search for nouns and verbs rather than for articles. This makes our searches much more relevant. Let’s take an example – “Drinking a latte at Cafe tous le jours.” When performing a search for this statement, we would enter keywords like “drinking latte” and expect the result to pop up. We would not usually enter “a” or “at” as search keywords. Solr understands that and lets us remove such words from the text during indexing. This not only makes the lookup faster, but it also reduces the size of the index. Such common words are known as stop words. We can remove them by using

In Solr, any document or query can be sent through a series of tokenizers and filters collectively known as analyzers. Tokenizers are used to break up the text into tokens, and filters are used to remove, change, or swap the tokens. StopFilterFactory is a filter provided by Solr that removes stop words from documents and queries.

Analyzers are described when implementing a field type in the Solr schema, like so –

It is essential to remove the same stop words from both indexes and queries, which is why the field type has both index and query analyzers. Solr processes the content of the field through the appropriate tokenizers and filters before writing the document on disk executing a search.

The stop-word filter is defined the same way in both the index and query analyzers –

<filter class=”solr.StopFilterFactory” ignoreCase=”true”
words=”lang/stopwords_en.txt” />

Let’s take our example through the index analyzer. First, the tokenizer splits the sentence into tokens:

“Drinking a latte at Cafe tous le jours”
[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]

Then our stop-word filter removes commonly-used English words that provide little value to document searching and storing:

[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]
[“Drinking”, “latte”, “Cafe”, “tous”, “le”, “jours”]

After this, the document is ready to be indexed. An incoming search query goes through a similar transformation, apart from one additional step where a synonym filter broadens the query before it is executed.

If you have been following along and have a keen eye, then you might have noticed that the stop-words filter only removed stop words for the English language and not for French. That’s because Solr provides language-specific stop words out of the box. We used “stopwords_en.txt”, which is a list of stop words for English. Solr provides custom stop-word lists for many languages in the conf/lang/ subdirectory, some of which are:

Arabic Bulgarian Catalan Czech Danish German
Greek English Spanish Basque Farsi Finnish
French Galic Hindi Hungarian Armenian Indonesian
Italian Japanese Latvian Dutch Norwegian Portuguese
Russian Swedish Thai Turkish

With the amount of content that can be created, indexed, and searched in Sitecore, search efficiency can make or break the quality of the user experience. Stop words are one of the many tools that help you improve not only the quality of search but also the efficiency of it. Have fun with it, try it out, expand the list to your liking, and watch your collections perform better than ever before.

Learn More About SearchStax and Sitecore