Apr. 03, 2019

Karan Jeet Singh

|

3 min. read

Web content-management systems, like Sitecore, give creators endless flexibility in the type of content and amount of content they can create and make available to the users. Such flexibility requires an equally flexible yet robust search capability to find the content. Sitecore can use Solr as its backbone to help users parse through all of the content and find the exact searched piece. Solr has many tricks up its sleeve when it comes to searching; one of which is to figure out what exactly to look for and what to ignore. Note that certain features can be set up in either Sitecore or Solr. In general, I recommend centralizing as many of the search parameters as possible in the Solr configuration to simplify management and optimize efficiency. (Solr is very fast at indexing and searching).

We usually search for nouns and verbs rather than for articles. This makes our searches much more relevant. Let’s take an example – “Drinking a latte at Cafe tous le jours.” When performing a search for this statement, we would enter keywords like “drinking latte” and expect the result to pop up. We would not usually enter “a” or “at” as search keywords. Solr understands that and lets us remove such words from the text during indexing. This not only makes the lookup faster, but it also reduces the size of the index. Such common words are known as stop words. We can remove them by using


StopFilterFactory.

In Solr, any document or query can be sent through a series of tokenizers and filters collectively known as analyzers. Tokenizers are used to break up the text into tokens, and filters are used to remove, change, or swap the tokens. StopFilterFactory is a filter provided by Solr that removes stop words from documents and queries.

Analyzers are described when implementing a field type in the Solr schema, like so –

<filter class=”solr.StopFilterFactory” ignoreCase=”true”
words=”lang/stopwords_en.txt” />

Let’s take our example through the index analyzer. First, the tokenizer splits the sentence into tokens:

“Drinking a latte at Cafe tous le jours”
->
[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]

Then our stop-word filter removes commonly-used English words that provide little value to document searching and storing:

[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]
->
[“Drinking”, “latte”, “Cafe”, “tous”, “le”, “jours”]
After this, the document is ready to be indexed. An incoming search query goes through a similar transformation, apart from one additional step where a synonym filter broadens the query before it is executed. If you have been following along and have a keen eye, then you might have noticed that the stop-words filter only removed stop words for the English language and not for French. That’s because Solr provides language-specific stop words out of the box. We used “stopwords_en.txt”, which is a list of stop words for English. Solr provides custom stop-word lists for many languages in the conf/lang/ subdirectory, some of which are:
Arabic Bulgarian Catalan Czech Danish German
Greek English Spanish Basque Farsi Finnish
French Galic Hindi Hungarian Armenian Indonesian
Italian Japanese Latvian Dutch Norwegian Portuguese
Russian Swedish Thai Turkish

With the amount of content that can be created, indexed, and searched in Sitecore, search efficiency can make or break the quality of the user experience. Stop words are one of the many tools that help you improve not only the quality of search but also the efficiency of it. Have fun with it, try it out, expand the list to your liking, and watch your collections perform better than ever before.

By Karan Jeet Singh

Solutions Engineer

“…search should not only be for those organizations with massive search budgets.”

Get the Latest Content First