SearchStax named one of the “Top 20 Open Source Software Solutions for 2017”. Read more here.
Don't be a clown with your Solr relevance

The Clown Problem

We recently assisted a client in debugging counter-intuitive search results and improving their Apache Solr relevance.  The client’s search application focuses on retrieval of party-oriented listings such as services available for clowns, magicians, bounce houses etc.  Documents in the party collection include listings for clowns, magicians, face painters, games etc.  The document schema includes the following fields:

  • Name
  • Description
  • Category
  • Supplier

 

We leverage Solr’s Extended DisMax Query Parser to exercise fine control over the scoring of search results.  Two features of the parser we make heavy use of are text boosts and boost queries.  Text boosts weight the match score specified fields by a multiplicative factor between 0 and 10.  For example, supplying a text boost of name^10 will disproportionately weight the name field in a document over other fields.  Boost queries weight field+term combinations by multiplying field+term matches by a factor between 0 and 10.  For example, supplying a Boost query of category:Clown^10 will significantly boost documents that contain the term “Clown” in the category field.  We use the term “relevance model” to denote a particular set of text boost and boost query parameters. We leveraged these relevance models to optimize Solr relevance.

 

Our client received the following search results when querying “Krusty the Clown”1:

  1. Poopsie the Clown
  2. Rusty the Clown
  3. Escape from Grandma’s House
  4. Sir Widebottom
  5. Krusty the Clown

 

This result seems to indicate the query is not matching documents as expected.  In particular, the #3 result, “Escape from Grandma’s House” is not even a clown – it’s a game.  

Debugging Search Results

The essential tool for debugging search results is the “explain” mechanism of Solr which is triggered by adding &debug=true to search queries.  This produces detail scoring information for each document.  A sample of the debug output is shown below, with explanations.

<str name="55532a04-4caa-4c38-bd24-3b1795a1cd7c"> [1]
1.7509375 [2] = sum of [3]: 
  0.87546873 [4] = sum of [5] :
    0.0 = max of:
      0.0 = ConstantScore(description:the), product of:
        0.0 = boost
        1.0 = queryNorm
      0.0 = ConstantScore(name:the), product of:
        0.0 = boost
        1.0 = queryNorm
    0.87546873 [6] = max of [7]:
      0.87546873 = weight(Synonym(category:clown category:jester) in 0) [8] [SchemaSimilarity], result of:
        0.87546873 = score(doc=0,freq=1.0 = termFreq=1.0
), product of [9]:
          0.87546873 = idf [10], computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            2.0 [11] = docFreq [12]
            5.0 = docCount [13]
          1.0 = tfNorm [14], computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 [15] = termFreq [16]=1.0
            1.2 = parameter k1 [17]
            0.75 = parameter b [18]
            1.0 [19] = avgFieldLength [20]
            1.0 [21] = fieldLength [22]
      0.0 = ConstantScore(Synonym(description:clown description:jester)), product of:
        0.0 = boost [23]
        1.0 = queryNorm
      0.0 = ConstantScore(Synonym(name:clown name:jester)), product of:
        0.0 = boost
        1.0 = queryNorm
  0.87546873 [24] = weight(Synonym(category:clown [25] category:jester) in 0) [SchemaSimilarity], result of:
    0.87546873 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
      0.87546873 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
        2.0 = docFreq
        5.0 = docCount
      1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        1.0 = avgFieldLength
        1.0 = fieldLength
</str>

Explanations:

  1. Id of the result item. Can discover ids by specifying &fl=”id,name” in the request parameters
  2. Total score for the item. Results ranks are based on score (higher scores have higher rank)
  3. “Text boost” match scores are added to “boost query” match scores. Text boost scores are simple comparison of the query terms against the indexed terms.  Boost query scores are a comparison of the indexed term against a relevance-model-defined term.
  4. Total Text boost score
  5. A score is generated for each term in the query (e.g., “Krusty”, “Clown”, etc). The scores for each term are added.
  6. Total term score
  7. A term’s score is the maximum score generated by comparing the term to each field
  8. Considering the match of clown against category:clown
  9. A term match for a given field is the product of idf and tfNorm
  10. idf is “inverse document frequency”. This favors matches of terms that are rare. E.g., “Zodiac” would a stronger match than “the” since “Zodiac” is a rare word compared to “the”.
  11. “clown” or “jester” appears twice in categories
  12. docFreq is the number of indexed documents that the term occurs in
  13. docCount is the total number of documents
  14. tfNorm is the normalized term frequency. The idea is that the more times a term occurs in a field, the better the match. The formula adjusts for a couple modifications of this basic idea. First, if a field is very long, it should have less weight that a field that is short. Second, an increase from 1 to 2 terms should have a bigger impact on the score than an increase from 99 to 100 terms.
  15. the term occurs once in this field
  16. The number of times a term occurs in the field
  17. Parameter for adjusting the sub-linearity of term frequency
  18. Parameter for adjusting effect of field length
  19. the average field length is 1.0
  20. Average length of the field across the entire index
  21. the field length of this document is 1.0
  22. length of this field
  23. boost multiplier for listing_desc. a boost of 0 means we ignore the field
  24. Boost query score
  25. We are boosting clowns. This results in a score identical to the ordinary match of the query term clown to category:clown

Experiments

To investigate the Krusty-on-bottom issue, we set up a small test environment with only the five listings shown above.  We then applied various relevance models to see the effect of the models on results and improve Solr relevance.

Experiment 1: No Model

The first experiment applied no relevance model, allowing Solr’s native scoring algorithm to take effect – effectively the default Solr relevance.  Querying for “Krusty the Clown” resulted in:

  1. Krusty the Clown
  2. Escape from Grandma’s House
  3. Poopsie the Clown
  4. Rusty the Clown
  5. Sir Widebottom

 

This result is encouraging since it demonstrates that it’s possible to surface Krusty to the top, as we would hope for.  However, somewhat surprisingly, Escape from Grandma’s House is retrieved as the #2 result.  It turns out that Escape from Grandma’s House is a strong match for Krusty the Clown since the supplier for Escape from Grandma’s House contains the word “Clown”.  Furthermore, the term “Clown” is rare in the supplier field, so Escape from Grandma’s House gets an additional boost due to the idf (inverse document frequency) factor.  This contrasts with presence of the term “Clown” in the category field of 2 other listings.  Since “Clown” is not a particularly rare category, it does not significantly boost listings that have “Clown” as a category.

Experiment 2: Zero Name and Desc

In the next experiment, we created a model that has zero-valued boosts for name and description, which is consistent with the client’s original Solr relevance model.  Applying this model result in:

  1. Escape from Grandma’s House
  2. Rusty the Clown
  3. Poopsie the Clown
  4. Sir Widebottom
  5. Krusty the Clown

 

This experiments dramatically illustrates the effect of zero-valued boosts – Krusty goes from rank #1 to rank #5.  What is happening is that zero-valued boosts cause the name and description to be completely ignored by the matching algorithm.  In a nutshell, a score is generated by matching these documents fields against the query, then that score is multiplied by 0 which effectively removes those fields from consideration in the scoring algorithm.

Experiment 3: Zero Name and Desc, Boosting Clowns

Although Experiment 2 demonstrates how Krusty can be moved down the ranking list, the results do not match the original “bad” results.  In our next experiment, we applied a boost query similar to the original Solr relevance model.  Specifically we boosted listings that have category matching Clown.  This resulted in the following result set:

  1. Rusty the Clown
  2. Poopsie the Clown
  3. Escape from Grandma’s House
  4. Sir Widebottom
  5. Krusty the Clown

 

This brings us closer the original results.  It might be a bit confusing why 2 clowns are at the top of the list, then a game, then 2 more clowns, and finally poor Krusty at the bottom.  It turns out that Rusty the Clown and Sir Widebottom have their category fields set to Clown, where the other clowns do not.  In fact, there is only one component that The Sir Widebottom and Krusty match: the word “the” in supplier.  In Krusty’s case, the supplier field is slightly longer, so Krusty is penalized relative to the Sir Widebottom due to the term frequency norm computation in the relevance algorithm.

Summary

In this article, we discussed how a seemingly confusing set of search results can be “debugged” to surface the core factors determining result orderings.  In debugging Solr relevance issues, it is important to take an incremental approach.  The first step should be to strip away all boosting-related parameters and observe the “natural” results.  The next step should be to tweak a single boost parameter and observe the effects, then the next parameter, and so on.  This method will not only produce better search results, but will produce greater insight into why search sometimes behaves in seemingly unexpected ways.

 

[1] Names of the Search Results have been changed from the originals to protect the innocent.

Get The Latest Content First!

* = required field