Feb. 09, 2018
Sameer Maggon
|
The Clown Problem
We recently assisted a client in debugging counter-intuitive search results and improving their Apache Solr relevance. The client’s search application focuses on retrieval of party-oriented listings such as services available for clowns, magicians, bounce houses etc. Documents in the party collection include listings for clowns, magicians, face painters, games etc. The document schema includes the following fields:
- Name
- Description
- Category
- Supplier
We leverage Solr’s Extended DisMax Query Parser to exercise fine control over the scoring of search results. Two features of the parser we make heavy use of are text boosts and boost queries. Text boosts weight the match score specified fields by a multiplicative factor between 0 and 10. For example, supplying a text boost of name^10 will disproportionately weight the name field in a document over other fields. Boost queries weight field+term combinations by multiplying field+term matches by a factor between 0 and 10. For example, supplying a Boost query of category:Clown^10 will significantly boost documents that contain the term “Clown” in the category field. We use the term “relevance model” to denote a particular set of text boost and boost query parameters. We leveraged these relevance models to optimize Solr relevance.
Our client received the following search results when querying “Krusty the Clown”1:
- Poopsie the Clown
- Rusty the Clown
- Escape from Grandma’s House
- Sir Widebottom
- Krusty the Clown
This result seems to indicate the query is not matching documents as expected. In particular, the #3 result, “Escape from Grandma’s House” is not even a clown – it’s a game.
Debugging Search Results
The essential tool for debugging search results is the “explain” mechanism of Solr which is triggered by adding &debug=true to search queries. This produces detail scoring information for each document. A sample of the debug output is shown below, with explanations.
[1]
1.7509375 [2] = sum of [3]:
0.87546873 [4] = sum of [5] :
0.0 = max of:
0.0 = ConstantScore(description:the), product of:
0.0 = boost
1.0 = queryNorm
0.0 = ConstantScore(name:the), product of:
0.0 = boost
1.0 = queryNorm
0.87546873 [6] = max of [7]:
0.87546873 = weight(Synonym(category:clown category:jester) in 0) [8] [SchemaSimilarity], result of:
0.87546873 = score(doc=0,freq=1.0 = termFreq=1.0
), product of [9]:
0.87546873 = idf [10], computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
2.0 [11] = docFreq [12]
5.0 = docCount [13]
1.0 = tfNorm [14], computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
1.0 [15] = termFreq [16]=1.0
1.2 = parameter k1 [17]
0.75 = parameter b [18]
1.0 [19] = avgFieldLength [20]
1.0 [21] = fieldLength [22]
0.0 = ConstantScore(Synonym(description:clown description:jester)), product of:
0.0 = boost [23]
1.0 = queryNorm
0.0 = ConstantScore(Synonym(name:clown name:jester)), product of:
0.0 = boost
1.0 = queryNorm
0.87546873 [24] = weight(Synonym(category:clown [25] category:jester) in 0) [SchemaSimilarity], result of:
0.87546873 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.87546873 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
2.0 = docFreq
5.0 = docCount
1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
1.0 = avgFieldLength
1.0 = fieldLength
Explanations:
- Id of the result item. Can discover ids by specifying &fl=”id,name” in the request parameters
- Total score for the item. Results ranks are based on score (higher scores have higher rank)
- “Text boost” match scores are added to “boost query” match scores. Text boost scores are simple comparison of the query terms against the indexed terms. Boost query scores are a comparison of the indexed term against a relevance-model-defined term.
- Total Text boost score
- A score is generated for each term in the query (e.g., “Krusty”, “Clown”, etc). The scores for each term are added.
- Total term score
- A term’s score is the maximum score generated by comparing the term to each field
- Considering the match of clown against category:clown
- A term match for a given field is the product of idf and tfNorm
- idf is “inverse document frequency”. This favors matches of terms that are rare. E.g., “Zodiac” would a stronger match than “the” since “Zodiac” is a rare word compared to “the”.
- “clown” or “jester” appears twice in categories
- docFreq is the number of indexed documents that the term occurs in
- docCount is the total number of documents
- tfNorm is the normalized term frequency. The idea is that the more times a term occurs in a field, the better the match. The formula adjusts for a couple modifications of this basic idea. First, if a field is very long, it should have less weight that a field that is short. Second, an increase from 1 to 2 terms should have a bigger impact on the score than an increase from 99 to 100 terms.
- the term occurs once in this field
- The number of times a term occurs in the field
- Parameter for adjusting the sub-linearity of term frequency
- Parameter for adjusting effect of field length
- the average field length is 1.0
- Average length of the field across the entire index
- the field length of this document is 1.0
- length of this field
- boost multiplier for listing_desc. a boost of 0 means we ignore the field
- Boost query score
- We are boosting clowns. This results in a score identical to the ordinary match of the query term clown to category:clown
Experiments
To investigate the Krusty-on-bottom issue, we set up a small test environment with only the five listings shown above. We then applied various relevance models to see the effect of the models on results and improve Solr relevance.
Experiment 1: No Model
The first experiment applied no relevance model, allowing Solr’s native scoring algorithm to take effect – effectively the default Solr relevance. Querying for “Krusty the Clown” resulted in:
- Krusty the Clown
- Escape from Grandma’s House
- Poopsie the Clown
- Rusty the Clown
- Sir Widebottom
This result is encouraging since it demonstrates that it’s possible to surface Krusty to the top, as we would hope for. However, somewhat surprisingly, Escape from Grandma’s House is retrieved as the #2 result. It turns out that Escape from Grandma’s House is a strong match for Krusty the Clown since the supplier for Escape from Grandma’s House contains the word “Clown”. Furthermore, the term “Clown” is rare in the supplier field, so Escape from Grandma’s House gets an additional boost due to the idf (inverse document frequency) factor. This contrasts with presence of the term “Clown” in the category field of 2 other listings. Since “Clown” is not a particularly rare category, it does not significantly boost listings that have “Clown” as a category.
Experiment 2: Zero Name and Desc
In the next experiment, we created a model that has zero-valued boosts for name and description, which is consistent with the client’s original Solr relevance model. Applying this model result in:
- Escape from Grandma’s House
- Rusty the Clown
- Poopsie the Clown
- Sir Widebottom
- Krusty the Clown
This experiments dramatically illustrates the effect of zero-valued boosts – Krusty goes from rank #1 to rank #5. What is happening is that zero-valued boosts cause the name and description to be completely ignored by the matching algorithm. In a nutshell, a score is generated by matching these documents fields against the query, then that score is multiplied by 0 which effectively removes those fields from consideration in the scoring algorithm.
Experiment 3: Zero Name and Desc, Boosting Clowns
Although Experiment 2 demonstrates how Krusty can be moved down the ranking list, the results do not match the original “bad” results. In our next experiment, we applied a boost query similar to the original Solr relevance model. Specifically we boosted listings that have category matching Clown. This resulted in the following result set:
- Rusty the Clown
- Poopsie the Clown
- Escape from Grandma’s House
- Sir Widebottom
- Krusty the Clown
This brings us closer the original results. It might be a bit confusing why 2 clowns are at the top of the list, then a game, then 2 more clowns, and finally poor Krusty at the bottom. It turns out that Rusty the Clown and Sir Widebottom have their category fields set to Clown, where the other clowns do not. In fact, there is only one component that The Sir Widebottom and Krusty match: the word “the” in supplier. In Krusty’s case, the supplier field is slightly longer, so Krusty is penalized relative to the Sir Widebottom due to the term frequency norm computation in the relevance algorithm.
Summary
In this article, we discussed how a seemingly confusing set of search results can be “debugged” to surface the core factors determining result orderings. In debugging Solr relevance issues, it is important to take an incremental approach. The first step should be to strip away all boosting-related parameters and observe the “natural” results. The next step should be to tweak a single boost parameter and observe the effects, then the next parameter, and so on. This method will not only produce better search results, but will produce greater insight into why search sometimes behaves in seemingly unexpected ways.
[1] Names of the Search Results have been changed from the originals to protect the innocent.
