Problem:
A user types a query and we return results. Two parts we could improve there:
- Fixing the user’s queries
- Measuring and returning the search results
1. Fixing Bad Queries:
Get Relevance Feedback
- Show the user 10 results. Allow the user to mark the results as “Relevant” or “Not Relevant”
- Mathematically then: We take the query vector and add the vectors of the relevant documents and subtract the vectors for irrelevant documents.
- Users don’t click relevance buttons
Blind / PRF Pseudo-Relevance Feedback:
- What: Run the user’s query, take the top k documents, append them to the query and rerun that new query.
- Improves recall.
- Risks query drift
2. Evaluating:
What’re We Measuring?
- Effectiveness: how good are the results now?
- Efficiency: How fast is the system?
- Usability: How easy is the system for a user to interact with?
Unranked Results
- TS easy. We can do the classic Accuracy vs Precision vs Recall vs F1 vs Perplexity.
- Accuracy is useless.
- Precision and recall are at odds:
- If you retrieve more documents (high recall), you’ll likely retrieve more junk (low precision). If you’re very selective (high precision), you’ll miss a lot of good documents (low recall).
Ranked Results:
Mean Reciprocal Rank:
- If the first relevant document appeared in the 3rd position, the MRR is 1/3.
- Great for factual responses, with a single factual answer.
Normalised Discounted Cumulative Gain (nDCG):
- First, we attribute a score for how good a certain answer is. (Gain / ). E.g. 3 is perfect, 2 is good, 1 is fair, 0 is rubbish.
- We get the summed / Cumulative Gain (CG). (Problem:
{0,0,3}and{3,0,0}have the same CG, but one made you read through rubbish first…) - We penalise documents for appearing late in the list - shrinking the score slowly and smoothly as we go down the ranks. We divide by a log based on the rank position.
- Final Formula is:
- We also normalise it by dividing it by the Ideal DCG (the theoretical maximum achievable).
- Average Precision: We calculate precision every time a relevant document is found, and then average all of those scores.
- E.g: You have 3 relevant docs. Your system ranks them at positions , , and .
- Precision at rank :
- Precision at rank :
- Precision at rank :
- Relevant documents ranked early are heavily rewarded.