Evaluating Search Engines

Problem:

A user types a query and we return results. Two parts we could improve there:

Show the user 10 results. Allow the user to mark the results as “Relevant” or “Not Relevant”
Mathematically then: We take the query vector and add the vectors of the relevant documents and subtract the vectors for irrelevant documents.
Users don’t click relevance buttons

What: Run the user’s query, take the top k documents, append them to the query and rerun that new query.
Improves recall.
Risks query drift

TS easy. We can do the classic Accuracy vs Precision vs Recall vs F1 vs Perplexity.
Accuracy is useless.
Precision and recall are at odds:
- If you retrieve more documents (high recall), you’ll likely retrieve more junk (low precision). If you’re very selective (high precision), you’ll miss a lot of good documents (low recall).

$MRR = \frac{1}{Rank of the FIRST relevant document}$

First, we attribute a score for how good a certain answer is. (Gain / $Rel$ ). E.g. 3 is perfect, 2 is good, 1 is fair, 0 is rubbish.
We get the summed / Cumulative Gain (CG). (Problem: {0,0,3} and {3,0,0} have the same CG, but one made you read through rubbish first…)
We penalise documents for appearing late in the list - shrinking the score slowly and smoothly as we go down the ranks. We divide by a log based on the rank position.
Final Formula is: $D C G_{k} = Rel_{1} + \sum_{i = 2}^{k} \frac{Rel _{i}}{l o g _{2} ( i )}$
We also normalise it by dividing it by the Ideal DCG (the theoretical maximum achievable).

Average Precision: We calculate precision every time a relevant document is found, and then average all of those scores.
- E.g: You have 3 relevant docs. Your system ranks them at positions $1$ , $3$ , and $7$ .
- Precision at rank $1$ : $\frac{1}{1} = 1.0$
- Precision at rank $3$ : $\frac{2}{3} = 0.66$
- Precision at rank $7$ : $\frac{3}{7} = 0.42$
- $A P = \frac{1.0 + 0.66 + 0.42}{3} = 0.69$
- Relevant documents ranked early are heavily rewarded.