What:
A scoring mechanism for ranked information retrieval. It says:
A term is important if it appears frequently within that document but rarely in the overall collection.
- Term Frequency (TF): Measures how often a term appears in a document. A higher TF suggests the document is about the term.
- Measures how rare (and thus important) a term is across the entire collection.
- Terms that appear in many documents (“the”) get low IDF scores and vice versa.
, where is the total number of documents and is the number of documents containing the term.
- (Where is the raw count of term in document )
- A higher W means a rare English word that appears lots of times in a document
Vector Space Search on TF-IDF
- Similar to Bag of Words Embeddings, you initialise 0-filled vector, where each entry corresponds to a word in the vocabulary.
- Instead of filling it with the term frequency, you fill it with the TF-IDF score.
- To search, you measure the cosine angle between the search query and document vector.
- (If a document has “Apple computer” 1000 times, compared to one that has it 1k times, the tips of the vectors will be far but the angle will be almost identical)