What:

A scoring mechanism for ranked information retrieval. It says:

A term is important if it appears frequently within that document but rarely in the overall collection.

  • Term Frequency (TF): Measures how often a term appears in a document. A higher TF suggests the document is about the term.
  • Measures how rare (and thus important) a term is across the entire collection.
    • Terms that appear in many documents (“the”) get low IDF scores and vice versa.

, where is the total number of documents and is the number of documents containing the term.

  • (Where is the raw count of term in document )

  • A higher W means a rare English word that appears lots of times in a document

Vector Space Search on TF-IDF

  • Similar to Bag of Words Embeddings, you initialise 0-filled vector, where each entry corresponds to a word in the vocabulary.
  • Instead of filling it with the term frequency, you fill it with the TF-IDF score.
  • To search, you measure the cosine angle between the search query and document vector.
  • (If a document has “Apple computer” 1000 times, compared to one that has it 1k times, the tips of the vectors will be far but the angle will be almost identical)