TF-IDF (Term Frequency-Inverse Document Frequency)

What:

A scoring mechanism for ranked information retrieval. It says:

A term is important if it appears frequently within that document but rarely in the overall collection.

Term Frequency (TF): Measures how often a term appears in a document. A higher TF suggests the document is about the term.
Measures how rare (and thus important) a term is across the entire collection.
- Terms that appear in many documents (“the”) get low IDF scores and vice versa.

$IDF Formula: idf_{t} = lo g_{10} (\frac{N}{d f _{t}})$ , where $N$ is the total number of documents and $df_{t}$ is the number of documents containing the term.

$TF Formula: T F_{t, d} = {1 + lo g_{10} (t f_{t, d}) 0 if t f_{t, d} > 0 if t f_{t, d} = 0$

(Where $t f_{t, d}$ is the raw count of term $t$ in document $d$ )

$Combined Formula: W_{t} = TF \times IDF$

A higher W means a rare English word that appears lots of times in a document

Vector Space Search on TF-IDF

Similar to Bag of Words Embeddings, you initialise 0-filled vector, where each entry corresponds to a word in the vocabulary.
Instead of filling it with the term frequency, you fill it with the TF-IDF score.
To search, you measure the cosine angle between the search query and document vector.
(If a document has “Apple computer” 1000 times, compared to one that has it 1k times, the tips of the vectors will be far but the angle will be almost identical)

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

TF-IDF (Term Frequency-Inverse Document Frequency)

What:

Vector Space Search on TF-IDF

Graph View

Table of Contents