Shingles

Problem:

The internet is full of duplicated content. If you search, you don’t want 10 identical results. So how do we check for duplicates?

Solution:

Well, for exact duplicates, we can just hash the file. But often, there’s just tiny differences (logo, timestamp, etc.).

So, we split it up into shingles, n-grams of 3. Using Jaccard Similarity / Intersection Over Union, if two documents share 80% of shingles, then BOSH.

Slight Hiccup:

Comparing web pages’ shingles against EVERY OTHER shingle would take $O (N^{2})$ . So we use MinHash. Which squashes them down.

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

Shingles

Problem:

Solution:

Slight Hiccup:

Graph View