Problem:
The internet is full of duplicated content. If you search, you don’t want 10 identical results. So how do we check for duplicates?
Solution:
Well, for exact duplicates, we can just hash the file. But often, there’s just tiny differences (logo, timestamp, etc.).
So, we split it up into shingles, n-grams of 3. Using Jaccard Similarity / Intersection Over Union, if two documents share 80% of shingles, then BOSH.
Slight Hiccup:
Comparing web pages’ shingles against EVERY OTHER shingle would take . So we use MinHash. Which squashes them down.