Problem:

The internet is full of duplicated content. If you search, you don’t want 10 identical results. So how do we check for duplicates?

Solution:

Well, for exact duplicates, we can just hash the file. But often, there’s just tiny differences (logo, timestamp, etc.).

So, we split it up into shingles, n-grams of 3. Using Jaccard Similarity / Intersection Over Union, if two documents share 80% of shingles, then BOSH.

Slight Hiccup:

Comparing web pages’ shingles against EVERY OTHER shingle would take . So we use MinHash. Which squashes them down.