Goal
- Given a large number of documents, find “near duplicate” pairs, if we can, as soon as possible.
Terminology
$$
sim(C1,C2)\;\;=\;\;|C1\;\cap\;C2|\;/\;|C1\;\cup\;C2|
$$
- Jaccard Distance = 1 - Similarity
Essential Steps for Similar Docs
- Shingling
- Min-Hashing
- Locality-Sensitive Hashing

🔗 모든 과정 영어 설명 링크
1. Shingling (k-shingle or k-gram)