Beruflich Dokumente
Kultur Dokumente
signatures to compare,
and that is too much.
So that's where locality
sensitive hashing comes in.
We do some magic, which we'll explain
soon, that allows us to look at
a small subset of the possible pairs, and
test only those pairs with similarity.
By doing so, we get almost all
the pairs that are truly similar, and
the total time spent is
much less than quadratic.
So lets begin the story by finding
exactly what shingles are.
For any integer k,
a k-shingle is sometimes called a k-gram,
is a sequence of k consecutive
characters in the document.
The blanks that separate the words of
the document are normally
considered characters.
If the document involves tags such as
an HTML document then the tags may
also be considered characters or
they can be ignored.
So here's an example of a little document
consisting of the five characters abcab.
We'll use k equals two for
this little example, although in
practice you want to use a k that is
large enough that most sequences of k
characters do not appear in the document.
A k in the range five to ten is used,
generally used but the two shingles for
our little document are ab, which is that,
then bc, then ca, and then ab again.
Since we're constructing a set,
we include the repeated
two shingle ab only once,
and thus our document
abcab is represented by
the set of shingles ab, bc and ca.
We need to assure ourselves
that replacing a document by
its shingles still lets us detect pairs of
documents that are intuitively similar.
In fact, similarity of shingle
sets captures many of the kinds of
document changes that we would regard
as keeping the documents similar.
For example, if we're using k-shingles and
we change one word, only the k-shingles to
the left and right of the word as well as
shingles within the word can be effected.
And we can re-order entire paragraphs
without affecting any shingles except
the shingles that cross the boundaries
between the paragraph we moved and
the paragraphs just before and
after in both the new and old positions.
For example,
suppose we use k equals three, and
we correctly change the which
in the sentence to that,
the only shingles that can be affected
are the ones that begin at most two
characters before which and
end at most two characters after which.
Okay, these are g blank w, blank wh,
and so on, up to h blank c.
A total of seven shingles.
These are replaced by
six different shingles.
G blank t, t is it blank th,
and so on up to t blank c.
however, all shingles other than these
remain the same in the two sentences.
Because documents tend to consist
mostly of the 26 letters, and
we want to make sure that most
shingles do not appear in a document,
we are often forced to use a large
value of k, like k equals ten.
But the number of different strings of
length ten that will actually appear in
any document is much smaller
than 256 to the tenth power or
even 26 to the tenth power.
Thus, it common to compress
shingles to save space while still
preserving the property that most shingles
do not appear in a given document.
For example, we can hash strings of
length ten to 32 bits or four bytes, thus
saving 60% of the space that are needed
to shore, to store the shingle sets.
The result of hashing shingles
is often called a token.
Thus, we can construct for
a document the set of it's tokens.
We construct the shingle set and
then hash each shingle to get a token.
Since documents are much shorter
than two to the 32nd power byte,
we still can be sure that a document
is only a small fraction of
the possible tokens in it's sets.
There's a small chance of a collision
where two shingles hashed to
the same token, but that could make two
documents appear to have shingles in
common when in fact they
have different shingles.
But such an an occurrence
will be quite rare.
In what follows we'll continue to
refer to shingles sets when these sets
might consist of token
rather than the raw shingles