Đề tài 9 PDF

Journal of Digital Forensics,
Security and Law

Volume 11 | Number 2 Article 4
2016
Bytewise Approximate Matching: The Good, The

Bad, and The Unknown
Vikram S. Harichandran
University of New Haven
Frank Breitinger
Ibrahim Baggili
Follow this and additional works at: http://commons.erau.edu/jdfsl

Part of the Computer Law Commons, and the Information Security Commons
Recommended Citation
Harichandran, Vikram S.; Breitinger, Frank; and Baggili, Ibrahim (2016) "Bytewise Approximate Matching: The Good, The Bad, and
The Unknown," Journal of Digital Forensics, Security and Law: Vol. 11 : No. 2 , Article 4.
DOI: https://doi.org/10.15394/jdfsl.2016.1379
Available at: http://commons.erau.edu/jdfsl/vol11/iss2/4
This Article is brought to you for free and open access by the Journals at
Scholarly Commons. It has been accepted for inclusion in Journal of Digital
Forensics, Security and Law by an authorized administrator of Scholarly
Commons. For more information, please contact commons@erau.edu.
(c)ADFSL
Bytewise approximate matching: the good, the bad ... JDFSL V11N2
BYTEWISE APPROXIMATE MATCHING:

THE GOOD, THE BAD, AND THE
UNKNOWN
Vikram S. Harichandran, Frank Breitinger and Ibrahim Baggili
Cyber Forensics Research and Education Group (UNHcFREG)
Tagliatela College of Engineering
University of New Haven, West Haven CT, 06516, United States
e-Mail: {vhari2, FBreitinger, IBaggili}@newhaven.edu
ABSTRACT
Hash functions are established and well-known in digital forensics, where they are commonly
used for proving integrity and file identification (i.e., hash all files on a seized device and
compare the fingerprints against a reference database). However, with respect to the latter
operation, an active adversary can easily overcome this approach because traditional hashes
are designed to be sensitive to altering an input; output will significantly change if a single
bit is flipped. Therefore, researchers developed approximate matching, which is a rather new,
less prominent area but was conceived as a more robust counterpart to traditional hashing.
Since the conception of approximate matching, the community has constructed numerous
algorithms, extensions, and additional applications for this technology, and are still working
on novel concepts to improve the status quo. In this survey article, we conduct a high-level
review of the existing literature from a non-technical perspective and summarize the existing
body of knowledge in approximate matching, with special focus on bytewise algorithms. Our
contribution allows researchers and practitioners to receive an overview of the state of the
art of approximate matching so that they may understand the capabilities and challenges of
the field. Simply, we present the terminology, use cases, classification, requirements, testing
methods, algorithms, applications, and a list of primary and secondary literature.
Keywords: Approximate matching, Fuzzy hashing, Similarity hashing, Bytewise, Survey,

Review, ssdeep, sdhash, mrsh-v2.
1. INTRODUCTION an increase in processing power. Speaking in

numbers, while 80-200 GB HDDs, 2-4 GB of
It is no secret that the number of networked RAM memory, and dual core were the quasi
devices in the world continues to increase standards for machines in 2011, nowadays
alongside the complexity of cyber crimes, they are 512 GB SSDs, 8-16 GB RAM, and
size of storage devices, and amount of data. multicore architectures; (external) storage
We are beyond the point where investiga- devices may have several terabytes of stor-
tors can manually analyze all cases. These age. Furthermore, if privately owned com-
advances have also been complemented with puting resources are insufficient for a task,
© 2016 ADFSL Page 59

JDFSL V11N2 Bytewise approximate matching: the good, the bad ...
one may shift it to the cloud. In short, Table 1. Answers to the survey question:
practitioners need tools and techniques that Have you ever used approximate match-
are capable of automatically handling large ing/similarity hashing algorithms?
amounts of data since time in investigations
is of the essence. Answer in %
A common forensic process to support Yes, I use them on a regular basis. 12.50
practitioners is known file filtering, which Yes, a few times. 34.38
aims at reducing the amount of data an in- No, they are too slow for practical 7.29
vestigator has to manually examine. The use.
process is quite simple: (1) compute the No, they are unnecessary for my 31.25
hashes for all files on a target device (2) purposes.
compare the hashes to a reference database. No, I am unaware of what it is. 14.58
Based on the signatures in the database,
files are whitelisted (filtered out / known-
approximate matching. Accordingly, we ad-
good files, e.g., files of the operating system)
dress the following key points:
or blacklisted (filtered in / known-bad files,
e.g., known illicit content). This straightfor- Terminology, use cases, classification,
ward procedure is commonly implemented requirements, and testing.
using cryptographic hash functions like MD5
(Rivest, 1992) or an algorithm from the High-level description of existing algo-
SHA family (FIPS, 1995; Bertoni, Daemen, rithms including strengths and weak-
Peeters, & Assche, 2008). nesses.
While cryptographic hashes are well- Secondary literature that enhances / as-
established and tested, they have one down- sesses existing approaches.
side – they can only identify bitwise identi-
cal objects. This means changing a single New applications that employ approx-
bit of the input will result in a totally dif- imate matching, e.g., file carving and
ferent hash value. Subsequently, the com- data leakage prevention.
munity worked on a counterpart for (crypto-
graphic) hashing algorithms that allows sim- Current limitations and challenges, and
ilarity identification – approximate match- possible future trends.
ing. Although this is a practically useful Since it is low-level (is directly concerned
concept, a recent survey by Harichandran, with the structure of everything digital), and
Breitinger, Baggili, and Marrington (2016) may be the most impacting / implemented
with 99 participants showed that only 12% type due to its usage for automation, we fo-
of the forensic experts polled use this tech- cus on bytewise approximate matching.
nology on regular basis. Detailed results are
provided in Table 1. Differentiation from previous work.
When writing this article, there were three
Contribution. In this paper we aim to ad- articles similar to this survey. The first
dress the almost 15 % that have never heard was the SP 800-168 from the National Insti-
of approximate matching by providing them tute for Standards and Technology (NIST,
with a comprehensive literature survey, and Breitinger, Guttman, McCarrin, Roussev,
the 31 % (unnecessary for my purposes) by and White (2014)). While this article pro-
illustrating a multitude of applications for vides an overview of the terminology, uses
Page 60 © 2016 ADFSL

cases, and testing, it does not include any the size of their intersection over the size of
algorithm concepts, applications, or criti- their union (Jaccard, 1901, 1912): if A and
cal discussion. Moreover, a reader is not B are sets, then the TJaccard index J is de-
provided with a long list of references. fined as J(A, B) = |A S B| . It has been widely
|A B|
Martı́nez, Álvarez, and Encinas (2014) is a adopted as a method for quantifying similar-
purely technical paper and focused on the ity and is still used mainly within computer
full details of the algorithms and their imple- linguistics for plagiarism detection.
mentations. Thirdly, the dissertation from Nearly a century later, Broder (1997) pro-
Breitinger (2014) contained almost all of posed using the Jaccard index as part of
these topics but is extremely lengthy. Ergo, his algorithm for identifying similar docu-
our intention during writing was to make ments. Broder suggested a distinction be-
this publication the primary source for re- tween two commonly used types of similar-
searchers / practitioners to grasp a cur- ity: ‘roughly the same’ (resemblance) and
sory bird’s-eye view of bytewise approximate ‘roughly contained inside’ (containment).
matching. While he recommended using the Jaccard in-
We summarized the most important ele- dex for resemblance, he introduced a varia-
ments of these works in a condensed and tion to approximate containment which “in-
direct manner to increase the awareness of dicates that A is Troughly contained within
approximate matching. Extra texts are also B”: c(A, B) = |A|A|B| . Additionally, Broder
shared for each algorithm in Sec. 4. described the MinHash algorithm, an effi-
Structure. The remainder of this paper is cient method for estimating these similarities
organized as follows: Sec. 2 provides the his- (Broder, Charikar, Frieze, & Mitzenmacher,
torical background of approximate match- 1998).
ing. Concepts are outlined, including use On the other hand, Manber (1994) pre-
cases, types, requirements (this subsection sented sif, an implementation used to cor-
describes the core principles of algorithm de- relate text files. “Files are considered simi-
sign), and testing. Then, after traversing 8 lar if they have a significant number of com-
of the most popular algorithms in Sec. 4 we mon pieces, even if they are very different
mention newly explored prospects in Sec. 5. otherwise.” Due to the complexity of com-
Limitations and challenges precedes a brief paring strings directly, he utilized Rabin fin-
listing of future areas of research in Sec. 7. gerprinting to hash and compare substrings
(Rabin, 1981).
2. HISTORY A first step towards approximate match-
ing as we use it today was dcfldd1 by Har-
Many of the approximate matching algo- bour in 2002, which was an extension for
rithms designed to solve modern-day prob- the well-known disk dump tool dd. His
lems in digital forensics rely fundamentally tool divided the input into chunks of fixed
on the ability to represent objects as sets length and hashed each chunk. Therefore,
of features, thereby reducing the similarity Harbour’s approach is also called block-based
problem to the well-defined domain of set op- hashing. While this approach works per-
erations (Leskovec, Rajaraman, & Ullman, fectly for flipped bits, it has a limited ca-
2014). This approach has roots in the work pacity to detect similarity in strings where
of the early Swiss 20th century biologist Paul
Jaccard, who suggested expressing the simi- 1
http://dcfldd.sourceforge.net (last ac-
larity between two finite sets as the ratio of cessed Feb 4th , 2016).

the deletion or insertion of bits creates a drive image for matches of 4096-byte blocks
shift that changes the hashes of all blocks across a set of nearly one million blacklisted
that follow. Theoretically, the shift of even files stored in the custom-built database,
a single bit at the beginning of a file could hashdb.
cause nearly identical objects to appear to
have nothing in common, much like the naive
(traditional) file hashing approach. 3. CONCEPTS
Although this weakness can present a
While approximate matching (a.k.a. fuzzy
problem for file-to-file comparison, it may
hashing or similarity hashing) started to gain
be acceptable in some scenarios. For exam-
popularity in the field of digital forensics in
ple, if the goal is to determine which parts of
2006, it was not until 2014 that the Na-
a disk image might have been changed dur-
tional Institute for Standards and Technolo-
ing a cyber attack, Harbour’s technique re-
gies (NIST) developed standard definitions,
mains useful. Likewise, in the case of an an-
publishing Approximate Matching: Defini-
alyst scanning for blacklisted material across
tion and Terminology (NIST SP 800-168,
a drive or a collection of drives, the loss of
Breitinger, Guttman, et al. (2014)). Sub-
a few block matches may be a worthwhile
sections below briefly summarize this work’s
trade-off for gains in speed and simplicity,
principles.
particularly because a single block is often
The ‘Purpose and Scope’ section of the
sufficient evidence to demonstrate the pres-
NIST document defines approximate match-
ence of an artifact, or at least to warrant
ing as follows: “Approximate matching is
closer inspection.
a promising technology designed to identify
Several efforts have been made to further
similarities between two digital artifacts. It
leverage this technique for detecting similar
is used to find objects that resemble each
material by matching fixed length file frag-
other or to find objects that are contained
ments. Collange, Dandass, Daumas, and
in another object.”
Defour (2009) coined the term “hash-based
carving” to describe this method of scanning 3.1 Use cases
for blacklisted material, since it can be used
to extract content without aid from the file In investigative cases, approximate match-
system, provided the targets are known being is used to filter known-good or known-
forehand. bad files while using a reference approximate
Key (2013)’s File Block Hash Map matching hashed data set, either on static
Analysis (FBHMA) EnScript and Simson data or data in transit over a network. The
Garfinkel’s tool frag find (S. L. Garfinkel, primary use cases for approximate matching
2009) provided practical implementations are presented below:
that automated the process for forensic ex-
aminers, though searches were limited to a Similarity detection correlates related
few files at a time. S. Garfinkel, Nelson, documents, e.g., different versions of a
White, and Roussev (2010) described the Word document.
implementation and evaluation of frag find
in detail, noting a particular difficulty in Cross correlation correlates documents
storing and searching billions of hashes at that share a common object, e.g., a
practical speeds. S. Garfinkel and McCar- DOC and a PPT document including
rin (2014) later succeeded in scanning a the same image.

Embedded object detection identifies an the similarity of the content of a JPG

object inside a document, e.g., an image and a PNG image where the image file
inside a memory dump. types / byte streams are different, but
the picture is the same.
Fragment detection identifies the pres-
ence of traces/fragments of a known ar- Furthermore, there are 4 cardinal cate-
tifact, e.g., identify the presence of a file gories of algorithms (see Sec. 4 for the inner
in a network stream based on individual workings) (Martı́nez et al., 2014):
packets.
Context-Triggered Piecewise Hashing
A lecture at DFRWS USA 2015 decided (CTPH).
to break down uses into 6 categories instead,
from the perspective of the bytestreams Block-Based hashing (BBH).
matching: R is identical to T, R contains T,
R & T share identical substrings, R is simi- Statistically-Improbable Features
lar to T, R approximately contains T, and R (SIF).
& T share similar substrings (where R and Block-Based Rebuilding (BBR).
T are two sequences) (Ren & Cheng, 2015).
Notwithstanding, approximate matching
3.3 Requirements
may not be appropriate when used to
whitelist artifacts since such content can be There are multiple ways to interpret sub-
quite similar to benign content, e.g., an SSH string matching. For example, “ababa” and
server with a backdoor would look analogous “cdcdc” might be considered similar because
to a regular entry point (Baier, 2015). they both have five characters ranging over
two alternating values, or they might be
3.2 Types treated as dissimilar because they have no
Regardless of the use cases, approximate common characters. Thus, algorithms must
matching can be implemented at different define the lowest common denominator of
abstractions. Usually we distinguish be- its interpretation - a feature - including how
tween the following three abstraction cate- they are derived from an input. When two
gories: features are compared the outcome is binary,
match or no match.
Bytewise: matching operates on the byte A feature set refers to a set of distinct
level and uses only the byte sequences as features found in the entire bytestream of
input only (also known as fuzzy hashing a file or file fragment. An algorithm may
and similarity hashing). choose to only include some features in this
Syntactic: matching also works on the byte set and must outline the method/criteria for
level but may use internal structure in- inclusion. Feature sets are then used to pro-
formation, e.g., one may ignore the TCP duce a match/similarity score, representing
header information of a packet that is the amount of similarity between sets of tar-
parsed. get files rationally (an increasing monotonic
function).
Semantic: matching works on the content- Bytewise algorithms have two main func-
visual layer and therefore closely resem- tions. A feature extraction function iden-
bles human behavior (also called percep- tifies and extracts features from objects to
tual hashing and robust hashing), e.g., convert them to a compressed version for

comparison. Then, a similarity function per- ronment for bytewise approximate matching.
forms the comparison between these com- The first step was taken by Breitinger, Sti-
pressed versions to output a normalized vaktakis, and Baier (2013), called FRame-
match score. This comparison usually in- work to test Algorithms of Similarity Hash-
volves string formulas such as Hamming dis- ing (FRASH).
tance and Levenshtein distance; Martı́nez, It tested efficiency, sensitivity and robust-
Álvarez, Encinas, and Ávila (2015) and Li et ness, and precision and recall. This last cat-
al. (2015) have proposed new algorithms for egory can be divided further into synthetic
specific uses. Normalized scores may be cal- data vs. real world data. While synthetic
culated by weighing the number of matching data provides the perfect ground truth (fur-
features against the total number of features ther described below), it does not coincide
for both objects (for resemblance), or by ig- with the real world, and vise versa.
noring unmatched features in the container Synthetic data test results were published
object (if concerned with containment). (Breitinger, Stivaktakis, & Roussev, 2013)
In addition to the above, these traits must in addition to real world data (Breitinger
be satisfied to be considered a valid ap- & Roussev, 2014). The complete results
proximate matching algorithm, according to are too complex to be presented in this
NIST: article but can be found in chapter 6 in
Breitinger (2014). In the following subsec-
Compression: actual storage of features is tions we briefly summarize how approximate
usually implemented as a one-way hash matching algorithms can be evaluated and
known as a similarity digest, signature, FRASH’s results. The main findings were:
or fingerprint); length is shorter than
the original feature/input itself. sdhash and mrsh-v2 outperform other
algorithms.
Similarity preservation: similar inputs
should result in similar digests. mrsh-v2 is faster and shows better com-
pression than sdhash.
Self-evaluation: authors should state the
confidence level for the circumstances/- sdhash obtains slightly better precision
parameters used to produce the match and recall rates than mrsh-v2.
score and what the scale is (e.g., 0 = no Therefore, the final decision for selecting an
features matched, 1 = all exact match). algorithm depends on the use case.
Time complexity/runtime efficiency: Efficiency. As with cryptographic hash
speed should be stated via theoretical functions, compression and runtime ef-
complexity in O-notation as well as the ficiency are important, but approximate
runtime speed; for bytewise algorithms matching algorithms involve additional con-
it is preferable to know the isolated cerns; several do not output fixed length di-
speeds of the feature extraction and gests. Thus, researchers usually report com-
similarity functions. pression ratio, cr = digest length
input length
.
The community distinguishes between the
3.4 Testing bytewise following for runtime:
approximate matching Generation efficiency: time needed to
Testing algorithms is an important task, so process an input and output the simi-
researchers set out to create a test envi- larity digest.

Comparison efficiency: summarizes the White noise resistance: is a probability-

theoretical complexity (in O-notation) driven test that introduces (uniform)
to compare digests against an existing random changes into a byte sequence
data set / database; again, often stated (via insertion, deletion, & substitution);
in units of time for implementations for a viable scenario is source code where a
bytewise approximate matching. developer renamed a variable.
Space efficiency: calculated by dividing Precision and recall on synthetic data.

digest length by input length. Precision can be thought of as a measure
of false positives (possibility of counting ob-
Sensitivity and robustness. Sensitivity jects as similar that in actuality are not)
refers to the granularity at which an algo- while recall refers to the false negatives
rithm can detect similarity, i.e., how minute (omitting objects that should be tallied as
the feature is. At some threshold making a similar). These attributes are an indication
feature too fine causes almost all objects to of an algorithm’s reliability.
appear common, however, and therefore the In order to quantify them, the initial step
algorithm designer must strike a balanced is to analyze synthetic data. First, ran-
sensitivity to optimize utility and time ef- dom byte sequences (the FRASH paper used
ficiency. Linux /dev/urandom) are generated. Next,
Robustness is a metric of how effective an mutations are created through methods like
algorithm can be in the midst of noise and those mentioned in the previous subsection.
plain transformations such as fragmentation Finally, the comparison is executed and the
and insertion/deletion into the target byte results are analyzed.
sequence. Precision and recall on real world data.
As outlined by FRASH, these attributes Testing on real world data is a bit more com-
are tested by creating manual mutations of plex because there is no definition for simi-
the target fragments/files: larity and no ground truth (publicly avail-
able data sets for testing that involves ex-
Alignment robustness: inserts blocks of planations of what the expected similarity
various sizes at the beginning of an in- is between different files). To define the
put; this should simulate scenarios like ground truth, the community developed ap-
growing log files or emails. proximate longest common substring (aLCS)
which estimates the longest common sub-
Fragment detection: identifies the small- strings of two files. According to this, two in-
est fragment of a byte sequence that puts are declared as similar, if their aLCS is
still matches by cutting it; this feature sufficient (e.g., 1 % of the total input length,
is important for network traffic analysis or at least 2 KiB).
(see Sec. 5.2) and hash-based file carv-
ing (see Sec. 5.1). 3.5 Security
Most approximate matching algorithms cur-
Single-common block correlation: rently implement few-to-zero security fea-
analyzes the minimum amount of tures to guard against active, real-time at-
correlation between two files, e.g., two tacks. One staple is inherently built into ap-
word documents that share an common proximate matching algorithms: hash func-
paragraph. tions are one-way, even for similarity, and

therefore prevent reverse engineering the the overall amount of comparisons. For the
original input sequence of a fragment / file. final decision, the ssdeep comparison func-
The few other tolerances exhibited by the al- tion was used. As a result, they reduced the
gorithms are stated in their individual sub- comparison time of 195,186 files against a
sections under Sec. 4. database containing 8,334,077 records from
However, we posit that for most uses of 442 h to 13 min (boosted by a factor of about
approximate matching, security features are 2000), a ‘practical speed’.
not essential. As pointed out by Baier (2015) However, this approach works only for
these algorithms are most likely to be used Base64 and hence for none of the other ap-
for blacklisting. Why would an active ad- proaches like sdhash or mrsh-v2. There-
versary want to create files that match a fore, Breitinger, Baier, and White (2014)
blacklist of static (not in transit) data? Re- presented a concept that could speed up the
searchers must find an answer to how easy process via Bloom filter-based approaches.
it is to avoid matching files. Maybe in the They suggested using one single huge Bloom
future we should classify security for approx- filter to store all feature hashes, which re-
imate matching algorithms by the minimum sults in a complexity of ∼ O(1). Their ap-
amount of changes that are necessary be- proach overcomes the drawback of compar-
tween two files in order to produce a non- ing digests against digests but loses preci-
match. A question that needs fresh explo- sion. That is, it allows for only yes or no
ration, though, is what practices criminals decisions: yes means there is a similar file in
can use to bypass certain (types of) algo- the set; no equates to none of the files being
rithms, use cases, and applications; a rigor- similar above the chosen threshold. It does
ous analysis of this has not been performed not allow for the returning of the matched
partially due to missing standards / ground file(s).
truth. Consequently, the authors presented an
enhancement which simply uses multiple
3.6 Extending existing large Bloom filters to generate a tree struc-
concepts ture that results in a complexity of O(log(n))
(Breitinger, Rathgeb, & Baier, 2014). But
One of the major challenges that comes with these are only assumptions – while there is
approximate matching is related to the near- a working prototype for the first approach,
est neighbor problem, i.e., how to identify the latter concept only exists in theory.
the similarity digests that are similar to a
given one. More precisely, let’s assume a 3.7 Distinction from
database containing n entries. Most algo- locality-sensitive hashing
rithms require an ‘against-all’ comparison
which equals a complexity of O(n). (LSH)
Winter, Schneider, and Yannikos (2013) It’s critical to note that sometimes people
presented an approach to diminish this com- confuse Locality-Sensitive Hashing (LSH)
plexity for ssdeep named F2S2. Gener- (e.g., Rajaraman and Ullman (2012)) with
ally speaking, instead of storing the com- approximate matching. Therefore, we in-
plete Base64 encoded similarity digest in the cluded this section. LSH is a general mecha-
database, they stored n-grams using hash- nism for nearest neighbor search and data
tables. In order to lookup single digests they clustering where the performance strongly
first looked for the n-grams which reduced relies on the used hashing method. Two pop-

ular algorithms are MinHash (Broder, 1997) speaking, it is a modified version of the spam
and SimHash2 (Charikar, 2002). detection algorithm from Tridgell (2002–
This does not necessarily coincide with the 2009) generalized to cope with any digital
idea of approximate matching. Specifically, object.
while LSH aims at mapping similar objects In CTPH the approach is to identify trig-
into the same bucket, approximate matching ger points to divide a given input into
outputs a similarity digest that is compara- chunks/blocks. This breakup is performed
ble. using a rolling hash that slides through the
We would like to note here that the follow- input, adds bytes to the current context
ing section mainly focuses on bytewise ap- (think of it as a buffer), creates a pseudo-
proximate matching. random value, and removes them from the
context after a set number of bytes are com-
4. INTRODUCTION pleted. The context is then used as a trigger
– whenever a specified sequence is created
TO ALGORITHMS the current context is hashed by the non-
As previously mentioned, approximate cryptographic FNV-hash function (Fowler,
matching started to gain attention in 2006 Noll, & Vo, 1994–2012). To create the sim-
with the concept of context triggered piece- ilarity digest, the FNV-chunk-hashes are re-
wise hashing and its first implementation, duced to 6 bits, converted into a Base64 char-
ssdeep (Kornblum, 2006). In the following acter and concatenated; this is done contin-
years, new algorithms were proposed and uously as the trigger outputs FNV hashes.
published. At the time of this article, ssdeep was
We will introduce the eight known approx- still an active project with version 2.13 and
imate matching algorithms. While the first is freely available online3 . Over the years,
three algorithms are still extended and rele- several extensions and performance improve-
vant, the last four algorithms are less promis- ments have been published that mostly
ing from a digital forensics perspective for focus on the efficiency of the implemen-
various reasons, e.g., precision and recall tation (Chen & Wang, 2008; Seo, Lim,
rates, runtime efficiency and detection capa- Choi, Chang, & Lee, 2009; Breitinger &
bilities. The last algorithm (TLSH) is more Baier, 2012b). However, a security analy-
related to LSH than approximate matching sis conducted by Baier and Breitinger (2011)
and is included for completeness. showed that CTPH cannot withstand an ac-
This section is a high-level summary of the tive attack.
current algorithms. Throughout each sub-
section references are cited for deeper read- 4.2 sdhash
ing. Similarity digest hashing was published
four years later by Roussev, Richard, and
4.1 ssdeep
Marziale (2008); Roussev (2010) and is also
CTPH is the technique used by ssdeep and still active. The SIF algorithm extracts sta-
was presented by Kornblum (2006). Roughly tistically improbable features that are de-
2
Note, SimHash is a common term and is used termined by Shannon entropy (not the ones
several times literature. Accordingly, it is also used with the highest / lowest entropy but the
twice in this article. Besides this section it is also
3
used in Sec. 4.6 where it describes an approach from http://ssdeep.sourceforge.net (last ac-
Sadowski and Levin (2007). cessed Feb 4th , 2016).

ones that seem unique, Roussev (2009)). In (data compression). Contrary to expecta-
sdhash, a feature is a byte sequence of 64 tion, its type (see Sec. 3.2) is not epony-
bytes that is then compressed by hashing it mous, but rather BBR. The main difference
with SHA-1. Finally, the author developed a is that this approach utilizes an external ref-
way to insert the hashes into a Bloom filter4 erence point – the building blocks.
(Bloom, 1970). A set of 16 building blocks (random byte
The original version was extended several sequences) is used to optimize representation
times, now supporting GPU usage for cal- of a given file. In order to find this represen-
culation and a block-based hashing mode tation the algorithm calculates the Hamming
(Roussev, 2012). The current version (3.4) distance, which is time consuming and slow
is available online5 . for practical usage (e.g., it takes about two
A comparison between ssdeep and minutes to process a 10 MB file) (Breitinger
sdhash showed that the latter algorithm out- & Baier, 2012a).
performs its predecessor (Roussev, 2011). In
addition, a security analysis showed that 4.5 mvHash-B
sdhash is much more robust and difficult to Majority vote hashing, another BBR type,
overcome (Breitinger & Baier, 2012c). was published by Åstebøl (2012); Breitinger,
Åstebøl, Baier, and Busch (2013). It trans-
4.3 mrsh-v2 forms any byte-sequence into long runs of
This algorithm was published by Breitinger 0x00s and 0xFFs by considering the neigh-
and Baier (2013) and is a combination boring bytes of a specific byte. If the neigh-
of ssdeep and sdhash6 . Like the afore- borhood consists of mainly 1s, the byte is
mentioned implementations, mrsh-v2 is still set to 0xFF, otherwise to 0x00. Next, these
supported7 . The algorithm uses the fea- runs are encoded by Run Length Encod-
ture identification procedure from ssdeep, ing (RLE). Although this proceeding is very
then hashes the feature using the non- fast, it requires a specific configuration for
cryptographic FNV (Fowler et al., 1994– each file type.
2012) and proceeds like sdhash, con-
sequently overcoming the weaknesses of 4.6 SimHash
ssdeep and becoming faster than sdhash. SimHash was presented by Sadowski and
The precision and recall rates are slightly Levin (2007) and embodies the notion of
worse than sdhash. counting the occurrences of certain prede-
fined binary strings called “Tags” within an
4.4 bbHash input. In their BBR implementation, the au-
Building block hashing is a completely dif- thors used 16 8-bit Tags, i.e., a possible Tag
ferent approach and is based on the concept could have been 00110101. Subsequently,
of eigenfaces (biometrics) and de-duplication the tool parses an input bit by bit, searching
4
A Bloom filter is a space efficient data structure
for each Tag. The total number of matches
to represent a set. Bloom filters will not be discussed is stored in a sum table. A hash key is com-
in this article but more details can be found online. puted as a function of the sum table entries
5
http://sdhash.org (last accessed Feb 4th , that form linear combinations. Lastly, all
2016). information (including file name, path, and
6
It was also inspired by multi-resolution similar-
ity hashing (Roussev, III, & Marziale, 2007).
size) is stored in a database.
7
http://www.fbreitinger.de (last accessed To identify similarities, a second tool
Feb 4th , 2016). named SimFash is used to query the

database. The hash keys are used as a first value in relation to the quartile points. The
filter to identify all possible matches. Next, distance between two digest headers is deter-
the sum tables are compared and a match mined by the difference in file lengths and
is found if the distance is within a specified quartile ratios. Meanwhile, the bodies are
tolerance. contrasted via their approximate Hamming
The authors clearly state that “two files distance. Summing these together produces
are similar if only a small percentage of their the TLSH similarity score.
raw bit patterns are different. ... [Thus,] the According to the authors, the precision
focus of SimHash has been on resemblance and recall rates are robust across a range of
detection” (Sadowski & Levin, 2007). file types. Additional experiments (Oliver,
Forman, and Cheng (2014)) showed that
4.7 saHash TLSH can detect strings which have been
Another SIF type, saHash uses Levenshtein manipulated with adversarial intentions9 .
distance to derive similarity between two TLSH is also effective in detecting embed-
byte sequences. The output is a lower bound ded objects depending on the level of object
for the Levenshtein distance between two in- manipulation. Despite these advantages, it
puts. Akin to SimHash (Sec. 4.6), saHash is less powerful than sdhash and mrsh-v2 for
allows for the detection of only near dupli- cross correlation.
cates (up to several hundred Levensthein op-
erations). 5. APPLICATIONS
A unique characteristic of this approach is
its definition of similarity. While all other Originally, approximate matching was de-
approaches output a number between 0 and signed to support the digital investigation
1 (not a percentage value), saHash actu- process via the use cases stated in Sec.
ally returns the lower bound of Levenshtein 3.1; search for target file(s)/fragments or re-
operations (Ziroff, 2012; Breitinger, Ziroff, duce the volume of data needing investiga-
Lange, & Baier, 2014) to convert one file into tion. Recently, tools such as EnCase, X-
another. Ways Forensics, and Forensic Toolkit (FTK)
have incorporated similar object detection
4.8 TLSH technologies (Breitinger, 2014). Researchers
TLSH belongs to the category of locality- have now identified additional working ar-
sensitive hashes, published by Oliver, Cheng, eas where these techniques or tools can
and Chen (2013), and is open source8 . It have practical impact, e.g., for file carving
processes an input byte sequence using a (see Sec. 5.1), data leakage prevention (see
sliding window to populate an array of Sec. 5.2 and 5.4) and Iris recognition (see
bucket counts, and determines the quartile Sec. 5.5).
points of the bucket counts. A fixed length 5.1 Automatic data reduction
digest is constructed which consists of two
parts: (i) a header based on the quartile
and hash-based file
points, the length of the input, and a check- carving
sum; (ii) a body consisting of a sequence of As sifting through data has become cum-
bit pairs, which depends on each bucket’s bersome, pre-processing schemes have risen.
8 9
https://github.com/trendmicro/tlsh (last Tolerance of manipulation was one of the design
accessed Feb 4th , 2016). considerations for TLSH.

Extracting data in bulk is arguably the that this method works robustly on random
most sought after application of approximate data (true positive rate 99.6 %, true nega-
matching. One perspective that should be tive rate 100.0 %) having a throughput of 650
fruitful is hash-based file carving. Mbit/s on a regular workstation.
This alienated area of work was presented Regardless, they faced several unsolved
by S. Garfinkel and McCarrin (2014). In problems for real world data. One obstacle
their paper, the authors combined tech- was that many files share the same structural
niques from file carving and approximate information (e.g., file header information;
matching to search on “media for complete this is equivalent to the non-probative blocks
files and file fragments with sector hash- problem from the previous subsection) which
ing and hashdb.” Instead of focusing on led to false positive rates of around 10−5 –
the complete file and comparing it against too high for network traffic analysis.
a database, the authors use individual data
blocks. They utilized a special database 5.3 Malware
named hashdb (Allen, 2015) to obtain high
throughput. Innately, similarity hashing is ideal for
The evaluation proved their strategy grouping things together, but it was not un-
works, although they had to solve the prob- til 2015 that it was rigorously tested when
lem of non-probative blocks that emerged applied to malware clustering (Li et al.,
“from common data structures in office doc- 2015). Faruki, Laxmi, Bharmal, Gaur, and
uments and multimedia files.” To filter out Ganmoor (2015) developed AndroSimilar, a
such artifacts, the authors presented several syntactical detection algorithm for Android
‘tests’ that alleviated the problem. Malware that falls into the SIF category.
While Zhou, Zhou, Jiang, and Ning (2012)’s
5.2 Network traffic analysis DroidMoSS, a CTPH algorithm, was devel-
Gupta (2013); Breitinger and Baggili (2014) oped to also detect mobile malware, a com-
demonstrated preliminary results when us- parison between the two could not be per-
ing approximate matching on network traf- formed due to unavailable code.
fic for data leakage prevention. The ques- Polymorphic malware families are on the
tion was (since approximate matching can rise, often hosted on servers that automati-
be used for fragment detection) whether net- cally alter inconsequential segments of a file
work packets could be matched back to their (before sending across a network) to bypass
original files. cryptographic detection tactics (Security,
Design was similar to its traditional coun- 2013). An intriguing paper by Payer et al.
terpart: create a database of known-object (2014) expands on ways criminals can cir-
signatures (most likely files) and identify cumvent the use of similarity-based match-
these objects, but instead of analyzing a ing for spotting malicious binary code, which
hard drive the researchers used a network often diversifies itself during recompilation.
stream (single packets). This work illus- Thus, it is imperative that malware receives
trated approximate matching’s utility in more attention.
data leakage prevention, a formerly un-
touched application. 5.4 Data leakage prevention
Beginning with modifying the original
mrsh-v2 algorithm to handle the small size With respect to data leakage prevention, ap-
of 1460 bytes per packet, the authors showed proximate matching may also be utilized for

printer data inspection, e.g., MyDLP10 and off between compression and biometric
Symantec (2010) Data Leakage Prevention. performance.”
If a document is protected, the software can
discard the print (implemented by MyDLP). “Efficient identification: a compact
A similar experiment was also run by our re- alignment-free representation of iris-
search group11 which created a virtual secure codes enables a computationally effi-
printer that analyzed a sent document before cient biometric identification reducing
forwarding it to an actual printer. the overall response time of the system.”
Note, according to Comodo Group Inc.
(2013), these software solutions often call
their technology partial document match- 6. LIMITATIONS AND
ing, unstructured data matching, intelli- CHALLENGES
gent content matching, or statistical docu-
ment matching, all synonyms for approxi- Bytewise approximate matching has some
mate matching. intrinsic limitations. First and foremost, it
cannot pick up similarity at a higher level
5.5 Biometrics of abstraction, such as semantically. For in-
Biometrics is another independent do- stance, it cannot meaningfully match two
main employing approximate matching with image files that have the same semantic pic-
promising results (Rathgeb, Breitinger, & ture but are different file types / formats
Busch, 2013; Rathgeb, Breitinger, Busch, & due to different binary encoding. However,
Baier, 2013). In their work, the authors placing it in tandem with other approaches
demonstrated the feasibility of using tech- will still help; Neuner, Mulazzani, Schrit-
niques from approximate matching for bio- twieser, and Weippl (2015) include it as a
metric template protection, data compres- critical component of the digital forensic pro-
sion and efficient identification. According cess. Doubly crucial when it comes to appli-
to Breitinger (2014), there are three im- cations like malware that employ databases
provements: is lookup time. Winter et al. (2013) out-
lined a faster approach to conduct similarity
“Template protection: the successive searching using a database but this method
mapping of parts of a binary biomet- will not work effectively for all approximate
ric template to Bloom filters represents matching algorithms. We will not belabor
an irreversible transformation achieving the point since this was discussed in Sec. 3.6.
alignment-free protected biometric tem- Evidently, the first challenge to confront
plates.” is awareness and adoption of approximate
matching, a possible indication that more re-
“Biometric data compression: the pro- search needs to be conducted to understand
posed Bloom filter-based transforma- the needs for the community. As we outlined
tion can be parameterized to obtain a in the introduction, 15 % of professionals are
desired template size, operating a trade- unaware of approximate matching – an un-
acceptable number. Conversely, 7 % criticize
10
https://www.mydlp.com (last accessed Feb 4th , that algorithms are too slow for practical
2016).
11
The project was done by Kyle Anthony, a mem- use. On the other hand, inquiry needs to
ber of the UNH Cyber Forensics Research & Educa- be made into why 35 % have used approxi-
tion Group mate matching only a few times. Our hope

is that having the NIST SP 800-168 defini- together a test corpus (its usefulness was
tion, a technical classification of algorithms demonstrated in Rowe (2012)).
(Martı́nez et al., 2014), and this more univer-
sal outline will improve awareness and adop- 7. FUTURE FOCUS
tion.
Yet another major obstacle is the lack of Future research should, in accompaniment
a standard definition of similarity (Baier, to prior comments, pursue areas that branch
2015). As addressed by S. Garfinkel and Mc- out from digital forensics, even if completely
Carrin (2014), not all kinds of byte level sim- detached. Below we name some possible ap-
ilarity are equally valuable as there are some plications that could use enhancement, and
artifacts (e.g., structural information, head- areas that approximate matching may be
ers, footers, etc.) that are less important able to be enhanced by (this is not an ex-
or lead to false positives. Hence, we need haustive list):
a filtering mechanism to prioritize matches.
One possibility could be to extract the main Bioinformatics: This field already uses
elements (like text or images) and compare exact matching methods, albeit us-
those, meaning including a pre-processing ing bytewise or semantic approximate
step before the comparison. matching alongside today’s methods
could conceivably increase efficiency.
Aside from initial efforts to test approx-
imate matching algorithms, there are cur- Text mining: Identifying patterns in
rently no accepted standards and reliable structured data to gain high-quality, se-
testing frameworks. FRASH is not easy mantic value.
to implement, a deterrent to practitioners.
This is one reason this paper avoids giving Templates and layouts: Semantically
absolute comparisons between all the (types identify document layout and separate
of) algorithms. More bothersome is the lack content from template automatically.
of an accepted ground truth for real world
Deep / machine learning: Automated
data that would support implementation as-
forms of learning need to be able to
sessments like whether algorithms scale ef-
process information fast, store it effi-
fectively. The ground truth should embody
ciently space-wise, and be able to differ-
the four use cases (see Sec. 3.1) and algo-
entiate similarities and differences; se-
rithm types, even if different data sets are
mantic hashing is a known aspect of
required for each one. Once this is done,
deep learning and ergo might be able
the algorithms will be directly comparable
to strengthen approximate matching.
(e.g., embedded object detection: A better
than B better than C; speed: B faster than Source code governance: Manage
A faster than C). We posit that it is criti- shared code better, especially for open
cal for practitioner efficiency to know which source software.
algorithms solve which potential problems.
At the moment the National Software Ref- Spam filtering and anti-plagiarism:
erence Library (NSRL12 ) has built the most These have already been looked at but
prominent software database and is piecing might be behooved by deeper scrutiny.
12
http://www.nsrl.nist.gov (last accessed Feb Ultimately, approximate matching is an
th
4 , 2016). alienated domain and its increased adoption

will strongly support other domains. Re- Communications of the ACM , 13 ,

searchers looking to improve on the slow- 422–426.
ness of indexing and searching may also ben- Breitinger, F. (2014). On the utility of
efit to look into other domains such as com- bytewise approximate matching in
pression and programming. Once more peo- computer science with a special focus
ple are aware of approximate matching we on digital forensics investigations
might identify more fields where the tech- (Doctoral dissertation, Technical
nology would be relevant. University Darmstadt). Retrieved
from http://tuprints.ulb.tu
8. -darmstadt.de/4055/
Breitinger, F., Åstebøl, K. P., Baier, H., &
ACKNOWLEDGMENTS Busch, C. (2013, March). mvhash-b -
We would like to thank Michael McCarrin a new approach for similarity
for reviewing and editing some paragraphs, preserving hashing. In It security
especially those that are related to his re- incident management and it forensics
search. Additionally, we would like to thank (imf ), 2013 seventh international
Jonathan Olivier for providing us with the conference on (p. 33-44). doi:
summary for TLSH. 10.1109/IMF.2013.18
Breitinger, F., & Baggili, I. (2014,
September). File detection on network
REFERENCES traffic using approximate matching.
Allen, B. (2015). Hashdb. Journal of Digital Forensics, Security
https://github.com/simsong/hashdb. and Law (JDFSL), 9 (2), 23–36.
Åstebøl, K. P. (2012). mvhash - a new Retrieved from
approach for fuzzy hashing http://ojs.jdfsl.org/index.php/
(Unpublished master’s thesis). Gjøvik jdfsl/article/view/261
University College. Breitinger, F., & Baier, H. (2012a, May). A
Baier, H. (2015). Towards automated Fuzzy Hashing Approach based on
preprocessing of bulk data in digital Random Sequences and Hamming
forensic investigations using hash Distance. In 7th annual Conference
functions. it-Information Technology, on Digital Forensics, Security and
57 (6), 347–356. Law (ADFSL) (pp. 89–100).
Baier, H., & Breitinger, F. (2011, May). Breitinger, F., & Baier, H. (2012b).
Security Aspects of Piecewise Hashing Performance issues about
in Computer Forensics. IT Security context-triggered piecewise hashing.
Incident Management & IT Forensics In P. Gladyshev & M. Rogers (Eds.),
(IMF), 21–36. doi: Digital forensics and cyber crime
10.1109/IMF.2011.16 (Vol. 88, p. 141-155). Springer Berlin
Bertoni, G., Daemen, J., Peeters, M., & Heidelberg. Retrieved from
Assche, G. V. (2008, October). http://dx.doi.org/10.1007/
Keccak specifications (Tech. Rep.). 978-3-642-35515-8 12 doi:
STMicroelectronics and NXP 10.1007/978-3-642-35515-8 12
Semiconductors. Breitinger, F., & Baier, H. (2012c).
Bloom, B. H. (1970). Space/time trade-offs Properties of a similarity preserving
in hash coding with allowable errors. hash function and their realization in

sdhash. In Information security for Automated evaluation of approximate

south africa (issa), 2012 (p. 1-8). doi: matching algorithms on real data.
10.1109/ISSA.2012.6320445 Digital Investigation, 11, Supplement
Breitinger, F., & Baier, H. (2013). 1 (0), S10 - S17. Retrieved from
Similarity preserving hashing: http://www.sciencedirect.com/
Eligible properties and a new science/article/pii/
algorithm mrsh-v2. In M. Rogers & S1742287614000073 (Proceedings of
K. Seigfried-Spellar (Eds.), Digital the First Annual DFRWS Europe)
forensics and cyber crime (Vol. 114, doi: http://dx.doi.org/10.1016/
pp. 167–182). Springer Berlin j.diin.2014.03.002
Heidelberg. Retrieved from Breitinger, F., Stivaktakis, G., & Baier, H.
http://dx.doi.org/10.1007/ (2013, August). FRASH: A
978-3-642-39891-9 11 doi: framework to test algorithms of
10.1007/978-3-642-39891-9 11 similarity hashing. In 13th Digital
Breitinger, F., Baier, H., & White, D. Forensics Research Conference
(2014, May). On the database lookup (DFRWS’13). Monterey.
problem of approximate matching. Breitinger, F., Stivaktakis, G., & Roussev,
Digital Investigation, 11, Supplement V. (2013, Sept). Evaluating detection
1 (0), S1 - S9. Retrieved from error trade-offs for bytewise
http://www.sciencedirect.com/ approximate matching algorithms.
science/article/pii/ 5th ICST Conference on Digital
S1742287614000061 (Proceedings of Forensics & Cyber Crime (ICDF2C).
the First Annual DFRWS Europe) Breitinger, F., Ziroff, G., Lange, S., &
doi: http://dx.doi.org/10.1016/ Baier, H. (2014, January). sahash:
j.diin.2014.03.001 Similarity hashing based on
Breitinger, F., Guttman, B., McCarrin, M., levenshtein distance. In Tenth annual
Roussev, V., & White, D. (2014, ifip wg 11.9 international conference
May). Approximate matching: on digital forensics (ifip wg11.9).
Definition and terminology (Special Broder, A. Z. (1997). On the resemblance
Publication 800-168). National and containment of documents. In
Institute of Standards and Compression and complexity of
Technologies. Retrieved from sequences (sequences’97) (pp. 21–29).
http://dx.doi.org/10.6028/ IEEE Computer Society.
NIST.SP.800-168 Broder, A. Z., Charikar, M., Frieze, A. M.,
Breitinger, F., Rathgeb, C., & Baier, H. & Mitzenmacher, M. (1998).
(2014, September). An efficient Min-wise Independent Permutations.
similarity digests database lookup - A Journal of Computer and System
logarithmic divide & conquer Sciences, 60 , 327-336.
approach. Journal of Digital Charikar, M. S. (2002). Similarity
Forensics, Security and Law estimation techniques from rounding
(JDFSL), 9 (2), 155–166. Retrieved algorithms. In Proceedings of the
from thiry-fourth annual acm symposium
http://ojs.jdfsl.org/index.php/ on theory of computing (pp. 380–388).
jdfsl/article/view/276
Breitinger, F., & Roussev, V. (2014, May). Chen, L., & Wang, G. (2008). An efficient

piecewise hashing method for tech.groups.yahoo.com/group/

computer forensics. In Knowledge linux forensics/message/3063
discovery and data mining, 2008. Gupta, V. (2013). File fragment detection
wkdd 2008. first international on network traffic using similarity
workshop on (pp. 635–638). doi: hashing (Unpublished master’s
10.1109/WKDD.2008.80 thesis). Denmark Technical
Collange, S., Dandass, Y. S., Daumas, M., University.
& Defour, D. (2009). Using graphics Harichandran, V. S., Breitinger, F., Baggili,
processors for parallelizing hash-based I., & Marrington, A. (2016). A cyber
data carving. In System sciences, forensics needs analysis survey:
2009. hicss’09. 42nd hawaii Revisiting the domain’s needs a
international conference on (pp. decade later. Computers & Security,
1–10). 57 , 1–13.
Comodo Group Inc. (2013). PDM (Partial Jaccard, P. (1901). Distribution de la flore
Document Matching). https:// alpine dans le bassin des drouces et
www.mydlp.com/partial-document dans quelques regions voisines.
-matching-how-it-works/. Bulletin de la Société Vaudoise des
Faruki, P., Laxmi, V., Bharmal, A., Gaur, Sciences Naturelles, 37 , 241–272.
M., & Ganmoor, V. (2015). Jaccard, P. (1912, February). The
Androsimilar: Robust signature for distribution of the flora in the alpine
detecting variants of android zone. New Phytologist, 11 , 37–50.
malware. Journal of Information Key, S. (2013). File block hash map
Security and Applications, 22 , 66–80. analysis. Retrieved from
FIPS, P. (1995). 180-1. secure hash https://www.guidancesoftware
standard. National Institute of .com/appcentral/
Standards and Technology, 17 , 45. Kornblum, J. (2006, September).
Fowler, G., Noll, L. C., & Vo, P. Identifying almost identical files using
(1994–2012). Fnv hash. context triggered piecewise hashing.
http://www.isthe.com/chongo/ Digital Investigation, 3 , 91–97.
tech/comp/fnv/index.html. Retrieved from http://dx.doi.org/
Garfinkel, S., & McCarrin, M. (2014). 10.1016/j.diin.2006.06.015 doi:
Hash-based carving: Searching media 10.1016/j.diin.2006.06.015
for complete files and file fragments Leskovec, J., Rajaraman, A., & Ullman,
with sector hashing and hashdb. J. D. (2014). Mining of massive
digital investigation, 12 . datasets. Cambridge University Press.
Garfinkel, S., Nelson, A., White, D., &
Roussev, V. (2010). Using Li, Y., Sundaramurthy, S. C., Bardas,
purpose-built functions and block A. G., Ou, X., Caragea, D., Hu, X., &
hashes to enable small block and Jang, J. (2015). Experimental study
sub-file forensics. digital investigation, of fuzzy hashing in malware clustering
7 , S13–S23. analysis. In 8th workshop on cyber
Garfinkel, S. L. (2009, March). Announcing security experimentation and test
frag find: finding file fragments in (cset 15).
disk images using sector hashing. Manber, U. (1994). Finding similar files in
Retrieved from http:// a large file system. In Proceedings of

the usenix winter 1994 technical cancelable iris biometric templates

conference on usenix winter 1994 based on adaptive bloom filters. In
technical conference (pp. 1–10). Biometrics (icb), international
Martı́nez, V. G., Álvarez, F. H., & Encinas, conference on (p. 1-8). doi:
L. H. (2014). State of the art in 10.1109/ICB.2013.6612976
similarity preserving hashing Rathgeb, C., Breitinger, F., Busch, C., &
functions. Baier, H. (2013). On the application
worldcomp-proceedings.com. of bloom filters to iris biometrics.
Martı́nez, V. G., Álvarez, F. H., Encinas, IET Biometrics.
L. H., & Ávila, C. S. (2015). A new Ren, L., & Cheng, R. (2015, August).
edit distance for fuzzy hashing Bytewise approximate matching,
applications. In Proceedings of the searching and clustering. DFRWS
international conference on security USA.
and management (sam) (p. 326). Rivest, R. (1992). The MD5
Neuner, S., Mulazzani, M., Schrittwieser, Message-Digest Algorithm.
S., & Weippl, E. (2015). Gradually Roussev, V. (2009). Building a better
improving the forensic process. In similarity trap with statistically
Availability, reliability and security improbable features. In System
(ares), 2015 10th international sciences, 2009. hicss ’09. 42nd hawaii
conference on (pp. 404–410). international conference on (pp.
Oliver, J., Cheng, C., & Chen, Y. (2013). 1–10). doi: 10.1109/HICSS.2009.97
Tlsh–a locality sensitive hash. In 4th Roussev, V. (2010). Data fingerprinting
cybercrime and trustworthy computing with similarity digests. In K.-P. Chow
workshop (ctc) (pp. 7–13). & S. Shenoi (Eds.), Advances in
Oliver, J., Forman, S., & Cheng, C. (2014). digital forensics vi (Vol. 337, pp.
Using randomization to attack 207–226). Springer Berlin Heidelberg.
similarity digests. In Applications and Retrieved from http://dx.doi.org/
techniques in information security 10.1007/978-3-642-15506-2 15
(pp. 199–210). Springer. doi: 10.1007/978-3-642-15506-2\ 15
Payer, M., Crane, S., Larsen, P., Roussev, V. (2011, August). An evaluation
Brunthaler, S., Wartell, R., & Franz, of forensic similarity hashes. Digital
M. (2014). Similarity-based matching Investigation, 8 , 34–41. Retrieved
meets malware diversity. arXiv from http://dx.doi.org/10.1016/
preprint arXiv:1409.7760 . j.diin.2011.05.005 doi:
Rabin, M. O. (1981). Fingerprinting by 10.1016/j.diin.2011.05.005
random polynomials (Tech. Rep. No. Roussev, V. (2012). Managing
TR1581). Cambridge, Massachusetts: terabyte-scale investigations with
Center for Research in Computing similarity digests. In G. Peterson &
Technology, Harvard University. S. Shenoi (Eds.), Advances in digital
Rajaraman, A., & Ullman, J. D. (2012). forensics viii (Vol. 383, pp. 19–34).
Mining of massive datasets. Springer Berlin Heidelberg. Retrieved
Cambridge: Cambridge University from http://dx.doi.org/10.1007/
Press. 978-3-642-33962-2 2 doi:
Rathgeb, C., Breitinger, F., & Busch, C. 10.1007/978-3-642-33962-2 2
(2013, June). Alignment-free Roussev, V., III, G. G. R., & Marziale, L.

(2007, September). Multi-resolution (2013). F2s2: Fast forensic similarity

similarity hashing. Digital search through indexing piecewise
Investigation, 4 , 105–113. doi: hashsignatures. Digital Investigation,
10.1016/j.diin.2007.06.011 10 (4), 361–371. doi: http://
Roussev, V., Richard, I., Golden, & dx.doi.org/10.1016/j.diin.2013.08.003
Marziale, L. (2008). Class-aware
similarity hashing for data Zhou, W., Zhou, Y., Jiang, X., & Ning, P.
classification. In I. Ray & S. Shenoi (2012). Detecting repackaged
(Eds.), Advances in digital forensics iv smartphone applications in
(Vol. 285, pp. 101–113). Springer US. third-party android marketplaces. In
Retrieved from http://dx.doi.org/ Proceedings of the second acm
10.1007/978-0-387-84927-0 9 doi: conference on data and application
10.1007/978-0-387-84927-0 9 security and privacy (pp. 317–326).
Rowe, N. C. (2012). Testing the national Ziroff, G. (2012, February). Approaches to
software reference library. Digital similarity-preserving hashing,
Investigation, 9 , S131–S138. Bachelor’s thesis, Hochschule
Sadowski, C., & Levin, G. (2007, Darmstadt.
December). Simhash: Hash-based
similarity detection.
http://simhash.googlecode.com/svn/
trunk/paper/SimHashWithBib.pdf.
Security, H. N. (2013). The rise of
sophisticated malware. Retrieved from
http://www.net-security.org/
malware news.php?id=2543
Seo, K., Lim, K., Choi, J., Chang, K., &
Lee, S. (2009). Detecting similar files
based on hash and statistical analysis
for digital forensic investigation. In
Computer science and its
applications, 2009. csa ’09. 2nd
international conference on (p. 1-6).
doi: 10.1109/CSA.2009.5404198
Symantec. (2010). Machine learning sets
new standard for data loss prevention:
Describe, fingerprint, learn (Tech.
Rep.). Symantec Corporation.
Tridgell, A. (2002–2009). spamsum.
http://www.samba.org/ftp/
unpacked/junkcode/spamsum/.
Retrieved from
http://samba.org/ftp/unpacked/
junkcode/spamsum/README (last
accessed Feb 4th , 2016)
Winter, C., Schneider, M., & Yannikos, Y.


Đề tài 9 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Đề tài 9 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Journal of Digital Forensics,

Security and Law

Bytewise Approximate Matching: The Good, The

Follow this and additional works at: http://commons.erau.edu/jdfsl

BYTEWISE APPROXIMATE MATCHING:

Keywords: Approximate matching, Fuzzy hashing, Similarity hashing, Bytewise, Survey,

1. INTRODUCTION an increase in processing power. Speaking in

© 2016 ADFSL Page 59

Page 60 © 2016 ADFSL

© 2016 ADFSL Page 61

Page 62 © 2016 ADFSL

 Embedded object detection identifies an the similarity of the content of a JPG

© 2016 ADFSL Page 63

Page 64 © 2016 ADFSL

Comparison efficiency: summarizes the White noise resistance: is a probability-

Space efficiency: calculated by dividing Precision and recall on synthetic data.

© 2016 ADFSL Page 65

Page 66 © 2016 ADFSL

© 2016 ADFSL Page 67

Page 68 © 2016 ADFSL

© 2016 ADFSL Page 69

Page 70 © 2016 ADFSL

© 2016 ADFSL Page 71

Page 72 © 2016 ADFSL

will strongly support other domains. Re- Communications of the ACM , 13 ,

© 2016 ADFSL Page 73

sdhash. In Information security for Automated evaluation of approximate

Page 74 © 2016 ADFSL

piecewise hashing method for tech.groups.yahoo.com/group/

© 2016 ADFSL Page 75

the usenix winter 1994 technical cancelable iris biometric templates

Page 76 © 2016 ADFSL

(2007, September). Multi-resolution (2013). F2s2: Fast forensic similarity

© 2016 ADFSL Page 77

Page 78 © 2016 ADFSL

Das könnte Ihnen auch gefallen

Embedded object detection identifies an the similarity of the content of a JPG