Beruflich Dokumente
Kultur Dokumente
2016
Frank Breitinger
University of New Haven
Ibrahim Baggili
University of New Haven
Recommended Citation
Harichandran, Vikram S.; Breitinger, Frank; and Baggili, Ibrahim (2016) "Bytewise Approximate Matching: The Good, The Bad, and
The Unknown," Journal of Digital Forensics, Security and Law: Vol. 11 : No. 2 , Article 4.
DOI: https://doi.org/10.15394/jdfsl.2016.1379
Available at: http://commons.erau.edu/jdfsl/vol11/iss2/4
This Article is brought to you for free and open access by the Journals at
Scholarly Commons. It has been accepted for inclusion in Journal of Digital
Forensics, Security and Law by an authorized administrator of Scholarly
Commons. For more information, please contact commons@erau.edu.
(c)ADFSL
Bytewise approximate matching: the good, the bad ... JDFSL V11N2
ABSTRACT
Hash functions are established and well-known in digital forensics, where they are commonly
used for proving integrity and file identification (i.e., hash all files on a seized device and
compare the fingerprints against a reference database). However, with respect to the latter
operation, an active adversary can easily overcome this approach because traditional hashes
are designed to be sensitive to altering an input; output will significantly change if a single
bit is flipped. Therefore, researchers developed approximate matching, which is a rather new,
less prominent area but was conceived as a more robust counterpart to traditional hashing.
Since the conception of approximate matching, the community has constructed numerous
algorithms, extensions, and additional applications for this technology, and are still working
on novel concepts to improve the status quo. In this survey article, we conduct a high-level
review of the existing literature from a non-technical perspective and summarize the existing
body of knowledge in approximate matching, with special focus on bytewise algorithms. Our
contribution allows researchers and practitioners to receive an overview of the state of the
art of approximate matching so that they may understand the capabilities and challenges of
the field. Simply, we present the terminology, use cases, classification, requirements, testing
methods, algorithms, applications, and a list of primary and secondary literature.
one may shift it to the cloud. In short, Table 1. Answers to the survey question:
practitioners need tools and techniques that Have you ever used approximate match-
are capable of automatically handling large ing/similarity hashing algorithms?
amounts of data since time in investigations
is of the essence. Answer in %
A common forensic process to support Yes, I use them on a regular basis. 12.50
practitioners is known file filtering, which Yes, a few times. 34.38
aims at reducing the amount of data an in- No, they are too slow for practical 7.29
vestigator has to manually examine. The use.
process is quite simple: (1) compute the No, they are unnecessary for my 31.25
hashes for all files on a target device (2) purposes.
compare the hashes to a reference database. No, I am unaware of what it is. 14.58
Based on the signatures in the database,
files are whitelisted (filtered out / known-
approximate matching. Accordingly, we ad-
good files, e.g., files of the operating system)
dress the following key points:
or blacklisted (filtered in / known-bad files,
e.g., known illicit content). This straightfor- Terminology, use cases, classification,
ward procedure is commonly implemented requirements, and testing.
using cryptographic hash functions like MD5
(Rivest, 1992) or an algorithm from the High-level description of existing algo-
SHA family (FIPS, 1995; Bertoni, Daemen, rithms including strengths and weak-
Peeters, & Assche, 2008). nesses.
While cryptographic hashes are well- Secondary literature that enhances / as-
established and tested, they have one down- sesses existing approaches.
side – they can only identify bitwise identi-
cal objects. This means changing a single New applications that employ approx-
bit of the input will result in a totally dif- imate matching, e.g., file carving and
ferent hash value. Subsequently, the com- data leakage prevention.
munity worked on a counterpart for (crypto-
graphic) hashing algorithms that allows sim- Current limitations and challenges, and
ilarity identification – approximate match- possible future trends.
ing. Although this is a practically useful Since it is low-level (is directly concerned
concept, a recent survey by Harichandran, with the structure of everything digital), and
Breitinger, Baggili, and Marrington (2016) may be the most impacting / implemented
with 99 participants showed that only 12% type due to its usage for automation, we fo-
of the forensic experts polled use this tech- cus on bytewise approximate matching.
nology on regular basis. Detailed results are
provided in Table 1. Differentiation from previous work.
When writing this article, there were three
Contribution. In this paper we aim to ad- articles similar to this survey. The first
dress the almost 15 % that have never heard was the SP 800-168 from the National Insti-
of approximate matching by providing them tute for Standards and Technology (NIST,
with a comprehensive literature survey, and Breitinger, Guttman, McCarrin, Roussev,
the 31 % (unnecessary for my purposes) by and White (2014)). While this article pro-
illustrating a multitude of applications for vides an overview of the terminology, uses
cases, and testing, it does not include any the size of their intersection over the size of
algorithm concepts, applications, or criti- their union (Jaccard, 1901, 1912): if A and
cal discussion. Moreover, a reader is not B are sets, then the TJaccard index J is de-
provided with a long list of references. fined as J(A, B) = |A S B| . It has been widely
|A B|
Martı́nez, Álvarez, and Encinas (2014) is a adopted as a method for quantifying similar-
purely technical paper and focused on the ity and is still used mainly within computer
full details of the algorithms and their imple- linguistics for plagiarism detection.
mentations. Thirdly, the dissertation from Nearly a century later, Broder (1997) pro-
Breitinger (2014) contained almost all of posed using the Jaccard index as part of
these topics but is extremely lengthy. Ergo, his algorithm for identifying similar docu-
our intention during writing was to make ments. Broder suggested a distinction be-
this publication the primary source for re- tween two commonly used types of similar-
searchers / practitioners to grasp a cur- ity: ‘roughly the same’ (resemblance) and
sory bird’s-eye view of bytewise approximate ‘roughly contained inside’ (containment).
matching. While he recommended using the Jaccard in-
We summarized the most important ele- dex for resemblance, he introduced a varia-
ments of these works in a condensed and tion to approximate containment which “in-
direct manner to increase the awareness of dicates that A is Troughly contained within
approximate matching. Extra texts are also B”: c(A, B) = |A|A|B| . Additionally, Broder
shared for each algorithm in Sec. 4. described the MinHash algorithm, an effi-
Structure. The remainder of this paper is cient method for estimating these similarities
organized as follows: Sec. 2 provides the his- (Broder, Charikar, Frieze, & Mitzenmacher,
torical background of approximate match- 1998).
ing. Concepts are outlined, including use On the other hand, Manber (1994) pre-
cases, types, requirements (this subsection sented sif, an implementation used to cor-
describes the core principles of algorithm de- relate text files. “Files are considered simi-
sign), and testing. Then, after traversing 8 lar if they have a significant number of com-
of the most popular algorithms in Sec. 4 we mon pieces, even if they are very different
mention newly explored prospects in Sec. 5. otherwise.” Due to the complexity of com-
Limitations and challenges precedes a brief paring strings directly, he utilized Rabin fin-
listing of future areas of research in Sec. 7. gerprinting to hash and compare substrings
(Rabin, 1981).
2. HISTORY A first step towards approximate match-
ing as we use it today was dcfldd1 by Har-
Many of the approximate matching algo- bour in 2002, which was an extension for
rithms designed to solve modern-day prob- the well-known disk dump tool dd. His
lems in digital forensics rely fundamentally tool divided the input into chunks of fixed
on the ability to represent objects as sets length and hashed each chunk. Therefore,
of features, thereby reducing the similarity Harbour’s approach is also called block-based
problem to the well-defined domain of set op- hashing. While this approach works per-
erations (Leskovec, Rajaraman, & Ullman, fectly for flipped bits, it has a limited ca-
2014). This approach has roots in the work pacity to detect similarity in strings where
of the early Swiss 20th century biologist Paul
Jaccard, who suggested expressing the simi- 1
http://dcfldd.sourceforge.net (last ac-
larity between two finite sets as the ratio of cessed Feb 4th , 2016).
the deletion or insertion of bits creates a drive image for matches of 4096-byte blocks
shift that changes the hashes of all blocks across a set of nearly one million blacklisted
that follow. Theoretically, the shift of even files stored in the custom-built database,
a single bit at the beginning of a file could hashdb.
cause nearly identical objects to appear to
have nothing in common, much like the naive
(traditional) file hashing approach. 3. CONCEPTS
Although this weakness can present a
While approximate matching (a.k.a. fuzzy
problem for file-to-file comparison, it may
hashing or similarity hashing) started to gain
be acceptable in some scenarios. For exam-
popularity in the field of digital forensics in
ple, if the goal is to determine which parts of
2006, it was not until 2014 that the Na-
a disk image might have been changed dur-
tional Institute for Standards and Technolo-
ing a cyber attack, Harbour’s technique re-
gies (NIST) developed standard definitions,
mains useful. Likewise, in the case of an an-
publishing Approximate Matching: Defini-
alyst scanning for blacklisted material across
tion and Terminology (NIST SP 800-168,
a drive or a collection of drives, the loss of
Breitinger, Guttman, et al. (2014)). Sub-
a few block matches may be a worthwhile
sections below briefly summarize this work’s
trade-off for gains in speed and simplicity,
principles.
particularly because a single block is often
The ‘Purpose and Scope’ section of the
sufficient evidence to demonstrate the pres-
NIST document defines approximate match-
ence of an artifact, or at least to warrant
ing as follows: “Approximate matching is
closer inspection.
a promising technology designed to identify
Several efforts have been made to further
similarities between two digital artifacts. It
leverage this technique for detecting similar
is used to find objects that resemble each
material by matching fixed length file frag-
other or to find objects that are contained
ments. Collange, Dandass, Daumas, and
in another object.”
Defour (2009) coined the term “hash-based
carving” to describe this method of scanning 3.1 Use cases
for blacklisted material, since it can be used
to extract content without aid from the file In investigative cases, approximate match-
system, provided the targets are known be- ing is used to filter known-good or known-
forehand. bad files while using a reference approximate
Key (2013)’s File Block Hash Map matching hashed data set, either on static
Analysis (FBHMA) EnScript and Simson data or data in transit over a network. The
Garfinkel’s tool frag find (S. L. Garfinkel, primary use cases for approximate matching
2009) provided practical implementations are presented below:
that automated the process for forensic ex-
aminers, though searches were limited to a Similarity detection correlates related
few files at a time. S. Garfinkel, Nelson, documents, e.g., different versions of a
White, and Roussev (2010) described the Word document.
implementation and evaluation of frag find
in detail, noting a particular difficulty in Cross correlation correlates documents
storing and searching billions of hashes at that share a common object, e.g., a
practical speeds. S. Garfinkel and McCar- DOC and a PPT document including
rin (2014) later succeeded in scanning a the same image.
comparison. Then, a similarity function per- ronment for bytewise approximate matching.
forms the comparison between these com- The first step was taken by Breitinger, Sti-
pressed versions to output a normalized vaktakis, and Baier (2013), called FRame-
match score. This comparison usually in- work to test Algorithms of Similarity Hash-
volves string formulas such as Hamming dis- ing (FRASH).
tance and Levenshtein distance; Martı́nez, It tested efficiency, sensitivity and robust-
Álvarez, Encinas, and Ávila (2015) and Li et ness, and precision and recall. This last cat-
al. (2015) have proposed new algorithms for egory can be divided further into synthetic
specific uses. Normalized scores may be cal- data vs. real world data. While synthetic
culated by weighing the number of matching data provides the perfect ground truth (fur-
features against the total number of features ther described below), it does not coincide
for both objects (for resemblance), or by ig- with the real world, and vise versa.
noring unmatched features in the container Synthetic data test results were published
object (if concerned with containment). (Breitinger, Stivaktakis, & Roussev, 2013)
In addition to the above, these traits must in addition to real world data (Breitinger
be satisfied to be considered a valid ap- & Roussev, 2014). The complete results
proximate matching algorithm, according to are too complex to be presented in this
NIST: article but can be found in chapter 6 in
Breitinger (2014). In the following subsec-
Compression: actual storage of features is tions we briefly summarize how approximate
usually implemented as a one-way hash matching algorithms can be evaluated and
known as a similarity digest, signature, FRASH’s results. The main findings were:
or fingerprint); length is shorter than
the original feature/input itself. sdhash and mrsh-v2 outperform other
algorithms.
Similarity preservation: similar inputs
should result in similar digests. mrsh-v2 is faster and shows better com-
pression than sdhash.
Self-evaluation: authors should state the
confidence level for the circumstances/- sdhash obtains slightly better precision
parameters used to produce the match and recall rates than mrsh-v2.
score and what the scale is (e.g., 0 = no Therefore, the final decision for selecting an
features matched, 1 = all exact match). algorithm depends on the use case.
Time complexity/runtime efficiency: Efficiency. As with cryptographic hash
speed should be stated via theoretical functions, compression and runtime ef-
complexity in O-notation as well as the ficiency are important, but approximate
runtime speed; for bytewise algorithms matching algorithms involve additional con-
it is preferable to know the isolated cerns; several do not output fixed length di-
speeds of the feature extraction and gests. Thus, researchers usually report com-
similarity functions. pression ratio, cr = digest length
input length
.
The community distinguishes between the
3.4 Testing bytewise following for runtime:
approximate matching Generation efficiency: time needed to
Testing algorithms is an important task, so process an input and output the simi-
researchers set out to create a test envi- larity digest.
therefore prevent reverse engineering the the overall amount of comparisons. For the
original input sequence of a fragment / file. final decision, the ssdeep comparison func-
The few other tolerances exhibited by the al- tion was used. As a result, they reduced the
gorithms are stated in their individual sub- comparison time of 195,186 files against a
sections under Sec. 4. database containing 8,334,077 records from
However, we posit that for most uses of 442 h to 13 min (boosted by a factor of about
approximate matching, security features are 2000), a ‘practical speed’.
not essential. As pointed out by Baier (2015) However, this approach works only for
these algorithms are most likely to be used Base64 and hence for none of the other ap-
for blacklisting. Why would an active ad- proaches like sdhash or mrsh-v2. There-
versary want to create files that match a fore, Breitinger, Baier, and White (2014)
blacklist of static (not in transit) data? Re- presented a concept that could speed up the
searchers must find an answer to how easy process via Bloom filter-based approaches.
it is to avoid matching files. Maybe in the They suggested using one single huge Bloom
future we should classify security for approx- filter to store all feature hashes, which re-
imate matching algorithms by the minimum sults in a complexity of ∼ O(1). Their ap-
amount of changes that are necessary be- proach overcomes the drawback of compar-
tween two files in order to produce a non- ing digests against digests but loses preci-
match. A question that needs fresh explo- sion. That is, it allows for only yes or no
ration, though, is what practices criminals decisions: yes means there is a similar file in
can use to bypass certain (types of) algo- the set; no equates to none of the files being
rithms, use cases, and applications; a rigor- similar above the chosen threshold. It does
ous analysis of this has not been performed not allow for the returning of the matched
partially due to missing standards / ground file(s).
truth. Consequently, the authors presented an
enhancement which simply uses multiple
3.6 Extending existing large Bloom filters to generate a tree struc-
concepts ture that results in a complexity of O(log(n))
(Breitinger, Rathgeb, & Baier, 2014). But
One of the major challenges that comes with these are only assumptions – while there is
approximate matching is related to the near- a working prototype for the first approach,
est neighbor problem, i.e., how to identify the latter concept only exists in theory.
the similarity digests that are similar to a
given one. More precisely, let’s assume a 3.7 Distinction from
database containing n entries. Most algo- locality-sensitive hashing
rithms require an ‘against-all’ comparison
which equals a complexity of O(n). (LSH)
Winter, Schneider, and Yannikos (2013) It’s critical to note that sometimes people
presented an approach to diminish this com- confuse Locality-Sensitive Hashing (LSH)
plexity for ssdeep named F2S2. Gener- (e.g., Rajaraman and Ullman (2012)) with
ally speaking, instead of storing the com- approximate matching. Therefore, we in-
plete Base64 encoded similarity digest in the cluded this section. LSH is a general mecha-
database, they stored n-grams using hash- nism for nearest neighbor search and data
tables. In order to lookup single digests they clustering where the performance strongly
first looked for the n-grams which reduced relies on the used hashing method. Two pop-
ular algorithms are MinHash (Broder, 1997) speaking, it is a modified version of the spam
and SimHash2 (Charikar, 2002). detection algorithm from Tridgell (2002–
This does not necessarily coincide with the 2009) generalized to cope with any digital
idea of approximate matching. Specifically, object.
while LSH aims at mapping similar objects In CTPH the approach is to identify trig-
into the same bucket, approximate matching ger points to divide a given input into
outputs a similarity digest that is compara- chunks/blocks. This breakup is performed
ble. using a rolling hash that slides through the
We would like to note here that the follow- input, adds bytes to the current context
ing section mainly focuses on bytewise ap- (think of it as a buffer), creates a pseudo-
proximate matching. random value, and removes them from the
context after a set number of bytes are com-
4. INTRODUCTION pleted. The context is then used as a trigger
– whenever a specified sequence is created
TO ALGORITHMS the current context is hashed by the non-
As previously mentioned, approximate cryptographic FNV-hash function (Fowler,
matching started to gain attention in 2006 Noll, & Vo, 1994–2012). To create the sim-
with the concept of context triggered piece- ilarity digest, the FNV-chunk-hashes are re-
wise hashing and its first implementation, duced to 6 bits, converted into a Base64 char-
ssdeep (Kornblum, 2006). In the following acter and concatenated; this is done contin-
years, new algorithms were proposed and uously as the trigger outputs FNV hashes.
published. At the time of this article, ssdeep was
We will introduce the eight known approx- still an active project with version 2.13 and
imate matching algorithms. While the first is freely available online3 . Over the years,
three algorithms are still extended and rele- several extensions and performance improve-
vant, the last four algorithms are less promis- ments have been published that mostly
ing from a digital forensics perspective for focus on the efficiency of the implemen-
various reasons, e.g., precision and recall tation (Chen & Wang, 2008; Seo, Lim,
rates, runtime efficiency and detection capa- Choi, Chang, & Lee, 2009; Breitinger &
bilities. The last algorithm (TLSH) is more Baier, 2012b). However, a security analy-
related to LSH than approximate matching sis conducted by Baier and Breitinger (2011)
and is included for completeness. showed that CTPH cannot withstand an ac-
This section is a high-level summary of the tive attack.
current algorithms. Throughout each sub-
section references are cited for deeper read- 4.2 sdhash
ing. Similarity digest hashing was published
four years later by Roussev, Richard, and
4.1 ssdeep
Marziale (2008); Roussev (2010) and is also
CTPH is the technique used by ssdeep and still active. The SIF algorithm extracts sta-
was presented by Kornblum (2006). Roughly tistically improbable features that are de-
2
Note, SimHash is a common term and is used termined by Shannon entropy (not the ones
several times literature. Accordingly, it is also used with the highest / lowest entropy but the
twice in this article. Besides this section it is also
3
used in Sec. 4.6 where it describes an approach from http://ssdeep.sourceforge.net (last ac-
Sadowski and Levin (2007). cessed Feb 4th , 2016).
ones that seem unique, Roussev (2009)). In (data compression). Contrary to expecta-
sdhash, a feature is a byte sequence of 64 tion, its type (see Sec. 3.2) is not epony-
bytes that is then compressed by hashing it mous, but rather BBR. The main difference
with SHA-1. Finally, the author developed a is that this approach utilizes an external ref-
way to insert the hashes into a Bloom filter4 erence point – the building blocks.
(Bloom, 1970). A set of 16 building blocks (random byte
The original version was extended several sequences) is used to optimize representation
times, now supporting GPU usage for cal- of a given file. In order to find this represen-
culation and a block-based hashing mode tation the algorithm calculates the Hamming
(Roussev, 2012). The current version (3.4) distance, which is time consuming and slow
is available online5 . for practical usage (e.g., it takes about two
A comparison between ssdeep and minutes to process a 10 MB file) (Breitinger
sdhash showed that the latter algorithm out- & Baier, 2012a).
performs its predecessor (Roussev, 2011). In
addition, a security analysis showed that 4.5 mvHash-B
sdhash is much more robust and difficult to Majority vote hashing, another BBR type,
overcome (Breitinger & Baier, 2012c). was published by Åstebøl (2012); Breitinger,
Åstebøl, Baier, and Busch (2013). It trans-
4.3 mrsh-v2 forms any byte-sequence into long runs of
This algorithm was published by Breitinger 0x00s and 0xFFs by considering the neigh-
and Baier (2013) and is a combination boring bytes of a specific byte. If the neigh-
of ssdeep and sdhash6 . Like the afore- borhood consists of mainly 1s, the byte is
mentioned implementations, mrsh-v2 is still set to 0xFF, otherwise to 0x00. Next, these
supported7 . The algorithm uses the fea- runs are encoded by Run Length Encod-
ture identification procedure from ssdeep, ing (RLE). Although this proceeding is very
then hashes the feature using the non- fast, it requires a specific configuration for
cryptographic FNV (Fowler et al., 1994– each file type.
2012) and proceeds like sdhash, con-
sequently overcoming the weaknesses of 4.6 SimHash
ssdeep and becoming faster than sdhash. SimHash was presented by Sadowski and
The precision and recall rates are slightly Levin (2007) and embodies the notion of
worse than sdhash. counting the occurrences of certain prede-
fined binary strings called “Tags” within an
4.4 bbHash input. In their BBR implementation, the au-
Building block hashing is a completely dif- thors used 16 8-bit Tags, i.e., a possible Tag
ferent approach and is based on the concept could have been 00110101. Subsequently,
of eigenfaces (biometrics) and de-duplication the tool parses an input bit by bit, searching
4
A Bloom filter is a space efficient data structure
for each Tag. The total number of matches
to represent a set. Bloom filters will not be discussed is stored in a sum table. A hash key is com-
in this article but more details can be found online. puted as a function of the sum table entries
5
http://sdhash.org (last accessed Feb 4th , that form linear combinations. Lastly, all
2016). information (including file name, path, and
6
It was also inspired by multi-resolution similar-
ity hashing (Roussev, III, & Marziale, 2007).
size) is stored in a database.
7
http://www.fbreitinger.de (last accessed To identify similarities, a second tool
Feb 4th , 2016). named SimFash is used to query the
database. The hash keys are used as a first value in relation to the quartile points. The
filter to identify all possible matches. Next, distance between two digest headers is deter-
the sum tables are compared and a match mined by the difference in file lengths and
is found if the distance is within a specified quartile ratios. Meanwhile, the bodies are
tolerance. contrasted via their approximate Hamming
The authors clearly state that “two files distance. Summing these together produces
are similar if only a small percentage of their the TLSH similarity score.
raw bit patterns are different. ... [Thus,] the According to the authors, the precision
focus of SimHash has been on resemblance and recall rates are robust across a range of
detection” (Sadowski & Levin, 2007). file types. Additional experiments (Oliver,
Forman, and Cheng (2014)) showed that
4.7 saHash TLSH can detect strings which have been
Another SIF type, saHash uses Levenshtein manipulated with adversarial intentions9 .
distance to derive similarity between two TLSH is also effective in detecting embed-
byte sequences. The output is a lower bound ded objects depending on the level of object
for the Levenshtein distance between two in- manipulation. Despite these advantages, it
puts. Akin to SimHash (Sec. 4.6), saHash is less powerful than sdhash and mrsh-v2 for
allows for the detection of only near dupli- cross correlation.
cates (up to several hundred Levensthein op-
erations). 5. APPLICATIONS
A unique characteristic of this approach is
its definition of similarity. While all other Originally, approximate matching was de-
approaches output a number between 0 and signed to support the digital investigation
1 (not a percentage value), saHash actu- process via the use cases stated in Sec.
ally returns the lower bound of Levenshtein 3.1; search for target file(s)/fragments or re-
operations (Ziroff, 2012; Breitinger, Ziroff, duce the volume of data needing investiga-
Lange, & Baier, 2014) to convert one file into tion. Recently, tools such as EnCase, X-
another. Ways Forensics, and Forensic Toolkit (FTK)
have incorporated similar object detection
4.8 TLSH technologies (Breitinger, 2014). Researchers
TLSH belongs to the category of locality- have now identified additional working ar-
sensitive hashes, published by Oliver, Cheng, eas where these techniques or tools can
and Chen (2013), and is open source8 . It have practical impact, e.g., for file carving
processes an input byte sequence using a (see Sec. 5.1), data leakage prevention (see
sliding window to populate an array of Sec. 5.2 and 5.4) and Iris recognition (see
bucket counts, and determines the quartile Sec. 5.5).
points of the bucket counts. A fixed length 5.1 Automatic data reduction
digest is constructed which consists of two
parts: (i) a header based on the quartile
and hash-based file
points, the length of the input, and a check- carving
sum; (ii) a body consisting of a sequence of As sifting through data has become cum-
bit pairs, which depends on each bucket’s bersome, pre-processing schemes have risen.
8 9
https://github.com/trendmicro/tlsh (last Tolerance of manipulation was one of the design
accessed Feb 4th , 2016). considerations for TLSH.
Extracting data in bulk is arguably the that this method works robustly on random
most sought after application of approximate data (true positive rate 99.6 %, true nega-
matching. One perspective that should be tive rate 100.0 %) having a throughput of 650
fruitful is hash-based file carving. Mbit/s on a regular workstation.
This alienated area of work was presented Regardless, they faced several unsolved
by S. Garfinkel and McCarrin (2014). In problems for real world data. One obstacle
their paper, the authors combined tech- was that many files share the same structural
niques from file carving and approximate information (e.g., file header information;
matching to search on “media for complete this is equivalent to the non-probative blocks
files and file fragments with sector hash- problem from the previous subsection) which
ing and hashdb.” Instead of focusing on led to false positive rates of around 10−5 –
the complete file and comparing it against too high for network traffic analysis.
a database, the authors use individual data
blocks. They utilized a special database 5.3 Malware
named hashdb (Allen, 2015) to obtain high
throughput. Innately, similarity hashing is ideal for
The evaluation proved their strategy grouping things together, but it was not un-
works, although they had to solve the prob- til 2015 that it was rigorously tested when
lem of non-probative blocks that emerged applied to malware clustering (Li et al.,
“from common data structures in office doc- 2015). Faruki, Laxmi, Bharmal, Gaur, and
uments and multimedia files.” To filter out Ganmoor (2015) developed AndroSimilar, a
such artifacts, the authors presented several syntactical detection algorithm for Android
‘tests’ that alleviated the problem. Malware that falls into the SIF category.
While Zhou, Zhou, Jiang, and Ning (2012)’s
5.2 Network traffic analysis DroidMoSS, a CTPH algorithm, was devel-
Gupta (2013); Breitinger and Baggili (2014) oped to also detect mobile malware, a com-
demonstrated preliminary results when us- parison between the two could not be per-
ing approximate matching on network traf- formed due to unavailable code.
fic for data leakage prevention. The ques- Polymorphic malware families are on the
tion was (since approximate matching can rise, often hosted on servers that automati-
be used for fragment detection) whether net- cally alter inconsequential segments of a file
work packets could be matched back to their (before sending across a network) to bypass
original files. cryptographic detection tactics (Security,
Design was similar to its traditional coun- 2013). An intriguing paper by Payer et al.
terpart: create a database of known-object (2014) expands on ways criminals can cir-
signatures (most likely files) and identify cumvent the use of similarity-based match-
these objects, but instead of analyzing a ing for spotting malicious binary code, which
hard drive the researchers used a network often diversifies itself during recompilation.
stream (single packets). This work illus- Thus, it is imperative that malware receives
trated approximate matching’s utility in more attention.
data leakage prevention, a formerly un-
touched application. 5.4 Data leakage prevention
Beginning with modifying the original
mrsh-v2 algorithm to handle the small size With respect to data leakage prevention, ap-
of 1460 bytes per packet, the authors showed proximate matching may also be utilized for
printer data inspection, e.g., MyDLP10 and off between compression and biometric
Symantec (2010) Data Leakage Prevention. performance.”
If a document is protected, the software can
discard the print (implemented by MyDLP). “Efficient identification: a compact
A similar experiment was also run by our re- alignment-free representation of iris-
search group11 which created a virtual secure codes enables a computationally effi-
printer that analyzed a sent document before cient biometric identification reducing
forwarding it to an actual printer. the overall response time of the system.”
Note, according to Comodo Group Inc.
(2013), these software solutions often call
their technology partial document match- 6. LIMITATIONS AND
ing, unstructured data matching, intelli- CHALLENGES
gent content matching, or statistical docu-
ment matching, all synonyms for approxi- Bytewise approximate matching has some
mate matching. intrinsic limitations. First and foremost, it
cannot pick up similarity at a higher level
5.5 Biometrics of abstraction, such as semantically. For in-
Biometrics is another independent do- stance, it cannot meaningfully match two
main employing approximate matching with image files that have the same semantic pic-
promising results (Rathgeb, Breitinger, & ture but are different file types / formats
Busch, 2013; Rathgeb, Breitinger, Busch, & due to different binary encoding. However,
Baier, 2013). In their work, the authors placing it in tandem with other approaches
demonstrated the feasibility of using tech- will still help; Neuner, Mulazzani, Schrit-
niques from approximate matching for bio- twieser, and Weippl (2015) include it as a
metric template protection, data compres- critical component of the digital forensic pro-
sion and efficient identification. According cess. Doubly crucial when it comes to appli-
to Breitinger (2014), there are three im- cations like malware that employ databases
provements: is lookup time. Winter et al. (2013) out-
lined a faster approach to conduct similarity
“Template protection: the successive searching using a database but this method
mapping of parts of a binary biomet- will not work effectively for all approximate
ric template to Bloom filters represents matching algorithms. We will not belabor
an irreversible transformation achieving the point since this was discussed in Sec. 3.6.
alignment-free protected biometric tem- Evidently, the first challenge to confront
plates.” is awareness and adoption of approximate
matching, a possible indication that more re-
“Biometric data compression: the pro- search needs to be conducted to understand
posed Bloom filter-based transforma- the needs for the community. As we outlined
tion can be parameterized to obtain a in the introduction, 15 % of professionals are
desired template size, operating a trade- unaware of approximate matching – an un-
acceptable number. Conversely, 7 % criticize
10
https://www.mydlp.com (last accessed Feb 4th , that algorithms are too slow for practical
2016).
11
The project was done by Kyle Anthony, a mem- use. On the other hand, inquiry needs to
ber of the UNH Cyber Forensics Research & Educa- be made into why 35 % have used approxi-
tion Group mate matching only a few times. Our hope
is that having the NIST SP 800-168 defini- together a test corpus (its usefulness was
tion, a technical classification of algorithms demonstrated in Rowe (2012)).
(Martı́nez et al., 2014), and this more univer-
sal outline will improve awareness and adop- 7. FUTURE FOCUS
tion.
Yet another major obstacle is the lack of Future research should, in accompaniment
a standard definition of similarity (Baier, to prior comments, pursue areas that branch
2015). As addressed by S. Garfinkel and Mc- out from digital forensics, even if completely
Carrin (2014), not all kinds of byte level sim- detached. Below we name some possible ap-
ilarity are equally valuable as there are some plications that could use enhancement, and
artifacts (e.g., structural information, head- areas that approximate matching may be
ers, footers, etc.) that are less important able to be enhanced by (this is not an ex-
or lead to false positives. Hence, we need haustive list):
a filtering mechanism to prioritize matches.
One possibility could be to extract the main Bioinformatics: This field already uses
elements (like text or images) and compare exact matching methods, albeit us-
those, meaning including a pre-processing ing bytewise or semantic approximate
step before the comparison. matching alongside today’s methods
could conceivably increase efficiency.
Aside from initial efforts to test approx-
imate matching algorithms, there are cur- Text mining: Identifying patterns in
rently no accepted standards and reliable structured data to gain high-quality, se-
testing frameworks. FRASH is not easy mantic value.
to implement, a deterrent to practitioners.
This is one reason this paper avoids giving Templates and layouts: Semantically
absolute comparisons between all the (types identify document layout and separate
of) algorithms. More bothersome is the lack content from template automatically.
of an accepted ground truth for real world
Deep / machine learning: Automated
data that would support implementation as-
forms of learning need to be able to
sessments like whether algorithms scale ef-
process information fast, store it effi-
fectively. The ground truth should embody
ciently space-wise, and be able to differ-
the four use cases (see Sec. 3.1) and algo-
entiate similarities and differences; se-
rithm types, even if different data sets are
mantic hashing is a known aspect of
required for each one. Once this is done,
deep learning and ergo might be able
the algorithms will be directly comparable
to strengthen approximate matching.
(e.g., embedded object detection: A better
than B better than C; speed: B faster than Source code governance: Manage
A faster than C). We posit that it is criti- shared code better, especially for open
cal for practitioner efficiency to know which source software.
algorithms solve which potential problems.
At the moment the National Software Ref- Spam filtering and anti-plagiarism:
erence Library (NSRL12 ) has built the most These have already been looked at but
prominent software database and is piecing might be behooved by deeper scrutiny.
12
http://www.nsrl.nist.gov (last accessed Feb Ultimately, approximate matching is an
th
4 , 2016). alienated domain and its increased adoption