Beruflich Dokumente
Kultur Dokumente
i
p(Y
i
|X
i
) =
i
E
y
i
,x
i
(6.2)
The spellcheck assumes that words in the recovered text are all properly partitioned
by whitespace. As we have seen in the previous section, the space key is indeed well
separated from the other keys during clustering but when a mistake is made it can cause
serious problems for the language model. One issue that arose during experiments
was that when a space key had been misclassied, it resulted in a string of characters
longer than any word in the dictionary. This meant no suggestion could be given
for that particular word. One option was to leave the word unchanged and assign it
a probability of 1. However, as we will see in the next chapter, we select training
examples from those words which are changed the least by this spellchecking stage so
leaving it unchanged would be problematic. An alternative approach was used which
was to split the string into two words by changing the central most character to a space
key.
6.2 Word Frequencies
Once a set of spelling suggestions and their corresponding probabilities had been gen-
erated for each word, a second order HMM could be constructed to nd the most
probable sequence of words (see gure 2.4). In their original attack, [35] stated that
they used an adaptation of the HMM outlined in [33]. Whilst this model was designed
for part-of-speech tagging it presents a useful means of representing individual sen-
tences and also incorporates a new method of smoothing for dealing with the sparse
Chapter 6. Language Model 42
data problem. In their model, they represented the sequence of words as observed
nodes and the sequence of corresponding word tags as hidden nodes. Since each hid-
den node in a second-order HMM is conditioned on the two previous hidden nodes,
they then prepended two new tag symbols, NONE and SOS, to the beginning of the
tag sequence to signify the start of the sentence. This allowed for a more sensible
probability distribution on the rst two real tags rather than using unigram and bigram
statistics. To overcome the data sparsity problem, the state transition probabilities used
variable weighting. For a tag
x
and state sequence t
a
t
b
t
c
this probability was given by
P(
x
=t
c
|
x1
=t
b
,
x2
=t
a
) = k
3
N
3
C
2
+(1k
3
)k
2
N
2
C
1
+(1k
3
)(1k
2
)
N
1
C
0
(6.3)
N
1
= number of times t
c
occurs
N
2
= number of times sequence t
b
t
c
occurs
N
3
= number of times sequence t
a
t
b
t
c
occurs
C
0
= total number of tags that appear
C
1
= number of times t
b
occurs
C
2
= number of times sequence t
a
t
b
occurs
k
2
=
log(N
2
+1)+1
log(N
2
+1)+2
k
3
=
log(N
3
+1)+1
log(N
3
+1)+2
The variables k
2
and k
3
effectively reweight the equation depending on the exis-
tence of bigram and trigram statistics in the training corpus. This model was imple-
mented for the project but where the original model represented tags and words, these
were substituted with words and attempted classications of words respectively. The
implementation of this method was performed in three stages. Firstly the language
statistics required for such a model were gathered. Secondly, sentences were extracted
from the recovered text and the language statistics were ltered to reduce processing
time. Finally, a second-order viterbi algorithm was used to nd the most probable set
of words.
6.2.1 Collecting language statistics
To collect word frequency statistics the Cambridge Statistical Language Modeling
Toolkit [7] was used. This toolkit provides the text2wngram utility which generates an
Chapter 6. Language Model 43
ordered list of word n-tuples along with their number of occurrences for all non-zero
occurrences in a given input text. Before generating these statistics, the projects ref-
erence corpus (see section 3.1.3) was processed so that each sentence was prepended
with the string <NONE> <SOS> and all punctuation except apostrophes was re-
moved. Unigram, bigram and trigram statistics were then taken from this new text. A
rst attempt at importing the trigram data into Matlab was to use a three-dimensional
sparse array but unfortunately this proved too large to t into memory. This meant
an alternative approach was implemented using 4 one-dimensional cell arrays where
element p of the rst 3 arrays stored the rst, second and third words of the p
th
trigram
and the last array stored the p
th
trigrams total number of occurrences. Matlab was able
to generate this data structure but as well discuss in section 6.3, it was a poor choice
in terms of performance. The unigram and bigram statistics were also included in this
same data structure by assigning the <NONE> symbol to the rst or rst and second
trigram words where appropriate.
6.2.2 Extracting sentences
Once word statistics had been collected, the recovered text was split into sentences.
Since fullstops have a low occurrence rate, they were often misclassied as other keys
resulting in extremely long sentences. To reduce the computational complexity of
analyzing these long sentences, a maximum length of 100 characters was enforced by
splitting any sentences over this threshold. The sentences were then prepended with
two control symbols. If the sentence was the rst half of a split sentence or a whole
sentence then it was prepended with <NONE> <SOS> whereas if the sentence was
from the second half of a split it was prepended with <NONE> <NONE>. Since the
unigram frequencies had been added to the word statistics prexed with two <NONE>
symbols, this effectively places a unigram prior on the rst word of sentences that are
from the second half of a split.
Each sentence was then processed in turn. The lists of spelling suggestions for
every word in the sentence were concatenated with any duplicates removed, giving a
unique set of vocabulary that may occur in the sentence. Before each sentence was
processed by the viterbi algorithm, a new set of word statistics was generated by ex-
tracting trigrams which only contained words from the vocabulary and the control
symbols <NONE> and <SOS>. This greatly reduced the time required to search
for the correct statistic. Finally, an observation matrix was constructed whereby the
Chapter 6. Language Model 44
rows of the matrix represented each word position in the sentence and the columns of
the matrix represented each word in the sentence vocabulary. The spelling suggestion
probabilities were then used to populate the matrix and all other values set to 0. The
observation matrix, vocabulary, and ltered word statistics were then passed to the
viterbi algorithm to nd the most likely sequence of words for the sentence.
6.2.3 Second-order Viterbi algorithm
To implement this part of the project, the method outlined in [12] for extending the
viterbi algorithm to second-order was used. They argued that for a state sequence X
and observation sequence Z, one can introduce a combined state sequence Y which
consists of pairs of states from X where y
1
= x
1
and y
i
= x
i1
x
i
for 2 i K. This
then gives,
p(y
1
) = p(x
1
)
p(y
2
|y
1
) = p(x
2
|x
1
)
p(y
i
|y
i1
) = p(x
i
|x
i1
, x
i2
) 3 i K
therefore for a partial state sequence X
i
= (x
1
, x
2
, ..., x
i
) and a partial observation
sequence Z
i
= (z
1
, z
2
, ..., z
i
) and 3 i K we have
P(X
i
, Z
i
) = P(X
i1
, Z
i1
)p(y
i
|y
i1
)p(z
i
|x
i
) (6.4)
This means that to nd the maximum P(X
i
, Z
i
) for each y
i
we must:
store the maximum P(X
i1
, Z
i1
) for each y
i1
extend each of these probabilities to every y
i
by using equation 6.4
select the maximum P(X
i
, Z
i
) for each y
i
The algorithm simply repeats these three steps, incrementing i on each iteration and
remembers the path taken. Once the last node is reached, the algorithm backtracks
through the recorded path to nd the best sequence. The probability p(z
i
|x
i
) is provided
by the spellcheck suggestion probabilities and p(y
i
|y
i1
) is calculated from the word
trigram statistics using equation 6.3.
Chapter 6. Language Model 45
6.3 Model Performance
Despite ltering the word trigram statistics on the sentence vocabulary, the language
model still took much longer than the implementation described in [35]. They stated
that the time required to process their attack, using 3 iterations of the language model,
was approximately 30 minutes using a Pentium IV 3.0GHz desktop computer with
1GBRAM. However, for the maximum sentence size of 100 characters (approximately
16 words), the projects viterbi implementation takes around 11 minutes to determine
the best sequence using an AMD Turion 2.0GHz laptop with 4GB RAM. This means
that for dataset A, the language model takes a total of 3 hours to complete a single
iteration, bringing the total attack time required to just over 9 hours for 3 iterations.
In [35] they stated that the second-order viterbi algorithms complexity is O(TN
3
)
where T is the length of the sentence and N is the number of possible values for each
hidden variable but since they were only selecting the top M spelling suggestions, this
reduced the complexity to O(TM
3
) with a value of M = 20 used in their experiments.
It would appear that the reason for the large time difference is in the way the top M
suggestions are processed. The projects implementation (see algorithm 6.1) stores the
sentences vocabulary in an array and then at each position in the sentence it loops over
the vocabulary of words and checks to see if that particular word is one of the possible
suggestions for that position in the sentence. However, this method is extremely inef-
cient because although each check takes a fraction of a second to perform, there are
three of these loops nested under each other. A more efcient method would have been
to store the spelling suggestions in a 2D array where each row represents the position
in the sentence and each column represents the M spelling suggestions. Then rather
than searching over the entire vocabulary and rejecting those that arent suggestions,
only the actual spelling suggestions for that sentence position could be looped over.
Unfortunately, there was not enough time to implement and thoroughly test these
optimizations in the project but whilst the language model takes a long time to execute,
the results it produces will be identical to an optimized version so the accuracy of the
attack will remain unaffected.
Chapter 6. Language Model 46
Algorithm 6.1 Pseudocode of main loop in viterbi implementation
for pos = 3 to sentenceLength
for word3 = 1 to vocabularyCount
if (observationMatrix(pos,word3) == 0) then continue;
for word2 = 1 to vocabularyCount
if (observationMatrix(pos-1,word2) == 0) then continue;
for word1 = 1 to vocabularyCount
if (observationMatrix(pos-2,word1) == 0) then continue;
#....calculate probability and check if
# its the best path so far....
end
end
end
end
Chapter 7
Supervised Classication
7.1 Training a classier
Once the recovered text has been corrected by the language model, a set of labelled
training data can then be extracted with which to train a supervised classier. To deter-
mine which keys to assign to the training set, the method used in [35] was used which
compared the corrected text with the original recovered text and any words that had
more than 75%of their characters left unchanged were used as training examples. Each
letter from these corrected words and their corresponding keystroke features were then
used to train a supervised classier. In this project, a probabilistic neural network was
used for classication using Matlabs newpnn() function and once it had been trained
with the newly created training set, it was then used to classify the entire dataset. The
key feature in the [35] model is that once this classication has been performed, the
results are then fed back into the language model, corrected and a new training set
is generated with which to retrain the classier. This retraining process is performed
repeatedly until no improvement in accuracy is obtained.
To generate the confusion matrix for the spellcheck module, each time a training
set is extracted the training words are compared with their original spelling in the
recovered text. This means that the next time a spellcheck is performed, the corrections
that needed to be made on the last set of training words will have greater probability
than random corrections.
47
Chapter 7. Supervised Classication 48
7.2 Overall attack performance
Figure 7.1 shows the results of the overall attack on datasets A and B. For the ini-
tial text recovery stage, the votes of condence classier was used (section 5.3) and
then a further 2 iterations of the language model were performed. The maximum text
recovery rates for both datasets were 81.47% for dataset A and 91.26% for dataset B.
Figure 7.1: Final results of attack model
As previously discussed in section 4.2, the keyboard used to record dataset B seems
to produce keystrokes sounds which are more readily distinguishable which is why
there is a large difference between the two sets of results. As we have seen in section
3.3, properties of the keystroke audio recording such as typing rate and background
noise can also have a large impact on the performance of the attack. Unfortunately
the dataset used by [35] is not publically available and so a direct comparison between
models could not be performed.
Chapter 8
Conclusion
The acoustic eavesdropping attack outlined in [35] has been successfully reimple-
mented with text recovery accuracies similar to those stated in the original paper.
A method for collecting accurate labelled datasets was developed and a novel error
function for determining thresholding values during keystroke extraction was demon-
strated. Some of the problems faced by unsupervised keystroke extraction methods
have also been explored and it was found that a mixture of FFT and cepstrum fea-
tures out-perform either set of features individually. Finally, it was shown that running
the clustering algorithm over multiple iterations and taking a vote on the recovered
keystroke sequence greatly increases the accuracy of the results.
Since these types of acoustic attack are still in their infancy, there doesnt yet exist a
common dataset on which to test competing models. Experiments so far have recorded
a user copying a piece of sample text onto the computer and so the use of a standard
English corpus has been a sufcient source of statistics. However, it has been shown
[30] that written text samples poorly reect the actual keystroke statistics of user input.
Future work in this eld would be greatly assisted by a large, publically accessible
dataset on which to test competing attack models and which provides a more realistic
set of data to attack.
The malicious use of surveillance for the purpose of industrial espionage is a grow-
ing threat to businesses. The German government recently stated that German compa-
nies were losing around C50bn (43bn) and 30,000 jobs to industrial espionage every
year [9]. Indeed, the United Kingdoms Security Service (MI5) currently lists es-
pionage alongside terrorism and weapons of mass destruction as one of the primary
threats against national security [21]. Clearly, the practical limitations of these types
of attack must be explored so that appropriate defenses can be identied.
49
Bibliography
[1] Simple python keylogger. http://sourceforge.net/projects/pykeylogger/, June
2010.
[2] Edwin A. Abbott. Flatland: A Romance of Many Dimensions. Seely & Co.,
1884.
[3] Agrawal R. Asonov, D. Keyboard acoustic emanations. In Proceedings of the
2004 IEEE Symposium on Security and Privacy, pages 3 11, May 2004.
[4] Yigael Berger, Avishai Wool, and Arie Yeredor. Dictionary attacks using key-
board acoustic emanations. In CCS 06: Proceedings of the 13th ACM confer-
ence on Computer and communications security, pages 245254, New York, NY,
USA, 2006. ACM.
[5] B. Bogert, M. Healy, and J. Tukey. The quefrency alanysis of time series for
echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking.
In Proc. Symp. on Time Series Analysis, pages 209243, 1963.
[6] Leo Breiman. Bagging predictors. Mach. Learn., 24(2):123140, 1996.
[7] Philip Clarkson. Cambridge statistical language modeling toolkit. http://svr-
www.eng.cam.ac.uk/ prc14/toolkit.html, June 1999.
[8] Australian Broadcasting Commission. Science news corpus.
http://www.abc.net.au/, 2006.
[9] Kate Connolly. Germany accuses china of industrial espionage. The Guardian
newspaper, July 2009.
[10] Jean-Franois Dhem, Franois Koeune, Philippe-Alexandre Leroux, Patrick
Mestr, Jean-Jacques Quisquater, and Jean-Louis Willems. A practical imple-
mentation of the timing attack. In CARDIS 98: Proceedings of the The Inter-
50
Bibliography 51
national Conference on Smart Card Research and Applications, pages 167182,
London, UK, 2000. Springer-Verlag.
[11] Karine Gandol, Christophe Mourtel, and Francis Olivier. Electromagnetic anal-
ysis: Concrete results. In CHES 01: Proceedings of the Third International
Workshop on Cryptographic Hardware and Embedded Systems, pages 251261,
London, UK, 2001. Springer-Verlag.
[12] Y He. Extended viterbi algorithm for second order hidden markov process. In
Markov Process, Proc. 9 th IEEE Conf. Pattern Recognition, pages 718720,
1988.
[13] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An
introduction to natural language processing. Prentice Hill, 2000.
[14] Paul C. Kocher. Timing attacks on implementations of dife-hellman, rsa, dss,
and other systems. In CRYPTO 96: Proceedings of the 16th Annual Interna-
tional Cryptology Conference on Advances in Cryptology, pages 104113, Lon-
don, UK, 1996. Springer-Verlag.
[15] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In
CRYPTO 99: Proceedings of the 19th Annual International Cryptology Confer-
ence on Advances in Cryptology, pages 388397, London, UK, 1999. Springer-
Verlag.
[16] H.J. Landau. Sampling, data transmission, and the nyquist rate. Proceedings of
the IEEE, 55(10):1701 1706, oct. 1967.
[17] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduc-
tion to Information Retrieval. Cambridge University Press, 1 edition, July 2008.
[18] Mathworks. Specgram function. http://www.mathworks.com/access/helpdesk-
r13/help/toolbox/signal/specgram.html, 2010.
[19] Rita Mayer-Sommer. Smartly analyzing the simplicity and the power of simple
power analysis on smartcards. In CHES 00: Proceedings of the Second Inter-
national Workshop on Cryptographic Hardware and Embedded Systems, pages
7892, London, UK, 2000. Springer-Verlag.
Bibliography 52
[20] Dominic Mazzoni. Audacity editing software. http://audacity.sourceforge.net/,
June 2010.
[21] UK Security Service (MI5). The threats. https://www.mi5.gov.uk/output/the-
threats.html, 2010.
[22] Imperial College Mike Brookes, Department of Electrical & Elec-
tronic Engineering. Voicebox: Speech processing toolbox for matlab.
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, 2010.
[23] Steven J. Murdoch. Hot or not: revealing hidden services by their clock skew. In
CCS 06: Proceedings of the 13th ACM conference on Computer and communi-
cations security, pages 2736, New York, NY, USA, 2006. ACM.
[24] Kevin Murphy. Hidden markov model toolbox for matlab.
http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html, June 2005.
[25] Nokia. Qt application framework. http://doc.qt.nokia.com/, 2010.
[26] Jean-Jacques Quisquater and David Samyde. Electromagnetic analysis (ema):
Measures and counter-measures for smart cards. In E-SMART 01: Proceedings
of the International Conference on Research in Smart Cards, pages 200210,
London, UK, 2001. Springer-Verlag.
[27] S. Shieh S. Huang and J. D. Tygar. Keyboard acoustic emanations revis-
ited. http://www.cs.berkeley.edu/ tygar/papers/Keyboard-Acoustic-Emanations-
Revisited/tiss.preprint.pdf, 2010.
[28] Adi Shamir and Eran Tromer. Acoustic cryptanalysis: On nosy people and noisy
machines. Eurocrypt 2004 Rump, May 2004.
[29] Dawn Xiaodong Song, David Wagner, and Xuqing Tian. Timing analysis of
keystrokes and timing attacks on ssh. In SSYM01: Proceedings of the 10th
conference on USENIX Security Symposium, pages 2525, Berkeley, CA, USA,
2001. USENIX Association.
[30] R. William Soukoreff and I. Scott MacKenzie. Input-based language modelling
in the design of high performance text input techniques. In Graphics Interface,
pages 8996, 2003.
Bibliography 53
[31] William Soukoreff and Scott MacKenzie. Keycapture.
http://www.dynamicnetservices.com/ will/academic/textinput/keycapture/,
July 2003.
[32] Fred A. Stahl. A homophonic cipher for computational cryptography. In AFIPS
73: Proceedings of the June 4-8, 1973, national computer conference and expo-
sition, pages 565568, New York, NY, USA, 1973. ACM.
[33] Scott M. Thede and Mary P. Harper. A second-order hidden markov model for
part-of-speech tagging. In In Proceedings of the 37th Annual Meeting of the ACL,
pages 175182, 1999.
[34] Wikipedia. British and american keyboards. http://en.wikipedia.org/wiki/British-
and-American-keyboards, August 2010.
[35] Li Zhuang, Feng Zhou, and J. D. Tygar. Keyboard acoustic emanations revis-
ited. In CCS 05: Proceedings of the 12th ACM conference on Computer and
communications security, pages 373382, New York, NY, USA, 2005. ACM.