Beruflich Dokumente
Kultur Dokumente
Abstract
This report briefly covered part of the program specification, tasks done and methods
used. Estimated project progress was approximately 50%. The accomplished tasks included
two major parts: context free spelling correction and statistical part of speech (POS) tagger.
The POS tagger was constructed based on hidden Markov model (HMM). As for the spelling
correction, similarity between words was used as the decision factor. Performance data were
also collected and analyzed.
1 Tools
The program was written completely in Python 2.6. Addition libraries were required. Details
for downloading necessary packages and installation guide for Windows were given below.
• PyYAML
• NLTK
After installing all packages, run the Python IDLE (see Getting Started), and type the commands:
1
1.2 Getting Started
The simplest way to run Python is via IDLE, the Integrated Development Interface. It opens
up a window and you can enter commands at the >>> prompt. You can also open up an editor
with File ->New Window and type in a program, then run it using Run ->Run Module. Save
your program to a file with a .py extension.
In Windows:
Start ->All Programs ->Python 2.6 ->IDLE
Check that the Python interpreter is listening by typing the following command at the prompt.
It should print Monty Python!
2 Program Specification
2.1 Requirement
• Spelling correction:
– Accuracy (baseline?)
– Handle morphology
– Incorporating with a parser to improve performance
– Real time operation
• POS tagger:
– Accuracy greater than 95%
– Real time operation
2
Figure 2: Spellchecker data flow diagram.
3 Spelling Correction
3.1 Method
• Compiled the lexicon using data from WordNet (about 90 000 words in the vocabulary)
• Used stemmer to handle word morphing (unimplemented at the moment)
• Different features such as length, common characters and transmute steps were used as
classifiers. Linear interpolation was employed to combined the classifiers.
• Collected performance data and carried out adjustment.
3.2 Features
Various features were used and compared against each other based on accuracy and execute time.
All feature’s values are in range [0,1]. The bigger the value, the fitter the words are.
• fitm(word, can): counts number of private letters in word and can then weights
3
• fitorder(word,can): level of disorder
• transmuteI(word,can): steps required to transform word into can. 4 basic (distance 1)
transform steps are: delete, add, change, and swap (adjacent letters only).
• fitc(word, can): scans for match strings between word and cand, sums up their lengths
then weights.
• fitl(word, can): length difference.
• In some case, less similar words or words of same similarity are actually preferred because
of our habits or typing errors due to slips of fingers.
These limits can be partially addressed by incorporating with a parser. Another solution is
using statistical information, which conveniently reflect the trend of nature. However, spelling
errors corpus is not common, and insufficient data may worsen the problem.
4 POS Tagging
4.1 Method
• Developed a language model using hidden Markov model (HMM). Linear interpolation of
trigram, bigram and unigram was used to improve performance. Prefix/suffix and regular
expression taggers would be called upon failure of the HMM model.
4
Solution can be found using iinear interpolation of 1, 2, and 3:
Y
argmax P (wi |ti )(λ3 × P (ti |ti−2 , ti−1 ) + λ2 × P (ti |ti−1 ) + λ3 × P (ti )) (4)
1≤i≤n
Table 1: Accuracies
The suffix tagger had slightly better accuracy and deviation than prefix tagger. Interpolation of
them fell somewhere in between. So suffix tagger was chosen.
It was surprising to achieve such decent accuracy just by knowing the ending or starting character.
As for the HMM tagger, simply going from left to right and choosing the most likely tags
yield a decent accuracy and speed. However, the outcomes might not be very useful for other
tasks such as information extraction due to misleading tags. Therefore searching for the most
probable path is more desirable. The drawback is it requires much more computational effort.
The tagger seemed to work brilliantly given the first word was tagged correctly. If not, the
error would propagate along and quite often severely worsen the accuracy. Separators, which
are always tagged correctly such as comma or quote, had the effect of attenuating the error
propagation.
The executed time was significantly longer. There was 170 possible tags, so the search space
could easily reach several hundred thousand in number in just 10 or 15 words. Either a threshold
or limiting search range can be used to save memory and time, but this stands the risk to miss
out the better solutions. Either way, the execution still took a significant time. A full scale test
therefore was not practical.
Testing on smaller scales gave accuracies ranged from 100% to as low as 25%. A crawler was
developed to detect poor performance samples for analyzing purpose but should be used only
on the held out data rather than testing data. In many cases, it turned out that the standard
solutions were filtered out due to limiting the search space, not because they were less fit. This
indicating reliability of the model.
Another possible solution is genetic algorithm (GA), a well-known algorithm which is used to
deal with gigantic search space. However, a previous application of GA to a string segmenting
task gave inconsistent results. So there is no plan to implement the algorithm at the moment,
but it is still a promising possibility.