Sie sind auf Seite 1von 5

Natural Language Processing for Search Engine

Short Progress Report


Thai Binh Duong
School of EEE
Nanyang Technological University
Project Supervisor: Chan Chee Keong
Project number: A3024-101
November 28, 2010

Abstract
This report briefly covered part of the program specification, tasks done and methods
used. Estimated project progress was approximately 50%. The accomplished tasks included
two major parts: context free spelling correction and statistical part of speech (POS) tagger.
The POS tagger was constructed based on hidden Markov model (HMM). As for the spelling
correction, similarity between words was used as the decision factor. Performance data were
also collected and analyzed.

1 Tools
The program was written completely in Python 2.6. Addition libraries were required. Details
for downloading necessary packages and installation guide for Windows were given below.

1.1 Download and Installation


Download size is about 17 MB. A full installation will require approximately 800 MB of free disk
space.
• Python 2.7

• PyYAML
• NLTK
After installing all packages, run the Python IDLE (see Getting Started), and type the commands:

>>> import nltk


>>> nltk.download()
A new window should open, showing the NLTK Downloader. Click on the File menu and
select Change Download Directory. For central installation, set this to C:\nltk data. Next,
select the packages or collections you want to download.
If you did not install the data to one of the above central locations, you will need to set the
NLTK DATA environment variable to specify the location of the data. Right click on My Computer,
select Properties>Advanced>Environment Variables>User Variables>New...
Test that the data has been installed as follows. (This assumes you downloaded the Brown
Corpus):
>>> from nltk.corpus import brown
>>> brown.words()[:50]
[’The’, ’Fulton’, ’County’, ’Grand’, ’Jury’, ’said’, ...]

1
1.2 Getting Started
The simplest way to run Python is via IDLE, the Integrated Development Interface. It opens
up a window and you can enter commands at the >>> prompt. You can also open up an editor
with File ->New Window and type in a program, then run it using Run ->Run Module. Save
your program to a file with a .py extension.
In Windows:
Start ->All Programs ->Python 2.6 ->IDLE
Check that the Python interpreter is listening by typing the following command at the prompt.
It should print Monty Python!

>>> print "Monty Python!"

2 Program Specification
2.1 Requirement
• Spelling correction:

– Accuracy (baseline?)
– Handle morphology
– Incorporating with a parser to improve performance
– Real time operation

• POS tagger:
– Accuracy greater than 95%
– Real time operation

2.2 Data Flow Diagram

Figure 1: System data flow diagram.

2
Figure 2: Spellchecker data flow diagram.

Figure 3: POS tagger data flow diagram.

3 Spelling Correction
3.1 Method
• Compiled the lexicon using data from WordNet (about 90 000 words in the vocabulary)
• Used stemmer to handle word morphing (unimplemented at the moment)
• Different features such as length, common characters and transmute steps were used as
classifiers. Linear interpolation was employed to combined the classifiers.
• Collected performance data and carried out adjustment.

3.2 Features
Various features were used and compared against each other based on accuracy and execute time.
All feature’s values are in range [0,1]. The bigger the value, the fitter the words are.
• fitm(word, can): counts number of private letters in word and can then weights

3
• fitorder(word,can): level of disorder
• transmuteI(word,can): steps required to transform word into can. 4 basic (distance 1)
transform steps are: delete, add, change, and swap (adjacent letters only).

• fitc(word, can): scans for match strings between word and cand, sums up their lengths
then weights.
• fitl(word, can): length difference.

3.3 Results and Discussion


Feature transmuteI intuitively would yield the best result. This was true according to the
experiments1 . Also from the observation, transmuteI alone yielded satisfactory results, but it
required the most computational effort. The other features could be used to limit the search
scope by filtering out unlikely candidates. fitc was chosen mainly because of its faster execution
time.
Currently there were several limits:
• Word morphing (unimplemented yet)
• Unable to detect valid but semantically wrong words such as piece, peace. . .

• In some case, less similar words or words of same similarity are actually preferred because
of our habits or typing errors due to slips of fingers.
These limits can be partially addressed by incorporating with a parser. Another solution is
using statistical information, which conveniently reflect the trend of nature. However, spelling
errors corpus is not common, and insufficient data may worsen the problem.

4 POS Tagging
4.1 Method
• Developed a language model using hidden Markov model (HMM). Linear interpolation of
trigram, bigram and unigram was used to improve performance. Prefix/suffix and regular
expression taggers would be called upon failure of the HMM model.

• Trained the tagger using tagged corpus (Brown, Treebank).


• Assessment: The initial tagger corpus was divided into training set, held out set and testing
set. The testing set was further divided into a number of subsets, 10 in particular. Mean
and standard deviation of the accuracies were then calculated.

• Collected performance data and carried out adjustment.

4.2 Mathematical Background


4.2.1 Language Model
The tagging problem can be defined for the trigram model as follow: giving sequence of[t1 , t2 ]
(POS 1 and 2), find the probability of the sequence being follow by word w3 which has POS t3
i.e.
P (t3 , w3 |t1 , t2 ) = P (w3 |t1 , t2 , t3 ) × P (t3 |t1 , t2 ) = P (w3 |t3 ) × P (t3 |t1 , t2 ) (1)
The formula can be derived by making the assumption that w3 only depends on t3 .
Similarly, the bigram and unigram model can be defined:

P (t3 , w3 |t2 ) = P (w3 |t3 ) × P (t3 |t2 ) (2)

P (t3 , w3 ) = P (w3 |t3 ) × P (t3 ) (3)


1 All experiments would be revised to ensure reliability and would not be covered in this report.

4
Solution can be found using iinear interpolation of 1, 2, and 3:
Y
argmax P (wi |ti )(λ3 × P (ti |ti−2 , ti−1 ) + λ2 × P (ti |ti−1 ) + λ3 × P (ti )) (4)
1≤i≤n

In which λ1 , λ2 , λ3 are weights of the classifiers, λ1 + λ2 + λ3 = 1, and t−1 = t−2 =0 BOS 0


(Beginning of Sentence). Logarithm is used when the numbers get smaller.

4.2.2 Estimating Lambda Value


For each trigram in training data, compare the following values: C(t 1 ,t2 ,t3 )−1 C(t2 ,t3 )−1 C(t3 )−1
C(t1 ,t2 )−1 , C(t2 )−1 , N (t)−1
Depend on which is the maximum of them, increase the corresponding lambda by a certain
amount. The chosen amount were: 1, C(t1 , t2 , t3 ), C(t 1 ,t2 ,t3 )
C(t1 ,t2 )
The reason for minus 1 is because we treat the in using trigram as observed event, so the actual
data must be minus by 1. For this reason we skipped trigrams which have been seen only once.

4.3 Results and Discussion

Tagger Mean (percent) Standard deviation


Regular expression 7.14 2.23
Suffix tagger 28.56 6.38
Prefix tager 27.15 6.84
HMM tagger (most likely tags only) 85.68 4.42

Table 1: Accuracies

The suffix tagger had slightly better accuracy and deviation than prefix tagger. Interpolation of
them fell somewhere in between. So suffix tagger was chosen.
It was surprising to achieve such decent accuracy just by knowing the ending or starting character.
As for the HMM tagger, simply going from left to right and choosing the most likely tags
yield a decent accuracy and speed. However, the outcomes might not be very useful for other
tasks such as information extraction due to misleading tags. Therefore searching for the most
probable path is more desirable. The drawback is it requires much more computational effort.
The tagger seemed to work brilliantly given the first word was tagged correctly. If not, the
error would propagate along and quite often severely worsen the accuracy. Separators, which
are always tagged correctly such as comma or quote, had the effect of attenuating the error
propagation.
The executed time was significantly longer. There was 170 possible tags, so the search space
could easily reach several hundred thousand in number in just 10 or 15 words. Either a threshold
or limiting search range can be used to save memory and time, but this stands the risk to miss
out the better solutions. Either way, the execution still took a significant time. A full scale test
therefore was not practical.
Testing on smaller scales gave accuracies ranged from 100% to as low as 25%. A crawler was
developed to detect poor performance samples for analyzing purpose but should be used only
on the held out data rather than testing data. In many cases, it turned out that the standard
solutions were filtered out due to limiting the search space, not because they were less fit. This
indicating reliability of the model.
Another possible solution is genetic algorithm (GA), a well-known algorithm which is used to
deal with gigantic search space. However, a previous application of GA to a string segmenting
task gave inconsistent results. So there is no plan to implement the algorithm at the moment,
but it is still a promising possibility.

Das könnte Ihnen auch gefallen