AWESOME - TokenizationPOS-Tagging MM and HMM - Training

Tokenization & POS-Tagging
presented by:
Yajing Zhang
Saarland University
yazhang@coli.uni-sb.de
Outline
 Tokenization
 Importance
 Problems & solutions
 POS tagging
 HMM tagger
 TnT statistical tagger
winter semester 05/06 2

Why Tokenization?
 Tokenization: the isolation of word-

like units from a text.
 Building blocks of other text
processing.
 The accuracy of tokenization affects
the results of other higher level
processing, e.g.: parsing.

Problems of tokenization
 Definition of Token
 United States, AT&T, 3-year-old
 Ambiguity of punctuation as
sentence boundary
 Prof. Dr. J.M.
 Ambiguity in numbers
 123,456.78

Some Solutions
 Using regular expressions to match

numbers and abbreviations
 ([0-9]+[,])*[0-9]([.][0-9]+)?
 [A-Z][bcdfghj-np-tvxz]+\.
 Using corpus as a filter to identify
abbreviations
 Using a lexical list (most important
abbreviations are listed)

POS Tagging
 Labeling each word in a sentence with its
appropriate part of speech
 Information sources in tagging:
 Tags of other words in the context
 The word itself
 Different approaches:
 Rule-based Tagger
 Stochastic POS Tagger
 Simplest stochastic Tagger
 HMM Tagger

Simplest Stochastic Tagger
 Each word is assigned its most

frequent tag (most frequently
encountered in the training set)
 Problem: may generate a valid tag
for a word but unacceptable tag
sequences
 Time flies like an arrow
NN VBZ VB DT NN

Markov Models (MM)
 In a Markov chain, the future element of
the sequence depends only on the current
element in the sequence, but not the past
elements
 X = (X1, …, XT) is a sequence of random
variables, S = {s1, …, sN} is the state
space
aij  P( X t 1  s j | X t  si )
N
aij  0, i, j and a
j 1
ij  1, i.

Example of Markov Models (MM)
Cf. Manning & Schütze, 1999, page 319

Hidden Markov Model
 In (visible) MM, we know the state
sequences the model passes, so the state
sequence is regarded as output
 In HMM, we don’t know the state
sequences, but only some probabilistic
function of it
 Markov models can be used wherever one
wants to model the probability of a linear
sequence of events
 HMM can be trained from unannotated
text

HMM Tagger
 Assumption: word’s tag only depends
on the previous tag and this
dependency does not change over
time
 HMM tagger uses states to represent
POS tags and outputs (symbol
emission) to represent the words.
 Tagging task is to find the most
probable tag sequence for a
sequence of words.

Finding the most probable sequence
Cf. Erhard Hinrichs & Sandra Kübler

HMM tagging – an example

HMM tagging – an example

Calculating the most likely sequence
Green: transition probabilities

Blue: emission probabilities

Dealing with unknown words
 The simplest model: assume that

unknown words can have any POS
tags, or the most frequent one in
the tagset
 In practice, morphological info like
suffix is used as hint

TnT (Trigrams’n’Tags)
 A statistical tagger using Markov Models:
states represent tags and outputs
represent words
 To find the current tag is to calculate:
T
arg max [ P(ti | ti 1 , ti  2 ) P ( wi | ti )]P(tT 1 | tT )
t1 ... t r i 1

Transition and emission probabilities
 Transition and output probabilities

are estimated from a tagged
corpus:
^ f (t 2 , t3 )
Bigrams: P(t3 | t 2 ) 
f (t 2 )
Trigrams: f (t1 , t2 , t3 )
^
P(t3 | t1 , t2 ) 
f (t1 , t2 )
^ f ( w3 , t3 )
Lexical: P( w3 | t3 ) 
f (t3 )

Smoothing Technique
 Needed due to sparse-data problem
 The trigram is most likely to be
zero in a limited corpus:
 Without smoothing, the complete
probability becomes zero
 Smoothing:
^ ^ ^
P(t3 | t1 , t 2 )  1 P(t3 )  2 P(t3 | t 2 )  3 P(t3 | t1 , t 2 )
where 1  2  3  1

Other techniques
 Handling unknown words
 Using the longest suffix (the final
sequence of characters of a word) as a
strong predictor for word classes
 To calculate the probability of a tag t
given the last m letters li of an n letter
word. m depends on the specific word
 Capitalization
 Works better for English than for
German

Evaluation
 Corpora:
 German NEGRA corpus around 355,000
tokens
 WSJ (Wall Street Journal) in the Penn
Treebank around 1.2 Million tokens
 10-fold cross validation
 The tagger assigns tags as well as
probabilities to words
rank different assignments

Results for German and English

POS Learning Curve for NEGRA

Learning Curve for Penn Treebank

Conclusion
 Good results for both German and

English corpus
 Average accuracy TnT achieves is
between 96% and 97%
 The accuracy for known tokens is
significantly higher than for
unknown tokens

References:
 What is a word, what’s a sentence
(Grefenstette 94)
 POS-Tagging and Partial Parsing
(Abney 96)
 TNT- A Statistical Part-of-Speech Tagger
(Brants 2000)
 Foundations of Statistical Natural Language
Processing
(Manning & Schütze 99)

AWESOME - TokenizationPOS-Tagging MM and HMM - Training

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AWESOME - TokenizationPOS-Tagging MM and HMM - Training

Hochgeladen von

Copyright:

Verfügbare Formate

Tokenization & POS-Tagging

winter semester 05/06 2

 Tokenization: the isolation of word-

winter semester 05/06 3

winter semester 05/06 4

 Using regular expressions to match

winter semester 05/06 5

winter semester 05/06 6

 Each word is assigned its most

winter semester 05/06 7

winter semester 05/06 8

Cf. Manning & Schütze, 1999, page 319

winter semester 05/06 9

winter semester 05/06 10

winter semester 05/06 11

Cf. Erhard Hinrichs & Sandra Kübler

Cf. Erhard Hinrichs & Sandra Kübler

winter semester 05/06 13

Cf. Erhard Hinrichs & Sandra Kübler

winter semester 05/06 14

Green: transition probabilities

winter semester 05/06 15

 The simplest model: assume that

winter semester 05/06 16

winter semester 05/06 17

 Transition and output probabilities

winter semester 05/06 18

winter semester 05/06 19

winter semester 05/06 20

winter semester 05/06 21

winter semester 05/06 22

winter semester 05/06 23

winter semester 05/06 24

 Good results for both German and

winter semester 05/06 25

winter semester 05/06 26

Das könnte Ihnen auch gefallen