You are on page 1of 2

Study on Named Entity Recognition for Polish Based on Hidden Markov Models*

Micha Marciczuk
michal.marcinczuk@pwr.wroc.pl

Maciej Piasecki
maciej.piasecki@pwr.wroc.pl

*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09

Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.
We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.
Sample: Positive patterns: <first name> <last name> <first name> <last name> <initial> <last name> <first name> <second name> <last name> <first name> <last name>-<maiden name> <first name> <last name> (<last name>) <first name> i <first name> <last name> Negative patterns: <last name> & <last name> Company company name <first name> <last name> University institution name <first name> <last name> Square location name

Corpora:

Stock Exchange Reports 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download Police Reports (Graliski et al., 2009) 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.

Baseline:

Heuristic matches a sequence of words such that each word starts with an upper case letter each. Gazetteers matches a sequence of words present in the dictionary of first names and last names (63 555 entries) (Piskorski, 2004).

HMM:

LingPipe implementation of HMM:


7 hidden states for every annotation type, 3 additional states (BOS, EOS, middle token), Witten-Bell smoothing, first-best decoder based on Viterbis algorithm, rescoring based on the language model.

Pan Jan Nowak zosta nominowany na stanowisko prezesa (Mr. Jan Nowak was nominated for the chairman position.) (BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)
Web-based application used to annotate and browse the corpora.

http://nlp.pwr.wroc.pl/gpw
User demo Corpus: poligon User: test Password: test

Errors:

We have analysed the errors produced by HMM and divided them in 6 groups:
Incorrect proper name category Lowercase and non-alphabetic expressions Incorrect annotation boundaries Missing annotations Common words starting with an uppercase character

187 False Positives, 99 False Positives, 35 partially matched annotation, 18 missing annotations, 6 False Positives.

Postprocessing:

To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:
WORD = "([A-Z])([a-z])*"; PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"

Results:
Precision Recall F1-measure

10-fold Cross Validation on the Stock Exchange Reports Heuristic 0.85 % 41.74 % 1.67 % Gazetteers 9.47 % 41.44 % 15.42 % HMM 61.82 % 92.35 % 74.05 % HMM + f1+ 74.49 % 90.21 % 81.66 % HMM + f2+ 89.00 % 89.60 % 89.33 %

Corss-domain evaluation HMM 28.04 % 47.74 % 35.33 % HMM + f1+ 63.61 % 47.67 % 54.43 % HMM + f2+ 83.49 % 32.79 % 47.09 %

HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.

Model Granularity:
Precision Recall F1-measure

PERSON HMM 61.82 % 92.35 % 74.05 % HMM + f1+ 74.49 % 90.21 % 81.66 %

FIRST NAME HMM 76.05 % 98.54 % 85.84 % HMM + f1+ 89.13 % 97.22 % 93.00 %

LAST NAME HMM 73.73 % 93.70 % 82.53 % HMM + f1+ 83.06 % 90.48 % 86.62 %
f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.

Study on Named Entity Recognition for Polish Based on Hidden Markov Models*
Micha Marciczuk
michal.marcinczuk@pwr.wroc.pl

Maciej Piasecki
maciej.piasecki@pwr.wroc.pl

*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09

Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.
We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.

Corpora:

Stock Exchange Reports 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download Police Reports [1] 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.

Baseline:

Heuristic matches a sequence of words such that each word starts with an upper case letter each. Gazetteers matches a sequence of words present in the dictionary of first names and last names (63 555 entries) [2].

HMM:

LingPipe [1] implementation of HMM:


7 hidden states for every annotation type, 3 additional states (BOS, EOS, middle token), Witten-Bell smoothing, first-best decoder based on Viterbis algorithm, rescoring based on the language model.

Pan Jan Nowak zosta nominowany na stanowisko prezesa (Mr. Jan Nowak was nominated for the chairman position.) (BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)
Web-based application used to annotate and browse the corpora.

http://nlp.pwr.wroc.pl/gpw

Errors:

We have analysed the errors produced by HMM and divided them in 6 groups:
Incorrect proper name category Lowercase and non-alphabetic expressions Incorrect annotation boundaries Missing annotations Common words starting with an uppercase character

187 False Positives, 99 False Positives, 35 partially matched annotation, 18 missing annotations, 6 False Positives.

User demo Corpus: poligon User: test Password: test

Postprocessing:

To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:
WORD = "([A-Z])([a-z])*"; PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"

Results:
Precision Recall F1-measure

10-fold Cross Validation on the Stock Exchange Reports Heuristic 0.85 % 41.74 % 1.67 % Gazetteers 9.47 % 41.44 % 15.42 % HMM 61.82 % 92.35 % 74.05 % HMM + f1+ 74.49 % 90.21 % 81.66 % HMM + f2+ 89.00 % 89.60 % 89.33 %

Corss-domain evaluation HMM 28.04 % 47.74 % 35.33 % HMM + f1+ 63.61 % 47.67 % 54.43 % HMM + f2+ 83.49 % 32.79 % 47.09 %

HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.

Model Granularity:
Precision Recall F1-measure

PERSON HMM 61.82 % 92.35 % 74.05 % HMM + f1+ 74.49 % 90.21 % 81.66 %

FIRST NAME HMM 76.05 % 98.54 % 85.84 % HMM + f1+ 89.13 % 97.22 % 93.00 %

LAST NAME HMM 73.73 % 93.70 % 82.53 % HMM + f1+ 83.06 % 90.48 % 86.62 %
f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.

Reference:

[1] Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008) [2] Graliski, F., Jassem, K., Marciczuk, M.: An Environment for Named Entity Recognition and Translation. In: Mrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, pp. 8895. Barcelona, Spain (2009) [3] Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313-316. ACL, Prague, Czech Republic (2004)