Sie sind auf Seite 1von 4

Rich Set of Features for Proper Name Recognition in Polish Texts

Micha Marciczuk, Micha Stanek, Maciej Piasecki, and Adam Musia


Wrocaw University of Technology, Wrocaw, Poland

Introduction

Statistical recognition of named entities (NEs) in Polish is more dicult then in English mainly because of less constrained word order and richer morphology. As a result, the number of words sequences corresponding to a multi-word NE is relatively high. A complex statistical model is required for this diversity. The level of diversity can be reduced by using some kind of generalization combined with raw observations, i.e. introducing some levels of granularities to text representation. Experiments with HMM showed, cf [1], that HMM often makes wrong decisions because the important premises appear in the close but right context, which is not available for HMM. This problem also seems to be more serious for less-constrained word order languages, than for English. The sample errors are: siedziba w Nowym Sadzie w Republice Serbii (oce in Nowy Sad in Republic of Serbia) HMM recognised Republice Serbi as a country name, but not Nowym Sadzie as a city name. However a more sophisticated model could recognise that Nowym Sadzie is a city name from the left and right context taken together; Elektroproizvodnja-ZPUE D.O.O. (D.O.O. is an abbreviation for Limited liability company in Serbian) HMM recognised Elektroproizvodnja as a road name while D.O.O. in the right context indicate a company name. Our goal is to develop a general method for NE recognition (NER) for Polish. Due to the limited resources for Polish for this task which are still under intensive development (e.g. [1, 5]) we have limited our scope to ve categories of proper names, i.e. rst names, surnames, names of countries, cities and roads. We aim at processing non-literary texts (newspaper articles, reports, brochures, etc.).

Approach

We have dened a set of 34 features that are used to form a description of a word occurrence in the sequence. The features are: 1. Orthographic features: word form, morphological base form, n character prexes and suxes, as well selected patterns of characters (8 categories, e.g. all upper, digits, upper init etc.) 2. Binary orthographic features indicating the presence of given characters in the word; based on ltering rules from [1].

Micha Marciczuk, Micha Stanek, Maciej Piasecki, and Adam Musia

3. Wordnet-base features reducing the observation diversity; based on synonyms from plWordNet and hypernyms of the word in the distance of n. 4. Morphological features complete tag (according to the IPI PAN tagset [4]) as disambiguated by TaKIPI, and selected attributes: part of speech, case, gender, number. 5. Gazetteer-based features one feature for every gazetteer. If a sequence of words is found in a gazetteer the rst word in the sequence is set as B and the other as I ; O is assigned to words not covered by the entries. CRF is a modern machine learning method applied successfully to labelling sequence data in many NLP tasks, e.g., in shallow parsing or NER. CRF-based NER models outperform HMM models because CRF can utilise additional context features encoding observations in a non-linear manner. CRF is able to analyse a much broader context than HMM based methods, utilise features encoding both preceding and following observations. Our goal is to improve NER with respect to the problems identied earlier. We aim at examining a new set of features and their inuence on NER. We ignore the problem of CRF learning algorithms and normalization factors in CRF applying parameters typically used for the English [2], and state of the art stochastic gradient descent learning method [6].

Experiments

In the experiments we used three corpora: a corpus of stock exchange reports (CSER), a corpus of police reports (CPR) and a corpus of economic news (CEN), see [1]. In the single-domain evaluation we followed 10-fold cross-validation on the revised CSER. Due to changes introduced in CSER we had to repeat the baseline experiments, see Table 1. The best congurations from [1] were applied. In 10-fold HMM all folds were used and HMM was combined with re-scoring based on heuristics and gazetteers. HMM + post is a cross-validation on folds 610 for HMM with re-scoring and rule-based post-processing. Table 1 shows that the correction of errors improved slightly the results. The baseline result for HMM is F1 =89.75%. The single-domain evaluation of CRF was performed on the revised CSER. CRF with the extended set of features and close context (previous, current and next token) obtained near the same level of recall as HMM but with higher precision of 92.07%, achieving F1 =90.75% which is better than the one of HMM. The highest results were obtained for wide context (3 preceding, current and 3 following tokens), i.e. F1 =92.53% with high precision 95.20%. This conrm that the discriminative information appears in wider context. To investigate the generality of the CRF model we evaluated it on crossdomain corpora. We trained the CRF model on CSER using feature templates that achieved the best result in the cross-validation on CSER and next we applied it to CEN and CPR. Due to changes in CSER the experiments from [1] were repeated using the same best HMM conguration and used as a baseline: for CPR 67.49% of precision (P), 84.36% of recall (R) and 74.99% of F1 ; for CEN P=54.83%, R=76.95% and F1 =64.03%. Results of the cross-domain evaluation

Proper Name Recognition for Polish Table 1. Base line evaluation on CSER. CSER Revised CSER

10-fold HMM HMM + post 10-fold HMM HMM + post Precision Recall F1 83.55% 89.70% 86.52% 85.28% 88.56% 86.88% 88.69% 90.68% 89.67% 89.84% 89.66% 89.75%

Table 2. Results of cross evaluation of CRF on CSER dataset road surname rst name country city Total

All features for previous, current and next token + Filtering Precision 93.33% 95.78% Recall 84.15% 87.60% F1 88.51% 91.51% 94.67% 81.38% 87.52% 81.22% 92.53% 86.15% 95.14% 83.61% 93.82% 92.07% 89.47% 90.75%

All features for wide context + Filtering Precision 96.67% 97.85% Recall 95.08% 87.88% F1 95.87% 92.60% 96.89% 80.23% 87.77% 89.67% 94.74% 82.68% 95.35% 86.04% 95.04% 95.20% 90.00% 92.53%

on CPR are presented in Table 3 F1 =67.71% is less by 7.27% than HMM. However, the 92.88% of precision of CRF is signicantly better than the one of HMM. Application of wider context resulted in the precision improvement (by ca. 2%) but also with recall reduction (by ca. 4%) the wider context tend to overtrain the model. Next, we tested the model on the other corpora CEN. The results achieved on CEN are presented in Tab. 4: F1 increased by 8.32%. On both corpora CRF achieved very high overall precision. The worst results for CRF were achieved for the recognition of person names. The wider context improved precision at the cost of recall.

Table 3. Cross-domain evaluation on CPR. road surname rst name country city Total

All features for previous, current and next token Precision 100.00% 93.06% Recall 50.00% 48.91% F1 66.67% 64.11% 93.89% 50.75% 65.89% 100.00% 89.05% 92.88% 81.48% 63.87% 53.29% 89.80% 74.39% 67.72%

All features for wide context Precision 100.00% 92.82% Recall 54.76% 44.04% F1 70.77% 59.74% 94.08% 47.75% 63.35% 100.00% 95.69% 94.48% 81.48% 58.12% 49.40% 89.80% 72.31% 64.88%

Micha Marciczuk, Micha Stanek, Maciej Piasecki, and Adam Musia Table 4. Cross-domain evaluation on CEN. road surname rst name country city Total

All features for current, next and previous token Precision 71.43% 93.06% Recall 16.13% 51.29% F1 26.32% 66.13% 96.57% 58.98% 73.23% 91.19% 79.91% 91.15% 70.86% 55.71% 59.98% 79.75% 65.65% 72.35%

All features for wide context Precision 62.50% 94.42% Recall 16.13% 49.11% F1 25.64% 64.61% 97.05% 57.06% 71.87% 90.31% 80.87% 91.41% 68.73% 50.84% 57.53% 78.06% 62.43% 70.62%

Summary

In the paper we presented some limitations of HMM in the task of NE recognition, i.e. a problem with encoding data generalization of linguistic information and modelling contextual information from two-side context. To overcome these two limitations we applied CRF a modern method for sequence labelling on a rich set of features: based on linguistic observation and used to reduce the observation diversity. In the single-domain cross-validation CRF outperformed HMM. CRF obtained 92.53% of F-measure, while HMM only 89.75%. On the cross-domain evaluation we have trained the model on CSER and evaluated on CPR and CEN. On both corpora we observed the same eect, the precision increased but also the recall decreased. In case of CEN the nal results was improved from 64.09% to 72.35%, but for CPR was decreased by 7.27%. Crossdomain evaluation has shown that CRF models are capable to t very good to the data in the training dataset. Unfortunately, CRF did not obtain high recall.

References
1. Marciczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts, To appear in Control and Cybernetics, (2011) 2. Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers Using Conditional Random Fields, in In HLT-NAACL, pp. 329336 (2004) 3. Piskorski, J.: Extraction of Polish named entities, in Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR, 2004), pp. 313316, ACL, Prague, Czech Republic (2004) 4. Przepirkowski, A.: The IPI PAN Corpus: Preliminary version, Institute of Computer Science, Polish Academy of Sciences, Warsaw (2004) 5. Savary, A., Waszczuk, J., Przepirkowski, A.: Towards the Annotation of Named Entities in the National Corpus of Polish, in LREC 2010 proceedings (2010) 6. Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random elds with stochastic gradient methods, in Proceedings of ICML 06, pp. 969976, ACM, New York, NY, USA (2006)