Sie sind auf Seite 1von 11

Automatic Clustering of Part-of-speech for

Vocabulary Divided PLSA Language Model


A seminar report submitted to
JAWAHARLAL NEHRU TECHNOLOGY UNIVERSITY, ANANTAPUR

Names of student (Roll


Number)

B.RAJESH KUMAR
07F61A0592
Under the guidance of

C H SIVASHANKAR
(M.TECH)
(Associate professor)

DEPARTMENT OFCOMPUTER SCIENCE


AND ENGINEERING

SIDDHARTH INSTITUTE OF ENGINEERING


& TECHNOLOGY

(An ISO 9001:2000 Certified Institution)

(Approved by AICTE, New Delhi & Affliated to JNTU,


Anantapur)

Siddharth Nagar, Narayanavanam Road

Puttur-517 583
SEMINOR
ON
Automatic Clustering of Part-of-speech for
Vocabulary Divided PLSA Language Model
SUBMITED
BY
B.RAJESH KUMAR

Abstract:
PLSA is one of the most powerful language models for adaptation to a target
speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher
performance than the conventional PLSA model because it can be adapted to the target
topic and the target speaking style individually. However, all of the vocabulary must be
manually divided into three categories (topic, speaking style, and general category). In
this paper, an automatic method for clustering parts-of-speech (POS) is proposed for
VD-PLSA.
Several corpora with different styles are prepared, and the distance between corpora
in terms of POS is calculated. The “general tendency score” and “style tendency score”
for each POS are calculated based on the distance between corpora. All of the POS are
divided into three categories using two scores and appropriate thresholds.
Experimental results showed the proposed method formed appropriate clusters, and
VD-PLSA with acquired categories gave the highest performance of all other models.
We applied the VD-PLSA into large vocabulary continuous speech recognition system.
VD-PLSA improved the recognition accuracy for documents with lower out-of-
vocabulary ratio, while other documents were not improved or slightly descended the
accuracy.
methods (e.g. [1], [2]) require a small amount
1. Introduction of adaptation data in advance. Both a general
N-gram is the most popular language model corpus and the adaptation data are mixed
used in the recent large vocabulary continuous together, and the adapted n-gram is calculated
speech recognition system. In general, an n-
gram is constructed using a huge amount of from the mixed corpus. This type of adaptation
training samples. If a training data consists of methods are effective, however, the adaptation
many topics and speaking styles, the n-gram data have to be prepared in advance. It means
acquires an average statistics of the training
that a new adaptation data should be prepared
data. It can be used as a general language
model, however, it is well known that the n- and the new adapted n -gram has to be trained
gram adapted to a target speech shows higher again whenever a target speech is changed.
performance than the general n-gram.
Many adaptation methods have been
proposed. There are two types of adaptation
methods, one is “static

adaptation method” and another is “dynamic


adaptation method”. The static adaptation
language models for adaptation to a target
On the other hand, the dynamic adaptation speech. It has many unigrams, which are
methods (e.g. [3], [4]) can be adapted to a called “latent models”, and a prediction
target speech without re-training the n-gram. probability is calculated by the weighted sum
The mixture model[3] captures topics using of these unigrams. PLSA can be also adapted
sentence-level mixture of n-gram probabilities. to a target speech by changing weights using
The mixture model is created through the three the previously observed text using EM
steps. The first step clusters the training data algorithm. The unigram probability obtained
using automatic clustering technique. In the from the PLSA is combined with the n-gram
second step, individual n-grams are trained probability using the unigram rescaling
from the each cluster. Finally the mixture technique[5].
weights of the n-grams are estimated and the The advantage of PLSA is that the
probabilities from the n-grams are mixed using clustering of the training text and the training
the weights. The mixture model can be adapted of the model are performed jointly through
to a target speech by changing the mixture EM algorithm. It means that the clustering
weights. The problem of the mixture model is result is expected to be optimum from the
that the clustering algorithm is not related to perplexity point of view. In this paper, we
the mixing process, which means that the focus on the PLSA because of the advantage
clustering result is not optimum from the of the PLSA.
perplexity nor recognition performance point Latent models are automatically acquired
of view. by the EM algorithm. Each model corresponds
PLSA[4] is one of the most powerful to each topic in
the training data. If the training data consists of several
speaking styles (such as formal, casual, dialog, and so
on), latent models also represent the speaking style
corresponding to the training data. For example, there are topic input speech
three topics (A, B, and C) in a training data; topic A is PLSA
spoken with a formal style (indicated as “f”), whereas adaptation

topics B and C are spoken with a casual style (“c”). corpora style
PLSA makes three latent models, Af, Bc and Cc.
PLSA
adaptation decoder
One of the biggest problems of PLSA is that both general
topic and style are dealt with together when latent
models are estimated. If the target speech has topic B PLSA
with formal style (Bf), no latent model matches the target recognized
speech even though both topic B and formal style are adaptation
context text
included in the training data. In general, any topics can
be combined with any speaking styles. Adaptation is appearance probability
done more efficiently if PLSA acquires topic-related
models and style-related models individually, and these estimation model
models can be combined when recognizing the target
speech.
In order to solve this problem, we have proposed
Figure 1. Overview of the VD-PLSA model.
the vocabulary divided PLSA language model (VD-
PLSA)[6], [7]. In this method, it is assumed that all
parts-of-speech (POS) are divided into three classes: a
topic-related class, style-related class, and general class. p(wi | hi ) for word wi at
For example, proper nouns are related to the topic, and A probability
auxiliary verbs are related to the speaking style. If a linguistic context hi is calculated from the VD-PLSA
PLSA model is constructed only using topic-related model using Eq.(1).
words, each latent model only corresponds to a topic. In
the same way, if a PLSA model is constructed only using p(wi | hi ) = W (C(wi ) | hi , wi−2, wi−1 ) (1)
style-related words, each model only corresponds to a ⋅ PC (w ) (wi | hi , wi−2, w )
speaking style. Combining these two PLSA models i i−1
enables effective adaptation to any target speech on any
topic with any speaking style. where, the function C(w) returns the category (T:
The VD -PLSA showed higher performance than topic, S: style, G: general) including the word w , and
the conventional PLSA model. However, VD -PLSA the function Px (wi | hi , wi− 2 , wi−1 ) denotes the word
requires manually-defined clustering of the POS. It is
needed to define many number of POS for optimum trigram combined with the sub model corresponding to
clustering, however, it is both difficult and time- the category x using the unigram rescaling method[5].
consuming job. In this paper, we propose an automatic Px (wi | hi , wi− 2 , wi−1 ) ∝ px (wi | hi )
dividing method of POS based on differences of
probabilistic distribution. This method yields more px (wi ) (2)
appropriate clustering, and VD-PLSA using the new
clustering gives higher performance. ⋅ p(wi | wi−2 , wi−1
2. Overview of the VD-PLSA model )
where, px (wi | hi ) is calculated by the PLSA model
Figure 1 shows an overview of the VD-PLSA corresponding to the category x.
model. This system has three PLSA models
corresponding to topic, speaking style, and general The function W (x | hi , wi−2 , wi−1 ) denotes an
words. In this paper, each PLSA model is called a “sub appearance probability for the category x after the
word sequence wi−2 , wi−1 at context hi . This
model”. Each sub model is the same as the conventional
PLSA model, except that each model consists only of
probability is calculated from the unigram of category
related vocabulary.
x at context h and word trigram using unigram
rescaling. The unigram p(x | h) is defined as Eq.(3).

p(x | h) = N x (h) (3)


N (h)
where, N (h) denotes the number of words in a
context h , and N x (h) denotes the number of words
included in a category x in a context h . The unigram
rescaling is carried out using:
.
of documents.
We assume that all of the words included in a POS
are assigned to the same category. In other words, there
Table 1. Manually categorized POS. (From [7])
is no overlapping word among any categories. For
Category POS example, all of the words included in the proper noun
Topic Noun (general) should be assigned to the same category, which may be
Proper noun topic-related category. The frequency of each word
Alphabet
included in proper noun should vary with the topic. In
Verb (independence) other words, the probability distribution of words should
Noun as adverb
Noun (conjunctive) depend on the topic, but not on the style. Let be UTs a
Style Pronoun unigram of proper noun words calculated using a
Number document with topic T and speaking style s . From
Prefix
Conjunction the above discussion, the distance between U Af and
Adjective U Ac is small, however, the distance between U Af and
Adverb
Adnoun U Bf is large. The new clustering method is based on
Filler the difference between probability distributions of words
Exclamation included in a POS for each document.
General Case-marking particle As you know, there are many POS definitions.
Adnominal particle Especially, a number of words in a POS can be controlled
Conjunctive particle by the definition. For example, “noun” is one POS in the
Charge particle coarse definition. However, it is divided into many POS
in the fine definition, such as “proper noun”, “numeral”,
W (x | hi , wi − 2 , wi −1) ∝ ∑Px (w | hi , wi − 2 , wi −1) “pronoun”, and so on. In the same way, the POS “proper
w∈x noun” can be divided into“person’s name”, “place”,
“food”, etc. in the finer definition.
⋅ p(x | hi ) (4)
The finer definition is more appropriate for the
∑p(w) proposed algorithm because all of the words included in a
w∈x POS should be assigned to the same category. However,
Note that the sub model corresponding to general finer definition makes many numbers of POS each of
words is not adapted to context h because the which has a few words. Unigram consists of a few words
h . Therefore, becomes similar to other unigrams calculated from a
“general” model does not depend on
document with any topic and style. It is not suitable for
PG (⋅) can be calculated for any context h using: the clustering. An appropriate size of POS is needed to
represent a topic or style dependency of words.
PG (wi | hi , wi−2 , wi−1 ) = PG (wi | wi−2 , wi−1
) (5) 3.2 Distance between corpora with different styles
In this method, all POS are divided into three
categories by hand. In the experiments described in the It is assumed that a text corpus consists of many
[7], 90 kinds of POS was defined, and divided into three documents, each of which has a single topic and single
categories. Categorized POS is shown in Table 1. It style. Moreover, all documents in the corpus have the
seems to be reasonable, however, nobody knows whether same style. For all words included in a POS M , the
it is optimum clustering or not. word unigram p(w) is calculated using all of the
documents in the corpus. In the same way, word unigram
3. Automatic clustering of part-of-speech pd (w) is also calculated
using only document d .
The distance J ( p d , p) between two unigrams is
3.1. Concept
calculated by the following equation, which is called
The frequency of a topic-related word in a Jeffery divergence:
document is not changed even if the speaking style is
changed. If the topic is changed, however, the frequency
J ( pd , p | M ) = ∑{ pd (w) − p(w)}log p (w)
d

is also changed. In the same way, the frequency of style- p(w)


related words depends only on the speaking style. The w∈M
frequency of general words does not change for any (6)
documents. We therefore propose an automatic
clustering method based on the relationship between
frequency of words and difference of topic and/or style
distance D(Si , S j | M ) is small only if
i = j . This
means that the difference between D(Si , S j | M )
If the POS M relates to the topic, the distance becomes
i ≠ j . On
large because pd is specified by the topic of document
and D(S j , S j | M ) is large for all the
d , even though p has the average
other hand, if a POS M is not related to the style
distribution of many topics. On the other hand, if M category, the difference becomes small because both
relates to style or general category, the distance becomes
small. Topic-related POS can be divided from other D(Si , S j | M ) and D(S j , S j | M ) are large ( M
categories using the average distance calculated from is related to the topic category), or both distances are
many documents included in the corpus. small ( M is related to the general category). The style
In order to divide style-related POS from general tendency score s(M ) is defined as follows:
POS, we considered the distance between corpora.
s(M ) = ∑∑{D(Si , S j | M ) − D(Si , Si | M )}2 (9)
Several text corpora Si are prepared. It is assumed that
i j
each corpus has a different speaking style. The distance Finally, an estimated category C(M ) can be
D(Si , S j | M ) for POS M is defined as follows: determined using two scores and appropriate thresholds
1 (S )
j θ
D(Si , S j | M ) =
N (S ) d∑∈S
J ( pd , p | M ) (7) s and θ g which are given by hand.
i i

where, N (Si ) denotes the number ofdocuments in S if s(M ) > θ s


C(M ) = G if g(M ) > θ g (10)
the corpus Si , and p (S ) denotes the unigram
i

calculated from all of the documents in the corpus Si . T others


4. Evaluation experiments using perplexity
If M relates to the style, D(Si , S j | M ) becomes
small only if i = j . On the other hand, if M relates In order to investigate the effectiveness of the
proposed method, several experiments were carried out. In
to general category, D(Si , S j | M ) becomes small this section, perplexity was used as a evaluation index.
for any i and j . Moreover, D(Si , S j | M ) is
4.1. Experimental conditions
always large for any i and j if M is related to the
topic. A POS M can be divided into three categories Newspaper articles and the Corpus of Spontaneous
using the distance D . Japanese (CSJ)[8], [9] were used for training data. Both
corpora have many documents on various topics, and the
33 General tendency score and style newspaper articles were written in formal Japanese,
tendency score whereas the CSJ corpus consists of transcripts of lectures,
oral presentations, dialogs, and so on.
In order to divide POS into three categories The evaluation data were also selected from the CSJ
automatically, two scores based on the distance D are corpus, and these were transcripts of lectures which were
talking about “recent news”. These data cover similar
proposed. At first, “general tendency score” g(M ) is
topics to newspaper articles, and have a speaking style
introduced to divide the general category from others. If similar to the CSJ corpus used as training data. This
a POS M is related to the general category, the means that there were no training data which had similar
distance D(Si , S j | M ) is small for any combination topic and style to the evaluation data, even though some
training data had similar topic or style.
of i and j . The general tendency score g(M ) is
thus defined as follows: 1 Table 2. Experimental conditions
g(M ) Vocabulary 30,000 + unknown words
= Number of Topic: 100
(8)
latent Style: 50
12 ∑∑D(Si , S j |M) model
Training
General: 1
Newspaper: 67,989 articles (26.9M words)
N CSJ: 2,580 articles (6.7M words)
i j
Evaluation CSJ: 152 articles
where, N denotes the number of corpora.
“Style tendency score” s(M ) is also introduced.
If a POS M is related to the style category, the
General tendency
25
Word classification number of words, therefore reliable statistics could not
General Class threshold
Style Class threshold
be acquired.
20
4.3. Evaluation using perplexity
te
nd
en
cy 15 The VD-PLSA model was constructed based on the
automatically-defined three categories. The number of
St
yle 10
latent models was set to 100, 50 and 1 for the topic
category, style category and general category,
respectively. The trigram, conventional PLSA model,
5 and VD-PLSA model with the manually-defined three
categories were also evaluated for comparison.
0
Table 3. POS with higher/lower
Figure 2. Relationship between general and style general tendency score
tendency scores. POS Score
Particle (attributive) 5.59
Case particle (general) 0.58
Case particle (quotation) 0.34
Particle (conjunction) 0.13
Other experimental conditions are shown in Table 2. Particle (adverb) 0.11
Several parameters such as a number of latent models, an Noun (base of auxiliary verb) 0.08
annealing schedule used in the tempered EM
General noun (dependence) 0.08
algorithm[4], were determined according to preliminary … 0.032
experiments. Noun (family name)
Noun (given name) 0.032
4.2. Analysis of acquired categories
Noun (location) 0.032
In this experiment, 90 POS were defined. Both Noun (organization) 0.031
“general tendency score” and “style tendency score” Adjective (suffix) 0.030
were calculated for each POS. Figure 2 shows the Particle (special) 0.030
relationship between the two scores. In this figure, each Noun (Conjunctive) 0.029
cross symbol denotes a POS. Prefix (with verb) 0.028
It can be seen that the POS can be divided into three Prefix (with adjective) 0.027
regions using thresholds (0.12 for general tendency score
and 6.0 for style tendency score). The region with higher Table 4. POS with higher/lower style
general tendency score and lower style tendency score tendency score
indicates the general category, the region with lower POS Score
general tendency score and higher style tendency score Noun (base of auxiliary verb) 16.25
indicates the style category, and the region with lower Filler 14.33
general and style tendency scores indicates the topic General noun (dependence) 13.18
category. Noun (base of adjective verb) 12.70
Table 3 and 4 shows POS with higher/lower general Pronoun 12.41
tendency score and style tendency score, respectively.
Table 5 shows acquired clusters. In the table 5, Impression words 12.28
underlined POS were categorized into another cluster Conjunction 11.82
compared with the manual clustering. … 0.51
Many noun-family POS take lower scores (both Adjective (suffix)
style tendency score and general tendency score), Prefix (with adjective) 0.46
however, some noun-family POS take higher scores Noun (conjunctive) 0.45
(noun (base of auxiliary verb) and general noun Particle (special) 0.33
(dependence)). Particle-family POS was also separated Alphabet 0.26
into higher and lower general tendency score. It means Particle (attributive) 0.25
that fine definition of POS is needed to acquire Noun (person’s name) 0.21
appropriate category. Prefix (with verb) 0.13
On the other hand, “Interjection” and “Noun Noun (quotation) 0.015
(quotation)” were categorized as topic-related words, Interjection 0.010
although these POS should be categorized as style-
related or general words. These POS had small
speech. The top -1 result, which is the same as final
Category POS result of conventional speech recognizer, is used as
Topic General noun adaptation data for the VD-PLSA. After adaptation
of the VD-PLSA, linguistic score is calculated using
(23,584 words) Proper noun
Noun (base of adjective verb) the VD-PLSA for each recognition results. Finally,
Number total score is calculated as a weighted sum of
Noun (special) acoustical and linguistic scores, and the recognition
Noun (suffix) result with the highest score is output as a final
Noun (conjunctive) result.
Noun (quotation) A language model is usually used in the
Prefix decoding process, however, using the VD -PLSA in
Adjective the decoding process requires a conversion of the
Particle (special) existing decoder. In order to investigate the
Interjection effectiveness of the VD-PLSA rapidly, the rescoring
Style Pronoun method is employed in this experiments, even
(6,372 words) Filler though the rescoring method may give slightly lower
Prefix (with noun) performance than that given by the converted
Verb decoder.
Adverb
5.2. Experimental conditions
Noun (special suffix)
Case particle (collocation)
Particle (adverb) Ten lecture data is selected from the evaluation
Impression words data used in Section 4. The detail of the selected test
Conjunction data is shown in Table 7. In this table, “PP” means a
General Particle (conjunctive) perplexity calculated by the trigram. “OOV” means
(44 words) Case particle (general) out-of-vocabulary ratio.
Case particle (quotation) Julius[10] was used as a decoder, and speaker-
Particle (attributive) independent HMM was used as an acoustic model.
500 recognition results were output for the rescoring
step. If the best candidate were selected in the rescoring
step, total recognition accuracy became 68.42%.

Table 5. Acquired clusters


5.3. Recognition results
Table 6. Evaluation results for each language model
Language model Perplexity Reduction rate Table 8 shows speech recognition results. From
Trigram 143.5 --- these results, the recognition accuracy was slightly
Conventional PLSA 114.7 20.1% improved by using the VD-PLSA. Some lectures (F0073,
VD-PLSA (manually) 111.9 22.0% F0404, M0267 and M1250) showed higher improvement.
(automatic) 109.4 23.8% However, the recognition accuracies of other lectures
were not improved or slightly descended. Especially, the
Table 6 shows perplexities given by each language accuracies of M0149 and M0846 showed larger
model and reduction rate compared with the trigram. descending.
From this experiment, the VD-PLSA model gave higher
Table 7. Details of the evaluation data
performance than both the trigram and conventional
PLSA model, and automatic determination of categories ID PP OOV Topic
gave slightly higher performance than the manually- F0073 130.82 1.7% Juvenile delinquency
defined model. F0226 126.58 2.4% Genetic operation
F0404 107.43 1.4% Aging society
5. Speech recognition experiments M0149 157.08 3.2% Loan
M0267 156.12 2.9% Mobile phone
5.1. Rescoring method M0557 179.47 1.4% Olympic
M0846 189.01 4.4% Blast incident
We applied the VD-PLSA to the large vocabulary M0872 111.82 3.4% Okinawa summit meeting
continuous speech recognition system. In this M1250 121.24 1.7% Rugby football
experiment, the proposed model was applied to the M1564 152.30 2.7% Soccer
“rescoring method”. Average 143.19 2.5%
At first, the input speech data is recognized using a
conventional n-gram. The speech recognizer outputs top-
N recognition results corresponding to an input
References

1] A. I. Rudnicky, “Language modeling with


ID Trigram VD-PLSA Improvement limited domain data,” in Proc. ARPA Spoken
F0073 68.21% 68.86% +0.65 Language Systems Technology Workshop,
F0226 77.22% 77.16% -0.06 1995, pp. 66–69.
F0404 78.24% 78.49% +0.25 2] M. Federico, “Bayesian estimation methods for
M0149 58.77% 58.07% -0.70 N-gram language model adaptation,” in Proc.
M0267 60.73% 61.47% +0.74 ICSLP, 1996, pp. 240–243.
M0557 58.75% 58.84% +0.09 3] R. M. Iyer and M. Ostendorf, “Modeling long
M0846 63.82% 63.12% -0.70 distance dependence in language: topic mixtures
M0872 65.69% 65.60% -0.09 versus dynamic cache models,” IEEE Trans.
M1250 65.44% 65.85% +0.41 Speech and Audio Processing, vol. 7, no. 1, pp.
M1564 67.26% 67.00% -0.26 30–39, 1999.
Average 66.41% 66.45% +0.04 4] T. Hofmann, “Probabilistic latent semantic
analysis,” in Proc. The Twenty-Second Annual
Table 8. Speech recognition accuracy International SIGIR Conference on Research
and Development in Information Retrieval
From the comparison between Table 7 and Table 8, (SIGIR-99), 1999.
lectures with higher OOV rate (excepted M0267) 5] D. Gildea and T. Hofmann, “Topic-based
showed larger descending of accuracy. In general, Most language models using EM,” in Proc.
EUROSPEECH, 1999, pp. 2167–2170.
of out-of-vocabularies are proper noun. If proper nouns
6] A. Ito, N. Kuriyama, M. Suzuki, and S. Makino,
cannot be recognized correctly, VD-PLSA cannot be
“Evaluation of multiple PLSA adaptation based
adapted to the topic sufficiently because the proper noun
on separation of topic and style words,” in Proc.
usually becomes keyword of the topic. Combination of
WESPAC IX, 2006.
the VD-PLSA with an OOV recognition method can
7] N. Kuriyama, M. Suzuki, A. Ito, and S. Makino,
improve the recognition accuracy more drastically.
“Topic and speech style adaptation using
vocabulary divided PLSA language model,” in
6. Conclusion
Proc. 3rd workshop of Yeungnam University
and Tohoku University, 2006, pp. 16–18.
In this paper, an automatic method for clustering 8] K. Maekawa, H. Koiso, S. Furui, and H.
parts-of -speech (POS) was proposed for the vocabulary Isahara, “Spontaneous speech corpus of
divided PLSA language model. Japanese,” in Proc. Second International
Several corpora with different styles were prepared, Conference on Language Resources and
and the distance between corpora in terms of POS was Evaluation (LREC), 2000, pp. 947–952.
calculated. It is defined as the average distance between 9] K. Maekawa, “Corpus of spontaneous Japanese:
the probability distribution calculated from a document Its design and evaluation,” in Proc. ISCA &
and that from all documents in a corpus. After that, IEEE Workshop on Spontaneous Speech
“general tendency score” and “style tendency score” for Processing and Recognition (SSPR), 2003.
each POS were calculated based on the distance between 10] A. Lee, T. Kawahara, and K. Shikano, “Julius
corpora. All of the POS were divided into three — an open source real-time large vocabulary
categories using two scores and appropriate thresholds.
recognition engine,” in Proc. EUROSPEECH,
From the experimental results, the proposed
2001, pp. 1691–1694.
clustering method formed appropriate clusters, and the
VD-PLSA model with acquired categories gave the
highest performance of all other models.
We applied the VD-PLSA into large vocabulary
continuous speech recognition system. VD-PLSA
improved the recognition accuracy for documents with
lower out-of-vocabulary ratio, while other documents
were not improved or slightly descended the accuracy.

Das könnte Ihnen auch gefallen