Beruflich Dokumente
Kultur Dokumente
B.RAJESH KUMAR
07F61A0592
Under the guidance of
C H SIVASHANKAR
(M.TECH)
(Associate professor)
Puttur-517 583
SEMINOR
ON
Automatic Clustering of Part-of-speech for
Vocabulary Divided PLSA Language Model
SUBMITED
BY
B.RAJESH KUMAR
Abstract:
PLSA is one of the most powerful language models for adaptation to a target
speech. The vocabulary divided PLSA language model (VD-PLSA) shows higher
performance than the conventional PLSA model because it can be adapted to the target
topic and the target speaking style individually. However, all of the vocabulary must be
manually divided into three categories (topic, speaking style, and general category). In
this paper, an automatic method for clustering parts-of-speech (POS) is proposed for
VD-PLSA.
Several corpora with different styles are prepared, and the distance between corpora
in terms of POS is calculated. The “general tendency score” and “style tendency score”
for each POS are calculated based on the distance between corpora. All of the POS are
divided into three categories using two scores and appropriate thresholds.
Experimental results showed the proposed method formed appropriate clusters, and
VD-PLSA with acquired categories gave the highest performance of all other models.
We applied the VD-PLSA into large vocabulary continuous speech recognition system.
VD-PLSA improved the recognition accuracy for documents with lower out-of-
vocabulary ratio, while other documents were not improved or slightly descended the
accuracy.
methods (e.g. [1], [2]) require a small amount
1. Introduction of adaptation data in advance. Both a general
N-gram is the most popular language model corpus and the adaptation data are mixed
used in the recent large vocabulary continuous together, and the adapted n-gram is calculated
speech recognition system. In general, an n-
gram is constructed using a huge amount of from the mixed corpus. This type of adaptation
training samples. If a training data consists of methods are effective, however, the adaptation
many topics and speaking styles, the n-gram data have to be prepared in advance. It means
acquires an average statistics of the training
that a new adaptation data should be prepared
data. It can be used as a general language
model, however, it is well known that the n- and the new adapted n -gram has to be trained
gram adapted to a target speech shows higher again whenever a target speech is changed.
performance than the general n-gram.
Many adaptation methods have been
proposed. There are two types of adaptation
methods, one is “static
topics B and C are spoken with a casual style (“c”). corpora style
PLSA makes three latent models, Af, Bc and Cc.
PLSA
adaptation decoder
One of the biggest problems of PLSA is that both general
topic and style are dealt with together when latent
models are estimated. If the target speech has topic B PLSA
with formal style (Bf), no latent model matches the target recognized
speech even though both topic B and formal style are adaptation
context text
included in the training data. In general, any topics can
be combined with any speaking styles. Adaptation is appearance probability
done more efficiently if PLSA acquires topic-related
models and style-related models individually, and these estimation model
models can be combined when recognizing the target
speech.
In order to solve this problem, we have proposed
Figure 1. Overview of the VD-PLSA model.
the vocabulary divided PLSA language model (VD-
PLSA)[6], [7]. In this method, it is assumed that all
parts-of-speech (POS) are divided into three classes: a
topic-related class, style-related class, and general class. p(wi | hi ) for word wi at
For example, proper nouns are related to the topic, and A probability
auxiliary verbs are related to the speaking style. If a linguistic context hi is calculated from the VD-PLSA
PLSA model is constructed only using topic-related model using Eq.(1).
words, each latent model only corresponds to a topic. In
the same way, if a PLSA model is constructed only using p(wi | hi ) = W (C(wi ) | hi , wi−2, wi−1 ) (1)
style-related words, each model only corresponds to a ⋅ PC (w ) (wi | hi , wi−2, w )
speaking style. Combining these two PLSA models i i−1
enables effective adaptation to any target speech on any
topic with any speaking style. where, the function C(w) returns the category (T:
The VD -PLSA showed higher performance than topic, S: style, G: general) including the word w , and
the conventional PLSA model. However, VD -PLSA the function Px (wi | hi , wi− 2 , wi−1 ) denotes the word
requires manually-defined clustering of the POS. It is
needed to define many number of POS for optimum trigram combined with the sub model corresponding to
clustering, however, it is both difficult and time- the category x using the unigram rescaling method[5].
consuming job. In this paper, we propose an automatic Px (wi | hi , wi− 2 , wi−1 ) ∝ px (wi | hi )
dividing method of POS based on differences of
probabilistic distribution. This method yields more px (wi ) (2)
appropriate clustering, and VD-PLSA using the new
clustering gives higher performance. ⋅ p(wi | wi−2 , wi−1
2. Overview of the VD-PLSA model )
where, px (wi | hi ) is calculated by the PLSA model
Figure 1 shows an overview of the VD-PLSA corresponding to the category x.
model. This system has three PLSA models
corresponding to topic, speaking style, and general The function W (x | hi , wi−2 , wi−1 ) denotes an
words. In this paper, each PLSA model is called a “sub appearance probability for the category x after the
word sequence wi−2 , wi−1 at context hi . This
model”. Each sub model is the same as the conventional
PLSA model, except that each model consists only of
probability is calculated from the unigram of category
related vocabulary.
x at context h and word trigram using unigram
rescaling. The unigram p(x | h) is defined as Eq.(3).