Sie sind auf Seite 1von 9

Neurocomputing 277 (2018) 120–128

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

A weighted accent classification using multiple words


Muhammad Rizwan∗, David V. Anderson
Georgia Institute of Technology, Atlanta, GA, U.S.A.

a r t i c l e i n f o a b s t r a c t

Article history: Speech recognition systems exhibit performance degradation due to variability in speech caused by the
Received 2 June 2016 accents or dialects of speakers. This can be overcome by correctly identifying the accent or dialect of
Revised 11 January 2017
the speaker and using accent or dialect information to adapt speech recognition systems. In this paper,
Accepted 13 January 2017
we apply extreme learning machines (ELMs) and support vector machines (SVMs) to the problem of ac-
Available online 31 August 2017
cent/dialect classification on the TIMIT dataset. We used Mel frequency cepstrum coefficients (MFCCs) and
MSC: the normalized energy parameter along with their first and second derivatives as raw features for train-
00-01 ing ELMs and SVMs. A weighted accent classification algorithm is proposed that uses a novel architecture
99-00 to classify North American accents into seven groups. Using this algorithm, we obtained a classification
accuracy of 77.88% using ELMs, which to our knowledge, is the best result reported for accent classifica-
Keywords:
Extreme learning machines
tion on the TIMIT dataset. We also compared the performance of ELMs with SVMs as classifiers for our
Support vector machines weighted accent classification algorithm and with multi-class classification using ELMs or SVMs.
Accent classification © 2017 Elsevier B.V. All rights reserved.

1. Introduction Lawson, et al. showed by their cross accented experiments that


phonemic models obtained from a different accent were 1.8 times
Speech signals intrinsically exhibit many variations, even in the less accurate in recognizing speech than those from a matched ac-
absence of background noise. The three most prominent types of cent [4]. Performance of a speech recognizer can be further im-
variations are due to acoustic effects, accent, and dialect. Acoustic proved by adapting a system based on accent/dialect. Goronzy
variations are primarily related to inherited physical characteris- achieved a 37% reduction in word error rate (WER) by adapting
tics of size and shape of a vocal tract. Two different people saying a recognizer based on accent [5]. There has been little past re-
the same sentence results in different spectrograms. The variations search in the area of accent classification. In particular, most of
due to accent result from the relative prominence of a particular the previous work in the field involves accent classification among
syllable or a word in pronunciation determined by the regional non-native English speakers. Accent variation among native Amer-
or social background of the speaker [1]. Different accents effect a ican speakers is more challenging and has not enjoyed the same
change in the order and number of phonemes used to construct amount of attention in speech community research.
each word of an utterance, i.e., phoneme deletion, insertion and Choueiter, et al. extended language identification techniques to
substitution with respect to some reference accent. Dialect is de- a large-scale accent classification task [6]. They performed sev-
fined as a regional variety of a language distinguished by pronun- eral experiments using heteroscedastic linear discriminant analy-
ciation, grammar or vocabulary. Every individual develops a char- sis (HLDA) and maximum mutual information (MMI) on the For-
acteristic speaking style at an early age that depends heavily on eign Accented English (FAE) dataset [7]. The FAE is composed of
his or her language environment as well as the region where the utterances spoken by native speakers of 23 languages. They found
language is spoken [2,3]. We are using a database of read speech that acoustic-only methods are quite effective for accent classifica-
that is labeled by dialect region. However, since the speech is read, tion in contrast to typical language identification systems. Angkiti-
although we are using dialect regions, words and grammar choices trakul and Hansen used a phoneme-based model to design a text
are eliminated, leaving only accent variations. Accordingly, in this independent automatic accent classification system [8]. They per-
paper we will use these terms interchangeably. formed experiments capturing the spectral evolution information
as potential accent sensitive cues. They generated subspace rep-
resentations using principal component analysis (PCA) and linear
discriminant analysis (LDA). They compared a spectral trajectory

Corresponding author. model framework with a traditional hidden-Markov-model (HMM)
E-mail address: mrizwan@gatech.edu (M. Rizwan). recognition framework using an accent sensitive word corpus. Sys-

http://dx.doi.org/10.1016/j.neucom.2017.01.116
0925-2312/© 2017 Elsevier B.V. All rights reserved.
M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128 121

tem evaluation was performed using a corpus that represent five


English speaker groups, which consisted of native American En-
glish and English speakers having Mandarin Chinese, French, Thai,
and Turkish accents for both male and female speakers. Guarasa
used Gaussian mixture models (GMMs) and Bayes’ classifiers for
German versus Spanish accent classification [9]. Clopper, et al. did
an extensive study of vowel variation among different regions of
North America by acoustic measures of duration and first and
second formant frequencies [10]. Hansen, et al. did an extensive
analysis and modeling of speech under accents on NATO N-4,
TIMIT and the WSJ corpus [11]. They analyzed prosodic structure
(formants, syllable rate and sentence duration), phoneme acous-
tic space and did word-level based modeling on large vocabulary
data. In their experiments, they found that using the most dis-
criminating vowels from each group improves the accent detection
rate.
In this paper, we propose an accent classification algorithm Fig. 1. Extreme learning machines.
based on extreme learning machines (ELMs). ELMs are attractive
for the accent classification task as they can be quickly trained and
also provide a better generalization capability for small amounts
of training data [12,13]. We also compare our accent classifica-
tion algorithm performance by using support vector machines as
classifiers. The rest of the paper is organized as follows: the the-
ory of extreme learning machines (ELMs) and support vector ma-
chines (SVMs) is presented in Sections 2 and 3 with a com-
parison between ELMs and SVMs in Section 4. In Section 5, we
propose our accent classification algorithm. A description of ex-
periments performed, including the dataset, extraction of features,
ELM training, and SVM training for the weighted accent classifi-
cation algorithm is presented in Section 6. Results are discussed
in Section 7 and the conclusion, with suggestions for future work
presented in Section 8.

Fig. 2. Support vector machine.

2. Extreme learning machines



M
Extreme learning machine (ELM) is a robust learning algorithm f (x ) = β i hi ( x ) = h ( x )β (1)
for single layer feed-forward neural networks (SLFNs) [14]. Cur- i=1
rently, SLFNs mostly use gradient based methods for training neu-
where
ral networks. Gradient based methods often get trapped in local
minima and, as a result, give suboptimal solutions. Genetic and hi ( x ) = σ ( wi x + bi ) (2)
evolutionary algorithms have also been used to overcome local
and σ is a non-linear activation function given by:
minima problems, but they are computationally expensive [15].
In ELMs, input weights of the hidden layer neurons are ran- 1
σ ( wi x + bi ) = (3)
domly generated and output weights of the hidden layer neurons 1 + e− ( wi x+bi )
are learned analytically [14,16]. By learning weights analytically,
there is a great performance speedup for training neural networks
β is the vector of weights between M neurons in the hidden layer
and the output layer:
as compared with learning methods such as back-propagation [17].
⎡ ⎤
Theoretically, it has been shown that by using ELMs universal ap- β1
proximation can be achieved [18,19]. ELMs can also be used for
β = ⎣ ... ⎦ (4)
training multilayer perceptrons by using hierarchical frameworks
[20]. βM
Various other architectures for ELMs have been proposed. In
The goal of ELM is to minimize the training error as well as
incremental-ELM, hidden nodes are added incrementally and out-
the norm of the output weights. It does not require any adjust-
put weights are determined analytically [21]. In online sequential-
ments to the input weights of neurons Fig. 2 in the hidden layer
ELM, training data is fed to the network in chunks [22]. Local re-
[12,23,25,26].
ceptive fields-ELM uses local structures and combinatorial nodes  σ1  
for incorporating translational invariance in the network [23]. ELMs minimize β + C Hβ − T σ2 (5)
p q
can be used for both regression and multiclass classification prob-
lems directly [24]. where σ 1 > 0, σ 2 > 0, and p, q = 0, 1, 2, . . . , ∞, and H is the output
ELMs transform the input data to the hidden layer by via ran- matrix at the hidden layer given by:
domly initialized weighted connections. A single hidden layer net- ⎡ ⎤ ⎡ ⎤
h ( x1 ) h1 ( x 1 ) ... hM ( x 1 )
work with M hidden nodes is shown in Fig. 1.
The output function of the single layer network with M hidden H = ⎣ ... ⎦ = ⎣ ... ..
.
..
.
⎦. (6)
neurons can be written as [12]: h ( xN ) h1 ( x N ) ... hM ( x N )
122 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128

and T is the training data target values: Table 1


⎡ ⎤ Kernel functions.
t1
Kernel Function Parameters
T = ⎣ ... ⎦ (7)
Linear xi T xj –
tN Polynomial ( γ xi T xj + r )d γ , r, d
Radial basis exp(−γ ||xi − xj ||2 ) γ
Minimizing the training error and norm of weight vector results Sigmoid tanh(γ xi T xj + r ) γ, r
in good generalization capability of the network [21,27]. The opti-
mal output weights are then computed as:
regularization parameter which determines the generalization ca-
β = H† T (8) pability of SVM (i.e., trade-off between margin and misclassifica-
where H† is the Moore–Penrose generalized inverse of the matrix tion errors; the higher the value of C, the stricter the constraint
H [28]. Various methods can be used to calculate inverse of the and the lower the likelihood of over-fitting [36]), and ζ i is the
matrix H such as orthogonal projection methods, iterative meth- slack variable. The above equation (Eq. (12)) is in non-convex form
ods, and singular value decomposition [24,29]. A closed-form solu- and, therefore, difficult to solve. The above optimization problem is
tion for calculating β is given by [23]: transformed into its dual form with an equality constraint by using
 the Lagrange multiplier. Lagrange function is given by:
HT ( CI + HHT )−1 T, if N ≤ M  N
β= (9)
L(w, b, ζi ; αi , λi ) =
1
||w||2 + C ζi
( + H H ) H T, if N > M
I
C
T −1 T
2
i=1
where C is a regularization parameter, I is the identity matrix, and 
N 
N
H and T are as previously defined in Eqs. (6) and (7), respectively. − αi (ti [wT φ (xi ) + b] − 1 + ζi ) − λi ζi (13)
ELM classifier expression can be written as: i=1 i=1
 where α i ≥ 0 and λi ≥ 0.
h(x )HT ( CI + HHT )−1 T, if N ≤ M The solution can be obtained by solving the Lagrange function
f (x ) = (10)
h(x )( CI + HT H )−1 HT T, if N > M and calculating the partial derivative of the Lagrange function with
respect to w, b, and ζ i [37].
where h(x) is the hidden layer
output vector corresponding to in-
max min L(w, b, ζi ; αi , λi ) (14)
put samples x = x1 , x2 , . . . , xN and β is the output weight vector αi ,λi w,b,ζi
between hidden layer of M nodes and the output node. In fact, h(x)
is a feature mapping from input space of D-dimensions to random
∂L ∂L ∂L
= 0, = 0, =0 (15)
feature space (or ELM space) of M-dimensions. ∂w ∂b ∂ζi
For our system, the training data consists of N distinct input- 
N
output pairs of words and their corresponding accent type given w= αi ti φ (xi ) (16)
by: i=1

Training Data = {(x1 , t1 ), (x2 , t2 ), . . . , (xN , tN )} (11) 


N
αi ti = 0 (17)
where each (xi , ti ) respectively represent an input word data and i=1
its corresponding accent label. Specifically, xi ∈ RD is the vector of
extracted speech signal features for a complete “word” (Details of 0 ≤ αi ≤ C (18)
feature extraction are in Section III-B), and ti ∈ RV is the corre- By using the above constraints, SVMs dual optimization problem
sponding accent type. In our case we train ELMs to distinguish be- can be written as:
tween two accent types (Details of ELM training in Section III-C). 
m
1  (i ) ( j )
m
maximize W (α ) = αi − t t αi α j xi , x j
α 2
3. Support vector machines i=1 i, j=1
(19)

m
Support vector machines (SVMs) are based on the intuition of subject to αi ≥ 0, αi t(i) = 0, i = 1, . . . , m
placing a hyperplane in such a way that it separates data classes i=1
with a large margin. An SVM is thus a maximum margin classifier. SVMs decision function in a kernel space is given by:
The margin in an SVM is the distance between the hyperplane and 

M  
data point closest to it [30–33]. When the data to be classified is f(x ) = sgn αi ti κ xi , x + b (20)
not linearly separable (usually the case), a kernel function may be i=1
used to map the data from a given input space to a high dimen- where κ (.) is a kernel function. Several kernel functions with their
sional space known as a kernel space. Using the kernel space may hyperparameters are summarized in Table 1 below.
result in a better separability of data [34,35]. For a given training When using SVMs, three choices must be made: kernel type,
set comprising N data points with class labels ti ∈ {1, −1}, the goal corresponding kernel parameters, and regularization parameter
of SVM classifier is to separate data classes by finding an optimal [38]. SVMs computational complexity depends on the number of
hyperplane in the kernel space by solving a minimization problem samples in the training data, and is independent of kernel space
with the inequality constraint given by Eq. (12): dimension.
1  N
minimize ||w||2 + C ζi 4. Comparison between ELMs and SVMs
w,ζi 2 (12)
i=1
Both ELMs and SVMs converge to a single global optimum so-
subject to ζi ≥ 0, ti [wT φ (xi ) + b] ≥ 1 − ζi
lution. ELMs optimize sum of squared errors, while SVMs con-
where i = 1, 2, . . . , N, φ (·) is a mapping function, w is the normal struct a hyperplane that maximizes the separation between the
to the optimal decision hyperplane, b is the bias term, C is the data classes [37,39,40].
M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128 123

Table 2
Comparison of ELMs and SVMs.

Characteristics ELMs SVMs

Optimization Sum of squared errors Maximum margin classifier


Loss function Smooth Not smooth
Feature transformation Random features Kernel functions
- Linear
- Polynomial
- Radial basis
- Sigmoid
Hyperparameters Regularization parameter “C” Regularization parameter “C”
Number of neurons in hidden layer Kernel function parameters
Activation functions - Linear (γ )
- Polynomial (γ , d, r)
- Radial basis (γ )
- Sigmoid (γ , r)

Optimization constraints α i from entire cube [0, C]N α i from one hyperplane Ni=1 αi ti = 0
Computational complexity Dependent on dimension of feature space Independent of feature space dimension
Training time Less More

Ouput function f(x ) = h(x )HT ( CI + HHT )−1 T f(x ) = sgn( Mi=1 αi ti κ (xi , x ) + b)
Multi-class classification Directly Indirectly

ELM uses a quadratic loss function and minimizes the sum of


Table 2 square errors between the class labels and the network
output. It not only penalizes wrong answers but also penalizes
correct answers which are far from the decision boundary. The
quadratic loss function is smooth and the resulting Karush-Kuhn-
Tucker (KKT) system has a closed form solution [42]. This makes
the training of ELM easy. Decision boundary of the ELM classifier is
determined by using all samples present in the training data [43].
SVM constructs hyperplane as a loss function. The loss function
is not smooth which results in an iterative solution for KKT dual
system. It not only penalizes answers which are incorrect, but also
1
that are correct but lie close to decision boundary [44]. Decision
boundary is decided by only those samples from the training data
Fig. 3. Extreme learning machine decision surface. for which Lagrange multiplier is non-zero (i.e., support vectors).

4.3. Feature transformation

ELM uses random feature transformation and classifier can be


trained by using primal or dual formulations [39]. SVM transforms
data in kernel space by using kernel functions. SVMs are always
trained in the dual space [30,45].

4.4. Hyperparameters

ELM requires selection of regularization parameter C and num-


ber of neurons in hidden layer as hyper-parameters. The number
1
of neurons in the hidden layer determine the dimensionality of
the feature space. SVM requires hyperparameters depending on the
Fig. 4. Support vector machine decision surface. kernel function, in addition to regularization parameter C. In short,
SVM requires more hyperparameters as compared with ELM [39].

4.1. Decision surface 4.5. Training and testing time

ELMs and SVMs have the same dual optimization objective ELM training time can be estimated as it uses a closed form
functions. In ELMs, optimal solutions are learned from the entire solution for calculating weights. Let N be the number of training
cube [0, C]N , while in SVMs optimal α i is learned from one hyper- samples, D be the dimensionality of input data, and M be the num-

plane N i=1 αi ti = 0 within the cube [0, C] as shown in Figs. 3 and
N
ber of neurons in the hidden layer of ELM. In order to calculate
4. This results in SVMs solution to be sub-optimal [41]. weight matrix given by Eq. (9), we first need to calculate H. Cal-
culating H matrix requires O(NDM) operations. The weight matrix
β requires O(NM2 + M3 ) operations [24,39]. Training and testing
4.2. Loss function time of ELM is given by Eqs. (21) and (22) for the case when N  M
and N  D.
The training of classifier depends on the loss functions. Loss
ELM T raining T ime = O(NM2 ) (21)
function has a significant impact on the training time of the clas-
sifier, as well as on the computational cost for the classification of
new data [39]. ELM T esting T ime = O(MD ) (22)
124 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128

Fig. 5. Weighted accent classification algorithm.

SVM training time estimation is difficult because of its iterative


training procedure [39]. SVM training is related to the number of
support vectors [46–48]. For S number of support vectors, the test-
ing time of SVM classifier is given by Eq. (23):
SV M T esting T ime = O(SD ) (23)

5. Weighted accent classification algorithm

Our weighted accent classification algorithm uses either ELMs


or SVMs for accent classification. The algorithm involves three Fig. 6. Extreme learning machine – training.
stages. In the first stage, multiple ELMs (or SVMs) are trained us-
ing word samples from two accent classes at a time. This pair-wise
classification helps to find good decision boudaries [49]. Hyperpa-
rameters of ELMs (or SVMs) are learned by cross-validation. In the
second stage, output of multiple ELMs (or SVMs) are combined to
obtain a classification score. Finally, the classification score is en-
coded and output accent class decision is made based on highest
encoded score. Fig. 5 shows the overall block diagram.

5.1. ELMs (or SVMs) training

Each ELM (or SVM) is individually trained and optimized for a


single pair-wise decision. For example an ELM (or SVM) [D1, D2]
is trained using word samples from speakers belonging to accents
D1 (New England) and D2 (Northern). Similarly an ELM (or SVM)
[D1, D3] is trained using word samples from speakers belonging
to dialects D1 (New England) and D3 (North Midland), and so on.
For ELMs, the number of hidden layer neurons were varied dur-
Fig. 7. Support vector machine – training.
ing training. For SVMs, kernel parameters were learned using grid
search approach. The best hyperparameters were selected based on
cross-validation (Details in Section 6.3). Although ELMs (or SVMs)
are capable of complex decision boundaries, in practice, develop- 5.2. Classification score
ing a system that can reliably distinguish the seven accent classes
is demanding, especially because there are many similarities be- Each ELM (or SVM) can receive all the samples from a typical
tween accents. The method shown in Fig. 5 utilizes only pair-wise word at once. All the word samples from a particular speaker are
classification to make it easier to find good decision boundaries. given as input to all these 21 ELMs (or SVMs) at once as shown in
This will result in 21 ELMs as shown in Fig. 5. Fig. 5. Each of these ELMs (or SVMs) trained in a pair-wise man-
Let ZA1 , ZA2 , ...,ZAj be the training word samples from accent ner will classify the particular speaker accent. For example, if the
“DA ,” and let “j” be the total number of speakers in that partic- true class of the input word was D1, most of the pair-wise clas-
ular accent group. Similarly, ZB1 , ZB2 , ...,ZBk are the training word sifiers that were trained on D1 (i.e., (D1, D2), (D1, D3), ...(D1, D7))
samples from accent “DB ,” and “k” is the total number of speakers will correctly identify the class as D1. Those not trained on D1 (i.e.,
in accent group “DB .” Thus we have DA ∈ {D1, D2, D3, ... , D7}, DB (D2, D3), (D2, D4), ...(D6, D7)) will have effectively random outputs,
∈ {D1, D2, D3, ... , D7}, and DA = DB . Figs. 6 and7 show how ELMs choosing among the other classes with approximately equal prob-
or SVMs are trained in a pairwise manner for the weighted accent ability. Thus, the class D1 will win the vote and the class will be
classification algorithm. correctly identified.
M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128 125

computed by the regression Eqs. (24) and (25):


M
n(ci+n − ci−n )
i = 
n=1
(24)
2 M n=1 n
2

M
n( − i−n )
( − )i = n=1 Mi+n (25)
2 n=1 n2
For each word sample we have 39 dimension feature vectors con-
sisting of 13 static cepstral feature, 13 cepstral features and 13
− ’s cepstral features. The ’s improve the accent classification
accuracy by adding temporal dependencies.
Fig. 8. Weighted score.
6.3. ELMs and SVMs hyperparameters

In our case, no hard decision is made for a single word so the During ELM training, the number of neurons in hidden layer
results are combined over the entire utterance using a weighting were varied from 100 to 10 0 0 with an increment of 100 and sig-
scheme as described below in Section 5.3. moid is used as a non-linear activation function. The number of
neurons in the hidden layer were learned using trial and error pro-
5.3. Accent decision cedure based on cross-validation. For SVM training, a grid search
method was used to find optimal SVM model parameters [55].
Classification results from multiple words are combined using SVMs in the weighted accent classification algorithm were trained
a weighting scheme that improves overall performance. The out- using LIBSVM library [56]. We used linear, polynomial, RBF, and
put classes from each of the 21 ELMs (or SVMs) are tallied and a sigmoid kernels with d = {1, 2, . . . , 15}, γ = {2−15 , 2−14 , . . . , 25 },
score is given to each class according to the number of times that and C = {2−3 , 2−2 , . . . , 215 }.
class was selected. The maximum count that any class can have
is 6 and the count⇒score mapping is given in Fig. 8. The overall 7. Results
dialect class is determined by the highest total score.
We compared the accuracy of the weighted accent classification
algorithm by using ELMs and SVMs as classifiers. We also com-
6. Experiment
pared the performance of different words and evaluated the im-
provement resulting from using multiple words from a particular
6.1. Dataset
speaker. Finally, we compared the relative performance of using
ELMs and SVMs as classifiers in our accent classification algorithm
The dataset used in our experiment is TIMIT, a speech dataset
in terms of training and testing time.
developed by Texas Instruments (TI) and Massachusetts Institute of
Technology (MIT) and considered as one of the standard datasets 7.1. Comparison of different words
in speech research [50,51]. The TIMIT dataset contains utterances
from 630 speakers representing eight different dialect regions of In our first experiment we used eleven different words: “dark,”
the United States. The dialect regions are: New England (D1), “like,” “oily,” “suit,” “that,” “wash,” “year,” “your,” “carry,” “water,”
Northern (D2), North-Midland (D3), South-Midland (D4), South- and “greasy” to classify speaker into one of the seven different ac-
ern (D5), New York City (D6), Western (D7), and Army Brat. In cents. We selected words with three or more letters so that they
TIMIT dataset, they used the term dialect for specifying these re- can capture variability in terms of accents and are available for all
gions. To be consistent with the dataset we use the word dialect speakers in the TIMIT dataset. We tested the weighted accent clas-
here. These utterances are read, so there are no word and gram- sification algorithm (Section 5) using ELMs and SVMs as classifiers
mar variations. The only variation in the acoustic waveform is the with only one word at a time. We also compared performance of
accent variations. For each utterance the text, the signal sampled at our proposed weighted accent classification algorithm with multi-
16 kHz, and hand-labeled segmentation at the word and phonetic class classification. Figs. 9 and 10 show the comparison of our
level are provided. In our experiment, we used the first 7 accent proposed weighted accent classification algorithm with multi-class
regions as the Army Brat accent group comprises speakers who classification using ELMs and SVMs as a classifier.
moved around often during their childhood. For each speaker we By using our proposed algorithm we get better results as com-
have ten utterances consisting of two accent sentences (SA) which pared with multi-class classification. Our proposed weighted ac-
are the same for each speaker, five phonetically compact sentences cent classification algorithm with ELM-based classifier performed
(SX) and three phonetically diverse sentences (SI). In our proposed best with the word “like” while the SVM-based classifier per-
method we are using words from sentence “SA” as these words are formed best with the word “carry”. In this experiment we used
available for each speaker. only one word at a time from a speaker.

6.2. Feature extraction 7.2. Classification accuracy and number of words

The TIMIT dataset is provided with word label information. Us- In this experiment, we compared the improvement in accent
ing word-label information, we extracted speech samples of words classification accuracy obtained by using multiple words from a
from the TIMIT dataset. These speech samples were normalized be- given speaker. We varied the number of words from one to five for
tween -1 and 1. We extracted 12 Mel Frequency Cepstral Coeffi- a particular speaker. Fig. 11 shows the comparison of our proposed
cients (MFCCs) [52] and normalized energy parameter using Au- weighted accent classification algorithm with multi-class classifica-
ditory Toolbox [53]. We used a Hamming window and triangular tion by using ELMs and SVMs as classifiers for multiple words. We
filter bank for the MFCCs [54]. To incorporate temporal dependen- used the top five words in terms of their performance as presented
cies we used and − ’s coefficients. Delta ( ) coefficients are in Figs. 9 and 10. For weighted accent classification using ELMs
126 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128

Fig. 12. Classification accuracy per different accents.


Fig. 9. Comparison of classification accuracy with different words using ELM as a
classifier.
we used the words “like,” “greasy,” “suit,” “wash,” and “water.” For
multi-class classification using ELMs we used the words “water,”
“that,” “like,” “wash,” and “dark.” Similarly for weighted accent
classification using SVMs we used the words “carry,” “suit,” “dark,”
“wash,” and “water.” For multi-class classification using SVMs we
used the words “carry,” “water,” “like,” “suit,” and “your.”
As we increase the number of words from one to five for a
particular speaker, our weighted accent classification algorithm us-
ing ELMs and SVMs results in an improvement of accuracy from
49.04 to 77.88% and from 43.27 to 60.58% respectively. In the case
of multi-class classification using ELMs and SVMs, there is a very
slight improvement in accent classification accuracy (multi-class
ELMs 29.81 to 35.58% and multi-class SVMs 28.85 to 30.77%).

7.3. Classification accuracy of each accent

Fig. 12 shows the accent classification accuracy as a function


of true accent. In this experiment we used five words from each
speaker. As shown in Fig. 12, accent D6 (New York City) shows
the worst performance for both ELM and SVM classifiers. This is
Fig. 10. Comparison of classification accuracy with different words using SVM as a because speakers from accent region D6 (New York City) inter-
classifier. mixed with speakers from accent region D1 (New England) and D2
(Northern).

7.4. Comparison of ELMs and SVMs training and testing time

Using ELMs as classifiers for our accent classification algo-


rithm gives a better accent classification accuracy relative to SVMs.
Fig. 13 shows the relative comparison of training and testing time
using ELMs and SVMs as classifiers for the proposed weighted ac-
cent classification algorithm. ELMs take less time to train and op-
erate by more than a factor of 2 relative to SVMs.

7.5. Comparison of accent classification results

As discussed, accent classification is a challenging problem and


it becomes more challenging on read sentences because word se-
lection and sentence structure are not part of the message as in
spontaneous speech. Different researchers have tried various ap-
proaches on different datasets and most of the work is done in
classifying accents of non-native speakers. In the empirical study of
23-way classification on a Foreign Accented English (FAE) dataset
Fig. 11. Comparison of classification accuracy with number of words. [6], an average accuracy of 32.7% was obtained. Guarasain used
acoustic methods to classify a German and Spanish group in an
M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128 127

from various accent and dialect groups which have prominent


variations.

References

[1] L.M. Arslan, J.H. Hansen, Language accent classification in american english,
Speech Commun. 18 (4) (1996) 353–367.
[2] J.J. Humphries, Accent modelling and adaptation in automatic speech recogni-
tion, University of Cambridge, 1998 (Ph.d. thesis).
[3] R. Huang, J.H. Hansen, P. Angkititrakul, Dialect/accent classification using unre-
stricted audio, IEEE Trans. Audio Speech Lang. Process. 15 (2) (2007) 453–464.
[4] A.D. Lawson, D.M. Harris, J.J. Grieco, Effect of foreign accent on speech recog-
nition in the NATO n-4 corpus, in: Proceedings of the Eighth European Confer-
ence on Speech Communication and Technology, 2003.
[5] S. Goronzy, Robust Adaptation to Non-Native Accents in Automatic Speech
Recognition, 2560, Springer Science & Business Media, 2002.
[6] G. Choueiter, G. Zweig, P. Nguyen, An empirical study of automatic accent clas-
sification, in: Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing, IEEE, 2008, pp. 4265–4268.
[7] T. Lander, in: CSLU: Foreign Accented English Release 1.2, Linguistic Data Con-
sortium, Philadelphia, 2007.
[8] P. Angkititrakul, J.H. Hansen, Advances in phone-based modeling for automatic
accent classification, IEEE Trans. Audio Speech Lang. Process. 14 (2) (2006)
Fig. 13. Comparison of ELMs and SVMs training and testing time.
634–646.
[9] J. Macías-Guarasa, Acoustic adaptation and accent identification in the ICSI MR
Table 3 and FAE corpora, in: Proceedings of the ICSI Meeting Slides, 2003.
Comparison of accent classification results. [10] C.G. Clopper, D.B. Pisoni, K. De Jong, Acoustic characteristics of the vowel sys-
tems of six regional varieties of American English, J. Acoust. Soc. Am. 118 (3)
Dataset Technique Accuracy (%age) (2005) 1661–1676.
[11] J.H. Hansen, U.H. Yapanel, R. Huang, A. Ikeno, Dialect analysis and modeling
FAE HLDA+MMI 32.70 for automatic classification, in: Proceedings of the Interspeech, 2004.
FAE GMM 73.00 [12] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and ap-
FAE Bayes’ 58.90 plications, Neurocomputing 70 (1) (2006) 489–501.
CU-accent PCA+LDA 64.90 [13] G.-B. Huang, E. Cambria, K.-A. Toh, B. Widrow, Z. Xu, New trends of learning
TIMIT Prosodic analysis 42.52 in computational intelligence, IEEE Comput. Intell. Mag. 10 (2) (2015) 16–17.
TIMIT ELM (Proposed) 77.88 [14] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning
TIMIT SVM (Prposed) 60.58 scheme of feedforward neural networks, in: Proceedings of the IEEE Interna-
tional Joint Conference on Neural Networks, 2, IEEE, 2004, pp. 985–990.
[15] G. Huang, S. Song, J.N. Gupta, C. Wu, Semi-supervised and unsupervised ex-
treme learning machines, IEEE Trans. Cybern. 44 (12) (2014) 2405–2417.
[16] G.-B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J.
FAE dataset [9]. By using GMMs and the Bayesian clasifier, detec- Mach. Learn. Cybern. 2 (2) (2011) 107–122.
tion rates of 73% and 58.9% respectively were obtained. In the text- [17] E. Cambria, N. Howard, Y. Xia, T.-S. Chua, Computational intelligence for big
social data analysis, IEEE Comput. Intell. Mag. 11 (3) (2016) 8–9.
independent automatic accent classification using phoneme based
[18] G.-B. Huang, An insight into extreme learning machines: random neurons,
models, average classification accuracies of 64.90% at the phone Random Featur. Kernels Cognit. Comput. 6 (3) (2014) 376–390.
level and 75.18% at the word level for pairwise classification were [19] G.-B. Huang, What are extreme learning machines? Filling the gap between
obtained [8]. For a pool of four accents, the average classification Frank Rosenblatts dream and John Von Neumanns puzzle, Cognit. Comput. 7
(3) (2015) 263–278.
accuracy rate was 37.57% at the phone level and 46.72% at the [20] J. Tang, C. Deng, G.-B. Huang, Extreme learning machine for multilayer percep-
word level. In another study on the TIMIT dataset that used the tron, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 809–821.
most discriminating vowels, a detection rate of 42.52% was ob- [21] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental
constructive feedforward networks with random hidden nodes, IEEE Trans.
tained [11]. Table 3 summarizes the comparison of accent classi- Neural Netw. 17 (4) (2006) 879–892.
fication results. [22] Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning
machine, Neurocomputing 72 (13) (2009) 3391–3395.
[23] G.-B. Huang, Z. Bai, L.L.C. Kasun, C.M. Vong, Local receptive fields based ex-
treme learning machine, IEEE Comput. Intell. Mag. 10 (2) (2015) 18–29.
8. Conclusions and future work [24] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regres-
sion and multiclass classification, systems, man, and cybernetics, part b: Cy-
In this paper, we proposed a weighted accent classification al- bernetics, IEEE Transactions on 42 (2) (2012) 513–529.
[25] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-
gorithm that uses a novel architecture for accent classification computing 70 (16) (2007) 3056–3062.
based on ELMs. The algorithm uses five words from a speaker to [26] W.F. Schmidt, M.A. Kraaijveld, R.P. Duin, Feedforward neural networks with
differentiate between different accents and is comprised of three random weights, in: Proceedings of the Eleventh IAPR International Conference
on Pattern Recognition Methodology and Systems, 2, IEEE, 1992, pp. 1–4.
stages. In the first stage, a given word from a test speaker is pre- [27] P.L. Bartlett, The sample complexity of pattern classification with neural net-
sented as the input to 21 ELMs which are each trained to dis- works: the size of the weights is more important than the size of the network,
tinguish between two accents. In the second stage, the outputs IEEE Trans. Inf. Theory 44 (2) (1998) 525–536.
[28] P. Lancaster, M. Tismenetsky, et al., The Theory of Matrices: With Applications,
of multiple ELMs are combined to obtain a classification score Elsevier, 1985.
for that word. Finally, the classification score is encoded and op- [29] N.R. Draper, H. Smith, E. Pownell, Applied Regression Analysis, third ed., Wiley,
tionally combined with the scores from other words and a deci- New York, 1966.
[30] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995)
sion about an accent class is based on the highest total score. Ex-
273–297.
periments were conducted on seven different accent groups from [31] V. Vapnik, S.E. Golowich, A. Smola, Support vector method for function ap-
the TIMIT dataset. Our proposed technique classifies speakers into proximation, regression estimation, and signal processing, in: Proceedings of
the Advances in Neural Information Processing Systems, 9, Citeseer, 1996.
seven groups with an accuracy of 77.88% using five words from
[32] S.P. Schölkopf, V. Vapnik, A. Smola, Improving the accuracy and speed of sup-
a given test speaker. To the author’s knowledge, this is the first port vector machines, Adv. Neural Inf. Process. Syst. 9 (1997) 375–381.
attempt to use ELMs for accent classification. We also compared [33] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data
our weighted accent classification algorithm performance by us- Min. Knowl. Discov. 2 (2) (1998) 121–167.
[34] A. Aizerman, E.M. Braverman, L. Rozoner, Theoretical foundations of the poten-
ing SVMs as classifiers and also with multiclass classification using tial function method in pattern recognition learning, Autom. Remote Control
ELMs or SVMs. In the future, we will investigate different words 25 (1964) 821–837.
128 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128

[35] V. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Busi- [52] L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall,
ness Media, 2013. 1978.
[36] B. Frénay, M. Verleysen, et al., Using SVMs with randomised feature spaces: [53] M. Slaney, Auditory toolbox, Interval Research Corporation Tech. Rep. No. 1998-
an extreme learning approach, in: Proceedings of the European Symposium on 010, Interval Research Corporation, (1998).
Artificial Neural Networks, 2010. [54] V. Tiwari, MFCC and its applications in speaker recognition, Int. J. Emerg. Tech-
[37] L. Zhang, D. Zhang, F. Tian, SVM and ELM: who wins? Object recognition with nol. 1 (1) (2010) 19–22.
deep convolutional features from imagenet, in: Proceedings of the Extreme [55] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to support vector clas-
Learning Machine, 1, Springer, 2016, pp. 249–263. sification. Technical report, Department of Computer Science, National Taiwan
[38] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parame- University.
ters for support vector machines, Mach. Learn. 46 (1–3) (2002) 131–159. [56] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM
[39] J. Chorowski, J. Wang, J.M. Zurada, Review and performance comparison of Trans. Intell. Syst. Technol. 2 (3) (2011) 27.
SVM-and ELM-based classifiers, Neurocomputing 128 (2014) 507–516.
[40] X. Liu, C. Gao, P. Li, A comparative analysis of support vector machines and
extreme learning machines, Neural Netw. 33 (2012) 58–66.
[41] G.-B. Huang, Extreme learning machines – filling the gap between Frank Muhammad Rizwan received his B.E. degree from Na-
Rosenblatt’s dream and John Von Neumann’s puzzle?, http://www.ntu.edu.sg/ tional University of Sciences & Technology, Pakistan, and
home/egbhuang/pdf/ELM-Tutorial.pdf. M.S. degree from Lahore University of Management &
[42] B. Scholkopf, A.J. Smola, Learning with kernels: support vector machines, in: Sciences, Pakistan. Currently, he is a Ph.D. candidate in
Regularization, Optimization, and Beyond, MIT press, 2001. School of Electrical and Computer Engineering, Georgia
[43] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning Institute of Technology (Georgia Tech), GA, USA. His re-
machine for classification, Neurocomputing 74 (1) (2010) 155–163. search interests include deep neural networks, extreme
[44] J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, learning machines, learning algorithms, adaptive systems,
Neural Process. Lett. 9 (3) (1999) 293–300. and unsupervised learning. He is a member of the IEEE,
[45] J.A. Suykens, T.V. Gestel, J. De Brabanter, Least squares support vector ma- the IEEE Signal Processing Society, and the American So-
chines, Fourth ed., World Scientific, 2002. ciety for Engineering Education.
[46] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin
classifiers, in: Proceedings of the Fifth annual Workshop on Computational
Learning Theory, ACM, 1992, pp. 144–152.
[47] J.C. Platt, Fast training of support vector machines using sequential mini- David V. Anderson is a Professor of Electrical and Com-
mal optimization, in: Proceedings of the Advances in Kernel Methods, 1999, puter Engineering at Georgia Tech. He received B.S and
pp. 185–208. M.S. degrees from Brigham Young University and a Ph.D.
[48] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector degree from Georgia Institute of Technology (Georgia
machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. Tech) in 1993, 1994, and 1999, respectively. Dr. Anderson’s
[49] M. Rizwan, B.O. Odelowo, D.V. Anderson, Word based dialect classification us- research interests include audio and psycho-acoustics,
ing extreme learning machines, in: Proceedings of the IEEE International Joint signal processing in the context of human perception, and
Conference on Neural Networks, IEEE, 2016. applications of machine learning to signal processing. Dr.
[50] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT Anderson was awarded the National Science Foundation
acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1–1.1, CAREER Award for excellence as a young educator and re-
NASA STI/Recon Technical Report No. 93, NASA, (1993) 27403. searcher in 2004 and the Presidential Early Career Award
[51] V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and for Scientists and Engineers in the same year. He has over
beyond, Speech Commun. 9 (4) (1990) 351–356. 180 technical publications and 7 patents. Dr. Anderson is
a senior member of the IEEE.

Das könnte Ihnen auch gefallen