Beruflich Dokumente
Kultur Dokumente
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Speech recognition systems exhibit performance degradation due to variability in speech caused by the
Received 2 June 2016 accents or dialects of speakers. This can be overcome by correctly identifying the accent or dialect of
Revised 11 January 2017
the speaker and using accent or dialect information to adapt speech recognition systems. In this paper,
Accepted 13 January 2017
we apply extreme learning machines (ELMs) and support vector machines (SVMs) to the problem of ac-
Available online 31 August 2017
cent/dialect classification on the TIMIT dataset. We used Mel frequency cepstrum coefficients (MFCCs) and
MSC: the normalized energy parameter along with their first and second derivatives as raw features for train-
00-01 ing ELMs and SVMs. A weighted accent classification algorithm is proposed that uses a novel architecture
99-00 to classify North American accents into seven groups. Using this algorithm, we obtained a classification
accuracy of 77.88% using ELMs, which to our knowledge, is the best result reported for accent classifica-
Keywords:
Extreme learning machines
tion on the TIMIT dataset. We also compared the performance of ELMs with SVMs as classifiers for our
Support vector machines weighted accent classification algorithm and with multi-class classification using ELMs or SVMs.
Accent classification © 2017 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2017.01.116
0925-2312/© 2017 Elsevier B.V. All rights reserved.
M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128 121
Table 2
Comparison of ELMs and SVMs.
4.4. Hyperparameters
ELMs and SVMs have the same dual optimization objective ELM training time can be estimated as it uses a closed form
functions. In ELMs, optimal solutions are learned from the entire solution for calculating weights. Let N be the number of training
cube [0, C]N , while in SVMs optimal α i is learned from one hyper- samples, D be the dimensionality of input data, and M be the num-
plane N i=1 αi ti = 0 within the cube [0, C] as shown in Figs. 3 and
N
ber of neurons in the hidden layer of ELM. In order to calculate
4. This results in SVMs solution to be sub-optimal [41]. weight matrix given by Eq. (9), we first need to calculate H. Cal-
culating H matrix requires O(NDM) operations. The weight matrix
β requires O(NM2 + M3 ) operations [24,39]. Training and testing
4.2. Loss function time of ELM is given by Eqs. (21) and (22) for the case when N M
and N D.
The training of classifier depends on the loss functions. Loss
ELM T raining T ime = O(NM2 ) (21)
function has a significant impact on the training time of the clas-
sifier, as well as on the computational cost for the classification of
new data [39]. ELM T esting T ime = O(MD ) (22)
124 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128
M
n( − i−n )
( − )i = n=1 Mi+n (25)
2 n=1 n2
For each word sample we have 39 dimension feature vectors con-
sisting of 13 static cepstral feature, 13 cepstral features and 13
− ’s cepstral features. The ’s improve the accent classification
accuracy by adding temporal dependencies.
Fig. 8. Weighted score.
6.3. ELMs and SVMs hyperparameters
In our case, no hard decision is made for a single word so the During ELM training, the number of neurons in hidden layer
results are combined over the entire utterance using a weighting were varied from 100 to 10 0 0 with an increment of 100 and sig-
scheme as described below in Section 5.3. moid is used as a non-linear activation function. The number of
neurons in the hidden layer were learned using trial and error pro-
5.3. Accent decision cedure based on cross-validation. For SVM training, a grid search
method was used to find optimal SVM model parameters [55].
Classification results from multiple words are combined using SVMs in the weighted accent classification algorithm were trained
a weighting scheme that improves overall performance. The out- using LIBSVM library [56]. We used linear, polynomial, RBF, and
put classes from each of the 21 ELMs (or SVMs) are tallied and a sigmoid kernels with d = {1, 2, . . . , 15}, γ = {2−15 , 2−14 , . . . , 25 },
score is given to each class according to the number of times that and C = {2−3 , 2−2 , . . . , 215 }.
class was selected. The maximum count that any class can have
is 6 and the count⇒score mapping is given in Fig. 8. The overall 7. Results
dialect class is determined by the highest total score.
We compared the accuracy of the weighted accent classification
algorithm by using ELMs and SVMs as classifiers. We also com-
6. Experiment
pared the performance of different words and evaluated the im-
provement resulting from using multiple words from a particular
6.1. Dataset
speaker. Finally, we compared the relative performance of using
ELMs and SVMs as classifiers in our accent classification algorithm
The dataset used in our experiment is TIMIT, a speech dataset
in terms of training and testing time.
developed by Texas Instruments (TI) and Massachusetts Institute of
Technology (MIT) and considered as one of the standard datasets 7.1. Comparison of different words
in speech research [50,51]. The TIMIT dataset contains utterances
from 630 speakers representing eight different dialect regions of In our first experiment we used eleven different words: “dark,”
the United States. The dialect regions are: New England (D1), “like,” “oily,” “suit,” “that,” “wash,” “year,” “your,” “carry,” “water,”
Northern (D2), North-Midland (D3), South-Midland (D4), South- and “greasy” to classify speaker into one of the seven different ac-
ern (D5), New York City (D6), Western (D7), and Army Brat. In cents. We selected words with three or more letters so that they
TIMIT dataset, they used the term dialect for specifying these re- can capture variability in terms of accents and are available for all
gions. To be consistent with the dataset we use the word dialect speakers in the TIMIT dataset. We tested the weighted accent clas-
here. These utterances are read, so there are no word and gram- sification algorithm (Section 5) using ELMs and SVMs as classifiers
mar variations. The only variation in the acoustic waveform is the with only one word at a time. We also compared performance of
accent variations. For each utterance the text, the signal sampled at our proposed weighted accent classification algorithm with multi-
16 kHz, and hand-labeled segmentation at the word and phonetic class classification. Figs. 9 and 10 show the comparison of our
level are provided. In our experiment, we used the first 7 accent proposed weighted accent classification algorithm with multi-class
regions as the Army Brat accent group comprises speakers who classification using ELMs and SVMs as a classifier.
moved around often during their childhood. For each speaker we By using our proposed algorithm we get better results as com-
have ten utterances consisting of two accent sentences (SA) which pared with multi-class classification. Our proposed weighted ac-
are the same for each speaker, five phonetically compact sentences cent classification algorithm with ELM-based classifier performed
(SX) and three phonetically diverse sentences (SI). In our proposed best with the word “like” while the SVM-based classifier per-
method we are using words from sentence “SA” as these words are formed best with the word “carry”. In this experiment we used
available for each speaker. only one word at a time from a speaker.
The TIMIT dataset is provided with word label information. Us- In this experiment, we compared the improvement in accent
ing word-label information, we extracted speech samples of words classification accuracy obtained by using multiple words from a
from the TIMIT dataset. These speech samples were normalized be- given speaker. We varied the number of words from one to five for
tween -1 and 1. We extracted 12 Mel Frequency Cepstral Coeffi- a particular speaker. Fig. 11 shows the comparison of our proposed
cients (MFCCs) [52] and normalized energy parameter using Au- weighted accent classification algorithm with multi-class classifica-
ditory Toolbox [53]. We used a Hamming window and triangular tion by using ELMs and SVMs as classifiers for multiple words. We
filter bank for the MFCCs [54]. To incorporate temporal dependen- used the top five words in terms of their performance as presented
cies we used and − ’s coefficients. Delta () coefficients are in Figs. 9 and 10. For weighted accent classification using ELMs
126 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128
References
[1] L.M. Arslan, J.H. Hansen, Language accent classification in american english,
Speech Commun. 18 (4) (1996) 353–367.
[2] J.J. Humphries, Accent modelling and adaptation in automatic speech recogni-
tion, University of Cambridge, 1998 (Ph.d. thesis).
[3] R. Huang, J.H. Hansen, P. Angkititrakul, Dialect/accent classification using unre-
stricted audio, IEEE Trans. Audio Speech Lang. Process. 15 (2) (2007) 453–464.
[4] A.D. Lawson, D.M. Harris, J.J. Grieco, Effect of foreign accent on speech recog-
nition in the NATO n-4 corpus, in: Proceedings of the Eighth European Confer-
ence on Speech Communication and Technology, 2003.
[5] S. Goronzy, Robust Adaptation to Non-Native Accents in Automatic Speech
Recognition, 2560, Springer Science & Business Media, 2002.
[6] G. Choueiter, G. Zweig, P. Nguyen, An empirical study of automatic accent clas-
sification, in: Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing, IEEE, 2008, pp. 4265–4268.
[7] T. Lander, in: CSLU: Foreign Accented English Release 1.2, Linguistic Data Con-
sortium, Philadelphia, 2007.
[8] P. Angkititrakul, J.H. Hansen, Advances in phone-based modeling for automatic
accent classification, IEEE Trans. Audio Speech Lang. Process. 14 (2) (2006)
Fig. 13. Comparison of ELMs and SVMs training and testing time.
634–646.
[9] J. Macías-Guarasa, Acoustic adaptation and accent identification in the ICSI MR
Table 3 and FAE corpora, in: Proceedings of the ICSI Meeting Slides, 2003.
Comparison of accent classification results. [10] C.G. Clopper, D.B. Pisoni, K. De Jong, Acoustic characteristics of the vowel sys-
tems of six regional varieties of American English, J. Acoust. Soc. Am. 118 (3)
Dataset Technique Accuracy (%age) (2005) 1661–1676.
[11] J.H. Hansen, U.H. Yapanel, R. Huang, A. Ikeno, Dialect analysis and modeling
FAE HLDA+MMI 32.70 for automatic classification, in: Proceedings of the Interspeech, 2004.
FAE GMM 73.00 [12] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and ap-
FAE Bayes’ 58.90 plications, Neurocomputing 70 (1) (2006) 489–501.
CU-accent PCA+LDA 64.90 [13] G.-B. Huang, E. Cambria, K.-A. Toh, B. Widrow, Z. Xu, New trends of learning
TIMIT Prosodic analysis 42.52 in computational intelligence, IEEE Comput. Intell. Mag. 10 (2) (2015) 16–17.
TIMIT ELM (Proposed) 77.88 [14] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning
TIMIT SVM (Prposed) 60.58 scheme of feedforward neural networks, in: Proceedings of the IEEE Interna-
tional Joint Conference on Neural Networks, 2, IEEE, 2004, pp. 985–990.
[15] G. Huang, S. Song, J.N. Gupta, C. Wu, Semi-supervised and unsupervised ex-
treme learning machines, IEEE Trans. Cybern. 44 (12) (2014) 2405–2417.
[16] G.-B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J.
FAE dataset [9]. By using GMMs and the Bayesian clasifier, detec- Mach. Learn. Cybern. 2 (2) (2011) 107–122.
tion rates of 73% and 58.9% respectively were obtained. In the text- [17] E. Cambria, N. Howard, Y. Xia, T.-S. Chua, Computational intelligence for big
social data analysis, IEEE Comput. Intell. Mag. 11 (3) (2016) 8–9.
independent automatic accent classification using phoneme based
[18] G.-B. Huang, An insight into extreme learning machines: random neurons,
models, average classification accuracies of 64.90% at the phone Random Featur. Kernels Cognit. Comput. 6 (3) (2014) 376–390.
level and 75.18% at the word level for pairwise classification were [19] G.-B. Huang, What are extreme learning machines? Filling the gap between
obtained [8]. For a pool of four accents, the average classification Frank Rosenblatts dream and John Von Neumanns puzzle, Cognit. Comput. 7
(3) (2015) 263–278.
accuracy rate was 37.57% at the phone level and 46.72% at the [20] J. Tang, C. Deng, G.-B. Huang, Extreme learning machine for multilayer percep-
word level. In another study on the TIMIT dataset that used the tron, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 809–821.
most discriminating vowels, a detection rate of 42.52% was ob- [21] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental
constructive feedforward networks with random hidden nodes, IEEE Trans.
tained [11]. Table 3 summarizes the comparison of accent classi- Neural Netw. 17 (4) (2006) 879–892.
fication results. [22] Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning
machine, Neurocomputing 72 (13) (2009) 3391–3395.
[23] G.-B. Huang, Z. Bai, L.L.C. Kasun, C.M. Vong, Local receptive fields based ex-
treme learning machine, IEEE Comput. Intell. Mag. 10 (2) (2015) 18–29.
8. Conclusions and future work [24] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regres-
sion and multiclass classification, systems, man, and cybernetics, part b: Cy-
In this paper, we proposed a weighted accent classification al- bernetics, IEEE Transactions on 42 (2) (2012) 513–529.
[25] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-
gorithm that uses a novel architecture for accent classification computing 70 (16) (2007) 3056–3062.
based on ELMs. The algorithm uses five words from a speaker to [26] W.F. Schmidt, M.A. Kraaijveld, R.P. Duin, Feedforward neural networks with
differentiate between different accents and is comprised of three random weights, in: Proceedings of the Eleventh IAPR International Conference
on Pattern Recognition Methodology and Systems, 2, IEEE, 1992, pp. 1–4.
stages. In the first stage, a given word from a test speaker is pre- [27] P.L. Bartlett, The sample complexity of pattern classification with neural net-
sented as the input to 21 ELMs which are each trained to dis- works: the size of the weights is more important than the size of the network,
tinguish between two accents. In the second stage, the outputs IEEE Trans. Inf. Theory 44 (2) (1998) 525–536.
[28] P. Lancaster, M. Tismenetsky, et al., The Theory of Matrices: With Applications,
of multiple ELMs are combined to obtain a classification score Elsevier, 1985.
for that word. Finally, the classification score is encoded and op- [29] N.R. Draper, H. Smith, E. Pownell, Applied Regression Analysis, third ed., Wiley,
tionally combined with the scores from other words and a deci- New York, 1966.
[30] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995)
sion about an accent class is based on the highest total score. Ex-
273–297.
periments were conducted on seven different accent groups from [31] V. Vapnik, S.E. Golowich, A. Smola, Support vector method for function ap-
the TIMIT dataset. Our proposed technique classifies speakers into proximation, regression estimation, and signal processing, in: Proceedings of
the Advances in Neural Information Processing Systems, 9, Citeseer, 1996.
seven groups with an accuracy of 77.88% using five words from
[32] S.P. Schölkopf, V. Vapnik, A. Smola, Improving the accuracy and speed of sup-
a given test speaker. To the author’s knowledge, this is the first port vector machines, Adv. Neural Inf. Process. Syst. 9 (1997) 375–381.
attempt to use ELMs for accent classification. We also compared [33] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data
our weighted accent classification algorithm performance by us- Min. Knowl. Discov. 2 (2) (1998) 121–167.
[34] A. Aizerman, E.M. Braverman, L. Rozoner, Theoretical foundations of the poten-
ing SVMs as classifiers and also with multiclass classification using tial function method in pattern recognition learning, Autom. Remote Control
ELMs or SVMs. In the future, we will investigate different words 25 (1964) 821–837.
128 M. Rizwan, D.V. Anderson / Neurocomputing 277 (2018) 120–128
[35] V. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Busi- [52] L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall,
ness Media, 2013. 1978.
[36] B. Frénay, M. Verleysen, et al., Using SVMs with randomised feature spaces: [53] M. Slaney, Auditory toolbox, Interval Research Corporation Tech. Rep. No. 1998-
an extreme learning approach, in: Proceedings of the European Symposium on 010, Interval Research Corporation, (1998).
Artificial Neural Networks, 2010. [54] V. Tiwari, MFCC and its applications in speaker recognition, Int. J. Emerg. Tech-
[37] L. Zhang, D. Zhang, F. Tian, SVM and ELM: who wins? Object recognition with nol. 1 (1) (2010) 19–22.
deep convolutional features from imagenet, in: Proceedings of the Extreme [55] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al., A practical guide to support vector clas-
Learning Machine, 1, Springer, 2016, pp. 249–263. sification. Technical report, Department of Computer Science, National Taiwan
[38] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parame- University.
ters for support vector machines, Mach. Learn. 46 (1–3) (2002) 131–159. [56] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM
[39] J. Chorowski, J. Wang, J.M. Zurada, Review and performance comparison of Trans. Intell. Syst. Technol. 2 (3) (2011) 27.
SVM-and ELM-based classifiers, Neurocomputing 128 (2014) 507–516.
[40] X. Liu, C. Gao, P. Li, A comparative analysis of support vector machines and
extreme learning machines, Neural Netw. 33 (2012) 58–66.
[41] G.-B. Huang, Extreme learning machines – filling the gap between Frank Muhammad Rizwan received his B.E. degree from Na-
Rosenblatt’s dream and John Von Neumann’s puzzle?, http://www.ntu.edu.sg/ tional University of Sciences & Technology, Pakistan, and
home/egbhuang/pdf/ELM-Tutorial.pdf. M.S. degree from Lahore University of Management &
[42] B. Scholkopf, A.J. Smola, Learning with kernels: support vector machines, in: Sciences, Pakistan. Currently, he is a Ph.D. candidate in
Regularization, Optimization, and Beyond, MIT press, 2001. School of Electrical and Computer Engineering, Georgia
[43] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning Institute of Technology (Georgia Tech), GA, USA. His re-
machine for classification, Neurocomputing 74 (1) (2010) 155–163. search interests include deep neural networks, extreme
[44] J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, learning machines, learning algorithms, adaptive systems,
Neural Process. Lett. 9 (3) (1999) 293–300. and unsupervised learning. He is a member of the IEEE,
[45] J.A. Suykens, T.V. Gestel, J. De Brabanter, Least squares support vector ma- the IEEE Signal Processing Society, and the American So-
chines, Fourth ed., World Scientific, 2002. ciety for Engineering Education.
[46] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin
classifiers, in: Proceedings of the Fifth annual Workshop on Computational
Learning Theory, ACM, 1992, pp. 144–152.
[47] J.C. Platt, Fast training of support vector machines using sequential mini- David V. Anderson is a Professor of Electrical and Com-
mal optimization, in: Proceedings of the Advances in Kernel Methods, 1999, puter Engineering at Georgia Tech. He received B.S and
pp. 185–208. M.S. degrees from Brigham Young University and a Ph.D.
[48] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector degree from Georgia Institute of Technology (Georgia
machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. Tech) in 1993, 1994, and 1999, respectively. Dr. Anderson’s
[49] M. Rizwan, B.O. Odelowo, D.V. Anderson, Word based dialect classification us- research interests include audio and psycho-acoustics,
ing extreme learning machines, in: Proceedings of the IEEE International Joint signal processing in the context of human perception, and
Conference on Neural Networks, IEEE, 2016. applications of machine learning to signal processing. Dr.
[50] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT Anderson was awarded the National Science Foundation
acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1–1.1, CAREER Award for excellence as a young educator and re-
NASA STI/Recon Technical Report No. 93, NASA, (1993) 27403. searcher in 2004 and the Presidential Early Career Award
[51] V. Zue, S. Seneff, J. Glass, Speech database development at MIT: TIMIT and for Scientists and Engineers in the same year. He has over
beyond, Speech Commun. 9 (4) (1990) 351–356. 180 technical publications and 7 patents. Dr. Anderson is
a senior member of the IEEE.