Sie sind auf Seite 1von 5

The 11th International Symposium on Communications & Information Technologies (ISCIT 2011)

Realization of Embedded Speech Recognition Module Based on STM32


Qinglin Qu School of Electrical and Information Engineering Anhui University of Science and Technology Huainan, Anhui, China E-mail:294025193@qq.com
AbstractSpeech recognition is the key to realize man-machine interface technology. In order to improve the accuracy of speech recognition and implement the module on embedded system, an embedded speaker-independent isolated word speech recognition system based on ARM is designed after analyzing speech recognition theory. The system uses DTW algorithm and improves the algorithm using a parallelogram to extract characteristic parameters and identify the results. To finish the speech recognition independently, the system uses the STM32 series chip combined with the other external circuitry. The results of speech recognition test can achieve 90%, and which meets the real-time requirements of recognition. Keywords-speech recognition; embedded system; STM32; DTW

Liangguang Li School of Electrical and Information Engineering Anhui University of Science and Technology Huainan, Anhui,China E-mail:lgli@aust.edu.cn II. PRINCIPLE OF SPEECH RECOGNITION

Speech recognition is a part of pattern recognition which includes two processes: speech training and speech recognition. The first stage is training also known as modeling stage. In this stage, the system learned and summarized the human language and the learned knowledge is stored to establish a language reference model. The second stage is identification also known as testing stage. The system will match the outside input voice messages with the reference model in the library and get the nearest meaning or semantic recognition results[2]. The typical methods of speech recognition are dynamic time warping (DTW), vector quantization (VQ), hidden markov method (HMM), artificial neural network (ANN) and mixed pattern recognition technology, etc.

I. It's a long

INTRODUCTION dream that human can

III.

DESIGN OF SYSTEM

communicate with the machine and let it understand what you say. Speech recognition is such a kind of technology which let the machine translate voice signals into texts or commands through recognition and understanding. Speech recognition is a cross subject which includes the signal processing, and pattern recognition, and probability theory and information theory, voice mechanism auditory mechanism artificial intelligence[1]. With the development of super-large-scale integration in recent years, the embedded speech recognition system has become a new direction in voice recognition.

A. Hardware design of system The system chooses STM32F103VET6 processor as the core which is produced by ST and based on Cortex-M3. With the help of speech input amplifier filter circuit, storage circuit, LCD, keyboard, audio DAC and the PC which is linked in JLINK8 by JTAG interface, the processes of modeling and testing are completed. The system hardware diagram is shown in figure 1.

978-1-4577-1295-1/11/$26.002011IEEE

73

PC JLINK Voice ADC JTAG GPIO Control input Keyboard

parameters

of

components

were

changed

Amplification input filter circuit

between 0V and 3.3V in magnified filter circuits. The audio signals pass the piece of I2S interface and output to the external audio DAC chip and then output to the speaker. The processor has

STM32F103VET6 Voice output Display output

DAC

I2S FSMC Chip memory

LCD

LCD

parallel LCD interface and it is convenient to display the results. 3) Storage part STM32F103VET6 supports compact flash,

Fig.1 Structure of system hardware

STM32F103VET6

is
[3]

high

32-bits

RAM, PSRAM, NOR and NAND memories, and it is convenient to store data sampling in external storage. B. Design of software The system chooses the linear prediction coefficient (LPC) as characteristic vectors and the Dynamic Time Warping (DTW) as algorithms to complete voice recognition. Using STM32 computing power, we can make speech signals sampling, endpoints detection, feature extraction and speech recognition finish in the processor. with the help of external amplification filter circuits, the system can finish speech recognition simply. 1) Preprocessing Preprocessing includes pre-emphasis, windows addition and endpoints detection. Because it includes single-cycle multiplication, hardware division and 12-bits A/D converters in 32-bits processors STM32F103VET6, so the algorithm was improved before preprocessing to improve the operating speed. Pre-emphasis is mainly to improve the signal high frequency bands and analyze the frequency spectrum. The coefficient used in this experiment is 0.93 and the formulas is[5]:
data[i] = original[i] 0.93* original[i 1] (1)

processor based on CM3 and has the optimal level of power dissipation . It has three modes of low power consumption: sleep mode, stop mode and standby mode. Its max working frequency is 72 MHz. The single cycle multiplication and hardware division is very favorable to digital signal processing. It also includes: 512K bytes of flash memory, 64K bytes of SRAM, flexible static memory controller with 4 chip, LCD parallel interface, 12-channel DMA controller, 11 timers and 112 fast I/O ports. The structure of system is listed below. 1) Control section Control section is and composed by STM32F103VET6 keyboard.

STM32F103VET6 completes speech signals collection, feature extraction, training and recognition. Speech signals are transformed by ADC and the results are stored in RAM. The system matches the results disposed by a series of algorithms with templates in the library to realize speech recognition. The keyboard is mainly complete system reset and some simple control functions. 2) Input and output This section has a microphone, amplification filter circuit, LCD, audio DAC conversion and the speaker. Speech signals were input from microphone and after amplifying and filtering enter into the controller. The conversion voltage range of STM32F103VET6 chip ADC is[4]: VREF- VIN VREF+. In the filter circuit, VREF-=0V, VREF+=3.3V. Therefore, the

The original storages the sampling data and the data is pre-emphasis output data. For the convenience of calculation, we will shift left 0.93 seven bits and get integer part 119. So the result increases 27 times and the formula becomes:
data[i] = original[i] << 7 119* original[i 1] (2)

Adding window technology can ensure the

74

short performance of speech signals. The algorithm selects Hamming window to add and Hamming window formula is:
speech

1 0 -1 0 1 2 3 4 5 6 7 8 x 10 10 Energy 5 0 9
4

h(n) = 0.54 0.46 cos(2 n / N ) (3)


N is defined as speech data sampling points and its realizing program is: data[n]*=h(n). Because the window function needs to use a lot
ZCR

200

400

600

800

1000

1200

of cosine functions and each value is less than 1, so h (n) is stored in a form of array and the data shift left seven bits to get integer part stored in h(N) (N is the length of a frame). After the treatments above, the sampling data increase 2
14

20 10 0

200

400

600

800

1000

1200

Fig.2 Processing result of signal 1

times and shift left 14 bits in

total. The maximum of data after calculating is 226. In order to stop the covariance overflow while we calculate the linear prediction coefficient, the data shift right 10 bits. The covariance program is shown as follows: r [i] stores covariance data and the data keep four effective numbers (binary): for(j=0;j<=p;j++) { r[j]=0; for(i=0;i<n-j;i++) r[j]+=(data[i]*data[i+j])>>4; } According to the characteristics of Chinese pronunciation, we choose short-term energy and zero rate of speech signals as the characteristic parameters
[6]

2) Extraction of feature vector The system uses linear parameters coefficient (LPC) as characteristic and solves in Durbin algorithm. Cepstrum operation of LPC uses a minimum phase characteristic of track system function and less computational complexity. In linear forecast analysis process, the choice of order p must be careful. A bigger p would causes much shakes which make the inherent characteristics of speech signals appear random. Taking 256 points of speech signal "1" and calculating separately the 12 order and 10 order LPC coefficient, we can obtain the results shown in figure 3.
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

and use the threshold limit method

to judge the starting point and ending point of speech signals. The short-term energy of S (n) is defined as follows:

10

12

14

En = [ s (m) w(n m)]2

(4)

Fig.3 (a) 12 order LPC coefficient of speech signal "1"


1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

w(n-m) is defined as Hamming window h(n). After pre-emphasis, the result of signal 1 is shown in figure 2.

10

11

Fig.3 (b) 10 order LPC coefficient of speech signal "1"

75

The figure 3 can be concluded that the concussion of 10 order LPC coefficient is less than 12 order LPC coefficient. It can describe the signal better, so we choose 10 as the experiment order p.

initialization start the recognition receive voice signals signal preprocessing extract LPC coefficient for each frame signal match the results
Y

3) Improved DTW algorithm of speech recognition The most simple and effective method in isolated words speech recognition is DTW (Dynamic Time Warping) algorithm. The algorithm bases on dynamic programming (DP) ideas and solves the template matching problem of pronunciation with different length. This algorithm is a classic and earlier algorithm. In the same condition, the recognition results of DTW algorithm and HMM algorithm are more or less the same, but the HMM algorithm is much more complicated. The basic DTW algorithm is looking for an optimal path to get the minimum sum of different points in the path. This path D is the two vector distance in the optimal time neat. According to the evaluation distance formula between the two vectors: (5) d [{T [i ], R[ j ]} =| R[ j ] T [i ] |
The formula of D can be received: D (i, j ) = d (i, j ) + min

error

output control words

Fig.4 Flow diagram of software process

IV. reference

RESULTS OF THE EXPERIMENTAL templates. According to the

The system tests 0-9 200 voices as speech characteristics of STM32, the speech signal sampling frequency is 10.4895 KHZ. After adding Hamming window, the length of one frame is 23.8 ms (250 sampling points) and the shift is 100. Extracting linear parameters coefficient and handling recognition, we can get the results as shown in table I. From the table I, we can see that the recognition rate of a same single word increases with the increasing of templates and the recognition rate is over 90% which meet the practical operation requirements of speech recognition.
TABLE I. Templates of a same single word 10 20 30 RESULTS OF SPEECH RECOGNITION Recognition numbers 5 5 5 Recognition rate 91.5% 93.3% 95.8%

[ D (i 1, j ), D (i 1, j 1), D (i 1, j 2)]
The inference of DTW algorithm all the
[7]

(6)

shows

that the traditional algorithm needs to calculate

d{T[i],R[j]}. According to the characteristics of speech signals, we use all path constraint (ADTW) algorithm in actual application. We can use a parallelogram to limit the scope of dynamic neat calculation and only calculate the area of the diamond d[{T[i],R[j]}] and D(i,j), which saves a lot of storage space. At the same time because the dependence of algorithm to the endpoint is large, we use dynamic starting point and find out the minimum point among d{T[1],R[1]} d{T[1],R[2]} d{T[2],R[1]} as a starting point to search which increases the system recognition rate.

V. The system

CONCLUSION uses the low power

consumption STM32 series chips and analyses the hardware circuit and algorithm. It also combined with hardware to improve the algorithm and make speed fast to meet the system real time. The system has strong

76

versatility which can be applied to many embedded system and have a good prospect in areas of voice control in home appliances, toys, PDA, mobile phones and intelligent devices etc. REFERENCES
[1] [2] [3] [4] Yi Ke-chu,Tian Bin, Fu Qiang. Speech signal processing[M]. Defense industry press,2000. Zhao Li. Speech signal processing[M]. Beijing.

Mechanical industry press,2003

[5]

[6]

[7]

ST.Crop. 103xCDE Data Manual. March 2009. M.White and R.B.Neely, " Speech Recognition Experiments with Linerar Prediction, Bandpass Filtering, and Dynamic Programming," IEEE Trans. AcousticsSpeech Signal. WEN Han, HUANG Guo-shun. A Research on improving DTW in speech recognition[J]. Micro-computer Information, 2010, 26(7):195-197. S.J.Young, P.C.Woodland. State clustering in HMM-based continuous speech recognition. Computer Speech and Language, 1994, 8(4): 369384. Hermansky H.Perceptual linear predictive (PLP) analysis of speech.J Acoustical Soc America,1990,87(4):218.

77

Das könnte Ihnen auch gefallen