Sie sind auf Seite 1von 5

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll15/17llpp95-99 Volume 16, Number 1, February 2011

English Speech Recognition System on Chip*


LIU Hong ( ), QIAN Yanmin (), LIU Jia ( )**
Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Abstract: An English speech recognition system was implemented on a chip, called speech system-on-chip (SoC). The SoC included an application specific integrated circuit with a vector accelerator to improve performance. The sub-word model based on a continuous density hidden Markov model recognition algorithm ran on a very cheap speech chip. The algorithm was a two-stage fixed-width beam-search baseline system with a variable beam-width pruning strategy and a frame-synchronous word-level pruning strategy to significantly reduce the recognition time. Tests show that this method reduces the recognition time nearly 6 fold and the memory size nearly 2 fold compared to the original system, with less than 1% accuracy degradation for a 600 word recognition task and recognition accuracy rate of about 98%. Key words: non-specific human voice-consciousness; system-on-chip; mel-frequency cepstral coefficients (MFCC)

Introduction
Embedded speech recognition systems[1] are becoming more important with the rapid development of handheld portable devices. However, only a few products are yet available due to the high chip costs. This paper describes an inexpensive English speech recognition system based on a chip. The chip includes a 16 bit microcontroller with 16 bit coprocessor, 32 KB RAM, and 16 bit A/D and D/A. The recognition models are based on the sub-word continuous density hidden Markov model (CHMM)[2,3] model, with the mel-frequency ceptrum coefficients (MFCC)[4] features. The recognition engine uses
Received: 2009-12-15; revised: 2010-06-10

* Supported by the National Natural Science Foundation of China and


Microsoft Research Asia (No. 60776800), the National Natural Science Foundation of China and Research Grants Council (No. 60931160443), and the National High-Tech Research and Development (863) Program of China (Nos. 2006AA010101, 2007AA04Z223, 2008AA02Z414, and 2008AA040201)

** To whom correspondence should be addressed.


E-mail: liuj@tsinghua.edu.cn; Tel: 86-10-62781847

two-pass beam searching algorithms[5] which greatly improve the implementation efficiency and lower the system cost. The system can be used in such command control devices as consumer electronics, handheld devices, and household appliances[6]; thus the system has many applications. The recognition accuracy of the speech recognition SoC is more than 97%. The hardware architecture of the speech recognition system-on-chip is designed for practical applications. All of a systems hardware is integrated into a single chip which presents the best solution for performance, size, power consumption, and cost as well as reliability[7]. The speech recognition system-on-chip is composed of a general purpose microcontroller, a vector accelerator, a 16 bit ADC/DAC, an analog filter circuit, audio input and output amplifiers, and a communication interface. In addition, the chip also includes a power management module and a clock. The application specific integrated circuit (ASIC) computing power is much greater than that of a microprocessor control unit (MCU) because of the vector accelerator. Unlike a DSP, this ASIC has integrated ADC, DAC, audio amplifier, and power management modules

96

Tsinghua Science and Technology, February 2011, 16(1): 95-99

without some unnecessary circuits to reduce the cost. Figure 1 shows a photo graph of the unpackaged chip core. The block diagram is shown in Fig. 2 with the details of each block described below.

continuous speech frame sequence. This sequence is then sent to the endpoint detection and feature extraction unit. After two-stage matching[8], the system outputs the recognition result. The result can be sent to the circuit or to the LCD display. 1.1 Software level division

Fig. 1 Photo graph of the chip

Hierarchical design[9] is often used for complex systems, regardless of the operating system or network structure. The lower levels provide bottom services and lower-level management. Each layer is encapsulated, so that the upper layer does not need to know the underlying method and the lower layer does not know the purpose. The structural layer increases the system flexibility with each module being logically related to the hierarchical design to improve the systems applicability and flexibility and to enhance the system reliability, robustness, and maintainability. The system software is divided into the driven layer, the universal module layer, the function module layer, and the scheduling layer, as shown in Fig. 4. The driver layer which isolates the software and hardware, includes all of the interrupt service routines and the peripheral drivers. The universal module layer includes a variety of basic operation modules which provide basic computing and operating services. The function module layer contains various functional modules as the core algorithms. The scheduling system which is the top layers, controls the ultra-loop of the global data maintenance system whose core aim is the task scheduling.

Fig. 2 SoC block diagram

Chip Software Design

The speech recognition process is described in Fig. 3. The speech signal from the microphone is pre-amplified and filtered by a low-pass filter and then sampled by the ADC with an 8 kHz sampling frequency. The signal is then segmented into frames, which form a
Fig. 4 Software division

Fig. 3

Speech recognition system

1.1.1 Driver layer The driver layer program enables direct operation of the hardware part. The program modules at this level generally correspond to the actual hardware modules, such as the memory, peripheral interfaces, and communication interfaces. These functions provide an interface between the hardware modules and the

LIU Hong ( ) et al.English Speech Recognition System on Chip

97

application program interface to the upper level program. 1.1.2 Service layer The driver layer program provides basic support to the hardware, but not any extended or enhanced functionality. The service layer provides a powerful interface program to the application level to further improve the system performance. Thus, the service layer improves use of the hardware features. 1.1.3 Scheduling layer The various inputs select different sub-tasks such as the speech recognition, speech coding, and speech decoding, which all have different response times for different asynchronous events. The scheduling layer then schedules these various tasks. This whole system is designed to provide good real-time performance with the scheduling layer providing seamless connectivity between the system procedures to complete applications without the applications needing to consider how to schedule the execution. The scheduling also facilitates control of the DSP power consumption. 1.1.4 Application layer The application level programs then use the driver level, service layer, and scheduling procedures provided by the API interface functions, so that the user can focus on the tasks, for example, the English command word recognition engine. Thus each application depends on the application layer procedures, with most, if not all, changes to be made only in the application layer program based on the application needs, while the driver, service, and the scheduling layer changes are relatively small. The driver, service, and the scheduling layer program serve as the system kernel layers, while the application layer program serves as the user program. 1.2 Two-pass beam search algorithm

systems using very simple models, do not give satisfactory results. Therefore, a two-pass search strategy was adopted to optimize the performance with speed, as shown in Fig. 5. In the first stage, the search uses approximate models such as the Monophone model with one Gaussian mixture[9]. This fast match generates a list of nbest hypotheses for the second search. The second stage was a detailed match among the nbest hypotheses using the Triphone model with three Gaussian mixtures. To reduce the computations, the covariance matrix of the Gaussian mixture model should be diagonal in both the fast match and detailed match stages. The output probability scores of all states are calculated and then matched one by one using the Viterbi method[10].

Fig. 5 Two-stage search structure

1.3

Front end feature extraction

The sub-word model in this system is based on the continuous density hidden Markov model (CHMM) with the output probability distribution of each state described by the Gaussian mixture model (GMM). The relationships between the context phonemes are categorized as Monophone, Biphone, and Triphone[9]. More complex models have higher recognition rate. However, such models take much longer. Thus the more complex models are not practical, even though the recognition rates can reach nealy 100%. Faster

Robust features must be used for embeded speech recognition systems. The mel-frequency cepstral coefficients have been proved to be more robust in the presence of background noise compared to other feature parameters. Since the dynamic range of the MFCC features is limited, the MFCC can be used for fixed-point algorithms and is more suitable for embedded systems. The MFCC parameters chosen here offer the best trade-offs between performance and computational requirements. Generally speaking, increased feature vector dimensions provide more information, but this increases the computational burden in the feature extraction and recognition stage[11,12]. The final recognition results for different features

98

Tsinghua Science and Technology, February 2011, 16(1): 95-99

are shown in Fig. 6. At the four-Gaussian-mixture situation, the first step needed twelve candidates for a 99% recognition rate. For 34 feature cases, the highest recognition rate was 99.22% and with 22 feature cases the recognition rate was 99.01% which satisfies the system requirements. Therefore a 22-dimension feature vector was chosen from the 12-dimension MFCC, 12-dimension difference MFCC, 12-dimension second difference MFCC, and the normalized energy and its first and second order differences.

then reached 99% in the second stage.

Fig. 7

Recognition rates for different candidates

Fig. 6

Different model recognition rates

The final system recognition rates using different sizes are shown in Table 1. All the evaluations were performed in an office environment with a moderate noise level. Although the recognition rate decreased as the vocabulary size increased, the recognition accuracy rate was still 98% with 600 phrases. The recognition time using the 600 phrase in a real environment was approximately 0.7 RTF (real time factor). Thus, this system could effectively deal with a 600 phrase vocabulary.
Table 1 Speech recognition system performance 99.2 98.8 98.1 Test speech (20 males /20 females) Recognition accuracy (%) 150 phrase vocabulary 300 phrase vocabulary 600 phrase vocabulary

Evaluation Results

The tests used 40 volunteer speakers using a test vocabulary[13,14] composed of names, places, and stock names. The vocabulary had a total of 600 phrases, with each phrase comprised of 2 to 4 English words spoken once by each speaker. The system recognition accuracy rate was tested by sampling input through the USB interface at 8 kHz/s. In this way, the recognition test conditions were almost the same as real conditions so that the statistical recognition accuracy could be properly assessed. The first recognition stage used a relatively simple acoustic model to produce a multi-candidate recognition result with a high recognition rate. Figure 7 shows the recognition rates for the multi-candidate results using a 600 phrase vocabulary. Although the model for the first recognition by one candidate obtains only a 93.7% recognition rate and six candidates recognition rate reached 98%. Thereafter, the upward trend of the recognition rate significantly slowed along with increasing numbers of candidates. In most cases, the correct results were included in the front four candidates. Therefore, the twelve candidates in the first stage were used for the second stage matching with a more complex acoustic model. The recognition rate

The best features of these medium size vocabulary recognition SoC systems are the low frequency (48 MHz) and small required system resources (48 KB). For the same performance the IBM Strong ARM needs a frequency of 200 MHz and a memory of 2.2 MB, while the Siemens ARM920T needs a frequency of 100 MHz and a memory of 402 KB.

Conclusions and Future Work

This paper describes an English verification system implemented on an SoC platform. The system uses an ASIC with a vector accelerator and speech recognition software developed to use the ASIC architecture. Tests show that the system can attain high recognition accuracy (more than 98%) with a short response time (0.7 RTF). The ASIC design of flexible fast speech recognition solutions for embeded applications has only 48 KB RAM. Future work will improve the algorithm to reduce the recognition time and increase the system robustness in noisy environments with changes

LIU Hong ( ) et al.English Speech Recognition System on Chip

99

in the memory and scheduling. The system can be used for Chinese or English on-chip identification. References
[1] Guo Bing, Shen Yan. SoC Technology and Its Application. Beijing: Tsinghua University Press, 2006. (in Chinese) [2] Levy C, Linares G, Nocera P, et al. Reducing computation and memory cost for cellular phone embedded speech recognition system. In: Proceedings of the ICASSP. Montreal, Canada, 2004, 5: 309-312. [3] Novak M, Hampl R, Krbec P, et al. Two-pass search strategy for large list recognition on embedded speech recognition platforms. In: Proceedings of the ICASSP. Hong Kong, China, 2003, 1: 200-203. [4] Xu Haiyang, Fu Yan. Embedded Technology and Applications. Beijing: Machinery Industry Press, 2002. (in Chinese) [5] Hoon C, Jeon P, Yun L, et al. Fast speech recognition to access a very large list of items on embedded devices. IEEE Trans. on Consumer Electronics, 2008, 54(2): 803-807. [6] Yang Zhizuo, Liu Jia. An embedded system for speech recognition and compression. In: ISCIT2005. Beijing, China, 2005: 653-656. [7] Westall F. Review of speech technologies for telecommunications. Electronics & Communication Engineering Journal, 1997, 9(5): 197-207.

[8] Shi Yuanyuan, Liu Jia, Liu Rensheng. Single-chip speech recognition system based on 8051 microcontroller core. IEEE Trans. on Consumer Electronics, 2001, 47(1): 149-154. [9] Yang Haijie, Yao Jing, Liu Jia. A novel speech recognition system-on-chip. In: International Conference on Audio, Language and Image Processing 2008 (ICALIP 2008). Shanghai, China, 2008: 166-174. [10] Lee T, Ching P C, Chan L W, et al. Tone recognition of isolated Cantonese syllables. IEEE Trans. on Speech and Audio Processing, 1995, 3(3): 204-209. [11] Novuk M, Humpl R, Krbec P, et al. Two-pass search strategy for large list recognition on embedded speech recognition platforms. In: ICCASP04. Montreal, Canada, 2004: 200-203. [12] Zhu Xuan, Wang Rui, Chen Yining. Acoustic model comparison for an embedded phonme-based mandarin name dialing system, In: Proceedings of International Symposium on Chinese Spoken Language Processing. Taipei, 2002: 83-86. [13] Zhu Xuan, Chen Yining, Liu Jia, et al. Multi-pass decoding algorithm based on a speech recognition chip. Chinese Acta Electronica Sinica, 2004, 32(1): 150-153. [14] Demeechai T, Mkelinen K. Integration of tonal knowledge into phonetic Hmms for recognition of speech in tone languages. Signal Processing, 2000, 80: 2241-2247.

Das könnte Ihnen auch gefallen