Beruflich Dokumente
Kultur Dokumente
Introduction
Embedded speech recognition systems[1] are becoming more important with the rapid development of handheld portable devices. However, only a few products are yet available due to the high chip costs. This paper describes an inexpensive English speech recognition system based on a chip. The chip includes a 16 bit microcontroller with 16 bit coprocessor, 32 KB RAM, and 16 bit A/D and D/A. The recognition models are based on the sub-word continuous density hidden Markov model (CHMM)[2,3] model, with the mel-frequency ceptrum coefficients (MFCC)[4] features. The recognition engine uses
Received: 2009-12-15; revised: 2010-06-10
two-pass beam searching algorithms[5] which greatly improve the implementation efficiency and lower the system cost. The system can be used in such command control devices as consumer electronics, handheld devices, and household appliances[6]; thus the system has many applications. The recognition accuracy of the speech recognition SoC is more than 97%. The hardware architecture of the speech recognition system-on-chip is designed for practical applications. All of a systems hardware is integrated into a single chip which presents the best solution for performance, size, power consumption, and cost as well as reliability[7]. The speech recognition system-on-chip is composed of a general purpose microcontroller, a vector accelerator, a 16 bit ADC/DAC, an analog filter circuit, audio input and output amplifiers, and a communication interface. In addition, the chip also includes a power management module and a clock. The application specific integrated circuit (ASIC) computing power is much greater than that of a microprocessor control unit (MCU) because of the vector accelerator. Unlike a DSP, this ASIC has integrated ADC, DAC, audio amplifier, and power management modules
96
without some unnecessary circuits to reduce the cost. Figure 1 shows a photo graph of the unpackaged chip core. The block diagram is shown in Fig. 2 with the details of each block described below.
continuous speech frame sequence. This sequence is then sent to the endpoint detection and feature extraction unit. After two-stage matching[8], the system outputs the recognition result. The result can be sent to the circuit or to the LCD display. 1.1 Software level division
Hierarchical design[9] is often used for complex systems, regardless of the operating system or network structure. The lower levels provide bottom services and lower-level management. Each layer is encapsulated, so that the upper layer does not need to know the underlying method and the lower layer does not know the purpose. The structural layer increases the system flexibility with each module being logically related to the hierarchical design to improve the systems applicability and flexibility and to enhance the system reliability, robustness, and maintainability. The system software is divided into the driven layer, the universal module layer, the function module layer, and the scheduling layer, as shown in Fig. 4. The driver layer which isolates the software and hardware, includes all of the interrupt service routines and the peripheral drivers. The universal module layer includes a variety of basic operation modules which provide basic computing and operating services. The function module layer contains various functional modules as the core algorithms. The scheduling system which is the top layers, controls the ultra-loop of the global data maintenance system whose core aim is the task scheduling.
The speech recognition process is described in Fig. 3. The speech signal from the microphone is pre-amplified and filtered by a low-pass filter and then sampled by the ADC with an 8 kHz sampling frequency. The signal is then segmented into frames, which form a
Fig. 4 Software division
Fig. 3
1.1.1 Driver layer The driver layer program enables direct operation of the hardware part. The program modules at this level generally correspond to the actual hardware modules, such as the memory, peripheral interfaces, and communication interfaces. These functions provide an interface between the hardware modules and the
97
application program interface to the upper level program. 1.1.2 Service layer The driver layer program provides basic support to the hardware, but not any extended or enhanced functionality. The service layer provides a powerful interface program to the application level to further improve the system performance. Thus, the service layer improves use of the hardware features. 1.1.3 Scheduling layer The various inputs select different sub-tasks such as the speech recognition, speech coding, and speech decoding, which all have different response times for different asynchronous events. The scheduling layer then schedules these various tasks. This whole system is designed to provide good real-time performance with the scheduling layer providing seamless connectivity between the system procedures to complete applications without the applications needing to consider how to schedule the execution. The scheduling also facilitates control of the DSP power consumption. 1.1.4 Application layer The application level programs then use the driver level, service layer, and scheduling procedures provided by the API interface functions, so that the user can focus on the tasks, for example, the English command word recognition engine. Thus each application depends on the application layer procedures, with most, if not all, changes to be made only in the application layer program based on the application needs, while the driver, service, and the scheduling layer changes are relatively small. The driver, service, and the scheduling layer program serve as the system kernel layers, while the application layer program serves as the user program. 1.2 Two-pass beam search algorithm
systems using very simple models, do not give satisfactory results. Therefore, a two-pass search strategy was adopted to optimize the performance with speed, as shown in Fig. 5. In the first stage, the search uses approximate models such as the Monophone model with one Gaussian mixture[9]. This fast match generates a list of nbest hypotheses for the second search. The second stage was a detailed match among the nbest hypotheses using the Triphone model with three Gaussian mixtures. To reduce the computations, the covariance matrix of the Gaussian mixture model should be diagonal in both the fast match and detailed match stages. The output probability scores of all states are calculated and then matched one by one using the Viterbi method[10].
1.3
The sub-word model in this system is based on the continuous density hidden Markov model (CHMM) with the output probability distribution of each state described by the Gaussian mixture model (GMM). The relationships between the context phonemes are categorized as Monophone, Biphone, and Triphone[9]. More complex models have higher recognition rate. However, such models take much longer. Thus the more complex models are not practical, even though the recognition rates can reach nealy 100%. Faster
Robust features must be used for embeded speech recognition systems. The mel-frequency cepstral coefficients have been proved to be more robust in the presence of background noise compared to other feature parameters. Since the dynamic range of the MFCC features is limited, the MFCC can be used for fixed-point algorithms and is more suitable for embedded systems. The MFCC parameters chosen here offer the best trade-offs between performance and computational requirements. Generally speaking, increased feature vector dimensions provide more information, but this increases the computational burden in the feature extraction and recognition stage[11,12]. The final recognition results for different features
98
are shown in Fig. 6. At the four-Gaussian-mixture situation, the first step needed twelve candidates for a 99% recognition rate. For 34 feature cases, the highest recognition rate was 99.22% and with 22 feature cases the recognition rate was 99.01% which satisfies the system requirements. Therefore a 22-dimension feature vector was chosen from the 12-dimension MFCC, 12-dimension difference MFCC, 12-dimension second difference MFCC, and the normalized energy and its first and second order differences.
Fig. 7
Fig. 6
The final system recognition rates using different sizes are shown in Table 1. All the evaluations were performed in an office environment with a moderate noise level. Although the recognition rate decreased as the vocabulary size increased, the recognition accuracy rate was still 98% with 600 phrases. The recognition time using the 600 phrase in a real environment was approximately 0.7 RTF (real time factor). Thus, this system could effectively deal with a 600 phrase vocabulary.
Table 1 Speech recognition system performance 99.2 98.8 98.1 Test speech (20 males /20 females) Recognition accuracy (%) 150 phrase vocabulary 300 phrase vocabulary 600 phrase vocabulary
Evaluation Results
The tests used 40 volunteer speakers using a test vocabulary[13,14] composed of names, places, and stock names. The vocabulary had a total of 600 phrases, with each phrase comprised of 2 to 4 English words spoken once by each speaker. The system recognition accuracy rate was tested by sampling input through the USB interface at 8 kHz/s. In this way, the recognition test conditions were almost the same as real conditions so that the statistical recognition accuracy could be properly assessed. The first recognition stage used a relatively simple acoustic model to produce a multi-candidate recognition result with a high recognition rate. Figure 7 shows the recognition rates for the multi-candidate results using a 600 phrase vocabulary. Although the model for the first recognition by one candidate obtains only a 93.7% recognition rate and six candidates recognition rate reached 98%. Thereafter, the upward trend of the recognition rate significantly slowed along with increasing numbers of candidates. In most cases, the correct results were included in the front four candidates. Therefore, the twelve candidates in the first stage were used for the second stage matching with a more complex acoustic model. The recognition rate
The best features of these medium size vocabulary recognition SoC systems are the low frequency (48 MHz) and small required system resources (48 KB). For the same performance the IBM Strong ARM needs a frequency of 200 MHz and a memory of 2.2 MB, while the Siemens ARM920T needs a frequency of 100 MHz and a memory of 402 KB.
This paper describes an English verification system implemented on an SoC platform. The system uses an ASIC with a vector accelerator and speech recognition software developed to use the ASIC architecture. Tests show that the system can attain high recognition accuracy (more than 98%) with a short response time (0.7 RTF). The ASIC design of flexible fast speech recognition solutions for embeded applications has only 48 KB RAM. Future work will improve the algorithm to reduce the recognition time and increase the system robustness in noisy environments with changes
99
in the memory and scheduling. The system can be used for Chinese or English on-chip identification. References
[1] Guo Bing, Shen Yan. SoC Technology and Its Application. Beijing: Tsinghua University Press, 2006. (in Chinese) [2] Levy C, Linares G, Nocera P, et al. Reducing computation and memory cost for cellular phone embedded speech recognition system. In: Proceedings of the ICASSP. Montreal, Canada, 2004, 5: 309-312. [3] Novak M, Hampl R, Krbec P, et al. Two-pass search strategy for large list recognition on embedded speech recognition platforms. In: Proceedings of the ICASSP. Hong Kong, China, 2003, 1: 200-203. [4] Xu Haiyang, Fu Yan. Embedded Technology and Applications. Beijing: Machinery Industry Press, 2002. (in Chinese) [5] Hoon C, Jeon P, Yun L, et al. Fast speech recognition to access a very large list of items on embedded devices. IEEE Trans. on Consumer Electronics, 2008, 54(2): 803-807. [6] Yang Zhizuo, Liu Jia. An embedded system for speech recognition and compression. In: ISCIT2005. Beijing, China, 2005: 653-656. [7] Westall F. Review of speech technologies for telecommunications. Electronics & Communication Engineering Journal, 1997, 9(5): 197-207.
[8] Shi Yuanyuan, Liu Jia, Liu Rensheng. Single-chip speech recognition system based on 8051 microcontroller core. IEEE Trans. on Consumer Electronics, 2001, 47(1): 149-154. [9] Yang Haijie, Yao Jing, Liu Jia. A novel speech recognition system-on-chip. In: International Conference on Audio, Language and Image Processing 2008 (ICALIP 2008). Shanghai, China, 2008: 166-174. [10] Lee T, Ching P C, Chan L W, et al. Tone recognition of isolated Cantonese syllables. IEEE Trans. on Speech and Audio Processing, 1995, 3(3): 204-209. [11] Novuk M, Humpl R, Krbec P, et al. Two-pass search strategy for large list recognition on embedded speech recognition platforms. In: ICCASP04. Montreal, Canada, 2004: 200-203. [12] Zhu Xuan, Wang Rui, Chen Yining. Acoustic model comparison for an embedded phonme-based mandarin name dialing system, In: Proceedings of International Symposium on Chinese Spoken Language Processing. Taipei, 2002: 83-86. [13] Zhu Xuan, Chen Yining, Liu Jia, et al. Multi-pass decoding algorithm based on a speech recognition chip. Chinese Acta Electronica Sinica, 2004, 32(1): 150-153. [14] Demeechai T, Mkelinen K. Integration of tonal knowledge into phonetic Hmms for recognition of speech in tone languages. Signal Processing, 2000, 80: 2241-2247.