Beruflich Dokumente
Kultur Dokumente
Contents
What is the task? What are the main difficulties? How is it approached? Existing SR systems How good is it? CMU Sphinx detail discussion
Applications
Automatic translation of speech from one language to another Automotive speech recognition (e.g., OnStar, Ford Sync) Court reporting (Real-time Speech Writing) Hands-free computing: Speech recognition in multimodal interface Home automation Interactive voice response Pronunciation evaluation in computer-aided language learning applications Robotics and military applications - Better HCIs in cockpits Speech-to-text reporter (transcription of speech into text, video captioning, Court reporting ) Telematics (e.g., vehicle Navigation Systems) Video games such as Tom Clancy's EndWar and Lifeline as working examples
5
Challenges
Portability independence of computing platform Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modelling role for linguistics to improve the language models Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words Systems must have some method of detecting OOV words, and dealing with them in a sensible way Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger) Accent, dialect and mixed language non-native speech is a huge problem, especially where code-switching is commonplace
6
Acoustic waveform
Acoustic signal
Speech recognition
Available Technologies
Digitization
Converting analog signal into digital representation (PCM)
Signal processing
Separating speech from background noise
Phonetics
Variability in human speech
Phonology
Recognizing individual sound distinctions (similar phonemes) is the systematic use of sound to encode meaning in any spoken human language
Digitization
Analogue to digital conversion Sampling and quantizing Filtering to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth
11
12
Available algorithms
Statistical matching is an extended form of template matching. Two main algorithms 1> Hidden Markov Model(HMM) 2> Dynamic Time Warping(DTW)
Both acoustic modelling and language modelling are important part of Statistical SR systems.
13
HMM continued
Each phoneme is like a link in a chain, and the completed chain is a word. The chain branches off in different directions as the program matches sounds with the next mostlikely phoneme. The program then assigns a probability score to each phoneme, based on its built-in dictionary and user training. This process is more complicated for phrases and sentences -- guess where each word stops and starts.
15
16
HMM continued
Break phonemes to form words 1> r eh k ao g n ay z s p iy ch "recognize speech 2> r eh k ay n ay s b iy ch "wreck a nice beach potato is recognised as ->
17
18
Accuracy is often measured as Word error rate(WER) and Command success rate(CSR)
19
CMU SPHINX
Includes series of speech recognizers (Sphinx 2-4) and acoustic model trainer (Sphinx Train)
Sphinx Developed by Kai-Fu-Lee - continuous-speech, speaker-independent - based on HMM and n-gram statistical models - superseded by other versions of Sphinx
21
Sphinx 2
Sphinx2 - Fast performance-oriented ASR
22
Sphinx 3
Adopted continuous HMM representation Used for high-accuracy and non real-time recognition Sphinx3 is under active development for almost real-rime implementations and very high recognition accuracy
23
Sphinx 4
Complete re-write of the Sphinx engine in Java Provides more flexible framework Current development goals include: - developing a new (acoustic model) trainer - implementing speaker adaptation - improving configuration management - creating a graph-based UI for graphical system design
24
Pocket Sphinx
A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). API features -Stable due to use of ADTs -fully re-entrant, multiple decoder per process -Lesser amount of code and memory required -supports linear interpolation of multiple models at runtime Reference documentation for the API sethttp://cmusphinx.sourceforge.net/api/pocketsphinx/
25
References
www.howstuffworks.com/speech-recognition -- contents and photo of pictorial view of steps involved www.wikipedia.org/speech-recognition -contents and basic techniques for ASR http://cmusphinx.sourceforge.net/wiki/ Tutorials http://www.cs.cmu.edu/~archan/sphinxPrese ntation.html - Working mechanism basics
26
Thank You
ANY QUESTIONS ??
27