Seminar Speech Recognition

CS399 Speech Recognition
-- Vishwaraj Anand 0901cs35
Contents
What is the task? What are the main difficulties? How is it approached? Existing SR systems How good is it? CMU Sphinx detail discussion
What is speech recognition(SR)?

To convert spoken words speech to text/any meaningful instruction Different from Voice recognition - Just to identify speaker(Biometrics) Also termed as automatic speech recognition (ASR), computer speech recognition (CSR), speech to text (STT)
Different types of ASR: Speaker Dependent and Independent

Isolated, Discontinuous and Continuous ASRs Read and Spontaneous ASRs Limited Vocab and Open speech ASRs
Applications

Automatic translation of speech from one language to another Automotive speech recognition (e.g., OnStar, Ford Sync) Court reporting (Real-time Speech Writing) Hands-free computing: Speech recognition in multimodal interface Home automation Interactive voice response Pronunciation evaluation in computer-aided language learning applications Robotics and military applications - Better HCIs in cockpits Speech-to-text reporter (transcription of speech into text, video captioning, Court reporting ) Telematics (e.g., vehicle Navigation Systems) Video games such as Tom Clancy's EndWar and Lifeline as working examples
5
Challenges
Portability independence of computing platform Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modelling role for linguistics to improve the language models Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words Systems must have some method of detecting OOV words, and dealing with them in a sensible way Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger) Accent, dialect and mixed language non-native speech is a huge problem, especially where code-switching is commonplace
6
What is the task?

Getting a computer to understand spoken language By understand we might mean
React appropriately Convert the input speech into another medium, e.g. text
Digitization => Acoustic analysis of speech => Linguistic interpretation in a language

7
How might computers do it?
Acoustic waveform
Acoustic signal
Digitization Acoustic analysis of the speech signal Linguistic interpretation
Speech recognition
Available Technologies
Digitization
Converting analog signal into digital representation (PCM)
Signal processing
Separating speech from background noise
Phonetics
Variability in human speech
Phonology
Recognizing individual sound distinctions (similar phonemes) is the systematic use of sound to encode meaning in any spoken human language
Linguistic Interpretation and syntax Pragmatics

Bridges the explanatory gap between sentence meaning and speaker's meaning
10
Digitization
Analogue to digital conversion Sampling and quantizing Filtering to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth
11
Separating speech from background noise

Noise cancelling microphones
Two mics, one facing speaker, the other facing away Ambient noise is roughly same for both mics
Knowing which bits of the signal relate to speech

Spectrograph analysis
12
Available algorithms
Statistical matching is an extended form of template matching. Two main algorithms 1> Hidden Markov Model(HMM) 2> Dynamic Time Warping(DTW)
Both acoustic modelling and language modelling are important part of Statistical SR systems.
13
Hidden Markov Models(HMM)

Modern general-purpose SR systems are based on Hidden Markov Models. Statistical models to output sequences of symbols using probability and mathematical functions to determine the most likely outcome Popular because speech signal is viewed as a short-time stationary signal on time scales such as 10 ms. Also HMMs can be trained automatically and are computationally feasible to use.
14
HMM continued
Each phoneme is like a link in a chain, and the completed chain is a word. The chain branches off in different directions as the program matches sounds with the next mostlikely phoneme. The program then assigns a probability score to each phoneme, based on its built-in dictionary and user training. This process is more complicated for phrases and sentences -- guess where each word stops and starts.
15
HMMs for some words
16
HMM continued
Break phonemes to form words 1> r eh k ao g n ay z s p iy ch "recognize speech 2> r eh k ay n ay s b iy ch "wreck a nice beach potato is recognised as ->
17
Dynamic Time Warping(DTW)

It was historically used in SR systems Measures similarity between two sequences that vary in time and/or speed The sequences are "warped" non-linearly to match each other to give a suitable output. It is often conjunct with HMMs to align words sequentially for a proper match
18
Performance and accuracy

ASRs are slow due to data intense comparisons and matching
Accuracy is often measured as Word error rate(WER) and Command success rate(CSR)
19
Existing ASR Systems

Open source CMU Sphinx, Julius, iARTOS, etc Mac Dragon Dictate, Via Voice, iListen, etc Smartphones Google Voice, Vlingo, Tell-me etc MS windows - Speech Application Programming Interface (SAPI) for ASR and speech synthesis in Windows applications SAPIs are embedded in Ms Office, Ms Agent and Ms Speech Server
-- we will discuss CMU Sphinx in detail
20
CMU SPHINX
Includes series of speech recognizers (Sphinx 2-4) and acoustic model trainer (Sphinx Train)
Sphinx Developed by Kai-Fu-Lee - continuous-speech, speaker-independent - based on HMM and n-gram statistical models - superseded by other versions of Sphinx
21
Sphinx 2
Sphinx2 - Fast performance-oriented ASR
focuses on real-time recognition suitable for spoken language applications

used in dialog systems and language learning
22
Sphinx 3
Adopted continuous HMM representation Used for high-accuracy and non real-time recognition Sphinx3 is under active development for almost real-rime implementations and very high recognition accuracy
23
Sphinx 4
Complete re-write of the Sphinx engine in Java Provides more flexible framework Current development goals include: - developing a new (acoustic model) trainer - implementing speaker adaptation - improving configuration management - creating a graph-based UI for graphical system design
24
Pocket Sphinx
A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). API features -Stable due to use of ADTs -fully re-entrant, multiple decoder per process -Lesser amount of code and memory required -supports linear interpolation of multiple models at runtime Reference documentation for the API sethttp://cmusphinx.sourceforge.net/api/pocketsphinx/
25
References
www.howstuffworks.com/speech-recognition -- contents and photo of pictorial view of steps involved www.wikipedia.org/speech-recognition -contents and basic techniques for ASR http://cmusphinx.sourceforge.net/wiki/ Tutorials http://www.cs.cmu.edu/~archan/sphinxPrese ntation.html - Working mechanism basics
26
Thank You
ANY QUESTIONS ??
27

Seminar Speech Recognition

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Seminar Speech Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

CS399 Speech Recognition

-- Vishwaraj Anand 0901cs35

What is speech recognition(SR)?

Different types of ASR: Speaker Dependent and Independent

What is the task?

Digitization => Acoustic analysis of speech => Linguistic interpretation in a language

How might computers do it?

Digitization Acoustic analysis of the speech signal Linguistic interpretation

Linguistic Interpretation and syntax Pragmatics

Separating speech from background noise

Knowing which bits of the signal relate to speech

Hidden Markov Models(HMM)

HMMs for some words

Dynamic Time Warping(DTW)

Performance and accuracy

Existing ASR Systems

focuses on real-time recognition suitable for spoken language applications

Das könnte Ihnen auch gefallen