Sie sind auf Seite 1von 27

CS399 Speech Recognition

-- Vishwaraj Anand 0901cs35

Contents
What is the task? What are the main difficulties? How is it approached? Existing SR systems How good is it? CMU Sphinx detail discussion

What is speech recognition(SR)?


To convert spoken words speech to text/any meaningful instruction Different from Voice recognition - Just to identify speaker(Biometrics) Also termed as automatic speech recognition (ASR), computer speech recognition (CSR), speech to text (STT)

Different types of ASR: Speaker Dependent and Independent


Isolated, Discontinuous and Continuous ASRs Read and Spontaneous ASRs Limited Vocab and Open speech ASRs

Applications

Automatic translation of speech from one language to another Automotive speech recognition (e.g., OnStar, Ford Sync) Court reporting (Real-time Speech Writing) Hands-free computing: Speech recognition in multimodal interface Home automation Interactive voice response Pronunciation evaluation in computer-aided language learning applications Robotics and military applications - Better HCIs in cockpits Speech-to-text reporter (transcription of speech into text, video captioning, Court reporting ) Telematics (e.g., vehicle Navigation Systems) Video games such as Tom Clancy's EndWar and Lifeline as working examples
5

Challenges
Portability independence of computing platform Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modelling role for linguistics to improve the language models Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words Systems must have some method of detecting OOV words, and dealing with them in a sensible way Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger) Accent, dialect and mixed language non-native speech is a huge problem, especially where code-switching is commonplace
6

What is the task?


Getting a computer to understand spoken language By understand we might mean
React appropriately Convert the input speech into another medium, e.g. text

Digitization => Acoustic analysis of speech => Linguistic interpretation in a language


7

How might computers do it?

Acoustic waveform

Acoustic signal

Digitization Acoustic analysis of the speech signal Linguistic interpretation

Speech recognition

Available Technologies
Digitization
Converting analog signal into digital representation (PCM)

Signal processing
Separating speech from background noise

Phonetics
Variability in human speech

Phonology
Recognizing individual sound distinctions (similar phonemes) is the systematic use of sound to encode meaning in any spoken human language

Linguistic Interpretation and syntax Pragmatics


Bridges the explanatory gap between sentence meaning and speaker's meaning
10

Digitization
Analogue to digital conversion Sampling and quantizing Filtering to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth

11

Separating speech from background noise


Noise cancelling microphones
Two mics, one facing speaker, the other facing away Ambient noise is roughly same for both mics

Knowing which bits of the signal relate to speech


Spectrograph analysis

12

Available algorithms
Statistical matching is an extended form of template matching. Two main algorithms 1> Hidden Markov Model(HMM) 2> Dynamic Time Warping(DTW)
Both acoustic modelling and language modelling are important part of Statistical SR systems.
13

Hidden Markov Models(HMM)


Modern general-purpose SR systems are based on Hidden Markov Models. Statistical models to output sequences of symbols using probability and mathematical functions to determine the most likely outcome Popular because speech signal is viewed as a short-time stationary signal on time scales such as 10 ms. Also HMMs can be trained automatically and are computationally feasible to use.
14

HMM continued
Each phoneme is like a link in a chain, and the completed chain is a word. The chain branches off in different directions as the program matches sounds with the next mostlikely phoneme. The program then assigns a probability score to each phoneme, based on its built-in dictionary and user training. This process is more complicated for phrases and sentences -- guess where each word stops and starts.
15

HMMs for some words

16

HMM continued
Break phonemes to form words 1> r eh k ao g n ay z s p iy ch "recognize speech 2> r eh k ay n ay s b iy ch "wreck a nice beach potato is recognised as ->

17

Dynamic Time Warping(DTW)


It was historically used in SR systems Measures similarity between two sequences that vary in time and/or speed The sequences are "warped" non-linearly to match each other to give a suitable output. It is often conjunct with HMMs to align words sequentially for a proper match

18

Performance and accuracy


ASRs are slow due to data intense comparisons and matching

Accuracy is often measured as Word error rate(WER) and Command success rate(CSR)

19

Existing ASR Systems


Open source CMU Sphinx, Julius, iARTOS, etc Mac Dragon Dictate, Via Voice, iListen, etc Smartphones Google Voice, Vlingo, Tell-me etc MS windows - Speech Application Programming Interface (SAPI) for ASR and speech synthesis in Windows applications SAPIs are embedded in Ms Office, Ms Agent and Ms Speech Server
-- we will discuss CMU Sphinx in detail
20

CMU SPHINX
Includes series of speech recognizers (Sphinx 2-4) and acoustic model trainer (Sphinx Train)
Sphinx Developed by Kai-Fu-Lee - continuous-speech, speaker-independent - based on HMM and n-gram statistical models - superseded by other versions of Sphinx

21

Sphinx 2
Sphinx2 - Fast performance-oriented ASR

focuses on real-time recognition suitable for spoken language applications


used in dialog systems and language learning

22

Sphinx 3
Adopted continuous HMM representation Used for high-accuracy and non real-time recognition Sphinx3 is under active development for almost real-rime implementations and very high recognition accuracy

23

Sphinx 4
Complete re-write of the Sphinx engine in Java Provides more flexible framework Current development goals include: - developing a new (acoustic model) trainer - implementing speaker adaptation - improving configuration management - creating a graph-based UI for graphical system design
24

Pocket Sphinx
A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). API features -Stable due to use of ADTs -fully re-entrant, multiple decoder per process -Lesser amount of code and memory required -supports linear interpolation of multiple models at runtime Reference documentation for the API sethttp://cmusphinx.sourceforge.net/api/pocketsphinx/

25

References
www.howstuffworks.com/speech-recognition -- contents and photo of pictorial view of steps involved www.wikipedia.org/speech-recognition -contents and basic techniques for ASR http://cmusphinx.sourceforge.net/wiki/ Tutorials http://www.cs.cmu.edu/~archan/sphinxPrese ntation.html - Working mechanism basics
26

Thank You

ANY QUESTIONS ??

27

Das könnte Ihnen auch gefallen