Beruflich Dokumente
Kultur Dokumente
To discuss on.
Fundamental definitions What is speech? Phonetics and Phonology Speech Recognition Speech Synthesis Research areas in speech
Fundamental Definitions
Sound waves
A sound is simply a disturbance of air molecules, which radiates outward from its source, in waves of fluctuating air pressure, like ripples from a stone dropped in a pool. The structure of these sound waves distinguishes one sound from another. When sound waves hit our eardrums, nerve cells in the inner ear detect the structure of the vibrations, and they pass this information on to the brain.
Spectrograms
Display of time on the x-axis, frequency on the y-axis, and the higher-amplitude frequency regions shown as darker areas
[ A j
u w
k I
N l
SPEECH
What is it?
Linguistics
Physiology
Acoustics
Linguistic level
Physiological level
Acoustic level
Physiological level
Linguistic level
Linguistics
Units of language. What are they?
Words? Syllables? Sounds?
Phonemes.
Consider the words pig, dig and jig p, d and j distinguish the three words from each other. We can compare all the words in a language and determine those sounds that differentiate one word from another. These sounds constitute the phonemes of a language. kop versus cap. What distinguishes the two words?
The o and the a. k and c have the same sound here. They both belong to the same phoneme /k/
/s/
Physiology
This relates to how the sounds are produced through neural and muscular activity. We set air coming up from the lungs in motion using our vocal cords and then we can channel this air through the vocal tract using out tongue, lips, etc. We can classify the different sounds we make according to how we set the air in motion and how we channel the airstreams through the vocal tract.
Acoustics
This describes the generation and transmission of the sounds. How air is set in motion. We generate sound waves. What do they look like?
The smallest unit in the sound system of a language that brings in a contrast in meaning(of a word), when replaced by another such unit.
For example: The sounds m and n in the English words ram and ran constitute a minimal pair that shows the semantic contrast between the two distinct phonemes /m/ and /n/ their distribution is contrastive one phoneme can occur where another does abstract mental representation of a sound unit.
ALLOPHONE
A contextual variant of a single phoneme, in a particular phonetic environment. They do not involve a semantic contrast
their distribution is mutually exclusive an allophone cannot occur where another can. Predicted/governed by phonological rules. For example: The p sounds in the English words pin and spin are acoustically different. The [p] in pin is produced with a breath of air following it (aspirated) whereas the [p] in spin is not.
Vowels sounds produced with no obstruction to the airstream as it passes through the vocal tract. There are three main organs of speech involved in changing the size of the air chamber. These are the lips - rounding, spreading the lower jaw - lowered, raised the tongue - raised, flattened, brought forward, etc.
Consonants
Consonants are articulated by restricting the airflow at some part of the vocal tract. The consonant that is produced is determined by three factors; place, manner and voice.
Characterized by three features 1) Place of articulation- Bilabial,Dental, Alveolar, Palatal,Velar, 2) Manner of articulation Stop(Plosive), Nasal, Trill, Tap(Flap), Fricative,Affricate, Lateral approximant. 3) Voicing voiced/voiceless
Places of articulation
Bilabial
Bilabial sounds are those sounds made by the articulation of the lips against each each other. Examples of such sounds in English are the following: [b],[p],[m]
Dental
Dental sounds are those sounds made by he articulation of the tip of the tongue towards the back of the teeth. Such sounds are not present in Standard American English, but in some Chicano English dialects and certain Brooklyn dialects, the sounds [t] and [d] are pronounced with a dental articulation
Alveolar
Alveolar sounds are those sounds made by the articulation of the tip of the tongue towards the alveolar ridge, the ridge of cartilage behind the teeth. Examples of such sounds in English are the following [n],[l]
Manner of articulation
Plosive/Stop
Plosive sounds are made by forming a complete obstruction to the flow of air through the mouth and nose.
pit bit
tip dip
cot got
A fricative is a type of consonant that is formed by forcing air through a narrow gap so that hissing sound is created. Air is forced between the tongue and the place of articulation for the particular sound
f (as in far) sh (as in shut)
Syllable
A syllable is a structural unit of sound that constitutes a sequence of consonants and vowels. It is hierarchically composed of three parts:
syllable
onset Rime
Nucleus str
Coda
eh
nx ths
Types of Parameterization
Linear Predictive Coding (LPC)
* Best for Speech Synthesis Mel Frequency Ceptral Coding (MFCC)
Acoustic Model
Goal:
Given acoustic data A = a1, a2, ..., ak Find word sequence W = w1, w2, ... wn Such that P(W | A) is maximized
Bayes Rule:
acoustic model (HMMs)
language model
Existing SR systems
Dragon Naturally speaking IBM Via Voice
1. Articulatory
physical models based on the detailed description of the physiology of speech production and on the physics of sound generation in the vocal apparatus typical parameters are the position and kinematics of articulators. Then the sound radiated at the mouth is computed according to equations of physics.
2. Formant
is a descriptive acoustic-phonetic approach to synthesis. speech generation is not performed by solving equations of physics in the vocal apparatus, but by modeling the main acoustic features of the speech signal
3. Concatenative Based on speech signal processing of natural speech databases The segmental database is built to reflect the major phonological features of a language. For instance, its set of phonemes is described in terms of diphone units, representing the phoneme-tophoneme junctures. Non uniform units are also used (diphones, syllables, words, etc.). The synthesiser concatenates (coded) speech segments, and performs some signal processing to smooth unit transitions
Diphones are:
Made up of 2 phonemes Incorporate transitional sound Make for better sounding speech
Fundamental Components
TTS System
words Text Pre-processing
Prosody
Concatenation
Text Pre-Processing
Input
String of characters (sentence)
Output
String of diphone symbols
Objective
Perform sentence level analysis
Punctuation marks Pauses between words
MLDS
Diphone Dictionary
Number Converter
Acronym Converter
Word Segmenter
Purpose
Translate words to their diphone representations
Resource
Dictionary of words and their diphones
Prosody
done
MLDS Diphone Retrieval Acoustic Manipulation
yes no
Concatenation
Diphone Database
Diphone Retrieval
Database of recorded diphones Every diphone matched with txt file
Distinguished by type (CC, CV, VC, VV) References to specific components within waveform
Vast array of signal processing tools Built-in functions Ease of debugging GUI-capable
Concatenation
Diphones Words
0011-24
Speech synthesis
Synthesis by rule Text-to-speech
Speech coding
Wide/narrow-band Very-low-bit-rate
Robustness
Noise/distortion
Human-machine interface
Ergonomics Subjective/objective evaluation
Individuality
Speaker recognition Speaker adaptation/normalization Voice conversion Database
Database
Speech information processing tree, consisting of present and future speech information processing technologies supported by scientific and technological areas serving as the foundations of speech research.