Sie sind auf Seite 1von 64

Speech Processing

To discuss on.
Fundamental definitions What is speech? Phonetics and Phonology Speech Recognition Speech Synthesis Research areas in speech

Fundamental Definitions

Sound waves
A sound is simply a disturbance of air molecules, which radiates outward from its source, in waves of fluctuating air pressure, like ripples from a stone dropped in a pool. The structure of these sound waves distinguishes one sound from another. When sound waves hit our eardrums, nerve cells in the inner ear detect the structure of the vibrations, and they pass this information on to the brain.

Frequency and amplitude of a wave


A lower amplitude, higher frequency wave:

A higher amplitude, lower frequency wave:

1 cycle of the wave (trough to trough, or peak to peak)

Pitch and loudness


The frequency of a wave is heard as its pitch. The amplitude of a wave is heard as its loudness.

Spectrograms
Display of time on the x-axis, frequency on the y-axis, and the higher-amplitude frequency regions shown as darker areas

Spectrogram of 'Are you working late, Nanny?'

[ A j

u w

k I

N l

SPEECH
What is it?
Linguistics
Physiology

Acoustics

The Speech Chain (Denes & Pinson) Speaker Listener

Linguistic level

Physiological level

Acoustic level

Physiological level

Linguistic level

Linguistics
Units of language. What are they?
Words? Syllables? Sounds?

What are the individual sounds in language?

Phonemes.

How are they defined?

Consider the words pig, dig and jig p, d and j distinguish the three words from each other. We can compare all the words in a language and determine those sounds that differentiate one word from another. These sounds constitute the phonemes of a language. kop versus cap. What distinguishes the two words?
The o and the a. k and c have the same sound here. They both belong to the same phoneme /k/

cede versus seed

/s/

Physiology
This relates to how the sounds are produced through neural and muscular activity. We set air coming up from the lungs in motion using our vocal cords and then we can channel this air through the vocal tract using out tongue, lips, etc. We can classify the different sounds we make according to how we set the air in motion and how we channel the airstreams through the vocal tract.

Acoustics
This describes the generation and transmission of the sounds. How air is set in motion. We generate sound waves. What do they look like?

PHONETICS AND PHONOLOGY


Phonetics concerns itself with:
The study of the acoustic detail of speech sounds and how they are articulated

Phonology concerns itself with:


Considers how these speech sounds are used within languages deals with the mechanisms / rules / processes that underlie / govern these units of speech.

Phoneme, Phone, Allophone


Phoneme:

The smallest unit in the sound system of a language that brings in a contrast in meaning(of a word), when replaced by another such unit.
For example: The sounds m and n in the English words ram and ran constitute a minimal pair that shows the semantic contrast between the two distinct phonemes /m/ and /n/ their distribution is contrastive one phoneme can occur where another does abstract mental representation of a sound unit.

ALLOPHONE
A contextual variant of a single phoneme, in a particular phonetic environment. They do not involve a semantic contrast
their distribution is mutually exclusive an allophone cannot occur where another can. Predicted/governed by phonological rules. For example: The p sounds in the English words pin and spin are acoustically different. The [p] in pin is produced with a breath of air following it (aspirated) whereas the [p] in spin is not.

Vowels sounds produced with no obstruction to the airstream as it passes through the vocal tract. There are three main organs of speech involved in changing the size of the air chamber. These are the lips - rounding, spreading the lower jaw - lowered, raised the tongue - raised, flattened, brought forward, etc.

Consonants
Consonants are articulated by restricting the airflow at some part of the vocal tract. The consonant that is produced is determined by three factors; place, manner and voice.

Characterized by three features 1) Place of articulation- Bilabial,Dental, Alveolar, Palatal,Velar, 2) Manner of articulation Stop(Plosive), Nasal, Trill, Tap(Flap), Fricative,Affricate, Lateral approximant. 3) Voicing voiced/voiceless

Places of articulation
Bilabial
Bilabial sounds are those sounds made by the articulation of the lips against each each other. Examples of such sounds in English are the following: [b],[p],[m]

Dental
Dental sounds are those sounds made by he articulation of the tip of the tongue towards the back of the teeth. Such sounds are not present in Standard American English, but in some Chicano English dialects and certain Brooklyn dialects, the sounds [t] and [d] are pronounced with a dental articulation

Alveolar
Alveolar sounds are those sounds made by the articulation of the tip of the tongue towards the alveolar ridge, the ridge of cartilage behind the teeth. Examples of such sounds in English are the following [n],[l]

Manner of articulation
Plosive/Stop
Plosive sounds are made by forming a complete obstruction to the flow of air through the mouth and nose.

explosion of air causes a sharp noise.

Voiceless - p, t, k Voiced - b, d, g Fricative

pit bit

tip dip

cot got

A fricative is a type of consonant that is formed by forcing air through a narrow gap so that hissing sound is created. Air is forced between the tongue and the place of articulation for the particular sound
f (as in far) sh (as in shut)

Syllable
A syllable is a structural unit of sound that constitutes a sequence of consonants and vowels. It is hierarchically composed of three parts:

Onset initial consonant or consonant cluster


Nucleus the vowel Coda final consonant or consonant cluster

syllable
onset Rime

Nucleus str

Coda

eh

nx ths

Need for Parameterization


Time domain waveform redundant information Waveform domain too much data to handle Data reduction by parameterization Parameter choice is application dependent Applications: speech coding, speech synthesis, speech recognition, speaker verification, etc.

Types of Parameterization
Linear Predictive Coding (LPC)
* Best for Speech Synthesis Mel Frequency Ceptral Coding (MFCC)

* Best for Speech Recognition

Acoustic Model
Goal:

Given acoustic data A = a1, a2, ..., ak Find word sequence W = w1, w2, ... wn Such that P(W | A) is maximized

Bayes Rule:
acoustic model (HMMs)
language model

P(A | W) P(W) P(W | A) = P(A) P(A) is a constant for a complete sentence

Existing SR systems
Dragon Naturally speaking IBM Via Voice

PHILIPS Free Speech 2000


L & H (Lernout & Heuspie) Voice Xpress

A Text-to-Speech Synthesis System

What is a TTS System?


Definition: A system which takes as input a sequence of words and converts them to speech Applications: Services for the hearing impaired Reading email aloud Commercial TTS Systems: Festival Bell Labs TTS

Types of speech synthesis

1. Articulatory
physical models based on the detailed description of the physiology of speech production and on the physics of sound generation in the vocal apparatus typical parameters are the position and kinematics of articulators. Then the sound radiated at the mouth is computed according to equations of physics.

2. Formant
is a descriptive acoustic-phonetic approach to synthesis. speech generation is not performed by solving equations of physics in the vocal apparatus, but by modeling the main acoustic features of the speech signal

3. Concatenative Based on speech signal processing of natural speech databases The segmental database is built to reflect the major phonological features of a language. For instance, its set of phonemes is described in terms of diphone units, representing the phoneme-tophoneme junctures. Non uniform units are also used (diphones, syllables, words, etc.). The synthesiser concatenates (coded) speech segments, and performs some signal processing to smooth unit transitions

Different TTS Systems


Phoneme-Based TTS System
Phonemes are:
The minimal distinctive phonetic units Relatively small in number (39 phonemes in English)

Disadvantage: Phonemes ignore transitional sound !!!

Different TTS Systems (contd)


Diphone-Based TTS System

Diphones are:
Made up of 2 phonemes Incorporate transitional sound Make for better sounding speech

Disadvantage: Over 1500 diphones in the English language !!!

Fundamental Components
TTS System
words Text Pre-processing

Prosody

Concatenation

Text Pre-Processing

Input
String of characters (sentence)

Output
String of diphone symbols

Objective
Perform sentence level analysis
Punctuation marks Pauses between words

Convert all input to corresponding diphones

Text Pre-Processing (Block Diagram)


Number Converter Acronym Converter Word Segmenter Word to Diphone Translator
(Phonetization)

MLDS

Diphone Dictionary

Number Converter

Replace numerals with their textual versions 100 one hundred


Handle fractional and decimal numbers 0.25 point two five

Acronym Converter

Replace acronyms with single letter components A.B.C. ABC


Change abbreviations to full textual format Mr. Mister

Word Segmenter

Divide sentence into word segments


Special delimiter to separate segments (i.e. ||)

Segments can be:


A single word An acronym A numeral

Identify punctuation marks

Word To Diphone Converter (Phonetization)

Purpose
Translate words to their diphone representations

Resource
Dictionary of words and their diphones

The Multi-Level Data Structure

Contains all necessary data for the next sub-system:


Word Diphone representation Prosodic parameters for each diphone
This reflects both word-level and sentencelevel prosody

Allows for modularization

Prosody
done
MLDS Diphone Retrieval Acoustic Manipulation
yes no

Concatenation

Diphone Database

Diphone Retrieval
Database of recorded diphones Every diphone matched with txt file
Distinguished by type (CC, CV, VC, VV) References to specific components within waveform

Store diphone waveform and prosodic parameters in variables

Acoustic Manipulation - MATLab


Recognizes wave files (.WAV)
load, play, write

Vast array of signal processing tools Built-in functions Ease of debugging GUI-capable

Concatenation
Diphones Words

Using PSOLA at the joining ends Ensures smooth transition


Words Sentence

Straight joining at the end points due to presence of pauses

0011-24

Speech recognition /understanding


Speaker-independent Spontaneous speech

Speech synthesis
Synthesis by rule Text-to-speech

Speech coding
Wide/narrow-band Very-low-bit-rate

Robustness
Noise/distortion

Human-machine interface
Ergonomics Subjective/objective evaluation

Individuality
Speaker recognition Speaker adaptation/normalization Voice conversion Database

Database

Feature extraction (dynamics)

Speech information processing tree, consisting of present and future speech information processing technologies supported by scientific and technological areas serving as the foundations of speech research.

Das könnte Ihnen auch gefallen