NLP

Speech Processing
To discuss on.
Fundamental definitions What is speech? Phonetics and Phonology Speech Recognition Speech Synthesis Research areas in speech
Fundamental Definitions
Sound waves
A sound is simply a disturbance of air molecules, which radiates outward from its source, in waves of fluctuating air pressure, like ripples from a stone dropped in a pool. The structure of these sound waves distinguishes one sound from another. When sound waves hit our eardrums, nerve cells in the inner ear detect the structure of the vibrations, and they pass this information on to the brain.
Frequency and amplitude of a wave

A lower amplitude, higher frequency wave:
A higher amplitude, lower frequency wave:
1 cycle of the wave (trough to trough, or peak to peak)
Pitch and loudness

The frequency of a wave is heard as its pitch. The amplitude of a wave is heard as its loudness.
Spectrograms
Display of time on the x-axis, frequency on the y-axis, and the higher-amplitude frequency regions shown as darker areas
Spectrogram of 'Are you working late, Nanny?'
[ A j
u w
k I
N l
SPEECH
What is it?
Linguistics
Physiology
Acoustics
The Speech Chain (Denes & Pinson) Speaker Listener
Linguistic level
Physiological level
Acoustic level
Physiological level
Linguistic level
Linguistics
Units of language. What are they?
Words? Syllables? Sounds?
What are the individual sounds in language?
Phonemes.
How are they defined?
Consider the words pig, dig and jig p, d and j distinguish the three words from each other. We can compare all the words in a language and determine those sounds that differentiate one word from another. These sounds constitute the phonemes of a language. kop versus cap. What distinguishes the two words?
The o and the a. k and c have the same sound here. They both belong to the same phoneme /k/
cede versus seed
/s/
Physiology
This relates to how the sounds are produced through neural and muscular activity. We set air coming up from the lungs in motion using our vocal cords and then we can channel this air through the vocal tract using out tongue, lips, etc. We can classify the different sounds we make according to how we set the air in motion and how we channel the airstreams through the vocal tract.
Acoustics
This describes the generation and transmission of the sounds. How air is set in motion. We generate sound waves. What do they look like?
PHONETICS AND PHONOLOGY

Phonetics concerns itself with:
The study of the acoustic detail of speech sounds and how they are articulated
Phonology concerns itself with:

Considers how these speech sounds are used within languages deals with the mechanisms / rules / processes that underlie / govern these units of speech.
Phoneme, Phone, Allophone

Phoneme:
The smallest unit in the sound system of a language that brings in a contrast in meaning(of a word), when replaced by another such unit.
For example: The sounds m and n in the English words ram and ran constitute a minimal pair that shows the semantic contrast between the two distinct phonemes /m/ and /n/ their distribution is contrastive one phoneme can occur where another does abstract mental representation of a sound unit.
ALLOPHONE
A contextual variant of a single phoneme, in a particular phonetic environment. They do not involve a semantic contrast
their distribution is mutually exclusive an allophone cannot occur where another can. Predicted/governed by phonological rules. For example: The p sounds in the English words pin and spin are acoustically different. The [p] in pin is produced with a breath of air following it (aspirated) whereas the [p] in spin is not.
Vowels sounds produced with no obstruction to the airstream as it passes through the vocal tract. There are three main organs of speech involved in changing the size of the air chamber. These are the lips - rounding, spreading the lower jaw - lowered, raised the tongue - raised, flattened, brought forward, etc.
Consonants
Consonants are articulated by restricting the airflow at some part of the vocal tract. The consonant that is produced is determined by three factors; place, manner and voice.
Characterized by three features 1) Place of articulation- Bilabial,Dental, Alveolar, Palatal,Velar, 2) Manner of articulation Stop(Plosive), Nasal, Trill, Tap(Flap), Fricative,Affricate, Lateral approximant. 3) Voicing voiced/voiceless
Places of articulation
Bilabial
Bilabial sounds are those sounds made by the articulation of the lips against each each other. Examples of such sounds in English are the following: [b],[p],[m]
Dental
Dental sounds are those sounds made by he articulation of the tip of the tongue towards the back of the teeth. Such sounds are not present in Standard American English, but in some Chicano English dialects and certain Brooklyn dialects, the sounds [t] and [d] are pronounced with a dental articulation
Alveolar
Alveolar sounds are those sounds made by the articulation of the tip of the tongue towards the alveolar ridge, the ridge of cartilage behind the teeth. Examples of such sounds in English are the following [n],[l]
Manner of articulation
Plosive/Stop
Plosive sounds are made by forming a complete obstruction to the flow of air through the mouth and nose.
explosion of air causes a sharp noise.
Voiceless - p, t, k Voiced - b, d, g Fricative
pit bit
tip dip
cot got
A fricative is a type of consonant that is formed by forcing air through a narrow gap so that hissing sound is created. Air is forced between the tongue and the place of articulation for the particular sound
f (as in far) sh (as in shut)
Syllable
A syllable is a structural unit of sound that constitutes a sequence of consonants and vowels. It is hierarchically composed of three parts:
Onset initial consonant or consonant cluster

Nucleus the vowel Coda final consonant or consonant cluster
syllable
onset Rime
Nucleus str
Coda
eh
nx ths
Need for Parameterization

Time domain waveform redundant information Waveform domain too much data to handle Data reduction by parameterization Parameter choice is application dependent Applications: speech coding, speech synthesis, speech recognition, speaker verification, etc.
Types of Parameterization
Linear Predictive Coding (LPC)
* Best for Speech Synthesis Mel Frequency Ceptral Coding (MFCC)
* Best for Speech Recognition
Acoustic Model
Goal:
Given acoustic data A = a1, a2, ..., ak Find word sequence W = w1, w2, ... wn Such that P(W | A) is maximized
Bayes Rule:
acoustic model (HMMs)
language model
P(A | W) P(W) P(W | A) = P(A) P(A) is a constant for a complete sentence
Existing SR systems
Dragon Naturally speaking IBM Via Voice
PHILIPS Free Speech 2000

L & H (Lernout & Heuspie) Voice Xpress
A Text-to-Speech Synthesis System
What is a TTS System?

Definition: A system which takes as input a sequence of words and converts them to speech Applications: Services for the hearing impaired Reading email aloud Commercial TTS Systems: Festival Bell Labs TTS
Types of speech synthesis
1. Articulatory
physical models based on the detailed description of the physiology of speech production and on the physics of sound generation in the vocal apparatus typical parameters are the position and kinematics of articulators. Then the sound radiated at the mouth is computed according to equations of physics.
2. Formant
is a descriptive acoustic-phonetic approach to synthesis. speech generation is not performed by solving equations of physics in the vocal apparatus, but by modeling the main acoustic features of the speech signal
3. Concatenative Based on speech signal processing of natural speech databases The segmental database is built to reflect the major phonological features of a language. For instance, its set of phonemes is described in terms of diphone units, representing the phoneme-tophoneme junctures. Non uniform units are also used (diphones, syllables, words, etc.). The synthesiser concatenates (coded) speech segments, and performs some signal processing to smooth unit transitions
Different TTS Systems

Phoneme-Based TTS System
Phonemes are:
The minimal distinctive phonetic units Relatively small in number (39 phonemes in English)
Disadvantage: Phonemes ignore transitional sound !!!
Different TTS Systems (contd)

Diphone-Based TTS System
Diphones are:
Made up of 2 phonemes Incorporate transitional sound Make for better sounding speech
Disadvantage: Over 1500 diphones in the English language !!!
Fundamental Components
TTS System
words Text Pre-processing
Prosody
Concatenation
Text Pre-Processing
Input
String of characters (sentence)
Output
String of diphone symbols
Objective
Perform sentence level analysis
Punctuation marks Pauses between words
Convert all input to corresponding diphones
Text Pre-Processing (Block Diagram)

Number Converter Acronym Converter Word Segmenter Word to Diphone Translator
(Phonetization)
MLDS
Diphone Dictionary
Number Converter
Replace numerals with their textual versions 100 one hundred

Handle fractional and decimal numbers 0.25 point two five
Acronym Converter
Replace acronyms with single letter components A.B.C. ABC

Change abbreviations to full textual format Mr. Mister
Word Segmenter
Divide sentence into word segments

Special delimiter to separate segments (i.e. ||)
Segments can be:

A single word An acronym A numeral
Identify punctuation marks
Word To Diphone Converter (Phonetization)
Purpose
Translate words to their diphone representations
Resource
Dictionary of words and their diphones
The Multi-Level Data Structure
Contains all necessary data for the next sub-system:

Word Diphone representation Prosodic parameters for each diphone
This reflects both word-level and sentencelevel prosody
Allows for modularization
Prosody
done
MLDS Diphone Retrieval Acoustic Manipulation
yes no
Concatenation
Diphone Database
Diphone Retrieval
Database of recorded diphones Every diphone matched with txt file
Distinguished by type (CC, CV, VC, VV) References to specific components within waveform
Store diphone waveform and prosodic parameters in variables
Acoustic Manipulation - MATLab

Recognizes wave files (.WAV)
load, play, write
Vast array of signal processing tools Built-in functions Ease of debugging GUI-capable
Concatenation
Diphones Words
Using PSOLA at the joining ends Ensures smooth transition

Words Sentence
Straight joining at the end points due to presence of pauses
0011-24
Speech recognition /understanding

Speaker-independent Spontaneous speech
Speech synthesis
Synthesis by rule Text-to-speech
Speech coding
Wide/narrow-band Very-low-bit-rate
Robustness
Noise/distortion
Human-machine interface
Ergonomics Subjective/objective evaluation
Individuality
Speaker recognition Speaker adaptation/normalization Voice conversion Database
Database
Feature extraction (dynamics)
Speech information processing tree, consisting of present and future speech information processing technologies supported by scientific and technological areas serving as the foundations of speech research.

NLP

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

NLP

Hochgeladen von

Copyright:

Verfügbare Formate

Speech Processing

Frequency and amplitude of a wave

A higher amplitude, lower frequency wave:

1 cycle of the wave (trough to trough, or peak to peak)

Pitch and loudness

Spectrogram of 'Are you working late, Nanny?'

The Speech Chain (Denes & Pinson) Speaker Listener

What are the individual sounds in language?

How are they defined?

cede versus seed

PHONETICS AND PHONOLOGY

Phonology concerns itself with:

Phoneme, Phone, Allophone

explosion of air causes a sharp noise.

Voiceless - p, t, k Voiced - b, d, g Fricative

Onset initial consonant or consonant cluster

Need for Parameterization

* Best for Speech Recognition

P(A | W) P(W) P(W | A) = P(A) P(A) is a constant for a complete sentence

PHILIPS Free Speech 2000

A Text-to-Speech Synthesis System

What is a TTS System?

Types of speech synthesis

Different TTS Systems

Disadvantage: Phonemes ignore transitional sound !!!

Different TTS Systems (contd)

Disadvantage: Over 1500 diphones in the English language !!!

Convert all input to corresponding diphones

Text Pre-Processing (Block Diagram)

Replace numerals with their textual versions 100 one hundred

Replace acronyms with single letter components A.B.C. ABC

Divide sentence into word segments

Segments can be:

Identify punctuation marks

Word To Diphone Converter (Phonetization)

The Multi-Level Data Structure

Contains all necessary data for the next sub-system:

Allows for modularization

Store diphone waveform and prosodic parameters in variables

Acoustic Manipulation - MATLab

Using PSOLA at the joining ends Ensures smooth transition

Straight joining at the end points due to presence of pauses

Speech recognition /understanding

Feature extraction (dynamics)

Das könnte Ihnen auch gefallen