Sie sind auf Seite 1von 52
An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for
An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for
An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for
An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for

An Introduction to Speech Annotation

An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for Indian
An Introduction to Speech Annotation L.R. PREM KUMAR Senior Research Assistant Linguistic Data Consortium for Indian

L.R. PREM KUMAR

Senior Research Assistant

Linguistic Data Consortium for Indian Languages Central Institute of Indian Languages, Mysore

Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why
Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why
Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why
Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why
Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why

Overview

Overview What is a Cor p us? Guidelines for Annotation Speech Corpus and Types Why corpus?

What is a Corpus? Guidelines for Annotation

Speech Corpus and Types

Why

corpus? Use of Speech Corpus

Using Speech Corpus in NLP Application

Recording of the data Storing LDC-IL Data in NIST Format The NIST Format Utility of Annotation

do

need

speech

we

p How to Annotated Speech Corpus using Praat?

LDCIL S

eech Cor ora

p

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 2
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
2
What is a corpus? ' Corpus ' means ' body ' in Latin , and
What is a corpus? ' Corpus ' means ' body ' in Latin , and
What is a corpus? ' Corpus ' means ' body ' in Latin , and
What is a corpus? ' Corpus ' means ' body ' in Latin , and
What is a corpus? ' Corpus ' means ' body ' in Latin , and

What is a corpus?

What is a corpus? ' Corpus ' means ' body ' in Latin , and literally

'Corpus' means 'body' in Latin, and literally refers to the biological structures that constitute humans and other animals (Wikipedia). Corpus is a collection of spoken language stored on computer and used for language research and writing dictionaries (Macmillan Dictionary 2002). It is a collection of written or spoken texts (Oxford Dictionary 2005). In other words, corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 3
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
3
Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and
Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and
Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and
Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and
Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and

Speech Corpus

Speech Corpus Speech Corpus (or spoken corpus) is a database of speech audio files and text

Speech Corpus (or spoken corpus) is a database of speech audio files and text transcriptions in a format that can be used to create acoustic model (which can then be used with a speech recognition engine).

There are two types of Speech Corpora They are Read Speech and Spontaneous speech

1. Read Speech includes:

Book excerpts Broadcast news Lists of words Sequences of numbers

2. Spontaneous Speech includes Dialogs - between two or more people (includes meetings) Narratives - a person telling a story Map-tasks - one person explains a route on a map to another

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 4
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
4
Why do we need speech corpus? To develop tools that facilitate collection of high quality
Why do we need speech corpus? To develop tools that facilitate collection of high quality
Why do we need speech corpus? To develop tools that facilitate collection of high quality
Why do we need speech corpus? To develop tools that facilitate collection of high quality

Why do we need speech corpus?

Why do we need speech corpus? To develop tools that facilitate collection of high quality speech
Why do we need speech corpus? To develop tools that facilitate collection of high quality speech

To develop tools that facilitate collection of high quality speech data Collect data that can be used for building speech recognition. speech synthesis and provide speech-to- speech translation from one language to another language spoken in India (including Indian English).

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 5
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
5
Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech
Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech
Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech
Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech
Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech

Use of Speech Corpus

Use of Speech Corpus Sp eech Reco g nition and S peech Sy nthesis Speech to

Speech Recognition and Speech Synthesis

Speech to Speech translation for a languages

pair of

Indian

Health care (Medical Transcription) Real time voice recognition

 

Multimodal

interfaces

to

the

computer

in

Indian

languages E-mail readers over the telephone Readers for the visually disadvantaged Automatic translation etc.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 6
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
6
Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process
Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process
Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process
Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process

Using Speech Corpus in NLP Application

Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process by
Using Speech Corpus in NLP Application Automatic s p eech reco gnition is the process by

Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer. It is what helps towards building Text-to-speech applications

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 7
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
7
LDCIL Speech Corpora SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 8
LDCIL Speech Corpora
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
8
Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected
Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected
Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected
Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected

Speech Dataset Collection

Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected
Speech Dataset Collection Phonetically Balanced Vocabulary – 800 Phonetically Balanced Sentences – 500 Connected

Phonetically Balanced Vocabulary – 800

Phonetically Balanced Sentences – 500

Connected Text created using phonetically balanced vocabulary - 6

Date Format - 2 Command and Control Words – 250

Proper Nouns 400 place and 400 person names - 824

Most Frequent Words- 1000 Form and Function Words- 200

News domain: news, editorial, essay - each text not less than 500 words - 150

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 9
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
9
Number of Speakers Data will be collected from minimum of 450 ( 225 Male and
Number of Speakers Data will be collected from minimum of 450 ( 225 Male and
Number of Speakers Data will be collected from minimum of 450 ( 225 Male and
Number of Speakers Data will be collected from minimum of 450 ( 225 Male and
Number of Speakers Data will be collected from minimum of 450 ( 225 Male and

Number of Speakers

Number of Speakers Data will be collected from minimum of 450 ( 225 Male and 225

Data will be collected from minimum of 450 (225 Male and 225 Female) speakers of each language. In addition to this, natural conversation data from various domains too shall be collected for Indian languages for research into spoken language.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 10
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
10
Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N
Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N
Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N
Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N

Speech Corpora (Segmented & Ware Housed)

Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N O
Speech Corpora (Segmented & Ware Housed) S.N O L ANGUAGES S PEAKERS HOURS S. N O
S.N O L ANGUAGES S PEAKERS HOURS S. N O L ANGUAGES S PEAKERS HOURS
S.N O
L ANGUAGES
S PEAKERS
HOURS
S. N O
L
ANGUAGES
S
PEAKERS
HOURS
1
Assamese
456
105:51:38
11
Malayalam
314
105:47:05
2
Bengali
472
138:18:47
12
Manipuri
457
107:10:27
3
Bodo
433
201:10:48
13
Marathi
306
168:13:50
4
D
ogr
i
154
111 32 11
:
:
14
Nepali
485
145:04:46
5
Gujarati
450
156:23:04
15
Oriya
462
165:30:05
6
Hindi
450
163:25:47
16
Punjabi
468
110:48:26
7
Kannada
492
143:28:54
17
Tamil
453
213:37:27
8
Kashmiri
150
44:59:07
18
Telugu
156
50:51:36
9
Konkani
455
195:14:47
19
Urdu
480
124:19:58
10
Maithili
156
43:33:42
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
11
Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence
Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence
Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence
Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence
Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence

Speech Segmentation

Speech Segmentation SegmentationSegmentation ofof data:data: Collected speech data is in a continuous form and hence it

SegmentationSegmentation ofof data:data:

Collected speech data is in a continuous form and hence it has to be segmented as per the various content types. i.e., Text, Sentences, Words.

SegmentationSegmentation tools:tools:

Wave Surfer is the tool used for segmentation of speech data.

Warehousing:

After segmenting the data according to the various content

types, it has to be warehoused properly. The data has to be

warehoused for each information.

content

type,

using

the

Meta data

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 12
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
12
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 13
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 13
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
13
Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District
Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District
Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District
Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District

Meta Data

Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District
Meta Data Investigator Name Language Data sheet script Age group Sound file’s format State District

Investigator Name Language Data sheet script Age group

Sound file’s format State

District Mother tongue

Place of elementary education Recording date Duration/length of recorded item (hh.mm.ss)

Speaker's ID Dialect (Region) Speaker’s Gender Recording env ironment

Place Educational qualification

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 14
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
14
Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and
Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and
Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and
Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and
Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and

Speech Annotation

Speech Annotation Annotation Annotation of of data: data: be annotated at phoneme, syllable, word and sentence

AnnotationAnnotation ofof data:data:

be

annotated at phoneme, syllable, word and sentence levels Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. AAnnonnottaatitionon ttoooolls:s:

Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.

Data

to

be

used

for

speech

recognition

shall

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 15
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
15
Speech Segmentation Sp eech data is in a continuous form and hence it has to
Speech Segmentation Sp eech data is in a continuous form and hence it has to
Speech Segmentation Sp eech data is in a continuous form and hence it has to
Speech Segmentation Sp eech data is in a continuous form and hence it has to
Speech Segmentation Sp eech data is in a continuous form and hence it has to

Speech Segmentation

Speech Segmentation Sp eech data is in a continuous form and hence it has to be

Speech data is in a continuous form and hence it has to be segmented at sentence level by using Wave Surfer Tool Open the file in Wavesurfer. Select waveform and open the file. Each sentence should be segmented but the duration of the sentence should be no longer than 30 seconds. If the sentence is longer than 30 seconds then the sentence should be segmented taking the nearest pause before a full stop. Then the selection should then be saved in the required folder.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 16
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
16
HowHow toto AnnotatedAnnotated SSppeecheech CorCorppusus usinusingg PraatPraat?? SRM University 29-Jan-12 Copyright
HowHow toto AnnotatedAnnotated
SSppeecheech CorCorppusus usinusingg PraatPraat??
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
17
What is speech annotation? Annotation is about assigning different tags like background noise, background speech,
What is speech annotation? Annotation is about assigning different tags like background noise, background speech,
What is speech annotation? Annotation is about assigning different tags like background noise, background speech,
What is speech annotation? Annotation is about assigning different tags like background noise, background speech,

What is speech annotation?

What is speech annotation? Annotation is about assigning different tags like background noise, background speech, vocal
What is speech annotation? Annotation is about assigning different tags like background noise, background speech, vocal

Annotation is about assigning different tags like background noise, background speech, vocal noise, echo etc to the segmented speech files. While annotating the files we should also keep in mind that the text should correspond the speech. The term linguistic annotation covers any descriptive or analytic notations applied to raw language data. The added notations may include information of various kinds: multi-tier transcription of speech in terms of units such as acoustic-phonetic features, syllables, words etc; syntactic and semantic analysis; paralinguistic information (stress, speaking rate) non-linguistic information (speaker's gender, age, voice quality, emotions, dialect room acoustics, additive noise, channel effects).

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 18
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
18
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 19
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 19
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
19
Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like
Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like
Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like
Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like

Formation of LDCIL Guideline

Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like CSL,
Formation of LDCIL Guideline There are various tools available for speech segmentation and annotation like CSL,

There are various tools available for speech segmentation and annotation like CSL, EMU, Transcriber, PRAAT etc. We are using PRAAT software for the annotation of our speech data. Praat is a product of Phonetic Sciences department of University of Amsterdam [4] and hence oriented for acoustic- phonetic studies by phoneticians. It has multiple functionalities that include speech analysis/synthesis and manipulation, labeling and segmentation, listening experiments. Guidelines for Annotation of our Data is Adapted from CSLU, OGI, Missippi University, SwitchBoard Guidelines, LDC, Upenn.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 20
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
20
Guidelines for Annotation Op en the stereo file in Praat and then create a text
Guidelines for Annotation Op en the stereo file in Praat and then create a text
Guidelines for Annotation Op en the stereo file in Praat and then create a text
Guidelines for Annotation Op en the stereo file in Praat and then create a text
Guidelines for Annotation Op en the stereo file in Praat and then create a text

Guidelines for Annotation

Guidelines for Annotation Op en the stereo file in Praat and then create a text file.

Open the stereo file in Praat and then create a text file. Open both the files in Praat and then select the correct text which corresponds to the speech file. The data should be annotated as per the pronunciation. If it is pronounced wrongly then it should be pronounced wrongly. The following should be marked while annotating the text:

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 21
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
21
1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal
1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal
1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal
1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal

1. Non Speech sounds should be marked

1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal noise
1. Non Speech sounds should be marked Human Noise: (a) Background speech (.bs) (b) Vocal noise

Human Noise:

(a)

Background speech (.bs)

(b)

Vocal noise (.vn).

Non-Human Speech:

(c) Background noise (.bn)

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 22
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
22
1. Non Speech sounds should be marked H uman S peec h Non - H
1. Non Speech sounds should be marked H uman S peec h Non - H
1. Non Speech sounds should be marked H uman S peec h Non - H
1. Non Speech sounds should be marked H uman S peec h Non - H

1. Non Speech sounds should be marked

1. Non Speech sounds should be marked H uman S peec h Non - H uman
1. Non Speech sounds should be marked H uman S peec h Non - H uman

Human Speech

1. Non Speech sounds should be marked H uman S peec h Non - H uman

Non-Human Speech

should be marked H uman S peec h Non - H uman S peec h SRM
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 23
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
23
2. Three different types of silences need to be marked Annotation of silences: An y
2. Three different types of silences need to be marked Annotation of silences: An y
2. Three different types of silences need to be marked Annotation of silences: An y
2. Three different types of silences need to be marked Annotation of silences: An y

2. Three different types of silences need to be marked

2. Three different types of silences need to be marked Annotation of silences: An y silence
2. Three different types of silences need to be marked Annotation of silences: An y silence

Annotation of silences: Any silence shorter than 50 ms NEED NOT be marked.

short silence (possibly intraword) (sil1): silences of length around 50-150 ms

medium silence (possibly interword) (sil2): silences of length between 150-300 ms

c) long silence (possibly interphrase) (sil3) : silences greater than 300 ms

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 24
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
24
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008

2. Three different types of silences need to be marked

2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008 ©
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008 ©
2. Three different types of silences need to be marked SRM University 29-Jan-12 Copyright 2008 ©
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 25
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
25
3. Echo need to be marked A sound that is heard after it has been
3. Echo need to be marked A sound that is heard after it has been
3. Echo need to be marked A sound that is heard after it has been
3. Echo need to be marked A sound that is heard after it has been
3. Echo need to be marked A sound that is heard after it has been

3. Echo need to be marked

3. Echo need to be marked A sound that is heard after it has been reflected

A sound that is heard after it has been reflected off a surface such as a wall.

Annotation for echo: mark .ec in the beginning of the annotation.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 26
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
26
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27

3. Echo need to be marked

3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
3. Echo need to be marked SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 27
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
27
4 . Multi speakers data need to be annotated A < text new speaker at
4 . Multi speakers data need to be annotated A < text new speaker at
4 . Multi speakers data need to be annotated A < text new speaker at
4 . Multi speakers data need to be annotated A < text new speaker at

4. Multi speakers data need to be annotated

4 . Multi speakers data need to be annotated A < text new speaker at the
4 . Multi speakers data need to be annotated A < text new speaker at the

A < text

new

speaker

at

the foreground level speaks

mark this is

spoken > A new annotation

to

defined (.sc)

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 28
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
28
5. Cut off speech and intended speech need to be marked [ mini ] *ster
5. Cut off speech and intended speech need to be marked [ mini ] *ster
5. Cut off speech and intended speech need to be marked [ mini ] *ster
5. Cut off speech and intended speech need to be marked [ mini ] *ster

5. Cut off speech and intended speech need to be marked

5. Cut off speech and intended speech need to be marked [ mini ] *ster -
5. Cut off speech and intended speech need to be marked [ mini ] *ster -

[mini]*ster - means that the speaker intended to speak minister but spoke mini in an unclear fashion and ster clearly.

*ster means that the speaker intended to speak minister but spoke ster.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 29
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
29
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright

5. Cut off speech and intended speech need to be marked

5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright 2008
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright 2008
5. Cut off speech and intended speech need to be marked SRM University 29-Jan-12 Copyright 2008
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 30
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
30
6. Language Change Lan guag e chan ge like code mixing and code switching need
6. Language Change Lan guag e chan ge like code mixing and code switching need
6. Language Change Lan guag e chan ge like code mixing and code switching need
6. Language Change Lan guag e chan ge like code mixing and code switching need
6. Language Change Lan guag e chan ge like code mixing and code switching need

6. Language Change

6. Language Change Lan guag e chan ge like code mixing and code switching need to

Language change like code mixing and code switching need to be marked as follows: [.lc-english <text>]

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 31
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
31
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32

6. Language Change

6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
6. Language Change SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 32
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
32
7. Annotation of speech disfluency Onl y restarts / false starts need to be marked.
7. Annotation of speech disfluency Onl y restarts / false starts need to be marked.
7. Annotation of speech disfluency Onl y restarts / false starts need to be marked.
7. Annotation of speech disfluency Onl y restarts / false starts need to be marked.

7. Annotation of speech disfluency

7. Annotation of speech disfluency Onl y restarts / false starts need to be marked. For
7. Annotation of speech disfluency Onl y restarts / false starts need to be marked. For

Only restarts/false starts need to be marked. For example the speaker intends to speak “bengaluru” but speaks be bengaluru. Then mark this as be-bengaluru

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 33
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
33
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34

7. Annotation of speech disfluency

7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
7. Annotation of speech disfluency SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 34
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
34
8. Number Sp ell out all number se q uences exce p t in cases
8. Number Sp ell out all number se q uences exce p t in cases
8. Number Sp ell out all number se q uences exce p t in cases
8. Number Sp ell out all number se q uences exce p t in cases

8. Number

8. Number Sp ell out all number se q uences exce p t in cases such
8. Number Sp ell out all number se q uences exce p t in cases such

Spell out all number sequences except in cases such as “123” or “101” where the numbers have a specific

spoken -

“nineteen eighty three.” Do not use hyphens (“twenty

meaning. Transcribe years like 1983

as

eight”, not “twenty-eight”).

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 35
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
35
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36

8. Number

8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
8. Number SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 36
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
36
9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it
9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it
9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it
9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it
9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it

9. Mispronunciations

9 . Mispronunciations the mispronunciation is not an actual word, transcribe the word as it is

the

mispronunciation is not an actual word, transcribe the word

as it is spoken. Utterances

the

annotator should find a long silence around 500 ms and split the sentence appropriately. Keep a separate Folder for the noisy data and for the time being as it was suggested not to annotate those now. SNR measuring tool will give you the percentage of the data which has to be annotated.

If

speaker

mispronounces

word

and

a

a

should be

no longer

that 30secs. So

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 37
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
37
9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்
9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்
9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்
9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்
9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்

9. Mispronunciations

9. Mispronunciations தமிழ்நாட் த்தமிழர்களிடம்மட்

தமிழ்நாட் த்தமிழர்களிடம்மட் மல்ல;உலகம் ம்உள்ளதமிழர்களிடம்மட் மல்ல;உலெகங்கும்உள்ளமனி தஉாிைம, மனிதேநயம்இவற்றின்மீ அக்கைறெகாண் ள்ளமக்களின்மத்தியி ம்இந்தக்ேகள்விஎ ந் நிற்கிற .

.ec sil2 தமிழ்நாட் த்தமிழர்களில்மட் மல்லsil2 உலகம் ம்உள்ளதமிழர்களில்மட் மல்லஉலகெமங்கும்உள்ளமனிதஉாிைமமனிதேநயம்sil1 இவற்றின்கீழ் .vn sil2 அக்கைறெகாண் ள்ளமக்களின்மத்தியி ம்இந்தேகள்விஎ ந் நிற்கிற sil2 .vn

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 38
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
38
Some more points to be taken into account Vocal noise followed by a silence: .vn
Some more points to be taken into account Vocal noise followed by a silence: .vn
Some more points to be taken into account Vocal noise followed by a silence: .vn
Some more points to be taken into account Vocal noise followed by a silence: .vn

Some more points to be taken into account

Some more points to be taken into account Vocal noise followed by a silence: .vn silx
Some more points to be taken into account Vocal noise followed by a silence: .vn silx

Vocal noise followed by a silence: .vn silx (if the silence is more than 50ms) Vocal noise, silence and vocal noise and then again silence or the vocal noise which is more than 50ms : .vn silx .vn silx Please mark the silence in case of background speech too: .bs silx If background noise followed by a silence or if .bn is more than 50ms: bn silx If there is any background noise in any particular position then marked that within square bracket. [.bn … ]

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 39
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
39
Recording of the data Data should be recorded in stereo format. The wave files should
Recording of the data Data should be recorded in stereo format. The wave files should
Recording of the data Data should be recorded in stereo format. The wave files should
Recording of the data Data should be recorded in stereo format. The wave files should
Recording of the data Data should be recorded in stereo format. The wave files should

Recording of the data

Recording of the data Data should be recorded in stereo format. The wave files should be

Data should be recorded in stereo format. The wave files should be preserved in four different forms:

Left channel

Right channel

Converted to mono

Original stereo

Nist files are created for all the above wave files. It is a format in which the files are saved. For example, if a

single stereo file is defined as S1_0001.wav, it will be stored as:

left microphone: S1_0001_left.nist

right microphone: S1_0001_right.nist

converted mono: S1_0001_mono.nist

original: S1_0001_stereo.nist

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 40
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
40
Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of
Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of
Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of
Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of

Storing LDC-IL data in NIST format

Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of sentences
Storing LDC - IL data in NIST format LDC-IL has collected read speech consisting of sentences

LDC-IL has collected read speech consisting of sentences and words. The details are as follows:

Database of 19 Indian languages have been collected.

Minimum 450 speakers were used to collect the database for each language.

Environment of the recording is taken into consideration.

All recordings are done in stereo.

Age group of the speakers are recorded.

The sampling rate varies as 44100 Hz, 48000 Hz, at 16 bits.

All the above information must go into the header of the NIST file. Items 4 and 6 are generated automatically by the PRAAT software. Labels have to be given for items 1,2 and 5.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 41
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
41
The NIST Format As some data has been labeled, it is felt that as decided
The NIST Format As some data has been labeled, it is felt that as decided
The NIST Format As some data has been labeled, it is felt that as decided
The NIST Format As some data has been labeled, it is felt that as decided
The NIST Format As some data has been labeled, it is felt that as decided

The NIST Format

The NIST Format As some data has been labeled, it is felt that as decided earlier

As some data has been labeled, it is felt that as decided earlier that all recordings must be converted to the nist format. Each line in the header is a triplet this gives various kinds of information about the waveform, namely, database name, speaker information, number of channels, environment. Eg:

NIST_1A

1024

database_id -s13 CIIL_PUN_READ

database_version -s3 1.0

recording_environment -s3 HOM

microphone -s5 INBDR

utterance_id -s8 sent_001

speaker_id -s9 STND_fad004

age -i 25

rec_type -s6 stereo

channel_count -i 1

sample_count -i 601083

sam

sample_byte_format -s2 01

sample_coding -s3 pcm

p

le

_

n

_

b

tes -i 2

y

samp e_ra e -

sample_min -i -32768

sample_max -i 32767

end_head

t

l

i 44100

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 42
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
42
1 . Database ID The following was decided for the database id. The database id
1 . Database ID The following was decided for the database id. The database id
1 . Database ID The following was decided for the database id. The database id
1 . Database ID The following was decided for the database id. The database id
1 . Database ID The following was decided for the database id. The database id

1. Database ID

1 . Database ID The following was decided for the database id. The database id will

The following was decided for the database id. The database id will be string of length 8 characters. The first 4 letters will

correspond to the organization that collects the database. This will be

The language id will consist of three characters.

followed by an

For example, the Punjabi database collected at CIIL will be given the following name:

CIIL_TAM

This will be included in the header using the following tag:

database_id -s13 CIIL_TAM_READ; tag for read Tamil speech

Database version:

This will be included in the header using the following tag:

database_version -s3 1.0

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 43
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
43
2. Recording environment The recording environment could be one of the following: 1. home (HOM)
2. Recording environment The recording environment could be one of the following: 1. home (HOM)
2. Recording environment The recording environment could be one of the following: 1. home (HOM)
2. Recording environment The recording environment could be one of the following: 1. home (HOM)
2. Recording environment The recording environment could be one of the following: 1. home (HOM)

2. Recording environment

2. Recording environment The recording environment could be one of the following: 1. home (HOM) 2.

The recording environment could be one of the following:

1. home (HOM)

2. public places (PUB)

3. office (OFF)

4. Telephone (TEL) Eg: Data recorded in a home should have the following entry in the NIST

header:

recording_environment -s3 HOM

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 44
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
44
3 . Mi crop h ones For the collection of LDC-IL speech data, we have
3 . Mi crop h ones For the collection of LDC-IL speech data, we have
3 . Mi crop h ones For the collection of LDC-IL speech data, we have
3 . Mi crop h ones For the collection of LDC-IL speech data, we have
3 . Mi crop h ones For the collection of LDC-IL speech data, we have

3. Microphones

3 . Mi crop h ones For the collection of LDC-IL speech data, we have used

For the collection of LDC-IL speech data, we have used in-built digital recorder microphone (stereo)(INBDR). However the following are the other types of microphones that can be used.

1. external low quality (LOWQ)

2. external high quality (noise cancelling) (HIGQ)

3. in-built cell phone(INBCP)

4. in-built landline (INBLL)

5. throat microphone (THROT)

6. bone microphone (BONE)

Examples:

Data recorded using a digital recorder with an in-built microphone(s),

should have the following entry in the NIST header:

microphone -s5 INBDR

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 45
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
45
4. Utterance ID Each utterance may be identified by type and number <4 characters for
4. Utterance ID Each utterance may be identified by type and number <4 characters for
4. Utterance ID Each utterance may be identified by type and number <4 characters for
4. Utterance ID Each utterance may be identified by type and number <4 characters for
4. Utterance ID Each utterance may be identified by type and number <4 characters for

4. Utterance ID

4. Utterance ID Each utterance may be identified by type and number <4 characters for type>_<3

Each utterance may be identified by type and number <4 characters for type>_<3 digit utterance number>

For example the entry in the header would be:

utterance_id -s8 <word|phrs|sent|uttr>_<3 digit utterance number>

For example, the entry for the 5th word in the database will be:

utterance_id -s8 word_005

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 46
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
46
5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for
5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for
5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for
5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for
5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for

5. Speaker ID

5 . Speaker ID Each speaker may be identified using 9 characters: <4 characters for the

Each speaker may be identified using 9 characters:

<4 characters for the region>_<4 characters to identify speaker>

The entry for each speaker will be:

speaker_id -s9 <4 characters to identify region>_<m|f><4 character speakerid>

A female speaker from South Karnataka with a speaker id ab0a (4 character

alpha (lower case only) numeric):

speaker_id -s9 STND_fab0a

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 47
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
47
6 . Age group of the speaker The speaker should be more that 16 years
6 . Age group of the speaker The speaker should be more that 16 years
6 . Age group of the speaker The speaker should be more that 16 years
6 . Age group of the speaker The speaker should be more that 16 years

6. Age group of the speaker

6 . Age group of the speaker The speaker should be more that 16 years and
6 . Age group of the speaker The speaker should be more that 16 years and

The speaker should be more that 16 years and not more than 60 years.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 48
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
48
Utility of Annotation Annotated speech data is the raw material for development of speech recognition
Utility of Annotation Annotated speech data is the raw material for development of speech recognition
Utility of Annotation Annotated speech data is the raw material for development of speech recognition
Utility of Annotation Annotated speech data is the raw material for development of speech recognition
Utility of Annotation Annotated speech data is the raw material for development of speech recognition

Utility of Annotation

Utility of Annotation Annotated speech data is the raw material for development of speech recognition and

Annotated speech data is the raw material for development of speech recognition and speech synthesis systems. Acoustic-phonetic study of the speech sounds of a language is essential for determining the parameters of speech synthesis systems following articulatory or parametric approach.

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 49
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
49
LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r .
LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r .
LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r .
LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r .

LDCIL Tamil Team

LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r . Research
LDCIL Tamil Team Academic Faculties S. Thennarasu, Sr. Lecturer Prem K umar, S r . Research

Academic Faculties S. Thennarasu, Sr. Lecturer Prem Kumar, Sr. Research Assistant R. Amudha, Junior Research Assistant R. Prabagaran, Junior Resource Person

Technical Faculties

Mohamed Yoonus, Sr. Lecturer

V

a

di

l

ve ,

L

t

ec urer

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 50
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
50
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
Speech Annotation Demo SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51

Speech Annotation Demo

SRM University 29-Jan-12 Copyright 2008 © LDC-IL, CIIL 51
SRM University
29-Jan-12
Copyright 2008 © LDC-IL, CIIL
51
• T am il A ca d emy, SRM U niversi ty • All the
• T am il A ca d emy, SRM U niversi ty • All the

Tamil Academy, SRM University

All the Professors, Teachers, Staff and the Participants &

LDCIL Team