Beruflich Dokumente
Kultur Dokumente
pocketSphinx
Make subdirectory CMUSphinx
Windows install
Mine is d:\Stephans\CMUSphinx
From http://cmusphinx.sourceforge.net/wiki/download/, download snapshot of pocketsphinx,
sphinxbase, and sphinxtrain to your CMUSphinx directory
I have made window binaries. They are available from the class web page.
If you get binaries, you still need to get the full sphinxtrain file as well (so you will need to download two versions of
sphinxtrain)
First get and decompress complete version
Second, get executables. Put executables in SphinxTrain\bin\Release (you will need to make this dirtectory)
This way the directory+file structure is the same as if you had compiled the files
Put binaries of sphinxbase in CMUSphinx/sphinxbase/bin/Release
Put binaries of pocketsphinx in CMUSphinx/pocketsphinx/bin/Release
To run on android, you need to get the full version of pocketsphinx. But this only compiles on linux. We will do this later
The non windows binaries requires MS Visual Studio 2010 (I used Visual Studio 2010 Ultimate)
If you are a student, you can get it for free from
http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=29950cc3-3670-e011-971f-
0030487d8897&vsro=8&JSEnabled=1
Or find MSDNAA link at https://www.eecis.udel.edu/wiki/ececis-docs/index.php/FAQ/Applications
You will also need an eecis account. You can sign up for one.
The snapshots include .sh visual studio 2010 project files (earlier versions will not work)
Open visual studio. File->open->Preoject/Solution. Navigate to and select .sl file. To build: Build->Build solution
First download and build sphinxbase
Before buildigng, switch to release.
Select Build -> Configuratrion Manager: under Active solutions configuration: Change from Debug to Release
Then pocketsphinx and sphinxtrain
You need perl and python
For perl, get activestates activeperl: http://www.activestate.com/activeperl/downloads
For python get v2.7.X
http://www.python.org/getit/
Once python is installed, add the directory to you path
Add path to sphinxtrain binaries
If downloaded binaries, then add path to where
Running pocketsphnix
Note audio file in CMUSphinx\pocketsphinx\test\data\goforward.raw
Open terminal and
Change directory to d:\Stephans\CMUSphinx\pocketsphinx\bin\Release
Pocketsphinx_batch.exe should be there, unless compile failed
Make file ctlFile.txt with text of the name of the file we will decode
goforward
Make file called argFile.txt with contents (more about these later)
-hmm ../../model/hmm/en_US/hub4wsj_sc_8k
-lm ../../model/lm/en/turtle.DMP
-dict ../../model/lm/en/turtle.dic
Move
CMUSphinx/sphinxbase/bin/Release/sphinxbase.dll
To
CMUSphinx/pocketsphinx/bin/Release
Move
CMUSphinx\pocketsphinx\test\data\goforward.raw
To
CMUSphinx\pocketsphinx\bin\Release\goforward.raw
run
pocketsphinx_batch.exe -argfile argFile.txt -cepdir ../../test/data -ctl ctlFile.txt -cepext .raw -adcin true -hyp out.txt
Note: the command line arguments must be in this order!!
Where
-argfile argFile.txt defines the name of the arguments file. These aurgments are displayed on the screen when the program runs. You
can check if they match
-cepdir ../../test/data defines the path to the files to be processed
-cepdir must come before -ctl
-ctl ctlFile.txt defines the ctlFile, which contains the name of the files to process. These names cannoy have the path or the extension
-cepext .raw defines the extension of the files in the ctlFile
-adcin true means that the files are audio files
-hyp out.txt defines the output file
More details on the parameters are http://manpages.ubuntu.com/manpages/lucid/man1/pocketsphinx_batch.1.html
After running, the outfile contains
go forward ten meters (goforward -26532)
Make and decode a new audio file
Open windows sound recorder
Record go forward ten meters
Save as myGoForward.wma
Saves as .wma file
Get wma to wav converted
Save as c:\pocketsphnix\test\data\myGoForward.wav
I use 4musics multiformat converted. Other converters should work
Change ctlFile.txt to
myGoForward
In terminal run
pocketsphinx_batch.exe -argfile argFile.txt -ctl ctlFile.txt -cepdir ./ -
cepext .wav -adcin true -hyp out2.txt
Check that out2.txt says go forward ten meters
Make your own acoustic model and
language
We will go over the what is going on later. But
first, lets try the process.
Alternatively, you can read the about what is going on
first and then return to this section
Download data
http://www.speech.cs.cmu.edu/databases/an4/index.
html
Get mswav version
Save it to your CMUSphinx directory
Decompress
models
Three types of models are used
acoustic model
Used to model the sound of a phone
Typically, this a HMM is used
Each phone has a HMM
Mapping from HMMs to phones
Since the acoustic model is a HMM, in the CMU Sphinx the HMM is
the same as the acoustic model
phonetic dictionary
Maps phones to words
In CMU Sphinx, .dic files are dictionary files
language model
Used to determine sequences of words are allowed. For example, he
super run the sally is not allowed in the language model
Set up config file
From CMUSphinx\SphinxTrain\etc
Copy
feat.params
sphinx_train.cfg
To CMUSphinx\an4\etc
Sphinc_train.cfg is the main configuration file
Open sphinx_train.cfg in an editor
Line 6: $CFG_DB_NAME = an4;
Line 7: $CFG_BASE_DIR = "d:\\stephans\\CMUSphinx\\an4";
Line 8: $CFG_SPHINXTRAIN_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain";
Line 11: $CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\sphinxbase\\bin\\Release";
Line 13: $CFG_SCRIPT_DIR = "d:\\Stephans\\CMUSphinx\\SphinxTrain\\scripts";
Check out line 19-21. These say where the wav files are and that we are using mswav, which is
what we downloaded
Line 232: $DEC_CFG_DB_NAME = 'an4';
Line 233: $DEC_CFG_BASE_DIR = 'd:\\Stephans\\CMUSphinx\\an4';
Line 234 does not seem to matter
Line 239: $DEC_CFG_BIN_DIR = "d:\\Stephans\\CMUSphinx\\pocketsphinx\\bin\\Release";
Save sphinx_train.cfg
Other changes
copy sphinxbase.dll from
CMUSphinx\sphinxbase\bin\Release
To
CMUSphinx\SphinxTrain\bin\Release
In CMUSphinx\an4\etc directory, copy or rename
an4.ug.lm.DMP to an4.lm.DMP
Open CMUSphinx\SphinxTrain\scripts\sphinxtrain.in in an editor
Line 3: sphinxpath="d:\\Stephans\\CMUSphinx
In many places is /lib/sphinxtrain. Change this to /SphinxTrain
Copy files
From CMUSphinx\pocketsphinx\bin\Release, copy
pocketspinx_batch.exe and pocketsphinx.dll to
CMUSphinx\SphinxTrain/bin/Release
Try skipping this and setting line 243 of .cfg
check
Open a cmd prompt
Type path and make sure that the directory to
python is there
SphinxTrain\bin\Release is there
Run training
Change to CMUSphinx\an4 directory
Run
python ..\SphinxTrain\scripts\sphinxtrain.in run
This will take a while (15 minutes)
Results from test is sentence error rate of 45% (nearly
half of the sentences had at least one error) and
15.7% word error rate (15.7% of the words were
incorrectly estimated)
This can fail because python was not installed or
the path to python was not set
Or the path to SphinxTrain
Check log
Open an4.html
Check for errors
MODULE: 30 Training Context Dependent models
A few errors of type: Failed to align audio to trancript: final
state of the search is not reached are acceptable
MODULE: 50 Training Context dependent models
A few errors of type: Failed to align audio to trancript: final
state of the search is not reached are acceptable
At the very end is the test decoding
Open log file
Note parameters for running decoding, specifically,
where Hmm, dic, and lm is
Test with your own voice sample
Record sample
Convert to .wav
Run pocketsphinx_batch
pocketsphinx_batch
-hmm d:\Stephans\CMUSphinx\an4/model_parameters/an4.cd_cont_200
-lw 10 -feat 1s_c_d_dd
-beam 1e-80 -wbeam 1e-40
-dict d:\Stephans\CMUSphinx\an4/etc/an4.dic
-lm d:\Stephans\CMUSphinx\an4/etc/an4.lm.DMP
-wip 0.2 -ctl d:\Stephans\CMUSphinx\an4/myTest/ctlFile.txt
-ctloffset 0
-ctlcount 130
-cepdir d:\Stephans\CMUSphinx\an4/myTest -cepext .wav
-hyp d:\Stephans\CMUSphinx\an4/myTest/results.txt
-agc none
-varnorm no
-cmn current
-adcin true
test
background
At a first approximation, words are a sequences of sounds, where
each sound is a phone.
However, the exactly pronunciation of a phone depends on the
phones before and after.
Diphones are two phones. Diphones are less impacted by the
phones that come before or after.
Triphones and quinphones are possible. The general name is
senone
While there are many phones, not all combinations of a phone is a
word. Thus, we should not simple recognize phones, by recognize
words as a sequence of phones
Besides phones are fillers (e.g., breath, um). An Utterance is a
sequence of words and fillers
Utterances are separated by a pause
models
Three types of models are used
acoustic model
Used to model the sound of a phone
Typically, this a HMM is used
Each phone has a HMM
Mapping from HMMs to phones
Since the acoustic model is a HMM, in the CMU Sphinx the HMM is
the same as the acoustic model
phonetic dictionary
Maps phones to words
In CMU Sphinx, .dic files are dictionary files
language model
Used to determine sequences of words are allowed. For example, he
super run the sally is not allowed in the language model
Running with other models
Many acoustic and language models are
available at
http://sourceforge.net/projects/cmusphinx/files/A
coustic%20and%20Language%20Models/
Building Your Own Acoustic Model and
Language Model
Building your own models is time consuming
Acoustic models require
Lots of recordings of people saying words and sentences
Not that difficult to do
Accurate transcription of the recording
Time consuming
There are many acoustic models available online
It is possible to take an existing model are quickly adapt it to a particular speaker
Language Model
Different systems need different language models
A voice control for your TV needs to recognize only a few words like volume up, change channel,
A voice driven email composer needs to recognize a different set of words
The performance of the recognizer is improved if your language only considers the relevant
words.
You can take an existing language model and trim it to what you need, or make on from
scratch
Many models are available from http://www.ldc.upenn.edu/Catalog/index.jsp
example
To explore acoustic and language models, get the AN4
database
http://www.speech.cs.cmu.edu/databases/an4/index.html
Save it to your CMUSphinx directory
Decompress
Also, explore the PDA dataset
http://www.speech.cs.cmu.edu/databases/pda/index.html
This data is from letters and numbers, e.g., A, B,
19
We can test this system by saying things like A, B,
etc.
Acoustic model
The acoustic model is used to translate recorded sounds into
labeled phones,
e.g., recorded sound in file asc.wav is AH
Roughly speaking, acoustic models take the sound sample as
input and the quality of fit as output
asc.wav -> AH-Model-> -12
asc.wav -> AY-Model-> -14
AH-Model gives a better fit of the recorder sound
Making a acoustic model is called training
Inputs to training are audio files and transcriptions
Challenge: Usually the audio file has many phones, not just one
E.g., from AN4 data set, an audio file contains a recording of the
words TWO SIX EIGHT FOUR FOUR ONE EIGHT
CMUSphinx\an4\wav\an4_clstk\fash\cen7-fash-b.wav
E.g. from PDA data set, an audio file might contain a recording of the
words: MARGINS HISTORICALLY HAVE PEAKED BY MID YEAR HE
SAYS
CMUSphinx\PDA\PDAs\001\PDAs01_001_1.wav
Transcriptions
Approach one: the recording from the PDA set is transcribed as: M AA R JH AX N Z SIL HH IX S T AO
R IX K AX L IY SIL ...
Two problems with approach one
If the word margins are in other files, we need to enter the pronounciation of the word twice
There are two ways that people pronounce historically
HH IX S T AO R IX K AX L IY
HH IX S T AO R IX K L IY (this one actually says historicly, which is incorrect)
Two stage transciptions (results in many files)
Transcription file: gives the words spoken
This file contains one line for each file used in training
The line contains the text of the words spoken and the filename (without extension such as .wav)
The AN4 dataset includes the file an4_train_transcription and it includes the line: <s> TWO SIX EIGHT FOUR FOUR ONE EIGHT
</s> (cen7-fash-b)
The PDA dataset includes the file PDAs.train_all.sent and it includes the line: MARGINS HISTORICALLY HAVE PEAKED BY MID
YEAR HE SAYS (PDAs01_041)
Hmm, this is missing the <s> and </s>, I think that the software requires <s> and </s>.. To use the pda data set, add
<s> and </s>
Dictionary file
A mapping from words to phones (elementary spoken sounds)
Allows words to have multiple pronunciations
E.g., the AN4 dataset includes the file an4.dic and it includes the lines
ELEVEN IH L EH V AH N
ELEVEN(2) IY L EH V AH N
E IY
By combining the transcript file and dictionary file, the sounds in each recorded audio file can be
determined
However, it is a bit tricky to determine which part of the audio file corresponds to which sound.
This is a major challenge facing training
Recall, the overall goal of training is to find models for each sound. But to make the training process easier
for the users, we only provide recordings of words and sentences.
Files needed
Training
your_db_train.fileids - List of files used for training
E.g., AN4 includes an4_train.fileids
Format
path/filename (without extension!)
The path is from where the SphinxTrain program is executed
E.g., an4_train.fileids path is relative to where AN4 /etc directory. So SphinxTrain needs to be run from this
directory
your_db_train.transcription - Transcription for training (described on previous slide)
your_db.dic - Phonetic dictionary (described on previous slide)
your_db.filler - List of fillers and what they map to
Fillers are things like silence, breathing, um etc.
Fillers should also be used in the transcript
E.g., <s> TWO +UM+ SIX EIGHT FOUR FOUR ONE EIGHT </s>
Fillers use the + sign before and after
During training, models for fillers will be computed
Decoding is more complicated
Fillers are allowed to be added, but there is some penalty
the fillers are ignored when computing the probability of a sequence of words
E.g., the language model might tell us that go to bed is common, and go up bed is uncommon. If the
decoder detect go um to bed it translates it to go to bed
For some reason, fillers are not used in the an4 and PDA transcript files
<s>, </s>, SIL are silence are included
SMACK is listed in the PDA filler file, but not in the transcript
File format
</s> SIL
<s> SIL
<sil> SIL
++INHALE++ +INHALE+
Note that these lines are also changed if you use different models
Build, run and test
Windows install
Requires Android NDK
Flex for windows:
http://gnuwin32.sourceforge.net/packages/fle
x.htm
Bison for windows:
http://gnuwin32.sourceforge.net/packages/bi
son.htm
Get CMUSphinix from here: ??
Note that this contains the
Follow directions from
http://cmusphinx.sourceforge.net/2011/05/buildi
ng-pocketsphinx-on-android/
Or google: pocketSphinx android
Or:
But order of libs at the end need to be reversed
Only compiles on linux, because is need yacc
resources:
http://www.speech.cs.cmu.edu/sphinxman/
Voice activity detection
VAD is used to detect if anyone is speaking