Contents
1 Introduction 2 Function call 3 Colour maps 4 Frequency axis
4.1 4.2 Nonlinear frequency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequency range and stepsize
1 2 2 3
3 3
5 Analysis bandwidth 6 Time Axis 7 Intensity scaling 8 Waveform and transcription 9 Output Arguments 10 MODE string options 11 MATLAB Code for gures
4 4 5 6 6 6 7
1 Introduction
This document describes the spgrambw function which is part of the voicebox toolbox available at
www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html [Bro97].
and a time-aligned phonetic annotation.
http://
sentence Six plus three equals nine for which a spectrogram is shown below inculding the speech waveform
1: MODE='pJcwat' 60
n a n z w k i i r s l p s k s
10 9 8 Frequency (kHz) 7 6 5 4 3 2 1 0
55 50 45 40 35 30 25 Power/Decade (dB)
0.2
0.4
0.6
0.8
1 Time (s)
1.2
1.4
1.6
1.8
2 Function call
The basic call to the function is: [ T , F , B]= spgrambw ( S , FS ,MODE,B W,FMAX, DB, TINC ,ANN)
where all but the rst two input arguments are optional. The input arguments are:
input speech waveform sample rate of speech waveform text string specifying a large range of options
FS
MODE BW
the bandwidth of the spectrogram. This argument determines the tradeo between time and frequency resolution.
species the range of power spectral density displayed species the range and resolution of the time axis gives an optional annotation le containing words or phonemes.
If all you want to do is draw a spectrogram, then the function should be called without any output arguments. If output arguments are specied, then no spectrogram will be drawn unless the 'g' mode option is also given. The output arguments are
T F B
gives the time of each time-axis sample point gives the frequency of each frequency-axis sample point a 2-dimensional array giving the spectral density at each time-frequency point.
In the plots shown in this document, the title (above the spectrogram) shows the gure number (written {n} in the text), the value of the MODE argument and the value of any other arguments that are not null.
3 Colour maps
The default output is a monochrome spectrogram shown as {2}. Specifying the `j' mode option uses the jet colourmap instead which is colourful and intuitive {3}. However it does not reproduce accurately if viewed or printed in monochrome and so I normally use the `J' option instead which is less aggressive and converts accurately to monochrome {4}. Notice that I have also used the `c' option in each case in order to include a colourbar giving the intensity scale in decibels.
2: MODE='pc' 10 9 8 50 7 Power/Decade (dB) Frequency (kHz) Frequency (kHz) 6 5 4 3 30 2 1 0 0.5 1 Time (s) 1.5 25 2 1 0 0.5 1 Time (s) 1.5 25 45 40 35 7 Power/Decade (dB) 6 5 4 3 30 2 1 0 0.5 1 Time (s) 1.5 25 45 40 35 Frequency (kHz) 60 55 10 9 8 50 7 6 5 4 3 30 45 40 35 Power/Decade (dB) 3: MODE='pjc' 60 55 10 9 8 50 4: MODE='pJc' 60 55
2: Monochrome
3: `j'=Jet
4: `J'=Thermal
Adding the `i' option inverts the colour map so that dark areas now correspond to high intensity. For these examples, I have omitted the `c' option so the colourbar is missing.
5: MODE='pi' 10 9 8 7 Frequency (kHz) Frequency (kHz) 6 5 4 3 2 1 0 0.5 1 Time (s) 1.5 10 9 8 7 6 5 4 3 2 1 0 0.5
7: MODE='pJi'
1 Time (s)
1.5
6: `ij'=Inverted Jet
7: `iJ'=Inverted Thermal
4 Frequency axis
4.1 Nonlinear frequency scaling
Speech scientists usually prefer a The default frequency axis is linear in Hz as seen in the examples above. nonlinear frequency scale in which high frequencies are compressed. There are several widely used frequency scales and these are plotted below (scaled to coincide at 1 kHz) [MG83, Ghi94, SVN37, Zwi61, ZT80]. The log scale {8} provides the most compression at high frequencies but it is more usual to use one of the physiological or psychoacoustical scales: Erb-rate {9}, Mel {10} or Bark {11}. The scale is selected by the MODE options `l', `e', `m' or `b'. In all cases, it is possible to add also the `f ' option which causes the frequency axis labels to be written in Hz as in {12}. In all the plots below, I have reduced the bandwidth to 80 Hz (see section 5) to give better frequency resolution.
Frequency scales 3 lin Scale relative to 1 kHz 2.5 2 1.5 1 0.5 0 0 1 2 3 4 Frequency (kHz) 5 6 log mel bark
Frequency (log10Hz) 4 3.8 3.6 3.4
8: MODE='pJcl', BW=80 60 55 50 Frequency (Erb-rate) 3.2 Power/Decade (dB) 3 2.8 2.6 2.4 2.2 2 1.8 1.6 0.5 1 Time (s) 1.5 30 25 40 35 45 32 30 28 26 24 20 18 16 14 12 10 8 6 4 2 0 0.5 22
erb-rate
Frequency scales
10: MODE='pJcm', BW=80 3 2.8 2.6 2.4 2.2 Power/Decade (dB) Frequency (kMel) Frequency (Bark) 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.5 1 Time (s) 1.5 2 25 0 45 40 35 30 50 55 18 16 60 22 20
Power/Decade (dB)
4.2
By default the frequency axis encompasses the entire range from 0 Hz to the Nyquist frequency, is often too large.
FMAX=4000 {13} restricts the frequency range to a maximum of 4 kHz while FMAX=[2000 4000] sets the range to 2 kHz to 4 kHz {14}. Normally the frequency stepsize is
1 256 of the displayed range, but you can also specify the stepsize explicitly: FMAX=[2000 200 4000] goes from 2 kHz to 4 kHz in steps of 200 Hz {15}. If a
nonlinear frequency scaling has been selected by the `l', `e', `m' or `b' options, then FMAX must be specied in scaled units unless the `h' option is given, in which case they are in Hz as normal. Note that selecting a very
Power/Decade (dB)
Power/Decade (dB)
small step size does not make the spectrogram any less blurry; the frequency resoulution is determined by the analysis bandwidth, BW, described in section 5.
13: MODE='hpJc', FMAX=4000 4 3.5 3 Frequency (kHz) 2.5 2 1.5 1 0.5 0 0.5 1 Time (s) 1.5 60 55 3.6 50 3.4 Power/Decade (dB) Power/Decade (dB) Frequency (kHz) 45 40 35 30 25 3.2 3 2.8 35 2.6 2.4 2.2 2 0.5 1 Time (s) 1.5 0.5 1 Time (s) 1.5 30 25 45 40 Frequency (kHz) 50 3.4 3.2 3 2.8 2.6 2.4 25 2.2 2 20 35 30 45 40 Power/Decade (dB) Power/Decade (dB) 4 3.8 55 3.8 50 3.6 14: MODE='hpJc', FMAX=[2000 4000] 60 4 15: MODE='hpJc', FMAX=[2000 200 4000] 55
13: 0 to 4 kHz
14: 2 to 4 kHz
5 Analysis bandwidth
There is an unavoidable tradeo between time resolution and frequency resolution that is often known as the uncertainty principle. The BW input parameter species the
separation at which two tones will denitely give distinct peaks. From the point of view of frequency resolution, it follows that the smaller BW the better. However selecting a small value of BW means that rapid amplitude variations within any single frequency bin will be attenuated and, in particular, amplitude variations faster than
6 dB
aI
n 0
s I k s p l V s T r i: i: k w@ z n 10 9 8 7 6 5 4 3 2 1 0
aI
n 0
55 50 45 40 35 30 25 20
45
1.5
0.5
1 Time (s)
1.5
16: BW=50 Hz
18: BW=400 Hz
In this speech example, which is by a female talker, the larynx frequency varies from 300 Hz down to 150 Hz. If BW is chosen to be below the fundamental frequency, e.g. BW=50 Hz in {16}, the harmonics of the larynx frequency are clearly visible as quasi-horizontal stripes, however the time resolution is relatively poor. In a broadband spectrogram, in contrast, the bandwidth is chosen to be higher than the larynx frequency, e.g. BW=400 Hz in {18}, and the individual harmonics are no longer resolved. The time resolution is however much improved and it is possible to resolve the individual acoustic excitations arising from each larynx pulse; these are visible as vertical striations during the /aI/ phoneme of nine at a time of around 1.5 seconds. The default bandwidth is BW=200 Hz {17} which is often too large to reslve the larynx frequency harmonics but which makes the vocal tract resonances, or formants, easy to see.
6 Time Axis
As discussed in section 5, the time resolution is determined by the BW parameter, and modulation frequencies above
1 0.45 2 BW are not shown in the spectrogram. For this reason, the default time-step is taken as BW and, for small values of BW, this may give a blocky appearance {19}. To avoid this you can explicitly set a smaller time
step using the TINC parameter as shown in {20}; note that although this results in a smoother appearance, it does not improve the time resolution which is still determined by the BW parameter (see section 5).
19: MODE='pJcwat', BW=20 60 s I k s p l V s T r i: i: k w@ z n 10 9 8 Frequency (kHz) 7 6 5 4 3 2 1 0 0.5 1 Time (s) 1.5 25 35 30 45 40 aI n 0 55 50 Power/Decade (dB) Frequency (kHz) 10 9 8 7 6 5 4 3 2 1 0 s
50 45 40 35 30 25 20 15
0.5
1 Time (s)
1.5
1.15
1.2
1.3
1.35
1.4
19: BW=20 Hz
You can restrict the display to a specic time interval by setting TINC correctly aligned. The sample time of
if you want to speciy the time-step as well {21}. Notice in {21} that the waveform and annotations remain
S(1)
is assumed by default to be
T1 =
7 Intensity scaling
The default spectrogram shows the spectral density in units of power per Hz {22}. Because most speech energy is concentrated at low frequencies, this can make it dicult to see detail in the display at both low and high frequencies. To avoid this, you can use the `p' option to display power per decade instead: this option multiplies the power by a value proportional to the frequency and so emphasises high frequencies {23}. If you are using one of the non-linear frequency scaling options described in section 4.1, you have a third option which is to show power per bark/erb/... {24}.
22: MODE='Jc' 10 25 9 8 7 Frequency (kHz) 6 5 5 4 3 2 1 0 0.5 1 Time (s) 1.5 0 -5 -10 20 15 Frequency (kHz) Power/Hz (dB) 10 9 8 7 6 5 4 3 30 2 1 0 0.5 1 Time (s) 1.5 25 0 0.5 1 Time (s) 1.5 10 45 40 35 Power/Decade (dB) Bark-scaled frequency (Hz) 50 45 55 5k 40 35 2k 30 25 20 500 15 Power/Bark (dB) 10 23: MODE='pJc' 60 10k 24: MODE='PJcbf'
1k
Power/Decade (dB)
22: Power/Hz
23: `p'=Power/Decade
Normally, the display shows a range of 40 dB from the maximum power anywhere in the spectrrogram {25}. You can change this to a dierent range by setting the DB parameter either to the desired range{26} or alternatively to the minimum and maximum powers to display: DB
= [Pmin Pmax ]
{27}.
is especially useful if you want to have several spectriograms with identical displayed power ranges. outside the selected range will be set to either the minimum or maximum.
25: MODE='Jc' 10 25 9 8 7 Frequency (kHz) 6 5 5 4 3 2 1 0 0.5 1 Time (s) 1.5 0 -5 -10 20 15 Frequency (kHz) Power/Hz (dB) 10 9 8 7 6 0 5 4 3 2 1 -30 0 0.5 1 Time (s) 1.5 0 0.5 1 Time (s) 1.5 -25 -20 -10 10 Frequency (kHz) Power/Hz (dB) 20 9 8 7 6 5 4 3 2 1 -20 -15 -10 -5 10 26: MODE='Jc', DB=60 10 27: MODE='Jc', DB=[-25 0] 0
26: DB=60
27: DB=[-25 0]
Power/Hz (dB)
their time intervals without any time markers {26}. If you want to display phonetic characters, you will need to install a non-unicode IPA font such as the SIL93 fonts (available for download from the Voicebox website). You can specify the font of each annotation entry by including a third column; each row of ANN is now of the form
`font'}. Example {27} uses the `SILDoulos IPA93' font and also includes the options `a'
which centres the annotations in their time interval and `t' which includes time markers.
29: MODE='Jc' 25 10 9 8 Frequency (kHz) 7 6 5 4 0 3 2 1 0 0.5 1 Time (s) 1.5 -5 -10 3 2 1 0 0.5 1 Time (s) 1.5 -5 -10 10 5 20 15 Frequency (kHz) Power/Hz (dB) 9 8 7 6 5 4 0 15 Frequency (kHz) Power/Hz (dB) 10 5 20 10 9 8 7 6 5 4 3 2 1 0 0.5 1 Time (s) 1.5 -5 -10 5 0 15 10 Power/Hz (dB)
n a nz wk i i r s lp s k s
30: MODE='Jcwat' n 0 25 25 20
10
I k s p lV s T r i: i: k w z naI @
9 Output Arguments
Specifying output arguments normally suppresses the spectrogram plot unless the `g' option is given. Note that, perhaps unexpectedly, the spectrogram array is the third output rather than the rst. If you save the B output (with a linear frequency scale and without the `p' or `P' options), you can use it as the input to a subsequent call to spgrambw instead of a time-domain waveform. In this case FS=[FS T1 FINC F1] where FS is now the frame rate (each frame is one row of B), T1 is the time of the rst row of B, FINC is the frequency increment and F1 is the frequency of the rst column in B.
clip the output B array to the limits specied by the "db" input erb scale]
label frequency axis in Hz rather than mel/bark/... draw a graph even if output arguments are present units of the FMAX input are in Hz instead of mel/bark/... In this case, the Fstep parameter is used only to determine the number of lters. express the F output in Hz instead of mel/bark/...
thermal colourmap that is linear in grayscale. Based on Oliver Woodford's % real2rgb at http://www.mathworks.com/mat log10 Hz frequency scale mel scale
p P t w
calculate power per decade rather than power per Hz. This eectively increases the power level at high frequencies and so maes them more visible calculate power per erb/mel/... rather than power per Hz. add time markers with annotations draw the speech waveform above the spectrogram
4000] []
4000] 1; [] []
' pJcwat ' ' pJcwat ' ' pJcwat ' ' pJcwat ' ' pJcwat ' ' pJcwat ' ' Jc ' ' pJc ' ' Jc ' ' Jc ' ' Jc ' ' Jcw ' ' Jc ' [] [] [] [] [] [] []
[0.005] [1.1
[] [] [] [] [] [] [] [] [] of to 1
[] [] [] []
[60] [ 25 [] [] [] to
y f i g =420;
'DB'
SFSf o r m a t
fn = ' . . / data / a t 0 5 f 0 . s f s ' ; [ sp , f s ]= r e a d s f s ( f n , 1 , 1 ) ; [ pt , fw ]= r e a d s f s ( f n , 5 , 2 ) ; i p a=ann ( : , [ 1 ipa (: ,2)={ ' s ' 2 2]); 'I ' 'k ' 's ' 'p ' 'l ' '' 's ' 'T' 'r ' 'i ' 'i ' 'k ' 'w ' ' ' 'z ' 'n ' ' aI ' 'n ' ' ' % speech signal transcription pt ( : , 3 ) % phonetic
ann =[ m a t 2 c e l l ( [ c e l l 2 m a t ( p t ( : , 1 ) )
c e l l 2 m a t ( p t ( : , 1 : 2 ) ) [ 1 ; 1 ] ] / fw , o n e s ( 1 , s i z e ( pt , 1 ) ) )
IPA93 ' } , s i z e ( i p a , 1 ) , 1 ) ;
for
set ( gcf , ' Position ' , [ 1 0 0 switch p{ i , 8 } 0 case case case end
100
round ( y f i g
p{ i
,2})
spgrambw ( sp , f s , p { i , 3 } , p { i , 4 } , p { i , 5 } , p { i , 6 } , p { i , 7 } ) ; 1 spgrambw ( sp , f s , p { i , 3 } , p { i , 4 } , p { i , 5 } , p { i , 6 } , p { i , 7 } , ann ) ; 2 spgrambw ( sp , f s , p { i , 3 } , p { i , 4 } , p { i , 5 } , p { i , 6 } , p { i , 7 } , i p a s s= s p r i n t f ( '% d : M D O E=' '% s ' ' ' , p { i , 1 } , p { i , 3 } ) ; for j =4:7 if numel ( p { i , j })==1 s s= s p r i n t f ( '% s , %s= %g ' , s s , a r g s { j elseif numel ( p { i , j }) >1 );
3} , p { i
, j }); ' , p{ i , j } ) ) ;
s s= s p r i n t f ( '% s , %s =[%s ' , s s , a r g s { j s s =[ s s ( 1 : end 1) end end t i t l e ( ss ); if if end end end % now for plot other graphs emf , eval ( sprintf ( ' print '] '];
3} , s p r i n t f ( '% g
dmeta
end
i =201:201 figure ( i ) switch i 201 f a x=l i n s p a c e ( 0 , 6 0 0 0 , 2 0 0 ) ' ; y =[ f a x [ nan ; l o g 1 0 ( f a x ( 2 : end ) ) ] frq2mel ( fax ) frq2bark ( fax ) frq2erb ( fax ) ] ; [ v , i v ]= min ( a b s ( f a x p l o t ( fax /1000 , y ) ; s e t ( gca , ' ylim ' , [ 0 x l a b e l ( ' Frequency ylabel ( ' Scale t x t ={2.8 for end figbolden end if emf , eval ( sprintf ( ' print j =1:5 text ( txt {j ,1} , txt {j ,2} , txt {j ,3}) 2.7 t i t l e ( ' Frequency 3]); ( kHz ) ' ) ; to 1.1 1 kHz ' ) ; ' log ' ; 4.7 2.5 ' mel ' ; 5.2 2.15 ' bark ' ; 4.5 1.7 scales ') 5 case
1000));
y=y . / r e p m a t ( y ( i v , : ) , l e n g t h ( f a x ) , 1 ) ;
' erb
dmeta
end
close ( i ); end
References
[Bro97] D. M. Brookes, VOICEBOX: A speech processing toolbox for MATLAB, 1997. [Online]. Available: http://www.ee.imperial.ac.uk/hp/sta/dmb/voicebox/voicebox.html [Ghi94] O. Ghitza, Auditory models and human performance in tasks related to speech coding and speech recognition, IEEE Trans. Speech Audio Process., vol. 2, pp. 115132, Jan. 1994. [MG83] B. C. J. Moore and B. R. Glasberg, Suggested formulae for calculating auditory-lter bandwidths and excitation patterns, J. Acoust. Soc. Am., vol. 74, pp. 750753, 1983.
[SVN37] S. S. Stevens, J. Volkman, and E. B. Newman, A scale for the measurement of the psychological magnitude of pitch, J. Acoust. Soc. Am., vol. 8, pp. 18519, 1937. [ZT80] E. Zwicker and E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency, J. Acoust. Soc. Am., vol. 68, no. 5, pp. 15231525, Nov. 1980. [Zwi61] E. Zwicker, Subdivision of audible frequency range into critical bands, J. Acoust. Soc. Am., vol. 33, p. 248, 1961.