32 3 Ryynanen

Matti P. Ryynanen and Anssi P.
Klapuri
Department of Signal Processing
Automatic Transcription
Tampere University of Technology
P.O. Box 553 of Melody, Bass Line, and
FI-33101 Tampere, Finland
{matti.ryynanen, anssi.klapuri}@tut.fi Chords in Polyphonic Music
This article proposes a method for the automatic We propose a method for the automatic tran-
transcription of the melody, bass line, and chords scription of the melody, bass line, and chords in
in polyphonic pop music. The method uses a pop-music recordings. Conventionally, these tasks
frame-wise pitch-salience estimator as a feature have been carried out by trained musicians who
extraction front-end. For the melody and bass-line listen to a piece of music and write down notes
transcription, this is followed by acoustic modeling or chords by hand, which is time-consuming and
of note events and musicological modeling of note requires musical training. A machine transcriber
transitions. The acoustic models include a model enables several applications. First, it provides an
for the target notes (i.e., melody or bass notes) easy way of obtaining a description of a music
and a background model. The musicological model recording, allowing musicians to play it. Second,
involves key estimation and note bigrams that the produced transcriptions may be used in mu-
determine probabilities for transitions between sic analysis, music information retrieval (MIR)
target notes. A transcription of the melody or from large music databases, content-based audio
the bass line is obtained using Viterbi search processing, and interactive music systems, for
via the target and the background note models. example.
The performance of the melody and the bass-line A note is here defined by a discrete pitch, an
transcription is evaluated using approximately onset time, and duration. The melody of a piece
8.5 hours of realistic polyphonic music. The chord is an organized sequence of consecutive notes and
transcription maps the pitch salience estimates rests, usually performed by a lead singer or by a
to a pitch-class representation and uses trained solo instrument. More informally, the melody is
chord models and chord-transition probabilities to the part one often hums when listening to a piece.
produce a transcription consisting of major and The bass line consists of notes in a lower pitch
minor triads. For chords, the evaluation material register and is usually played with a bass guitar,
consists of the first eight Beatles albums. The a double bass, or a bass synthesizer. A chord is a
method is computationally efficient and allows combination of notes that sound simultaneously or
causal implementation, so it can process streaming nearly simultaneously. In pop music, these concepts
audio. are usually rather unambiguous.
Transcription of music refers to the analysis of Figure 1 shows the waveform of an example
an acoustic music signal for producing a parametric music signal and two different representations
representation of the signal. The representation may of its melody, bass line, and chords. The middle
be a music score with a meticulous arrangement for panels show a piano-roll representation of the
each instrument or an approximate description of melody and the bass notes, respectively. Notes in
melody and chords in the piece, for example. The this representation can be compactly saved in a
latter type of transcription is commonly used in MIDI file. The lowest panel represents the same
commercial songbooks of pop music and is usually notes and the chords in the common musical
sufficient for musicians or music hobbyists to notation where the note onsets and durations are
play the piece. On the other hand, more detailed indicated by discrete symbols. The proposed method
transcriptions are often employed in classical produces a piano-roll representation of the melody
music to preserve the exact arrangement of the and the bass line together with chord labels. If
composer. desired, the note timings can be further quantized to
obtain common music notation (Hainsworth 2006;
Whiteley, Cemgil, and Godsill 2006).
Computer Music Journal, 32:3, pp. 7286, Fall 2008 Work on polyphonic music transcription dates
c
!2008 Massachusetts Institute of Technology. back more than 30 years (Moorer 1977). Nowadays,
72 Computer Music Journal

Figure 1. Given an audio
recording (top panel), one
can represent the melody,
bass line, and chords with
a piano roll (middle
panels) and with a music
score (bottom panel).
the concept of automatic music transcription Sheh and Ellis (2003), Bello and Pickens (2005), and
includes several topics, such as multi-pitch analysis, Harte and Sandler (2005).
beat tracking and rhythm analysis, transcription Figure 2 shows a block diagram of the proposed
of percussive instruments, instrument recognition, method. An audio signal is processed frame-wise
harmonic analysis and chord transcription, and with two feature extractors: a pitch-salience estima-
music structure analysis. For an overview of the tor and an accent estimator that indicates potential
topics, see Klapuri and Davy (2006). Fundamental note onsets based on signal energy. These features
frequency (F0) tracking of the melody and bass lines are used to compute observation likelihoods for
in polyphonic music signals was first considered by target notes (i.e., melody or bass notes), other in-
Goto (2000, 2004). Later, either F0 tracking or note- strument notes, and silence or noise segments,
level transcription of the melody has been considered each of which is modeled using a Hidden Markov
by Paiva, Mendes, and Cardoso (2005); Ellis and Model (HMM). (See Rabiner and Juang 1993 for
Poliner (2006); Dressler (2006); and Ryynanen and an introduction.) The musicological model esti-
Klapuri (2006), and bass-line transcription has been mates the musical key based on the pitch-salience
addressed by Hainsworth and Macleod (2001) and function and then chooses between-note transition
Ryynanen and Klapuri (2007). Poliner et al. (2007) probabilities modeled with a note bigram. The
reported results on a comparative evaluation of Viterbi algorithm (Forney 1973) is used to find the
melody transcription methods. Automatic chord optimal path through the target note models to
transcription from audio has been considered by produce a transcription. To summarize, the method
Ryynanen and Klapuri 73

Figure 2. A block diagram
of the proposed method.
INPUT:
AUDIO NOTE MODELS
EXTRACT TARGET OTHER NOISE OR

FEATURES NOTE HMM NOTES HMM SILENCE HMM
CHORD TRANSCRIPTION
COMPUTE CHORD ESTIMATE COMPUTE TARGET AND BACKGROUND

CHORD MUSICAL KEY
OBS. LIKELIHOODS OBSERVATION LIKELIHOODS
MODELS
VITERBI SEARCH TRANSITIONS VITERBI SEARCH

CHORD THROUGH CHORDS BETWEEN
TRANSITIONS THROUGH THE NOTES
NOTES
OUTPUT:
KEY SIGNATURE
MELODY AND BASS NOTES
CHORD LABELS
incorporates both low-level acoustic modeling and signals are downmixed to mono before the feature
high-level musicological modeling, and it produces extraction.
discrete pitch labels and the beginning and ending
times for the transcribed notes simultaneously.
The chord-transcription method uses a 24-state Pitch Salience Estimation
HMM consisting of twelve states for both major
and minor triads. The observation likelihoods are The salience, or strength, of each F0 candidate is
computed by mapping the pitch saliences into a calculated as a weighted sum of the amplitudes
pitch-class representation and comparing them of its harmonic partials in a spectrally whitened
with trained profiles for major and minor chords. signal frame. The calculations are similar to those
Probabilities of between-chord transitions are of Klapuri (2006) and are briefly explained here.
estimated from training data, and Viterbi decoding Spectral whitening, or flattening, is first ap-
is used to find a path through the chord models. plied to suppress timbral information and thereby
The rest of this article explains the proposed make the estimation more robust against varia-
method in more detail and presents evaluation tion in the sound sources. Given one frame of
results with quantitative comparison with other the input signal x(n), the discrete Fourier trans-
methods. form X(k) is calculated after Hamming-windowing
and zero-padding the frame to twice its length.
Then, a band-pass filterbank is simulated in the
Feature Extraction frequency domain. Center frequencies of the sub-
bands are distributed uniformly on the critical-band
The front-end of the method consists of two frame- scale, fc = 229(10(0.5c+1)/21.4 1), and each subband
wise feature extractors: a pitch-salience estima- c = 1,. . . ,60 has a triangular power response ex-
tor and an accent estimator. Pitch salience st ( ) tending from fc2 to fc+2 . Power c2 of the signal
measures the strength of fundamental period within each subband c is calculated by applying
in analysis frame t, and the accent at measures the triangular response in the frequency domain
spectral change from frame t 1 to frame t, in and adding the resulting power spectrum values
practice indicating potential note onsets. Input within the band. Then, bandwise compression co-
signals are sampled at fs = 44.1 kHz rate, and stereo efficients c = c1 are calculated, where = 0.33

is a parameter determining the amount of spectral note onsets and vibrato, which is commonly applied
whitening applied. The coefficients c are linearly in singing.
interpolated between the center frequencies fc to ob-
tain compression coefficients (k) for all frequency
bins k. Finally, a whitened magnitude spectrum Accent Signal
|Y(k)| is obtained as |Y(k)| = (k)|X(k)|.
The salience s( ) of a fundamental period candi- Accent signal at measures the amount of incoming
date is then calculated as spectral energy in time frame t and is useful for
I detecting note beginnings. Calculation of the accent
!
s( ) = g( , i)max|Y(k)| (1) feature has been explained in detail in Klapuri,
i=1
k ,i Eronen, and Astola (2006). Briefly, a perceptual
spectrum is first calculated in analysis frame t by
where the set ,i defines a range of frequency bins measuring log-power levels within critical bands.
in the vicinity of the partial number i of the F0 Then, the perceptual spectrum in frame t 1 is
candidate fs / and I = 20. More precisely, element-wise subtracted from that in frame t, and
the resulting positive level differences are added
,i = {#i K/( + &/2)$, . . . , #i K/( &/2)$} (2) across bands. As a result, we have at , which is a
perceptually motivated measure of the amount of
where #$ denotes rounding to the nearest integer,
incoming spectral energy in frame t. The frame
K is the length of the Fourier transform, and &
rate while calculating the accent signal is four
denotes the spacing between successive period
times higher than that of the salience estimation;
candidates . We use & = 0.5, that is, a spacing
therefore, we down-sample the accent signal by
corresponding to half the sampling interval. The
selecting the maximum value in four-frame blocks to
partial weighting function g( , i) in Equation 1 is of
match the frame rate of the pitch-salience function.
the form
fs / + 1
g( , i) = (3)
ifs / + 2 Acoustic Modeling of Notes
where 1 = 52 Hz and 2 = 320 Hz. Note that fs / The basic idea of acoustic modeling is that all
is the fundamental frequency corresponding to , possible note pitches n at all times are classi-
and Equation 3 reduces to 1/i if the moderation fied either as target notes (melody or bass), notes
terms 1 and 2 are omitted. For details, see Klapuri from other instruments, or as noise or silence. For
(2006). this purpose, three acoustic models are trained:
The salience function in Equation 1 is calculated (1) target-notes model, (2) other-notes model, and (3)
for F0 values between 35 Hz and 1.1 kHz in overlap- noise-or-silence model. Use of the target-notes and
ping 92.9-msec frames, with a 23.2-msec interval the other-notes models attempts to improve dis-
between successive frames. Based on this, the dif- criminability of the target sound source from other
ferential salience of a particular period candidate instruments. The target and the other notes are mod-
is defined as &st ( ) = st ( ) st1 ( ). For convenience, eled with three-state left-to-right HMMs, where the
the fundamental frequency candidates are expressed consecutive states can be interpreted to represent the
as unrounded MIDI note numbers by attack, sustain, and release segments of the notes.
F ( ) = 69 + 12 log2 (( fs / )/440) (4) The noise-or-silence model is a three-state fully con-
nected HMM, because no similar temporal order can
Figure 3 shows the salience and the differential be assumed for these segments. At all times, each
salience features extracted from the signal shown in candidate note n is in one of the internal states of
Figure 1. Salience values indicate the melody notes one of the models. The notes are identified by their
quite clearly. Differential salience shows potential discrete MIDI note pitch n N, where N is the set

Figure 3. Salience function
st ( ) in the top panel and
the differential salience
function &st ( ) in the
bottom panel.
70 Melody notes
65
pitch (MIDI note)
60
70 Melody notes
65
60
0.5 1 1.5 2 2.5 3 3.5 4

time (sec)
of possible pitches. The set consists of MIDI notes transcribed pitches is postponed for the probabilistic
{44, . . . ,84}, i.e., A-flat2 to C6, for the melody, and models, and the feature extraction becomes consid-
of notes {26, . . . ,59}, i.e., D1B3, for the bass line. erably simpler and computationally more efficient.
The acoustic models and their parameters do not
depend on note pitch n. This has the advantage that
only one set of HMM parameters must be trained for Training the Acoustic Models
each of the three models. However, the observation
vectors on,t are specific to each note. These are The acoustic models are trained using the Real World
obtained from the extracted features by selecting Computing (RWC) database, which includes realistic
the maximum-salience fundamental period n,t in a musical recordings with manual annotations of the
1 semitone range around the note n in frame t: melody, the bass line, and the other instruments
(Goto et al. 2002, 2003). For the time region of
n,t = arg max st (i), i { | |F ( ) n| 1} (5) each annotated note n, the observation vectors
i
by Equation 6 constitute a training sequence for
The observation vector on,t is then defined as either the target-notes or the other-notes model.
The HMM parameters are then obtained using
T
on,t = [&F , st (n,t ), &st (n,t ), at ] (6) the Baum-Welch algorithm (Rabiner 1989) where
observation-likelihood distributions are modeled
where &F = F (n,t ) n is the distance between the with Gaussian Mixture Models (GMMs). Prior to
pitch of the detected salience peak and the nom- the training, the features are normalized to have
inal pitch of the note, and s(n,t ) and &s(n,t ) are zero mean and unit variance over the training set.
the salience and the differential salience of n,t . The noise-or-silence model requires training
The accent value at does not depend on pitch but sequences as well. Therefore, we generate random
is common to all notes. Notice that the pitch- note events at positions of the time-pitch plane
dependent features are obtained directly from the where there are no sounding notes in the reference
salience function in Equation 5. This is advanta- annotation. Durations of the generated notes are
geous to the often-used approach of deciding a set sampled from a normal distribution with mean and
of possible pitches within a frame already at the variance calculated from the annotated notes in the
feature-extraction stage; here, the final decision of song.

Figure 4. Background
probability evaluation for
one note. See text for
details.
After the training, we have the following pa-

rameters for each HMM: observation likelihood
distributions P(on,t | qt = j) for states j = 1, 2, 3;
state-transition probabilities P(qt = j | qt1 = i), i.e.,
the probability that state i is followed by state j
within a random state sequence q1 q2 . . . qT q1:T ;
and initial state probabilities P(q1 = j). The obser-
vation likelihood distributions are modeled with
diagonal covariance matrices and three or four
GMM components for melody and bass model
sets, respectively. In the following, we distinguish
the states of the target-notes model, the other-
notes model, and the noise-or-silence model by the
notation jtgt , joth , and jns , respectively.
Using the Acoustic Models
As already mentioned, the sequence of target model. We define the background probability for
notes is extracted by classifying all possible note note n in frame t as
pitches either as a target note, a note from another
instrument, or as noise or silence. However, by
n,t = P(on,t | bn,t
)P(bn,t
| bn,t1 ) (8)
definition there can be only up to one target
note sounding at each time. This constraint is The use of the other notes and the noise-or-silence
implemented as follows. model is illustrated in Figure 4 where the continuous
First, we find a state sequence bn,1:T

which best arrowed line shows the state sequence bn,1:7
solved
explains the feature vectors on,1:T of note n using from Equation 7. The example shows the background
only the other-notes and the noise-or-silence model: probability n,3 evaluation in frame t = 3 using
" $ Equation 8.
T
#
Next, we try to find a path trough time for which
bn,1:T = arg max P(q1 )P(on,1 | q1) P(qt | qt1)P(on,t | qt ) the target-notes model gives a high likelihood and
q1:T
t=2
which is not well explained by the other models.
(7) The state space of this target path is larger than in
Equation 7, because the path can visit the internal
Here, qt {ioth , jns }, where i, j {1, 2, 3}, meaning states jtgt of any note n. We denote the state of the
that the sequence bn,1:T
may visit the states of both target path by a variable rt , which determines the
the other-notes and the noise-or-silence model. note n and the internal state jtgt at time t. More
Because we have a combination of the two models, exactly, rt = [rt (1), rt (2)]T , where rt (1) {N R} and
we must allow switching between them by defining rt (2) {1, 2, 3}. Here R denotes a rest state, where no
nonzero probabilities for P(qt = joth | qt1 = ins ), target notes are sounding. In this case, the value of
where j = 1 and i {1, 2, 3}, as well as for P(qt = rt (2) is meaningless.
jns | qt1 = ioth ) where i = 3 and j {1, 2, 3}. The To find the best overall explanation for all notes
state sequence is found with the Viterbi algorithm at all times, let us first assume that the notes are
(Forney 1973). This procedure is repeated for all independent of each other. In this case, the overall
considered note pitches n N. probability given % by the background model in
Now we have the most likely explanation for all frame t is Zt = n n,t . When the target path state
notes at all times without using the target-notes rt visits a note n at time t, the overall probability

Figure 5. Using the
target-note models to
produce a transcription.
See text for details.
becomes The transition probabilities P(rt | rt1 ) are defined as

Zt P(rt (2) = jtgt | rt1 (2) = itgt ),
Zt ) = P(rt | rt1 )P(on,t | rt (2)) (9)

n,t
when inside target HMMs
% P(rt | rt1 ) = (12)
P(rt (1) = u | rt1 (1) = u ),
)
where Zt /n,t i*=n i,t . The best explanation for all

the
% notes is given by a target path that maximizes when applying musical model
Z
t t
)
. The factor Zt in Equation 9 is independent of
the target path and can be omitted while searching Above, the acoustic model is applied when staying
the optimal path. As a result, we can omit the within the same target-note HMM, that is, when
independence assumption too, because it only rt (1) = rt1 (1) N and rt (2) rt1 (2). The musico-
concerns the calculation of Zt . For rest frames, the logical model is applied when switching between
background explanation applies for all notes, and notes or between notes and rests, that is, when the
thus Zt ) = P(rt | rt1 )Zt . The transition probabilities above condition is not fulfilled, and u, u) {N R}.
P(rt | rt1 ) must still remain in all frames, because The musicological model is explained in the next
they include the musicological model that controls section. The path by Equation 10 is found with the
transitions between notes and between notes and Viterbi algorithm. This simultaneously produces
rests. the discrete pitch labels and note segmentation, i.e.,
The best target path is then obtained as the note onsets and offsets. A note starts when the
path enters the first state of a note model and ends
" $
T
# when the path exits its last state. Rests are produced
r1:T = arg max P(r1 ) f (r1 ) P(rt | rt1 ) f (rt ) (10) where rt (1) = R.
r1:T
t=2 Figure 5 illustrates this process with two target-
note HMMs for MIDI notes 60 and 61 and the rest
where state. The arrowed line shows the best target path
& r1:21 where the gray arrows show the transitions
P(on,t | rt (2))/n,t , rt (1) = n N inside target-note models and the black arrows show
f (rt ) = (11)
1, rt (1) = R the transitions where the musicological model is

applied. The black boxes show the four transcribed and Kmin (d), which give the occurrence frequen-
notes (i.e., MIDI notes 60, 61, 61, 60) with their cies of note degrees d {0,1,. . . ,11} in major and
onsets and offsets. The figure also shows example minor keys, respectively. For example, d = 0 is
calculations for Equations 1112. the tonic note of the key and d = 7 is the perfect
fifth.
Let k {0,1,. . . ,11} denote the relative-key pairs
Musicological Model for Note Transitions [C major/A minor], [D-flat major/B-flat minor], and
so forth, until the pair [B major/G-sharp minor],
Musical context plays an important role in both
respectively. Given the pitch-class profile PCPt (m)
melodies and bass lines. Consider a target note
and the key profiles, the most probable relative-key
sequence in the key of C major in which the note
pair kt at time t is calculated by
E is possibly followed by F-sharp or G, for which
the acoustic model gives approximately equal like- 11
!
lihoods. Because F-sharp follows E less often in Lt (k) = [log [Kmaj (d)] PCPt (mod(d + k, 12))
this key, the note G is musically preferred. We use d=0
this feature in the melody and bass-line transcrip-
tion by employing key-dependent note-transition + log [Kmin (d)] PCPt (mod(d + k + 9, 12))] ,
probabilities to solve ambiguous situations. (14)

t
!
Key Estimation kt = arg max L j (k) . (15)
k
j=1
A musical key is roughly defined by the basic scale
used in a piece. A major key and a minor key are When calculating note degree for minor keys, the
here referred to as a relative-key pair if they consist term +9 in Equation 14 shifts the key index k to
of scales with the same notes (e.g., C major and A the relative minor key. As key profiles Kmaj (d) and
minor). The relative-key pair also specifies the key Kmin (d), we use those reported by Krumhansl (1990,
signature of a musical score. p. 67).
The proposed key estimator finds the relative-key
pair using the pitch salience function st ( ). First,
the function values are mapped into a pitch-class Note Bigrams
representation, where notes in different octaves are
considered equivalent. The set of notes that belong to Note bigrams determine the probability of making a
a pitch class m {0,1, . . . ,11} is defined by Hm = {n transition between notes or between notes and the
N | mod (n, 12) = m}, where mod(x, y) x y ,x/y-. rest state. This aims at favoring note sequences char-
For key estimation, the note range N consists of acteristic to the target material. In Equation 12, these
the MIDI notes {48, . . . ,83}. The salience func- transitions were denoted by P(rt (1) = u | rt1 (1) = u) ),
tion is mapped to the pitch-class profile PCPt (m) where u, u) {N R}. Because the target note mod-
by els are left-to-right HMMs, it is possible to enter
or exit a target note HMM only via its first or last
1 ! state, respectively. In addition, our system includes
PCPt (m) = max st (i), i { ||F ( ) n| 0.5}
Wt i transitions from target-note models to the rest state
nHm
and vice versa, as well as the probability to stay
(13) within the rest state.
Given the relative-key pair k, a transition from
where Wt is a normalization factor so that
+ note u) to note u is specified by the degree of the first
11
m=0 PCPt (m) = 1. The pitch-class profile in each note, mod(u) k, 12) and the interval between the
frame is then compared to key profiles Kmaj (d) two notes u u) . For note-to-rest and rest-to-note

transitions, only the note degree matters. The perform best in our simulations. For bass lines, we
probability to stay within the rest state does not found mean 33.2 and variance 3.7.
depend on the key but is a free parameter that
can be used to control the amount of rests in the
transcription. In summary, the transitions given by
the musicological model are defined by
Chord Transcription
The proposed chord-transcription method attempts
P(rt (1) = u | rt1 (1) = u) ) to label each frame with a certain major or minor

P(rt = [u, 1]T | rt1 = [u) , 3]T , k), note to note chord. The reason for introducing the method here is

P(r = [u, 1]T | r (1) = R, k), rest to note to demonstrate the usefulness of the pitch-salience
t t1
= function in the chord-transcription task. In addition,

P(rt (1) = R | rt1 = [u) , 3]T , k), note to rest

chords complement the melody and the bass line
P(rt (1) = R | rt1 (1) = R), rest to rest transcriptions to produce a useful representation of
(16) the song under analysis.
Harte et al. (2005) proposed a text-based chord
syntax and publicly provided the chord annotations
For melody notes, the note bigram probabilities
for the first eight albums by the Beatles. This
were estimated by counting the note transitions in
database is used here for training and evaluation of
the Essen Associative Code and Folksong Database
the proposed method. The database also includes
(EsAC) with about 20,000 folksongs with key
complex chords, such as major and dominant-
information (see www.esac-data.org). Krumhansls
seventh and extended chords (ninths, sixths), which
profiles were used to give probabilities for rest-to-
are treated here as major or minor triads. Augmented
note and note-to-rest transitions. This is based on
or diminished chords are left out from the training.
the assumption that it is more probable to start
Database details are given in the next section.
or end a phrase with a note that is harmonically
The chord transcription method is the following.
stable in the estimated key.
First, we train pitch-class profiles for major and
For bass lines, the note bigram estimation is done
minor triads. The salience function is mapped
from a collection of over 1,300 MIDI files including
into two pitch-class profiles by using Equation 13:
bass lines. For each file, we estimate the relative-key
pair using the Krumhansl profiles and all MIDI PCPlot (m) for low-register MIDI notes n {26,. . . ,49},
notes from the file. A rest is added between two and PCPhi t (m) for high-register MIDI notes n
consecutive bass notes if they are separated by more {50,. . . ,73}. For each major chord in the training
than 200 msec. Then, we count the key-dependent data, PCPlo hi
t (m) and PCPt (m) are calculated, the
note transitions similarly to the melody notes. The profiles are rotated so that the pitch-class m = 0
bigrams for both melody and bass lines are smoothed corresponds to the chord root note, and the profiles
with the Witten-Bell discounting algorithm (Witten are then obtained by averaging over time. A similar
and Bell 1991). procedure is repeated for the minor chords. Figure 6
lo lo
The note bigrams do not take into account the shows the estimated low-register profiles Cmaj , Cmin
hi hi
absolute pitch of the notes but only the interval and high-register profiles Cmaj , Cmin for both major
between them. However, it is advantageous to prefer and minor chords. The low-register profile captures
target notes in the typical pitch range of melodies the bass notes contributing to the chord root,
or bass lines: for example, preferring low-pitched whereas the high-register profile has more clear
notes in bass-line transcription. We implement this peaks also for the major or minor third and the fifth.
by weighting the note bigram probabilities with Next, we define a chord HMM with 24 states
values from a normal distribution over notes n. For (twelve states for both major and minor triads).
melodies, we found a distribution mean at MIDI In frame t, a state in this HMM is denoted by
note value 62.6 and a variance of 6.8 semitones to ct = [ct (1), ct (2)]T , where ct (1) {0,1,. . . ,11} denotes

Figure 6. Major and minor Figure 7. The estimated
chord profiles for low and chord transition
high registers. probabilities from major
and minor chords. The text
boxes show examples for
transitions from a C-major
chord and an A-minor
chord.
0.12 Major profile (low) Major profile (high) Minor profile (low) Minor profile (high)
0.11
0.1
0.09
0.08
0.07
0.06
I II IIIIV V VI VII I II IIIIV V VI VII I II III IV V VI VII I II III IV V VI VII
Note degree
Figure 6
Figure 7
the chord roots C, D-flat, and so forth, and ct (2) The calculation is exactly similar for the mi-
{maj, min} denotes the chord type. We want to nor triads L(ct ), ct (2) = min, except that the
find a path c1:T through the chord HMM. For the major profiles are replaced with the minor
twelve major chord states ct , ct (2) = maj, the chord profiles.
observation log-likelihoods L(ct ) are calculated in a We also train a chord-transition bigram P(ct | ct1 ).
manner analogous to the key estimation: The transitions are independent of the key so that
only the chord type and the distance between
11
! 0 0 lo 1 the chord roots, mod(ct (1) ct1 (1),12), matters. For
L(ct ) = log Cmaj (d) PCPlo
t (mod(d + ct (1), 12)) example, a transition from A minor, ct1 = [9, min]T ,
d=0 to F major, ct = [5, maj]T , is counted as a transition
0 hi 1 1 from a minor chord to a major chord with distance
+ log Cmaj (d) PCPhi
t (mod(d + ct (1), 12)) .
mod (5 9,12) = 8. Figure 7 illustrates the estimated
(17) chord-transition probabilities. The probability to

Figure 8. Chord all major and minor triads.
transcription of With a The reference chords are
Little Help From My indicated by the white line
Friends by the Beatles. and the transcription c1:T
Likelihoods from by the black line.
Equation 17 are shown for
Bm
Reference
Am Transcription
Gm
Em
Dm
Chord
Cm
B
A
G
E
D
C
20 40 60 80 100 120 140 160
time (sec)
stay in the chord itself is a free parameter that about 19 sec to process 180 seconds of stereo audio.
controls the amount of chord changes. The feature extraction takes about 12 sec, and the
Now, we have defined the log-likelihoods L(ct ) for melody and bass-line transcription take about 3 sec
all chords in each frame t and the chord-transition each. The key estimation and chord transcription
bigram P(ct | ct1 ). The chord transcription is then take less than 0.1 seconds. In addition, the method
obtained by finding an optimal path c1:T through the allows causal implementation to process stream-
chord states: ing audio in a manner described in Ryynanen and
" $ Klapuri (2007).
!T
c1:T = arg max L(c1 ) +

(L(ct ) + log P(ct | ct1 )) ,
c1:T
t=2
(18) Audio Databases
which is again found using the Viterbi algorithm. For the development and evaluation of the melody
The initial probabilities for chord states are uniform and bass-line transcription, we use the Real World
and are therefore omitted. The method does not Computing popular music and genre databases
detect silent segments but produces a chord label in (Goto et al. 2002, 2003). The databases include a
each frame. Figure 8 shows the chord transcription MIDI file for each song, which contains a manual
for With a Little Help From My Friends by the annotation of the melody, the bass, and other
Beatles. instrument notes, collectively referred to as the
reference notes. MIDI notes for drums, percussive
instruments, and sound effects are excluded from the
Results reference notes. Some songs in the databases were
not used due to an unreliable synchronization of the
The proposed melody-, bass-, and chord- MIDI annotation and the audio recording. Also, some
transcription methods are quantitatively evaluated songs do not include the melody or the bass line.
using databases described subsequently. For all Consequently, we used 130 full acoustic recordings
evaluations, a two-fold cross validation is used. for melody transcription: 92 pop songs (the RWC
With a C++ implementation running on a 3.2 popular database) and 38 songs with varying
GHz Pentium 4 processor, the entire method takes styles (the RWC genre database). For bass-line

Table 1. Melody and Bass-Line Transcription Results (%)
Melody R P F Bass line R P F
RWC popular 60.5 49.4 53.8 61.1 RWC popular 57.7 57.5 56.3 61.9
RWC genre 41.7 50.3 42.9 55.8 RWC genre 35.3 57.5 39.3 57.6
Total 55.0 49.6 50.6 59.6 Total 50.1 57.5 50.6 60.4
transcription, we used 84 songs from RWC popular note is not already associated with another reference
and 43 songs from RWC genrealtogether, 127 note. We use the F-measure F = 2RP/(R + P) to
recordings. This gives approximately 8.7 and 8.5 give an overall measure of performance. Temporal
hours of music for the evaluation of melody and overlap ratio of a correctly transcribed note with
bass-line transcription, respectively. There are the associated reference note is measured by =
reference notes outside the reasonable transcription (min{E} max{B})/(max{E} min{B}), where sets B
note range for both melody (<0.1 percent of the and E contain the beginning and ending times of the
melody reference notes) and bass lines (1.8 percent two notes, respectively. The mean overlap ratio
of the notes). These notes are not used in training is obtained by averaging values over the correctly
but are counted as transcription errors in testing. transcribed notes. The recall rate, the precision
As already mentioned, the chord-transcription rate, the F-measure, and the mean overlap ratio are
method is evaluated using the first eight Beatles calculated separately for each recording, and then
albums with the chord annotations provided by the average over all the recordings is reported.
Harte and colleagues. The albums include 110 songs Table 1 shows the melody and bass-line tran-
with approximately 4.6 hours of music. The refer- scription results. Both the melody and the bass-line
ence major and minor chords cover approximately transcription achieve over 50 percent average F-
75 percent and 20 percent of the audio, respectively. measure. The performance on pop songs is clearly
Chords that are not recognized by our method and better than for the songs from various genres. This
the no-chord segments cover about 3 percent and was expected since the melody and bass lines are
1 percent of the audio. usually more prominent in pop music than in other
genres, such as heavy rock or dance music, for
example. In addition, the RWC popular database
Melody and Bass-Line Transcription Results includes only vocal melodies, whereas the RWC
genre database also includes melodies performed
The performance of melody and bass-line tran- with other instruments. The musicological model
scription is evaluated by counting correctly and plays an important role in the method: The total
incorrectly transcribed notes. We use the recall rate F-measures drop to 40 percent for both melody
R and the precision rate P defined by and bass-line transcription if the note bigrams are
replaced with uniform distributions.
#(correctly transcribed notes) For comparison, Ellis and Poliner kindly pro-
R= vided the pitch tracks produced by their melody-
#(reference notes)
(19) transcription method (Ellis and Poliner 2006) for
#(correctly transcribed notes) the recordings in the RWC databases. Their method
P=
#(transcribed notes) decides note pitch for the melody in each frame
whenever the frame is judged to be voiced. Briefly,
A reference note is correctly transcribed by a note the pitch classification in each frame is conducted
in the transcription if their MIDI note numbers are using one-versus-all, linear-kernel support vector
equal, the absolute difference between their onset machine (SVM). The voiced-unvoiced decision is
times is less than 150 msec, and the transcribed based on energy thresholding, and the pitch track is

Table 2. Frame-Level Melody Transcription Results (%)
Method Data Overall acc. Raw pitch Vc det. Vc FA Vc d)
Proposed RWC popular 63.0 63.6 87.0 39.4 1.40

RWC genre 62.4 40.5 58.8 17.9 1.14
Total 62.8 56.9 78.7 33.1 1.23
Ellis and Poliner RWC popular 42.9 50.4 93.4 66.3 1.09
RWC genre 38.5 42.8 91.6 64.7 1.00
Total 41.6 48.2 92.9 65.9 1.06
Table 3. Chord Transcription Results, Frame Classification Proportions Averaged

Over the Songs (%)
Proposed Bello-Pickens
Exactly correct 70.6 69.6

Root correct (major and minor are confused) 5.7 3.4
Relative chord (e.g., C was labeled as Am or vice versa) 2.7 5.8
Dominant (e.g., C/Cm was labeled as G/Gm) 4.5 4.0
Subdominant (e.g., C/Cm was labeled as F/Fm) 3.7 2.2
III or VI (e.g., C was labeled with Em or vice versa) 1.6 2.4
Other errors 6.4 7.7
Chord not recognized by the methods (e.g., C dim) 4.8 4.8
smoothed using HMM post-processing. Because the that are labeled as voiced in the transcription
Ellis-Poliner method does not produce segmented but are unvoiced in the reference. The voicing
note events but rather a pitch track, we compare d) (Vc d) ) combines the voicing detection and voicing
the methods using frame-level evaluation metrics false alarm rates to describe the systems ability
adopted for melody-extraction evaluation in the to discriminate the voiced and unvoiced frames.
Music Information Retrieval Evaluation eXchange High voicing detection and low voicing false alarm
(Poliner et al. 2007). For this, the RWC reference give good discriminability with a high voicing d)
MIDI note values are sampled every 10 msec to value (Duda, Hart, and Stork 2001). According to
obtain a frame-level reference. Similar conversion is the overall accuracy and the voicing d) , the proposed
made for the melody notes produced by our method. method outperforms the Ellis-Poliner method in
Table 2 shows the results for the proposed this evaluation. Their method classifies most of
method and for the Ellis-Poliner method on the the frames as voiced, resulting in a high voicing-
RWC databases. The overall accuracy denotes the detection rate but also high voicing false-alarm rates.
proportion of frames with either a correct pitch
label or correct unvoiced decision, where the pitch
label is correct if the absolute difference between Chord Transcription Results
the transcription and the reference is less than half
a semitone. The raw pitch accuracy denotes the The chord-transcription method is evaluated by
proportion of correct pitch labels to voiced frames comparing the transcribed chords with the reference
in the reference. Voicing detection (Vc det) measures chords frame-by-frame. For method comparison,
the proportion of correct voicing in transcription Bello and Pickens kindly provided outputs of their
to voiced frames in the reference. Voicing false chord transcription method (Bello and Pickens 2005)
alarm (Vc FA) measures the proportion of frames on the Beatles data. As a framework, their method

also uses a chord HMM where states correspond bass line, and chord transcription takes a step toward
to major and minor triads. Our method resembles more complete and accurate music transcription.
theirs in this sense, but the methods differ in details. Audio examples of the transcriptions are avail-
The results with error analysis are given in able online at www.cs.tut.fi/sgn/arg/matti/demos/
Table 3. On the average over the songs, both methods mbctrans.
decide exactly correct chord in about 70 percent of
the frames, that is, the transcribed chord root and
type are the same as in the reference. The proposed Acknowledgments
method makes more major/minor chord confusions,
whereas the Bello-Pickens method more often This work was supported by the Academy of Finland,
labels a reference chord with its relative chord. project no. 5213462 (Finnish Centre of Excellence
Because neither of the methods detects augmented Program 20062011). The authors would like to
or diminished chords, or no-chord segments, those thank Dan Ellis, Graham Poliner, and Juan Pablo
frames are always treated as transcription errors Bello for providing outputs of their transcription
(4.8 percent of all the frames). methods for comparison.
References
Conclusions
Bello, J. P., and J. Pickens. 2005. A Robust Mid-level
We proposed a method for the automatic tran- Representation for Harmonic Content in Music Sig-
scription of melody, bass line, and chords in poly- nals. Proceedings of the 6th International Conference
phonic music. The method consists of frame-wise on Music Information Retrieval. London: Queen Mary,
pitch-salience estimation, acoustic modeling, and University of London, pp. 304311.
musicological modeling. The transcription accuracy Dressler, K. 2006. An Auditory Streaming Ap-
was evaluated using several hours of realistic music, proach on Melody Extraction. MIREX Audio
and direct comparisons to state-of-the-art methods Melody Extraction Contest Abstracts. MIREX06
were provided. Using quite straightforward time extended abstract. London: Queen Mary, Univer-
quantization, common musical notation such as sity of London. Available online at www.music-
that shown in Figure 1 can be produced. In addition, ir.org/evaluation/MIREX/2006 abstracts/AME dressler.
pdf.
the statistical models can be easily retrained for dif-
Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern
ferent target materials. Future work includes using Classification, 2nd ed. New York: Wiley.
timbre and metrical analysis to improve melody and Ellis, D., and G. Poliner. 2006. Classification-Based
bass-line transcription, and a more detailed chord Melody Transcription. Machine Learning Journal
analysis method. 65(23):439456.
The transcription results are already useful for Forney, G. D. 1973. The Viterbi Algorithm. Proceedings
several applications. The proposed method has of the IEEE 61(3):268278.
been integrated into a music-transcription tool Goto, M. 2000. A Robust Predominant-F0 Estimation
with a graphical user interface and MIDI editing Method for Real-Time Detection of Melody and Bass
capabilities, and the melody transcription has been Lines in CD Recordings. Proceedings of the 2000 IEEE
successfully applied in a query-by-humming system International Conference on Acoustics, Speech, and
Signal Processing. New York: Institute for Electrical
(Ryynanen and Klapuri 2008), for example. Only a
and Electronics Engineers, pp. 757760.
few years ago, the authors considered the automatic
Goto, M. 2004. A Real-Time Music-Scene-Description
transcription of commercial music recordings as a System: Predominant-F0 Estimation for Detecting
very difficult problem. However, rapid development Melody and Bass Lines in Real-World Audio Signals.
of transcription methods and the latest results have Speech Communication 43(4):311329.
demonstrated that feasible solutions are possible. Goto, M., et al. 2002. RWC Music Database: Popular,
We believe that the proposed method with melody, Classical, and Jazz Music Databases. Proceedings of

the 3rd International Conference on Music Information Music Information Retrieval. London: Queen Mary,
Retrieval. Paris: IRCAM, pp. 287288. University of London, pp. 175182.
Goto, M., et al. 2003. RWC Music Database: Mu- Poliner, G., et al. 2007. Melody Transcription from Music
sic Genre Database and Musical Instrument Sound Audio: Approaches and Evaluation. IEEE Transactions
Database. Proceedings of the 4th International Con- on Audio, Speech, and Language Processing 15(4):1247
ference on Music Information Retrieval. Baltimore, 1256.
Maryland: Johns Hopkins University. Available online Rabiner, L., and B.-H. Juang. 1993. Fundamentals of
at ismir2003.ismir.net/papers/Goto1.PDF. Speech Recognition. Upper Saddle River, New Jersey:
Hainsworth, S. 2006. Beat Tracking and Musical Metre Prentice Hall.
Analysis. In A. Klapuri and M. Davy, eds. Signal Rabiner, L. R. 1989. A Tutorial on Hidden Markov Models
Processing Methods for Music Transcription. Berlin: and Selected Applications in Speech Recognition.
Springer, pp. 101129. Proceedings of the IEEE 77(2):257289.
Hainsworth, S. W., and M. D. Macleod. 2001. Automatic Ryynanen, M., and A. Klapuri. 2006. Transcription of the
Bass Line Transcription from Polyphonic Music. Singing Melody in Polyphonic Music. Proceedings of
Proceedings of the 2001 International Computer Music the 7th International Conference on Music Information
Conference. San Francisco, California: International Retrieval. Victoria: University of Victoria, pp. 222
Computer Music Association, pp. 431434. 227.
Harte, C., et al. 2005. Symbolic Representation of Musical Ryynanen, M., and A. Klapuri. 2007. Automatic Bass
Chords: A Proposed Syntax for Text Annotations. Line Transcription from Streaming Polyphonic Audio.
Proceedings of the 6th International Conference on Proceedings of the 2007 IEEE International Conference
Music Information Retrieval. London: Queen Mary, on Acoustics, Speech, and Signal Processing. New
University of London, pp. 6671. York: Institute of Electrical and Electronics Engineers,
Harte, C. A., and M. B. Sandler. 2005. Automatic pp. 14371440.
Chord Identification Using a Quantised Chromagram. Ryynanen, M., and A. Klapuri. 2008. Query by
Proceedings of the 118th Audio Engineering Society Humming of MIDI and Audio Using Local-
Convention. New York: Audio Engineering Society, ity Sensitive Hashing. Proceedings of the 2008
paper number 6412. IEEE International Conference on Acoustics,
Klapuri, A. 2006. Multiple Fundamental Frequency Speech, and Signal Processing. Available online
Estimation by Summing Harmonic Amplitudes. at www.cs.tut.fi/mryynane/ryynanen icassp08.
Proceedings of the 7th International Conference on pdf.
Music Information Retrieval. Victoria: University of Sheh, A., and D. P. Ellis. 2003. Chord Segmentation
Victoria, pp. 216221. and Recognition Using EM-Trained Hidden Markov
Klapuri, A., and M. Davy, eds. 2006. Signal Processing Models. Proceedings of the 4th International Con-
Methods for Music Transcription. Berlin: Springer. ference on Music Information Retrieval. Baltimore,
Klapuri, A. P., A. J. Eronen, and J. T. Astola. 2006. Maryland: Johns Hopkins University. Available online
Analysis of the Meter of Acoustic Musical Signals. at ismir2003.ismir.net/papers/Sheh.PDF.
IEEE Transactions on Audio, Speech, and Language Whiteley, N., A. T. Cemgil, and S. Godsill. 2006. Bayesian
Processing 14(1):342355. Modeling of Temporal Structure in Musical Audio.
Krumhansl, C. 1990. Cognitive Foundations of Musical Proceedings of the 7th International Conference on
Pitch. Oxford: Oxford University Press. Music Information Retrieval. Victoria: University of
Moorer, J. A. 1977. On the Transcription of Musical Victoria, pp. 2934.
Sound by Computer. Computer Music Journal 1(4): Witten, I. H., and T. C. Bell. 1991. The Zero-
3238. Frequency Problem: Estimating the Probabilities of
Paiva, R. P., T. Mendes, and A. Cardoso. 2005. On Novel Events in Adaptive Text Compression. IEEE
the Detection of Melody Notes in Polyphonic Adio. Transactions on Information Theory 37(4):1085
Proceedings of the 6th International Conference on 1094.

32 3 Ryynanen

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

32 3 Ryynanen

Hochgeladen von

Copyright:

Verfügbare Formate

Matti P. Ryynanen and Anssi P.

72 Computer Music Journal

Ryynanen and Klapuri 73

EXTRACT TARGET OTHER NOISE OR

COMPUTE CHORD ESTIMATE COMPUTE TARGET AND BACKGROUND

VITERBI SEARCH TRANSITIONS VITERBI SEARCH

74 Computer Music Journal

Ryynanen and Klapuri 75

0.5 1 1.5 2 2.5 3 3.5 4

76 Computer Music Journal

After the training, we have the following pa-

Using the Acoustic Models

Ryynanen and Klapuri 77

becomes The transition probabilities P(rt | rt1 ) are defined as

78 Computer Music Journal

Ryynanen and Klapuri 79

80 Computer Music Journal

Ryynanen and Klapuri 81

(18) Audio Databases

82 Computer Music Journal

Ryynanen and Klapuri 83

Proposed RWC popular 63.0 63.6 87.0 39.4 1.40

Table 3. Chord Transcription Results, Frame Classification Proportions Averaged

Exactly correct 70.6 69.6

84 Computer Music Journal

Ryynanen and Klapuri 85

86 Computer Music Journal

Das könnte Ihnen auch gefallen