Beruflich Dokumente
Kultur Dokumente
Klapuri
Department of Signal Processing
Automatic Transcription
Tampere University of Technology
P.O. Box 553 of Melody, Bass Line, and
FI-33101 Tampere, Finland
{matti.ryynanen, anssi.klapuri}@tut.fi Chords in Polyphonic Music
This article proposes a method for the automatic We propose a method for the automatic tran-
transcription of the melody, bass line, and chords scription of the melody, bass line, and chords in
in polyphonic pop music. The method uses a pop-music recordings. Conventionally, these tasks
frame-wise pitch-salience estimator as a feature have been carried out by trained musicians who
extraction front-end. For the melody and bass-line listen to a piece of music and write down notes
transcription, this is followed by acoustic modeling or chords by hand, which is time-consuming and
of note events and musicological modeling of note requires musical training. A machine transcriber
transitions. The acoustic models include a model enables several applications. First, it provides an
for the target notes (i.e., melody or bass notes) easy way of obtaining a description of a music
and a background model. The musicological model recording, allowing musicians to play it. Second,
involves key estimation and note bigrams that the produced transcriptions may be used in mu-
determine probabilities for transitions between sic analysis, music information retrieval (MIR)
target notes. A transcription of the melody or from large music databases, content-based audio
the bass line is obtained using Viterbi search processing, and interactive music systems, for
via the target and the background note models. example.
The performance of the melody and the bass-line A note is here defined by a discrete pitch, an
transcription is evaluated using approximately onset time, and duration. The melody of a piece
8.5 hours of realistic polyphonic music. The chord is an organized sequence of consecutive notes and
transcription maps the pitch salience estimates rests, usually performed by a lead singer or by a
to a pitch-class representation and uses trained solo instrument. More informally, the melody is
chord models and chord-transition probabilities to the part one often hums when listening to a piece.
produce a transcription consisting of major and The bass line consists of notes in a lower pitch
minor triads. For chords, the evaluation material register and is usually played with a bass guitar,
consists of the first eight Beatles albums. The a double bass, or a bass synthesizer. A chord is a
method is computationally efficient and allows combination of notes that sound simultaneously or
causal implementation, so it can process streaming nearly simultaneously. In pop music, these concepts
audio. are usually rather unambiguous.
Transcription of music refers to the analysis of Figure 1 shows the waveform of an example
an acoustic music signal for producing a parametric music signal and two different representations
representation of the signal. The representation may of its melody, bass line, and chords. The middle
be a music score with a meticulous arrangement for panels show a piano-roll representation of the
each instrument or an approximate description of melody and the bass notes, respectively. Notes in
melody and chords in the piece, for example. The this representation can be compactly saved in a
latter type of transcription is commonly used in MIDI file. The lowest panel represents the same
commercial songbooks of pop music and is usually notes and the chords in the common musical
sufficient for musicians or music hobbyists to notation where the note onsets and durations are
play the piece. On the other hand, more detailed indicated by discrete symbols. The proposed method
transcriptions are often employed in classical produces a piano-roll representation of the melody
music to preserve the exact arrangement of the and the bass line together with chord labels. If
composer. desired, the note timings can be further quantized to
obtain common music notation (Hainsworth 2006;
Whiteley, Cemgil, and Godsill 2006).
Computer Music Journal, 32:3, pp. 7286, Fall 2008 Work on polyphonic music transcription dates
c
!2008 Massachusetts Institute of Technology. back more than 30 years (Moorer 1977). Nowadays,
the concept of automatic music transcription Sheh and Ellis (2003), Bello and Pickens (2005), and
includes several topics, such as multi-pitch analysis, Harte and Sandler (2005).
beat tracking and rhythm analysis, transcription Figure 2 shows a block diagram of the proposed
of percussive instruments, instrument recognition, method. An audio signal is processed frame-wise
harmonic analysis and chord transcription, and with two feature extractors: a pitch-salience estima-
music structure analysis. For an overview of the tor and an accent estimator that indicates potential
topics, see Klapuri and Davy (2006). Fundamental note onsets based on signal energy. These features
frequency (F0) tracking of the melody and bass lines are used to compute observation likelihoods for
in polyphonic music signals was first considered by target notes (i.e., melody or bass notes), other in-
Goto (2000, 2004). Later, either F0 tracking or note- strument notes, and silence or noise segments,
level transcription of the melody has been considered each of which is modeled using a Hidden Markov
by Paiva, Mendes, and Cardoso (2005); Ellis and Model (HMM). (See Rabiner and Juang 1993 for
Poliner (2006); Dressler (2006); and Ryynanen and an introduction.) The musicological model esti-
Klapuri (2006), and bass-line transcription has been mates the musical key based on the pitch-salience
addressed by Hainsworth and Macleod (2001) and function and then chooses between-note transition
Ryynanen and Klapuri (2007). Poliner et al. (2007) probabilities modeled with a note bigram. The
reported results on a comparative evaluation of Viterbi algorithm (Forney 1973) is used to find the
melody transcription methods. Automatic chord optimal path through the target note models to
transcription from audio has been considered by produce a transcription. To summarize, the method
INPUT:
AUDIO NOTE MODELS
OUTPUT:
KEY SIGNATURE
MELODY AND BASS NOTES
CHORD LABELS
incorporates both low-level acoustic modeling and signals are downmixed to mono before the feature
high-level musicological modeling, and it produces extraction.
discrete pitch labels and the beginning and ending
times for the transcribed notes simultaneously.
The chord-transcription method uses a 24-state Pitch Salience Estimation
HMM consisting of twelve states for both major
and minor triads. The observation likelihoods are The salience, or strength, of each F0 candidate is
computed by mapping the pitch saliences into a calculated as a weighted sum of the amplitudes
pitch-class representation and comparing them of its harmonic partials in a spectrally whitened
with trained profiles for major and minor chords. signal frame. The calculations are similar to those
Probabilities of between-chord transitions are of Klapuri (2006) and are briefly explained here.
estimated from training data, and Viterbi decoding Spectral whitening, or flattening, is first ap-
is used to find a path through the chord models. plied to suppress timbral information and thereby
The rest of this article explains the proposed make the estimation more robust against varia-
method in more detail and presents evaluation tion in the sound sources. Given one frame of
results with quantitative comparison with other the input signal x(n), the discrete Fourier trans-
methods. form X(k) is calculated after Hamming-windowing
and zero-padding the frame to twice its length.
Then, a band-pass filterbank is simulated in the
Feature Extraction frequency domain. Center frequencies of the sub-
bands are distributed uniformly on the critical-band
The front-end of the method consists of two frame- scale, fc = 229(10(0.5c+1)/21.4 1), and each subband
wise feature extractors: a pitch-salience estima- c = 1,. . . ,60 has a triangular power response ex-
tor and an accent estimator. Pitch salience st ( ) tending from fc2 to fc+2 . Power c2 of the signal
measures the strength of fundamental period within each subband c is calculated by applying
in analysis frame t, and the accent at measures the triangular response in the frequency domain
spectral change from frame t 1 to frame t, in and adding the resulting power spectrum values
practice indicating potential note onsets. Input within the band. Then, bandwise compression co-
signals are sampled at fs = 44.1 kHz rate, and stereo efficients c = c1 are calculated, where = 0.33
70 Melody notes
65
pitch (MIDI note)
60
70 Melody notes
65
60
of possible pitches. The set consists of MIDI notes transcribed pitches is postponed for the probabilistic
{44, . . . ,84}, i.e., A-flat2 to C6, for the melody, and models, and the feature extraction becomes consid-
of notes {26, . . . ,59}, i.e., D1B3, for the bass line. erably simpler and computationally more efficient.
The acoustic models and their parameters do not
depend on note pitch n. This has the advantage that
only one set of HMM parameters must be trained for Training the Acoustic Models
each of the three models. However, the observation
vectors on,t are specific to each note. These are The acoustic models are trained using the Real World
obtained from the extracted features by selecting Computing (RWC) database, which includes realistic
the maximum-salience fundamental period n,t in a musical recordings with manual annotations of the
1 semitone range around the note n in frame t: melody, the bass line, and the other instruments
(Goto et al. 2002, 2003). For the time region of
n,t = arg max st (i), i { | |F ( ) n| 1} (5) each annotated note n, the observation vectors
i
by Equation 6 constitute a training sequence for
The observation vector on,t is then defined as either the target-notes or the other-notes model.
The HMM parameters are then obtained using
T
on,t = [&F , st (n,t ), &st (n,t ), at ] (6) the Baum-Welch algorithm (Rabiner 1989) where
observation-likelihood distributions are modeled
where &F = F (n,t ) n is the distance between the with Gaussian Mixture Models (GMMs). Prior to
pitch of the detected salience peak and the nom- the training, the features are normalized to have
inal pitch of the note, and s(n,t ) and &s(n,t ) are zero mean and unit variance over the training set.
the salience and the differential salience of n,t . The noise-or-silence model requires training
The accent value at does not depend on pitch but sequences as well. Therefore, we generate random
is common to all notes. Notice that the pitch- note events at positions of the time-pitch plane
dependent features are obtained directly from the where there are no sounding notes in the reference
salience function in Equation 5. This is advanta- annotation. Durations of the generated notes are
geous to the often-used approach of deciding a set sampled from a normal distribution with mean and
of possible pitches within a frame already at the variance calculated from the annotated notes in the
feature-extraction stage; here, the final decision of song.
As already mentioned, the sequence of target model. We define the background probability for
notes is extracted by classifying all possible note note n in frame t as
pitches either as a target note, a note from another
instrument, or as noise or silence. However, by
n,t = P(on,t | bn,t
)P(bn,t
| bn,t1 ) (8)
definition there can be only up to one target
note sounding at each time. This constraint is The use of the other notes and the noise-or-silence
implemented as follows. model is illustrated in Figure 4 where the continuous
First, we find a state sequence bn,1:T
which best arrowed line shows the state sequence bn,1:7
solved
explains the feature vectors on,1:T of note n using from Equation 7. The example shows the background
only the other-notes and the noise-or-silence model: probability n,3 evaluation in frame t = 3 using
" $ Equation 8.
T
#
Next, we try to find a path trough time for which
bn,1:T = arg max P(q1 )P(on,1 | q1) P(qt | qt1)P(on,t | qt ) the target-notes model gives a high likelihood and
q1:T
t=2
which is not well explained by the other models.
(7) The state space of this target path is larger than in
Equation 7, because the path can visit the internal
Here, qt {ioth , jns }, where i, j {1, 2, 3}, meaning states jtgt of any note n. We denote the state of the
that the sequence bn,1:T
may visit the states of both target path by a variable rt , which determines the
the other-notes and the noise-or-silence model. note n and the internal state jtgt at time t. More
Because we have a combination of the two models, exactly, rt = [rt (1), rt (2)]T , where rt (1) {N R} and
we must allow switching between them by defining rt (2) {1, 2, 3}. Here R denotes a rest state, where no
nonzero probabilities for P(qt = joth | qt1 = ins ), target notes are sounding. In this case, the value of
where j = 1 and i {1, 2, 3}, as well as for P(qt = rt (2) is meaningless.
jns | qt1 = ioth ) where i = 3 and j {1, 2, 3}. The To find the best overall explanation for all notes
state sequence is found with the Viterbi algorithm at all times, let us first assume that the notes are
(Forney 1973). This procedure is repeated for all independent of each other. In this case, the overall
considered note pitches n N. probability given % by the background model in
Now we have the most likely explanation for all frame t is Zt = n n,t . When the target path state
notes at all times without using the target-notes rt visits a note n at time t, the overall probability
0.12 Major profile (low) Major profile (high) Minor profile (low) Minor profile (high)
0.11
0.1
0.09
0.08
0.07
0.06
I II IIIIV V VI VII I II IIIIV V VI VII I II III IV V VI VII I II III IV V VI VII
Note degree
Figure 6
Figure 7
the chord roots C, D-flat, and so forth, and ct (2) The calculation is exactly similar for the mi-
{maj, min} denotes the chord type. We want to nor triads L(ct ), ct (2) = min, except that the
find a path c1:T through the chord HMM. For the major profiles are replaced with the minor
twelve major chord states ct , ct (2) = maj, the chord profiles.
observation log-likelihoods L(ct ) are calculated in a We also train a chord-transition bigram P(ct | ct1 ).
manner analogous to the key estimation: The transitions are independent of the key so that
only the chord type and the distance between
11
! 0 0 lo 1 the chord roots, mod(ct (1) ct1 (1),12), matters. For
L(ct ) = log Cmaj (d) PCPlo
t (mod(d + ct (1), 12)) example, a transition from A minor, ct1 = [9, min]T ,
d=0 to F major, ct = [5, maj]T , is counted as a transition
0 hi 1 1 from a minor chord to a major chord with distance
+ log Cmaj (d) PCPhi
t (mod(d + ct (1), 12)) .
mod (5 9,12) = 8. Figure 7 illustrates the estimated
(17) chord-transition probabilities. The probability to
Bm
Reference
Am Transcription
Gm
Em
Dm
Chord
Cm
B
A
G
E
D
C
20 40 60 80 100 120 140 160
time (sec)
stay in the chord itself is a free parameter that about 19 sec to process 180 seconds of stereo audio.
controls the amount of chord changes. The feature extraction takes about 12 sec, and the
Now, we have defined the log-likelihoods L(ct ) for melody and bass-line transcription take about 3 sec
all chords in each frame t and the chord-transition each. The key estimation and chord transcription
bigram P(ct | ct1 ). The chord transcription is then take less than 0.1 seconds. In addition, the method
obtained by finding an optimal path c1:T through the allows causal implementation to process stream-
chord states: ing audio in a manner described in Ryynanen and
" $ Klapuri (2007).
!T
c1:T = arg max L(c1 ) +
(L(ct ) + log P(ct | ct1 )) ,
c1:T
t=2
which is again found using the Viterbi algorithm. For the development and evaluation of the melody
The initial probabilities for chord states are uniform and bass-line transcription, we use the Real World
and are therefore omitted. The method does not Computing popular music and genre databases
detect silent segments but produces a chord label in (Goto et al. 2002, 2003). The databases include a
each frame. Figure 8 shows the chord transcription MIDI file for each song, which contains a manual
for With a Little Help From My Friends by the annotation of the melody, the bass, and other
Beatles. instrument notes, collectively referred to as the
reference notes. MIDI notes for drums, percussive
instruments, and sound effects are excluded from the
Results reference notes. Some songs in the databases were
not used due to an unreliable synchronization of the
The proposed melody-, bass-, and chord- MIDI annotation and the audio recording. Also, some
transcription methods are quantitatively evaluated songs do not include the melody or the bass line.
using databases described subsequently. For all Consequently, we used 130 full acoustic recordings
evaluations, a two-fold cross validation is used. for melody transcription: 92 pop songs (the RWC
With a C++ implementation running on a 3.2 popular database) and 38 songs with varying
GHz Pentium 4 processor, the entire method takes styles (the RWC genre database). For bass-line
RWC popular 60.5 49.4 53.8 61.1 RWC popular 57.7 57.5 56.3 61.9
RWC genre 41.7 50.3 42.9 55.8 RWC genre 35.3 57.5 39.3 57.6
Total 55.0 49.6 50.6 59.6 Total 50.1 57.5 50.6 60.4
transcription, we used 84 songs from RWC popular note is not already associated with another reference
and 43 songs from RWC genrealtogether, 127 note. We use the F-measure F = 2RP/(R + P) to
recordings. This gives approximately 8.7 and 8.5 give an overall measure of performance. Temporal
hours of music for the evaluation of melody and overlap ratio of a correctly transcribed note with
bass-line transcription, respectively. There are the associated reference note is measured by =
reference notes outside the reasonable transcription (min{E} max{B})/(max{E} min{B}), where sets B
note range for both melody (<0.1 percent of the and E contain the beginning and ending times of the
melody reference notes) and bass lines (1.8 percent two notes, respectively. The mean overlap ratio
of the notes). These notes are not used in training is obtained by averaging values over the correctly
but are counted as transcription errors in testing. transcribed notes. The recall rate, the precision
As already mentioned, the chord-transcription rate, the F-measure, and the mean overlap ratio are
method is evaluated using the first eight Beatles calculated separately for each recording, and then
albums with the chord annotations provided by the average over all the recordings is reported.
Harte and colleagues. The albums include 110 songs Table 1 shows the melody and bass-line tran-
with approximately 4.6 hours of music. The refer- scription results. Both the melody and the bass-line
ence major and minor chords cover approximately transcription achieve over 50 percent average F-
75 percent and 20 percent of the audio, respectively. measure. The performance on pop songs is clearly
Chords that are not recognized by our method and better than for the songs from various genres. This
the no-chord segments cover about 3 percent and was expected since the melody and bass lines are
1 percent of the audio. usually more prominent in pop music than in other
genres, such as heavy rock or dance music, for
example. In addition, the RWC popular database
Melody and Bass-Line Transcription Results includes only vocal melodies, whereas the RWC
genre database also includes melodies performed
The performance of melody and bass-line tran- with other instruments. The musicological model
scription is evaluated by counting correctly and plays an important role in the method: The total
incorrectly transcribed notes. We use the recall rate F-measures drop to 40 percent for both melody
R and the precision rate P defined by and bass-line transcription if the note bigrams are
replaced with uniform distributions.
#(correctly transcribed notes) For comparison, Ellis and Poliner kindly pro-
R= vided the pitch tracks produced by their melody-
#(reference notes)
(19) transcription method (Ellis and Poliner 2006) for
#(correctly transcribed notes) the recordings in the RWC databases. Their method
P=
#(transcribed notes) decides note pitch for the melody in each frame
whenever the frame is judged to be voiced. Briefly,
A reference note is correctly transcribed by a note the pitch classification in each frame is conducted
in the transcription if their MIDI note numbers are using one-versus-all, linear-kernel support vector
equal, the absolute difference between their onset machine (SVM). The voiced-unvoiced decision is
times is less than 150 msec, and the transcribed based on energy thresholding, and the pitch track is
smoothed using HMM post-processing. Because the that are labeled as voiced in the transcription
Ellis-Poliner method does not produce segmented but are unvoiced in the reference. The voicing
note events but rather a pitch track, we compare d) (Vc d) ) combines the voicing detection and voicing
the methods using frame-level evaluation metrics false alarm rates to describe the systems ability
adopted for melody-extraction evaluation in the to discriminate the voiced and unvoiced frames.
Music Information Retrieval Evaluation eXchange High voicing detection and low voicing false alarm
(Poliner et al. 2007). For this, the RWC reference give good discriminability with a high voicing d)
MIDI note values are sampled every 10 msec to value (Duda, Hart, and Stork 2001). According to
obtain a frame-level reference. Similar conversion is the overall accuracy and the voicing d) , the proposed
made for the melody notes produced by our method. method outperforms the Ellis-Poliner method in
Table 2 shows the results for the proposed this evaluation. Their method classifies most of
method and for the Ellis-Poliner method on the the frames as voiced, resulting in a high voicing-
RWC databases. The overall accuracy denotes the detection rate but also high voicing false-alarm rates.
proportion of frames with either a correct pitch
label or correct unvoiced decision, where the pitch
label is correct if the absolute difference between Chord Transcription Results
the transcription and the reference is less than half
a semitone. The raw pitch accuracy denotes the The chord-transcription method is evaluated by
proportion of correct pitch labels to voiced frames comparing the transcribed chords with the reference
in the reference. Voicing detection (Vc det) measures chords frame-by-frame. For method comparison,
the proportion of correct voicing in transcription Bello and Pickens kindly provided outputs of their
to voiced frames in the reference. Voicing false chord transcription method (Bello and Pickens 2005)
alarm (Vc FA) measures the proportion of frames on the Beatles data. As a framework, their method
References
Conclusions
Bello, J. P., and J. Pickens. 2005. A Robust Mid-level
We proposed a method for the automatic tran- Representation for Harmonic Content in Music Sig-
scription of melody, bass line, and chords in poly- nals. Proceedings of the 6th International Conference
phonic music. The method consists of frame-wise on Music Information Retrieval. London: Queen Mary,
pitch-salience estimation, acoustic modeling, and University of London, pp. 304311.
musicological modeling. The transcription accuracy Dressler, K. 2006. An Auditory Streaming Ap-
was evaluated using several hours of realistic music, proach on Melody Extraction. MIREX Audio
and direct comparisons to state-of-the-art methods Melody Extraction Contest Abstracts. MIREX06
were provided. Using quite straightforward time extended abstract. London: Queen Mary, Univer-
quantization, common musical notation such as sity of London. Available online at www.music-
that shown in Figure 1 can be produced. In addition, ir.org/evaluation/MIREX/2006 abstracts/AME dressler.
pdf.
the statistical models can be easily retrained for dif-
Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern
ferent target materials. Future work includes using Classification, 2nd ed. New York: Wiley.
timbre and metrical analysis to improve melody and Ellis, D., and G. Poliner. 2006. Classification-Based
bass-line transcription, and a more detailed chord Melody Transcription. Machine Learning Journal
analysis method. 65(23):439456.
The transcription results are already useful for Forney, G. D. 1973. The Viterbi Algorithm. Proceedings
several applications. The proposed method has of the IEEE 61(3):268278.
been integrated into a music-transcription tool Goto, M. 2000. A Robust Predominant-F0 Estimation
with a graphical user interface and MIDI editing Method for Real-Time Detection of Melody and Bass
capabilities, and the melody transcription has been Lines in CD Recordings. Proceedings of the 2000 IEEE
successfully applied in a query-by-humming system International Conference on Acoustics, Speech, and
Signal Processing. New York: Institute for Electrical
(Ryynanen and Klapuri 2008), for example. Only a
and Electronics Engineers, pp. 757760.
few years ago, the authors considered the automatic
Goto, M. 2004. A Real-Time Music-Scene-Description
transcription of commercial music recordings as a System: Predominant-F0 Estimation for Detecting
very difficult problem. However, rapid development Melody and Bass Lines in Real-World Audio Signals.
of transcription methods and the latest results have Speech Communication 43(4):311329.
demonstrated that feasible solutions are possible. Goto, M., et al. 2002. RWC Music Database: Popular,
We believe that the proposed method with melody, Classical, and Jazz Music Databases. Proceedings of