Beruflich Dokumente
Kultur Dokumente
42
Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 43
plete AMT system, the note-level evaluation metrics are voice. The values are calculated by counting the number
most applicable here, and readers interested in the frame- of correct edges (true positives), spurious edges (false pos-
based evaluation, an accuracy metric, should refer to [28]. itives), and omitted edges (false negatives).
For the note-level metric, a note is defined by its pitch, Each of these metrics would penalise an AMT system
onset time, and offset time. [1] defines two different preci- for any spurious notes, so for our proposed metric, we will
sion, recall, and F-measures for note-level multi-pitch de- need to use a modified version of one of them (or design
tection. For the first, they define true positives as those a new metric) in order to enforce the principle of disjoint
notes detected whose pitch lies within a quartertone of that penalties.
of a ground truth note, and whose onset time is within
50 ms of the same ground truth note’s onset time, regard- 2.3 Metrical Alignment
less of offset time. Spuriously detected notes are each as-
signed a false positive, and ground truth notes which are Metrical alignment is most often approached as one of
not matched by a detected note are each assigned a false three different tasks: downbeat tracking, beat tracking,
negative. The second metric they propose is identical, with or metrical structure detection. Downbeat tracking and
the additional constraint that a detected note’s offset time beat tracking each involve identifying points in time, and
must be accurate to within a certain threshold for it to be thus can theoretically be evaluated using the same metrics,
considered a true positive. Both of these metrics are used which are summarised by [8, 9]. F-measure [11] (which
by both the Music Information Retrieval Evaluation Ex- downbeat tracking work uses almost exclusively), is cal-
change (MIREX) [25] and the mir eval package [29]. culated by counting the number of (down)beats within
For our purposes, we care mostly about onset time and some window length (usually 70 ms) of an annotated
pitch (to the nearest semitone) as these aspects are most (down)beat. [4] proposes a similar metric, where accuracy
directly relevant to the underlying musical score. Offset is calculated instead using a Gaussian window around each
time, on the other hand, is applicable as far as it relates annotated beat. [13] proposes a binary metric which is 1 if
to note value, and is discussed in Section 2.4. Our multi- the beats are correctly tracked for at least 25% of the piece,
pitch detection metric will therefore be based most closely and 0 otherwise. P-score [18], is the proportion of tracked
on the first multi-pitch F-measure, which doesn’t account beats which correctly match an annotated beat, normalised
for offset time. by either the number of tracked beats or the number of
annotated beats (whichever is greater). Finally, [15] pro-
poses metrics based on the longest continuously tracked
2.2 Voice Separation
section of music. All of the above are used to some extent
Voice separation refers to the separation of the notes of a in beat-tracking, and all are used by both mir eval [29] and
piece of music into perceptual streams called voices. There MIREX. [21, 23] In addition, evaluation is also often pre-
is no standardised definition of what constitutes a voice, sented at twice and half the annotated beat length, to handle
and a full discussion can be found in [3]. In this work, we models which may track a beat at the wrong metrical level.
restrict each voice to be monophonic. This aligns with the By comparison, the evaluation of metrical structure de-
majority of work on voice separation, and is beneficial in tection is far less sophisticated. Meter detection is the or-
AMT in that it allows simpler processing of monophonic ganisation of the beats of a given musical performance into
data to occur in the later musical analysis steps. a sequence of trees at the bar level, in which each node
There is no standardised metric for evaluating voice represents a single note value. The structure of each of
separation performance. [5] defines Average Voice Consis- these trees is directly related to the music’s time signature,
tency (AVC), which returns an average of the percentage of where the head of each tree splits into a number of nodes
notes in each voice which have been assigned to the correct equal to the number of beats per bar, and each of these beat
voice. (A note is said to be assigned to the correct voice if nodes splits into a number of nodes equal to the number of
its ground truth voice is the most common one for notes sub-beats per beat. Thus, each time signature uniquely de-
assigned to its voice.) This metric has a problem in that if scribes a single metrical tree structure as defined by the
a model assigns each note to a distinct voice, it achieves number of beats per bar and sub-beats per beat in that time
a perfect AVC of 100%. For acoustic input, [19, 30] use signature. The most basic evaluation is to simply report
a similar metric, with the addition that spuriously detected the proportion of musical excerpts for which the model
notes automatically count as incorrect. guesses the correct metrical structure and phase (such that
[17] defines two metrics: soundness, which measures each tree aligns correctly with a single bar). Another ap-
the percentage of consecutive notes in an assigned voice proach is to simply report the proportion of musical ex-
which belong to the same ground truth voice; and com- cerpts for which the model correctly classifies the meter as
pleteness, which measures the percentage of consecutive duple or triple [10]. Both of these metrics are simplistic,
notes in a ground truth voice which were assigned to the and fail to take into account some idea of partially correct
same voice. Finally, [12] defines a precision, recall, and metrical structure trees.
F-measure evaluation, in which the problem of voice as- Two metrics have been used for metrical structure de-
signment is treated as a graph problem where each note is tection evaluation which contain within them an evaluation
represented by a node, and two nodes are connected by an of beat tracking and downbeat tracking, making them ideal
edge if and only if they are consecutive notes in an assigned for an evaluation of a joint model. [31] proposes a metric
44 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018
which takes into account the level on the metrical tree at 2.6 Joint Metric
which each note lies in order to capture some idea of partial
For the joint evaluation of AMT performance, [7] presents
correctness. However, since it is based on detected notes,
a system to transcribe MIDI input into a musical score
it is not robust to errors in multi-pitch detection. [20] intro-
(thus including errors from typesetting), and evaluate it us-
duces an F-measure metric where each level of the detected
ing five human evaluators. The evaluators were asked to:
metrical structure is assigned a true positive if it matches
“1) Rate the pitch notation with regard to the key signature
any level of the ground truth metrical structure (even if it is
and the spelling of notes. 2) Rate the rhythmic notation
not the same level). A false positive is given for any level
with regard to the time signature, bar lines, and rhythmic
of the detected metrical structure which clashes with a met-
values. 3) Rate the notation with regard to stems, voicing,
rical grouping in the ground truth, and a false negative for
and placement of notes on staves,” each on a scale of 1
any metrical level in the ground truth which remains un-
to 10. The three questions roughly correspond with four
matched by a level of the detected metrical structure. As
of our sections above: 1) harmonic analysis; 2) metrical
it is based solely on metrical groupings, rather than notes,
alignment, note value detection; and 3) voice separation.
it is robust to multi-pitch detection errors, and would not
[6] describes an automatic metric for the same task,
violate our principle of disjoint penalties. However, it was
similar to string edit distance, taking into account the or-
designed for use with metronomic input, and would there-
dering of 12 different aspects of a musical score: barlines,
fore need to be adapted for our purposes of evaluating a
clefs, key signatures, time signatures, notes, note spelling,
complete AMT system on live performance data.
note durations, stem directions, groupings, rests, rest dura-
tion, and staff assignment.
2.4 Note Value Detection While this metric is a great step towards an automatic
Note value detection (identifying a note as a quarter note, evaluation of AMT performance, it violates our principle
eighth note, dotted note, tied note, etc.) is not a widely of disjoint penalties. A single mistake in metrical align-
researched problem, related to a combination of note offset ment can manifest itself in the time signature, rest dura-
time and metrical alignment. [26] describes two metrics tions, note durations, and even additional notes (tied notes
for the task. One, error rate, is simply the percentage of are counted as separate objects in the metric).
notes whose transcribed value is incorrect. The other, scale It appears that both of the above metrics measure some-
error, takes into account the magnitude of the error as well thing slightly different from what we want. They measure
(relative to the metrical grid), in log space such that errors the readability of a score produced by an AMT system,
from long notes do not dominate the calculation. while we really want a metric which measures the accuracy
of the analysis performed by the AMT system, a slightly
However, since the measured note values are reported
different task. To our knowledge, no metric exists which
relative to the underlying meter, they violate our property
measures the accuracy of the analysis performed by a com-
of disjoint penalties and we must design a new measure of
plete AMT system in the way we desire.
note value detection accuracy for our metric.
3. NEW METRIC
2.5 Harmonic Analysis
Our proposed metric, M V 2H, draws from existing met-
Harmonic analysis involves both key detection, a classi- rics where possible, though we take care to ensure that our
fication problem of identifying one of twelve tonic notes, principle of disjoint penalties is not violated. Essentially,
each with two possible modes (major or minor—alternate we calculate a single score for each aspect of the transcrip-
mode detection has not been widely researched); and chord tion, and then combine them all into the final joint metric.
tracking, identifying a sequence of chords and times given
an audio recording. The possible chords to identify range 3.1 Multi-pitch Detection
from simply identifying the correct root note, to determin-
ing major or minor, identifying seventh chords, and even For multi-pitch detection, we use an F-measure very simi-
identifying different chord inversions. lar to the one by [1] described above, counting a detected
The standard key detection evaluation, used by both note as a true positive if its detected pitch (in semitones) is
mir eval [29] and MIREX [24], is to assign a score of 1.0 correct and its onset lies within 50 ms of the ground truth
to the correct key, 0.5 to a key which is a perfect fifth too onset time. All other detected notes are false positives, and
high, 0.3 to the relative major or minor of the correct key, any unmatched ground truth notes are false negatives. Note
0.2 to the parallel major or minor of the correct key, and offset time does not factor into our evaluation; rather, see
0.0 otherwise. Section 3.4 for a discussion on the related problem of note
The standard chord tracking evaluation is chord sym- value detection.
bol recall (CSR)—described by [16], and used by both
3.2 Voice Separation
MIREX [22], and mir eval [29]—defined as the propor-
tion of the input for which the annotated chord matches the For voice separation, we use an F-measure very similar
ground truth chord. There can be varying levels of speci- to [12], taking care not to violate our principle of dis-
ficity for what exactly constitutes a match, since different joint penalties. Specifically, we don’t want to penalise any
sets of possible chords can be used as described above. model in voice separation for multi-pitch detection errors.
Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 45
Figure 1: An example transcription of the ground truth bar Figure 2: An example transcription of the ground truth
(left) is shown (right). The voice connection between the bar (left) is shown (right). Those notes which are assigned
last two notes in the lower voice counts as a true positive, a note value score are coloured. Of those, the C (assuming
even though they are not consecutive in the ground truth. treble clef) is assigned a score of 0.5, while the others are
assigned a score of 1.
the voice separation F-measure. Of the coloured notes, the Value detection, and Harmonic analysis), we take the
C would be assigned a score of around 0.5 (depending on arithmetic mean of the five previously calculated values.
exact timing), since its value duration is off by exactly half We also want the metric to be usable no matter what subset
of the ground truth note’s value duration. The others would of analyses is performed, for example, for models which
receive scores of 1. Thus, the final note value score would run on MIDI input and therefore do not perform multi-
be the average of 1, 1, 1, and 0.5, or about 0.875. pitch detection. In these cases, we advise using our met-
ric and simply taking the arithmetic mean of only those
3.5 Harmonic Analysis scores which correspond with analyses performed. In fu-
ture work, we will investigate whether a linear combina-
For harmonic analysis, we use the standard key detection
tion of the five values involved, perhaps weighting some
and CSR metrics described above, as neither one violates
more strongly than others, aligns more exactly with human
our principle of disjoint penalties since they are based on
judgements than the current arithmetic mean.
time rather than notes or the metrical alignment. For now,
we take the set of possible chords to include a major and
minor version for each root note, but not sevenths or inver- 4. EXAMPLES
sions, although the full collection of chords should be used
To illustrate the effectiveness and appropriateness of our
for the final version of our metric.
metric, we present in Figure 3 two example transcriptions
To combine the two into a single harmonic analysis
of the first four bars of Bach’s Minuet in G, each exhibit-
score, we take the arithmetic mean of the two values, since
ing different errors. Figure 3a shows the ground truth tran-
they are both on the range [0–1]. Models which only per-
scription (where the chord progression is shown beneath
form one of the above tasks may simply use that task’s
the staff), and the example transcriptions are shown be-
score as their harmonic analysis score.
low. We make two assumptions: (1) ground truth voices
are separated by clef (plus the bottom two notes in the ini-
3.6 Joint Metric
tial chord, which each belong to their own voice); and (2)
We now have five values to combine into a single number: The sub beats of each transcription align in time with the
the multi-pitch detection F-measure, the voice separation sub beats of the ground truth.
F-measure, the metrical F-measure, the note value detec- Figure 3b shows an example transcription which is good
tion accuracy score, and the harmonic analysis mean. All in general, with just a few mistakes, mostly related to the
of these values are on the range [0–1] such that a value of metrical alignment. First, for the multi-pitch detection F-
1 results from a perfect transcription in that aspect. We measure, we can see that the transcription has 20 true pos-
consider three different approaches for their combination: itives, 3 false negatives (a G on the second beat in the first
harmonic mean, geometric mean, and arithmetic mean. bar, a C on the second beat of the third bar, and the final
Harmonic mean is most useful when there is potential G in the fourth bar), and 0 false positives, resulting in an
for one of the values involved to be significantly larger than F-measure of 0.93. For voice separation, this transcription
the others, and thus dominate the overall result. F-measure, is generally good, making a single bad assignment in the
for example, is the harmonic mean between precision and second bar, resulting in 3 false positives (the connections
recall, and is used so that models cannot receive a high F- to and from the incorrect assignment, as well as the incor-
measure by simply tuning their model to have a very high rect connection in the treble clef), 3 false negatives (the
recall or precision; rather, both recall and precision must correct connections to and from the misclassified note, as
be relatively high in order for their harmonic mean to also well as the correct connection in the bass clef), and a voice
be high. This is not relevant in our case as there is no way separation F-measure of 0.83. Notice that the missed G in
for a model to tune itself towards one very high score at the upper voice in the treble clef of the first bar does not re-
the expense of the others as is the case with some binary sult in a penalty for voice assignment due to our principle
classification problems. of disjoint penalties. For metrical alignment, we can see
Geometric mean is most useful when the values in- that this transcription is notated in 68 time, correctly group-
volved are on different scales. Then, a given percent ing all sub beats (eighth notes) and bars, yielding 28 true
change in one of the values will result in the same change positives, but incorrectly grouping three sub beats together
in mean as the same percent change to another of the val- into dotted quarter note beats, yielding 8 false positives and
ues. This property is not necessary for us because all of 12 false negatives. This results in a metrical F-measure of
our values lie on the same range. 0.74. For note value detection, 14 notes are counted: all
Arithmetic mean is a simple calculation that weights of the bass clef notes and all of the eighth notes in the first
each of the values involved equally. This property is de- bar, only the high D in the second bar, the low C and all
sirable for us because, for a complete transcription, all five of the eighth notes in the third bar, and the high G and the
aspects of an analysis must be correct. Furthermore, due low B in the fourth bar. Notice that the initial high D isn’t
to our property of disjoint penalties, we have kept the five counted because the next note in its voice has not been de-
values involved disjoint, and a model must fairly perform tected. Similarly, neither the G on the second beat of the
well on all aspects in order for its overall score to be high. second bar nor any of the bass clef notes in the second bar
Therefore, for the final joint metric, M V 2H (for Multi- are counted due to voice separation errors. Of the 14 notes,
pitch detection, Voice separation, Metrical alignment, note 13 of them are assigned the correct note value (even the
Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 47
Transcription 1 2
Multi-pitch 0.93 0.77
Voice 0.83 1.0
Meter 0.74 1.0
(a) Ground truth Note Value 0.96 1.0
Harmonic 1.0 0.5
M V 2H 0.89 0.85
[3] Emilios Cambouropoulos. Voice And Stream: Percep- [18] M. F. McKinney, D. Moelants, Matthew E. P. Davies,
tual And Computational Modeling Of Voice Separa- and Anssi Klapuri. Evaluation of Audio Beat Tracking
tion. Music Perception, 26(1):75–94, September 2008. and Music Tempo Extraction Algorithms. Journal of
New Music Research, 36(1):1–16, March 2007.
[4] A. Taylan Cemgil, Bert Kappen, Peter Desain, and [19] Andrew McLeod, Rodrigo Schramm, Mark Steed-
Henkjan Honing. On tempo tracking: Tempogram man, and Emmanouil Benetos. Automatic transcription
Representation and Kalman filtering. Journal of New of polyphonic vocal music. Applied Sciences, 7(12),
Music Research, 29(4):259–273, December 2000. 2017.
[5] Elaine Chew and Xiaodan Wu. Separating Voices in [20] Andrew McLeod and Mark Steedman. Meter detection
Polyphonic Music: A Contig Mapping Approach. In in symbolic music using a lexicalized pcfg. In Proceed-
Computer Music Modeling and Retrieval, pages 1–20, ings of the 14th Sound and Music Computing Confer-
2004. ence, pages 373–379, 2017.
[6] Andrea Cogliati and Zhiyao Duan. A metric for music [21] MIREX. Audio beat tracking. http://www.
notation transcription accuracy. In ISMIR, pages 407– music-ir.org/mirex/wiki/2017:
413, 2017. Audio\_Beat\_Tracking, 2017. Accessed:
2017-07-18.
[7] Andrea Cogliati, David Temperley, and Zhiyao Duan.
Transcribing Human Piano Performances into Music [22] MIREX. Audio chord estimation. http:
Notation. In ISMIR, pages 758–764, 2016. //www.music-ir.org/mirex/wiki/2017:
Audio\_Chord\_Estimation, 2017. Accessed:
[8] Matthew E. P. Davies and Sebastian Böck. Evaluating 2017-07-18.
the Evaluation Measures for Beat Tracking. In ISMIR,
[23] MIREX. Audio downbeat estimation. http:
pages 637–642, 2014.
//www.music-ir.org/mirex/wiki/2017:
[9] Matthew E. P. Davies, Norberto Degara, and Mark D. Audio\_Downbeat\_Estimation, 2017.
Plumbley. Evaluation Methods for Musical Audio Accessed: 2017-07-18.
Beat Tracking Algorithms. Queen Mary University of [24] MIREX. Audio key detection. http://www.
London, Centre for Digital Music, Technical Report music-ir.org/mirex/wiki/2017:
C4DM-TR-09-06, 2009. Audio\_Key\_Detection, 2017. Accessed:
2017-07-18.
[10] W. Bas De Haas and Anja Volk. Meter Detection in
Symbolic Music Using Inner Metric Analysis. In IS- [25] MIREX. Multiple fundamental frequency esti-
MIR, pages 441–447, 2016. mation & tracking. http://www.music-ir.
org/mirex/wiki/2017:Multiple\
[11] Simon Dixon. Evaluation of the Audio Beat Track- _Fundamental\_Frequency\_Estimation\
ing System BeatRoot. Journal of New Music Research, _\%26\_Tracking, 2017. Accessed: 2017-07-18.
36(1):39–50, March 2007.
[26] Eita Nakamura, Kazuyoshi Yoshii, and Simon Dixon.
[12] Ben Duane and Bryan Pardo. Streaming from MIDI us- Note value recognition for piano transcription using
ing constraint satisfaction optimization and sequence markov random fields. IEEE/ACM Transactions on Au-
alignment. In Proceedings of the International Com- dio, Speech, and Language Processing, 25(9):1846–
puter Music Conference, pages 1–8, 2009. 1858, sep 2017.
Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 49