Beruflich Dokumente
Kultur Dokumente
ABSTRACT
Audio
We describe a system for the automatic gen- Signal
Percussion
Recognizer
eration of a 3D animation of a drummer play-
ing along with a given piece of music. The in-
put, consisting of a sound wave, is analysed MIDI
to determine which drums are struck at what Events
moments. The Standard MIDI File format is
used to store the recognised notes. From this Animation
Generator
higher-level description of the music, the an-
imation is generated. The system is imple- 3D Animation
mented in Java and uses the Java3D API for
visualisation.
Figure 1: An overview of the system
1. INTRODUCTION
As gure 2 shows, the total task can be
In this paper we describe preliminary results separated into two independent subtasks:
of our research on virtual musicians. The
objective of this project is to generate ani- An analysis of the sound signal and tran-
mated virtual musicians, that play along with scription of the percussion part. The
a given piece of music. The input of this system has to determine which drums
system consists of a sound wave, originating are hit, at what moments in time. Con-
from e.g. a CD or a real-time recording. centrating on percussion sounds has cer-
There are many possible uses for an appli- tain advantages and disadvantages; this
cation like this, ranging from the automatic is further discussed in section 2.
generation of music videos to interactive mu-
The creation of the the movements of
sic performance systems where musicians play
a 3D avatar playing on a drum kit. A
together in a virtual environment. In the last
more detailed explanation on this part
case, the real musicians could be located on
is given in section 3 and 4.
dierent sites, and their virtual counterparts
could be viewed in a virtual theatre by a world-
wide audience. Additionally, our department 2. THE PERCUSSION RECOGNISER
is currently working on instructional agents
that can teach music, for which the work we This part of the system is responsible for the
describe in this paper will be a good founda- translation from a low level description of
tion. the music (the sound wave) to a abstract, high
For our rst virtual musicians application, level description of all percussion sounds that
we have restricted ourselves to an animated are present in the signal. These recognised
drummer. However, the system is exible notes are stored as MIDI events.
enough to allow an easy extension to other Many attempts in the eld of musical in-
instruments. strument recognition concentrate on pitched
sounds [1]. As explained in [9], this is a rather
dierent task than recognising percussive sounds,
which have a sharp attack, short duration,
and no clearly dened pitch. As shown in [9],
individual, monophonic samples of drums and
cymbals can be classied very well. In this
approach, a few frames of the spectrum, mea-
sured from the onset of the sounds, were matched
against a database of spectral templates.
In our highly polyphonic, real-life situa-
tion, however, the input signal may contain
many percussive sounds played simultaneously,
and non-percussive instruments (such as gui-
tar and vocals) may be mixed through the
signal as well. Therefore, special techniques
are needed to separate the percussive sounds Figure 2: An overview of the system
from the other sounds. Other researchers
have already tried to solve the same problem
[13, 14]: Sillanpa
a et. al. subtract harmonic
3.1. Overview of the system
components from the input signal to lter
out non-percussive sounds. Furthermore, they
A general overview of the animation gener-
stress the importance of top-down process-
ation is shown in gure 2. An abstract de-
ing: using temporal predictions to recognise
scription of the animation (in this case, a list
soft sounds that are partially masked by louder
of time-stamped MIDI events) is transformed
sounds [14]. Puckettes Pure Data program
into a concrete animation. This lower-level
has an object called bonk that uses the dif-
description of the animation is dened in terms
ference between subsequent short-time spec-
of key frames [4] that can directly be used by
tra, to determine whether a new attack has
the graphical subsystem to animate objects
occurred.
in the scene.
We are still developing this part of the sys-
tem, therefore we cannot yet present a nal Our implementation uses the Java3D en-
solution of this problem. We plan to solve gine for visualisation purposes [7]; the geom-
the problem of polyphony by adding exam- etry of the 3D objects we have used is has
ples that consist of multiple sounds played been created using Virtual Reality Modeling
together to the collection of spectral templates. Language (VRML, [15]).
For example, a bass drum, snare drum and hi-
hat played together. For an o-line situation,
where the complete input signal is already
3.2. Pre-calculated versus real-time anima-
known, we plan to apply clustering methods
tion
on all fragments of the signal that contain a
strong attack. This is based on our hypothe-
sis that specic drum sounds will sound very In our current o-line implementation, the
similar throughout a piece of music. This piece of music to be played is completely known
is especially plausible for commercial record- in advance as a list of MIDI events. Therefore,
ings, and/or in the case that the music con- the entire animation can be computed before
tains sampled drum sounds. it is started. In a real-time situation, where
the system has to respond to incoming MIDI
events, this would not possible. In that case,
3. BASIC ALGORITHMS a short animation should be constructed and
started immediately for each note that occurs
In this section, we describe how our system in the input.
generates animations automatically. The var- A great advantage of pre-calculating the
ious algorithms discussed here are kept rather entire animation is that the transitions be-
simple on purpose, to maintain a clear view tween strokes will be much smoother: for
on the system as a whole. In section 4, more each note we already know which drum will
advanced techniques (that give better results) be struck next, and the arm can already start
will be explained. moving towards that drum.
3.3. Polyphony Issues 3.4.2. Other Parameters
Monophonic instruments (such as the trum- Other parameters that are dened in the drum
pet or the ute) are relatively easy to animate, kit model:
because each possible sound corresponds to
For each event type, a preferred hand:
exactly one pose of all ngers, valves, etcetera,
-1 (left) or 1 (right).
and only one pose can be active at each mo-
ment in time. Highly polyphonic instruments For each event type, a parameter minT imeGap
(such as the piano) are much more dicult, that determines how fast that particu-
because there are many dierent ways (n- lar event type can be played with one
gerings) to play the same piece of music, and hand. This parameter will be explained
a search method is needed to nd a good so- in more detail in section 3.6.
lution [8]. The drum kit could be viewed in
between these two extreme examples: up to
four sounds can be started simultaneously. 3.5. MIDI Parsing
bal, but we speak of a ride bell (or cup bell) events, that is: multiple events on the same chan-
nel, with the same time stamp, the same note number
when the stick hits the small cup at the cen-
and the same velocity. These extra events do not con-
ter of the cymbal (this gives a bell-like sound, tain new information, nor do they increase the velocity,
hence the name). therefore we can discard them.
2. For each event type, there is a preferred 3.7. Pose Creation
(default) hand that should be used if pos-
sible.
A parameter def aultHandeventT ype is Figure 3: The graphical poser interface, ap-
specied for all event types. In our im- plied to the left arm
plementation, the SNARE and RIM events
have the default hand set to left, while A graphical user-interface (GUI) is provided
right is the default hand for all other to create poses manually. Figure 3 shows a
events. screenshot of the GUI applied to the left arm.
A pose consists of a set of angles or trans-
A parameter minT imeGap is dened, lation values: one for each degree of free-
that determines how fast an event can dom. With the horizontal sliders, the user
be played with one hand. This parame- can change these values.
ter can have a dierent value for dier-
ent event types, Because the tendency
to alternate hands varies from one drum
type to another. For example, the hi-hat
is usually played with the right hand;
only in very demanding situations (fast
rolls) both hands will be used. On the
other hand, hand alternation on the high
tom is much more common.
Figure 4: MID TOM Figure 5: MID TOM
These principles are implemented in algo- UP DOWN
rithm 3.1. It consists of two phases:
For each limb, two posews should be spec-
1. default hand assignment ied for each drum event type that it sup-
ports: the DOWN pose (the exact situation
2. hand alternation
on contact) and the UP pose (the situation
just before and just after the hitting moment).
Examples of UP and DOWN poses are shown
Algorithm 3.1 A simple algorithm for event
in gures 4 and 5.
distribution
iterate over all events e: Once a good position is achieved, it can be
hand(e) := preferredHand(type(e)) stored in the pre-dened list of poses. The
iterate over all triplets of subse- entire list can be saved to disk, to preserve
quent events (e1,e2,e3): the information for a next session.
if hand(e1)=hand(e2)=hand(e3)
AND
3.7.1. Motivation
Time(e2) -
Time(e1) <= minTimeGap(type(e1)) We have chosen for manually setting the poses
OR through a GUI interface, instead of using mo-
Time(e3) - tion capture [16] or inverse kinematics for
Time(e2) <= minTimeGap(type(e3))
the following reasons:
then
hand(e2) := otherHand(hand(e2)) Costs: Motion capture equipment is expen-
sive, and requires a complete setup with
a real drum kit that matches the 3D kit. (that contains only drum events that should
If one would want to change something be played by that arm/leg) is parsed in the
in the 3D drum kit (for example, moving correct temporal order. For each abstract an-
a tom-tom) the whole capturing would imation event e, a Stroke is added to the an-
have to be done all over again. imation time line. A Stroke consists of three
concrete animation events (i.e. key frames):
Simplicity: there are only a small number of (ebef or e , econtact , eaf ter ).
poses, and they have to be set only once The parameter delta is a constant that
for a new drum kit conguration. determines the time between the key frames
Flexibility: besides the setting poses for the within a stroke (100ms is a useful value). See
arms and legs, the interface can also be gure 6 for a graphical representation of a
used for the hi-hat stand and pedal, the Stroke that will be used throughout this chap-
cymbal stands, the parts of the bass pedal, ter.
and giving the snare, bass drum and tom-
toms their position and orientation in
the 3D scene.
3.7.2. Implementation
arm:
the shoulder can rotate around its Figure 6: A basic Stroke, consisting of key
local X,Y and Z axis; frames before, contact and after
the elbow can rotate around its lo-
If the time gap between subsequent ani-
cal X and Y axis, to make the lower
mation events e1 and e2 is less then delta,
arm twist and the elbow bend, re-
their key frames will overlap, and special care
spectively;
has to be taken. We distinguish between two
the wrist can rotate around its lo- cases:
cal X and Z axis
If e1 and e2 are of the same event type
hi-hat: (e.g. both are SNARE events), the last
key frame of e1 and the rst key frame
the pedal can rotate around its lo-
of e2 are replaced by an interpolated
cal Z axis
key frame eNew: the less time between
the upper part (the stick to which e1 and e2, the closer the new key frame
the upper cymbal is attached) can will be to the DOWN key frame, as can
be translated along the Y axis. be seen from gure 7.
UP
3.8. Key Frame Generation
key frame space
type(e1)
e1
e2
angle
0
e2af ter is shortened. A parameter a (0 <
a < 1) determines the fraction of the
time between the events that is used for time
moving the arm from e1af ter to e2bef or e .
1a a 1a Figure 9: angle(t)
2 2
key frame space
UP
type(e2)
UP
type(e1)
3.8.2.1. Pedals 0
e1 e2 e3
tween ring and middle nger of the ltering as explained in section 4.2.
Our second algorithm, that solves these 1. generate all possible solutions
shortcomings, uses default hand assignments
for all possible pairs of events. For exam- 2. assign a distance value to each solution
ple, we can dene that whenever RIDE and (e.g. based on distances between drums,
HIHATOPEN are played together, the RIDE is penalties for using a certain hand for a
played with the right hand and the HIHAT certain event type, etcetera)
with the left. We should keep some exibility,
as these constraints do not have to be equally 3. take the solution with the lowest dis-
strong for all pairs: for example, SNARE+CRASH tance value.
can be played as left-right just as easy as right-
left. Problems with this approach lie in the design
The drum kit model is extended with a of a good distance function, and in the large
function pair (eventT ype, eventT ype), that number of possible solutions7 . We have not
returns a oating-point value in the range [- (yet) implemented a shortest-path algorithm
1..1]. The semantics of this value are as fol- in our system.
lows:
The improved hand assignment algorithm In a real drum kit, one can observe that some
uses just the pair (a, b) function for simulta- drums or cymbals are more elastic than oth-
neous events. For events [e1, e2] with a time ers, i.e. the drum stick bounces more on one
gap t greater that zero, the default hand object than on another. Besides the object it-
values are taken into account as well. self, the elasticity is also dependent on the
For each event with index I in the event way of playing: the stick will bounce back
list, a hand assignment value is calculated more on the hi-hat when it is played closed
twice: in the pair [event(I-1),event(I)] and in then when it is played open.
the pair [event(I),event(I+1)]. Afterwards, these To simulate this phenomenon, we extend
two values are averaged to yield the nal hand the drum kit model with an elasticity param-
assignment value for event(I). eter eleventT ype in the range [0..1] for each
For a pair [e1,e2] the hand assignment val- drum event type. The value of eleventT ype
ues (hand(e1), hand(e2)) are calculated in determines how far the drum stick should
the following way: bounce back to its initial position after con-
tact. In this denition, 0 means no elastic-
ity while 1 corresponds to maximum elas-
t =T ime(e2) T ime(e1) ticity. The elasticity values are now used in
the following way: for each stroke, the T Raf ter
hand(e1) = t pair (e1, e2) + (1 t )
key frame is interpolated between the UP
def aultHand(e1) and the DOWN pose:
hand(e2) = t (pair (e1, e2)) + (1 t )
def aultHand(e2)
T Rbef or e =T RU P
The decreasing exponential function t T Rcontact =T RDOW N
(0 < < 1) ensures that the default hand val- T Raf ter =T RDOW N + eleventT ype
ues are taken more into account when there
is more time between e1 and e2, at the same (T RU P T RDOW N )
time lowering the inuence of the pair-wise
hand preference. From this, one can easily deduce that
ods consist of the following steps: be distributed over the 2 hands in 2n ways
4.3.2. Note Velocities poses are dened for the neck joint, and for
each beat note a Stroke is created. We have
In the basic algorithm (see section 3.5), we used the SNARE event on the left hand as an
did not take the velocities velevent of the DrumEvents approximisation of beat notes.
into account. It would of course be more con- Finding the real beat in a MIDI le is far
vincing to use dierent animations for dier-
from trivial, and many other researchers have
ent velocities. use dierent animations for addressed this problem [3, 2, 5, 6]. Our sys-
dierent velocities will result in a more nat- tem could very well be integrated with an in-
ural behavior: the UP position should be
telligent beat detector to create even better
closer to the drum surface for softer notes, looking behaviour.
and further away in the case of loud notes.
The key frames [T Rbef or e , T Rcontact , T Raf ter ]
that make up a Stroke can therefore be de- 4.3.4. Key Frame Interpolation
ned as follows (see also gure 11):
After the basic key frames are set, the motion
is ne-tuned by inserting extra key frames ac-
T Rbef or e =T RDOW N + velevent di cordiapplying a dierent interpolation script
T Rcontact =T RDOW N between certain key frame types (before /con-
tact /after). These scripts can also be dier-
T Raf ter =T RDOW N + velevent eleventT ype
ent for each joints.
di The example scripts shown in gure 12
di =T RU P T RDOW N create rather convincing results, because the
stick moves slightly behind the hand, giving
in a whip-like motion. These interpolation
TR
UP scripts are derived by observing the motion
of a human drummer.
from after from before from contact
next to before to contact to after
TR key frame
DOWN
vel=1.0 vel=1.0 vel=0.5 vel=0.5
el=0.5 el=0.25 el=1.0 el=0.5 elbow
wrist
stick
time stamp
time stamp
time stamp
time stamp
time stamp
current
current
current
elasticity values.
next
next
next
4.3.3. Extra avatar animation Figure 12: example interpolation scripts for
the elbow and the wrist and stick joints
In this section, a number of extensions are
discussed that animate parts of the avatar
that were not animated at al in the basic sys- 4.4. Implementation Notes
tem. This helps a great deal to make the
avatar look alive. The Java3D API is used for the implementa-
tion, because it is platform-independent and
supports a wide range of geometry le for-
4.3.3.1. The head mats. Moreover, the our virtual theatre [12] is
The head of the avatar is animated, to cre- currently being ported from VRML to Java3D.
ate the eect that the avatar follows his hands
The SMF format (Standard MIDI File) is used
with his eyes. First, we create poses for the
as intermediate le format between the per-
head: one for each event type that is sup-
cussion recogniser and the animation gener-
ported by the hands. These poses rotate the
ator. A great advantage of using the SMF is,
head so that the eyes are pointed at the as-
that it allows us to use MIDI les (which are
sociated drum / cymbal. If we then use all
widely available on the WWW) to test the an-
events that are played by e.g. the right hand
imation generator independent from the per-
to create a key frame time line, the head ap-
cussion recognizer.
pears to follow this hand.
For the synchronisation of the animation
and the sound, a seperate thread is used, which
4.3.3.2. The neck looks up the current audio position and ad-
The neck joint is used to make the avatar justs the start time of the animation accord-
nod with his head on the beat: UP and DOWN ingly.
5. CONCLUSION In Proceedings of the International Com-
puter Music Conference, pages 171174,
We have chosen for a GUI-based pose editor Sept. 1995.
and script-based key frame interpolation. A
screenshot is shown in gure 3. This proves [7] The Java3D API. http://java.sun.com/-
to be a very exible solution, since there are products/java-media/3D/.
only a small number of poses, and they have [8] J. Kim. Computer animation of pianists
to be set only once for a new drum kit cong- hand. In Eurographics 99 Short Papers
uration. The system could be extended with and Demos, pages 117120, Milan, 1999.
motion capturing, dynamics and inverse kine-
matics to create even more realistic behaviour, [9] M. Kragtwijk. Recognition of percus-
but at the cost of losing simplicity and exi- sive sounds using evolving fuzzy neural
bility. The interpolation scripts create natu- networks. Technical report, University
ral motion, while the hand assignment algo- of Otago, Dunedin, New Zealand, July
rithm ensures the arms will not cross. Mo- 2000. Report of a practical assignment.
tion capture would require the setup of the
virtual drum kit to exactly match the setup [10] T. Lokki, J. Hiipakka, R. H
anninen, T. Il-
of the real kit, so changes cannot easily be monen, L. Savioja, and T. Takala. Real-
made. time audiovisual rendering and con-
The animation results can be viewed at temporary audiovisual art. Organised
our web site: http://wwwhome.cs.utwente.nl Sound, 3(3):219233, 1998.
/kragtwij/science/ [11] The general midi specication.
http://www.midi.org/about-midi-
6. REFERENCES /gm/gm1sound.htm.