Mellinger 2006

Applied Acoustics 67 (2006) 1226–1242
www.elsevier.com/locate/apacoust
MobySound: A reference archive for studying

automatic recognition of marine mammal sounds
a,* b
David K. Mellinger , Christopher W. Clark
a
Cooperative Institute for Marine Resources Studies, Oregon State University, and
NOAA Pacific Marine Environmental Laboratory, 2030 SE Marine Science Drive, Newport, OR 97365, USA
b
Bioacoustics Research Program, Cornell Laboratory of Ornithology,
159 Sapsucker Woods Road, Ithaca, NY 14850-1999, USA
Received 17 January 2006; received in revised form 7 June 2006; accepted 7 June 2006
Available online 14 August 2006
Abstract
A reference archive has been constructed to facilitate research on automatic recognition of marine
mammal sounds. The archive enables researchers to have access to recorded sounds from a variety of
marine species, sounds that can be very difficult to obtain in the field. The archive also lets research-
ers use different sound-recognition methods on a common set of sounds, making it possible to com-
pare directly the effectiveness of the different methods. In recognizing sounds in a given recording,
the type and frequency of noise present has a strong effect on the difficulty of the recognition prob-
lem; a measure of the amount of interference was devised, the ‘‘time-local, in-band, signal-to-noise
ratio’’, and was applied to each sound in the archive. Current entries in the archive comprise low-
frequency sounds of large whales, and have about 14,000 vocalizations from eight species of baleen
whales. MobySound may be accessed at http://hmsc.oregonstate.edu/projects/MobySound/. Contri-
butions to the archive are welcomed.
Ó 2006 Elsevier Ltd. All rights reserved.
Keywords: Sound archive; Call recognition; SNR
*
Corresponding author. Tel.: +1 541 867 0372; fax: +1 541 867 3907.
E-mail addresses: David.Mellinger@oregonstate.edu (D.K. Mellinger), cwc2@cornell.edu (C.W. Clark).
0003-682X/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.apacoust.2006.06.002
D.K. Mellinger, C.W. Clark / Applied Acoustics 67 (2006) 1226–1242 1227
1. Introduction
The automatic recognition of mammal sounds is a promising tool for acoustic investi-
gations of free-ranging animals in their natural habitats. Primary applications of acoustic
methods include surveys for documenting the relative abundances and seasonal distribu-
tions of particular species [1–5] and studies of acoustic behavior, especially when aug-
mented with techniques for tracking animals in their native habitat [6–8]. Most
traditional, non-acoustic methods for conducting surveys or studying behavior involve
visual observation, but acoustic methods have certain advantages over visual ones. Some
environments, including most ocean habitats, are inaccessible or inhospitable to visual
observation, but are accommodating to acoustic methods. For cetaceans, sound is prob-
ably the preferred mode of information transfer for distances greater than 100 m, and
therefore acoustic methods offer a significant improvement over visual methods by taking
advantage of the animal’s natural propensity to produce sounds. For animals in general,
except in rare cases, visual observation is possible only in daytime, while acoustic methods
can be used for nocturnal species and species that are active throughout the 24-h day. Con-
ditions may make visual observation difficult, as in the case of fog, high sea states, or ice
cover in polar regions. With automated recognition methods, extensive sound recordings
can be made and processed relatively quickly, while visual observation is usually very
labor-intensive and requires trained observers. For some species, particularly for marine
species that rely heavily on the acoustic modality, acoustic methods may very well offer
more appropriate insights into the biology than other, less direct methods. A researcher
using the acoustic modality probably perceives the environment closer to the way the ani-
mal does, and perhaps can appreciate its behavior more directly.
Acoustic observation can be done by a person who has had appropriate training, but an
automated method has certain advantages. It is unbiased, or rather its bias is constant
rather than possibly changing from time to time and place to place. It can be used to pro-
cess large amounts of data; this is quite important in many types of field work, where thou-
sands of hours of sound may need to be analyzed rapidly. In real-time applications, an
automatic method can work on many data streams at the same time, allowing simulta-
neous monitoring of sound data being collected from many different locations. Sounds
above or below the frequency range of human hearing can be processed, as can sounds
that change too quickly or too slowly for humans to detect easily. Some detection methods
may have better performance than human listeners; for example, they may be able to
detect sounds too faint, or too buried in noise, for humans to perceive.
Automatic recognition is also interesting as a problem in signal processing. The task is
particularly challenging for a number of reasons, including the high variability in animal
sounds at the individual, intra-specific, and inter-specific levels of analysis; the transient
nature of the signals involved; and the fact that ambient noise encountered during field
recording conditions is non-stationary and non-Gaussian.
2. Motivation for the archive
The MobySound marine mammal sound archive is a resource made available to aid
research on the detection and classification of marine mammal sounds. In particular, it
is intended to provide a common ground for training, testing, and comparing algorithms
for automatic recognition of large whale sounds. Although there are several sound
1228 D.K. Mellinger, C.W. Clark / Applied Acoustics 67 (2006) 1226–1242
libraries and archives containing marine animal sounds (e.g., the Macaulay Library at the
Cornell Laboratory of Ornithology, the Wildlife Section of the British Library’s National
Sound Library, and the collection of the Borror Laboratory for Bioacoustics at Ohio State
University), none yet has the ancillary data needed to make them immediately useful for
automatic call recognition research. MobySound’s creation was motivated by several
needs of researchers working on automatic call recognition, where recognition involves
the tasks of detection.
First, there is a need for access to validated, representative sounds of various species. It
can be quite difficult and expensive to collect sounds from large whales, particularly pela-
gic species, and many researchers have sounds from only a few species. Researchers some-
times exchange sound recordings, but this process can be haphazard and incomplete.
MobySound provides a common resource containing sounds from a number of species
of marine mammals. With such a resource researchers can find out the types of species-
specific vocalizations (and other sounds) in the archive, learn something about the
variation present in the sounds, use the sounds as input data for training their automatic
recognition algorithms, and develop more general and robust detection and classification
methods.
Second, researchers need a common collection of data against which to compare the
performance of various algorithms. By having a common set of data to use as input to var-
ious methods, the output of the various analysis methods can be based on consistent trial
data, and the methods’ success rates can be directly compared. The need for common data
sets has long been recognized in the speech recognition community, where archives such as
TIMIT [9,10] are used for direct comparison of error rates in speech recognition. It is
hoped that having such an archive of marine mammal sounds will help spur the develop-
ment of increasingly accurate methods for automated sound recognition. An archive ori-
ented toward signal-processing sounds exists [11], but has almost no non-human biological
signals. Other compilations of marine animal sounds have been made for specific projects
[12,13], but none has multiple species recorded in multiple places and has been made
openly available.
3. Contents of the archive
MobySound consists of sound sets, each of which contains recordings and associated
information, or metadata. Recordings in a sound set contain sounds from one species, col-
lected using one recording configuration in or near one geographic area over one relatively
short time span. Each recording consists of a set of sound files representing a continuous
recording session. For example, one sound set in the collection consists of metadata and
sound files containing bowhead whale (Balaena mysticetus) vocalizations collected off Pt.
Barrow, Alaska during the migration of spring 1988 [14]. Within that sound set there are
multiple recordings from the various recording sessions in 1988, so each recording is
archived as a set of sound files from the same recording session. Recordings of bowhead
whales made using different recording equipment, or made in a different place, or in a dif-
ferent year or season, would be in a separate sound set.
In addition to sound files, a sound set contains metadata associated with the recordings.
One such file, the README file, is included with each sound set in the archive. It is a tex-
tual description of the sound set with information common to all recordings in the sound
set. The README file includes
The scientific name of the species, and optionally the common name(s).
The geographic location of recording.
The depth of the hydrophone(s) used in the recording, when available.
The sound-speed profile at the time of recording, when available.
The start date and start time of recording. This may also be included in the names of the
sound files, in which case the file-naming convention should be described.
The recording equipment used, including microphone/hydrophone, recorder make and
model, tape speed, filter equipment type, amplifier type and settings; and digitizer type,
bit depth, and other factors that may affect digitization. The recording system’s fre-
quency response is also specified if it is non-flat in the frequency band of the sounds
of interest.
The sampling rate of the recorded sounds (this is incorporated in the sound file headers,
but it should also be included here).
The spectrogram parameters used when annotating the time–frequency (T–F) boxes
containing calls.
Copyright information, if any.
Labels for the columns in the individual annotation tables, as described below.
Whether the annotation tables cover more than one individual.
References to the call type in the literature, if desired, particularly references to any
publications on automatic recognition using this call type.
Each sound file in MobySound typically contains from one to several hundred sounds
from one animal, as well as background noise and perhaps sounds from other animals of
the same or different species. Information is needed to delineate where in the sound file the
sounds of interest occur. But first, the issue had to be addressed of what sounds are ‘‘of
interest’’. Automatic call recognition can operate at many levels of specificity; the level
desired directly affects how animal sounds should be delineated. For a natural-sound
library, one typically aspires to detect all species of animals in a given recording, and
all sounds from a focal individual. When performing an acoustic survey, one may want
to detect sounds of a certain group defined by taxonomy (say, all balaenopterids) or by
frequency (say, all low-frequency marine mammals). A more common objective is to find
all sounds of a certain species – all blue whale sounds, for instance – or all sounds of a
certain type of vocalization, such as all blue whale ‘‘B’’ calls. At the most specific level,
one may wish to find all instances of a certain call type produced by a certain individual;
in this case, calls of conspecifics would be considered clutter.
For MobySound, the decision was made to delineate sounds at the level of individuals.
In many applications of automatic call recognition, it is desirable to know which sounds in
the recording came from the same individual, where sounds are attributed to the same
individual based on such attributes as the consistency of sound patterns (e.g., cadence
of song notes or phrases) and acoustic qualities, or the similarity of features in conjunction
with the animal’s location and movement. In most cases, attribution of sounds to the same
individual is not based on visual recognition of the individual in synchrony with its vocal
behavior. For instance, some automatic recognition algorithms use multiple successive
calls from an individual for recognition [5]; or in a population census, it may be desired
to count acoustically active individuals [15]. In MobySound, the sounds of each individual
are delineated separately and kept as a list of annotations. This list is stored in a file known
as an individual annotation table (IAT), with one IAT for each animal present in each
sound file. In cases where sounds from two, three, or more individual animals occur on a
single recording, there are two, three, or more IATs.
Determination of which sounds come from an individual animal is the responsibility of
the researcher submitting a sound for inclusion in MobySound. This determination can be
done in several ways: by loudness, as when only one individual of the species being
recorded is very close to the sensor (hydrophone or microphone), or when two or more
individuals are vocalizing but one is substantially and consistently louder (e.g. [16]); by
acoustically localizing the recorded animals so that each call can be reliably associated
with an individual; or by determining from other acoustic characteristics (frequency con-
tour, duration, timing, multipath effects, etc.) that calls are sufficiently similar such that
they may be reliably identified as originating from one individual [17]. In many recording
situations, it is impossible to determine which sounds come from which individual. For
such recordings in MobySound, a note is included in the README file that sounds can-
not be reliably associated with an individual, just a species.
The IAT includes start and end times and lower and upper frequency bounds for each
annotated sound. The time and frequency information is useful to recognition algo-
rithms as both training and testing data. The IAT also includes information about
the signal-to-noise ratio (SNR) of each sound. This is provided to help calibrate the suc-
cess rates of detection and classification algorithms. With a high-SNR recording of a
species’ vocalization, many different types of recognition algorithms will produce reliable
results. A stronger test of the ability of a recognition algorithm can be attained when
sounds are weak relative to background noise or interfering sounds. For this reason,
SNR measures are present for each sound in the archive; submissions to the archive
must include this information. A method for calculating SNR for transient sounds is
described below.
4. Measurement of signal-to-noise ratio
The lower the signal-to-noise ratio of a sound, the more difficult it is, in general, to
detect and classify the sound. The SNR is classically measured for stationary signals by
determining the powers of the signal and the background noise, and dividing the signal
power by the noise power values. Animal sounds are transient, and they often vary rapidly
in amplitude and frequency, characteristics that present problems when measuring SNR.
Here we discuss some of the problems that arise in devising a consistent measure of SNR
(Section 4.1) and present the solutions used in MobySound (Sections 4.2 and 4.3).
4.1. Challenges in measuring SNR
The naive approach to measuring SNR is to measure the average power of a given ani-
mal sound, with as little background noise included as possible; measure the average noise
power in nearby time periods when the animal is not calling; subtract noise power from
sound-plus-noise; and divide the result by noise power.
Some complications arise with short-duration animal sounds. A single animal sound is
limited in frequency. Should noise in the whole frequency range recorded be measured, or
just noise in the frequency band containing the particular sound of interest? The usual, sta-
tionary SNR measure for digital sounds includes all of the noise in the recording, i.e., in the
frequency range from 0 Hz to half the sampling frequency. However, most call-detection
methods operate in a limited frequency band, so a measure of noise limited to the frequency
band containing the sound of interest is more appropriate.
This problem is further complicated by the fact that animal sounds can vary widely in
frequency from one occurrence to the next. For example, bowhead whales may sing in the
range of 800–900 Hz for one note of a song, and not go above 250 Hz in the next note. For
measuring the noise power under such variable circumstances, what frequency range
should be used – the frequency range of the animal’s sounds that are close in time to where
the noise was measured, the frequency range of all of the animal’s sounds, or the whole
frequency range of the recording, from 0 to half the sampling frequency? Furthermore,
if it is decided to use the frequency band in which the species vocalizes, should this be
the frequency band of all of the species’s sounds, the band of just one song or kind of call,
or the band that occurs in a given recording? If it is to be the band containing all of the
species’s sounds, how can this band be determined, given that any one recording rarely has
all the sounds that a given species can produce?
Noise itself is often transient on animal recordings – wind noise, wave and ship noise,
ice sounds in polar regions, and other animal sounds are all fairly common transient sound
sources. This raises a question about time periods for measuring noise. Should noise power
be averaged over the entire length of a recording? Over an entire field season? Or should
some shorter period, such as a fixed interval of time around each animal sound, or the
quiet time between adjacent animal sounds, be used?
Echoes and other multipath effects raise further problems. In measuring the SNR of a
sound, should the energy present in the sound’s echoes be counted? For some detection
methods, such as correlation methods (e.g. [18,19]), this energy may help detection of
the sound, while for other methods such as spectrogram correlation [20], it may hinder
it. Multipath propagation can be an especially acute problem with the oceanic sounds
in this archive.
Some species’ sounds do not have well-defined limits in time or frequency or both. For
instance, the harbor seal vocalization in Fig. 1 fades into background noise in the upper
frequencies of the recording. For such sounds, it is difficult to determine what part of
the recording should be considered signal, and what part noise. The problem is compli-
cated by the fact that different choices of spectrogram parameters reveal different signal
features. For instance, long integration times (i.e., large FFTs) are good for revealing faint
constant-frequency tones in background noise, while short integration times (i.e., small
FFTs) reveal impulsive sounds better. How should spectrogram parameters be chosen?
4.2. SNR measurement choices
Answers to all of the above problems were needed for creating MobySound. The
answers were arrived at by considering (1) the biological research driving automatic ani-
mal sound recognition, (2) the intended use of MobySound as an aid to automatic call
detection research, (3) the consequent desire for the SNR measure to reflect the difficulty
of detecting a call, and (4) the desire for a simple, understandable, consistent and repeat-
able method that works for as many species as possible. In MobySound, the SNR is deter-
mined separately for each sound recorded from the species of interest. For each sound,
SNR is computed by first measuring the average signal power in a time/frequency box
bracketing the sound. That is, signal power is measured within the time and frequency
bounds just large enough to contain each sound of interest. Noise power is measured
Fig. 1. Spectrogram of a harbor seal (Phoca vitulina) ‘roar’ vocalization that illustrates the difficulty of defining
the time/frequency bounds of a vocalization. The upper parts of the roar vocalization fade into background noise,
making it difficult to define a precise upper frequency bound, and it is similarly difficult to define a precise
beginning and end to this vocalization. Such ambiguity is not at all uncommon in marine mammal sounds.
Spectrogram parameters: sampling rate 12 kHz, frame size 0.021 s (256 samples), 75% overlap, FFT size 512
samples, Hamming window, filter bandwidth 190 Hz.
within a limited frequency band, a band that is fixed throughout an entire recording. The
frequency band for calculating noise is given by the minimum and maximum frequencies
in the set containing all of the individual animal’s sounds in the recording. These limits are
used because automatic call recognition algorithms must typically analyze all sounds
occurring within this frequency range for the sounds of interest, but can usually ignore
sounds outside this frequency range. For reasons of simplicity and reproducibility, the fre-
quency minimum and maximum are measured separately for each recording. This choice
was made so that the noise frequency band, and hence SNR, would not change if at some
later time more recordings of the given species were added to MobySound. (Note that the
noise frequency band could still vary with the way in which a very long recording is cut
into separate, shorter sound files; this effect has been negligible for all recordings in Moby-
Sound to date.) Noise power is computed by using the entire minimum/maximum fre-
quency band. The average power of a given animal sound is that of a time-frequency
box that just encompasses the sound.
When computing the SNR of a given call, the most appropriate periods of time in
which to measure noise are adjacent to the call. Therefore, for measurement of noise
power, MobySound uses the time periods between adjacent calls. As stated above, spectro-
gram parameters may affect how much of a given vocalization is visible in a spectrogram,
and thus affect the measurement of SNR. Choice of the spectrogram parameters used for
annotating vocalizations is left to the researcher submitting sounds for inclusion in Moby-
Sound, with the qualification that the choice should reveal important features of the ani-
mal sounds of interest. However, the restriction is imposed that all sounds included in a
given sound set should be measured using the same spectrogram parameters. For measur-
ing signal power, the time and frequency bounds defining the extent of each animal sound
are determined by inspection of the sound’s spectrogram. All of the animal’s sound appar-
ent in the spectrogram should be included in the measurement of signal power. This is the
best rule that can be consistently applied to all sounds in the archive.
In some of the bioacoustic research for which MobySound was created, the most useful
output of an animal sound recognition system is the sequence of sounds from a single indi-
vidual. For MobySound annotations, sounds from only a single individual are considered
‘‘signal’’, while sounds of other individuals of the same species are ‘‘noise’’. Thus, noise is
measured between adjacent calls of the same individual, rather than between adjacent calls
of individuals of the same species. This definition will match the actual operation of ani-
mal sound recognition systems when research progresses in the direction of separating the
calls of each individual. Distinguishing individuals is encouraged but not required in
MobySound: for some recordings, sounds of individuals cannot be clearly distinguished,
and for such recordings, noise is measured between adjacent calls by the same species,
rather than between adjacent calls of the same individual. Note of this fact is included
in the README file as described above.
Multipath arrivals of a sound, including echoes, are considered noise. This choice was
made principally for simplicity and consistency; if all echoes were considered part of the
signal, it would be difficult to define when multipath arrivals end and background noise
begins, and it would thus be difficult to achieve a consistent measure of SNR. In Moby-
Sound, when a given marine mammal sound arrives several times because of multipath
effects, signal power is measured for the loudest arrival only. Even this definition can be
difficult to apply, as multiple arrivals of a sound can overlap and interfere. Note that
the loudest arrival is not necessarily the first arrival, nor is it necessarily the direct-path
arrival [21].
Fig. 2 illustrates the measurement of signal power and noise power for an example
sound, a humpback whale song unit. In the figure, the rectangular solid-line box at center
delineates the area in which signal power for the unit of interest is measured. This box
includes some background noise. MobySound’s definition of SNR considers this noise
1500
frequency, Hz
1000
500
384 386 388 390 392

time, s
Fig. 2. Measuring the signal-to-noise ratio in a spectrogram of a humpback whale (Megaptera novaeangliae) song
unit. The unit being measured here is in the solid-outline box at center; all energy inside this box is considered
signal energy. Noise energy is measured in the two dashed-line boxes flanking the measured song unit. The
frequency bounds of the noise boxes are determined by the maximum and minimum frequencies of all units in the
song, including units that are outside the time period displayed here. These energy measures are used to calculate
the signal power and noise power and then the signal-to-noise ratio. Spectrogram parameters: sampling rate
4 kHz, frame size 256 samples, 75% overlap, FFT size 512 samples, Hamming window, filter bandwidth 63.4 Hz.
to be part of the signal principally because of a desire for simplicity: Other definitions of
SNR that attempt to include only the song unit’s frequency contour are difficult to imple-
ment in practice and raise further problems in defining exactly what should be included in
the signal. Thus, it was decided to use a relatively simple measure for signal power: the
average power in a rectangular box outlining the sound of interest. The larger dashed-line
boxes delineate the areas in which noise is measured; these ‘‘noise boxes’’ span a wider fre-
quency band than the calls shown here, since this band must be wide enough to contain
other calls not shown that occur at slightly different frequencies. This method, because
it increases the amount of noise in the measurement, tends to negatively bias the SNR.
Fig. 3 shows an example of two animals calling in the same time period, and illustrates
that sounds from the two are clearly differentiable. Although this example is perhaps easier
than others, it remains true for many recordings that one can distinguish the sounds of
different individuals by using consistencies in the frequency, contour shape, multipath
structure, peak intensity, overall intensity, or period of successive sounds.
Fig. 4 illustrates the measurement of signal power and noise power in an instance when
multipath propagation causes several arrivals of each vocalization, and when more than
one whale is present. This figure shows fin whale pulses from at least two different whales,
labeled A and B. Which whale produced which call was determined from timing relation-
ships, frequency characteristics, and loudness of all the pulses, including some pulses out-
side the time period shown. There are also several multipath arrivals of each call of whale
A; the loudest arrival in each set of multipath arrivals is used for SNR measurement. In
the figure, the tall, thin, solid-line box delineates the time–frequency (T–F) region in which
signal power for the call of interest is measured. The larger dashed-line boxes delineate the
T–F bounds in which noise is measured; these ‘‘noise boxes’’ are defined using the previous
and following pulses from whale A while ignoring pulses from whale B. Including sounds
from other individuals as ‘‘noise’’ tends to negatively bias the SNR compared to methods
that include only stationary background noise, but excluding all other individuals from
noise measurement raises the much more difficult problem of deciding what are, and what
are not, the faint calls of other individuals.
B B B B B B B B B B B B B
frequency, Hz
A A A A A A A A A A A A A A
time, s
Fig. 3. Example showing regularly-repeating pulses from two individual fin whales (Balaenoptera physalus), here
labeled A and B. Pulses from the two whales are clearly differentiated by consistencies in frequency, period,
intensity, and multipath arrival structure. In addition, the sequence of pulses from whale B pauses for
approximately two minutes, perhaps indicating a blow (breathing) cycle [39], while that from whale A does not
pause here. Spectrogram parameters: sampling rate 100 Hz, frame size 128 samples, 75% overlap, FFT size 512
samples, Hamming window, filter bandwidth 3.17 Hz.
30
frequency, Hz
25
20
15 A B B B A B B A B
80 100 120 140 160

time, s
Fig. 4. Measuring the signal-to-noise ratio in a spectrogram of fin whale pulses when more than one whale is
vocalizing. Labels A and B underneath the pulses indicate which of the two whales present produced which pulse;
this determination was made by examining a longer sequence of pulses than is shown here. Unlabeled pulses, like
the pulses appearing a few seconds before each A pulse, are other multipath arrivals of the labeled pulses. The
pulse being measured here, from whale A, is in the solid-line box at center. The time bounds of the flanking noise
boxes (dashed boxes) are determined by preceding and succeeding pulses from that same whale; pulses from
whale B are ignored when measuring whale A. Spectrogram parameters: sampling rate 100 Hz, frame size 128
samples, 75% overlap, FFT size 256 samples, Hamming window, filter bandwidth 3.17 Hz.
4.3. SNR measurement procedure
The above choices can be reduced to the following procedure for calculating the SNR
of calls of an individual animal in a given recording. The result is called the ‘‘time-local, in-
band signal-to-noise ratio’’. First, choose the target sound type of interest (e.g., call, song
note, broadband pulse). Determine which of the target sounds in the recording come from
that individual using the methods described previously. Choose spectrogram parameters
appropriate to the sound of interest and make a quadratically-scaled spectrogram of the
recording in question. In the spectrogram, for each sound from the individual, outline a
time-frequency box surrounding the loudest multipath arrival; this step is usually accom-
plished interactively by a person, so it is the labor-intensive part of this process. For each
sound, compute the sum Es of the signal in the sound’s time–frequency (T–F) box; most
spectrogram programs will provide this number once the T–F box is specified. We call Es
the energy here because it is the sum over time of what are normally called power spectra.
(Note: Neither ‘power spectrum’ nor ‘energy’ are physically correct terms [22, pp. 178 and
183–184], but ‘power spectrum’ is widely used to describe a second-order pressure or volt-
age spectrum, and ‘energy’ is a logical extension since energy is the integral or sum of
power over time.) Compute signal power
Es
Ps ¼ ; ð1Þ
ts
where ts is the duration (length in time) of the T–F box containing the sound. Determine
the noise frequency bounds by taking the minimum and maximum frequencies of all of the
measured sounds. For each sound, measure the energy in two T–F boxes (see Figs. 2 and
4): the energy En1 in the box containing the time between the end of the previous sound
and the start of the sound being analyzed, and the energy En2 in the other box containing
the time between the end of the sound being analyzed and the start of the next sound. In
frequency, both T–F boxes are bounded by the minimum–maximum frequency range just
described. Let tn1 and tn2 be the durations of the two boxes adjacent to the sound being
analyzed, respectively. Compute noise power as
En1 þ En2
Pn ¼ : ð2Þ
tn1 þ tn2
Then the SNR is
Ps
R¼ ð3Þ
Pn
or in decibels,
Ps
RdB ¼ 10 log10 : ð4Þ
Pn
For the very first call in a sequence of sounds, the T–F box used to measure noise is as
follows: the frequency bounds are the same bounds as for all noise measurements; the end-
ing time is the start time of the first sound; and the duration is the median duration of all
of the inter-call intervals in the sequence of sounds. An analogous definition holds for the
noise T–F box at the end of the sound sequence.
Note: A different, seemingly more accurate, approach to calculating the SNR of a
sound is obtained by observing that some noise is present in the T–F box that the sound’s
energy is measured in, so that Es represents the energy of signal plus noise, and similarly
for Ps. To correct for this, the noise power is subtracted from the signal power:
Ps Pn Ps
R¼ ¼ 1: ð5Þ
Pn Pn
Unfortunately, this can cause problems because of the non-stationarity of the noise. In
measuring a sound and its surrounding noise, the noise in the area of background mea-
surement (between two adjacent sounds) may be louder than the signal-plus-noise present
during the sound. In other words, in Eq. (5), Pn can be larger than Ps, leading to a negative
R and complex-valued RdB, which would be meaningless. This problem is not merely
hypothetical, as it has been observed with some of the sounds initially placed into Moby-
Sound (see below).
Because of these problems with highly transient noise, the definition of the SNR used
for MobySound is Eq. (4). This definition has been encoded as a MATLAB program
called snr.m which is available as part of MobySound. This program takes as input the
digital sound recording in question and the set of T–F boxes outlining the sounds, and
produces as output the SNR values for the sounds.
5. Format of files in a sound set
The above sections described the motivation for choosing sounds to be stored in Moby-
Sound, the manner of describing animal sounds in the sound recordings, and the method
for measuring signal-to-noise ratio. This section describes the basic content of the files in
MobySound. Each sound set includes a ‘‘README’’ text file; one or more sound files;
and for each sound file, one or more IATs.
The README file, named README or README.txt or something similar, is plain
text and has the contents described in Section 3. Plain text was chosen, despite its limita-
tions, to avoid complications about incompatible document formats on different computer
systems, and to make the information in the files accessible to all users as simply as possible.
Each sound set has one or more sound files. Each sound file is a single continuous
recording – a sequence of digital samples from one or more hydrophones. Each sound file
is a recording of one or more individual animals, and may contain one or more sounds,
where the a sound is a continuous or nearly-continuous acoustic utterance. The ‘‘nearly-
continuous’’ qualification is included so that sounds that most likely function as single units
in communication, such as trills or rapid pulse trains, may be considered single events.
There are many formats in use today for storing digital recordings as computer files. In
order to reduce complexity for researchers using these sound files, the requirement has
been instituted for MobySound that sound files included in the archive must be in WAVE
(.wav) format. WAVE format was chosen because it is the sound file format used most
widely, accepted for input and output by almost all programs that manipulate sound.
WAVE files can have multiple channels, can store sound samples in a variety of number
formats (8-bit, 16-bit, U-law, etc.), and can hold sound recorded at virtually any sampling
rate.
For researchers who do not use WAVE file format, there are programs available on the
Internet for conversion of sound files from one format to another. One such program, Ish-
mael [23], can convert between binary format, AIFF, and other file formats used on most
computers today (Sun, PC-Windows, Apple, SGI, etc.).
Each sound file has one or more IATs associated with it. Table 1 shows examples of two
such IATs. Each table indicates where in the given sound file the sounds from an individ-
ual animal occur. In this table, each entry defines a single sound in terms of the annotation
of the sound event in the sound file (i.e., begin and end times and lower and upper frequen-
cies). IATs also contain the SNR of each sound, measured by the procedure described
above.
Table 1
Two individual annotation tables (IATs) for the two fin whale pulse sequences shown in Fig. 4, with ‘‘Whale A’’
and ‘‘Whale B’’ corresponding to the labels in Fig. 4
Start time (s) End time (s) Low frequency (Hz) High frequency (Hz) Signal-to-noise ratio (dB)
Whale A 84.3 85.2 18.2 33.0 9.7
123.3 124.2 18.1 31.8 11.1
161.4 162.3 17.6 32.1 13.4
Whale B 86.4 87.5 17.3 26.4 9.4
100.9 101.8 17.7 25.5 13.7
114.5 115.5 16.9 26.7 10.0
132.4 133.0 17.1 20.0 1.2
149.8 150.6 16.4 20.3 5.2
169.6 170.5 16.6 20.2 5.3
Each of these two tables is stored in a separate annotation file in the archive. Each row contains information
describing a single fin whale pulse. The two files in the archive containing these tables have purely numeric data,
and do not contain the labels and headings shown here (‘‘start time’’, ‘‘Whale A’’, etc.).
In recordings for which it is not possible to unambiguously determine which individual

made which sound, a single IAT is used, and SNRs are measured as if only one individual
were present. In this case, it is recommended that a note be included in the README file
that more than one individual may be present in the IAT.
The IATs are stored as plain-text files, with one line – one table row – per animal sound.
Each row contains five or more numeric values separated by spaces or tab marks. The first
five columns, in order, are (1) start time (in seconds) of the animal sound relative to the
start of the file, (2) end time, (3) lower frequency (in Hertz), (4) upper frequency, and
(5) signal-to-noise ratio (in dB). These values may be specified with any number of signif-
icant digits; two digits after the decimal point is typical for the entries currently in Moby-
Sound. Other columns may be appended by the contributor of the sound, if desired. This
file does not have any header lines describing the information, since that is done in the
README file for the sound set as a whole. The choice to have no header lines was made
to ensure that only numeric information is stored in the table, so that the file may be read
as easily as possible by computer programs.
6. Current entries in MobySound
The current entries in MobySound are intended as only a starting point, and it is hoped
that more entries will be added by both the authors and by others working on marine
mammal sound detection and classification. The initial choice of sound recordings to
put into the archive was determined by the intended use of MobySound for detection
and classification. Thus, the basic aim is to include sounds that represent as much of
the naturally occurring acoustic variation of a given species as possible. To this end, we
intend to collect sounds from as wide a geographic region as possible. We also aim to
cover as much time variation as possible – diurnal variation, seasonal variation, and
year-to-year variation. To date, sounds entered into the archive have been low-frequency
vocalizations from mysticete (baleen) whales. Low-frequency sounds were chosen because
their sampling rate can be low, enabling a large volume of sound data to be stored in the
archive. Table 2 shows the species represented.
The bowhead whale sounds are all end-notes from the 1988 bowhead song and gener-
ally do not include other parts of this species’ complex songs or vocal repertoire. The
sounds were collected during the spring 1988 migration, off Pt. Barrow, Alaska [14]. Bow-
head songs change from year to year [24]. These bowhead sounds have been previously
used in some work on automatic recognition [20,25].
Table 2
Entries currently in the archive, along with the geographic region from which the sounds came and the number of
individually annotated sounds
Species Common name Geographic region Number of sounds
Balaena mysticetus Bowhead whale Northern Alaska 589
Balaenoptera musculus Blue whale North Atlantic 405
B. physalus Fin whale North Atlantic 3066
B. acutorostrata Minke whale North Atlantic 178
B. edeni Bryde’s whale Eastern tropical Pacific 7403
Eubalaena australis Southern right whale South Africa 67
Eubalaena japonica North Pacific right whale Bering Sea 38
Megaptera novaeangliae Humpback whale North Atlantic 2310
The blue, fin, and minke whale sounds are from 1993 and early 1994, and were recorded
from the North Atlantic Ocean [26]. Species identifications were done by comparison to
known sounds from visually identified whales [27–29]. Some of these sounds have been
used in sound-detection work [30]. Efforts are underway to add sounds of these species
from other regions of the world and other years.
The Bryde’s whale sounds were recorded within 12° of the equator in the eastern Pacific
[31]. They were recorded on autonomous hydrophones [32] and were identified as Bryde’s
whale sounds by their similarity to sounds visually associated with Bryde’s whales [33].
Humpback whale sounds were recorded during spring 1994 off the north shore of
Kauai, Hawaii [34]. Each ‘‘sound’’ is a unit of a male humpback’s song. Humpback songs
also change throughout a season, and therefore this collection represents only a very small
portion of the sounds of singing humpbacks.
The southern right whale sounds were recorded off South Africa in 1998. The North
Pacific right whale sounds were recorded in the southeastern Bering Sea in 1999. The
sounds from both of these species are the type of low-frequency moans described by
Thompson et al. [35].
7. Inclusion in Macaulay Library
MobySound sound sets will in the future be included in the Macaulay Library (ML;
formerly Library of Natural Sounds) at the Cornell Laboratory of Ornithology – specif-
ically, in its Marine Mammal Collection. Details are still being clarified, but essentially,
the recordings in MobySound will become part of the ML collection; the information
in the README files and individual annotation tables will become fields of the ML meta-
data database. This inclusion offers certain advantages. Most importantly, it makes the
recordings and metadata part of an archive that will persist in usable form into the fore-
seeable future, beyond the lifetime of anyone now living. Such persistence has become an
issue for a number of personal marine sound collections, which have started to suffer from
physical decay of recording media, obsolescence of media formats (e.g., it is difficult to find
players for some older types of tapes), loss of handwritten notes and other metadata, and
similar problems of age. Second, it takes advantage of the advanced data organization
available in ML – making metadata searchable, making advanced taxonomic organization
and description possible, making it possible to associate other behaviors with given animal
sounds, etc.
8. Submission of sounds for inclusion
Additions to MobySound are welcomed. To add a sound set for a new species to Moby-
Sound, first decide on a subset of your recordings that contains a representative sample of
the variety of sounds you have. Preferably, this sample would include sounds recorded from
many individuals, in several different locations, at several different times. Such variety is not
always possible, of course; include as much as is available. It is acceptable to include only
one sound type, since automatic detection is sometimes done on a single sound type.
Write a README file for your sounds that covers the topics mentioned in Section 5.
See the existing README files in MobySound for examples.
Using a sound editing program, trim your sound files so that they include mainly the ani-
mal sounds of interest, without an excessive amount of background sound at the beginning
and end. However, some automatic recognition algorithms use a background noise estima-
tion step (e.g. [5]), so it is necessary to leave some background sound before the first animal
sound of interest. A useful compromise between the needs of such algorithms and the desire
to reduce storage of unwanted sound is to leave an amount of background noise recording
equal to five to ten times the duration of a typical sound in the recording. For instance, the
most common sounds of fin whales last approximately 1 s [36], so it is appropriate to leave
at least 10 s of background noise at the beginning and end of each recording.
Measure the marine mammal sounds on your recordings. Using a spectrogram tool
with measurement capabilities such as Canary [37], Ishmael [23], or Raven [38], manually
choose values defining the start and end times and lower and upper frequencies of each
sound of each individual. Make up one file for each sequence of marine mammal sounds
on your recordings: Arrange it as a table with one sound per row, with each row having
the four entries start time, end time, low frequency, and high frequency, respectively. If
MATLAB is available, run the snr.m routine with the sound file as input; the program
appends a fifth column to your input file with the signal-to-noise ratio information. If
MATLAB is not used, use the procedure outlined in Section 4.3 to do this computation.
Finally, contact the first author of this paper about your wish to provide sounds.
Sounds can be transferred by FTP to the MobySound site, or other arrangements for
transfer may be made.
9. Access to MobySound
MobySound may be accessed at http://hmsc.oregonstate.edu/projects/MobySound/.
Acknowledgments
We thank Christopher Fox, Adam Frankel, Mark McDonald, and Leonie Hofmeyr-
Juritz for contributions to the archive, and Sara Heimlich, Sharon Nieukirk, Irene Guo,
Janet Doherty, Catherine Marzin, and Mehdi Ouni for their help delineating sounds in
the recordings currently in MobySound. This work was supported by Office of Naval Re-
search Grants N00014-93-1-0431 and N00014-03-1-0099, by NOPP award 45393-7649,
and by the David and Lucile Packard Foundation. This is PMEL contribution #2939.
References
[1] Clark CW, Charif RA, Mitchell SG, Colby J. Distribution and behavior of the bowhead whale, Balaena
mysticetus, based on analysis of acoustic data collected during the 1993 spring migration off Point Barrow,
Alaska. Sci Rep int Whaling Commission 1996;46:541–54.
[2] Taylor A, Watson G, Grigg G, McCallum H. Monitoring frog communities: an application of machine
learning. In: Innovative applications of Artificial Intelligence conference. Menlo Park (CA): AAAI Press;
1996. p. 1564–9.
[3] Barlow J, Taylor BL. Preliminary abundance of sperm whales in the northeastern temperate Pacific
estimated from a combined visual and acoustic survey. Document number SC/50/CAWS20, International
Whaling Commission, Cambridge; 1998.
[4] Stafford KM, Nieukirk SL, Fox CG. Low-frequency whale sounds recorded on hydrophones moored in the
eastern tropical Pacific. J Acoust Soc Am 1999;106:3687–98.
[5] Mellinger DK, Stafford KM, Fox CG. Seasonal occurrence of sperm whale (Physeter macrocephalus) sounds
in the Gulf of Alaska, 1999–2001. Mar Mammal Sci 2004;20:48–62.
[6] Watkins WA, Tyack P, Moore KE, Bird JE. The 20-Hz signals of finback whales (Balaenoptera physalus). J
Acoust Soc Am 1987;82:1901–12.
[7] Clark CW. Acoustic communication and behavior of the southern right whale. In: Payne RS, editor.
Behavior and communication of whales. Boulder: Westview Press; 1983. p. 163–98.
[8] Clark CW, Gagnon GJ. Low-frequency vocal behaviors of baleen whales in the North Atlantic: insights from
Integrated Undersea Surveillance System detections, locations, and tracking from 1992 to 1996. J
Underwater Acoust [to appear].
[9] DARPA. The DARPA TIMIT acoustic–phonetic continuous speech corpus (CD-ROM). Technical Report
PB91-505065, National Institute of Standards and Technology, Gaithersburg; 1990.
[10] Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren N. The DARPA TIMIT acoustic–
phonetic continuous speech corpus on CDROM (printed documentation). Gaithersburg (MD): National
Institute of Standards and Technology; 1991.
[11] Johnson DH, Shami PN. The signal processing information base. IEEE Signal Process Mag
1993;10(4):36–43.
[12] Desharnais F, Laurinolli MH, Schillinger DJ, Hay AE. A description of the workshop datasets. Can Acoust
2004;32:33–8.
[13] Watkins WA, Fristrup K, Daher MA, Howald T. SOUND database of marine animal vocalizations.
Technical Report 92-31, Woods Hole Oceanographic Institution, Woods Hole, MA; 1992. 52pp.
[14] Clark CW, Bower JL, Brown LM, Ellison WT. Received levels for bowhead whale sounds recorded off Point
Barrow, Alaska in spring 1988: preliminary results. Sci Rep int Whaling Commission 1992;42:500.
[15] George JC, Zeh J, Suydam R, Clark C. Abundance and population trend (1978–2001) of the western Arctic
bowhead whales surveyed near Barrow, Alaska. Mar Mammal Sci 2004;20:755–73.
[16] Mellinger DK, Clark CW. Blue whale (Balaenoptera musculus) sounds from the North Atlantic. J Acoust Soc
Am 2003;114:1108–19.
[17] Clark CW. The use of bowhead sound-tracks based on sound characteristics as an independent means of
determining tracking parameters. Sci Rep int Whaling Commission 1989;39:111–3.
[18] Leaper R, Chappell O, Gordon J. The development of practical techniques for surveying sperm whale
populations acoustically. Sci Rep int Whaling Commission 1992;42:549–60.
[19] Stafford KM, Fox CG, Clark DS. Long-range acoustic detection and localization of blue whale sounds in the
northeast Pacific Ocean. J Acoust Soc Am 1998;104:3616–25.
[20] Mellinger DK, Clark CW. Recognizing transient low-frequency whale sounds by spectrogram correlation. J
Acoust Soc Am 2000;107:3518–29.
[21] Premus V, Spiesberger JL. Can acoustic multipath explain finback (B. physalus) 20-Hz doublets in shallow
water? J Acoust Soc Am 1997;101:1127–38.
[22] National Research Council. Ocean noise and marine mammals. Washington: National Academy Press.
[23] Mellinger DK. Ishmael 1.0 user’s guide. Technical Report OAR-PMEL-120, National Oceanographic and
Atmospheric Administration, Seattle; 2001.
[24] Würsig B, Clark CW. Behavior. In: Burns J, Montague J, Cowles CJ, editors. The bowhead whale.
Lawrence: Allen Press; 1993. p. 157–93.
[25] Potter JR, Mellinger DK, Clark CW. Marine mammal sound discrimination using artificial neural networks.
J Acoust Soc Am 1994;96:1255–62.
[26] Nishimura CE, Conlon DM. IUSS dual use: monitoring whales and earthquakes using SOSUS. Mar Technol
Soc J 1994;27(4):13–21.
[27] Edds PL. Vocalizations of the blue whale, Balaenoptera musculus, in the St. Lawrence River. J Mammal
1982;63:345–7.
[28] Schevill WE, Watkins WA, Backus RH. The 20-cycle signals and Balaenoptera (fin whales). In: Tavolga WN,
editor. Marine bio-acoustics. New York: Pergamon Press; 1964. p. 147–52.
[29] Winn HE, Perkins PJ. Distribution and sounds of the minke whale, with a review of mysticete sounds.
Cetology 1976;19:1–12.
[30] Mellinger DK, Clark CW. Methods for automatic detection of mysticete sounds. Mar Freshwater Behav
Physiol 1997;29:163–81.
[31] Heimlich SL, Mellinger DK, Nieukirk SL, Fox CG. Types, distribution, and seasonal occurrence of sounds
attributed to Bryde’s whales (Balaenoptera edeni) recorded in the eastern tropical Pacific, 1999–2001. J
Acoust Soc Am 2005;118:1830–7.
[32] Fox CG, Matsumoto H, Lau T-KA. Monitoring Pacific Ocean seismicity from an autonomous hydrophone
array. J Geophys Res 2001;106(B3):4183–206.
[33] Oleson EM, Barlow J, Gordon J, Rankin S, Hildebrand JA. Low frequency sounds of Bryde’s whales. Mar
Mammal Sci 2003;19:407–19.
[34] Frankel AS, Clark CW, Herman LM, Gabriele CM. Spatial distribution, habitat utilization, and social
interactions of humpback whales, Megaptera novaeangliae, off Hawaii, determined using acoustic and visual
techniques. Can J Zool 1995;73:1134–46.
[35] Thompson TJ, Winn HE, Perkins PJ. Mysticete sounds. In: Winn HE, Olla BL, editors. Behavior of marine
mammals: current perspectives in research. New York: Plenum Press; 1979.
[36] Watkins WA, Schevill WE. Sound source location by arrival-times on a non-rigid three-dimensional
hydrophone array. Deep-Sea Res 1972;19:691–706.
[37] Charif RA, Mitchell SG, Clark CW. Canary 1.2 user’s manual. Technical Report, Cornell Laboratory of
Ornithology, Ithaca (NY); 1995.
[38] Charif RA, Clark CW, Fristrup KM. Raven 1.3 user’s manual. Technical Report, Cornell Laboratory of
Ornithology, Ithaca (NY); 2006.
[39] Patterson B, Hamilton GR. Repetitive 20 cycle per second biological hydroacoustic signals at Bermuda. In:
Tavolga WN, editor. Marine bio-acoustics. New York: Pergamon Press; 1964. p. 125–45.

Mellinger 2006

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Mellinger 2006

Hochgeladen von

Copyright:

Verfügbare Formate

Applied Acoustics 67 (2006) 1226–1242

MobySound: A reference archive for studying

Keywords: Sound archive; Call recognition; SNR

2. Motivation for the archive

3. Contents of the archive

4. Measurement of signal-to-noise ratio

4.1. Challenges in measuring SNR

4.2. SNR measurement choices

384 386 388 390 392

80 100 120 140 160

4.3. SNR measurement procedure

5. Format of ﬁles in a sound set

In recordings for which it is not possible to unambiguously determine which individual

6. Current entries in MobySound

7. Inclusion in Macaulay Library

8. Submission of sounds for inclusion

MobySound may be accessed at http://hmsc.oregonstate.edu/projects/MobySound/.

Das könnte Ihnen auch gefallen