Evaluating Generative Music Systems Using the Turing Test

Christopher Ariza
Department of Music
Towson University
8000 York Road
Towson, Maryland 21252 USA
ariza@flexatone.net
The Interrogator as Critic:

The Turing Test and the
Evaluation of Generative
Music Systems
c
Computer Music Journal, 33:2, pp. 4870, Summer 2009

2009 Massachusetts Institute of Technology.
systems. Yet, since only the output of the system

is tested (that is, system and interface design are
ignored), any generative technique can be employed.
These tests may be associated with the broader
historical context of human-versus-machine tests,
as demonstrated in the American folk-tale of John
Henry versus the steam hammer (Nelson 2006) or
the more recent competition of Garry Kasparov
versus Deep Blue (Hsu 2002).
Some tests attempt to avoid measures of subjective quality by measuring perceived conformity to
known musical artifacts. These musical artifacts
are often used to create the music being tested:
they are the source of important generative parameters, data, or models. The design goals of a system
provide context for these types of tests. Pearce,
Meredith, and Wiggins (2002, p. 120) define four motivations for the development of generative music
systems: (1) composer-designed tools for personal
use, (2) tools designed for general compositional use,
(3) theories of a musical style . . . implemented as
computer programs, and (4) cognitive theories of
the processes supporting compositional expertise . . .
implemented as computer programs. Such motivational distinctions may be irrelevant if the system
is used outside of the context of its creation; for
this reason, system-use cases, rather than developer
motivations, might offer alternative distinctions.
The categories proposed by Pearce, Meredith, and
Wiggins can be used to generalize about two larger
use cases: systems used as creative tools for making
original music (motivations 1 and 2, above), and
systems that are designed to computationally model
theories of musical style or cognition (motivations
3 and 4). These two larger categories will be referred
to as creative tools and computational models.
Although design motivation is not included in
the seven descriptors of computer-aided algorithmic systems proposed in Ariza (2005), the idiom
affinity descriptor is closely related: systems with
singular idiom affinities are often computational
models.
48
Computer Music Journal
Procedural or algorithmic approaches to generating

music have been explored in the medium of software
for over fifty years. Occasionally, researchers have
attempted to evaluate the success of these generative
music systems by measuring the perceived quality
or style conformity of isolated musical outputs.
These tests are often conducted in the form of
comparisons between computer-aided output and
non-computer-aided output. The model of the
Turing Test (TT), Alan Turings proposed Imitation
Game (Turing 1950), has been submitted and
employed as a framework for these comparisons. In
this context, it is assumed that if machine output
sounds like, or is preferred to, human output, the
machine has succeeded. The nature of this success
is rarely questioned, and is often interpreted as
evidence of a successful generative music system.
Such listener surveys, within necessary statistical
and psychological constraints, may be pooled to
gauge common responses to and interpretations of
musicyet these surveys are not TTs. This article
argues that Turings well-known proposal cannot
be applied to executing and evaluating listener
surveys.
Whereas pre-computer generative music systems
have been employed for centuries, the idea of
testing the output of such systems appears to only
have emerged since computer implementation.
One of the earliest tests is reported in Hiller
(1970, p. 92): describing the research of Havass
(1964), Hiller reports that, at a conference in 1964,
Havass conducted an experiment to determine
if listeners could distinguish computer-generated
and traditional melodies. Generative techniques
derived from the fields of artificial intelligence
(AI; for example, neural nets and various learning
algorithms) and artificial life (e.g., genetic algorithms
and cellular automata) may be associated with
such tests due to explicit reference to biological
Explicitly testing the output of generative music

systems is uncommon. As George Papadopoulos and
Geraint Wiggins (1999, p. 113) observe, research in
generative music systems demonstrates a lack of
experimental methodology. Furthermore, there is
usually no evaluation of the output by real experts.
Similarly, Pearce, Meredith, and Wiggins (2002,
p. 120), presumably describing all types of generative
music systems, state that researchers often fail to
adopt suitable methodologies for the development
and evaluation of composition programs and this, in
turn, has compromised the practical or theoretical
value of their research. In the case of creative tools,
the lack of empirical output evaluation is not a
shortcoming: creative, artistic practices, no matter
the medium, often lack rigorous experimental
methodologies. While system and interface design
benefit from rigorous evaluation, the relevance of
systematic output evaluationwhether conducted
by experts or otherwiseis questionable. The lack
of systematic evaluation of aesthetic artifacts in
general is traditionally accepted: evaluation is
more commonly found as aesthetic criticism, not
experimental methodology.
This article examines the concept of a musical
TT. A variety of tests, proposed and executed by
researchers in diverse fields, are examined, and it is
shown that musical TTs do not actually conform
to Turings model. Use of the TT in the evaluation
of generative music systems is superfluous and
potentially misleading; its evocation is an appeal to
a measure of some form of artificial thought, yet,
in the context of music, it provides no more than a
listener survey.
The term musical judgments will be used
to include a range of statements listeners might
make about music. Musical judgments may be
aesthetic judgments or well-reasoned and informed
interpretations; they may evaluate conformity to
a known style or perceived similarity to existing
works; they may also be statements of taste,
bias, or preference. Such judgments are often
subjective: they are statements about the experience
of hearing and interpreting music. Some musical
judgments may be objective; asserting or denying
this claim, however, requires a psychoacoustic and
philosophical inquiry beyond the scope of this study.
Stanley Cavell (2002, p. 88), while arguing that

aesthetic judgments are conclusive and rational,
states that such judgments lack something: the
arguments that support them are not conclusive
in the way arguments in logic are, nor rational
in the way arguments in science are. Aesthetic
judgments, as described by Immanuel Kant in the
Critique of Judgment (1790), can be divided into
two categories: taste of sense, which concerns what
is merely pleasant or agreeable, and taste of
reflection, which concerns what is beautiful (Cavell
2002, p. 88). Musical judgments, in this framework,
include both the taste of sense and the taste of
reflection. Some critics are better than others at
distinguishing these tastes. Critics make musical
judgments to affirm aesthetic value.
Listener surveys rely on musical judgments.
Although the nature of these judgments can be
shaped in the selection of the listeners and the design
of the survey, there is an essential psychological
uncertainty in what informs individual musical
judgments. This uncertainty is heightened in the
context of an anonymous survey, where a listener
is neither accountable for nor required to defend
their judgments. Thus, a skepticism in regard to
the results of listener surveys, perhaps more than
surveys of other human experiences, is warranted.
This article asserts the particular nature of the TT
in order to discourage false associations of the TT
with listener surveys. This critique is not aimed at
individual researchers or research projects. The TT
is an attractive concept: it is not surprising that it
has found its way into discourse on generative music
systems. More practical methods of system evaluation, however, may do more to promote innovation.
Comparative analysis of system and interface design, or studies of user interaction and experience,
are examples. Such approaches to system evaluation,
while only offering limited insight into musical output, promote design diversity and provide valuable
research upon which others can build.
The Turing Test

In 1950, Alan Turing devised a method of answering
the question: Can a machine think? (Turing 1950,
Ariza
49
p. 433). His original model, called The Imitation

Game, includes a human interrogator who, through
a text-based interface designed to remove the aural
qualities of speech and the visual appearance of the
subject, communicates with two agents. One agent
is human; the other, machine. The interrogator must
be aware that one of the agents is a machine. If the
interrogator, through discourse, cannot successfully
distinguish the human from the machine, then the
machine, in Turings view, has achieved thinking.
Importantly, Turing does not define thinking or
intelligence (Copeland 2000, p. 522), and Turing does
not claim that passing this test proves thought or
intelligence (Harnad 2000, p. 427). As Steven Harnad
states, Turings goal is an epistemic point, not an
ontic one, and heuristic rather than demonstrative
(Harnad 2000, p. 427).
Turing based his test on a party game in which
the interrogator attempts to distinguish the gender
of two concealed human agents. Turings original
description of his test is incomplete, and has led to a
variety of interpretations. Some have suggested that
there are actually two tests, the Original Imitation
Game Test and the Standard Turing Test (Genova
1994; Sterrett 2000). Yet B. Jack Copeland, based on
analysis of additional commentary by Turing, states
that it seems unlikely that Turing saw himself as
describing different tests (Copeland 2000, p. 526).
Turing suggests that multiple tests must be
averaged with, presumably, multiple human agents
and interrogators. He predicted that by the year
2000, an average interrogator, after five minutes
of conversation, will make a correct identification
no more than 70 percent of the time (Turing 1950,
p. 442). Additional sources suggest that Turing
estimated it would take over 100 years before a
machine regularly passed the test (Copeland 2000,
p. 527). Anticipating complaints to such a test,
Turing (1950) responds to at least nine hypothetical
objections.
Under the heading of The Argument from
Consciousness, Turing responds to an argument
presented by Geoffrey Jefferson a year earlier (1949).
Jefferson states that it will not be sufficient for a
machine mind to just use words: It would have
to be able to create concepts and to find for itself
suitable words in which to express additions to
knowledge that it brought about (Jefferson 1949,

p. 1110). Beyond just the ability of creating concepts,
Jefferson goes on to suggest that such a machine
must have emotions and self awareness. Not until
a machine can write a sonnet or compose a concerto
because of thoughts and emotions felt, and not by
the chance fall of symbols, could we agree that
machine equals brainthat is, not only write it but
know that it had written it (Jefferson 1949, p. 1110).
Jeffersons quote is used by Turing to anticipate
the objection of the other minds problem
the argument that it is impossible to prove that
any entity (including humans) has a mind; only
individuals can be certain that they have minds. As
Harnad states, the only way to know for sure that
a system has a mind is to be that system (2000,
p. 428). Turing avoids this problem by rejecting the
solipsist point of view (1950, p. 446) and affirming
that humans use conversation and discourse to
identify the presence of minds in other humans.
Turing takes a pragmatic approach: Instead of
arguing continually over this point it is usual to
have the polite convention that everyone thinks
(Turing 1950, p. 446). This is a natural language based
argument from analogy: You are sufficiently like
me in all other visible respects, so I can justifiably
infer (or assume) that you are like me in this invisible
one (Rapaport 2000, p. 469). In short, the TT works
by assuming that humans have minds and that
natural language is sufficient to represent mind; a
machine has a mind if a machine and a human are
indistinguishable in discourse.
An important consequence of the TT is that the
machines internal mechanism, as well as its outward visual or aural presence, is irrelevant. Through
blind comparison, Turing hoped to surpass personal
and aesthetic bias. The test isolates convincing,
human-like conversation as the sole determinate of
thinking. Harnad calls this functional, rather than
structural, indistinguishability (2000, p. 429). As
Ray Kurzweil states, the insight in Turings eponymous test is the ability of written human language
to represent human-level thinking (Kurzweil 2002).
This insight, as discussed subsequently, has been
debated.
Turings goal was, at most, to provide an inductive
test of thinking (Moor 1976, 2001, p. 82); at the least,
50
he offered a philosophical conversation stopper

(Dennett 1998, p. 4). Turing left significant details
of his test unspecified, including the number of
interrogators and agents, their qualifications, and
the duration and organization of the tests (Halpern
2006, p. 43). Nonetheless, the TT can be repeated
and averaged to arrive at an inductive claim.
French suggests that a simplified version without
comparison between two agents, involving only a
computer agent and an interrogator, is satisfactory:
It is generally agreed that this variation does
not change the essence of Turings operational
definition of intelligence (French 2000, p. 116).
While comparison between two agents might be
replaced with a single agent, the use of natural
language and interaction are always retained.
The machine agent of the TT may deceive the
interrogator: mathematical questions, for example,
may be answered with programmed mistakes
calculated to confuse the interrogator (Turing
1950, p. 448), or answers may be programmatically
delayed to simulate human calculation times. All
that is necessary is that the machine fool a human
a suitable percentage of the time. Peter Kugel
states that Turings proposal does not suggest that
computers will gain intelligence, but that they
will fake intelligence well enough to fool human
beings (Kugel 2002, p. 565).
An early example of a software system well suited
to the TT is Joseph Weizenbaums ELIZA system
(Weizenbaum 1966). In rough analogue to a human
therapist, ELIZA attempts to communicate with
an interrogator in natural language. Although not
offering quantitative results such as those suggested
by Turing, Weizenbaum notes that some subjects
have been very hard to convince that ELIZA . . . is
not human and that this is a striking form of
Turings test (Weizenbaum 1966, p. 42).
Humans may too easily associate humanity with
machines. Some have called this the Eliza Effect,
the susceptibility of people to read far more understanding than is warranted into strings of symbols
especially wordsstrung together by computers
(Hofstadter 1996, p. 155). Jefferson, nearly fifty
years earlier, similarly comments on what he saw
as a new and greater danger threateningthat of
anthropomorphizing the machine (Jefferson 1949,
p. 1110). Hofstadter further characterizes the Eliza

Effect as a virus that constantly mutates, reappearing over and over again in AI in ever-fresh disguises,
and in subtler and subtler forms (Hofstadter 1996,
p. 158). As discussed subsequently, the Eliza Effect
may influence the evaluation of generative music
systems.
Describing a response similar to the Eliza Effect,
Halpern notes how surprise is often confused with
success: AI champions point out that the computer
has done something unexpected, and that because
it did so, we can hardly deny it was thinking . . . to
make this claim is simply to invoke the [Turing]
test without naming it (Halpern 2006, p. 51). This
surprise at learning that a computer has performed
some feat that . . . only humans could perform, as
the Eliza Effect, often leads observers to overestimation. Computer-generated art and music are often
used as examples of progress in AI (Kurzweil 1990),
in part because of this surprise factor.
The model provided by ELIZA has inspired a wide
range of systems specialized for natural language
communication. The Loebner Prize, established in
1990 by Hugh Loebner (1994), provides a forum
for conducting Turing-style tests. The prize will
award $100,000 to the developers of the first system
to pass the Loebner form of the TT. Ironically,
during the first Loebner Competition in 1991,
three judges mistook a human for a computer,
presumably because she knew so much about
her topic that she exceeded their expectations for
mere humans (Halpern 2006, p. 57). Significant
differences between the Loebner prize and the
TT have been documented (Shieber 1993; French
2000, p. 121; Zdenek 2001), and many have criticized
the prize, suggesting that more practical goals should
be promoted (Hofstadter 1996, p. 491; Dennett 1998,
p. 28).
Another formal Turing-style test has been established as part of a bet. Mitchell Kapor has predicted
that by 2029 no machine intelligence will have
passed the TT, a prediction challenged by Kurzweil
as part of a $20,000 Long Bets Foundation wager
(Long Bets Foundation 2002; Kurzweil 2002, 2005,
p. 295). As will be discussed herein, Kurzweils conviction that the TT will be passed by the end of the
2020s (Kurzweil 2005, p. 25) has led him to describe
Ariza
51
a wide variety of human-machine competitions as

TTs.
For over fifty years, there has been tremendous
discussion and criticism of the TT. As Halpern
notes, Turing (1950) is one of the most reprinted,
cited, quoted, misquoted, paraphrased, alluded to,
and generally referenced philosophical papers ever
published (Halpern 2006, p. 42).
A widely cited complaint, called the Chinese
Room Argument (CRA), is presented by John Searle
(1980). Searles argument imagines a human translating Chinese simply through direct symbol manipulation and table look-up procedures. Searle argues
that, in this case, the human has knowledge only of
syntax, and cannot be seen to have any knowledge of
semantics. This human, with sufficient procedural
resources, could appear to communicate in Chinese
yet have no knowledge of Chinese. From this, Searle
claims that the ability to provide good answers to
human questions does not necessarily imply that
the provider of those answers is thinking; passing
the Test is no proof of active intelligence (Halpern
2006, p. 52). Although some authors have extended
the CRA (Wakefield 2003; Damper 2006), Searles
argument has been frequently challenged (Harnad
2000, p. 437; Hauser 1993, 1997, 2001; Kurzweil
2005, p. 430; Rapaport 2000). Numerous additional
arguments questioning the value or applicability
of the TT have been proposed (Block 1981; French
1990; Crockett 1994).
The question of whether the TT or similar tests
are sufficient measures of machine thought will
be debated for the foreseeable future (French 2000;
Saygin, Cicekli, and Akman 2000). The answer to
this question is not relevant to the present study.
The TT, regardless of its potential for measuring
thought or intelligence, provides at least a symbolic
benchmark of one form of machine aptitude. More
relevant to this study, the TT has inspired other
forms of comparison tests between human and
machine output.
Related tests of machine aptitude have been proposed. Some of these tests attempt to measure
human-like experience or creativity and might be

applied to musical activities and musical thoughts.
These tests are examined here to distinguish them
from the TT, discussed previously, and proposed
musical TTs, introduced subsequently.
The John Henry Test (JHT) is proposed here to
label a common approach to testing machine aptitude. A JHT is a competition between a human and
a machine in which there is a clearly defined winner
within a narrowly specialized (and not necessarily
intelligent) domain. A JHT requires quantitative
measures of success: a decisive, independently
verifiable conclusion results. A test of aesthetic
outputs cannot be a JHT. The Garry Kasparov versus
Deep Blue competition can be properly seen as a
JHT: it was a human-versus-machine contest, and
the conclusion of the competition was defined by
a clear (though contested) result. While the TT
might be seen as a type of JHT, the natural-language
interaction of the TT is free-form and not based on
a fixed domain or content. The results of the TT
are based on human-evaluated indistinguishability,
not an independently verifiable distinguishability.
Nonetheless, Bloomfield and Vurdubakis generalize
the TT as something like a JHT, describing such tests
as forms of socially situated performance geared
to the enactment and dramatization of a number
of occidental modernitys fundamental moral and
conceptual categories (Bloomfield and Vurdubakis
2003, p. 35). Although the TT is not a JHT, both
have been used to dramatize changes in machine
aptitude.
Steven Harnad places the TT within what he calls
the Turing Hierarchy. Harnad first extended the TT
into the Total Turing Test (TTT). The TTT requires
that Turings text-based interface be replaced by
full physical and sense-based interaction with a
robot: The candidate must be able to do, in the
real world of objects and people, everything that real
people can do, in a way that is indistinguishable (to
a person) from the way real people do it (Harnad
1991, p. 44). If applied to a generative music system,
the TTT would presumably require a robot to play a
musical instrument, or perform some other musical
task, in a physical manner indistinguishable from
a human. Eighteenth-century musical automata,
such as the flute player of Jacques de Vaucanson
52
Alternative Tests
or the harpsichord player of the Jaquet-Droz family

(Riskin 2003) provide early examples of such humanshaped music machines. More recent attempts, such
as Haile (Weinberg and Driscoll 2006) and the
Waseda Flutist Robot (Solis et al. 2006), explore
more sophisticated musical interactions with less
convincing human-like exteriors. Nick Collins
imagines another example: a musical TTT of sorts
executed within a blind orchestra audition. The
machine would employ a real or virtual instrument,
virtuoso skill, and the important conversational
analysis required to follow instructions and visual
analysis to read complex scores (Collins 2006,
p. 210). No past or present musical automata have
approached the comprehensive ability necessary to
pass a TTT.
Harnad proposes the TTT as way of adding
semantics to a syntax-only device: the symbols
must be grounded directly and autonomously in
causal interactions with the objects, events and
states that they are about, and a pure symbolcruncher does not have the wherewithal for that
(Harnad 2000, p. 438). Although this can be seen
as an attempt to avoid the problems raised by
Searles CRA, some have suggested that the TTT is
unnecessary or misguided (Searle 1993).
Harnad extends the Turing Hierarchy in both
directions: beyond the TTT (or T3) are greater
degrees of human indistinguishability. The T4
requires internal microfunctional indistinguishability (Harnad 2000, p. 439), and the T5 requires
microphysical indistinguishability, real biological
molecules, physically identical to our own (Harnad
2000, p. 440).
More important for this discussion is what Harnad
places below the TT (or T2): level t1. The t in this
context stands for toy models, not Turing. Tests in
this form employ models for subtotal fragments
of our functional capacity (2000, p. 429). Harnad
emphasizes that the true TT is predicated on total
functional indistinguishability; anything less, in
the context of a TT, is a toy, and toys are ultimately
inadequate for the goals of Turing testing (2000,
p. 430). Harnad states that, as all current mindmodeling research remains at the t1 level (2000,
p. 430), we can assume that models of the musical
mind, or models of subtotal musical functionality
independent of comprehensive cognitive models, are

similarly toys, and inadequate for the TT. As will be
developed herein, the t1 level provides a context for
musical TTs. The term toy is used to emphasize
the distance of these systems from Turings goal of
thinking, not to suggest that such systems are for
children or are otherwise unsophisticated.
Bringsjord, Bello, and Ferrucci (2001) propose a
test based on Lady Lovelaces discussion of the limits
of Charles Babbages Analytic Engine (Lovelace
1842). Turing describes Lady Lovelaces objection in
his original presentation of the TT (1950, p. 454).
The Lovelace Test (LT) requires that the machine
be creative, where the term creative is used in a
highly restrictive sense. This creativity is evidenced
when the machine produces an artifact through a
procedure that cannot be explained by the creator (or
a creator-peer) of the machine. Specifically, where
H is the human architect, A is the artificial agent,
and o is the output, A has passed the LT when
H (or someone who knows what H knows, and
has Hs resources) cannot explain how A produced
o by appeal to As architecture, knowledge base,
and core functions (Bringsjord, Bello, and Ferrucci
2001, p. 12). This is explicitly a special epistemic
relationship (2001, p. 9). H is permitted time to
provide an explanation, and may investigate and
study the system in any way necessary, including
analyzing the learned or developed states of a
dynamic system within A. Knowledge contained
within an artificial neural network, for example,
might be explained through such an analysis (2001,
p. 19). Multi-agent systems, or other types of
emergent programming paradigms, may produce
surprising results: these results, however, can be
traced back to the systems architecture, knowledge
base, and core functions. It is easy to underestimate
the difficulty of passing the LT. The LT is designed
to suggest that the notion of creativity requires
autonomy and that there may simply not be a
way for a mere information-processing artifact to
pass LT (2001, p. 25).
Hofstadter offers a perspective similar to that
of the LT, noting that when programs cease to be
transparent to their creators, then the approach to
creativity has begun (1979, p. 673). Elsewhere, Hofstadter states that true creativity implies autonomy:
Ariza
53
a creative program must make its own decisions,

must experiment and explore concepts, and must
gradually converge on a satisfactory solution through
a continual process in which suggestions coming
from one part of the system and judgments coming from another part are continually interleaved
(Hofstadter 1996, p. 411).
Margaret Boden divides the Lovelace objection
into four Lovelace questions: (1) can computational
ideas increase understanding of human creativity; (2)
can computers do things that appear creative; (3) can
a computer appear to recognize creativity; and (4)
can computers really be creative (1990, p. 6). Boden
answers the first three questions affirmatively. The
fourth question is the LT. Boden notes that, even
after satisfying all the scientific criteria for creative
intelligence (whatever those may be), answering
this question requires humans to make a moral
and political decision: this decision amounts to
dignifying the computer: allowing it a moral and
intellectual respect comparable with the respect we
feel for fellow human beings (1990, p. 11). This
respect relates to the issue of intentionality and
authorship, discussed subsequently.
The LT standard of creativity is significantly
higher than the computational creativity defined
by Wiggins, which includes all behavior exhibited
by natural and artificial systems, which would be
deemed creative if exhibited by humans (2006, p.
451). Wiggins, admitting this definition is intangible, does not offer a method or a test to determine
what is or is not deemed creative by humans. Humans may not agree on, or even regularly identify,
creative behavior exhibited by any agent, human or
machine. The apprehension of creativity, like the
identification of successful music, may be a largely
aesthetic problem. Cope recasts the influence of consensus by defining creativity as the initialization
of connections between two or more multifaceted
things, ideas, or phenomena hitherto not otherwise
considered actively connected (2005, p. 11). Here,
the problem of identifying consensus is shifted, not
removed: consensus is required to determine what
is already actively connected.
Independent of whether machines exhibit creativity, no contemporary generative music system is
likely to pass the LT. Generative systems may em-
ploy complex stochastic models, neural nets, models

of artificial life, or any of a wide range of procedures;
such systems may also serendipitously produce
surprising and aesthetically satisfying outputs. Yet
the output of such systems, upon examination of
the systems architecture, can be explained. Cope,
for example, executes an LT of sorts, asserting
that an Experiments in Musical Intelligence (EMI)
composition from 2003 was not produced creatively
nor with creative processes: given enough time,
I could reverse engineer this music and find all of
its original sources in Bachs lute suites (2005,
p. 44). This is not a practical or aesthetic concern
of music making: as demonstrated by the history of
generative music systems, failing the LT or other
measures of machine intelligence has not limited
the computer-aided creation of music by humans.
Tests that are associated with the TT yet fundamentally alter its structure are the focus of this
article. An early example is provided by Hofstadter
in Godel,
Escher, Bach (Hofstadter 1979). Hofstadter
calls this a little Turing test: readers are asked
to distinguish selections of human-written natural
language and computer-generated text, presented in
an intermingled list (Hofstadter 1979, p. 622). Hofstadter fails to articulate the significant deviations
from Turings model.
Similarly, Kostas Terzidis suggests that if an
algorithmically generated paper created by the Dada
Engine system (Bulhak 1996) was submitted to
a conference and accepted, it may have passed
Turings classic test of computer intelligence
(Terzidis 2006, p. 22). As should be clear, simply
mistaking computer output for human output is not
passing a TT. Appropriately, the author of the Dada
Engine credits Hofstadters little Turing test as a
source of inspiration (Bulhak 1996).
Another alteration of Turings model is demonstrated by the Completely Automated Public
Turing test to tell Computers and Humans Apart,
or CAPTCHA. A CAPTCHA is a now-familiar test
given by a computer to distinguish if a user is
either a human or a machine. While superficially
related to the TT in that the CAPTCHA attempts
to distinguish humans and machines, it is not a
TT: there is no interaction, the medium is often
visual (based on the ability to distinguish distorted
54
characters or images), and thinking is not (generally)

tested. Moni Naor, in the first proposal for such
tests, fails to articulate these differences, simply
calling these automated Turing Tests (Naor 1996).
Luis von Ahn, Manuel Blum, Nicholas Hopper, and
John Langford, who coined the term CAPTCHA,
promulgate a similar misnomer and continue to
refer to these tests as forms of Automated Turing
Tests (von Ahn et al. 2003; von Ahn, Blum, and
Langford 2004).
MDtT would permit the interrogator to submit

as many musical directives as desired. An MDtT
could take the form of a real-time musical call and
response or improvisation between interrogator and
composer agents. The interrogator must attempt to
distinguish the human from the machine. This test
retains some aspects of interaction, yet it replaces
natural language with music.
Musical Output Toy Test (MOtT)
Music as the Medium of the Turing Test

To test the output of generative music systems,
and to avoid the problems of the TTT, the LT, and
definitions of creativity, the TT might be altered by
making aesthetic artifacts, music or other creative
forms, the medium of the test. In the case of music,
this means replacing the text-based medium, in
whole or in part, with sound symbols or sound
forms. Two models of such tests, amalgamated from
diverse sources, are introduced herein. Although
sometimes using the language and format of the
TT, these tests fundamentally alter the role of the
interrogator, recasting the interrogator as a critic.
As such, these are not TTs but rather, after Harnad
(2000), toy tests.
In this test, two composer-agents provide a score,

synthetic digital audio, or digital audio of a recorded
performance to the interrogator. One of the composers is a machine, the other, a human. The
provided music may be related in terms of style,
instrumentation, or raw musical resources, but is
not a newly composed response to a specific musical
directive (as in the MDtT). Each agent might provide
multiple musical works. Based only on these works,
the interrogator must attempt to distinguish the
human from the machine. This test maintains only
the blind comparison of output from two sources;
the interaction and discourse permitted in the TT
are removed.
Comparison
Musical Directive Toy Test (MDtT)
The interrogator, using a computer interface, sends
a musical directive to two composer-agents. One of
the composers is a machine, the other, a human.
The musical directive could be style- or genredependent, or it could be abstractsomething like
write sad music, write a march, or compose
a musical game. The musical directive might
also include music, such as melodic or rhythmic
fragments upon which the composer-agent would
build. The two composers both receive the directive
and create music. After an appropriate amount
of time (a human scale would be necessary), the
completed music is returned to the interrogator in
a format such as a score, synthetic digital audio, or
digital audio of a recorded performance. A flexible
The MDtT and MOtT, while employing the blind

indistinguishability test of the TT, remove the
critical component of natural language discourse.
The MDtT and MOtT are solely dependent on
the musical judgments of the interrogator. These
musical judgments cannot do what natural language
discourse can do to expose the agents capacity
for thinking. Successful music, unlike natural
language, does not require a common, external
syntax; successful musical discourse, unlike natural
language, can employ unique, dynamic, or local
syntaxes.
Furthermore, the interrogator may overwhelmingly rely on subjective musical judgments. This
contrasts with the TT, which, while permitting
any form of discourse within the common syntax
of written natural language, is designed to remove
Ariza
55
subjective visual and aural evaluations through

blind comparison. Although it is uncommon for
humans to interact via text with a completely
unknown agent, it is quite common for humans
to evaluate music without any knowledge of its
author, source, or means of production. The TT
is a strikingly isolated and focused form of blind,
discourse-based evaluation. The MDtT or MOtT, by
using music in a conventional form of delivery, do
not offer a similarly isolated or blind form of evaluation. The interrogator in the TT, although certainly
influenced by subjective ideas of human thought,
attempts to reasonably distinguish between human
and machine based on linguistic constructions and
content. The critic of the MDtT or MOtT is unrestrained, free to employ a wide range or mixture of
musical judgments.
The MDtT is related to the Short Short Story
Game (S3 G) proposed and demonstrated by Selmer
Bringsjord and David Ferruci (2000). In this game,
a system (BRUTUS, in the case of Bringsjord and
Ferruci) is given a sentence. Based on this sentence,
the system must compose a short story designed to
be truly interesting (Bringsjord, Bello, and Ferrucci
2001, p. 13). The goal of this system is to compete
with human authors in a manner similar to the
MDtT. While Bringsjord, Bello, and Ferrucci (2001,
p. 13) acknowledge that BRUTUS produces some
rather interesting stories, they do not claim that
the system passes the LT. In their view, the system
is not creative. Its output is merely the result of
their input and design: two humans . . . spent
years figuring out how to formalize a generative
capacity sufficient to produce this and other stories
(Bringsjord, Bello, and Ferrucci 2001, p. 14).
The MDtT and MOtT, however, significantly
differ from the S3 G: as stated previously, successful
music, unlike successful stories, may have no
common syntax. The abstract nature of music,
particularly in the context of creative contemporary
practice, is such that there exists no comparable
expectation of grammar or form. Even in the limited
cases where strong expectations of musical grammar
or form exist, subverting these expectations may
be musically legitimate or aesthetically valuable.
Music, like some poetry, is not natural language.
Where the evaluation of the output of S3 G may
include formal, rational, and aesthetic criteria, the

evaluation of output in a MDtT may be limited to
essentially aesthetic musical judgments.
Without the unifying context of a musical
directive, the MOtT, even more than the MDtT,
may result in unreliable or unrepresentative musical
judgments. These judgments may be influenced by
historical or cultural associations about musical
style, expectations of what a machine or a human
sounds like, or assumptions of what is possible
with current technology. These expectations can
vary greatly depending on the listeners surveyed.
Of course, similar notions are likely to be held by
an interrogator in the TT. Yet a responsible TT
interrogator has the option, even the obligation, to
balance such judgments with further discourse by
asking questions or demanding explanations. This
is not possible in the MDtT or MOtT.
Pearce, in describing a MOtT directed at identifying stylistic similarity, finds similar influences
on musical judgments, noting that potential interrogators might shift their attention to searching for
musical features expected to be generated by a computer model rather then concentrating on stylistic
features of the composition (2005, p. 185). While
Pearce notes that the Turing test methodology
fails to demonstrate which cognitive or stylistic
hypotheses embodied in the generative system influence the judgments of listeners (Pearce 2005,
p. 186), he still claims that such tests offer empirical, quantitative results which may be appraised
intersubjectively (p. 184).
Both the MDtT and the MOtT are surveys of
musical judgments, not determinants of thought
or intelligence. Where the TT requires discourse
between an interrogator and an agent, here discourse
is replaced by single-sided criticism. Even in the
musical discourse of an interactive MDtT, the interrogator, at the end of the discourse, must attempt
to distinguish human and machine with musical
judgments. While interrogators can generally agree
on what makes rational and coherent language,
and are likely to concur on what kind of language
offers evidence of thought, critics may not agree on
what makes aesthetically successful music, and are
likely to offer inconsistent or contradictory musical
judgments.
56
Tests similar to the MOtT have been proposed

and executed in other creative mediums. In all
cases, critical components of the TT, such as
interactivity or the use of natural language, are
removed. Hofstadters little Turing test, described
earlier, removes interactivity but evaluates natural
language. The evaluation of aesthetic artifacts
pushes such tests even further away from Turings
model.
Related to a MOtT, Ray Kurzweil demonstrates a
test called A (Kind of) Turing Test (Kurzweil 1990,
p. 374). Kurzweil describes a narrower concept of
a Turing test, a domain-specific Turing test, or a
Turing test of believability, where the goal is for a
computer to successfully imitate a human within a
particular domain of human intelligence. Kurzweil
tests the output of his Kurzweil Cybernetic Poet
system with poems by human authors, and he
provides data based on a 28-poem comparison given
to 16 human judges. Kurzweil concludes that
this domain-specific Turing test has achieved some
level of success in tricking human judges in its
poetry-writing ability (Kurzweil 1990, p. 377).
The test proposed by Kurzweil bears only superficial similarity to the TT. Critical components of
the TT are stripped away without consideration:
the test is not interactive, making the interrogator
a judge, and the medium of natural language is
replaced by poetry. Kurzweil employs the association with the TT to suggest that these tests are part
of a trajectory toward completing the TT. Kurzweil
states that the era of computer success in a wide
range of domain-specific Turing tests is arriving
(Kurzweil 1990, p. 378), and considers success with
the Turing test of believability the first level in a
four-level progression toward widespread acceptance
of a computer passing a complete TT (pp. 415416).
As examples of these narrow versions of the Turing
test of believability, Kurzweil offers diagnosing
illnesses, composing music, drawing original pictures, making financial judgments, playing chess
(p. 415). Most of these narrow TTs, such as diagnosing illnesses and making financial judgments,
are simply JHTs. As will be discussed subsequently,
Kurzweil applies this progression to music, stating
that music composed by computer is becoming
increasingly successful in passing the Turing test of
believability (p. 378). The assumption that success

in a MOtT foreshadows success in a TT ignores
the critical differences between music and thought
expressed in language.
MDtT and MOtT are often employed as a way of
arguing for the success of a generative music system.
An automatic connection between positive musical
judgments and system-design success, however,
should be questioned. Specific system outputs may
not be representative of the system, and the critic
is in a weak position to judge what is and is not
representative. A generative system may be badly
designed, difficult to use, unreliable, or incapable of
variety, yet produce aesthetically successful outputs.
Just as the aesthetic output of a human may indicate
little, if anything, about the author, the aesthetic
output of a system indicates nothing with certainty
about the systems design. As Pearce, Meredith,
and Wiggins (2002, p. 129) state, Evaluating the
music produced by the system reveals little about
its utility as a compositional tool.
That many systems have already passed MDtT or
MOtT, and that systems developed hundreds of years
ago might fare just as well, further questions what
success such tests grant. As discussed subsequently,
all documented musical TTs report a win for the machine. Even simple Markov-based systems, within
constrained evaluative contexts, have produced
output indistinguishable from human output (Hall
and Smith 1996). An 18th-century generative music
system, such as a dice-based music assembly system
(Gardner 1974; Hedges 1978), could have fooled an
average human a suitable percentage of the time.
The continued execution of such tests may do more
to investigate the limits of musical judgments than
the innovations of generative music systems.
Finally, there is no necessary connection between
humanism, intelligence, or thinking and aesthetic
success. Thus the TT, designed to discern thinking,
is not automatically equipped to discern aesthetic
success. Although Wiggins and Smaill state that
music is undeniably an intelligent activity, they
admit that some kinds of musical activities seem
to be an almost subconscious and involuntary
response (Wiggins and Smaill 2000, p. 32). Music
is not necessarily an intelligent activity, and it
is certainly not a reliable test of intelligence.
Ariza
57
Music may be perceived as an intelligent activity,

even when its genesis is the result of involuntary,
irrational, or algorithmic activities. Whereas music
may be a very human activity, the application of the
TT to music ignores that seemingly successful music
can come from sources with little or no thought.
Hofstadter, for example, describes his profound
sense of bewilderment or alarm (Hofstadter 2001,
p. 39) upon recognizing that convincing music
results from mechanisms thousands if not millions
of times simpler than the intricate biological
machinery that gives rise to a human soul (p. 79).
Soul, mind, thought, intelligence, and creativity are
common though weak determinants of aesthetically
successful music. Assuming the necessity of any of
these determinates when encountering the output
of a generative music system often contributes to a
musical Eliza Effect.
The problem of musical TTs is part of a larger
problem of TT derivatives. Dennett describes how
a failure to think imaginatively about the TT
has led many to underestimate its severity and to
confuse it with much less interesting proposals;
furthermore, there is a common misapplication of
the sort of testing exhibited by the Turing test that
often leads to drastic overestimation of the powers
of actually existing computer systems (Dennett
1998, p. 5). Musical TTs are a misapplication of the
TT that can lead to overestimation.
Use of the TT, even if by name and rough analogy

alone, has significant implications. The history of
the TT and its association with projects in AI make
it a powerful concept in both the academic and
popular imaginations. Alternative blind comparison
tests not associated with the TT make very different
implicit claims than those branded as TTs. For these
reasons, it is important to identify discrimination
tests (DTs) as a type of listener survey that avoids
some of the faults of musical TTs. The DT is similar
to the MOtT. Although not free of the problems
of evaluating musical judgments, such tests, when
properly constrained, permit generalizing these
judgments between selected groups.
Computational models, as generative music

systems with significantly different goals than
those of creative tools, require particular evaluation
strategies. Pearce and Wiggins propose the DT,
where the generated music can be evaluated by
asking human subjects to distinguish compositions
taken from the data set from those generated by
the system (Pearce and Wiggins 2001, p. 25). If the
system-composed pieces cannot be distinguished
from the human-composed pieces, we can conclude
that the machine compositions are indistinguishable
from human composed pieces (p. 25).
Although Pearce and Wiggins note that the DT
bears a resemblance (Pearce and Wiggins 2001,
p. 25) to the TT, they are careful to note significant
differences. The DT is designed not to test machinethinking, but to determine the (non-)membership of
a machine composition in a set of human composed
pieces of music (p. 25). Further, they note that
the critical element of interaction is removed: in
our test the subjects are simply passive listeners:
there is no interaction with the machine (p. 25).
Pearce and Wiggins argue that both the TT and DT
are behavioral tests: the tests are used to decide
whether a behaviour may be included in a set . . . the
set of intelligent behaviours in the case of the TT
and the set of musical pieces in a particular style
in the case of the DT (p. 25).
In a specific case, Pearce and Wiggins use a
collection of musical examples to train a geneticalgorithm-based system. The DT is then used to
determine if the output of the system is distinguishable from the same training examples (Pearce and
Wiggins 2001, p. 25). The authors claim that the final machine compositions are evaluated objectively
within a closed system which provides no place for
subjective evaluation or aesthetic merit (p. 25). Evidence is not provided to affirm that human listeners,
whether experts or novices, can objectively evaluate musical similarity. Furthermore, while such a
test attempts to remove aesthetics from musical
judgments, the authors go on to claim that success
in the DT shows that there are absolutely no perceivable features that differentiate the human and
machine compositions, and that these features may
include such elusive notions as aesthetic quality
or perceivable creativity (p. 25). The authors thus
58
Discrimination Tests
claim that the aesthetic quality of the compositions

is indistinguishable. Further subverting their initial
claim of an objective DT, the authors note that perceived creativity is likely to be closely related to . . .
perceived aesthetic value and that this association
may have been considered by DT subjects (Pearce
and Wiggins 2001, p. 30). Even within the closed
system of the DT, aesthetic values are difficult to
remove from musical judgments.
Another example of a DT is provided by Hall
and Smith (1996). Here, the authors employ various
orders of Markov chains to generate melodies over
a fixed harmonic background in the style of blues.
Transition weightings are derived from computer
analysis of bodies of printed music (Hall and Smith
1996, p. 1163), although full analysis data is not
provided. The authors note that this is a very
primitive idea as to what constitutes a musical
phrase in blues (p. 1165).
To evaluate their output, the authors subjected
198 people to listening tests. Each participant
is given ten pairs of tunes; for each pair, one is
machine-generated and one is human-generated.
Hall and Smith make no reference to Turing, calling
this procedure a listening test (Hall and Smith
1996, p. 1165). The authors suggest that these tests,
owing to the binary result of each question, can
be viewed as a series of Bernoulli trials (p. 1166).
The authors claim that if the model successfully
captures the structure of blues melodies, then
listeners should have trouble distinguishing between
human composed tunes and computer tunes (p.
1166). The results show that people are unable to
reliably distinguish between blues tunes composed
by humans and those composed by the computer
model described (p. 1166). Although Hall and Smith
designed their test to avoid subjective issues . . .
such as quality, the participants were explicitly
asked to distinguish human from computer (p.
1167). As suggested previously, expectations of
human and machine performance may influence
musical judgments.
In 1992 and 1993, Dahlig and Schaffrath, conducting a DT, found that aesthetic preferences led to
incorrect assumptions of human authorship (Dahlig
and Schaffrath 1997, p. 211). The study, called Kompost, employed two-phrase folk melodies, some of
which were original while others were constructed

by artificially recombining fragments of existing
melodies. Dahlig and Schaffrath collected responses
from individuals with diverse genders, national
origins, and musical backgrounds. Response forms
recorded, for each melody, both assumed origin
(human or computer) and a rank of personal preference. They observed that, uncorrelated to actual
origin, human attribution was most often given
to melodies with common stylistic features, such
as parallel rhythmic structures between phrases,
melodies ending on the first scale degree, and
melodies starting the second phrase on a note other
than the first scale degree (Dahlig and Schaffrath
1997, p. 216). Thus, unusual or disagreeable human
compositions were attributed to machines.
More recently, Collins (2008) employs spot the
difference listener surveys as a means of evaluating
the output of Infno, a generative music system specialized for synth-pop and electronic dance music.
Collins is careful to note that subjective biases are
hard to remove: any choice of testing procedure
can be problematised for the subjective and social
domain of music. Furthermore, the tests, as much
as anything else, revealed much about the individual subjectivities of the participants (Collins 2008).
This DT is employed to gather feedback as guidance
for future iterations of the software, with particular
attention given to the variation in output allowed
by the generative musical artifact. Similarly to the
results of Hall and Smith, no statistically significant results . . . were found to distinguish perception
of human and computer generation.
DTs often rely on musical judgments of musical
style: listeners are asked to distinguish works not
by quality, but by stylistic conformity. While some
specific attributes of broad musical styles, genre, or
idioms may appear with regularity, apprehension of
style is an interpretive musical judgment informed
by the consensus of critics. DTs that depend on
genre classifications are in a weak position to make
definitive claims about the evaluated music or the
generative music system.
John Zorn, as well as many other musicians,
rejects such classifications: they are used to
commodify and commercialize an artists complex
personal vision . . .. [T]his terminology is not about
Ariza
59
understanding . . . its about money (Zorn 2000,

p. v). Pachet and Cazaly (2000), as part of a study of
music taxonomies, support this view, stating the
most important producers of music taxonomies
are probably music retailers. In a study of large
online music taxonomies, these authors illustrate
that there is little if any consensus on the terms,
structures, or meanings deployed.
As shown in Aucouturier and Pachet (2003),
recent efforts to automatically sort music into
discrete style classes based on signal or symbolic
representations have demonstrated limited success;
success is often a direct result of extremely narrow
conceptions of genre (p. 92). The study of Soltau
et al. (1998), after demonstrating an Explicit Time
Modeling with Neural Networks (ETM-NN) system
to classify music within four genres, shows that
for some genres humans are just as ill-equipped as
their system to classify music: human confusions
in this experiment are similar to confusions of
the ETM-NN system. Aucouturier and Pachet
erroneously call this listener survey a Turing Test
(Aucouturier and Pachet 2003, p. 88). Unable to
rigorously define genre, such approaches implement
systems to match the consensus interpretations
of critics. Aucouturier and Pachet, supporting this
view, state that music genre is an ill-defined notion,
that is not founded on any intrinsic property of the
music, but rather depends on cultural extrinsic
habits (p. 84). DTs often ignore the real diversity
and elusive nature of these extrinsic habits. The
danger of testing circular, ungrounded projections
(p. 83) is great.
As an alternative, Aucouturier and Pachet form
genre-like clusters based on extrinsic similarity:
specifically, they perform a co-occurrence analysis
based on cultural similarity from text documents
such as radio playlists and track listings of compilation CDs. This approach relies on the documented
interpretations of critics (e.g., DJs and editors): similarity is asserted if two works are placed together.
It is significant that such a measure of similarity
cannot be applied to the newly created output of
a generative music system. As such works lack
any cultural texts or criticism, listener surveys or
measures of intrinsic similarity may be the only
means of comparison.
Despite the problems described herein, musical TTs

have been discussed and executed as a means of
evaluating the success of various generative music
systems. The following examples, as a collection of
small case studies, illustrate diverse applications of
music to TTs. Although some DTs report the failure
of machines to produce indistinguishable results
(Pearce and Wiggins 2001), every executed musical
TT reports machine success.
After summarizing the TT, Alan Marsden suggests that a musical version of this test could be
proposed (Marsden 2000, p. 22). Marsden describes
an MOtT in which there are two rooms, each with a
composer and a means of distributing music to the
outside world. One of the composers is a machine.
The test is passed when observers cannot distinguish
which composer is a computer. Marsden offers this
example to state that, although a computer might
pass this test in practice, the computer could never
pass the test in principle (p. 23). The reason for
this, Marsden explains, is that originality is an
essential characteristic of music . . . computers are
digital automata, and so their behavior is always, in
principle at least, predictable and therefore cannot
be original (p. 23). Marsden does not question the
validity of the MOtT, suggesting that it might operate in both practice and in principle. Further,
Marsden suggests that deterministic systems are
incapable of originality. Yet humans, while lacking
proof of free will, may be both deterministic and
creative. Marsdens principled MOtT, in this context, is better seen as an LT, as a test of creativity.
The standard of creativity set forth by Bringsjord,
Bello, and Ferrucci (2001), however, is more successful than Marsdens problematic criterion of
non-deterministic originality.
Curtis Roads (1984), in a section on the Turing
Test for Musical Intelligence, suggests using an
MDtT to measure the effectiveness of softwarebased music representations (p. 33). Although
conceding that there is no universal criterion for
determining an optimal music representation, he
offers criteria to determine system value such as
the usefulness in practice, the limits of structures
available for representation, and what kinds of
60
Proposed and Executed Musical Turing Tests
musical tasks are easy to perform with it, and which

are difficult.
Roads, referencing a personal correspondence
on the subject of musical TTs, suggests that a
similar test could offer a validation of a computergenerated music theory and the representations
behind it (Roads 1984, p. 33). He imagines a test
in which we can ask the machine questions and
have it perform tasks that a human could [perform]
after hearing a piece, e.g., sing the melody, analyze
the form, trace its influences, compose something
in the same style, etc. If the system succeeds,
then the representation is clearly effective. The
last task, composing something in the same style,
approximates an MDtT.
Gareth Loy, considering the application of connectionism to generative music systems, speculates
that a musical Turing test might be easier for a
computer to pass someday, since music is a very
abstract artistic medium (Loy 1991, p. 370). Presumably imagining an MOtT, Loy notes that passing
such a test would prove no more than that a
reasonable facsimile of human musical functioning
could be constructed using computational means
(p. 370). This is a fair assessment. The question for
Loy is whether human musical cognition can be
represented computationally at all (p. 370). Loy,
like Marsden, does not directly question the validity
of the MOtT, but instead questions if musical cognition can be represented in a computational machine.
Loy recognizes that an MOtT can be passed by a
system without musical thought or cognition.
David Soldier, defining naughtmusik as the
set of nonart sounds and genuine music as
excluding naughtmusik and including Artists
with Intent, proposes a modified MOtT (Soldier
2002, p. 53). Here, naughtmusik might represent
machine output. Soldier describes a test where
one could play recordings of genuine music and
naughtmusik . . . if the human judges detect the
fakery [of naughtmusik], the strong definition of
genuine music can be confidently adopted. This
definition of genuine music requires that humans
can distinguish nonart sounds from sounds
produced by artists with intent. Testing music
made by untrained children versus professional
musicians, Soldier reports that results favoring
genuine music produced by professionals were

too low to reach a statistically significant level
(p. 54). In other words, participants in a listener
survey could not distinguish genuine from fake
music. Soldier goes on to argue, using examples
of music produced by hypothetical zombies and
the Thai Elephant Orchestra, that a definition of
genuine music cannot require that the musical
agents possess musical knowledge or intent (p. 55).
Soldier illustrates that the presence of aesthetic
intention or musical thought, either in elephants or
in something like a computer, cannot be identified
in DTs or MOtTs and cannot be used to distinguish
music from nonart sounds.
In a letter to the Computer Music Journal,
Erik Belgum describes an unaltered borrowing
of the original Turing test to investigate musical
intelligence (Belgum et al. 1988, p. 7) where a
musician, in a type of MDtT, alternatively jams
with a human and machine. While stating that
all the same doubts can be expressed about
the musical Turing test as have been expressed
about the original Turing test, Belgum does not
question the legitimacy of his modified test, instead
suggesting that this MDtT seems to keep the basic
spirit of the test. As shown previously, the TT
employs natural language discourse to represent the
presence of thought; its spirit is not preserved in
either the MOtT or the MDtT. Written responses
to this letter, provided by Joel Chadabe, Emile
Tobenfeld, and Laurie Spiegel, do not question the
fundamental legitimacy of the proposed MDtT, but
they instead offer alternative definitions of musical
intelligence or different types of tests. Laurie Spiegel,
summarizing the inadequacy of aesthetic tests in
general, asks, what purpose would be satisfied by
creating qualitative criteria or quantitative metrics
for artificial musical intelligence, given the lack
of successful similar criteria for natural musical
intelligence, musicality, or even music per se?
(Belgum et al. 1988, p. 9). The lack of these criteria
and metrics is an important constraint on many
musical judgments.
Some have moved beyond thought experiments
to actual tests. In 2002, Sam Verhaert, as part
of a radio broadcast, organized a Music Turing
Test at the Sony Computer Science Laboratory in
Ariza
61
Paris (2002). The goal of this MOtT was to answer

the question, [C]an we make the distinction
between music played by a human and music
played by a machine? Two interrogators (Henkjan
Honing and Koen Schouten) were presented with
musical fragments, some performed by Albert van
Veenendaal, others performed by the Continuator
software system developed by Francois Pachet
(2002). Although test data is not provided, the
author claims the result was largely in favor of
the Continuator. Receiving more positive musical
judgments than the human, the software system is
deemed successful.
Hiraga et al. (2002) have proposed and executed
a series of music performance rendering tests in
which human performances of a fixed work are
compared with computer-rendered performances of
the same work. Executed as part of project called
RENCON, these tests date from 2002 and have
continued at various workshops and conferences
since (Hiraga et al. 2004, p. 121). Whereas the
authors at times properly describe these tests as
listening comparisons, they go on to describe these
as a Turing test for musical performance and
competitions given in Turing Test style (p. 123).
The authors claim that this test determines by
listening whether system-rendered performance is
distinguishable from human performance (p. 123);
in addition to selecting their preferred performance,
participants are asked to rate performances by
humanlikeness (p. 123). As machine-rendered
performances have been selected over humanrendered performances, the authors state that more
than a few people agree that some performance
rendering systems generate music that rivals human
performances (p. 123). While listening comparisons
or DTs may offer a valuable method of evaluating
performance rendering, the association with Turing
is incorrect and unnecessary.
Music generated with David Copes EMI system
(1991, 1992, 1996, 2001) has been the subject of many
presentations of MOtTs. Only in passing does Cope
associate these tests with Turing. In Cope (1996), he
describes conducting comparison tests to gauge the
significance of compositional style signatures in
evaluating style membership. In his first test with
students, he presents phrases of Mozart, phrases of
machine-composed music in the style of Mozart

with signatures, and machine-composed music in
the style of Mozart without signatures. Although
noting that this study falls outside the framework
of scientifically valid research (1996, p. 82), he
suggests that his results show that the use of
signatures contributes to style recognition.
Cope (1996) removes signatures as a variable in
these tests; in this case, he compares the music
of Mozart with machine-composed music in the
style of Mozart. As part of a section of the 1992
Association for the Advancement of Artificial
Intelligence (AAAI) conference entitled Artificial
Intelligence and the Arts, a larger MOtT was
conducted. Cope reports that nearly 2000 visitors,
over three days, took part in a test that pitted
machine-composed examples with signatures in
the style of Mozart against actual Mozart (p. 82).
While again Cope states the test has absolutely
no scientific value, the results are summarized by
Cope as indicating that the audience was unable
to distinguish between machine-composed Mozart
and the real thing, that the machine-composed
music has some stylistic validity, and for the
layperson at least, real Mozart is hard to distinguish
from artificial Mozart (p. 82). While neither Cope
nor the conference publications call the 1992 AAAI
MOtT a TT, Cope (2000, p. 65) claims that Alice,
a system closely related to EMI, may succeed in
occasionally passing the spirit if not the letter of
Turings test (2000, p. 65). As argued previously,
the spirit of the TT is not maintained in MOtTs.
Cope (2001) titles a similar test The Game, presenting the reader notated and recorded musical examples of computer-generated and human-composed
music. Here, he refers to computer-generated works
as virtual music. Cope notes that mixing weak
human-composed music with strong virtual music would simply fool listeners. His objective is
to determine whether listeners can truly tell the
difference between the two types of music (p.
20). If players can only distinguish human from
machine about 50% of the time, it is assumed that
the music examples are indistinguishable. Cope
relates The Game to the 1992 AAAI MOtT,
stating that results from previous tests with large
groups of listeners, such as 5000 [sic] in one test in
62
1992 . . . typically average between 40 and 60 percent

correct responses (Cope 2001, p. 21).
Others have directly called these same EMI
demonstrations TTs. At a later AAAI Conference,
in 1998, the program chairs describe a musical
Turing test between the works of Bach and compositions by an AI program conducted by Cope
(Mostow and Rich 1998). Similarly, Bernard Greenberg characterizes fooling the audience with
EMI as a vernacular paraphrase of the Turing
test, the classic admissions exam for all putative
AI endeavors (Greenberg 2001, p. 222). Kurzweil
describes a musical Turing test administered by
Steve Larson where audience members, sampling
three compositions, attempt to distinguish EMIgenerated music in the style of Bach from real Bach
(Kurzweil 1999, p. 160). Lastly, Jonathan Bedworth
and James Norwood, while noting that artists are
not immune to technologys implicit, and possibly inappropriate, categorisations and metaphors,
describe EMIs output as one of many examples
that show continued adherence to Turing Test-like
evaluations of machine capability (Bedworth and
Norwood 1999, p. 193). Bedworth and Norwoods
incorrect assumption, that evaluation of EMI output
relates to the TT, is exactly the kind of inappropriate categorization and metaphor they warn
against.
Although anecdotal, Patricio da Silva, in a study
of Copes music, describes another MOtT. An
appendix, consisting of emails from a music theory
discussion list authored in 2002, discusses the
possibility of a musical TT. Contributors suggest
that the output of Copes EMI system could pass
a musical Turing test. A contributor describes
a presentation in which a MOtT was informally
conducted between performances of a Chopin
prelude and an EMI-composed prelude in the style of
Chopin (Silva 2003, p. 66). The audience, reportedly,
was unable to identify the original Chopin. A
contributor to this list notes that while the medium
of the test (sheet music) deviates from that of the
original TT, the MOtT preserves the spirit of the
Turing test (Silva 2003, p. 68). This statement, like
that of Belgum et al. (1988) and Cope (2000, p. 65),
overstates the superficial similarity of the MOtT
and the TT.
Electronic Telegraph and the BBC1 Tomorrows

World program sponsored a series of massive TTs,
similar to the Loebner competition but employing
thousands of interrogators. These tests were conducted as part of Megalab, a series of experiments in
Britain dating from 1993 and designed to provide
science week with a media profile by employing
experiments that could inspire large numbers of
people to take part, were easy to carry out, and
would help to push back the frontiers of science
(Anonymous 1998). The 1998 and 1999 Megalab
featured Loebner competition style TTs. During the
2000 successor, Live Lab (Anonymous 2000), the
classic Turing test . . . had been supplemented by
two new tests which required participants to distinguish between examples of human and computer
generated painting and music (Bloomfield and
Vurdubakis 2003, p. 31). The computer-generated art
was provided by Harold Cohens generative illustration system AARON; the computer-generated music
was provided by EMI (Bloomfield and Vurdubakis
2003). In the case of the computer-generated art,
Bloomfield and Vurdubakis state that the resulting
success was easily explained by the fact that it was
the only representational picture among abstract
human ones (p. 35); this explanation suggests that
expectations of machine aptitude influence the
results of such tests. Bloomfield and Vurdubakis
report that the computer-generated music decisively won in its category with 42% of the vote
(p. 33).
Musical TTs, in various other arrangements, have
been and will likely continue to be suggested in
academic and popular literature (e.g. Fowler 1994,
p. 9; Trivino-Rodriguez
and Morales-Bueno 2001,
p. 73; Wassermann et al. 2003, p. 89; Midgette

2005; Lamb 2006). As Dennett states, cheapened
versions of the Turing test are everywhere in the air
(Dennett 1998, p. 20). It is clear that the execution
of such tests, under the heading of Turing, are
bound by many faults. Furthermore, such tests
are not serious investigations into the design of
the generative system, the aesthetic value of the
output, or the presence of intelligence, musical or
otherwise. In the tradition of the JHT, and as shown
by Live Lab, musical TTs may function best as
entertainment.
Ariza
63
In the context of musical TTs, an objection can be

raised that the computer-generated material is not
generated in whole by the computer. The computer
can be seen not as an autonomous author but as
a system that executes or reconfigures knowledge
imparted to it by its programmers. This is part of
the problem of machine creativity suggested by the
LT (Bringsjord, Bello, and Ferrucci 2001).
Fundamental questions of authorship are important when comparing the aesthetic output of
machines and humans. For the MOtT or MDtT
to actually test the machine system, the musical
outputs provided by the agents must be created by
the agents themselves. Just as the human agent presumably cannot plagiarize an output, the machine
agent cannot simply return a stored work previously
created by a human. Such a test would have little
value. Although Turing specifically condoned deception in the TT, such deception is problematic in the
context of testing aesthetic artifacts. This problem
further removes the MOtT and the MDtT from the
TT.
Pure machine authorship is impossible to imagine
without an autonomy sufficient to pass the LT.
In a manner similar to that of Wolfgang von
Kempelens famous chess-playing machine (often
called The Turk), a purported automaton that
convinced countless observers in the eighteenth and
nineteenth century of the possibility of machine
autonomy (Sussman 1999; Standage 2002), there
may always be a human, or at least significant
human knowledge, hiding inside the creative
machine. Halpern supports this view, noting that
machine intelligence is really in the past: when a
machine does something intelligent, it is because
some extraordinarily brilliant person or persons,
sometime in the past, found a way to preserve some
fragment of intelligent action in the form of an
artifact (Halpern 2006, p. 54). Such a perspective is
applicable to many less intelligent but musically
useful generative music systems.
Cope, for example, states that when using his
Alice system he sees no reason to even assign
partial credit to the program: Alice processes a
composers database in specific and known ways,

acting only as a specialized calculator and assistant
(Cope 2000, p. 252). Describing the EMI system,
Cope notes that the hand of the composer is not
absent from the finished product of computerassisted composition (Cope 1991, p. 236), and
that all works produced with EMI are attributed
to David Cope with Experiments in Musical
Intelligence (Cope 2001, p. 340). The MOtTs of
EMI output described herein, therefore, tested not
EMI, but Cope with EMI.
A case might be imagined where a machine is
somehow completely responsible for an aesthetic artifact. Such machine authorship would presumably
require what Cohen describes as autonomy. Describing his generative illustration system AARON,
Cohen (2002) suggests that, with such autonomy, the
system, not him, would be the author: if AARON
ever does achieve the kind of autonomy I want it
to have, it will go on to eternity producing original
AARONs, not original Harold Cohens (2002, p. 64).
Cohens views are similar to those of Hofstadter,
described previously. The idea of machines with
autonomy, intentions, or initiative is sometimes
associated with more exotic things such as Putnam
Gold Machines (Kugel 2002, p. 565) or Turings Oracles (Turing 1939). Bringsjord, Bello, and Ferrucci,
however, argue that Oracles cannot pass the LT, and
thus do not offer autonomy (Bringsjord, Bello, and
Ferrucci 2001, pp. 2024). As neither AARON nor
any known contemporary system has reached such
a level of autonomy, generative works will likely
continue to be seen as human works. If the role
of the system exceeds that of a conventional tool,
these works might be seen more as human-machine
collaborations; collaboration, as used here, does not
require machine autonomy. Contemporary MDtTs
and MOtTs are not machine-versus-human competitions: they are competitions among humans using
different tools and collaborators.
Machines, although presently lacking autonomy
and intention, can produce output that appears
intentional. As described previously, Soldier (2002)
argues that aesthetic intention (Carroll 1999, p. 163)
is not a criterion of creating, and thus authoring,
art. Soldier demonstrates that artists without intent
can create works that sound intentional; similarly,
64
Machine Authorship and the Problem of Aesthetic

Intention
others have noted the limits of evaluating a work in

the context of the authors intentions. Wimsatt and
Beardsley (1946) call this the intentional fallacy:
the design or intention of the author is neither
available nor desirable as a standard for judging the
success of a work (Wimsatt 1954, p. 3). Similarly,
Robert Zimmerman dismisses aesthetic intention as
a criterion of art due to the necessity of psychological
introspection: the decision whether an entity is an
aesthetic object would depend upon the results of
a lie-detector test given to creator and audience
(Zimmerman 1966, p. 182).
Some have, nonetheless, assumed that intention
distinguishes human art-making. Ian Cross, with
a model loosely related to the CRA, imagines two
computers in a box creating and apprehending
music. Cross states that what is happening within
the box has no element of human participation,
intervention or experience (Cross 1993, p. 167)
and thus is not music. Cross ignores that humans
participated in this hypothetical event by creating
the box and the systems within the box. This relates
to an objection to Searles CRA: the performance
of the man who understands no Chinese is only as
good as those who understood Chinese well enough
to create the lexicon in the first place, and thus
create the illusion of comprehension in the Chinese
Room (Halpern 2006, p. 54).
Based on his model, Cross claims that intention
and consensus are necessary preconditions for the
human experience of music (Cross 1993, p. 170).
For music to happen, there must be intentintent
to produce music, or (less obviously) intent to hear
music (p. 167). The requirement of intentional
hearing is discussed subsequently; the requirement
of intentional production, however, can only be
argued by employing a highly constrained definition
of music, by committing the intentional fallacy, or
by ignoring demonstrations such as those provided
by Soldier (2002).
Roger Scruton states that music is the intentional object of an experience that only rational
beings can have, and only through the exercise of
imagination (Scruton 1999, p. 96). It is important to
note that the intention described here is in the experience of music, in imaginative hearing. This relates
to what Cross describes as the intent to hear music
(Cross 1993, p. 167). Perhaps something like a musical Eliza Effect enables humans to treat aesthetic
artifacts from machines as intentional objects; perhaps, in contrast to the CRA, musical syntax alone
can in some cases suffice for musical semantics.
Language Is the Medium of the Turing Test

The medium of the TT cannot be altered. If researchers desire evidence that a generative music
system is intelligent or capable of thought (musical or otherwise), the system must simply pass
the TT. The interrogator may ask human and machine agents, through a text-based medium, about
original pieces of music created by each agent.
Problems of originality, creativity, intention, and
authorship are irrelevant, and deception is again
permitted. Through conversation, through natural
languageif the interrogator cannot distinguish the
machinethe machine has passed. A true musical
TT is simply a TT.
In the context of a comparison test, natural
language discourse is superior to an abstract aesthetic medium. An interrogator can ask questions
about a piece of prose: a statement can be rephrased,
explained, or put into other words. Even attempts at
deception can be disputed or argued. An interrogator cannot ask the same questions of an aesthetic
work; although an artist, if available, might offer
context or explanation, such explanation is not
required for aesthetic appreciation. The piece exists
independently, and generally cannot be rephrased or
reformulated to aid understanding. The interrogator
can only be a critic of such a medium.
Turing demonstrates using language, within a TT,
to discuss other mediums. Turing does not limit the
topics of conversation in his test, stating that anything can be claimed, but practical demonstration
cannot be demanded (Turing 1950, p. 435). Musical
TTs clearly violate this prohibition of practical
demonstration. Turing, however, provides examples
of an interrogator asking for practical demonstrations. As a sample question and response, Turing
offers the following: I have K at my K1, and no
other pieces. You have only K at K6 and R at R1.
It is your move. What do you play? (Turing 1950,
Ariza
65
p. 435). While the computer answers the end-game

with a mate, the computer could just as well discuss
the history of the game, or state that it does not play
chess. When asked to write a sonnet, the same agent
declines: Count me out on this one. I never could
write poetry (p. 434).
The computer agent, asked to play chess, could
alternatively be mischievous and play a nonwinning move. As Halpern states, the Turing
end-game example introduces an assumption that
cannot automatically be allowed: namely, that the
computer plays to win (Halpern 2006, p. 46). The
MOtT and MDtT imply that aesthetic success, a
win, indicates system design success. However,
as made clear herein, Turings model does not
require the computer to play to win: self sabotage
or simple mischief is acceptable if explained in
rational discourse. MOtTs and MDtTs, if related
to the TT, should allow for new aesthetic concepts
and non-winning aesthetic moves: as Boden states,
even if a computers notion of art is irrelevant
to us humans, these notions might broaden our
aesthetic horizons (Boden 1996).
Conclusion
As Dennett states of restricted text-based TTs, we
should resist all limitations and waterings-down of
the Turing test . . . they make the game too easy . . .
they lead us into the risk of overestimating the
actual comprehension of the system being tested
(Dennett 1998, p. 11). The MDtT and MOtT are
too easy. Music, as a medium remote from natural
language, is a poor vessel for Turings Imitation
Game. Generative music systems gain nothing
from associating their output with the TT; worse,
overestimation may devalue the real creativity in
the design and interface of these systems.
Iannis Xenakis, considering the history of
computer-aided algorithmic composition systems,
asked: What is the musical quality of these attempts? He answers bluntly: The results from
the point of view of aesthetics are meager . . . hope
of an extraordinary aesthetic success based on
extraordinary technology is a cruel deceit (Xenakis
1985, p. 175). Xenakis here distinguishes the system
66
(the technology) from its aesthetic artifacts. This

distinction suggests that aesthetic success or failure
is dependent on humans and independent of any
technology. Until machines achieve autonomy, it
is likely that humans will continue to form, shape,
and manipulate machine output to satisfy their
own aesthetic demands, taking personal, human
responsibility for machine output.
Simon Holland, after Pena
and Parshall (1987)
and Cook (1994), describes open-ended domains
such as music composition as problem seeking
rather than problem solving: there are in general no clear goals, no criteria for testing correct
answers, and no comprehensive set of well-defined
methods (Holland 2000, p. 240). If used as creative
tools, generative music systems, as systems within
problem-seeking domains, likewise have no criteria
for testing correct answers. In the development and
presentation of these systems, comparative analysis
of system and interface design, or studies of user
interaction and experiences, offer greater potential
for the development of practical tools.
Computational models with clearly articulated
goals may continue to pass DTs; properly constrained, such tests may show that musical judgments cannot discern sets of musical artifacts
produced by humans and machines. While this may
demonstrate technological innovation in the modeling of historical musical artifacts, such technologies
may also offer aesthetic innovation if redeployed as
creative systems. In this use-case, the clear goals
and testing criteria evaporate. Within the practical
use-case of creative music-making, any system
becomes a problem seeking domain.
Acknowledgments
I am grateful for the commentary this article has
received over the many stages of its development.
Thanks to Elizabeth Hoffman and Paul Berg for
discussing some of the initial ideas presented here.
Thanks to Nick Collins for research assistance and
comments on important themes. Thanks to the
anonymous reviewers and the editors for valuable
suggestions.
References
Anonymous. 1998. A Mega Success, Thanks to You.
Telegraph 26 March.
Anonymous. 2000. Are You Smarter Than a Robot.
Telegraph 9 March.
Ariza, C. 2005. Navigating the Landscape of ComputerAided Algorithmic Composition Systems: A Definition,
Seven Descriptors, and a Lexicon of Systems and Research. Proceedings of the International Computer
Music Conference. San Francisco, California: International Computer Music Association, pp. 765
772.
Aucouturier, J., and F. Pachet. 2003. Representing
Musical Genre: A State of the Art. Journal of New
Music Research 32(1):8393.
Bedworth, J., and J. Norwood. 1999. The Turing Test is
Dead. Proceedings of the 3rd Conference on Creativity
& Cognition. New York: Association for Computing
Machinery, pp. 193194.
Belgum, E., et al. 1988. A Turing Test for Musical
Intelligence? Computer Music Journal 12(4):79.
Block, N. 1981. Psychologism and Behaviorism.
Philosphical Review 90(1):543.
Bloomfield, B. P., and T. Vurdubakis. 2003. Imitation
Games: Turing, Menard, Van Meegeren. Ethics and
Information Technology 5(1):2738.
Boden, M. 1990. The Creative Mind: Myths and Mechanisms. New York: Routledge.
Boden, M. A. 1996. Artificial Genius. Discover 17:104
107.
Bringsjord, S., P. Bello, and D. Ferrucci. 2001. Creativity,
the Turing Test, and the (Better) Lovelace Test. Minds
and Machines 11:327.
Bringsjord, S., and D. Ferrucci. 2000. Artificial Intelligence
and Literary Creativity: Inside the Mind of BRUTUS, a
Storytelling Machine. Mahwah, New Jersey: Lawrence
Erlbaum.
Bulhak, A. C. 1996. The Dada Engine. Available online
at dev.null.org/dadaengine/manual-1.0/dada toc.html.
Carroll, N. 1999. Philosophy of Art: A Contemporary
Introduction. New York: Routledge.
Cavell, S. 2002. Must We Mean What We Say?: A Book of
Essays. Cambridge, UK: Cambridge University Press.
Cohen, H. 2002. A Self-Defining Game for One Player:
On the Nature of Creativity and the Possibility of
Creative Computer Programs. Leonardo 35(1):5964.
Collins, N. 2006. Towards Autonomous Agents for
Live Computer Music: Realtime Machine Listening
and Interactive Music Systems. PhD dissertation,
University of Cambridge.
Collins, N. 2008. Infno: Generating Synth Pop and

Electronic Dance Music On Demand. Proceedings of
the International Computer Music Conference. San
Francisco, California: International Computer Music
Association. Available online at www.informatics
.sussex.ac.uk/users/nc81/research/infno.pdf.
Cook, J. 1994. Agent Reflection in an Intelligent Learning Environment Architecture for Composition. In M.
Smith, A. Smaill and G. A. Wiggins, eds. Music Education: An Artificial Intelligence Approach. London:
Springer Verlag, pp. 323.
Cope, D. 1991. Computers and Musical Style. Oxford:
Oxford University Press.
Cope, D. 1992. Computer Modeling of Musical Intelligence in EMI. Computer Music Journal 16(2):6983.
Cope, D. 1996. Experiments in Musical Intelligence.
Madison, Wisconsin: A-R Editions.
Cope, D. 2000. The Algorithmic Composer. Madison,
Wisonsin: A-R Editions.
Cope, D. 2001. Virtual Music: Computer Synthesis of
Musical Style. Cambridge, Massachusetts: MIT Press.
Cope, D. 2005. Computer Models of Musical Creativity.
Cambridge, Massachusetts: MIT Press.
Copeland, B. J. 2000. The Turing Test. Minds and
Machines 10:519539.
Crockett, L. J. 1994. The Turing Test and the Frame
Problem: AIs Mistaken Understanding of Intelligence.
Bristol, UK: Intellect.
Cross, I. 1993. The Chinese Music Box. Interface
22:165172.
Dahlig, E., and H. Schaffrath. 1997. Judgments of
Human and Machine Authorship in Real and Artificial
Folksongs. Computing in Musicology 11:211219.
Damper, R. I. 2006. The Logic of Searles Chinese Room
Argument. Minds and Machines 16(2):163183.
Dennett, D. C. 1998. Brainchildren: Essays on Designing
Minds. Cambridge, Massachusetts: MIT Press.
Fowler, J. W. 1994. Algorithmic Composition. Computer
Music Journal 18(3):89.
French, R. 1990. Subcognition and the Limits of the
Turing Test. Mind 99(393):5365.
French, R. M. 2000. The Turing Test: The First 50 Years.
Trends in Cognitive Sciences 4(3):115122.
Gardner, M. 1974. Mathematical Games: The Arts as
Combinatorial Mathematics, or, How to Compose Like
Mozart with Dice. Scientific American 231(6):132
136.
Genova, J. 1994. Turings Sexual Guessing Game. Social
Epistemology 8:313326.
Greenberg, B. 2001. Experiments in Musical Intelligence
and Bach. In D. Cope, ed. Virtual Music: Computer
Ariza
67
Synthesis of Musical Style. Cambridge, Massachusetts:

MIT Press, pp. 221236.
Hall, M., and L. Smith. 1996. A Computer Model of Blues
Music and Its Evaluation. Journal of the Acoustical
Society of America 100(2):11631167.
Halpern, M. 2006. The Trouble with the Turing Test.
The New Atlantis 11:4263.
Harnad, S. 1991. Other Bodies, Other Minds: A Machine
Incarnation of an Old Philosophical Problem. Minds
and Machines 1:4354.
Harnad, S. 2000. Minds, Machines and Turing. Journal
of Logic, Language and Information 9(4):425445.
Hauser, L. 1993. Searles Chinese Box: The Chinese
Room Argument and Artificial Intelligence. PhD
dissertation, Michigan State University.
Hauser, L. 1997. Searles Chinese Box: Debunking the
Chinese Room Argument. Minds and Machines
7:199226.
Hauser, L. 2001. Look Whos Moving the Goal Posts
Now. Minds and Machines 11:4151.
Havass, M. 1964. A Simulation of Music Composition.
Synthetically Composed Folkmusic. In F. Kiefer,
ed. Computational Linguistics. Budapest: Computing
Centre of the Hungarian Academy of Sciences 3:107
128.
Hedges, S. A. 1978. Dice Music in the Eighteenth
Century. Music and Letters 59(2):180187.
Hiller, L. 1970. Music Composed with Computers: An
Historical Survey. In H. B. Lincoln, ed. The Computer
and Music. Ithaca, New York: Cornell University Press,
pp. 4296.
Hiraga, R., et al. 2002. Rencon: Toward a New Evaluation Method for Performance Rendering Systems.
Proceedings of the International Computer Music
Conference. San Francisco, California: International
Computer Music Association, pp. 357360.
Hiraga, R., et al. 2004. Rencon 2004: Turing Test
for Musical Expression. Proceedings of the 2004
Conference on New Interface for Musical Expression.
New York: Assocation for Computing Machinery, pp.
120123.
Hofstadter, D. R. 1979. Godel,

Escher, Bach: An Eternal
Golden Braid. New York: Vintage.
Hofstadter, D. R. 1996. Fluid Concepts and Creative
Analogies: Computer Models of the Fundamental
Mechanisms of Thought. New York: Basic Books.
Hofstadter, D. R. 2001. Staring Emmy Straight in the
Eyeand Doing My Best Not to Flinch. In D. Cope,
ed. Virtual Music: Computer Synthesis of Musical
Style. Cambridge, Massachusetts: MIT Press, pp. 33
82.
Holland, S. 2000. Artificial Intelligence in Music Education: A Critical Review. In E. R. Miranda, ed. Readings
in Music and Artificial Intelligence. Amsterdam:
Harwood Academic Publishers, pp. 239274.
Hsu, F. 2002. Behind Deep Blue: Building the Computer
that Defeated the World Chess Champion. Princeton,
NJ: Princeton University Press.
Jefferson, G. 1949. The Mind of Mechanical Man.
British Medical Journal 1:11051110.
Kant, I. 1790. Kritik der Urteilskraft [Critique of Judgment.
Berlin: Lagarde and Friederich.
Kugel, P. 2002. Computers Cant Be Intelligent (. . . and
Turing Said So). Minds and Machines 12(4):563
579.
Kurzweil, R. 1990. The Age of Intelligent Machines.
Cambridge, Massachusetts: MIT Press.
Kurzweil, R. 1999. The Age of Spiritual Machines. New
York: Penguin Books.
Kurzweil, R. 2002. A Wager on the Turing Test: Why
I Think I Will Win. Available online at www.
kurzweilai.net/articles/art0374.html?printable=1.
Kurzweil, R. 2005. The Singularity is Near. New York:
Penguin Books.
Lamb, G. M. 2006. Robo-Music Gives Musicians the
Jitters. The Christian Science Monitor, December 14.
Loebner, H. 1994. In Response. Communications of the
ACM 37(6):7982.
Long Bets Foundation. 2002. By 2029 No Computeror
Machine IntelligenceWill Have Passed the Turing
Test. Available online at www.longbets.org/1.
Lovelace, A. 1842. Translators Notes to an Article on
Babbages Analytical Engine. In R. Taylor, ed. Scientific
Memoirs: Selected from the Transactions of Foreign
Academies of Science and Learned Societies, and from
Foreign Journals. London: printed by Richard and John
E. Taylor, 3:691731.
Loy, D. G. 1991. Connectionism and Musiconomy.
Proceedings of the International Computer Music Conference. San Francisco, California: International Computer Music Association, pp. 364
374.
Marsden, A. 2000. Music, Intelligence and Artificiality.
In E. R. Miranda, ed. Readings in Music and Artificial Intelligence. Amsterdam: Harwood Academic
Publishers, pp. 1528.
Midgette, A. 2005. Play It Again, Vladimir (via Computer). New York Times, 5 June.
Moor, J. H. 1976. An Analysis of the Turing Test.
Philosophical Studies 30:249257.
Moor, J. H. 2001. The Status and Future of the Turing
Test. Minds and Machines 11:7793.
68
Mostow, J., and C. Rich. 1998. The Fifteenth National

Conference on Artificial Intelligence. Available online
at www.aaai.org/Conferences/AAAI/aaai98.php.
Naor, M. 1996. Verification of a Human in the Loop
or Identification via the Turing Test. Unpublished
manuscript.
Nelson, S. R. 2006. Steel Drivin Man: John Henry: The
Untold Story of an American Legend. Oxford: Oxford
University Press.
Pachet, F. 2002. The Continuator: Musical Interaction
with Style. Proceedings of the International Computer Music Conference. San Francisco, California:
International Computer Music Association, pp. 211
218.
Pachet, F., and D. Cazaly. 2000. A Taxonomy of Mu` RIAO [Recherche
sical Genres. Actes du congres
dInformation Assistee
par Ordinateur] 2000, ContentBased Multimedia Information Access. Paris: Centre
des Hautes Etudes Internationales dInformatique
Documentaire, pp. 12381246.
Papadopoulos, G., and G. Wiggins. 1999. AI Methods for
Algorithmic Composition: A Survey, a Critical View
and Future Prospects. In Proceedings of the AISB
99 Symposium on Musical Creativity. Brighton, UK:
SSAISB, pp. 110117.
Pearce, M. T. 2005. The Construction and Evaluation
of Statistical Models of Melodic Structure in Music
Perception and Composition. PhD dissertation,
Department of Computing, City Univesity London.
Pearce, M., D. Meredith, and G. Wiggins. 2002. Motivations and Methodologies for Automation of the
Compositional Process. Musicae Scientiae 6(2):119
147.
Pearce, M., and G. Wiggins. 2001. Towards a Framework for the Evaluation of Machine Compositions.
Proceedings of the AISB01 Symposium on Artificial
Intelligence and Creativity in the Arts and Sciences.
Brighton, UK: SSAISB, pp. 2232.
Pena,
W. M., and S. A. Parshall. 1987. Problem Seeking: An Architectural Programming Primer, 3rd ed.
Washington, DC: AIA Press.
Rapaport, W. J. 2000. How to Pass a Turing Test. Journal
of Logic, Language, and Information 9:467490.
Riskin, J. 2003. The Defecating Duck, or, the Ambiguous
Origins of Artificial Life. Critical Inquiry 29(4):599
633.
Roads, C. 1984. An Overview of Music Representations.
Musical Grammars and Computer Analysis. Firenze:
Leo S. Olschki, pp. 737.
Saygin, A. P., I. Cicekli, and V. Akman. 2000. Turing Test:
50 Years Later. Minds and Machines 10(4):463518.
Scruton, R. 1999. The Aesthetics of Music. Oxford: Oxford

University Press.
Searle, J. R. 1980. Minds, Brains, and Programs.
Behavioral and Brain Sciences 3(3):417457.
Searle, J. R. 1993. The Failures of Computationalism.
Think 2:1278.
Shieber, S. M. 1993. Lessons from a Restricted Turing
Test. Communications of the ACM 37(6):7078.
Silva, P. 2003. David Cope and Experiments in Musical
Intelligence. Los Angeles, California: Spectrum Press.
Soldier, D. 2002. Eine Kleine Naughtmusik: How Nefarious Nonartists Cleverly Imitate Music. Leonardo
Music Journal 12:5358.
Solis, J., et al. 2006. The Waseda Flutist Robot WF-4RII
in Comparison with a Professional Flutist. Computer
Music Journal 30(4):1227.
Soltau, H., et al. 1998. Recognition of Music Types.
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2:1137
1140.
Sony Computer Science Laboratory. 2002. A Musical
Turing Test. Sony Computer Science Laboratory. Available online at www.csl.sony.fr/pachet/Continuator/
VPRO/VPRO.htm.
Standage, T. 2002. The Turk. New York: Walker and
Company.
Sterrett, S. G. 2000. Turings Two Tests for Intelligence.
Minds and Machines 10:541559.
Sussman, M. 1999. Performing the Intelligent Machine:
Deception and Enchantment in the Life of the Automaton Chess Player. The Drama Review 43(3):8196.
Terzidis, K. 2006. Algorithmic Architecture. Oxford:
Architectural Press.
Trivino-Rodriguez, J. L., and R. Morales-Bueno. 2001.
Using Multiattribute Prediction Suffix Graphs to
Predict and Generate Music. Computer Music Journal
25(3):6279.
Turing, A. M. 1939. Systems of Logic Defined by
Ordinals. Procedings of the London Mathematical
Society 45:161228.
Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59:433460.
von Ahn, L., et al. 2003. CAPTCHA: Using Hard AI
Problems for Security. Advances in Cryptology
Eurocrypt 2003. Santa Barbara, California: International
Association for Cryptologic Research, pp. 294
311.
von Ahn, L., M. Blum, and J. Langford. 2004. Telling
Humans and Computers Apart Automatically: How
Lazy Cryptographers do AI. Communications of the
ACM 47(2):5660.
Ariza
69
Wakefield, J. C. 2003. The Chinese Room Argument

Reconsidered: Essentialism, Indeterminacy, and Strong
AI. Minds and Machines 13:285319.
Wassermann, K. C., et al. 2003. Live Soundscape
Composition Based on Synthetic Emotions. IEEE
MultiMedia 10(4):8290.
Weinberg, G., and S. Driscoll. 2006. Toward Robotic
Musicianship. Computer Music Journal 30(4):2845.
Weizenbaum, J. 1966. ELIZAA Computer Program
For the Study of Natural Language Communication
Between Man And Machine. Communications of the
ACM 9(1):3645.
Wiggins, G. A. 2006. A Preliminary Framework for
Description, Analysis and Comparison of Creative
Systems. Knowledge-Based Systems 19:449458.
Wiggins, G., and A. Smaill. 2000. Musical Knowledge:
What Can Artificial Intelligence Bring to the Musician. In E. R. Miranda, ed. Readings in Music and
Artificial Intelligence. Amsterdam: Harwood Academic

Publishers, pp. 2946.
Wimsatt, W. K. 1954. The Verbal Icon: Studies in the
Meaning of Poetry. Louisville, Kentucky: University
Press of Kentucky.
Wimsatt, W. K., and M. C. Beardsley. 1946. The
Intentional Fallacy. Sewanee Review 54:468488.
Xenakis, I. 1985. Music Composition Treks. In C.
Roads, ed. Composers and the Computer. Los Altos,
California: William Kaufmann.
Zdenek, S. 2001. Passing Loebners Turing Test: A
Case of Conflicting Discourse Functions. Minds and
Machines 11:5376.
Zimmerman, R. L. 1966. Can Anything Be an Aesthetic
Object. The Journal of Aesthetics and Art Criticism
25(2):177186.
Zorn, J. 2000. Preface. In J. Zorn, ed. Arcana: Musicians
on Music. New York: Granary, pp. vvi.
70

Evaluating Generative Music Systems Using the Turing Test

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Evaluating Generative Music Systems Using the Turing Test

Hochgeladen von

Copyright:

Verfügbare Formate

Christopher Ariza

The Interrogator as Critic:

Computer Music Journal, 33:2, pp. 4870, Summer 2009

systems. Yet, since only the output of the system

Computer Music Journal

Procedural or algorithmic approaches to generating

Explicitly testing the output of generative music

Stanley Cavell (2002, p. 88), while arguing that

The Turing Test

p. 433). His original model, called The Imitation

knowledge that it brought about (Jefferson 1949,

Computer Music Journal

he offered a philosophical conversation stopper

p. 1110). Hofstadter further characterizes the Eliza

a wide variety of human-machine competitions as

human-like experience or creativity and might be

Computer Music Journal

or the harpsichord player of the Jaquet-Droz family

independent of comprehensive cognitive models, are

a creative program must make its own decisions,

ploy complex stochastic models, neural nets, models

Computer Music Journal

characters or images), and thinking is not (generally)

MDtT would permit the interrogator to submit

Musical Output Toy Test (MOtT)

Music as the Medium of the Turing Test

In this test, two composer-agents provide a score,

The MDtT and MOtT, while employing the blind

subjective visual and aural evaluations through

include formal, rational, and aesthetic criteria, the

Computer Music Journal

Tests similar to the MOtT have been proposed

believability (p. 378). The assumption that success

Music may be perceived as an intelligent activity,

Use of the TT, even if by name and rough analogy

Computational models, as generative music

Computer Music Journal

claim that the aesthetic quality of the compositions

which were original while others were constructed

understanding . . . its about money (Zorn 2000,

Despite the problems described herein, musical TTs

Computer Music Journal

Proposed and Executed Musical Turing Tests

musical tasks are easy to perform with it, and which

genuine music produced by professionals were

Paris (2002). The goal of this MOtT was to answer

machine-composed music in the style of Mozart

Computer Music Journal

1992 . . . typically average between 40 and 60 percent

Electronic Telegraph and the BBC1 Tomorrows

p. 73; Wassermann et al. 2003, p. 89; Midgette

In the context of musical TTs, an objection can be

composers database in specific and known ways,

Computer Music Journal

Machine Authorship and the Problem of Aesthetic

others have noted the limits of evaluating a work in

Language Is the Medium of the Turing Test

p. 435). While the computer answers the end-game

(the technology) from its aesthetic artifacts. This

Computer Music Journal

Collins, N. 2008. Infno: Generating Synth Pop and

Synthesis of Musical Style. Cambridge, Massachusetts: