Kosslyn and Schwartz (1977)

COGNlnVE SCIENCE 1,265-295 ( I 977)
A Simulation of Visual imagery*
STEPHENM. KOSSLYN
AND
STEVENP. SHWARTZ
The Johns Hopkins University
This paper describes an operational computer simulation of visual mental imagery in

humans. The structure of the simulation was motivated by results of experiments on,how
people represent information in. and access information from; visual images. The simula-
tion includes a "surface representation." which is spatial and quasi-pictorial, and an
underlying "deep representation." which contains "perceptual" information encoding
appearance plus "propositional" information describing facts about an object. The simula-
tion embodies a theory of how surface images are generated from deep representations. and
how surface images are pmcessed when one accesses information embedded in them. The
simulation also offers an account of various sorts of imagery transformations.
Our simulation is a model of how people represent information in, and later
retrieve information from, visual .mental images. That is, whe'n retrieving some
fact, people sometimes report generating an image and inspecting it; for exam-
ple. if asked what shape are a beagle's ears, one may recall this information by
mentally picturing the dog's head, and "looking" at the ears. We wanted to
understand how such processes operate, and felt that a simulation approach was
useful for a number of reasons: First. it allows explicit modeling of a system of
complex processes. Second. it provides a "sufficiency proof," a demonstration
that one's model is in fact adequate to account for some range of data, if the
program runs as claimed. Third, we felt that by attempting to motivate construc-
tion of the model with actual data about human cognition, we would be con-
fronted by important empirical issues that may otherwise have been overlooked;
we also felt that such an approach would lead us to collect a set of data that would
build, that would accumulate to form a "big picture" (cf. Newell. 1973; see also
Kosslyn. in press).
*The present work was supported by NSF Grant BNS 76-16987 awarded to the fikt author.
Requests for reprints andlor copies of the running program should be addressed to Stephen M .
Kosslyn. 1236 William James Hall, Harvard University. 33 Kirkland St.. Cambridge. M A 02138.
The present description of the program was written in January, 1977; it is likely to become obsolete
as our work progresses. Our simulation also serves as a question-answerer, where information may
be looked up directly in, or deduced from, propositional files (when possible); this paper will not
discuss this aspect of the simulation (the deficiencies of which are glaring at present).
265
Plan of the Paper. The paper is-divided into three major sections. The first
a description of the kinds of data that motivated the basic choices made i
constructing our model; in this section two major topics are considered, th
significance of the spatial/pictorial properties of experienced images and th
origins of such images. The second section is a description of the model itsel:
we will first consider the basic data structures of the model, and then the basi
processes used in constructing, inspecting, and transforming images. In thi
section we will illustrate the operation of our sim'ulation by tracing through
. series of examples. Finally, the third,, concluding section is a discussion of th
special characteristics of our model and the usefulness of themodel as a guide fc
empirical research.
I. MOTIVATING THE MODEL: BASIC FOUNDATIONS
1 . 1 T h e Functional Sign$cance of the Experienced Image

How to start?? The first step seemed crucial; if we got off on the wrong foo
here, we might never be able to recover. We tried to consider what single issur
was most important in dictating the basic architecture of a model, its basic
structure. In our case, the answer seemed clear: The most impo,rtant singlc
consideration was whether or not the experienced, quasi-pictorial image is in fac
functional. That is, images, as we experience them, could operate as surrogate^
percepts, being "looked at" by a mind's eye; or, they could simply bc
epiphenomenal, incidental concomitants of the truly functional processes, whict
are abstract and not at all pekept-like. Depending upon how this issue is re.
solved, one either builds provisions for a spatial representation that is repro-
cessed to interface with some semantic "conceptual" store, or one simply utilize:
a single system of abstract "propositions" or the like to account for all internal
representation (as haye Baylor, 1971 ; Farley, 1974; Moran, 1970; Pylyshyn,
1973; Simon, 1972, and others). This decision was not only deemed too impor-
tant to be resolved by recourse to simple introspection, but has already provided
an example par excellence of the inadequacy of simple introspection as a guiding
force in model building (see Kosslyn & Pornerantz, 1977, as opposed to
Pylyshyn, 1973);
We performed three kinds of experiments in an effort to demonstrate that the
spatial/pictorial qualities of experienced images are in fact functional. If we
could provide reasonable support for this claim, by collecting data consistent
with it and inconsistent with a relatively straightforward image-as-
epiphenomenon view, then we would proceed as if our model should account for
our experiences (see Kosslyn, in press, for a more thorough treatment of the
data).
SIMULATING IMAGERY 267
Findings on Scanning Images. Experienced images seem to preserve spatial

information in the same way that it is represented in our percepts. If this is in fact
the case, then information about distance between points of an imaged object
ought to be preserved in an image, and this information ought to be retrievable
and capable of influencing .real-time processing. We performed a number of
experiments to investigate this claim. In the simplest experiment, people were
asked to image a set of schematic faces one at a time. The faces were identical in
all but two respects: The eyes could be either light or dark, and the distance
between the mouth and eyes was varied.. As soon as a person mentally imaged a
given face, he' was asked to "focus on the mouth." Following this, the word
"light" or "dark" was presented, and the subject was to "glance up at the eyes"
and "see" if they were the named color. If so, he depressed one button as
quickly as possible; if not, he depressed another. It is interesting to note that
response times increased as distance between the mouth and eyes increased, just
as one would expect if in fact people scanned the images. In a further experi-
ment, subjects were asked to image these faces either full size, half-size, or so
large that only the mouth seemed "visible" in their image. Now, not only did
response times increase with increased separation between the eyes and mouth,
but more time generally was taken to scan increasingly larger images.
Althoughit seemed unlikely, we worried that perhaps subjects somehow
fathomed the purposes of the experiments (although they denied having done so),
and so we conducted a more elaborate experiment. In this experiment, subjects
first learned to draw a map of a mythical island that had seven objects on it. The
objects were placed such that there were 21 distinct distances separating all
possible pairs of objects. After learning to draw the map, the subjects then were
asked to image it, and to focus on a named location; each location was named
equally often as an initial focus point. Following this, another name was pre-
sented; half of the time this was the name of another object. If so, the subject was
to scan to that location and push one button; if it did not name an object on the
map, he was to push another button. A s expected, times-increased linearly with
actual distance to be scanned. Pinally, a control.group was tested in the task just
described, but with one change in instructions: These people, after focusing on
the initial location, were told that they did not have to scan to the second
location, but could just respond "true" or "false" as quickly as possible. This
group was tested in an effort to discover whether the obtained results were not
due to scanning a spatial image, but instead were a consequence of some sort of
underlying processing (of a nonobvious sort, since the number of objects falling
between two locations was constant-and so too, presumably, the number of
links in a graph structure). If so, then we expected distance to have an effect even
when the image itself was not processed. It is interesting to note, however, that
for this group there 'were absolutely no effects of the amount of separation
between objects: Distance simply had no effect on response times. Thus, it
seemed quite unlikely that the image merely was epiphenomena1 in this task (see
Kosslyn, l976b; Kosslyn, Ball & Reiser, in preparation).
Findings on Detecting Parts of Images. If images really do act as stand-in

percepts, and are able to be reprocessed by some sort of mind's eye (i.e., a set of
interpretive procedures that classify spatial patterns in terms of semantic
categories), then some of the same constraints that impede interpretation of
percepts also ought to affect processing of mental images. In particular, we
reasoned that parts of smaller images would be more difficult to "see." And in
fact, if subjects are asked to image various animals at different subjective sizes,
more time is required to "see" various parts (e.g., a goose's feet) if the animal is
imaged relatively small. This was true when subjects manipulated size directly,
and when size was manipulated indirectly. In this latter case, people imaged a
"target" animal, like a rabbit, next to either an elephant or a fly, appropriately
scaled for size. When next to the elephant, most people reported that the rabbit
seemed subjectively smaller than when next to the fly. And in fact, more time
was required to see a property, like ears, when the target animal was next to the
elephant. We worried, however, that these results may have been an artifact of
the animals used (e.g., perhaps people are simply more interested in elephants,
and thus spend more time constructing good images of them-at the expense of
constructing good images of the target animal). Thus, we performed another
experiment, where new subjects now were asked to engage in the same task-but
to image giant flies and tiny elephants. Now, more time was taken to evaluate a
target animal when it was imaged next to the large fly than when it was imaged
next to the tiny elephant (see Kosslyn, 1974, 1975).
A critic could argue that the above results do not really indicate that size per se
is affecting processing. Instead, perhaps people simply activate more properties
in a list when asked to image something large. After all, everyone knows that one
can "see" more detail if an object is closer or larger. If so, then the reason it
takes more time to evaluate a smaller image is that fewer of that object's proper-
ties had been activated prior to query, and thus it is more likely that one will have
to engage in time-consuming memory search than when the animal is "imaged"
large, and more properties are activated. Most list models of memory also posit
that more highly associated (or frequently accessed) properties are higher on the
list; this assumption explains why people generally are faster to verify that an
object has some highly associated property (e.g., stripes, for a zebra) than a less
associated one (legs; see Conrad, 1972; Smith, Shoben, & Rips, 1974). This
assumption allowed us to distinguish between the "differential list activation"
notion and our own claim, that smaller objects are more difficult to assess
because the parts are smaller, and hence "harder to see." If the list notion were
correct, then it should require less time to "see" a cat's claws than to detect its
head, as claws are more strongly associated and hence more likely to be acti-
vated. If our notion is correct, in contrast, head-because it is larger-should be
. . SIMULATING IMAGERY 269
more quickly detected. And in fact, when subjects were asked to image an
,animal and attempt to find some named part, larger p a r t k v e n though they
were less highly associated-were affirmed more quickly than smaller but more
highly associated parts. It is interesting, to note that when imagery was not
required, ,and subjects were simply asked to respond as quickly as possible, then
association strength-not size-determined evaluation time (replicating previous
.findings). Thus, imagesdo not seem to be simply another case of general list
processing of the sort that might occur in comprehending and processing seman-
tic information (see Kosslyn, l976a).
Measuring the Visual Angle of the Mind's Eye. One of the defining features
of the experienced image, we claim, is that it seems to represent spatial extent in
a way analogous to percepts. If so, then it makes sense to ask how large an image
can be. That is, if the structure in which images occur is spatial, then it may be
meaningful to speak of it as having "edges." Perhaps images can only be some
certain size before overflowing..We tested this idea in the following way: Sub-
jects were asked to image some object (either named or shown) and to pretend
that they were walking toward it, which reportedly made it seem to loom larger.
If at some point the object seemed to overflow, was not "visible" in its entirety
in the image, the subject was to "stop" in his "mental walk." At this point he
was to indicate how far away the object seemed in his image (i.e., how far away
he would be from the actual object if it appeared at that subjective size). We did a
series of experiments using this basic methodology, and found that the larger the
imaged object, the further was the apparent distance at the point of overflow, just
as one would expect if the "image space" was of limited extent. When subjects
imaged large animals and simply estimated distance in feet and inches, we
sometimes found that distance did not seem to increase linearly, however, but
increased more with larger animals (approximately a log function). For stimuli
that did not differ in size,so greatly (but still varied considerably), we found that a
constant angle was occluded by objects--regardless of their size-at the point of
overflow. We also found that this point was not absolute: more stringent defini-
tions of what it meant for an image to "overflow" resulted in appreciably smaller
estimates of the maximal angle (Kosslyn 1976b; in preparation). The important
point here, however, is that mental images do have spatial "boundaries," and
because of this, cannot always be "inspected" in their entirety.
These three findings, on scanning images, detecting parts of images, and the
spatial extent of images, converge in demonstrating that the spatiallpictorial
qualities of experienced images can enter into actual processing.
1.2 The Origin of Images

We decided, then, that our model ought to incorporate a spatial surface image.
This defined the second problem: the origins of such representations. If we had
decided that experienced images are epiphenomenal, we would not have been led
. -270 S. M. KOSSLYN AND-S. P. SHWARTZ
to consider this issue, but might have considered how inferences are made from
sets of abstract propositions (cf. Pylyshyn, 1973).
There seemed to be two very basic alternatives regarding the present issue:
images could be stored in toto, and simply retrieved, or some sort of constructivq.
activity might underlie images. The first alternative was quickly eliminated: We
found that subjects required more time to generate.progressively larger images of
objects (Kosslyn, 1975). If more parts are included in larger images (perhaps
because it is easier to see where they belong), this result is' not surprising. If
images are simply "projected," however, this sort of finding seems counterintui-
tive. However, although unlikely, this result could just indicate. that smaller
images become brighter (more vivid) more quickly than larger ones. This latter
account fails to explain, however, why it takes more time to image very detailed
pictures than to image less elaborated ones (Kosslyn, in press; Kosslyn, Green-
barg & Reiser, in preparation). The results of these experiments make no sense if
images are simply replayed or retrieved as a single unit.
The above results raise yet another issue: Perhaps images are not really con-
structed, but are stored as integral units, which are simply activated a portion at a
time. And the more complex the image, the longer it takes to activate it. In order
to eliminate this possibility, we asked subjects to image scenes which included an
array of letters. The interesting comparison was between alternative organiza-
tions of the same number of letters, which either formed 6 columns or were.
grouped to form 2 columns, each 3 letters wide. We found that when letters were
grouped to form fewer perceptual units, people were able to generate an image of
them more quickly-ven though the same number of letters were present in both
cases. This finding is inconsistent with the piecemeal retrieval hypothesis. These
results do not, however, exclude another possibility: images could be stored
integrally, but organized in the course of retrieval, during output. That is, the
effects of the number of perceptual units could be due to a nonrandom piecemeal
retrieval process. A common introspection seems to run counter to this claim,
however: People do not forget arbitraj portions or segments, but lose entire
interpreted units of an image (see Pylyshyn, 1973). This introspection is sup-
ported by the results of Bower and Glass' (1976) study of the organization of
visual memories, and makes no sense if images are stored integrally.
The next issue, as we saw it, concerned what sorts of representations are
recruited during the construction process. There seemed to be two basic pos-
sibilities: Images could be constructed solely on the basis of perceptual informa-
tion, like fitting together the pieces of a puzzle. Or, images could be created by
relating together perceptual and more abstract "propositional'', representations.
It seemed clear that the latter possibility could in fact occur; after all, most people
claim to have no trouble imaging a novel scene that is described to them (e;g.,
Jimmy Carter putting a giant peanut to bed at night). In this case, one presumably
retrieves memories of what the named objects look like, and uses the stated
relational information to construct an image of these objects in the proper con-
SIMULATING IMAGERY 27 1
texts. To buttress the claim'that conceptual information is used in image con-

struction, we performed a simple experiment: People were first shown a matrix
of letters, and then were given one of two labels ("3 rows of 6" or "6 columns
of 3").after the matrix was removed from sight. It is interesting to note that more
time was later taken to image a .matrix if it had been conceptualized in terms of
more u n i t s e v e n though the description was given only after the matrix was
physically removed. Thus, conceptual information can be used in image genera-
tion (see Kosslyn, in press; Weber, Kelley, & Little, 1972).
We now have enough data to begin constructing a model, so let us turn to the
simulation proper. We will consider additional data as necessary to motivate
particular aspects of the model.
2. THE MODEL
Most of the basic concepts underlying our data structures and processes were
motivated by empirical findings, but'not all the details. In a way, the current
version of our program is a collection of hypotheses resting on a relatively firm
foundation of findings. That is, although we have implemented procedures in
particular ways, we are not married to the details; in some cases, we are currently
conducting research in order to discriminate among a number of plausible alter-.
native schemes (e.g., for rotation, in particular). Nonetheless, we feel it is
important to formulate hypotheses that are plausible and sufficient to account for
kn0w.n findings. Thus, we have filled in some details prior to obtaining empirical
support for underlying assumptions. We will not simply incorporate these as-
sumptions and build upon them, however, but will first attempt to discover
whether such assumptions are in fact justified in the face of actual properties of
human cognition.
This section is composed of two major parts. In the first part we will review
the major data structures of our simulation; in the second part, we will discuss the ,
processes involved in generating, inspecting, and transforming images. Before
turning to the specifics of the model, however, let us first discuss some aspects.of
the computer program itself.
Status of the Program. The program is written in Algol 60. It currently

works with only two images, of a chair and a car. It is constructed, however, to
be very general. The data underlying construction of a given image are not part
of the program proper, but are read off of files stored on a disk (which corres-
ponds to the system's long-term memory). Thus, the, program will operate on
files for any number of different objects. The program was written with an eye
towards eventually interfacing it with a language understanderlproblem solver
and perceptual device: Ultimately, the files our program accesses in constructing
images could themselves be the output of a perceptual parser, and the inputs that
direct it to perform certain tasks could be in part delivered from a more general
272 . . S. M. KOSSLYN AND S. P. SHWARTZ
"problem solving" set of programs. As it now stands, the user provides the
relevant information directly or by setting up files on a disk (as mentioned
above). '
2.1 Data Structures

,
There are two major data structures in our model, one representing the experi-
enced, quasi-pictorial "surface" image1 that embodies spatial information, and
one representing the abstract information used to generate such images.
Implementation of the Surface Image. In our program, we simulated the

image itself by constructing a matrix wherein cells were filled in order to depict
some object. This simple idea was clearly inadequate, however: The overflow
experiments had shown that images gradually faded off toward the periphery.
Thus, if nothing else, we should have some kind of fading activation toward the
periphery of our matrix. The question then becomes where the activation stops
relative to the matrix. That is, there could be inactivated portions "waiting in the
wings," ready to be "moved" into the activated region. This issue was settled
by two experiments: In one, people imaged the schematic faces so large subjec-
tively that only the mouth was visible in their image. Upon hearing the word
"light" or "dark," they were to "look" at the eyes of their image and decide
whether the color word appropriately labeled them. These subjects were not told
to scan to the eyes, but only to be sure to "see" them before responding. If the
image was in fact preserved, distance between the eyes-which were not initially
visible-and the mouth should affect retrieval times. This was true. Further-
more, overall times were greater than in a condition wherein all of the face was
"visible" at one time in the image.
The same instructions were also used in an experiment with the map. Again,
these subjects were asked to "zoom in" on the initially named focus point until it
was so large subjectively that all else had overflowed. As in the faces experi-
ment, these subjects were not told to scan to named objects that were on the map,
but merely to "see" them in their image before responding. Again as before,
times increased with increased distance between focus and target locations (see
Kosslyn, 1976b; Kosslyn, Ball, & Reiser in preparation).
These results, then, suggest that there is a nonactivated region in the image
structure, wherein spatial images are constructed and waiting to be processed.
' A surface image is quasi-pictorial in that: ( I ) it represents visual information by using a coordi-
nate system to depict the appearance of the referent such that portions of the internal representation
within the coordinate space bear the same spatial relations to each other as do the correshnding
portions of the object(s) being represented;and, (2) the same sorts of information (e.g.,about extent.
brightness contrast, etc.) are available to inte~pretiveprocedures here as are available when percepts
themselves are accessed. This follows if the same sorts of representations (accessible by the same
sorts of interpretive procedures) that underlie the experience of seeing also underlie surface images.
Recall that there were no effects of distance in a control condition, wherein

subjects imaged the map, and focused, but were not required to "see" the target
location. Thus, we decided to represent a spatial, "surface image" in a matrix,
the central regions of which contained more cells that were activated (i.e.,
accessible to interpretive procedure+the "mind's eye"-as will be discussed
shortly), and the outermost regions of which contained cells that are not activated
at all.
We simulated the surface display by using not one but two matrices, a "visual
buffer" and an "activated partition buffer." It is important to.realize that these
two buffers were used to simulate a single spatial structure with properties of
fading activation from the center and lateral inhibition (as will be discussed
shortly). The theory itself posits only a single visual short-term memory structure
with both of these properties. In the program, the "activated partition" is the
only buffer accessible to interpretive procedures, and represents only a subset of
the information in the visual buffer. Material placed in the visual buffer is always
automatically mapped into the activated partition buffer. In order to simulate
fading activation, fewer cells are mapped from the visual buffer into the activated
partition with increasing distance from the center. As is schematized in Fig. 1 ,
the outermost portions of the visual buffer are not mapped into the activated
partition at all, and represent inactivated material "waiting in the wings" of the
surface display. The activated partition buffer, then, corresponds to the activated
region of the surface display structure.
In addition to having fading activation from the center, we also wanted the
surface display structure to have only limited resolution. That is, when objects
are depicted at smaller sizes, fewer details should be evident than when they are
displayed at larger sizes. Furthermore, if the object begins at a small size and
then is "expanded," details initially obscured now should become apparent.
Subjects in our experiments commonly reported having to "zoom in" on initially
small images in order to see details. These claims led us to hypothesize that parts
of small images were obscured due to something like lateral inhibition, and that
details become evident when size is expanded. The notion that something like
lateral inhibition is operating here is, however, only an hypothesis at present, but
one that allows us to account for obtained results. We simulated this property by
mapping four cells of the visual buffer into a single cell of the activated partition
buffer. Thus, points close together in the visual buffer are not differentiated in
the activated partition, and contours are obscured; the displayed details will
become apparent in the activated partition. however, if the image is expanded.
Finally, we used capital letters to mark cells in the activated partition buffer
that contained more than one point; lower case letters indicate that a filled cell in
the activated partition has not been overprinted. This notation may correspond to
subjective "brightness" of an image, but we have not tested the implication that
smaller images will tend to become denser and brighter than larger ones.
LOOKFOR
r)
TRANSFORMATIONS I partially
Act iva ted I
I)
Inactivated
IMAGE t
car. prp
imogefile car. img

hasa frontire
optimalresolution
r
f e a r t i r e . 'prp .
L f r o n t i r e . oro -
definition 6
4 9 6 11 5 131 r definition 6
2 8 5 1 0 6 j 3 l I
rearwheelbase f rontwheelbase
r
rearwheelbase. prp frontwheeibase . prp
def i n i t i d n 5
4 9 6 , 1 5 1
FIG. 1 . Schematic representation of the basic data structures utilized in the simulation; large
words with arrows symbolize processes that operate upon the structures.
Implementation of the Underlying "Deep" Representations. Our findings

suggested that it would be useful to posit two sorts of.representations underlying
images, one perceptual (in the sense of representing "rote" appearances, how
something looked) and one conceptual (in the sense of being abstract and discur-
sive, describing a thing or aspects thereof). The first sort of representation was
deemed necessary because people do seem to image appearances, and the second
was included because conceptual information clearly interacts with memory of
appearance in image construction. Because our surface representation occurred
in a matrix of filled or unfilled cells, our underlying representation of appear-
SIMULATING IMAGERY - 275
ances (spatial patterns) consisted of instructions specifying where cells should be '
filled. We needed a format that would allow people to preset the size of their
images, since we had data that people could readily image objects at different
sizes (Kosslyn, 1975). In addition, our intuition was that people could place
images at different "locations" with relative ease. These concerns, plus consid-
erations of parsimony, led us to adopt a polar coordinate format. That is, we
specified' the relative location of each point by an R,B pair, indicating distance
and angular orientation relative to an origin. This format allows one to vary size
easily (by multiplying the R values), allows easy shifting of location (by moving
the origin), and .is very economical.
The other component of the underlying representation of an image was a list of
"facts." This list included: ( I ) parts of the object ("HASA X"); (2) the location
of that imaged object in a scene, or the location of a part of an object (as will be
discussed). Locations were indicated by a relation and a "foundation part" (e .g.,
for "cushion," "LOCATION FLUSHON SEAT"); (3) the level of resolution
needed to see the imaged part, indicated in terms of dot density in the surface
matrix; and (4) a definition of the part. This definition was in terms of a set of
procedures, which would be necessary to execute successfully in a specified
sequence in order to find the object or part (as will be discussed). The purpose of
each type of list entry will be developed as we discuss the operating characteris-
tics of the program below. . .
In order to model the constructive process, we were forced to assume that any
given object may be represented in memory by more than just a single image and
propositional file. We posited that all encoded objects have a "skeletal file,"
which represents a global shape (or the "central shape" to which all else is
attached). This file represents infomation encoded upon an initial look; in addi-
tion, there may be subsidiary files representing "second looks." In this case,
information about details may be represented by files subordinate to the skeletal
file. The propositional lists associated with these files must note where on the
skeleton the part belongs, and how it is attached. This swcture was motivated by
our finding that images of objects with more parts required more time to gener-
ate, even if the parts were created simply by cutting up the same object into more
segments, and presenting the segments sequentially prior to imaging. The struc-
ture of our program is schematized in Fig. 1 .
Finally, in addition to kpresentations of rote appearances and lists of proposi-
tional knowledge, we also posit that one has representations of the meanings of
various relations (like "left of," "flush on," "under," and so on). We represent
such knowledge by procedures that serve to integrate parts together. Thus,
"FLUSHON," for example, means that an added part (e.g., a cushion of a
chair) should fit on another part (the seat) such that none overlaps or fails to
cover.
This exhausts the sorts of representations in our model. Let us now turn to the
actual operation of the program.
276 S. M. KOSSLYN AND S. P. SHWARTZ
2.2 Operation of the Program

The data structures described above are processed in various ways, depending
upon the task being performed. Our program makes use of three basic processes,
all of which work together: First, there are procedures for mapping deep rep-
resentations into a surface representation. Second, there are procedures for
evaluating a surface image, for finding a pattern representing some part, or for
evaluating the level of resolution of the image. Third, there are procedures for
traiisforming an image: for adjusting size, for scanning, and for rotation.
2.2.1 Image Generation

The basic procedures used in generating images are called PICTURE,
FIND, and PUT, all of which are coordinated by IMAGE. PICTURE converts
a file of R , 8 coordinates (representing the long-term memory of a rote visual
appearance of an object) into a set of points in the surface matrix; in so doing, the
size may be altered by adjusting the R values, the location may be set by shifting
the placement of the origin (to which all R , 8 values are relative) in the Cartesian
space of the matrix, and the orientation may be altered by adjusting the 8 values.
(We did not initially plan on this last property of images, but it is clearly implied
by our representation and we are currently investigating it.)
FIND (which is called by PUT) consists of a set of procedures that test for
various spatial configurations in the surface matrix (as will be illustrated). FIND
is an interface between a definition of a concept (e.g., for a car's rear tire) and a
spatial pattern corresponding to an instance of that concept. Upon locating a
pattern of points that corresponds to a part or object, FIND then passes back the
location in the surface matrix (specified in Cartesian coordinates). We make no
claims that the specific procedures incorporated in FIND actually reflect those
used by people; to solve this problem would be to solve much of the problem of
how people recognize patterns. Instead, we simply assert that people utilize
something like our procedures in some similar ways; the particulars of our
procedures were constructed simply for convenience.
Finally, PUT integrates a pattern corresponding to a part into a pattern already
in the surface matrix. For example, PUT places a tire at the correct location on
an image of a car's body. In order to integrate a part, PUT must first discover
where the part belongs on the image and then must locate that "foundation part"
via FIND. FIND allows PUT to calculate the appropriate size of the to-be-
integrated part and allows it to specify where the part should be printed out; this
information is passed to PICTURE, which actually generates the image of the
part.
These three procedures are coordinated by a procedure called IMAGE, which
interfaces with the rest of the hypothesized cognitive system (i.e., a language
comprehender, problem-solving apparatus, etc.). In actuality, of course,
IMAGE interfaces with the user, who specifies which images to generate and
may indicate the size, orientation, location, and level of detail desired.
The best way to illustrate how the prograrn.works is to trace through some
examples involving generation of an image of a car.
Generating a Skeletal Image. Figure 2 shows what the program did in

generating two images of a car, a "skeletal image," without added details (top),
and an elaborated image ( b o t t ~ m ) In
. ~ order to generate the skeletal image, we
first entered a command to image a car. The program immediately looked up a
file listing facts about that object; if it could not find such a file, the object is
novel and the program obviously cannot proceed further. The file located in
generating this skeletal image was labeled "CAR.PRP"; the program assumes
that files with a ".PRP1' extension contain statements in a propositional format.
One of the statements listed in the file may be the name of another file, with an
indication that this file contains an encoding of the actual appearance of the
object or part. Presumably, one can have numerous images of the same thing,
each indexed according to date, place, etc. In the present case, only "IM-
AGEFILE CAR.IMGWwas listed in the CAR.PRP file. Upon locating this,
the IMAGE procedure then looked up this file, which contains a list of R,8
coordinates. After locating the file, PICTURE was called up, and the specified
points were placed in the jurface matrix.
In this case, PICTURE centered the image in the buffer, since no explicit
location specifications were provided, and printed the image so it fell just into the
most activated portion of the visual buffer, which is the default size used when no
explicit size requirement is entered. One can specify a command like "IMAGE
CAR LARGE A T 10,20," which will cause the R values to be adjusted and the
origin (the center of the image) to be moved to the ipecified place. The default
size was adopted because we found that people spontaneously seem to image
most things at about the largest size before overflowing, as indicated by estimates
of how far away imaged objects seem when they are constructed with no special
instructions about size (i.e., when subjects are not asked to image them as large
as possible). We decided to have images centered unless otherwise specified, for
two reasons: First, people claimed that they centered their images unless other-
wise instructed. Second, many of our empirical findings would not have occurred
if people did not center images (e.g ., larger images would not be inspected faster
than smaller ones if one had to scan to overflowed portions; only by dint of being
able to "see" numerous parts simultaneously should one be able to inspect a
larger image more quickly than a smaller one).
We have yet to perform research indicating whether the "skeletal file" ought
to contain a low-resolution global representation or a representation of just the
most structurally central part (the body, in this case). Either option can be
incorporated with no changes in the structure of the program. We have chosen to
ZAll of the following figures are externalized representations of the "activated partition buffer,"
which is operated upon by FIND.
1
AAAAAAAAAAAAAAAAA
AA 1 AAA
AA 1 AAA
AAAAAAAAAAAAA 1 . AA AA
AAAA--------
AA
AAA AAAAAAAA A A A AAAAAAAAAAAA
AAAAA AAAAAAAAAAAAAAAAAAAAAAA AAAA
1
1
1
1
1
1
1
1
1
1
BBBBBBBBBBBBBDBBB
BB 1 BBB
BB 1 BBB
BBRBBBBBBBBBB 1 BB BB
----------------B------------RBBBBBBBBRBBBBBBBBBBRBRDBB8BBBBBBB--------
B B BB B BB B BB
BBB BBBRBBBB B B B RBBBBBBBBBBB
BRBBBC FBBBRBBRRRBRBBBBBBBBRRBBA Af)BBB
CCCCCC 1 AAAAAA
1
1
1
1
1
1
1
4
FIG. 2. Externalized surface representation of a car. The top illustration is a skeletal image and
the bottom is a fully elaborated image. Different letters in the bottom illustration index recency of
being "refreshed."
store only the body of the car in the skeletal file for expository purposes; this
-
allows us to illustrate more clearly how "second looks" are intenrated into the
image. We also have yet to determine when only a skeletal image is generated (as
opposed to a more fully fleshed-out version). We know that when people are
asked to image all of the details, more time is in fact required when more detailed
drawings are mentally pictured. We also have some pilot data, however, that
seem to indicate that if people are not given explicit instructions to include all
parts in their image, the number of details does not influence construction time.
If people typically only construct a skeletal image (and perhaps wait for task
demands before filling in further information), then this result makes sense (if in
fact this preliminary finding is correct). In addition, we do have some slightly
more compelling reasons for suspecting that people often do not fill out all
details: Kosslyn (1976a) found that people did not "see" parts of already con-
structed images any more quickly than they constructed parts of images on the
spot. If people had only constructed a skeleton when asked to image an object,
and' then had to insert the part when asked to "de" it (as opposed to simply
detecting it on a fully constructed image), these results would make sense: in
either case, whether an image was c o n k c t e d ahead of time or not, the person
would still have to generate an image of the probed part. In balance, then, we
decided to have the program construct only the skeletal image unless otherwise
instructed .
Generating an Elaborated Image. Although the default currently is to gen-

erate only the skeleton, one can specify generation of a fully elaborated image, as
is illustrated at the bottom of Fig. 2. This is a much more imposing task than that
discussed above, and it makes sense not to go through all of this unless one
knows it will be of some use. In this case, the skeletal image is constructed first.
Following this, the IMAGE procedure now looks to see whether the object has
parts explicitly noted in the appropriately labeled file (CAR.PRP). In this case,
HASA REARTIRE was located. The program now attempts to find the appro:
priately labeled propositional file (REARTIRE.PRP), and attempts to discover
whether there is a statement that an image of this part is stored in memory. If, as
here, it finds a notation that an image file exists-"IMAGEFILE
TIRE.IMG"-this file is then looked up. (If no appropriate index is found, the
program would report an error condition.)
Having found that the object has a part, and that there is a representation of the
part's appearance, the program will now attempt to integrate the image of the part
into the already generated skeletal image. In order to do so, the program must
first look up the location of the reartire; locations consist of two parts, a relation
("UNDER," in this case) and a foundation part ("REARWHEELBASE"). If
either of these specifications is missing, PUT will not be able to place the part
correctly. It often seems, however, that we can know what some part or thing
looks like, but not know where it belongs; for example, one may know what a
friend's ring looks like, but never notice which finger-r hand, for that
matter-it belongs on.'In a future version of the program we would like to build
in "best guess" inference procedures for integrating parts when locations are not
fully specified; at present, however, attempts to integrate a part will terminate if
PUT cannot find a complete location specification.
PUT begins to integrate the tire into the image by first having the image
"refreshed" (regenerated); as will be discussed, we hypothesize that parts fade
with time, and must be regenerated if they are to be maintained. After regenerat-
ing the image, P U T calls FIND in order to locate the foundation part,
"REARWHEELBASE," on the skeletal image. This is accomplished by find-
ing the definition of REARWHEELBASE, and then executing these tests in the
specified order. The definition is specified simply as an ordered list of numbers
that index procedures, but could be stored as a list of concepts corresponding to
what the procedures actually do. Our suspicion was, however, that natural lan-
guage is too weak to easily draw the distinctions necessary to specify the differ-
ent procedures humans actually use, and that such definitions are probably stored
in some kind of abstract format. If all definitional procedures are satisfied, FJND
has located the part, and the,location is passed back to PUT.
Following this, the representation of the relation "UNDER" is looked up.
We represented relations directly as procedures, but instead we could have rep-
resented them in terms of declarative specifications with interpretive functions
that work over them (as far as we are concerned, these alternatives are so difficult
to distinguish between that they may simply be "notational variants"). The
definition for a relation may be very particular; that is, there may be many
distinguishable sorts of "under," depending on what sorts of objects are in-
volved. Some of these distinctions can be expressed in English with phrases
(e .g., "tucked slightly in and under"), but many cannot. Thus, we have simply
used single words as labels, but realize that something more abstract may in fact
be required in order to provide an accurate model of human representation and
processing of relations.
In any case, the relation expresses exactly how parts fit together, which allows
PUT to construct a composite image. The first thing PUT does is to adjust the
size scale of the part relative to the skeleton. That is, the skeleton may have been
generated at any number of sizes, and the part must be matched to scale. This is
done by assessing the size of the foundation part (the wheelbase in this case) and
then adjusting to scale the R values in the image deep representation of the part
(the tire). That is, the extent of the foundation part along a single dimension
(horizontally) is assessed in the image, and then the R values are adjusted until
the sum of the maximum horizontal (within a tolerance) R values is the same
size. Once so adjusted, the part is printed out at the correct location in the image
(via PICTURE). This procedure may be repeated for any number of parts. On
the bottom of Fig. 2, a front tire was also imaged. In this case, the same tire was
simply placed in the front. We suspect that people often assume that all tires look
alike (or all trees!), and are quite happy to save effort by encoding only a single
exemplar, and using it in multiple contexts.
Once an image is constructed, we hypothesize that it begins to fade, and that
work is required to maintain it. We have evidence that more complex images
(e.g., of 16-cell vs. 4-cell matrices) are more difficult to maintain, and that parts
of more complex imaged scenes require more time to detect-supporting the
notion that the image is generally more degraded as it becomes more complex
(see Kosslyn, 1975). In our program, there is a scheme to recycle parts of the
surface image. That is, we print out each part with a different letter; the most
recent part is printed with "a," and the letters reassigned such that more recently
printed parts are printed with lower letters. The different letters in. the car at the
bottom of Fig. 2 reflect recency of being refreshed. Older parts are refreshed
first; in our current implementation, we only regenerate when the program is
asked to-do something (e.g., find a part) with the image. So, the parts may fade
altogether without being refreshed if the image is held too long before the
program is asked to evaluate it or the like. This decision was motivated by.the
same sort of data that led us to consider a skeletal-imageas a default. That is, if
people hold an image for a while, they may in fact let all details fade and only
maintain the skeleton rather than retain a potentially useless load. We are cur-
rently testing this hypothesis, and thus pose it at present only as a reasonable
possibility. Unless fading rates vary.somewhat capriciously, however, our model
commits us to a position that "image processing capacity" will be defined by the
rate at which constmction and fading proceed: If there are so many parts to be
placed that the initially placed ones fade away by the time the last is generated,
all parts will not be displayed simultaneously-and "processing capacity" will
have been exceeded.
Generating a Subjectively Small Image. We noted earlier that people gener-

ate subjectively smaller images more quickly than larger ones, but that larger
ones are inspected more easily. We assume that this result is due to people
placing fewer parts on smaller images; since they know that they will soon be
asked to find some part, people may generate detailed images in this task initially
(but we do not know that they bother to maintain the details if asked to hold the
image). Basically, a skeletal image is first generated as before, but now with a
smaller size factor. As is evident in Fig. 3, the resolution is too poor for the
wheel base to be clearly delineated; hence, the procedure checking for a gap is
not satisfied, and PUT is not provided with the information necessary to place
the tire. Thus, a smaller image will be constmcted faster than a larger elaborated
one, since fewer parts will be placed; as we shall see shortly, when a small image
is inspected, more operations will be required to find a part (via "zooming in")
than if the image had been constructed at a larger size.
Constructing an OverJowed Image. Images may also be constructed so large

that they overflow the activated portion of the visual buffer. as is illustrated in
Fig. 4. As is evident, overflow at the periphery is not all-or-none, but occurs
gradually-as mandated by the data described earlier. As the image becomes
larger, parts become more sharply delineated (as is clearly evinced by the door
handles, which are not very discernible even in Fig. 2, but are clearly evident in
Fig. 4). It is interesting to note that FIND will also fail to locate the wheelbases
. S. M. KOSSLYN AND S. P. SHWARTZ
FIG. 3. Externalized representation of a car imaged at a small subjective size.
here, and will not allow PUT to place the tires; the program trace for the run that
produced Fig. 4 is identical to that produced in generating Fig. 3; except for the
size factor indicated. Thus, we would predict that this size image may be con-
structed faster than a medium sized one if enough foundation parts overflow; this
prediction has not yet been tested.
SIMULATING IMAGERY
1
1
1
1
1
1
1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a 1 aaaa
a 1 aaaa
a 1 aaaa
a a a
aaaaaa 1
1 aaaa aaaa
a aaaaaa 1 aaaa
-------.--------------aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa---a---a-----a-
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a
aa aaaa aa aaaa aa
aa aaaa aa aaaa aa aa a a
a a aaaa aa aa a a a a a
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aa aa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
FIG. 4. Externalized surface representation of a car imaged at a very large subjective size. Small
letters indicate that there isa direct mapping from the visual buffer to the activated partition (which is
printed here); this indicates that Iateral inhibition is no longer obscuring contours.
2.2.2 Image inspection

Much of the data on processing of mental images involves a person "looking
on" an imaged object for some specified part or property. Our program simulates
this process by using the procedures FIND, SCAN, EXPAND, and SHRINK,
all of which are coordinated by LOOWOR. FIND has already been described,
and works here just as it does himage generation. SCAN shifts the points that
define a surface image across the visual buffer, such that different portions of the
image fall in the center of the matrix (which is posited to be most highly activated
and "in focus"). EXPAND and SHRINK shift the points defining a surface
image such that the size scale is altered. (The details of these last three proce-
dures will be described shortly .)
LOOKFOR accepts commands to locate some part; it does this by calling up
FIND, which looks up a procedural definition of the part and searches for it.
Before executing FIND, however, LOOKFOR checks to see whether the image
size scale is correct; if not, EXPAND or SHRINK is called to make adjust-
ments. If FIND is not successful in locating the part, SCAN is utilized in a
further effort to locate it.
Locating a Part of a Centered, Mediumsized Image. Let us consider how

the program operates when asked to find the rear tire of a car imaged centered
and at a medium size. First, LOOKFOR attempts to locate an appropriately
named file; if no such file can be found, the property is novel and the program
cannot continue. In the present example, the file labeled "REARTIRE.PRPW
was found. The program does not begin immediately to search for the part,
however; we reasoned that people probably do not bother to scrutinize an image
that is clearly too small or large, but first check .the level of resolution of the
image and adjust size if ne~essary.~ Our program simulates this by first attempt-
ing to look up the optimal resolution necessary to see the part. If this information
is not stored with the file associated with the part (with REARTIRE, in this
case), LOOKFOR then looks in the file associated with the foundation part
(REARWHEELBASE). This course of action was adopted for two reasons:
First, many parts may be different sizes on different things (like tires); second,
one may sometimes look for parts that are not explicitly noted in a file, like the
hood, but are only implicit in the surface image. It seemed to us that imagery
would most often be used in memory retrieval when one did not have an explicit
notation that some object had a sought property. For example, consider what
seems to happen when you have to decide whether or not Volkswagen Beetles
have ventwings (little triangular windows at the front of the door). It is unlikely
that this information is explicitly noted in the propositional file associated with
Volkswagen, nor may this information be deducible from a superordinate or the
like. Thus, one may have to mentally image the car in order to retrieve this
information. The definition of ventwing allows one to look up the fact that it is
In the current simulation, size and resolution covary; larger images are more resolved, because
more details may emerge a s more dots are available to represent contours in the activated partition
buffer. This relationship between size and level o f resolution is not necessary, however; w e could
have kept the dot density constant as size was changed. A s it now stands, larger images also will
appear "closer." and smaller ones "further away," because of the relationship between size and
resolution.
SIMULATING lMAGERY 285
attached to the-door, which will allow one to adjust the image to the correct size
to "see" the ventwings.
As the program now stands, the optimal level of resolution is stored directly.
The resolution of an image is indexed in our model by dot density; more dense
images are less resolved, as points run into one another and details are obscured
(in our model this is indicated in part by number of capital letters, which indicate
overprinting). We will eventually implement a scheme for calculating optimal
resolution from information about the relative size of a part and the material of
which an object is composed: If dots represent places where light is deflected by
local perturbations on the surface, then mirrors will have fewer dots than
cardboard; hence, knowledge of both size and material will be necessary, to
calculate how dense dots ought to be in order to be optimally able to see some
part. In any case', once optimal dot density is obtained, the current level of
density is assessed and compared yith the optimal; if the current density is not
within some specified range, the program "zooms in" or "pans out" as appro-
priate (via EXPAND or SHRINK, as will be described shortly) until the dots are
at the correct level of density.
Once it has been determined that the image is within the range of resolution
necessary to "see" the sought part, FIND is used as it-is in generating fully
elaborated images. In addition, however, once the tire is found, the program
"scans" to it, which is equivalent to centering the part in the visual buffer.
(Thus, the inspected part is in the most sharply focused region of the visual
buffer.) We decided to have the program scan to the object because of results of
experiments in which people were asked first to mentally focus on a portion of an
image, and then were asked whether they could "see" another portion; the
further the separation between the two portions, the longer the response times.
People reported scanning directly to the object, whereupon it became sharply in
focus. In order to scan directly to it, however, people seemed first to locate it in
the whole image field; the strongly linear'relationship between detection time and
distance scanned (see Kosslyn, 1976b; Kosslyn, Ball & Reiser, in preparation)
indicates that random searching probably was not required. Figure 5 illustrates
the results of finding the rear tire.
Locating an Overflowed Part. In Fig. 5 the program is focused on the rear

tire, and the front tire is now in the partially activated zone of the surface display.
Table I is a verbatim set of tracings, printed out on-line as the program attempted
to find the front tire, which was not "visible" because it had overflowed the
activated region of the visual buffer. In this case, FIND cannot locate the
wheelbase because not enough of the wheelbase is activated to identify it (lines
3 1-35 in Table I). Now, LOOKFOR sifts through the procedural definition of
the part, and finds an instruction to use a procedure that searches in a particular
spatial direction for the part; this information is used to direct SCAN to scan in
that direction. In our working program, SCAN moves the image such that the
S . M. KOSSLYN A N D S. P. SHW.ARTZ
1
CCCCCCCCCCCCCCCCC
CC 1 CCC
CCl CCC
CCCCCCCCCCCCC CC C
C 1 CCCCCCCCCCCCCCCCCCCCtCCC c C
C 1 C CC C CC C
CCC CCCCCCCC C C ccc c c c c
'--------------------------CCCCCA------ACCCCCCCCCCCCCCCCCCCCCCCB-~--------c-
AAAAAAl B bC
FIG. 5 . Results of scanning to the rear tire; note that the front end has partially overflowed.
material previously at the appropriate edge (i.e., the right) of the activated region
is shifted into the center, moving material previously inactivated into the acti-
vated regions (lines 42-45).,Thus, since the front tire is defined as being at the
front-which here is equated with "right1'-the rightmost portion of the image is
shifted to the center. (The process of scanning and inspecting should be continu-
TABLE 1
Attempting to Find a Front Tire on a Fully-Detailed, Medium-Sized,
Off-Centered Image of a Car
*LOOKFOR FRONTlRE
LOOKING FOR PROPOSITIONAL FlLE FOR FRONTIRE

FRONTIRE.PRP OPENED
CHECKING PROPOSITIONAL FILE FOR OPTIMAL RESOLUTION

OPTIMAL RESOLUTION NOT FOUND
CHECKING PROPOSITIONAL FILE FOR FOUNDATION PART

FOUNDATION PART FOUND: FRONTWHEELBASE
LOOKING FOR PROPOSITIONAL FILE FOR FRONTWHEELBASE

FRONTWHEELBASE.PRP OPENED
CHECKING PROPOSITIONAL FlLE FOR OPTIMAL RESOLUTION

OPTIMAL RESOLUTION FOUND: 85.0
CHECKING SIZE OF IMAGE

CURRENT RESOLUTION = 87.2

FRONTIRE.PRP OPENED
CHECKING PROPOSITIONAL FlLE FOR PROCEDURAL DEFINITION OF

FRONTIRE
PROCEDURAL DEFIMTION FOUND
REGENERATING IMAGE
BEGIN SEARCHING ACTIVATED PARTITION FOR FRONTlRE
SEARCHING FOR LOWEST POINT RIGHT

FOUND AT 34 0
FOLLOWING HORIZONTAL LEFT TO END

PROCEDURE FAILED
CAN'T FIND FRONTIRE

FRONTIRE.PRP OPENED
CHECKING PROPOSITIONAL FILE FOR DIRECTION TO SCAN

DIRECTION FOUND: SCAN RIGHT
SCANNINGTO 25 0
, LOOKING FOR PROPOSITIONAL FILE FOR FRONTlRE

- -
T A B L E 1 cGtld.
FRONTIRE.PRP OPENED
CHECKING PROPOSITIONAL R L E FOR PROCEDURAL DEFINITION OF

FRONTIRE
PROCEDURAL DEFINITION FOUND
REGENERATING IMAGE . .
BEGIN SEARCHING ACTIVATED PARTITION FOR FRONTIRE
SEARCHING FOR LOWEST POINT RIGHT

FOUND AT 10 0
FOLLOWING HORIZONTAL LEFT TO END

FOUND AT 6 0
STORE COORDINATES OF RIGHT ANCHOR POINT
SEARCHING FOR NEXT HORIZONTAL POINT LEFT

FOUND AT -I 0
STORE COORDINATES OF LEFT ANCHOR POINT
CHECKING FOR PART BELOW.

FRONTIRE FOUND
SC ANMNG TO 6 0
ous, but we presently perform the scans in leaps in order to save time.) Once the
image is shifted, FIND again inspects the image and now is successful, and
LOOKFOR again calls SCAN to center the found part. The reader will note that
parts (e.g., front tire) are currently defined in t e n s of left and right; this clearly
is inadequate, as the car obviously could face either direction. We plan to
overcome this deficit by having the program locate a "landmark" (e.g., the hood
ornament) which establisJes whether the front is to the left or right; once this is
established, left and right will be assigned as appropriate in the execution of
procedures, which will be written in terms of orientation-invariant relations like
"front" and "back."
We presently execute SCAN only after FIND, but it may be that SCAN-like
EXPAND and SHRINK-ought to be called prior to attempting serious
scrutiny, especially if the optimal resolution entails having some portion over-
flowing (e.g., as would occur when searching for a cat's claws, or the door
handles of a car). We are about to conduct research attempting to discover
whether scanning occurs simultaneously with size adjustment or whether the two
operations occur serially (if the former is true, we expect that effects of size and
distance will not be additive). If size adjustment and scanning occur in parallel,
then our current simulation is in need of revision.
2.2.3 Transforming Images
The data on image transformations seem to suggest that such transformations
occur more or less continuously; at the minimum, the image seems to go through
intermediate states in the process of being transformed (e.g., Cooper & Shepard,
1973).4 In our program, all image transformations involve shifting the locations
of points that delineate the object in the surface matrix. Furthermore, all shifting
is done sequentially, as we did not wish to posit that points could be stored in a
temporary buffer before being repositioned. Given that portions of an image are
transformed sequentially, we were forced to posit that there was a limitation on
how far individual points could be shifted at any one time; this limit is a function
of how far a given portion (i.e., set of points) can be moved before a noticeable
gap or deformation occurs in the image (and hence depends upon the size of the
image, among other factors). Thus, only relatively small changes can be made at
a time, requiring a series of relatively small tranformations. Our scheme, then,
conforms to the data and to our intuitions that images pass through intermediate
points when transformed; there is no reason why a nonspatial model, like a
"semantic network," would require passing through spatially intermediate
states.
Although we have implemented our transformations using the portion-at-ai
time sequential shifting scheme described above, we also could have simulated a
mechanism that shifts all points simultaneously. In this scheme, there would be a
distribution around the distance each point (or portion) is moved, such that the
image becomes scrambled when transformed; further. the larger the transforma-
tion, the more the points would become jumbled. In order to avoid total dismp-
tion, a sequence of relatively small transformations would be used, each one
followed by a "clean up" operation realigning the points correctly. This latter
scheme is much more difficult to implement than the one we have currently
programmed, however, and thus we have hesitated to embark upon it until we
can formulate some way of empirically distinguishing between the two types of
mechanisms.
Size Adjustment. The transformations first mentioned above involve altering

apparent size, either by "zooming in" (EXPAND) or "panning out"
(SHRINK). These transformations work by first defining a sequence of "rings"
'At first glance one may think that there is no problem here, images just rotate. expand. etc.
However. an image is not like a picture. not a rigid physical object that can actually be moved and
obeys the laws of physics. A s is apparent when one examines our model, the spatial qualities o f
image representations do not necessarily entail continuous (or stepwise) transformations; this is a real
problem to be solved that is not explained away simply by appeal to the existence of spatial images
(cf. Pylyshyn, in press).
around the center of the surface matrix. For zooming in, the outer ring is moved
outward and then each ring towards the center is moved outward in succession.
For panning,out, the rings are pulled toward the center, again one step at a time
starting at the innermost.ring. The maximal step size of each cycle is set by the
starting size of the image; thus, rates of expansion, for example, may increase as
the size increases; this prediction has yet to be tested. As the apparent size of the
image is altered, previously obscured details may be able to be mapped into more
than a few cells of the activated partition, and hence become more sharply
defined. That is, if the image is too small, various details will be obscured, as
only activated cells (represented in the activated partition buffer) are available for
inspection by the FIND procedure's. All transformations cause points to move in
the visual buffer, which then are mapped into the activated partition; hence, as
images are expanded, previously obscured details become evident (i.e., accessi-
ble to the FIND procedures).
Scanning. Scanning in our model is treated as simply another kind of image

transformation. Instead of moving an activated region across the image, we
march the image across the most highly activated region of the surface matrix.
Our reasons for implementing scanning in this way were as follows: Our data
indicated that images gradually fade off toward the edges. This finding led us to
posit that there is only a limited amount of processing capacity available to
activate an image. Given this claim, it makes sense that activation would be
distributed unevenly throughout the surface display; why should the image be
degraded overall, when some portion could be relatively sharp (even though
other portions may be relatively degraded)? Given that some distribution of
capacity is likely, it is almost tautological to say that the "center" would be most
highly activated: as long as activation fades off symmetrically from the point of
highest activation, this point becomes-for all intents and purposes-the
"center." Whatever part is centered in the image, then, will be most sharply in
focus. Thus, to shift the point of focus over the image one could simply shift the
image such that different portions fall in the center of the internal display. In our
model, then, scanning consists of shifting the points delineatirig an image until
the sought configuration (i.e., part) is in the center of the most activated region of
the surface display.
If experienced surface images occur in the same neural structures that support
our experience of seeing an object during perception, then the above sort of
notion makes some sense: Many people claim to be able to scan 360 degrees
around them in a mental image of a room, but visual structures need to support
only the limited visual arc subtended by the eyes. Thus, one should "bump into
edges" if scanning images consisted of moving an activated region through the
representational structure. If one instead moves the image,through the representa-
tional structure, in contrast, then it is perfectly natural that one could seem to
SlMULATlNG lMAGERY 29 1
scan indefinitely along an arc (as long as new portions of the image were con-
tinually being constructed at the leading edge).
Rotation. Rotation is the last scheme we have implemented.-This is our least

developed transformation, however, and we are currently in the midst of de-
veloping alternative schemes,and distinguishing between them empirically. AS it
now stands, portions of the image are shifted sequentially in a given direction; in
so doing, the first wedge moves in the desired direction, leaving a gap which is
filled in and so on until the image has inched all around to the first part. The
initially moved portion overlaps an unmoved part, and this unmoved part must be
shifted. We distinguish between points delineating the moved part by making
them "brighter" (as indexed by a different symbol). Furthermore when a point
falls into a cell already occupied, the brightness is increased even more (rep-
resented by yet another symbol). Thus, the program knows not to move bright
dots, but that very bright dots consist of one dot to be moved and one to be left
alone.
The present notion seems to predict, then, that more complex images should
be more difficult to rotate, if more complex images have.more portions to be
shifted. In addition to simply having more material, images of complex objects
(e.g., a face) should be difficult to rotate because the parts should tend to become
scrambled, due to accumulated error in shifting portions (or points in parallel, as
discussed earlier). If after each iteration parts were realigned, this would allow
one to rotate images of relatively simple objects without horrible deformations;
more complex objects would still be more difficult to realign, and hence would
suffer more as rotation proceeds. These notions seem to make some sense to us,
but we have not yet begun to implement them, and will not,do so until ongoing
research provides us with more direction.
In closing, it is worth noting that the transformation schemes, unlike other
aspects of the program, are not very efficient from the point of view of the
computer. Although this consideration is irrelevant vis-a-vis the model being
embodied, it is clear that these notions have much room for improvement.
3. CONCLUSIONS
We have described a simulation of human visual mental imagery which is

quite different from those proposed previously (e .g ., Baylor, 197 I ;Farley, 1974;
Moran, 1973). The major distinguishing feature of our program is the inclusion
of a functional, quasi-pictorial surface representation. We motivated this deci-
sion by an appeal to various empirical findings. In addition to this source of
motivation, however, examination of our model suggests some rational reasons
why a surface representation makes sense: First, some forms of deductions or
inferences may be most efficiently performed using a memory of the perception
of the appearance itself. as opposed to a description of the appearance. For

example, it may be easiest to list which states would fall along a line drawn
between Boston and Washington, D.C. by reference to an image of a map of the
U.S. (see Kosslyn & Pomerantz, 1977). We recently discovered that Funt (1976)
has independently used a similar type of spatial representation in a model of
perceptual problem solving. Funt's WHISPER program makes inferences about
the stability of a "block world" scene by inspecting a two-dimensional array,
which corresponds to a diagram. He was able to show that such spatial represen-
tations can greatly reduce the amount of time taken to solve various problems.
Second, a medium for spatial representations allows one to compose numerous
perceptual memories into a common framework. It is not clear how information
in the CAR .IMG and TIRE.IMG files in our model could be integrated (given
differences in distance at time of encoding and so on) without translation into a
common framework. And it is not clear how conversion to a common size
standard, a prerequisite for integration, could occur without a spatial array.
Third, one reason one would have trouble integrating separate encodings of
appearance involves semantic interpretation: the tire must be adjusted to fit into
the wheelwell. How does one know which pairs of R,8 coordinates delineate a
wheelwell? This is especially difficult if information about the wheelwell's ap-
pearance is stored in more than one file (representing multiple looks). Even if
interpretive procedures can be written to accomplish this task by processing deep
representations directly, these procedures surely would be less efficient than
ours, given that definitions of parts will refer to spatial patterns. For example, in
our model it is trivial to determine whether a straight line is present in a surface
representation. If one attempts to discover whether a straight line is delineated
within a list of unordered R,8 pairs without constructing something like our
surface representation, however, one must pass through the entire list numerous
times. No list ordering will solve this problem, as no unidimensional ordering
can capture all of the information about spatial proximities inherent in a two-
dimensional space (let alone a three-dimensional one). The same problems will
exist, of course, if we simply listed Cartesian coordinates. Finally, it makes
sense to compose spatial information in a visual short-term buffer from the point
of view of storage efficiency: Storage of this information in an unreduced form in
long-term memory would take up much more room (cf. Funt, 1976). In our
model, the R, 8 pairs underlying an image are stored in relatively small space
compared to the amount required to store an entire matrix, including empty cells.
Thus, a visual short-term store is probably not only efficient for integrating
appearances remembered from separate occasions and for classifying such in-
formation, but generating spatial images in consciousness may be the most
economical sort of representation over the long run, when many memories must
be stored in long-term memory.
In a way, our project represents an exercise in the new field of "cognitive
science," a melding of psychology and artificial intelligence, with some inspira-
tion from linguistics (and with a passing interest in neurophysiology). We use

psychological data to motivate a program, and the program forces us, in turn, to
consider phenomena we may not have otherwise pondered. The present theoreti-
cal enterprise has influenced our empirical work in two ways: First, in the course
of trying to motivate the program with data we have been directed to examine a
number of phenomena we may otherwise have overlooked. In addition to the
basic questions already broached, like how various transformations proceed and
what is contained in a skeletal file (a global representation vs. a central part), we
have been led to formulate numerous more particular questions. To mention just
a few, we have been led to ask: (1) whether two or more transformations can
operate simultaneously, or must they operate serially (and, if so, in what order)?;
(2) whether images are "refreshed" by regenerating them from underlying deep
representations, or are they somehow maintained at a surface level?; (3) whether
images and propositional information are retrieved simultaneously when one is
asked simply to retrieve some fact, or are they accessed serially (and, if so, in
what order)? (See Kosslyn, Murphy, Bemesderfer, & Feinstein, in press, for a
description of a project bearing on this last issue.)
Second, in addition to directing our attention to interesting issues, the model
itself suggests a number of predictions; for example: (1) The necessity for adjust-
ing a part's size prior to integrating it into an already-generated skeletal image
forced us to posit that some image information is accessible without generating a
surface image (i.e., information about extent); we did not anticipate this implica-
tion. (2) The R,B implementation of long-term perceptual memory forced us to
consider the possibility that images could be generated at different angular orien-
tations; there was no reason why the 8 values could not be multiplied through
prior to image generation. Although this is true for skeletal images, the model
also implies that elaborating such images should be difficult at nonstandard
orientations because the FIND procedures will encounter problems. Thus, im-
ages of complex objects should be fuzzy if generated at nonstandard orientations,
and should be more difficult to generate than images of simple, less-detailed
things. (3) Our notions about how images could be rotated predict that more
complex images should be more difficult to rotate. Although this prediction is
counter to Cooper's (1975) findings with nonsense shapes, the prediction seems
to follow so inexorably from our current conceptions that we are investigating it
in more detail. Other predictions from the model, in addition to interesting issues
raised, are described in Kosslyn (in press).
In this paper, we have concentrated on the actual job of constructing a pro-
gram; given this orientation, it seems fitting to offer a defense of the value of the
program itself. That is, a sensible critic might well ask why we should want to
construct the actual running program. Presumably, just designing the program
forces one to be explicit, sufficient, and so on. With complex theories, however,
one often cannot keep track of all the details without an actual implementation;
more than once we were surprised by the behavior of our program. For example:
(1) We initially made no provisions for scanning to the inactivated region of the
visual buffer (as in Table 1). Only when we asked the program to find the front
tire in the image illustrated in Fig. 5, and received an error message, were we
reminded of the problems due to overflow. (Some time was actually spent in
trying to find the bug when we first received an error message in this situation!)
(2) The finding that rate of expansion accelerated with increasing size was
unanticipated, but followed from our implementation of the EXPAND transfor-
mation. (3) In an earlier version of the program there was no attempt to simulate
lateral inhibition; expansion did not clarify parts of the image in this model,
however, and thus the model seemed in need of revision. Upon altering the
implementation as described herein (with the two matrices), we were surprised to
discover that resolution was no longer indexed by simple dot density: more dots
were now filled in as the image expanded. Thus, the previous technique of using
decreased dot density as a direct index of increased resolution failed to work; the
program faltered in assessing whether to zoom in or pan out. We solved this
problem by indicating where cells of the activated partition buffer were over-
printed (by using capital letters). Thus, as more dots, including those over-
printed, appear in a given area, resolution tends to decrease (because contours are
obscured). Overprinting is taken to correspond to increased density and bright-
ness, a claim which seems potentially testable.
In closing, then, we argue that the marriage of cognitive psychology and
artificial intelligence can only argue well for both camps. Treating humans as a
"sufficiency proof," as a guide for constructing artificially intelligent machines,
we believe greatly enhances the possibility of success in the A1 enterprise.
Similarly, by attempting to build working models of cognitive processes, we
believe one is brought face to face with the real problems with one's notions, and
is forced to deal with issues in a relatively thorough manner. In addition, it may
turn out that a complex set of phenomena like those of the human mind simply
cannot be considered as isolated subprocesses, but must be considered all of a
piece; the simulation medium may turn out to be the only way of accomplishing
this task.
ACKNOWLEDGMENTS
We wish to thank Robert Abelson, A1 Collins, John Anderson, and Susan Williams for useful
comments on the manuscript.
REFERENCES
Baylor, G . W. A treatiseon the mind'seye. Ph.D. Dissertation,Carnegie-MellonUniversity, 197 1.

Bower, G . H . , & Glass. A . L . Structural units and the redintegrative power of picture fragments.
Journal of Experimental Psychology: Human Learning and Memory, 1976, 2 , 456-466.
Conrad, C. Cognitive economy in semantic memory. Journal of Experimental Psychology. 1972,
92, 149-154.
. . SIMULATING'IMAGERY 295
Cooper, L. A. Mental rotation of random two-dimensional shapes. Cognitive psycho log^, 1975,7,
20-43.
Cooper, L. A., &,Shepard, R. N. chronometric studies of the rotation of mental images. In W. G.
. Chase ( ~ d . ) , ' V i s u ainformation
l processing. New York: Academic Press, 1973.
Farley, A. M. VIPs: A visual imagery and perception system; the result of protocol analysis. Ph.D.
Dissertation. Camegie-Mellon University. 1974.
Funt, B. V. WHISPER: A computer implementation using analogues in reasoning. Ph.D. Thesis,
University of British Columbia, 1976.
Kosslyn, S. M. Constructing visual images. Ph.D. Dissertation, Stanford University, 1974.
Kosslyn, S. M. Information representation in visual images.Cognitive Psychology, 1975, 7 , 341-
370.
Kosslyn. S. M. Can imagery be distinguished from other forms of internal representation? Evidence
from studies of information retrieval time. Memory and Cognition, 1976.4, 29 1-297. (a)
Kosslyn, S. M. Visual images preserve metric. spatial information. Paper presented at the
Psychonomic Society Meetings, St. Louis, MO, 1976. (b)
Kosslyn. S. M. Imagery and internal representation. In E. Rosch & B. Lloyd (Eds.), Cognition and
categorization. Hillsdale, New Jersey: Lawrence Erlbaum Associates, ,1977, in press.
~osslyn;S. M., Ball, T. M.. & Reiser, B. J. Visual images preserve metric spatial information.
Journal of Experimental Psychology: Human Perception and Performance. in press.
Kosslyn, S. M.,' Greenbarg, P. E., & Reiser, B. J. Generating visual images. In preparation.
Kosslyn, S. M., & Pomerantz, J . R. Imagery, propositions, and the form of internal representations.
Cognitive Psychology, 1977. 9, 52-76.
Kosslyn, S. M., Murphy, G. L., Bemesderfer, M. E., & Feinstein, K. J . Category and continuum
in mental comparisons. Journal of Experimental Psychology: General, in press.
Moran, T . P. The symbolic imagery hypothesis: A production system model. Ph.D. Dissertation,
Carnegie-Mellon University, 1973.
Newell, A. You can't play 20 questions with nature and win. In W. G. Chase (Ed.). Visual
igormation processing. New York: Academic Press, 1973.
Pylyshyn, Z. W. What the mind's eye tells the mind's brain: A critique of mental imagery.
Psychological Bulletin. 1973, 80, 1-24.
Pylyshyn, Z. W. The symbolic nature of mental representations. in S. Kaneff & J . F. O'CalIaghan
(Eds.), Objectives and methodologies in artificial intelligence. New York: Academic Press, in
press.
Simon. H. A. What is visual imagery?An inf~rmation'~rocessing interpretation. In L. W. Gregg
(Ed.), Cognition in learning and memory. New York: J. Wiley, 1972.
Smith. E. E., Shoben, E. J., & Rips, L. J . Structure and process in semantic memory: A feature
model for semantic decisions. Psychological Review, 1974. 81, 214-241.
Weber, R. J., Kelley, J.. & Little, S. is visual imagery sequencing under verbal control? Journal of
Experimental Psychology, 1972, %, 354-362.

Kosslyn and Schwartz (1977)

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Kosslyn and Schwartz (1977)

Hochgeladen von

Copyright:

Verfügbare Formate

COGNlnVE SCIENCE 1,265-295 ( I 977)

A Simulation of Visual imagery*

This paper describes an operational computer simulation of visual mental imagery in

I. MOTIVATING THE MODEL: BASIC FOUNDATIONS

1 . 1 T h e Functional Sign$cance of the Experienced Image

Findings on Scanning Images. Experienced images seem to preserve spatial

Findings on Detecting Parts of Images. If images really do act as stand-in

1.2 The Origin of Images

texts. To buttress the claim'that conceptual information is used in image con-

Status of the Program. The program is written in Algol 60. It currently

2.1 Data Structures

Implementation of the Surface Image. In our program, we simulated the

Recall that there were no effects of distance in a control condition, wherein

imogefile car. img

Implementation of the Underlying "Deep" Representations. Our findings

2.2 Operation of the Program

2.2.1 Image Generation

Generating a Skeletal Image. Figure 2 shows what the program did in

Generating an Elaborated Image. Although the default currently is to gen-

Generating a Subjectively Small Image. We noted earlier that people gener-

Constructing an OverJowed Image. Images may also be constructed so large

FIG. 3. Externalized representation of a car imaged at a small subjective size.

2.2.2 Image inspection

Locating a Part of a Centered, Mediumsized Image. Let us consider how

Locating an Overflowed Part. In Fig. 5 the program is focused on the rear

LOOKING FOR PROPOSITIONAL FlLE FOR FRONTIRE

CHECKING PROPOSITIONAL FILE FOR OPTIMAL RESOLUTION

CHECKING PROPOSITIONAL FILE FOR FOUNDATION PART

LOOKING FOR PROPOSITIONAL FILE FOR FRONTWHEELBASE

CHECKING PROPOSITIONAL FlLE FOR OPTIMAL RESOLUTION

CHECKING SIZE OF IMAGE

LOOKING FOR PROPOSITIONAL FlLE FOR FRONTIRE

CHECKING PROPOSITIONAL FlLE FOR PROCEDURAL DEFINITION OF

BEGIN SEARCHING ACTIVATED PARTITION FOR FRONTlRE

SEARCHING FOR LOWEST POINT RIGHT

FOLLOWING HORIZONTAL LEFT TO END

CAN'T FIND FRONTIRE

LOOKING FOR PROPOSITIONAL FlLE FOR FRONTIRE

CHECKING PROPOSITIONAL FILE FOR DIRECTION TO SCAN

, LOOKING FOR PROPOSITIONAL FILE FOR FRONTlRE

CHECKING PROPOSITIONAL R L E FOR PROCEDURAL DEFINITION OF

BEGIN SEARCHING ACTIVATED PARTITION FOR FRONTIRE

SEARCHING FOR LOWEST POINT RIGHT

FOLLOWING HORIZONTAL LEFT TO END

STORE COORDINATES OF RIGHT ANCHOR POINT

SEARCHING FOR NEXT HORIZONTAL POINT LEFT

STORE COORDINATES OF LEFT ANCHOR POINT

CHECKING FOR PART BELOW.

Size Adjustment. The transformations first mentioned above involve altering

Scanning. Scanning in our model is treated as simply another kind of image

Rotation. Rotation is the last scheme we have implemented.-This is our least

We have described a simulation of human visual mental imagery which is

of the appearance itself. as opposed to a description of the appearance. For

tion from linguistics (and with a passing interest in neurophysiology). We use

Baylor, G . W. A treatiseon the mind'seye. Ph.D. Dissertation,Carnegie-MellonUniversity, 197 1.

Das könnte Ihnen auch gefallen