Beruflich Dokumente
Kultur Dokumente
XV Ciclo – 2003
Massimo Romano
Università degli Studi di Roma “La Sapienza”
Dottorato di Ricerca in Ingegneria Informatica
XV Ciclo - 2003
Massimo Romano
iii
iv
Acknowledgements
I wish to acknowledge the people who helped me most during my Ph.D.. First of all,
my deepest debt of gratitude goes to my advisor Prof. Fiora Pirri who constantly
inspired and encouraged me. I express my special thanks for the foundamental sup-
port, for the guidance, and the productive collaboration. Its mainly due to her if I
could accomplish and finish this work.
I express my gratitude to Prof. Martin D. Levine and Prof. Murray Shanahan
who read a preliminary version of this thesis as external referees, and contributed
with a number of advices and very useful comments. I thank them for their attention
and their accurate reports.
I thank the members of my Ph.D. Committee who helped me in the accomplish-
ment of this work. A special thanks goes to Prof. Antonio Chella and Prof. Daniele
Nardi for her constant support.
I thank my colleges Marco Pirrone, Alberto Finzi for their collaboration. I thank
Massimiliano Cialente, Ivo Mentuccia and Katia Vona for their work with the mobile
manipulators Mr. ArmHandOne. Many other people contributed to this work with
interesting discussions, suggestions and support, among them I would like to cite Prof.
Marco Schaerf.
I would like to thank all the people of the Dipartimento of Informatica e Sis-
temistica of the Universitá degli Studi di Roma “La Sapienza”, the members of the
Artificial Intelligence Group, and the members of the ALCOR laboratory.
A special thanks goes, at the end of the corridor, to my colleagues of the room
WC-229: Marco Benedetti, Andrea Calı́, Domenico Lembo, Marco Pirrone, Alberto
Finzi, Carlo Marchetti.
Last thanks, but for sure the most important, to my love, Barbara. She has
believed in me unconditionality and incessantly, showing all her love; Thanks to her
essential support I have reached this important result.
Infinite gratitude to my father and my mother, Dario and Maria.
v
vi
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
1.1 Research Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Works 9
2.1 Classification of Object Recognition Systems . . . . . . . . . . . . . . 9
Data Driven and Model Driven . . . . . . . . . . . . . . . . . . . . . . 9
View Centered and Object Centered . . . . . . . . . . . . . . . . . . . 10
2.2 ORS Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conceptual Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Alignment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Recognition By Component . . . . . . . . . . . . . . . . . . . . . . . . 13
Aspect Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Hierarchical Aspect Graph . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Preliminaries 17
3.1 Perception in the Situation Calculus . . . . . . . . . . . . . . . . . . . 17
A basic theory of action and perception . . . . . . . . . . . . . . . . . 19
High Level Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Symgeons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Bayes Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Burglary Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Computing Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
viii CONTENTS
6 An Algebra of Figure 55
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 An algebra of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Axioms for the Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Grouping Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Figures Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5 Connection Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Relation Between Primits . . . . . . . . . . . . . . . . . . . . . 68
6.5.2 Relation Between Boundits . . . . . . . . . . . . . . . . . . . . 70
6.5.3 Relations concerning Faces . . . . . . . . . . . . . . . . . . . . 72
7 A Bayes-aspect Network 77
7.1 SymGeons Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Hypotheses Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 From Aspects to SymGeon . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4 Portion of the code for Aspect Recovery . . . . . . . . . . . . . . . . . 81
8 Description 85
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Analyzing the image graph and using descriptions . . . . . . . . . . . 88
8.3 Hints on descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.4 Parsing the reference: the scene graph . . . . . . . . . . . . . . . . . . 94
9 Experimental Results 99
9.1 Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10 Conclusion 111
10.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Appendix A 123
Appendix B 127
Appendix C 131
List of Figures
1.2 . . . . . . . . . . . . . . . . . .
The reasoning process behind perception 4
1.1 Mr. ArmHandOne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Superquadrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Ellipsoid, Cuboid and Cylindroid . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Hierarchy of Symgeons according to bending and tapering defor-
mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Bayes Network for the Burglary Example. . . . . . . . . . . . . . . . . 26
4.1 Biederman’s Geons. Image taken from “Avian Visual Cognition” on-
line Book available at http://www.pigeon.psy.tufts.edu/avc/toc.htm . 30
4.2 Primit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Viewpoints Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 A gaussian probability function with mean 0 and variance 0.5 and 0.8. 39
ix
x LIST OF FIGURES
8.1 The graph of the scene in which a table and a chair have been singled out of
a set of symgeons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 The graph of the scene representing a table . . . . . . . . . . . . . . . . . 91
8.3 The graph of the scene representing a table . . . . . . . . . . . . . . . . . 96
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Example faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10 Example faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
12 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
16 Example of aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
18 Example of aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
List of Tables
7.1 CPT for a node of the Bayes Network for SymGeon Recognition . . . 80
8.1 Relations between symgeons in the scene and their functional denota-
tion. All relations are reflexive. . . . . . . . . . . . . . . . . . . . . . 95
xi
xii LIST OF TABLES
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
In the first class we shall consider the functional use of an object. It is certainly
the main source of information that would allow us to recognize it. In fact, whenever
an object is a human artifact, it is completely characterized by the purpose for which
it is made up. It is not difficult to believe that if someone tell us that the observed
object is used for drinking we immediately recognize a cup (or a glass) whether it
resembles it or not.
Furthermore, the human brain is able to recognize a glass, with a good error range,
even if it has meaningless shape or is it broken in ten pieces, because its function has
been anticipated.
Another important aspect that we consider is the scene context in which the object
is immersed. If we are in the Sahara desert and we are thirsty, a jar can be used to
drink too. This means that a jar can fulfill the same function of a glass because their
logical shape is the same.
Finally, we shall consider the visual aspect of an object. All shapes, colors and
textures are sources of information, that can be used to recognize an object but
their information content is less accurate, in our opinion, than the one drawn by the
previous aspect considered.
Since it is not entirely clear how these aspects interact one with the other, an-
other important question is to understand in which way they can be combined in
order to resolve the recognition prob-
lem. The human brain certainly use a
complex procedure to accomplish this
task, but all the existing Object Recog-
nition Systems (ORS) basically follow
a simple bottom up procedure, where
only visual and logical aspects are con-
sidered.
The Figure on the left shows a typical
framework of an ORS. The input of
the system is a digital image while the
output is the recognized object and,
in some cases, a specification of the
object pose. Basically every ORS is
characterized by the following steps:
First it extracts from the input im-
age an appropriate set of features like
edge (brightness discontinuity), cor-
ners (edge intersection), and regions
(homogeneous image patches). The
goal of features extraction is to select from a large amount of image data, only those
information which are relevant to identify the object. For example, Ullman and Hut-
tenlocher [54] use triples of corners (curvature discontinuity) to recognize polyhedral
objects.
Despite a lot of techniques exist for features extraction, which are completely
characterized and mathematically justified, to classify them the peculiar intuition
of the individual researcher is used, which gives an informal account of the human
reasoning process involved in this task. However all of them use a knowledge base
1.2. OUR APPROACH 3
(or objects database) where all the objects to be recognized are represented using
suitable models.
Therefore, the next step of the ORS is to group the extracted features into a
suitable collection of data in order to reconstruct the topological information of the
observed scene. The obtained collection is used to access the objects database and
retrieve a set of candidate objects i.e. all the objects conforming with the topological
information. Such a procedure avoids to using a linear search in the whole database,
because it allows to eliminate a large number of candidate objects contained in it.
Obviously, if we can recover some really distinguishing clues, we may have only one
model to test and we don’t need the search.
The final step of the recognition system is to decide which of the candidate objects
we’re looking at. To accomplish this task it must verify each of the candidates in
terms of how well it matches the image data. A score is finally assigned to each
object hypothesis, and the best scoring candidate is chosen as the interpretation of
the object.
The main component of an ORS is the objects database where a model for each
object, that has to be recognized, is represented. A model is a symbolic description of
an object which is generale and concise, and the basic goal of an ORS is to transform
the unstructured set of sensor data into a symbolic description which is consistent
with the chosen representation and which supports an efficient model matching.
A Cognitive Vision System (CVS) is a system that uses visual information to build
a symbolic description of the domain, based on a logical representation approach. It
is basically the result of a combination of techniques from symbolic and subsymbolic
AI with computer vision techniques. Therefore the area of Cognitive Vision includes
many of the major issues in AI such as knowledge representation, reasoning, learning,
and so on.
In this thesis we present a new approach to support knowledge representation and
reasoning in cognitive vision, in particular our essential purpose can be synthesized
into the following two points:
i. the introduction of a specific vocabulary and language to talk about objects and
scene;
In this thesis we present the basic ideas governing a high level recognition system,
which has been tested on Mr. ArmHandOne, a tiny mobile manipulator endowed with
4 CHAPTER 1. INTRODUCTION
3. Recognition level: at this level recognition is concerned only with primitive com-
ponents of the image. Each shape in the image is classified as an aspect/view of
a symgeon, with a certain probability, depending on the traits of the shape itself.
The classification granting the existence of this or that symgeon is achieved by
a special Bayes net integrating hierarchical aspect graphs (see [27]), and the two
graphs are fused into an Aspect-Bayes net, that is, a Bayes-net in which causal
relations between nodes are enriched with compositional relations concerning
aspects.
4. Syntactic analyzer level: this level is concerned with the construction of a labeled
graph of the image. The labeled graph is defined in FOL. The syntactic image
analysis delivers a set of connected segments that we call boundits. This set
forms a graph that we call the syntactic graph.
In synthesis our approach is based on the following idea. It is not possible to separate
the recognition process from knowledge of a specific environment. Each component
of an environment is either known or unknown. An element is known if a symbolic
representation for it is given, together with the information about where it could be,
what it is for, and where/how it will be after a specific action has taken place. An
object is unknown if there might be a description available for it, but its presence in
the environment is not acknowledged a priori. Clearly for everything that is known,
it is important to track its movement caused by the execution of actions, and this can
be achieved with enough accuracy (given the possibility of failures and noise in the
sensors). For the unknown elements a crucial role is played by the following factors:
2. The ability to relate the unknown object to a more general category. This
ability essentially is a classification of the unknown object requiring several
considerations. What its components are, the way they are connected together,
the object location, on the floor, on a table, etc.., what it is for, etc..
1 Anchoring is the process of creating and maintaining the correspondence between symbols and
to the Art Gallery2 problem of computational geometry [68]. However we can stick
to this complexity if, instead of representing the multiple views of a complex object,
we focus just on the views of simpler geometric objects like parametric geons. [27],
in fact, uses the aspect graph exclusively to represent simple geometric primitives,
because such primitives have straightforward representative views. Dickinson and
Metaxas have further introduced an evolution of the aspect graph, the hierarchical
aspect graph (HAG) in which both the features, at different levels of complexity, and
the compositional relationships between them can be represented.
Following the above line of research on visual perception, we develop our formal-
ization on the recent approach of [73] who introduces symgeons, that are an extension
of the above mentioned parametric geons. The advantage of symgeons over paramet-
ric geons is that, by losing the symmetry property which we do not need to preserve
in our approach, they can be used as coarse descriptions also of asymmetrical objects
(e.g. the well known snicker example that was Ullman’s point against recognition by
components [100]). On the other hand symgeons have several views, and views are
composed of faces which, in turn, are composed of simpler elements as coarse descrip-
tions of primitives geometric signs depicted in the image. This earlier compositional
construction is obtained using an Aspect-Bayes network, which plays a role similar
to the aspect graph, but here causal relationships are enriched with compositional
relationships.
As a matter of fact, we are trying to define a perceptual architecture that fully
exploits compositionality: from the analysis of the elementary signs in the image to
the analysis and description of an object structure, and finally to the interpretation
of a scene. To achieve this we construct a reasoning process that draws suitable hy-
potheses about each object occurring in the observed scene. Hypotheses are evaluated
according to their probability and the most probable one is added to the knowledge
base for interpreting the scene.
The work introduced here represents part of our effort to provide a control sys-
tem employed in the ASI (Italian Space Agency) PEGASO (PErcept Golog for Au-
tonomous agents in Space Station Operation) [37] and Marviss (Anthropomorphic
Hand and Virtual Reality for Robotic System on the International Space Station)
[47] projects.
Related Works
9
10 CHAPTER 2. RELATED WORKS
not specify an exact geometry. Furthermore, these techniques often assume that the
bounding contour of a region belongs to the object which is indeed problematic when
the object is occluded. Finally such techniques often require a manual segmentation
of an object into its meaningful parts.
The Model Driven approach to shape recovery uses models, that capture the exact
geometry of the object. The recognition algorithm acts according to the following
sequence.
First, simple 2D features, like corners or changes in curvature, extracted in the
image, are paired with similar features of a 3D model to obtain a set of possi-
ble correspondences. Then the sys-
tem uses solid transformations, like
Ima rotation and translation, to verify the
ge
given hypothesis bringing the model
features into alignment with their cor-
responding image features. Finally,
because the correspondence is weak,
those features belonging to the chosen
model must be compared to other im-
age features. If there is enough agree-
Hypothesis ment among the features, the object
is recognized.
Matching
If the number of object models is
small and the exact object geometry is known, this approach is highly efficient be-
cause it requires the extraction of simple recoverable image features. Moreover it is
substantially insensitive to occlusion. However, for large databases, the complexity
makes the model search intractable. Besides, the approach is very sensitive to minor
changes in the shapes of the objects because it is based on the verification of local
features. For example, if the curvature of an object part changes, a new object model
must be added.
Therefore objects can be more easily recognized, since the same techniques used for
2D recognition can be used for 3D recognition without changes.
However this approach has two major limitations. Firstly, the definition of multiple
views is not easy and the user has to consider a-priori all possible view points, in order
to guarantee the completeness of the recognition process. Secondly, it requires to store
many different views of the same object.
Neural Network
Probably, the most obvious way to recognize the shape of an object, is to collect a
sufficiently ample set of object images (pattern) taken from different views. Therefore,
the recognition problem can be reformulated as the problem to compare this set of
images with the image data. A lot of ORS have been proposed, on the basis of this
paradigm, where the matching problem is solved using some kind of neural networks
[69].
Basically we can define a neural network as network-like model of computation,
designed to simulate the behaviour of biological neural networks (e.g. human brain),
whose basic computational unit is some kind of a formal x1
network (see Figure on the right).
A neuron receives input signals xl , ..., xn from either x2
other neurons or the outside environment. It computes y
its output signal y by adding together the xj ’s weighted
by some internal weight wj , and applying some transfer xn-1
σ, to the result:
xn
y = σ(Σnj=1 wj xj )
Conceptual Space
Conceptual spaces, introduced by Gärdenfors (see e.g. [42]), are metric spaces in
which entities are characterized by a number of quality dimensions which represent
some kind of qualities of the environment. Examples of such dimensions are color,
pitch, volume, spatial coordinates, and so on. Some dimensions are closely related to
the inputs data of the system, other may be characterized in more abstract terms.
A generic point in a CS is called a knoxel (the term is derived by analogy from
pixel). Knoxels are obtained from measurements of the external world performed on
the image acquired by the camera and processed applying low-level vision algorithms.
An important aspect of this theory is the possibility to define a metric function.
Following Gärdenfors, the distance between two knoxels calculated according to such
a metric function corresponds to a measure of the similarity between the entities
represented by the knoxels themselves.
In the domain of computer vision CS are used to model objects as follows: an
object is characterized by a set of attributes or qualities {q1 , q2 , ..., qn }. Each quality
qi takes values in a domain Qi . For example, the quality of volume could take values
in the domain of positive real numbers. Objects are identified with points in the CS
C = Q1 × Q2 × ... × Qn , and concepts are regions.
An interesting application of the notion of CS is presented in [15, 17]. Here a
framework for high-level representations in computer vision architectures is described.
Alignment Method
Generally speaking, ORS based on the alignment methodology perform the follow-
ing steps: given an acquired image representing an object view and starting from
a set of object models, the ORS performs a search in the space of solid geometric
transformations, trying to match the description of the image with a model in the
database.
2.2. ORS APPROACHES 13
More formally, in the alignment method we assume that a series of rigid transfor-
mations (e.g. translations, scale operations, rotations, stretching) could be applied
either to the input image or to the model. In such a way, the recognition problem
can be seen as a search problem. In fact, the ORS have to find a particular model Mi
and a combinations of transformations Tij maximizing an overlapping measurement
F between the model and the image V :
max F (V, Tij (Mi ))
ij
Ullman and Huttenlocher have introduced this technique in [53]. Their approach
is able to recognize polyhedral object identifying 2D point discontinuity in the image
(where the object contour changes significatively) and matching these points with the
vertex of a 3D model (see also [54]).
In particular, their system computes the transformation between each triplet of
points from the image, and each triplet of points from the target model. According
to such a transformation, all the other points from the target set are transformed. If
they match, the transformation receives a mark, and if the number of marks is above
a chosen threshold, the transformation is assumed to be the matching transformation
between the query and the target.
Such a search takes O(m4 n3 log(n)) time in the worst case (where m and n are the
number of points in the model and image, respectively) and shortcuts have brought
that down to O(mn3 ). But the complexity of the search also depends on the number
of models in the computer vision system’s library.
Variations of these methods also work for geometric features other than points,
such as segments, or points with normal vectors [2], and for other transformations
than affine transformations.
Recognition By Component
Recognition By Component (RBC) [10, 12, 65, 34] is an emerging approach to object
recognition and modelling in which the shape of an object is described in terms of
relatively few generic components, joined by spatial relationships.
An essential property of this technique is universality of primitives, i.e. the ability
to describing any object in the domain by a combination of one or more primitive
components. Therefore the standardization of the primitives (component and their
relationships) is crucial.
In the literature we can find several kinds of volumetric primitives. Ferrie and
Levine [34] uses ellipsoid and cylindroid as basic object components, obtaining an
extremely approximated representation.
Shapiro in [95] uses as primitives sticks, blobs and plates in order to represent long
and thin object parts, flat parts and volumetric parts respectively.
David Marr initially proposed to use cylinders as unique valid primitives [64] but,
in this case, a meaningful representation of an object requires a great number of
cylinders. Consequently the amount of information contained in the representation
is comparable to that contained in the original image, therefore the representation is
not concise.
Later Marr and Nishiara [65] proposed generalized cylinders as primitives. These
geometric shapes were realized by sliding a 2D figure along an axis of symmetry. The
14 CHAPTER 2. RELATED WORKS
representation obtained using this set of primitives is more expressive and versatile
and moreover they are easy to extract like simple cylinders.
In the 1980’s, from the human vision community, Biederman in [10] proposed 36
qualitatively different volumetric primitives named geons (short for geometric icons).
They were described in terms of four qualitative attributes of generalized cylinders:
Symmetry, Size, Edge and Axis. Psychological experimentations [11, 12] have pro-
vided support for the descriptive power of geon-based descriptions. Even if this model
has been used by several researchers to describe 3D object, most of the work about
this subject has focused on the recovery of geon models from complete line drawing
which depicted perfect geon-like objects.
Alternatively to this idea, different authors focus their researches on a represen-
tation using purely qualitative primitives. Pentland, for example, introduced in [75]
an interesting parametric family of closed surfaces named superquadrics. A crucial
characteristics of these structures is their ability to model a large number of objects,
because the shape of a superquadrics is based on parameter values (1 , 2 ), that can
be suitably varied. On the other hand this property makes the recognition problem
more difficult.
Terzopoulos et al. [98] uses generalized splines to create an active model of the
specific object, that is suitably deformed to be adapted to the object image. Such
models react to the external forces produced accordingly to the image data. Following
such an idea, Metaxas and Terzopulous [97] developed a deformable model based on
superquadrics, instead of generalized splines.
Kenong Wu and Martin Levine [104] combined both the qualitative and quantita-
tive approaches, by introducing a new set of geometric primitives named parametric
geons. Such a set consists of seven volumetric primitives derived by superquadrics,
specifying the parameter 1 and 2 and applying tapering and bending deformations.
The first three basic primitives are ellipsoid, cilindroid and cuboid, respectively
defined for (1 = 1, 2 = 1), (1 = 0.1, 2 = 1) and (1 = 0.1, 2 = 0.1). The other
four primitives, obtained applying tapering and bending operations to the basic one,
are tapered cylinder, tapered cuboid, curved cylinder e curved cuboid.
More information about superquadrics and parametric geons are given in Chapter 3.
Aspect Graph
The concept of Aspect Graph was initially described by Koenderink and van Doorn
[56] in 1979. They observed that for most views, a small change in the vantage point
of an object produces a small change in the shape of the projection of that object,
while for some views a large change is produced.
Starting from these considerations, they proposed a classification technique to
reduce the dimension of the image set based on a partitioning of the view space into
regions each one composed of qualitatively similar views.
In their approach the object in the knowledge base is represented using a graph
(named aspect graph) where each node is joined to a region (or views), while each
edge represents transitions among adjacent regions (visual events).
The success of such a representation depends on the ability to find the minimal
set of significative views for each object, in order to reduce the dimension of the
associated graph. For this reason since these concepts were introduced, much effort
2.2. ORS APPROACHES 15
has been expended in analytically deriving the exact Aspect Graph [31, 82, 57], which
yields a complete description of the object’s view-sphere.
Analytical methods, however, must be limited to simple models since they might
be expensive, if the object contains many features. For example, Platinga and Dyer
in [81] describe an algorithm to localize the most significative views for polyhedral
objects, while Gigus e Malik in [44] resolve the same problem for non convex polyhe-
dral objects, using an algorithm whose time complexity is Θ(n6 log n) in the number
of vertices.
In general, it is possible to demonstrate that the problem of finding the optimal set
of views for a polyhedral object is NP-hard, by reducing it to the Art Gallery1 problem
of computational geometry [68]. However, the exact aspect graph is generally too
detailed to be useful in model matching with real data. Koenderink and van Doorm,
in fact, recognize that many nodes in the aspect graph will probably correspond to
some unstable views, where an infinitesimal (i.e. suitably small for real applications)
camera movements will change the topological properties of the view.
An interesting evolution of the Aspect Graph was introduced by Dikinson and Metaxas
in [28, 27, 29].
1 The Art Gallery problem consist to determine the minimal number of observation points in order
Preliminaries
This chapter introduces some general preliminary notions that will be used in the
rest of the thesis. It describes the Situation Calculus, its language, its formalism
and the way it has been extended in [77] to face sensing and perception. Then a
deeper characterization of the SymGeons, a set of primitives deriving from the notion
of parametric Geons, which are our object recognition primitives, is given. Finally,
some fundamental notions on Bayes Networks are introduced. We have, in fact, used
Bayes networks to extend the concept of aspect graph, in order to face hypotheses
formation in SymGeons recognition.
17
18 CHAPTER 3. PRELIMINARIES
observable Fluents. By observable we mean the sensory system can observe the fact
or property or object the fluent is designating. For example On(block5, block3, s) is a
relation between block5 and block3 that holds in s and can be observed, as far as it
holds, by the sensory system. Related to the observable Fluents are the perceptibles,
that is, terms that denote what is perceived about facts, properties or objects. As it
was done in [76] we consider only one sense, namely the vision system, identifying the
perceptibles with the data of the image. In fact, to account both for these data and
for the geometric description of object we shall assume (as in in [76]) that the objects
domain includes the reals R, relying instead on the standard interpretation of the
reals and their operations (addition, multiplication, etc.) and relations (<, ≤ etc.),
(see [86]). As far as the relationships among elements of the image are concerned,
likewise the relationships among elements of the scene, we shall discuss this topic in
Chapter 6.
However we do not treat a comparison with spatial reasoning (see [1, 6, 5, 21, 22]),
in this thesis, since their approach is genuinely qualitative and, in fact, they deal with
2D images and not with recognition by components, as we do scaling up from 2D to
3D. In any case we refer to the vast literature on this topic, to which we shall come
back later (see e.g. [50, 7, 102]).
The alphabet of L sitcalc is extended with function symbols of sort images, the
perceptibles, and of sort action, sensing actions, the special relational Fluents P ercept
and Occluded, and the special functional fluent Scene. Relational Fluents, like Align,
M istaken, SeP ercept, Connected and so forth, are introduced by definitions. The
advantage of introducing concepts by definitions is that one can rely on the composi-
tional laws of logic to get new properties on the basis of simpler ones.
As Fluents account for physical events that are affected by control actions, per-
ceptibles account for the mental events– the agent’s inner denotation of the physical
events – that are affected both by sensing and control actions. Formally a percepti-
ble is a term of a general sort perceptible which should be the result of the union of
many sorts that account for the different data that sensing can interpret. We treat
only perceptibles of sort images, nevertheless we shall specifically refer to it as sort
perceptible. A perceptible differs from a fluent also because it lacks the argument of
sort situation: isF : ((action ∪ object)n ) 7→ images.
A perceptible is taken as argument by the fluent P ercept, which traces the sensory
experience of the agent, and is of the form:
The fluent symbol Scene(s) is of the form Scene : situation 7→ images, and
it is used for the description of the perceptibles in the scene. So, for example,
isCube(block5) ⊂ Scene(s) is true, in the situation s, if there is a region in the
scene matching the set of points denoted by the perceptible isCube(block5) and this
set, after suitable transformations, can be aligned with the model of the cube.
Here DS0 is the initial database: the set of formulae that mention only the situation
term S0 and no other term of sort situation. Dap is the set of action precondition
axioms, one for each term of sort action; Dss is the set of successor state axioms,
one for each fluent; Duna is the set of unique name for actions and Σ is the set of
domain independent axioms specifying the properties of the domain of sort situation.
We refer again to [60, 86, 78] for details on each element in the above set.
A basic theory of action and perception is defined analogously: the set Dss in-
cludes, together with the successor state axioms for each fluent in the language, also
a successor state axioms for the Fluents P ercept and Occluded and for the functional
Fluents Scene, for each perceptible in the language. Dap includes action precondi-
tion axioms for sensing actions and Duna comprehend together with the set of unique
name for actions unique name axioms for perceptibles. To these sets we shall add a
set Dda which are the axioms for describing the representation of objects in the scene,
and the set Dgd which should describe the class of primitive objects (from the Primits
to the Symgeons, through a whole hierarchy of primitives, see Chapter 6).
In accordance with the problem solving approach to perception [89, 90], causal
laws for perception have been introduced in [76]. Here, in fact, perception is inter-
preted as a mental event subject to causal laws, therefore based on prior perceptions,
implying a perception-perception chain of causality. This is quite important to ac-
count for the correlation and causal laws among Percepts: perceived orientation is
the suitable context for perceived shape, perceived adjacency explains the perception
of the composition of an object and so forth [90].
Following the same ideas of [84, 85] causal laws for perception can be formulated
as negative and positive contexts.
In the first item the sensing action sense(isCube(x), 1) is context free, in the sec-
ond item the action drop(x) is context sensitive and in the third the sensing action
sense(isBlackboard(x), 1) is context sensitive.
20 CHAPTER 3. PRELIMINARIES
2
In the last item, the context plays the role of deciding whether to accept or not
the answer from the vision system. When the vision system acknowledges a 1 to the
query sense(isBlackboard(x), 1), the context decides whether the perception should
hold or not.
Finally, in our approach, we distinguish between modeling sensing actions’ direct
effects (i.e. direct perception), and modeling the process of sensing data interpretation
and assimilation (selective perception, and meaningful perception).
Direct Perception. Direct perception is the process of sensing tout court. In fact,
it is formalized in the Situation Calculus just by suitably adding successor state axioms
for each perceptible. As noted above, according to the problem solving approach to
perception [90] we define causal laws for perception, and by following the same ideas
of [84, 85] we formulate them in terms of successor state axioms:
here p is a perceptible, e.g. isF (~y ), and pr is the outcome, pr ∈ {0, 1}. Analogously,
we introduce successor state axioms for the fluents Occluded and Scene, for each
perceptible in the language. Finally Dap includes action precondition axioms for
sensing actions. For all perceptibles p:
For example:
The hypothesis is built using the sensing history σ(H), that is the history of all the
sensing actions that lead to the derivation of a given P ercept(p, pr, σ) or a given
sentence mentioning the fluent P ercept.
In fact, by the consistency theorem of the Situation Calculus (see [78]) a basic
theory of actions is consistent given that the initial database DS0 is consistent, there-
fore we want to add an hypothesis to the initial database. Since we want to gather
the hypothesis from the perceptual process we proceed as follows. Suppose D |= ϕ(σ)
and ϕ(σ) is a sentence mentioning P ercept and D 6|= ψ(σ), with ψ(σ) a sentence in
which P ercept(isF (~x), 1, σ) is transformed into F (~x, σ) and P ercept(isF (~x), 0, σ) is
transformed into ¬F (~x, σ) (for details on the transformation see [76]), then we want
to add an hypothesis h such that D∪{h} |= ψ(σ). We construct such an hypothesis by
regressing ϕ(σ), in so extracting the sensing history. Then we transform the sensing
history into observable fluents and we build a sentence uniform in S0 using them. This
explanation based (or abductive) process is called meaningful perception, because it
allows to use the perception process to infer real properties of the world. In fact,
human being do this all the times, if one sees that the door is open one usually uses
the sensing to infer that indeed the door is open, obviously the outcome of sensing
is not taken as universal truth in fact if the door was a glass door and one ends up
bumping on it then one wants to give up the hypothesis.
Therefore we consider meaningful perception as a non monotonic inference process.
Of course meaningful perception is not always possible. In [76] Pirri and Finzi have
shown that meaningful perception is always possible when there is no misalignment
between perception and real world information. To account for the misalignment
between the inner state of the agent which is transformed by perception (sensing
actions and the fluent P ercept) and the real world which is transformed by control
actions Pirri and Finzi have introduced for each perceptible isF and property F the
following definition1 :
def
M istaken(isF (~x), s) ≡
(¬F (~x, s) ∧ P ercept(isF (~x), 1, s)) ∨ (F (~x, s) ∧ P ercept(isF (~x), 0, s)).
(3.4)
A misalignment admits two interpretations: (a) sensing is correct and the misalign-
ment between perception and observable fluents is not a mistake; (b) the agent has a
wrong perception of the real world. A brave approach would always consider sensing
correct, admitting the outcome of sensing actions to changing the truth value of ob-
servable fluents (interpretation a). A cautious one would consider the truth value of
the observable fluents a measure of the correctness of perception (interpretation b).
The definition of meaningful perception given in [76] is based on this second approach.
1 Observe that we assume that in the language for each observable fluent F there is a perceptible
isF .
22 CHAPTER 3. PRELIMINARIES
3.2 Symgeons
The concept of symgeon, introduced by [73], is part of a families of related concepts,
that have been used in visual recognition, to describe the shape of an object in terms of
a relatively few generic components, joined by spatial relationships. Symgeons (which
we can consider as a simple generalization of parametric geons introduced by Kenong
Wu and Martine Levine [104, 103]) have their origins, in qualitative geons [10] and
in the computer graphic concept of Superquadrics [3]. The Biederman original geons
are 36 volumetric component shapes described in terms of the following qualitative
attributes of generalized cylinders: symmetry, size, edge, axis: each of these properties
can be suitably varied in so determining a unique geon.
Superquadrics were first introduced in computer graphics by Barr in his seminal
paper [3]: they are a family of shapes, depicted in Figure 3.1, including: superhyper-
boloid of one sheet, superhyperboloid of two sheets, superellipsoid and supertoroid.
In computer vision literature, it is so common to refer to superellipsoids by the
more generic term of superquadrics.
The parameters a1 , a2 and a3 are scaling factors along the three coordinate axes.
The first three basic primitives are ellipsoid, cylindroid and cuboid (see Figure 3.2),
respectively defined for:
2 2 2
• 1 = 1, 2 = 1 x
a1 + ay2 + az3 = 1
2 2 10 20
x y z
• 1 = 0.1, 2 = 1 a1 + a2 + a3 =1
The other four primitives, obtained by applying tapering and bending operations to
the basic ones, are tapered cylinder, tapered cuboid, curved cylinder and curved cuboid.
A tapering deformation is performed along the z axis with a linear rate with
respect to z; a point (x, y, z) of a primitive is transformed into a point (X, Y, Z) by
the following: Kx
X = ( a3 z + 1)x
K
Y = ( a3y z + 1)y (3.6)
Z = z
Here Kx and Ky are tapering parameter along the x and y axes. To avoid invalid
tapering operation as defined in [103], the constrain 0 ≤ Kx ≤ 1 and 0 ≤ Kx ≤ 1 are
imposed.
The bending operation adopted by
Wu and Levin, is easily defined consid-
ering a section of the primitive along
the xz plane.
It is depicted in the figure on
the left, taken from [103, 104], where
dark area delimits the original primi-
tives, while the thick line represent the
curved primitive.
The transformation is applied to
the primitive along the z axis in the
positive x direction. The O is the cen-
ter of bending curvature and θ is the
bending angle.
Given a point (x, y, z) belonging to
the original primitive, the operation
transform it into the point (X, Y, Z) according to the following equation:
X = κ−1 − cos θ(κ−1 − x)
Y = y (3.7)
Z = (κ−1 − x) sin θ
24 CHAPTER 3. PRELIMINARIES
Figure 3.3: The Hierarchy of Symgeons according to bending and tapering deforma-
tions
Petel and Holt [73] introduced a new set of primitives called SymGeons (Sym-
metrical Geometric Icons), extending the concept of parametric geons of Wu and
Levine considering the possibility to apply the tapering and bending transforma-
tions at the same time. In such a way they eliminated the intrinsic symmetry of
the parametric geons allowing one to
model a larger number of asymmetri-
cal objects. Symgeons are depicted in
figure (3.3).
In the rest of the paper we use the
term G(a, , K, κ) to denote a generic
symgeon, where a = (a1 , a2 , a3 ) is the
scale vector, = (1 , 2 ) is the square-
ness vector, K = (Kx , Ky ) is the ta-
pering vector and κ is the bending
parameter. To refer to the coordi-
nates of a symgeon in the scene we
shall use a term pos, so the position of
a specific symgeon G(a, , K, κ), with
γ = ha, , K, κi will be denoted by
g(pos, γ). We shall also use the term
−
→
NG ( −→
a ,−
→ , K , κ) to denote the normal
to the symgeon G. For more details
about these definitions we refer the
reader to Wu and Levine [104]. A clas-
sification of symgeons is given in Table
Table 3.1: SymGeon Classification
on the left.
3.3. BAYES NETWORKS 25
ii. an oriented edge connects a pair of nodes (X, Y ) if node X directly affects node
Y;
Burglary Example
Bayes networks are modeling tools which can be applied to a variety of domain. In
this section we present an example taken from [74].
Suppose you are living in a house in California, and you are concerned
about burglaries. Therefore there is an alarm in the house which might
go off if a burglary occurs. There are also two neighbors who work at
home: Dr. Watson and Mrs. Gibbons. If the alarm goes off they might
phone you at your office. On the other hand you know that your alarm also
sometimes goes off if there is a earthquake or tremor, but that earthquakes
and tremors will likely be reported on the radio.
R
G
E
B
W
3. G: True if Mrs. Gibbons phone you to tell that an alarm has rung.
4. W : True if Dr. Watson phones you to tell that an alarm has rung.
5. E: True if an earthquake has occurred.
6. R: True if there is a radio report of an earthquake.
In Figure 3.4, is drawn the Bayes Network for the example. The independence as-
sumption allows us to make the following observation:
• B does not depend on E. So if we know E then we expect S to take on a
non-zero value. However, we do not then expect that B will become more likely
given that S has rung.
• If S takes on a non-zero value, we do expect that B and E both become more
likely. But if we later learn that E is false we would expect that B becomes
even more likely (there is no longer another reason for S).
• If B occurs then we expect both W and G to become more likely, since S is
more likely and that will make these two variables more likely to be true. On
the other hand if B occurs we do not expect E to become more likely.
These yield to the following product decomposition, according with the equation 3.8:
P rob(B ∧ E ∧ R ∧ S ∧ W ∧ G) =
P rob(B) · P rob(E) · P rob(R|E) · P rob(S|E ∧ B) · P rob(W |S) · P rob(G|S)
Once we have the net we must parameterize it defining a CPT for each vari-
able in the decomposition. This means that for each node in the Bayes network we
have a matrix of values: one probability for every different assignment of the node
given the different assignments of its parents. Parents S = 0 S = 1 S = 2
In particular in the burglary example all of the a b 0.1 0.15 0.75
variables are propositional except for S which a ¬b 0.2 0.6 0.2
has 3 possible values. We report on the table ¬a b 0.15 0.25 0.6
at the right an example of CPT for S, here ¬a ¬b 0.9 0.1 0
we use lower case letters to denote particular
assignment of the variables: Table 3.2: CPT for S
3.3. BAYES NETWORKS 27
Computing Probabilities
Besides saving the number of probabilities we need to obtain and store, Bayes nets
also allow us to compute new probabilities more efficiently. A typical reasoning task
in a Bayes net is that we want to compute the posterior probability of a variable given
some instantiation of a set of variables (the evidence node).
For example, we hear on the radio that there was an earthquake, and we want
to compute the new probability that the alarm rang: P (S = 0|e), P (S = 1|e),
P (S = 2|e). Or we want to compute the probability that Dr. Watson will call after
we hear the radio report: P (W = >|e), P (W = ⊥|e).
In general we want to compute the new probabilities of the different values of
a variable V given that we have some set of assignments to another collection of
variables V1 = v1 , V2 = v2 , ..., Vk = vk . Such computation can be performed using the
structure of the Bayes Network and Bayes rules.
Many sophisticated algorithms have been developed to treat inference, probabilis-
tic inference, in Bayes Networks. However, for our purposes we will give only an
overview of a simple technique using variable elimination.
Let V1 , V2 , ..., Vk be the variables of a Bayes Network. We want to compute the
probability of Vi given some set of values to some of the other variables Vj1 , ...Vjk
(an arbitrary set). For example, we want to compute the conditional probability
P rob(V3 |V1 = a, V4 = b, V5 = c). To this end, we will compute the set of unconditional
probabilities P rob(V3 , V1 = a, V4 = b, V5 = c), one for every possible values of V3 .
Then we will normalize these unconditional probabilities so that they sum to one
over all values of V3 , this will give us the set of conditional probabilities we want.
Now, to compute the unconditional probabilities we first write them down as sums
of probabilities involving all of the variables in the network. In this case we rewrite
the probabilities P rob(V3 , V1 = a, V4 = b, V5 = c) as the sum:
Finally we use the Bayes Network to decompose the probabilities inside the summa-
tion, and try to simplify them by moving various terms out of summations.
28 CHAPTER 3. PRELIMINARIES
Chapter 4
29
30 CHAPTER 4. MOTIVATION AND METHODOLOGY
Figure 4.1: Biederman’s Geons. Image taken from “Avian Visual Cognition” on-line
Book available at http://www.pigeon.psy.tufts.edu/avc/toc.htm
can be called primal access: the first contact of perceptual input from an isolated,
unanticipated object to a representation in memory.
Following this idea, RBC was developed to account for primal recognition of ob-
jects which does not utilize higher-level cognitive processes. Higher-level processing
may involve the use of shading, texture, or color in finer discriminations of objects.
Biederman notes that certain properties of visual features remain invariant to
perspective transformation through small angles. For example a straight edge appears
straight, while a curved edge appears curved, through a wide range of rotations of
the object, although the exact angle or curvature of that edge changes with rotation.
Starting from this consideration Biederman identified 36 geons (Figure 4.1), qual-
itatively derived using four attributes of generalized cylinders [13]:
The most important contribution of the RBC is its proposal for a particular vo-
cabulary of components, in fact since its introduction it has encouraged and inspired
many researches in the computer vision community.
Edelman in [30] described some theoretical limitations of the approach. In partic-
ular he emphasizes the necessity to introduce metric information for the components
4.2. REASONING FRAMEWORK 31
The most common AI approach to realize such systems is based on the representa-
tion of the world and the inference rules in a logical theory, using inference mechanism
to reason about the world.
Since our framework is based on the definition of a logical reasoning system, our
principal task is its formalization. In general, the formalization process begins with
the definition of the language in which the facts can be represented. By translating
the facts into the language we obtain a set of sentences which is called the knowledge
base.
Then the reasoning process is translated into the language through the definition
of a set of axioms, which are generally non-trivial. The purpose of this step is to
ensure that if a fact A follows from the facts included in the knowledge base, the A
is a logical consequence of those facts.
Following this approach, the formalization of our vision based reasoning system
is achieved through the identification of a set of primitives, which constitute the
ontology of our langauge, and the introduction of a set of axioms both for primitives
connection and definition.
The Ontology
In philosophy, ontology is the branch that studies what things exist. In AI1 it is
an explicit formal specification of how to represent the objects, concepts and other
entities that are assumed to exist in some area of interest and the relationships that
hold among them.
Translating this into the recognition problem, the ontology of our reasoning system
is composed of a set of visual features used for the recognition task. In particular it
is a set of multidimensional primitives, composed of:
NP ∪ NB ∪ NF ∪ NA ∪ NSP
1. For each view v, v 0 the corresponding 2D orthographic projections (op) are such
that op(g, v) 6= op(g, v 0 ).
2. For each view v, v 0 , there is no mapping m such that m(v) = v 0 , i.e. the set is
minimal.
We shall show that, for each SymGeon, the set of its aspects represented in GS
are all its significant 2D orthographic projections, in which a linear transformation
for smooth edges is defined. In this way the superellipsoid geometrical properties,
likewise the deformations, are qualitatively simulated via a compositional construction
exploiting the 2D projections.
This problem is addressed in many pa-
pers concerning the construction of the as-
pect graph of a generic object [44, 31, 57].
For our purpose we use an orthographic
model to describe the viewpoints space,
where each vantage point is represented by
a point on a sphere surrounding the ob-
ject. In this way each viewpoint is defined
by two parameters: θ or longitude and φ
or latitude (see Figure on the right) and is
denoted as Vθ,φ .
In this way, given a point (x0 , y0 , z0 ) in
Figure 4.3: Viewpoints Space.
the object reference frame, we can compute
its projection on the image plane (u, v) in two steps:
i. a rotation to align the point to the z-axis, so that the vantage point is also
aligned;
which, after the orthographic projection into the image plane, yields:
u = x0 cos θ − z0 sin θ
(4.2)
v = x0 sin θ sin φ + y0 cos φ + z0 cos θ sin φ
The points (θ, φ, u, v) defined by the above equations, for all possible values of
θ and φ represent the set of points in the viewpoint space occupied by the point
(x0 , y0 , z0 ). Our problem is to understand what happens to the image point (u, v) as
soon as the view point Vθ,φ is changed.
34 CHAPTER 4. MOTIVATION AND METHODOLOGY
• →
−
a = (a1 , a2 , a3 ) are the scale parameters along the x, y and z axis;
• →
− = ( , ) are the squareness parameters along the longitude and latitude
1 2
direction;
→
−
• K = (Kx , Ky ) are the tapering parameters along x and y axis;
vi. ν : E → {r, c} is a function which assigns a label to the arcs depending on the
kind of edges generating it.
We are now able to recognize two different aspects of a symgeon using a notion of
Topological Equivalent Aspects which is obtained as a graph isomorphism.
Definition 5 (Topological equivalence) Let G be a symgeon, and Vθ,φ and Vθ0 ,φ0 ,
two view directions of G. Let AG and AG 0 be two aspects of G, defined according to
the two different view directions. AG and AG 0 are topologically equivalent if they are
isomorphic.
NA = {AG |G ∈ S, ¬∃AG 0 , AG ≡ AG 0 }
Here S is the set of symgeons (see Table 3.1). The set of aspects are depicted in
Appendix B.
Analogously as we have done for the aspects, we need to find the minimal set
of faces. To this end we have to consider each aspect level node and its topological
characterization through the aspect graph AG which is, by virtue of its properties, a
planar graph. Any planar graph enjoys the following property:
Moreover, the number of such regions is independent of the particular drawing. Ex-
actly one of these faces is unbounded and is called the exterior face. All other faces
are bounded by the edges of the graph.
Extending this property to the graph AG , we can consider
all its bounded regions as candidate faces of NF (see Figure on
the right). In the same way as we have done for the aspects,
we have to introduce a definition of similarity between faces,
in order to generate the minimal number of elements.
To this end we can give a topological definition of a face
using the notion of subgraph. A Face aspect (FA ) of a face
region is a subgraph of a symgeon aspect AG , formed by the
edges bounding the region. In particular, a FA is a cycle of
the AG which generates it.
By the above definition we can use the same topological
equivalence introduced before for the aspects. some elements of the set of faces are
depicted in Figure (6.7).
Finally, boundits represents boundary elements and as done above, by starting
from a face f ∈ NF we introduce a constructive way to generate the boundary
element b.
Let F =< vi1 , vi2 , . . . , vin > be a cycle representing a FA of an AG . This means
that for every pair of adjacent nodes vi and vj , belonging to F, there exists an edge
ek = (vi , vj ) ∈ AG . Therefore we can say that two edges ei and ej are connected in
F if ei = (vl , vk ), ej = (vk , vm ) and vl , vk , vm are adjacent nodes of F.
Using this notion, we can proceed starting from each face f and its structural
definition F, and adding a boundary b for every couple of adjacent edges in F. As
for the face level we can use the definition of topological equivalence given in (3,4,5)
to minimize NB .
Following this procedure NB could be composed only by the elements ll, eal and
eaea represented in Figure (6.4), but in order to simplify the axiomatization of our
reasoning system, we introduce in the ontology other 4 elements also.
Axiomatization
Since the image interpretation task that we consider involve information about prim-
itives arranged in space, representation and reasoning about spatial relations, are
important components of our framework. Therefore, the core of our visual reasoning
system is the Algebra of Figure (AF), described in detail in Chapter 6.
Techniques to represent spatial information have been studied in AI in the area of
spatial reasoning, where different logical framework are introduced to formalize image
domain knowledge. Reiter and Mackworth in [87] introduced a framework based on
the definition of three sets of axioms where an interpretation of an image is defined
to be a logical model of these axioms.
Other interesting results are obtained for spatial reasoning in the context of GIS
(Geometric Information System), where different sets of relations are introduced for
regions in space [20, 8, 19].
Our Algebra of Figure is a multi sorted algebra used both to define each basic
elements in our ontology, and the description of the objects that has to be recognized
4.3. THE BAYES-ASPECT NETWORK 37
by the system.
The sorts of the algebra are those needed to represent the class of elements de-
fined in the ontology (Primits, Boundits, Faces, Aspects and SymGeons) and a sort
named Scene Graph which represent ob-
jects obtained by suitable composition of C= Connected: ⊕nC
SymGeons.
The elements of the algebra can be P= Parallel: ⊕nP
composed using four possible operators
which represents geometrical properties: S= Symmetric: ⊕nS
Connection, Parallelism, Symmetry and
Orthogonality. For our commodity, to this
set we add an additional operator, repre- T= Orthogonal: ⊕nT
senting the Angularity property, that can
be defined in terms of the previous one.
They are represented in the Table on V= Angular: ⊕nV
the left, together with their functional de- Table 4.1: Relation between elements and their
notation which will be more clear in the functional denotation.
next Chapter.
Observe that all the relations used to define the operators are reflexive. This is an
important property in fact, especially for objects, it allows to give a definition which
is view independent, i.e. the composing relations are preserved from translation,
rotation or scaling.
The choice of the relations are inspired mainly by the work of Lowe [62, 83, 63],
which uses proximity, parallelism, symmetry, etc. to describe relations among 2D
features. Although their simplicity, these five operators allow us to describe a large
number of complex objects.
The axioms of our algebra are divided into three groups:
1. Grouping Axioms: these axioms are used to build all the terms of the algebra
and, more important, to parse the graph used to represent the observed scene,
the scene graph, described in Chapter 1.2. Such a parsing procedure is the
keystone of our reasoning approach, because it is used to recognize the objects.
2. Connection Axioms: these axioms are used to capture among all the possible
models, only those ones which are Canocical, i.e. which represent valid connec-
tions among elements. Besides this, connection axioms give a semantic to the
connection operators
The scene graph, in turn, can be constructed because a set of SymGeons have been
recovered from the image. Each SymGeon, in turn, is recovered as an hypothesis and,
then, if the hypothesis is reliable the SymGeon is suitably localized in the scene.
In Section 4.2 we have described the primitive components of the image, i.e. the
primits. By suitably composing primits we form boundits, and by suitably composing
boundits we form faces, and finally composing faces we form aspects, where aspects
are views from different vantage points of a SymGeon. This composition process is
formulated in FOL, by explicit definitions, as previously introduced using the algebra
AF. However, due to the image noise and the uncertainty of its structure, the presence
of a SymGeon in the scene is gathered using Bayesian inference. E.g. a given closed
region of the Syntactic Graph is a cylindroid with probability p.
In this section we describe the basic idea of constructing the Bayes-aspect network.
It is obtained suitably composing the structure of a HAG [27] together with causal
relations obtained by introducing the connection relations r ∈ R, between features
(primits, boundits, faces, aspects), defined with the algebra AF. A hierarchical aspect
graph representation identifies equivalent views and neighborhood relations generating
a graphical structure of views. Each node of the HAG represents an aspect of the 3D
type in the hierarchy. Each link represents some visual event where transitions occur
between two neighboring general views. As we mentioned in the preliminaries, since a
symgeon is obtained by suitable tapering and/or bending transformations applied to
a superellipsoid, a few aspects are needed to capture the change in the vantage point.
Dickinson and Metaxas have defined a three layered HAG, Gpg = (NA ∪ NF ∪ NB , E)
for parametric geons in which the set of nodes N is partitioned into three sets, where
NA is the set of aspects, NF is the set of faces, and NB is the set of boundits. The
HAG levels are naturally ordered: the nodes in NB contributes to the composition
of the faces represented in NF which, in turn, contributes to the composition of the
aspects in NA . The aspects, finally, are the significant projections of the parametric
geons. In Chapter 2 we have given an overview about this approach.
In our case we have extended the notion of parametric geons to that of SymGeons.
We shall now give the methodology through which we construct the topology of
the Bayes-aspect network.
2. E = ECm ∪ ECa , where ECm is the set of composition links leading from a
feature node to a connection node, and ECa is the set of causal links leading
from a connection node to a feature node.
Table 4.2: CPT for feature nodes (left) and connection nodes (right).
0.8
0.6
0.4
0.2
-3 -2 -1 1 2 3
Figure 4.4: A gaussian probability function with mean 0 and variance 0.5 and 0.8.
According to such definition, each feature nodes of the BAN is labelled with the
feature element (primits, boundits, faces and aspects) composing our ontology, de-
scribed in a Section 4.2. Once all the nodes concerning the features used to represents
the aspect that root the structure have been determined, we need to introduce suit-
able connection nodes and the conditional probability tables (CPT) linking nodes of
a lower features level to the nodes of the upper level. For example, two boundary
nodes cooperates in determining a face via a connection node, e.g. a symmetry node,
labelled by ⊕S . Each connection node, since it is defined in terms of a distance as
described in the next Chapter, is given a probability according to the distance of the
features entering the connection.
Consider, for example, the quadrilateral face with curved boundaries f depicted
in Figure 7.1, which is obtained by suitable composing two boundits eal and ial using
40 CHAPTER 4. MOTIVATION AND METHODOLOGY
P rob(eal(x, y)) = p ≡
∃p1 , p2 , p01 , p02 , α, α0 .x = primit(p1 , p2 , α) ∧ y = primit(p01 , p02 , α0 )∧
α = slope(p1 , p2 ) ∧ ∃C, rm , rM , β, φ, ∆φ.α0 = ell(C, rm , rM , β, φ, ∆φ)∧
d = x ⊕C y ∧ P rob(x) = 1 ∧ P rob(y) = 1 ∧ p = N0,σC (d)
(4.6)
Observe that in the case of primits, probabilities can be either 1 or 0, where 1
means that the primit is found in the image syntactic graph, by the syntactic analyzer,
and 0 otherwise.
To understand what H is and the role of sensing actions, consider the following simple
example. There is a table and a chair in a room and an observation of the scene is
performed, i.e. a shot of the scene is taken (we open our eyes and look into the room);
let us cut the instant before we make sense out of what there is in the room. Clearly,
at the very moment in which the image is taken no distinction among objects is made.
Therefore it is not a single sensing action like sense(isT able(x), v) that takes place,
but a scene/image acquisition.
4.4. COGNITIVE AND DESCRIPTION FRAMEWORK 41
From the image acquisition till the inference, leading to an hypothesis that there
might be a table in the room, a complex process of revelation takes place. One
bringing the shapeless and scattered components identified in the image, to the surface
of cognition2 , by giving a structure to these components. And there is a step in the
structuring that reveals the meaning: “that’s a table”. In other words the re-cognition
process is a thread of revelations (the apperception) giving, attributing, meaning to
the elements of the image. This is achieved by conjugating the construction of a tight
data structure (a graph of all the symgeons occurring in the scene together with their
topological relations), which is the hypothesis H, together with the meaning given
by a description and denoted by a sensing action like sense(isT able(x), v). Therefore
the sense(isT able(x), v) action has, indeed, the double meaning of giving sense to the
elements of the data structure and of bringing to the surface of cognition the existence
of an object, a table, in fact.
To understand what we mean let’s go through the example of the table. We might
simplify the successor state axiom in (3.3) as follows:
3. Processing step: the data structure is the image syntactic graph, i.e. the graph
depicting all the segments (straight segments and arc of segments) traced in the
image. Such a graph is described in Chapter 7.
Chapter 5
The framework of our vision reasoning system can be divided into two parts. The
first one concerns the recognition of the SymGeons in the scene, and the second one
regards the recognition of the objects.
In Chapters 5, 6 and 7 we describe the first recognition process. Such a recognition
is achieved in two steps, starting from 2D image data. The first step consists of an
image analysis whose purpose is to identify, in the acquired image, the basic elements
of our ontology (primits). It is described in this Chapter.
The second step concerns the compositional process that, starting from a set of
primits, identifies boundits, faces and aspects in turns, using probabilistic reasoning
and leading to the recognition of a specific SymGeon. It is described in Chapter 7.
43
44 CHAPTER 5. SYNTACTIC IMAGE ANALYSIS
v
p1
a v p1
f
rm
p2 F2 b
Df rM
horizontal
p2 C
F1
u
Du u
Observe that processing(I, s) is a functional fluent denoting the interface with basic
visual processing. Therefore, it has to be distinguished both from a sensing action
and a control action.
In particular a straight line and an arc of ellipse are characterized using the fol-
lowing parameter (see Figure 5.1), defined w.r.t. the image reference frame (u, v):
Implementation
The input of the algorithm is a binary image obtained from the acquired image and
elaborated using edge detection operators. The output is a list of segments which are
classified in two categories: rectilinear and curvilinear. To accomplish this task we
operate in two steps:
1. Grouping: the image points that compose each instance of the target curve
are grouped together;
5.1. IMAGE ANALYSIS 45
2. Model fitting: given a set of image points probably belonging to a single curve,
find the best curve interpolating the points.
To solve the grouping problem we use a simple edge following algorithm that
provides an ordered list of connected points belonging to the object boundaries (chain)
by scanning the binary image in a predefined order. The algorithm is sketched out
below:
The model fitting problem instead, is resolved using a split & merge algorithm
applied to each list of connected pixels. The two phases of the algorithm are described
in the following.
Split phase This phase consists in a top-down algorithm where the list is recursively
divided into sublists until each one can be approximated with a rectilinear line or an
arc of ellipse with an acceptable error.
Algorithm SPLIT
Merge phase Using only the split phase the risk is to obtain an over-fitting of the
list i.e. a fitting of the list using more segments than is necessary. Therefore the task
of the merge phase is to analyze every couple of near segment and fuse them into a
unique segment if the error is acceptable.
The technique used to fit a set of points with an arc of ellipse is described in [39]
and in appendix A we present a short overview.
The final step of the procedure is to determine the parameters characterizing
each kind of primits, as reported previously. The endpoints of the primits are easily
identified considering the first and the last point of the list from which the primit is
obtained.
More complicated is to determine the parameters C, an , aM , α, φ and ∆φ charac-
terizing an elliptic primit. Starting from the coefficients of a generic conic, returned
46 CHAPTER 5. SYNTACTIC IMAGE ANALYSIS
we firstly verify if it is a real ellipse, through the evaluation of the following conditions:
A C D
C 2 E2 A C ∆
∆ = 2 B 2 6= 0
= C 2 >0 < 0.
B A+B
D E F 2
2 2
Center of Ellipse: since an ellipse is a central conic (as a hyperbola), the center of
the conic (cu , cv ) is the intersection of the diameter’s equations, so it is the solution
of the following system:
2Au + Cv + D = 0
Cu + 2Bv + E = 0
Therefore, the coordinates of the center of ellipse are given by:
D C 2A D
E 2B C E
cu = cv =
2A C 2A C
C 2B C 2B
Major and Minor Axes: we need to transform the equation (5.2) in the canonical
form of the ellipse, whose equation is:
au2 + bv 2 + c = 0
the coefficient a, b and f are determined as the solution of the equation system:
C 2
ab = AB − ( 2 )
a+b=A+B
c= ∆
AB−( C )2
q q
here ∆ is specified above. Let r1 = − fa and r1 = − fa , the major and minor axis
are:
am = r1 , aM = r2 if f r1 < r2
am = r2 , aM = r1 otherwise.
- if r < 0:
φ = φ2 − α
∆φ = arcExtension(φ2 , φ1 )
In Section 5.2, we list three basic functions of the Syntactic Image Analyzer: arcExtension,
ellipseParameters and ellipsePhiDeltaPhi.
The output of the Syntactic Image Analysis is the image structure graph. In our
logic-based framework such a graph is defined through a list of terms, representing
the sequence of primits extracted from the image, having the following form:
primit( l1, point(P1X, P1Y), point(P2X, P2Y), S ).
primit( a2, point(P1X, P1Y), point(P2X, P2Y),
ell( point(CX, CY), Rm, RM, Alpha, Phi, DPhi) ).
Here the first argument is a unique identifier in the form “l”, for straight line, or “a”,
for arc of ellipse, followed by an ordering number. The other parameters are those
introduced above, to describe primitive traits.
In the sequence of figure depicted in Table 5.1 the resulting output of the above
steps, applied to the image of an abstract picture, is shown. The associated list of
primits are reported in Table 5.2.
From the analysis of such a list, is it possible to note that two considerable kinds
of errors occurs. The fist one is an incorrect subdivision of a line into two primits e.g.
l1, l2 or l3, l4, the second one is a wrong classification of a segment e.g. l16, l17,
l31, l32. This behaviour is due to the choice of the threshold errors for the line and
ellipse fitting; in particular in the first case it is too high, while in the second one it
is too low. This contrast makes difficult the implementation of the Syntactic Image
Analysis.
The problem can be probably solved through an initial preprocessing of the data,
looking for the best threshold for the specific image. Another possibility, which best
fits our idea of a vision reasoning system, is a close interaction between the high-level
reasoning system and the low-level visual processing. In such a way, as described
in [94] through a mechanism of feedback and expectation, the high level hypothesis
48 CHAPTER 5. SYNTACTIC IMAGE ANALYSIS
Primits
about the interpretation of sensor data, directly affect the low level vision processing
that, in our particular case, adjust its internal thresholds trying to extract those clues
which allows to confirm such hypothesis.
Arc Extention
//////////// Arc extension between two angle first and second ////////////
double arcExtension(double first, double second)
{
if ((first>0) && (second>0)) {
if (first<second)
return fabs(first-second);
else
return 2*PIGRECA-first+second;
} else {
if ((first<0) && (second<0)) {
if (first<second)
return fabs(first-second);
else
return fabs(first)+2*PIGRECA+second;
}
}
if (first<0)
return fabs(first)+second;
else {
return 2*PIGRECA-first-fabs(second);
}
}
Ellipse’s Parameters
/*********************** Ellipse Parameters **************************/
bool ellipseParameters(double A0, double Ax, double Ay,
50 CHAPTER 5. SYNTACTIC IMAGE ANALYSIS
double delta, j, i, q;
double det, detX, detY;
double r1, r2;
Computing φ and ∆φ
/////////////////// Computing phi and DeltaPhi ////////////////////
bool ellipsePhiDeltaPhi(ellisse *e,
double x1, double y1,
double x2, double y2,
double mx, double my) {
e->phi = phi1-e->alpha;
e->delta_phi = ampiezzaArco(phi1, phi2);
} else {
return false;
}
return true;
}
5.2. PORTIONS OF THE SIA CODE 53
An Algebra of Figure
6.1 Introduction
In this chapter we present a formalization of the operations and primitives that we
consider the elementary constituents of a scene representation. By a scene represen-
tation we intend, indeed, a disposition of human artifacts, in a room, hallway, etc..,
thus excluding landscapes, animals, human beings and so on. Our formalization takes,
therefore, into account, the way a scene can first be decomposed into, by denoting
its basic patterns, and then recomposed by using rational laws concerning not only
geometric properties, but also commonsense rules. Despite these last rules cannot
be general, they are crucial because they take into account not only the way human
beings assemble artifacts, but also the way human beings observe artifacts.
This formalization will be used to introduce descriptions, that are the basic pat-
terns of the reasoning process leading to the interpretation of a scene. Here, in fact,
instead of talking about the image, in which patterns have no meaning associated
with them, we shall talk about the scene, and the difference relies on the fact that
each token of a scene has its own specific dimension and meaning.
A significant aspect of our formalization is that of denoting patterns by new terms,
from the most simple to the most complex. These terms can be composed through
operations which, in turn, can form new terms, according to the peculiar composition
they denote. We have therefore introduced a bunch of new terminology which is
not pretextuous, as it serves to introduce a suitable symbolic notation to model a
scene. The structural relationships among terms, denoting scene patterns, and their
underlying axiomatization, is what we call an algebra of figures. Despite figures are
terms of the algebra, we could not avoid a metric notion of distance to define them.
However the crucial point is: shall we refer to the image plane (which is 2D) or shall
we refer to the scene space (which is 3D)? It turns out that as far as the process of
composition is concerned, one can shift from the plane to the space very easily, since
there is only a metric problem (concerning the definition of distance), which can be
made independent ftrom the definition of each shape.
55
56 CHAPTER 6. AN ALGEBRA OF FIGURE
For example, if we want to describe a table by saying that the legs are orthogo-
nal to the surface we apparently do not need the third dimension, unless we specify
the SymGeons in the space instead of specifying them in the plane. Now, with the
exlusion of Chapter 7, in which we face the problem of Object Recognition, in this
chapter we are concerned only with the “representation” of SymGeons components,
therefore we shall refer only to the image plane, and when we talk about a dis-
tance, we consider the Euclidean distance between points as δ(p1 (u1 , v1 ), p2 (u2 , v2 )) =
p
(u2 − u1 )2 + (v2 − v1 )2 .
To indicate primits we shall use the lower case greek letter ρ; to denote boundits
we shall use the lowercase greek letter β; to denote faces we shall use the lower case
greek letter φ and to denote SymGeons we shall use the lowercase greek letter γ. We
shall use all of them with superscripts or subscripts.
Operations, Ω = {, ⊕P , ⊕E , ⊕F }
Here is the concatenation operator (sometimes referred to as a C-operator);
⊕P is the pointwise connection operator;
⊕E is the edgewise connection operator;
and ⊕F is the facewise connection operator.
The usual metric notions on a plane have to be intended in a perceptive context,
therefore we relax, somehow, the (e.g. Birkhoff) usual metric definitions and consider
the following. A point (x, y) in the image plane is denoted with p, and δ is the
Euclidean distance defined above:
Consider now the usual geometric notions of parallelism, symmetry, and orthogonality.
We give some definition, that adapt to the image plane.
Definition 8 Two primitive elements π1 and π2 , in the image plane, are parallel if
they do not intersect; i.e., for any p ∈ π1 , there is no p-region r (p) s.t. r (p) ⊂ π2 .
6.2. AN ALGEBRA OF FIGURES 57
1. For any p0 -region r (p0 ) ⊂ π1 there is a p00 -region r (p00 ) ⊂ π2 , which is in equal
distance from both l1 and l2 .
Let R be a set of relations between primitive elements, in the image plane (e.g.
parallelism, symmetry, orthogonality, connection, overlapping, etc..). In particular,
let II indicate parallelism, S indicate symmetry, V angularity, and T orthogonality,
and let R ∈ {II, T, S, V}. A family of n-ary operators {{⊕iX }i≤n }X∈R can be derived,
as follows. Let n be the number bounding the set of primitive elements in the image
plane. For each i ≤ n, and for each X ∈ R, there is a set of n-ary operators {⊕iX }i≤n ;
in particular, when i = 1 and X is any of the above defined connection operators, i.e.
X ∈ {C, P, E, F }, then {⊕iX }i≤n ∈ Ω. Now Ω can be extended with {{⊕iX }i≤n }X∈R .
Relations, the set of F-relations is {=, ≺}, where = is defined whithin the first
order language and ≺ is a precedence relation.
2. If t, t0 ∈ T then t t0 ∈ T . ∈ Ω.
Nothing else is a term of F. In the following we will denote the terms of F with t with
superscripts or with τ , always with superscripts.
Given the above primitives and operators the two following elements can be de-
fined:
2. Scene Graph. The set SG of Scene graphs is, analogously, defined inductively
as follows:
F1 F2
F3
All the above axioms are necessary to recognize a SymGeon in the image (plane).
The above axioms, in fact, cannot be used to recognize an object, to this end we
shall introduce further the axioms for descriptions. Observe, however, that the set of
Descriptions can be empty, while the set of above aspecified axioms is necessary to
process the aspect graph of an image.
b
V a V P
C
V
a d
b P d c
P C C
P C
V a
c c
c
i. ii.
Figure 6.2:
t = (f1 ⊕F f2 ) (f1 ⊕F f3 )
then t has no principal node. By suitably rewriting the term t, we get a term t0 as
follows
t0 = f1 ⊕2C (f2 f3 )
then t0 is a principal node.
We show now how the above defined algebra is useful to represent a graph. Con-
sider the graph depicted in Figure 6.2 (i), labeled by some relation in R. The term
denoting the graph is the following:
b ⊕V a b ⊕V c b ⊕P d c ⊕P a c ⊕C d d ⊕C a (6.1)
Proof. If t = t0 ⊕nX τ then t has already a principal node, therefore we can assume
that t has the form t = t1 · · · tk . We show the claim by induction on the length
|t| of t. If |t| = 1 then t is the only term, and the claim is proved. Suppose the claim
is true for any k ≤ n, and let t = t1 · · · tk tk+1 . Each ti ∈ T⊕ . Since t denotes
|τ |
a connected graph, and the term tk+1 = g ⊕X τ , it follows that either g must appear
in some of the tails in ti , 1 ≤ i ≤ k, or there is a term g 0 , appearing in τ which is
mentioned in some of the ti . If g is mentioned in the tail of ti then by applying the
connection rule introduced in (6.3.1), we can move tk+1 in the tail of such a ti and
remove it, so obtaining a term t0 = t1 · · · t0i · · · tk = t. And the claim is proved.
|τ |
Otherwise we look for the g 0 . By induction on the structure of g ⊕X τ , we show that
0
|τ | |τ |
we can transform this term into g 0 ⊕X τ 0 = g ⊕X τ .
0 0
If τ = g then by reflexivity we get g ⊕X g.
62 CHAPTER 6. AN ALGEBRA OF FIGURE
|τ 00 |
If τ = g 0 ⊕Y τ 00 then by applying the connection rule we can transform t
|τ | |τ 00 | |τ 00 |
k + 1 = g ⊕X g 0 ⊕Y τ 00 into the terms g ⊕X g 0 g 0 ⊕Y τ 00 . By reflexivity we get
00
|τ | |τ 00 |
g 0 ⊕X g g 0 ⊕Y τ 00 . Finally applying the first distributive law we get g 0 ⊕X ⊕Y gτ 00 ,
now by applying the connection rule, using the term ti and g 0 , we can eliminate the
term tk+1 .
If τ = t01 · · · g 0 · · · t0m , then by item (c) of Proposition 1 we get tk+1 = g ⊕X
t1 · · · g ⊕X g 0 · · · g ⊕X t0m . This is equal to the term g ⊕X t01 · · · g 0 ⊕X
0
g · · · g ⊕X t0m = g 0 ⊕X g g ⊕m 0 0
X t1 · · · tm , by (c) again; now by applying the
connection rule we get the equal term t = g ⊕X (g ⊕m0 0 0 0
X t1 · · · tm ), and this last can
0
be substituted in the g mentioned in ti , by the connection rule. And since we have
moved the term t0 , t0 = tk+1 in the tail of such a ti , we can remove it, so obtaining a
term t00 = t1 · · · t0i · · · tk = t. And the claim is proved.
Example 3 Let t be the term expressed in Equation 6.3.1, denoting the completely
connected graph depicted in Figure 6.2 (i). Then t can be transformed into a term t0
having a principal node, as follows:
The above term can be seen as one denoting a tree, e.g. the one depicted in Figure
6.2 (ii). Observe that in such a tree some node is repeated because of the relations
that have to be maintained.
known. The general term known, means that they are explicitly defined by a set of
axioms, indeed the set FF of figures axioms. Observe that we have only one metric
notion, which is the distance according to which a connection is accepted or refused.
The other parameters (length, angles, etc.) are irrelevant to our axiomatization.
Observe that despite we have introduced sorts for each primitive element in the
image, we shall introduce for each primitive element an explicit definition as follows.
Primit As described in the previous chapter, the Syntactic Image Analyser (SIA)
returns a graph of the image in which there are tiny traits or patterns which could be
all considered arcs of ellipes of different dimension. However to simplify computations,
at the syntactic level, we shall partition the set of arcs into arc of line and arc of ellipse.
Then we recompose both of them into a general notion of primit staying for primitive
patterns. The two kind of primits are, thus:
ρ = hprimit(p1 , p2 , α) ≡ ρ ∈ SA∧
α = slope(ρ) ∧ ∀ρ0 , p0 , α0 α0 = α → ¬(ρ0 = primit(p1 , p0 , α0 )∨
i
ρ0 = primit(p0 , p2 , α0 )) ∨
n
∃C, am , aM , θ, φ, ∆φ α = ell(C, am , aM , θ, φ, ∆φ)∧
h
∀γ, ρ0 , p0 , φ0 , ∆φ0 φ = φ0 ∨ ∆φ = ∆φ0 → ¬ (ρ0 = primit(p1 , p0 , γ)∨
io
ρ0 = primit(p0 , p2 , γ)) ∧ γ = ell(C, am , aM , θ, φ0 , ∆φ0 )
(6.4)
Boundit Now, for the definition of a boundits (see Figure 6.4) we have to consider
the two primits composing it, they could be:
p1 P1
p1 a
a
l1
l2
l l
pm
pm p2 Pm
p2 P2
ll eal ial
p1 p1
p1
a1 a1
a1
a2 a2 a2
p2 p2 pm p2
pm pm
In the second and third case we have to take into account the cases in which the arc
is convex or concave.
Furthermore, let the boundit be composed of the primits ρ1 = primit(p1 , pm , X)
and ρ2 = primit(pm0 , p2 , Y ), and let lx,y denote a line passing through two points px
and py and aligned(p1 , p2 , p3 ) is the equation of the straight line passing through the
three points p1 , p2 , and p3 . Let primit(p1 , pm , γ) be a primit, and p be a point. We
can now define the two notions of convex and concave, of ρ w.r.t. p2 , as follows:
We are now ready to introduce a definition for the set of boundits as follows.
Let pwc(x) be a term denoting the distance between two terms whenever they are
pointwise connected, according to suitable metric conditions that will be specified
further, in the context of the Connection Axioms:
β ∈ boundit(p1 , pm , p2 ) ≡
∃p2 , p0m αα0 .primit(p1 , pm , α) ∧ primit(pm0 , p2 , α0 )∧
β = primit(p1 , pm , α) ⊕P primit(pm0 , p2 , α0 ) ∧ pwc(β) ≤ ).
(6.5)
6.4. FIGURES AXIOMS 65
Now, given a general definition of the set of boundits, we detail each one in the
following axiom defining a boundit as the pointwise connection between two primit:
primit(p1 , pm , X) ⊕P primit(pm0 , p2 , Y ) = β ≡
n
β = ll(p1 , pm , p2 )∧
o
X = slope(primit(p1 , pm , X)) ∧ Y = slope(primit(pm0 , p2 , Y )) ∨
n
β = eal(p1 , pm , p2 )∧
(X = slope(primit(p1 , pm , X)) ∧oY = ell(η, primit(pm0 , p2 , Y ))∧
Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = ial(p1 , pm , p2 )∧
(X = slope(primit(p1 , pm , X)) o ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧
Convex(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = eaea(p1 , pm , p2 )∧
X = ell(η, primit(p1 , pm , X)) ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧ o
Concave(primit(p1 , pm , X), p2 ) ∧ Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = iaea(p1 , pm , p2 )∧
X = ell(η, primit(p1 , pm , X)) ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧ o
Convex(primit(p1 , pm , X), p2 ) ∧ Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = iaia(p1 , pm , p2 )∧
X = ell(η, primit(p1 , pm , X)) ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧ o
Convex(primit(p1 , pm , X), p2 ) ∧ Convex(primit(pm0 , p2 , Y ), p1 ) .
(6.6)
Observe that the boundits iaea, ial and eal have a symmetric definition (respectively
eaia, lia, and lea), that can be obtained substituting ρ1 for ρ2 , and by a mirror
transformation. Indeed, we need only the the six boundits mentioned above, because:
ρ1 ⊕P ρ2 = ρ2 ⊕P ρ1
Faces As seen in the previous paragraph there are 6 basic boundits, i.e. given that
there are 3 kinds of primits (straight, concave and convex), the set of boundits is 32
minus the 3 symmetric ones. Now, considering the set of faces, we have to consider the
admissible connections between boundits (see Figure 6.5 and the connection axioms):
i.e. the pointwise and edgewise connections.
To form the faces by pointwise connection, we can consider that each of the 6
basic boundits can connect to itself and the others, we thus get 36 boundits to which
we have to take away the 6 repetitions. Thus we get 30 possible faces. To this set
of 30 faces we have to add the set of 3-sides faces obtained by edgewise connections.
There 12 possible ways of combining 6 boundits with a common edge. Therefore we
get 42 faces.
66 CHAPTER 6. AN ALGEBRA OF FIGURE
Pex’
d
P’ec
Pm Pex Pm
d2
d1
P’er P’m P’er P’m
Figure 6.5: Edgewise connection (on the left) and Pointwise connection (on the right)
Figure 6.6: Two faces obtained by both Pointwise and Edge Wise Connection
In general a term of sort face, can be said to belong to the set of faces as follows,
where both pwc and ewc are terms which are specified in the connection axioms:
In order to state what are the composition properties allowed, in the next section
we shall introduce a formalization of the distance between primitive elements in the
image, furthermore we shall precise notions like symmetry and parallelism given at
the beginning of the chapter.
Elliptical Faces
Quadrilateral Faces
Trilateral Faces
from the graph, in order to state whether they can be combined or not, by a point-
wise connection, it is enough to verify a specific distance between their end points.
Analogously for the boundits and the face and for other operators.
d d d
Before introducing the axioms we give some useful definitions that will be used in
the sequel.
The minimal distance between two primits is the distance between the closest
end points, and it is denoted with minDist(ρ1 , ρ2 ), and it is defined as follows, here
δ(p, p0 ) is the Euclidean distance between p and p0 :
d1
d1
d2 d2
(a) (b)
ewc(ρ1 ⊕E ρ2 ) = minDist(ρ1 , ρ2 ) + otherDist(ρ1 , ρ2 ) ≡ straight(ρ1 ) ∧ straight(ρ2 )
(6.11)
On the other hand, when the primits considered are eaea, eaia, iaia then the ⊕E
operator is allowed as follows:
ewc(ρ1 ⊕E ρ2 ) = d1 + d2 + dρ + dφ + dc + da ≡ ∃C1 C2 , am1 , am2 aM1 aM2 , θ1 , θ2 φ1 , φ2 .
elliptic(ρ1 , ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 ))∧
elliptic(ρ2 , ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ))∧
d1 = minDist(ρ1 , ρ2 )∧
d2 = otherDist(ρ1 , ρ2 )∧
dc = δ(C1 , C2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |φ1 + φ2 | + |∆φ1 − ∆φ2 |∧
da = |am1 − am2 | + |aM1 − aM2 |
(6.12)
The parallel relation between primits depends on the kind of primits involved, and
is defined as follows. Let ρ1 and ρ2 be two primits:
par(ρ1 ⊕II ρ2 ) = dρ + dφ + da ≡
ρ1 = primit(p1 , p2 , X) ∧ ρ2 = primit(p3,op4 , Y ))∧
n
X = m1 ∧ Y = m2 ∧ d = |m1 | − |m2 | ∨
∃C1 C2 , am1 , am2 aM1 aM2 , θ1 , θ2 φ1 , φ2 .
(6.13)
X = ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 )∧
Y = ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |(φ1 + ∆φ 1 ∆φ2
2 ) − (φ2 + 2 )|∧
da = |am1 − am2 | + |aM1 − aM2 |
To establish the symmetry between two primits ρ1 and ρ2 (see figure 6.10) we
generalize also the straight primits two arc of ellipses:
70 CHAPTER 6. AN ALGEBRA OF FIGURE
ϕ1
ϕ2
(a) (b)
symm((ρ1 ⊕S ρ2 ) = dc + dρ + dφ + da ≡
ρ1 = primit(p1 , p2 , ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 )∧
ρ2 = primit(p3 , p4 , ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ2 ))))∧
d = |φ1 − π2 | + |φ1 − π2 |∨ (6.14)
dc = δ(C1 , C2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |φ1 − φ2 | − |(φ1 − φ2 ) + (∆φ1 − ∆φ2 )|∧
da = |am1 − am2 | + |aM1 − aM2 |
Case a., b.
pwc β1 ⊕P β2 = minP wc(β1 , β2 ) + otherP wc(β1 , β2 ). (6.18)
6.5. CONNECTION AXIOMS 71
d1 d1 d1 d1
d2 d2 d2 d2
(a) (b) (c) (d)
d1 d1 d1
d2 d2 d2
d d
d d
Case c. ,d. , e.
h i
pwc (aa1 ⊕P ll2 ) ⊕P (aa2 ⊕P ll1 ) = pwc(aa1 ⊕P ll2 ) + pwc(aa2 ⊕P ll1 )
(6.19)
Case f., g.
pwc aa ⊕P al = minP wc(aa, al) + otherP wc(aa, al) (6.20)
Analogously, there are six cases to be taken into account, for the Edge Wise
Connection relation as shown in figure 6.12. The ewc relation between boundits is
introduced in order to combine a couple of L − junction forming a valid U − junction
(see figure 6.13).
As we argued before we have to constrain the connection, in order to avoid strange
connections.
72 CHAPTER 6. AN ALGEBRA OF FIGURE
(i) (ii)
b1
b2
b1
b1
b2 b2
Figure 6.16: Some example of faces generated by PWC and/or EWC among boundits.
d
d
PWC EWC
Sym
Figure 6.17: P W C, EW C and Sym relation between faces. The dashed line represent
either a l − primit or an a − primit.
74 CHAPTER 6. AN ALGEBRA OF FIGURE
between boundits. Relations between faces likewise between aspects are a straight-
forward extension of the relations betwen boundits.
To exemplify the set of axioms, we shall therefore introduce a generalized definition
of Symmetry, taken on faces, boundits and primits:
Here is an error threshold, and ΨF (g1 , g2 , d), ΨB (g1 , g2 , d), and ΨP (g1 , g2 , d) are,
respectively, defined as follows:
Here fa and fb denote faces. In the previous section we have further specialized
the connection operator into point wise, and edge wise connections. ΨB (g1 , g2 , d) is
defined as follows:
ΨB (g1 , g2 , d) ≡ {(g1 = ll1 (x1 , x2 , x0 ) ∧ g2 = ll2 (x3 , x4 , x00 ))∨
(g1 = aa1 (x1 , x2 , x0 ) ∧ g2 = aa2 (x3 , x4 , x0 ))∨
(g1 = al1 (x1 , x2 , x0 ) ∧ g2 = al2 (x3 , x4 , x0 ))}∧
d1 = sym(xi ⊕S xj ) ∧ (xi = minDist(x1 , x2 )) ∧ (xj = minDist(x3 , x4 )) ∧ xi 6= xj ∧
d2 = sym(xh ⊕S xk )) ∧ (xh = minDist(x1 , x2 )) ∧ (xk = minDist(x3 , x4 )) ∧ xh 6= xk }∧
d = d1 + d2
Here ll1 and ll2 denote straight boundits; aai (here aa is any of the terms {eaea, eaia, iaea, iaia})
denotes elliptic boundits, and finally ali ∈ {ial, eal} denotes boundits which are ob-
tained by the connection of the two primits ai and lj . ΨP (g1 , g2 , d) is obtained as
follows:
ΨP (g1 , g2 , d) ≡
(g1 = l1 (p1 , p2 , m1 ) ∧ g2 = l2 (p3 , p4 , m2 ) ∧ d = |φ1 − π2 | + |φ2 − π2 |)∨
(g1 = a1 (p1 , p2 , e1 ) ∧ g2 = a2 (p3 , p4 , e2 )∧
dc = δ(C1 , C2 )∧
dα = |α1 − α2 |∧
dφ = |φ1 − φ2 | − |(φ1 − φ2 ) + (∆φ1 − ∆φ2 )|∧
dr = |rm1 − rm2 | + |rM1 − rM2 |∧
d = dc + dα + dφ + dr )
All these definitions have been suitably implemented in Eclipse-Prolog, since they
are used to infer in the Bayes Aspect network, which is discussed in the the next
chapter.
6.5. CONNECTION AXIOMS 75
ϕ1
ϕ2
A Bayes-aspect Network
In this Chapter, we describe the second step of the SymGeon recognition process.
It is a compositional process that, starting from the set of primits obtained through
the Syntactic Image Analysis, identifies boundits, faces and aspects in turns, using
probabilistic reasoning and leading to the recognition of a specific SymGeon.
Definition 15 (roots of BAN) Let NP be the set of primit nodes of the BAN,
NP = {p1 , p2 , . . . , pK }
For each node p ∈ NP , there is a function ||·|| taking as arguments the image syntactic
graph and the primit p itself, and returning the set Pp of primits of the same kind,
occurring in the image syntactic graph. Each node p ∈ NP is thus decorated with all
its instances Pp occurring in the image syntactic graph. The family of sets {Pp }p∈NP
is called the root nodes of the BAN.
77
78 CHAPTER 7. A BAYES-ASPECT NETWORK
C S C P
b2 b1
C C
The root nodes are, then, labeled with the position of the segments forming each
primit. Therefore the output of the image syntactic graph instantiates the BAN, by
creating the root nodes. The probability of a node p ∈ NP is given by the probability
of its root nodes, as follows: if the node has roots then its probability is 1 otherwise
is 0.
Once the network is initialized its root nodes constitutes the network evidence.
Thus the inference consists in querying the BAN about the probability that, in the
image syntactic graph, a specific aspect of a symgeon is mentioned, given the evidence.
The query is as follows:
∃p∃x1 · · · xn aspect(x1 , · · · , xn ) ∧ P rob(aspect(x1 · · · xn )|x1 · · · xn ) = p (7.1)
It is easy to see that the required inference is double:
1. The first inference is D |= ∃p∃x1 · · · xn aspect(x1 , · · · , xn ). This inference re-
quires to construct the terms x1 · · · xn such that each xi will be a term in T⊕ ,
p) ⊕C ⊕S p2 (p~0 )p3 (p~00 ), mentioning only primits, here p~, p~0 and p~00
e.g. ti = p1 (~
denote the set of values end points, middle point, slopes, etc, characterizing
primits. This can be achieved because each aspect is defined in terms of faces
and connections, and each face is defined in terms of boundits and connections.
7.2. HYPOTHESES FORMATION 79
2. The second inference is possible just in case the first returns the set of terms
defined by the primits. It is a classical diagnosing inference, requiring to com-
pute the composed probabilities of the paths leading from the specific aspect
node to the evidence nodes constituted by the primits composing the specified
query.
Observe that in item 1) the theory D is defined in the Situation Calculus and it
denotes the theory of perception [76]. D, also includes the theory of actions and fail-
ures (see [36]). And, finally, D includes the Algebra of Figures AC, and the definition
of each feature, such as aspects, faces and boundits.
The BAN is thus formalized in D simply by the definition of each feature, given
in FOL. For example, in the equation (6.4) we have already given the definition of a
primit. Boundits are defined using primits and connections, faces are defined using
boundits and connections and, finally, aspects are defined using faces, boundits and
connections. Observe that the topological structure, simply determine the minimal
set of boundits, faces and aspects, that can appear in the BAN.
However, this might not be enough to establish with enough accuracy the existence
of a SymGeon in the scene. For this reason we need to define a threshold λ, which
will depend on the circumstances in which the scene has been taken.
Now we shall construct the set of hypotheses H as follows. Each query to the BAN
returns the probability that a given aspect appears in the image syntactic graph. If
the probability is greater than λ, then the aspect recovered is recorded and all the
primits used to compose the aspects are marked. Starting from the consideration that
a primits can not belong to more than two aspects, if the primit is already marked,
it is limply deleted from the root nodes. Otherwise nothing is done, and a new query
is performed.
The BAN is queried until all the root nodes are deleted, or the set of root nodes
cannot contribute in the formation of an aspect.
80 CHAPTER 7. A BAYES-ASPECT NETWORK
Definition 17 (Bayes
Network for SymGeon Recognition (BNSG ) It is a DAG
(NSG ∪ NA ), E where:
iii. To each node of NSG is associated a CPT in the form given in Table 7.1.
P(g|a1 , . . . , ak ) a1 a2 ··· ah
0.8 T F ··· F
0.6 F T ··· F
.. .. .. .. ..
. . . . .
0.4 F F ··· T
Table 7.1: CPT for a node of the Bayes Network for SymGeon Recognition
Note that, the particular structure of the CPT is a direct consequence of the
considerations previously done.
The set of aspects recovered H, are used to instantiate the root node of the BNSG
and to infer the hypotheses of the SymGeons occurring in the scene. This inference
process I(h) that assign to a given recovered aspect h = aspect(− →x ) ∈ H, with
Ph = P rob(h), the SymGeon s, maximize the probability returned from the BNSG :
I(h) = s ⇐⇒ ∀s0 ∈ NSG .s0 6= s→P rob(s|h) · P rob(h) > P rob(s0 |h) · P rob(h)
ll(L1, L2):-
L1 = primit(P1,P2,ALPHA1),
L2 = primit(P3,P4,ALPHA2),
primit(P1,P2,ALPHA1),
primit(P3,P4,ALPHA2),
B = boundit(L1,L2,ll),
pwc(B)<EPSILON.
lea(L ,A , B) :-
L = primit(P1,P2,ALPHA1),
A = primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
primit(P1,P2,ALPHA1),
primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
external(L,A),
B = boundit(L,A,lea),
pwc(B)<EPSILON.
lia(L ,A , B) :-
L = primit(P1,P2,ALPHA1),
A = primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
primit(P1,P2,ALPHA1),
primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
internal(L,A),
B = boundit(L,A,lia),
pwc(B)<EPSILON.
eaea(A1, A2, B) :-
A1 = primit(P1,PM1,P2,E1),
A2 = primit(P3,PM2,P4,E2),
primit(P1,PM1,P2,E1),
primit(P3,PM2,P4,E2),
external(A1,A2),
external(A2,A1),
B = boundit(L,A,eaea),
pwc(B)<EPSILON.
eaia(A1, A2, B) :-
A1 = primit(P1,PM1,P2,E1),
A2 = primit(P3,PM2,P4,E2),
primit(P1,PM1,P2,E1),
primit(P3,PM2,P4,E2),
external(A1,A2),
82 CHAPTER 7. A BAYES-ASPECT NETWORK
internal(A2,A1),
B = boundit(L,A,eaia),
pwc(B)<EPSILON.
................
%-----------------------------------------------------------------------------------------
% FACES DEFINITION
%-----------------------------------------------------------------------------------------
face(B1, B2, F) :-
B1=boundit(L1, L2, ll),
B2=boundit(L3, L4, ll),
boundit(L1, L2, ll),
boundit(L3, L4, ll),
F=face(B1, B2, llll),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(L1, A1, lia),
B2=boundit(L2, A2, lea),
boundit(L1, A1, lia),
boundit(L2, A2, lea),
F=face(B1, B2, lialea),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(L1, A1, lia),
B2=boundit(L2, A2, lia),
boundit(L1, A2, lia),
boundit(L2, A2, lia),
F=face(B1, B2, lialia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(L1, L2, ll),
B2=boundit(L3, A1, lia),
boundit(L1, L2, ll),
boundit(L3, A1, lia),
F=face(B1, B2, lllia),
pwc(F)<EPSILON.
7.4. PORTION OF THE CODE FOR ASPECT RECOVERY 83
face(B1, B2, F) :-
B1=boundit(L1, L2, ll),
B2=boundit(L3, A1, lea),
boundit(L1, L2, ll),
boundit(L3, A1, lea),
F=face(B1, B2, lllea),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, iaia),
B2=boundit(A3, A4, iaia),
boundit(A1, A2, iaia),
boundit(A3, A4, iaia),
F=face(B1, B2, iaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaia),
B2=boundit(A3, A4, iaia),
boundit(A1, A2, eaia),
boundit(A3, A4, iaia),
F=face(B1, B2, eaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaia),
B2=boundit(A3, A4, eaia),
boundit(A1, A2, eaia),
boundit(A3, A4, eaia),
F=face(B1, B2, eaiaeaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaea),
B2=boundit(A3, A4, iaia),
boundit(A1, A2, eaea),
boundit(A3, A4, iaia),
F=face(B1, B2, iaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaea),
B2=boundit(A3, A4, eaea),
boundit(A1, A2, eaea),
boundit(A3, A4, eaea),
F=face(B1, B2, eaeaeaea),
pwc(F)<EPSILON.
................
84 CHAPTER 7. A BAYES-ASPECT NETWORK
Chapter 8
Description
We have, then, two things to compare: (1) a name, which is a simple sym-
bol, directly designating an individual which is its meaning, and having
this meaning in its own right, independently of the meaning of all other
words; (2) a description, which consists of several words, whose meanings
are already fixed, and from which results whatever is to be taken as the
“meaning” of the description. ([92])
8.1 Introduction
Recognition is the process of associating a designator, i.e. something endowed with
an unambiguous meaning, with an object: the word “table” in the phrase “this is a
table” designates unambiguously the object indicated.
In the process of recognition the role of vision is that of associating an object of
the world with an inner, i.e. subjective, experience constituting the connotation or
meaning or interpretation of the object seen.
The inner experience, in the learning phase of what human artifact are, is crucial
although never revealed: a child that sees for the first time an airplane has a quite
strong experience. The experience is useless to denote the perception, and the child
indicating the airplane asks “what is this”. She wants to know the word that she
could use for denoting both the object seen and her inner experience.
Thus, although the recognition of the airplane is independent of the language, the
language makes the inner experience repeatable and recordable: the denotation has
been learned as soon as the word “airplane” has been grasped.
The role of the language, therefore of terms, words, nouns, names, descriptions,
etc.. is that of designating the object seen, by associating a denotation to the percep-
tive action: the term “airplane” denotes the object seen, but also the inner experience.
The limits of our world are the limit of our language.
We argue that there are three components of the recognition process: seeing,
experiencing, denoting. The crucial part is the abstraction, categorization and gen-
eralization process underlying the experience. The result of this process is the agent
ability to establish a correspondence between the object seen and the concept or
category built through the experience.
85
86 CHAPTER 8. DESCRIPTION
Finally, denotation, is the very last component of the recognition process, and it
plays the role of the agent awareness, as it means the ability to assign a noun or a
label or a designator to each object in each situation.
In this sense recognizing a table (or even a person, despite we treat recognition
of nothing else but artifacts) implies that the object recognized can be used in a
reasoning process through denotation. E.g the agent, once has recognized the table
can hold “this table is the same as the one in X’s studio”.
Therefore the final process, i.e. that of denoting, amounts to an assignment of a
term to an object, this assignment consists therefore in designating an object with an
unambiguous term, and this designator is anchored to the object so as to denote it in
every possible situation.
Both Russell and Kripke [92, 58] consider a designator as a word which has the
same reference in every possible world in which it has any reference at all. In particular
they consider proper names as the only possible rigid designators. Actually, objects
(human made) hold abstract nouns, and the difference with proper names does not
seems to be so meaningful, from a designating power point of view.
An abstract noun designates a category of objects sharing both a function and a
general inner structure ( the noun “table” denote a category of objects..), on the other
hand a proper name designates a set of individuals sharing very few characteristics,
e.g. the proper name “Mario Rossi” designates some (actually many) individuals
possibly having nothing in common, but being male and Italian.
In other words there is nothing in the language that can play the role of a rigid
designator in the peculiar sense of Kripke (see [58]). However a rigid designator could
be constructed by coding. The problem is the following, is there a way of coding each
object in the environment? Furthermore, would this coding be suitable enough to
recognize an object?
There are several possible way of coding, for example storing a set of vantage
points of an object in a database can be considered a way of coding it, a set of
matrices can be seen as a single string of numbers.
Let us introduce a total function C : X 7→ Y , defined for each object in the domain
of artifacts X; here Y is the codification set which is a very large set, but still finite.
Let us call C(x) the code of object x, i.e. its rigid designator, how do we associate to
each C(x) a reference?
In other words, suppose that we tell the agent that his goal is: “go to the telephone,
on the table, and pick up the receiver”. Therefore the agent needs to establish a
correspondence between what he sees and C(telephone) and C(table). This is the
crucial point, without this association no recognition is possible. Suppose that the
set of strings {1201179....0i}i≤n , for some n not too large, denotes the set of vantage
points of the telephone, and the association is obtained by a function allowing to
compare parts of the image with all the set of codings of all the artifacts in the domain.
The alignment method (see e.g.[54, 101, 100]) is in this perspective of codification.
This method has an evident drawback in a dynamic domain: if the vantage point is
changed (while approaching the object or because the object has been moved, etc.)
and the current one is not coded, then the robot get lost, because he cannot find the
telephone or because he finds something that is not a telephone.
Analogously, the ambiguity cannot be resolved if the coding is concerned with the
initial position of the object. For, suppose that each object in a scene is associated
8.1. INTRODUCTION 87
with a position, in the initial situation, therefore the following would axiomatize each
object:
Here the ellipsis indicates a sentence precising the conditions under which the action
a leads to a change in position of the object whose code is a suitable substitution
for the variable w. Observe that the successor state axiom above, ensures that the
change in position is recorded therefore the object is anchored by the coding. The
drawbacks of this approach are twofold:
1. If the actions are non deterministic then the object might fail to be anchored.
The problem with the above methodologies is the following: by requiring a specific
set of instances for each object they do not allow for a suitable generalization and
abstraction.
Despite we do not have a solution for representing the inner experience connot-
ing the action of seeing a specific object, we consider the possibility of introducing
descriptions as the simplest way for defining an inner concept associated with an
image.
As Russell points out in his Introduction to Mathematical Philosophy, a description
might be either definite (the so-and-so) or indefinite, ambiguous, (a so-and-so). So, for
example, a description such as “the table in my studio is wooden made and squared”
is definite; while “a table is an artifact made of a support, at most 110 cm tall, holding
a flat surface usually made of some smooth hard material like wood, glass, marble, or
some plastic material” is an indefinite description.
The difference between definite and indefinite descriptions is ontologically relevant,
but both the kind of descriptions are meaningful in recognition, as a way to mimic the
learning experience with a definition in which the salient aspects of the experience are
captured together with the denotation. In other words a description amounts to an
abstract noun, e.g. table, defined in terms of its primitive components, the symgeons,
recognized at an earlier stage.
In this chapter we shall describe the role of descriptions for capturing a suitable
generalization of an artifact and the way these descriptions are matched against a
reference. In the next section we shall briefly introduce the analysis of the scene,
then in Section 8.3 we introduce the notion and concept of descriptions using a single
example, finally we discuss a parsing algorithm, that it is used to parse the image
graph.
88 CHAPTER 8. DESCRIPTION
1. Individuating all the symgeons in the scene, and establish them as hypotheses,
by attributing them a specific probability. One of the crucial problem here is to
avoid repetitions. We need to ensure that the same elements of the image are
not used for more than one symgeon.
2. Establishing all the relationships holding, between the symgeons recovered and
between the symgeons and the image graph, just in terms of distances. These
relationships are covered by the relationships between and among aspects, e.g.
two aspects are parallel if, given their symmetry axis, the distance between the
two symgeons is measured on the symmetry axis.
3. Define accurate descriptions for each expected object, coherently with the suc-
cessor state axioms for perception.
Many of the above steps are simply just hinted, as they are still under experi-
mentation. However they deserve a suitable discussion, that will be given in the next
sections.
8.3. HINTS ON DESCRIPTIONS 89
Figure 8.1: The graph of the scene in which a table and a chair have been singled out of a
set of symgeons.
With the above representation we can follow the problem solving approach to
perception [89, 90], in which perception is interpreted as a mental event subject to
causal laws. On the basis of the assumption that an agent experienced some prior
perceptions, and that a perception-perception chain of causality can be represented,
we define causal laws for perception. Obviously there is a frame problem to be solved
also for perception, and we adopt Reiter’s solution (see [86] for a full and very well
detailed description). A successor state axiom for the fluent P ercept accounts for
the correlation and causal laws among Percepts: perceived orientation is the the
suitable context for perceived shape, perceived adjacency explains the perception of
the composition of an object and so forth [90]. See the above Example 4.
For each fluent F (~x, s) in the language, which is observable, i.e. it can be sensed,
a perceptible isF (~x) is introduced into the language, together with a successor state
axiom of the kind:
Here ΨisF (~x, v, a, s) is a sentence describing what should be true both in terms of
other previous percepts and in terms of properties holding in the domain, to make
perception hold about the perceptible isF in the situation in which a sensing action
has taken place. Each sensing action, in turn, has a precondition axiom of the kind:
Observe that a successor state axiom like the one for P ercept does not interfere
with the successor state axioms for fluents, which means that we have two threads of
transformations: an inner thread (the agent’s inner world) traced by the history of
actions, through P ercept plus the perceptible isF , and an outer thread (the agent’s
external world ) traced by the history of actions, through the fluent F . These two
threads can be convergent or divergent. If they converge, what is perceived can be
added to the database, although the addition is non monotonic, since it is added
as an hypothesis. If they diverge, nothing can be added and the agent records a
mistake. A mistake is not an inconsistency, which can never occur through sensing and
percept. This reasoning process is called meaningful perception. Inference, according
to meaningful perception, is done using regression, so if D is the theory of action and
perception, and
D |=
P ercept(isT able(a), 1, s) ∧ P ercept(isT elephone(c), 1, s)∧
P ercept(isOn(c, a), 1, s)∧
s = [sense(isT able(a), 1), . . . sense(isOn(c, a), 1), . . .
sense(isT elephone(c), 1)]∧
∀p.p = isT able(a) ∨ p = isT elephone(c)
∨p = isOn(c, a)→¬M istaken(p, s)
Calculus, that mentions a situation σ of the kind do(ak , . . . , a1 , S0), with [a1 , . . . , ak ]
actions, into an equivalent expression Ψ(S0 ) mentioning only S0 , the initial situation.
On the other hand if we had:
D |=
P ercept(isT able(a), 1, s) ∧ P ercept(isT elephone(c), 1, s)∧
P ercept(isOn(c, a), 1, s)∧
s = [sense(isT able(a), 1), . . . , sense(isOn(c, a), 1), . . . ,
sense(isT elephone(c), 1)]∧
¬On(c, a, s)→M istaken(isOn(c, a), s)
then a mistake would be recorded and nothing can be added, in this case, to the
initial database, although no inconsistency is caused by the mistake.
Now, the problem of perception consists essentially in managing the input data
obtained by sensors (e.g. the camera), processing them and suitably adding the results
to the theory of action and perception as hypotheses so that the following will hold:
To understand what H is and the role of sensing actions, consider the following simple
example. There is a table and a chair in a room and an observation of the scene is
performed, i.e. a shot of the scene is taken (we open our eyes and look into the room);
let us cut the instant before we make sense out of what there is in the room. Clearly,
at the very moment in which the image is taken no distinction among objects is made.
Therefore it is not a single sensing action like sense(isT able(x), v) that takes place,
but a scene/image acquisition.
From the image acquisition till the inference, leading to an hypothesis that there
might be a table in the room, a complex process of revelation takes place. One
92 CHAPTER 8. DESCRIPTION
bringing the shapeless and scattered components identified in the image, to the surface
of cognition1 , by giving a structure to these components. And there is a step in the
structuring that reveals the meaning: “that’s a table”. In other words the re-cognition
process is a thread of revelations (the apperception) giving, attributing, meaning to
the elements of the image. This is achieved by conjugating the construction of a tight
data structure (a graph of all the symgeons occurring in the scene together with their
topological relations), which is the hypothesis H, together with the meaning given
by a description and denoted by a sensing action like sense(isT able(x), v). Therefore
the sense(isT able(x), v) action has, indeed, the double meaning of giving sense to the
elements of the data structure and of bringing to the surface of cognition the existence
of an object, a table, in fact.
To understand what we mean let’s go through the example of the table. We might
simplify the successor state axiom in (8.1) as follows:
image. Each node of the syntactic graph represents a junction point between
segments and is labeled by the information of its 2D position in the image. Each
edge of the syntactic graph represents a segment and is labeled by primits about
which we know the type and the specific information.
Therefore the term ref erence denotes the graph of the scene. When the inference
matches the term P oss(sense(isT able(x), 1), s), a description of the table is asked
for. A description should comply with the following conditions:
1. It has to be general enough to capture several possible shapes (at least the most
common) the specified object can be represented by.
2. It has to match several views of the object, including occlusion: e.g. the table
has just three legs, the fourth being occluded.
3. The description is a term t ∈ T⊕ having a recursive structure as follows: the
head of t is a symgeon and, recursively, the tail is decomposed into head and
tail, the head is the principal node.
In other words, the description gives sense to a term: it is the semantics of a term,
while its syntax is the structure of the object: e.g. the graph of the table (see 8.5)
is the syntactic structure of a table, the meaning of the table is given through its
description (see the next axiom (8.4). Matching is achieved by a parse function.
The parse function is a call to a parsing algorithm that analyzes the term while
transforming it by applying the rewriting rules introduced in the Algebra of figures
(see Chapter 5)).
Each object, in the scene, will be a term t (which should be represented with
a principal node, we shall deal with this in the next section), such a term t holds a
description, which is the prior knowledge of the agent about the artifact. For example
if t is a table then a description of the table will be as follows:
The above successor state axiom tells that a description is a fluent, obviously de-
pending on the situation, and the action is exactly that of matching the scene by
parsing ( we shall introduce the parsing algorithm in the next section). In particular,
a table is made of a hard and flat top and a legged support, in which the legs can
be one or many, but they have to be of the same shape and material. In particular,
94 CHAPTER 8. DESCRIPTION
the properties concerning the predicate F lat, Hard and Legged can be, for example,
being of a given height, of a selected material etc... Furthermore, here ref erence is
the scene graph, in which the table has to be found, the table is supposed to be a term
with a principal node, that is referred to as tableRef erence. Having delegated the
definition of the shape and structure of the top to the predicates F lat and Hard, and
having delegated the existence of the legs to Legged, the two sentences φ(l, γ) and
φ(l0 , γ 0 ) denote the possible symgeons (e.g. cylindroid, cuboid, etc.) that represent
a leg, and in this last case also the possible number of legs. Observe, also, that the
value that take the variable x is the anchoring of the perceptible isT able(x) to the
object, described by the term, through its principal node, which in this case is the
top of the table. In particular the variable x, might denote the position of the table,
and it will be associated with the term identifying the specific table. Therefore any
action on the table (e.g. moving the table, putting it upside down etc.), will affect
that term, i.e. its position. If another table is looked for, this will be identified by a
new name/position since for each perceptible isF (~x), the following will hold:
together with the unique name for perceptibles analogous to the set of axioms unique
name for actions. It is clear that building successor state axioms for description is
rather hard, the goal will be that of compiling them. We yet do not know if this will
ever be possible by learning.
C= Connected: ⊕C
P= Parallel: ⊕P
S= Symmetric: ⊕S
T= Orthogonal: ⊕T
V= Angular: ⊕V
Table 8.1: Relations between symgeons in the scene and their functional denotation.
All relations are reflexive.
For example, suppose that the graph referencing the table is the one given in
Figure 8.3.
~ k) be the cylindroid representing the top of the table, for suitable
Let G1 (~a, ~e, K,
~ and k, and let G2 (a~0 , e~0 , K
values of ~a, ~e, K, ~ 0 , k 0 ) be the tapered cuboid representing
the legs of the table, for suitable values of a~0 , e~0 , K ~ 0 , and k 0 . Let γ1 = h~a, ~e, K,
~ ki and
~0 ~0 ~ 0 0
γ2 = ha , e , K , k i. Finally let a, b, c, d denote the position pos of each symgeon then
the term of the table in the scene is defined as follows:
This can be transformed into the following term, with a principal node.
g 1 (a)
3
C
3 3
C C
g 2 (d)
2
P
2
P
g 2 (b) 2 g 2 (c)
P
In order to apply the parsing algorithm we need to suitably rearrange the terms
of S. The following normal form theorem will ensure us that this rearrangement is
possible and each term denoting an artifact can be devised in the scene graph.
S = τ1 . . . τk
Proof. First observe that by applying Proposition 2 to S we can get a graph G having
a principal node. Therefore G is already in normal form. Now, on the term G we can
use the distributive laws (see 6.3.1, 5 and 6) so as to split the term G and obtain a
more flat term, still in normal form.
Let S be the scene graph already in normal form, furthermore assume that S is
indeed the term mentioned in a successor state axiom Scene(isF (~x), S, do(a, s)).
Clearly, when the parsing algorithm is invoked, it will be with a given artifact,
e.g. isT able(x). Observe that in the description axiom, through the parsing action,
the term isT able(x) is described in terms of the algebra F, that is, there is a term
tableRef erence = top ⊕C ⊕T support saying to look inside the scene graph S (i.e.
ref erence), for a term of the form τ1 ⊕C ⊕T τ2 , which has a principal node, namely
τ1 , and whose tail (namely τ2 ) is further specified in the description axiom. This last
term, namely τ1 ⊕C ⊕T τ2 , will be returned and further analyzed.
The parse algorithm, therefore takes as input two terms t, t0 , in which the second
term has to be a term of F, and it outputs a term τ which is the structure of t in
terms of the algebra.
Example 5 Let S be the graph of the image as depicted in Figure 8.2, let isT able(a)
be the queried term, and top ⊕C ⊕T support be the term tableRef erence which has
8.4. PARSING THE REFERENCE: THE SCENE GRAPH 97
Before introducing the algorithm parse(t, t1 , t2 ), let us specify the function matches(f, g),
that will be used by the algorithm, as follows:
with their relationships, have been identified (with a reasonable accuracy), then the
sensing action sense(isT able(x), v) is a query to such a data structure, defined by
the recognition system, asking whether all the elements, which are expected to be the
components of the abstraction of the object (the table in this case), belong to the data
structure, and are each other in the expected relationships. If this is achieved through
the parsing algorithm, matching the term given by the description, with the scene
graph, i.e. the ref erence, then the object is recognized and, through the predicate
P ercept(isT able(p), 1, s), the object will be added to the initial database, decorated
with all the information recovered by the recognition process.
Chapter 9
Experimental Results
In this Chapter, we will provide experimental results to demonstrate the various as-
pects or our vision reasoning system. As already mentioned in Chapter 1.2, actually
we have implemented the Syntactic Image Analyzer as a C++ module, and the rea-
soning system (Bayes-aspect network and feature description) used to recover all the
SymGeons in the image as a Prolog program, through the Algebra of Figure described
in Chapter 6.
However, other components of the framework are implemented, like the scene graph
parser, but they cannot be introduced in the system because we are still working on
the idea presented in Chapter 1.2, regarding the 3D reconstruction and the scene
graph recovery.
Other considerations about experimentation concern the test images. As men-
tioned previously in Section 10.2, the analysis capabilities of the Syntactic Image
Analyzer are strictly related to the ability of the edge detector. Up to now, we have
used Matrox Imaging Library (MIL) for the execution of basic image processing op-
erations, like acquisition, filtering, etc. and in particular for edge detection.
Unfortunately to execute this operation MIL uses a simple technique based on
convolution and zero crossing [45]. This implies that the quality of the binary image
obtained using MIL’s edge detector, is influenced by some environmental conditions:
illuminance, light direction, shadows, etc.
Since our goal is to test the ability of the reasoning system, we have preferred to
avoid the results compromission by the edge detector inaccuracy, using synthesized
images.
99
100 CHAPTER 9. EXPERIMENTAL RESULTS
this platform has an Intel Pentium III (400 Mz) and Windows 2000 as Operating
System2 .
However, some of the basic modules realized in C++ for the Syntactical Analyzer
are also tested and used for the vision system of Mr. ArmHandOne [18] (see Figure
1.1), in occasion of the Mobile Robot Competition and Exhibition of AAAI-2002 (28
July - 1 August 2002).
The robot is endowed with many sensors. Thanks to them it can perceive the
world or know its internal state. They are composed of two cameras mounted
2 For the future we are seriously thinking to move our platforms under Linux.
9.2. TEST RESULTS 101
Primits
Primits
Pot
Primits
Conclusion
111
112 CHAPTER 10. CONCLUSION
the network, a query to the BAN asks for the probability that a specific aspect of
a SymGeon occurs in the image (actually in the image syntactic graph), given the
evidence:
P rob(aspect(~t)|~t)
If the computed probability is greater than a given threshold λ, that depends on the
image acquisition, then the SymGeon, associated with such aspects, is added as an
hypothesis to the knowledge base.
Such a SymGeon will be then elaborated according to binocular reconstruction
and it will be recovered in its 3D shape, and its position will also be recovered. The
set of SymGeons obtained by querying the Bayes-aspect network will contribute to
the formation of the scene graph. The scene graph is finally queried by single sensing
actions whose role is to give meaning to the scene graph.
The formalization of perception in the Situation Calculus allows one to infer, via
the sensing actions, the existence of specific objects in the scene. Furthermore, be-
cause of its symbolic description, an object is anchored to the data structure, therefore
its dynamic can be suitably traced.
10.2 Discussion
In this section we emphasize and discuss some weaknesses and strengths of our ap-
proach.
3D object recognition is a complex problem and indeed hard domain test for our
system. Thanks to its generality, our framework can be used in a variety of different
domains. For example, remaining in the area of object recognition, if the underlining
approach is compositional, the system can be used with other primitives since they
can be decomposed into primits.
Corridor detection is another possible domain that we have tested in the maze
navigation problem of Mr ArmHandOne [18], introduced in Section 9.1. Descriptions
of different panel configuration are given to the system to recognize straight corridor
and left or right recess, using suitable definition in the Algebra of Figure. Some
screen-shot of the application is depicted in Table 10.1. Note that, an important
advantage of this domain is the possibility to discard elliptic arcs, simplifying the
reasoning process.
A considerable drawback of our approach is its dependency on the accuracy of the
edge detector. In our system we have used Matrox Image Library (MIL) to perform
low level image processing operation. This library uses a convolution operator for edge
detection, so the capabilities of our system result strictly related to environmental
conditions like illumination, light direction, shadows, etc.
However, this limitation can probably be overcome using more sophisticated fil-
tering and edge detector operator like Sobel filter or Canny edge detector [32].
Other limitations are due to the underlining assumptions of the approach. We
have assumed that the object of interest can be decomposed into parts. This isn’t
generally true, in particular if we consider natural objects, so we have focused our
attention on a specific subset of artifacts.
By taking into consideration these limitations, a more powerful vision reasoning
system for object recognition can be developed.
10.2. DISCUSSION 113
Primit
[1] Asher, N., and Vieu, L. Toward a geometry of common sense: A semantics
and a complete axiomatization of mereotopology, 1995.
[2] Barequet, G., and Sharir, M. Partial surface matching by using directed
footprints. In Symposium on Computational Geometry (1996), pp. C–9–C–10.
[6] Bennett, B., Cohn, A., and Isli, A. Combining multiple representations
in a spatial reasoning system. In Proceedings of the 9th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach,
CA (1997), pp. 314–322.
[7] Bennett, B., Cohn, A., Wolter, F., and Zakharyaschev, M. Multi-
dimensional modal logic as a framework for spatio-temporal reasoning. Applied
Intelligence (2001). To appear.
[8] Bennett, B., Cohn, A. G., Torrini, P., and Hazarika, S. M. A founda-
tion for region-based qualitative geometry. In Proceedings of ECAI-2000 (2000),
W. Horn, Ed., pp. 204–208.
[9] Beymer, D., and Poggio, T. Face recognition from one example view. Tech.
Rep. AIM-1536, , 1995.
[11] Biederman, I., and Cooper, E. Evidence for complete translational and
rectional invariance in visual object priming. Perception 20 (1991), 585–593.
115
116 BIBLIOGRAPHY
[14] Cederberg, R. Chain-link coding and segmentation for raster scan devices.
CGIP 10 (1979), 224–234.
[15] Chella, A., Frixione, M., and Gaglio, S. A cognitive architecture for
artificial vision. Artificial Intelligence 89, 1-2 (1997), 73–111.
[16] Chella, A., Frixione, M., and Gaglio, S. Understanding dynamic scenes.
AI 123, 1-2 (October 2000), 89–132.
[17] Chella, A., Frixione, M., and Gaglio, S. Conceptual spaces for computer
vision representations. Artificial Intelligence Review 16, 2 (2001), 137–152.
[18] Cialente, M., Finzi, A., Mentuccia, I., Pirri, F., Pirrone, M., Ro-
mano, M., Savelli, F., and Vona, K. The mr armhandone project. a mazes
roamer robot. In Proceedings of the The Mobile Robot Workshop AAAI-2002
(2002).
[20] Clementini, E., Felice, P. D., and van Oosterom, P. A small set of
formal topological relationships suitable for end-user interaction. In Advances
in Spatial Databases, Third International Symposium, SSD’93, Singapore, June
23-25, 1993, Proceedings (1993), D. J. Abel and B. C. Ooi, Eds., vol. 692 of
Lecture Notes in Computer Science, Springer, pp. 277–295.
[22] Cohn, A., and Varzi, A. Modes of connection. Spatial Information Theory-
Cognitive and Computational Foundations of Geographic Information Science,
Lecture Notes in Computer Science 1661, Springer, Berlin (1999), 299–314.
[25] David Randell, M. W., and Shanahan, M. From images to bodies: Mod-
elling and exploiting spatial occlusion and motion parallax. In In Proceedings
of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01)
(2001), pp. 57–63.
[26] Dickinson, S., Bergevin, R., Biederman, I., Eklundh, J., Munck-
Fairwood, R., Jain, A., and Pentland, A. Panel report: The potential of
geons for generic 3-d object recognition, 1997.
BIBLIOGRAPHY 117
[43] Gigus, Z., Canny, J., and Seidel, R. Efficiently computing and representing
aspect graphs of polyhedral objects. T-PAMI 13 (1991), 542–551.
[44] Gigus, Z., and Malik, J. Computing the Aspect Graph for Line Drawings
of Polyhedral Objects. In Proc. IEEE 1988 Conf. on Computer Vision and
Pattern Recognition (1988), pp. 654–661.
[46] Grant, G., and Reid, A. An efficient algorithm for boundary tracing and
feature extraction. CGIP 17, 3 (November 1981), 225–237.
[49] Halı́ř, R., and Flusser, J. Numerically stable direct least squares fitting of
ellipses. In Proc. 6th International Conference in Central Europe on Computer
Graphics and Visualization. WSCG ’98 (Plzeň, Czech Republic, Feb. 1998),
CZ, pp. 125–132.
[51] Hopfield, J. Neural networks and physical systems with emergent collective
computational abilities. In Proceedings of the National Academy of Scientists
(1982), vol. 79, pp. 2554–2558.
[55] Jaklič, A., Leonardis, A., and Solina, F. Segmentation and Recovery of
Superquadrics, vol. 20 of Computational imaging and vision. Kluwer, Dordrecth,
2000. ISBN 0-7923-6601-8.
[56] Koenderink, J., and van Doorn, A. The internal representation of solid
shape with reference to vision. Biological Cybernetics 32 (1979), 211 – 216.
BIBLIOGRAPHY 119
[57] Kriegman, D., and Ponce, J. Computing exact aspect graphs of curved
objects: Solids of revolution. WI3DS 89 (1990), 116–122.
[59] Lespérance, Y., Levesque, H., Marcu, D., Reiter, R., and Scherl,
R. B. A logical approach to high-level robot programming: A progress report.
Control of the Physical World by Intelligent Systems: Papers from the 1994
AAAI Fall Symopsium (1994), 79–85.
[60] Levesque, H., Pirri, F., and Reiter, R. Foundations for the situation
calculus. Series: Linkping Electronic Articles in Computer and Information
Science, Vol. 3: nr 018, (1999) 159–178.
[61] Levesque, H., Reiter, R., Lesperance, Y., Lin, F., and Scherl, R.
Golog: A logic programming language for dynamic domains. Journal of Logic
Programming 31 (1997), 59–84.
[62] Lowe, D. Perceptual organization and visual recognition, vol. 23. Kluwer
Academic Publishers, Massachusetts, 1985.
[63] Lowe, D. Visual recognition from spatial correspondence and perceptual or-
ganization. IJCAI 85 (1985), 953–959.
[65] Marr, D., and Nishihara, H. Representation and recognition of the spatial
organization of three-dimensional shapes. In Proc. R. Soc. Lond. B, vol. 200
(1978), pp. 269–294.
[66] McCarthy, J., and Hayes, P. Some philosophical problems from the stand-
point of artificial intelligence. Machine Intelligence 4 (1969), 463–502.
[68] O’Rourke, J. Art gallery theorems and algorithms. Oxford Univ. Press (1987).
[70] Palm, G. On Associative Memory. No. 36. Byological Cybernetic, 1980, pp. 82–
86.
[76] Pirri, F., and Finzi, A. An approach to perception in theory of actions: Part
i. ETAI 4 (1999), 19–61.
[78] Pirri, F., and Reiter, R. Some Contributions to the Metatheory of the
Situation Calculus. Journal of the ACM 3 (1999), 325–361.
[79] Pirri, F., and Reiter, R. Planning with natural actions in the situation
calculus. In Logic-Based Artificial Intelligence (2000), J. Minker, Ed., Kluwer.
[81] Plantinga, W., and Dyer, C. Visibility, occlusion, and the aspect graph.
International Journal Computer Vision 5(2) (1990), 137–160.
[82] Ponce, J., Petitjean, S., and Kriegman, D. Computing exact aspect
graphs of curved objects: Algebraic surfaces. ECCV 92 (1992), 599–614.
[83] Pope, A., and Lowe, D. Learning object recognition models from images. In
ICCV93 (1993), pp. 296–301.
[84] Reiter, R. The frame problem in the situation calculus: A simple solution
(sometimes) and a completeness result for goal regression. In Artificial Intel-
ligence and Mathematical Theory of Computation: Papers in Honor of John
McCarthy, V. L. (Ed.), Ed. Academic Press, 1991, pp. 359–380.
[87] Reiter, R., and Mackworth, A. A logical framework for depiction and
image interpretation. AI 41 (1989), 125–155.
BIBLIOGRAPHY 121
[96] Stewman, J., and Bowyer, K. Direct construction of the perspective pro-
jection aspect graph of convex polyhedra. CVGIP 51 (1990), 20–37.
[97] Terzopoulos, D., and Metaxas, D. Dynamic 3d models with local and
global deformations: Deformable superquadrics. IEEE Trans. on PAMI 13(7)
32 (1991), 703–714.
[99] Tsotsos, J., Verghese, G., Dickinson, S., Jenkin, M., Jepson, A., Mil-
ios, E., Nuflot, F., Stevenson, S., Black, M., Metaxas, D., Culhane,
S., Ye, Y., and Mann, R. Playbot: A visually-guided robot for physically
disabled children. Image and Vision Computing 16, 4 (April 1998), 275–292.
[100] Ullman, S. High-level vision: Object recognition and visual cognition, 1996.
[104] Wu, K., and Levin, M. 3D object representation using parametric geons.
Tech. Rep. CIM-93-13, CIM, 1993.
[105] Wu, K., and Levine, M. Parametric geons: A discrete set of shapes with
parameterized attributes. In In SPIE International Symposium on Optical En-
gineering and Photonics in Aerospace Sensing: Visual Information Processing
III (Orlando, FL, April 1994 1994), vol. 2239, pp. 14–26.
Appendix A
Ellipses Fitting
An ellipses is a general case of a general conic which can be described by an implicit
second order polynomial:
where a = [a, b, c, d, e, f ]> is the parameter vector formed by the ellipse’s coefficients,
while x = [x2 , xy, y 2 , x, y, 1].
The fitting of a general conic to a set of points (xi , yi ), i = 1 . . . N may be ap-
proached by minimizing the square sum of the distance of each points to the conic
which is represented by the parameter vector a.
N
X
min [D(p, a)]2 (3)
i=1
In general there are two main suitable distances D for ellipse fitting, the Euclidean
distance and the algebraic distance. The Euclidean distance is only partially satisfac-
tory, because it require the introduction of approximations that lead the problem to
a nonlinear minimization that can be solved only numerically. The algebraic distance
instead is different from the true geometric distance between a curve and a point. In
this sense we start with an approximation, however it is the only approximation we
introduce, since the algebraic distance turns the minimization problem into a linear
problem that we can solve in closed form an with no further approximations.
The minimization problem becomes:
N
X N
X N
X
min F (xi , yi )2 = min [Fa (xi )]2 = min [xi · a]2 (4)
i=1 i=1 i=1
The problem can be solved directly by the standard least squares approach, but the
result of such fitting is a general conic, and not necessary an ellipse. To ensure this,
the appropriate constraint has to be considered. Such a system is hard to solve in
general, however because αa represents the same conics as a for any α 6= 0, we have a
123
124 APPENDIX . APPENDIX A
freedom to arbitrarily scale the coefficients a. Tanks to this property, under a proper
scale the inequality constraint mentioned above can be changed into an equality one:
4ab − b2 = 1 (5)
Next the equation (9) can be solved by using generalized eigenvectors. There exist
up to six real solutions (λj , aj ), but because
we are looking for the eigenvector ak corresponding to the minimal positive eigenvalues
λk . Finally after a proper scaling ensuring a> k Cak = 1, we get the solution of the
minimization problem, which represents the best fit ellipse for the given set of points.
The approach that we have described has several drawback. First the matrix C is
singular and the matrix S is also singular if all data points lie exactly on an ellipse.
Moreover the computation of the eigenvalues is numerical unstable and can produce
wrong results (as infinite or complex number). Another problem of the algorithm is
the localization of the optimal solution of the fitting. Unfortunately, the assumption
that there exists one positive eigenvalue (corresponding to the optimal solution), is
not true. In fact, as noted above, in an ideal case when all data points lie exactly on
an ellipse, the eigenvalue is zero. Moreover, a numerical computation of eigenvalues
can produce an optimal eigenvalue that is a small negative number. In such situation
the algorithm can produce non optimal solution or completely wrong solution.
To overcome such a problems in [49] a simplified approach is proposed. Tanks to
the special structure of the matrix S and C, this approach allows us to easily compute
the eigenvalues.
First, the design matrix D can be decomposed into its quadratic and linear parts
D = (D1 |D2 ) where:
x21 x1 y1 y12 x1 y1 1
.. .. .. .. .. ..
. . . . . .
2
D1 =
xi xi yi 2
yi D2 =
xi yi 1 (11)
. .. .. . .. ..
.. . . .. . .
x2N xN yN yN 2
xN yN 1
S1 = D>
1 D1
S1 S2
S= where S2 = D>
1 D2 (12)
S> S3
2
S3 = D>
2 D2
Based on these decompositions, the first condition of the equation (9) can be rewritten
as:
S1 S2 a1 C1 0 a1
· = λ · · (15)
S>
2 S3 a2 0 0 a2
126 APPENDIX . APPENDIX A
S1 a1 + S2 a2 = λC1 a1 (16)
S>
2 a1 + S3 a2 = 0 (17)
The matrix S3 ,
Sx2 Sxy Sx
S3 = D>
2 D2 = Sxy Sy2 Sy (18)
Sx Sy S1
is the scattered matrix of the line fitting problem i.e. the problem to fit a set of
points through a straight line. Such a matrix is singular only if all the points lie on
the line. In such situation there is no real solution, but in all other cases the matrix
S3 is regular.
From the equation (17), a2 can be expressed as:
a2 = −S−1 >
3 S2 a1 (19)
C−1 −1 >
1 (S1 − S2 S3 S2 )a1 = λa (21)
The second condition of the equation (9) can also be reformulated by using the de-
composition principle. Due to the special shape of the matrix C we simply get:
a>
1 C1 a1 = 1 (22)
Summarizing all the decomposition steps, the conditions in (9) can be finally expressed
as the following set of equations:
Ma1 = λa1
a>
1 C1 a1 = 1
−1 >
a2 = 3 S2 a1
−S (23)
a1
a =
a2
M = C−1 −1 >
1 (S1 − S2 S3 S2 ) (24)
Now we can return to the task of a fitting ellipse through a set of points. As we
said before, the task can be expressed as a constrained minimization problem (eq. 6)
whose optimal solution corresponds to the eigenvector a of equation (9) which yields
a minimal non negative value λ. Equation (9) is equivalent to equation (23), thus it
is enough to find the appropriate eigenvector a1 of the matrix M.
Appendix B
SymGeon Aspects
PR CYLINDROID 1
PR CUBOID 1
127
128 APPENDIX . APPENDIX B
PR CUBOID 2
PR B CYL 1
PR B CYL 4 PR B CYL 5
ELLIPSOID
129
S A CYL 2
S A CUB
S A B CYL 2
S A B CYL 3
S A B CUB
130 APPENDIX . APPENDIX B
Appendix C
Distance Formalization
In this Appendix we detail all the definitions concerning edgewise connection and
pointwise connection, given w.r.t. their metric. We have been using these definitions
for the formalization of the Algebra of Figure. For each figure we provide also its
composition by a composition graph.
Face Level
Edgewise
Coterminant
Connection
Cotermination (fig.1):
0
i. d1 = minx,y (distance(Pex , Pey )) where x, y ∈ {a, b};
0 0 0 0 0
ii. d2 = distance(Pex0 , Pey 0 ) where x , y ∈ {a, b} and x 6= x, y 6= y;
0
iii. d3 = distance(Pm , Pm );
d1 +d2
vi. d = d3 .
131
132 APPENDIX . APPENDIX C
Pm P’ey’
Pex’
Pex’
Pex d3 d2 P’ey
d1
P’ey’
d1 P’m
Pm d2
P’ey
Pex
P’m
Figure 1:
v. d = d1 + d2 + F (d3 ).
Edgewise Edgewise
Coterminant Connection Connection
(Straight) (Elliptic)
P’ec
d1 P’er Pec
Pec Per
P’m P’er
P’er Pec
d3
Pm
d1 P’m d2
P’ec Pm d2 d1 P’m
Pm d2
Per Per P’ec
Figure 3:
Cotermination (fig.3):
0
i. d1 = distance(Pec , Per );
0
ii. d2 = distance(Per , Pec );
0
iii. d3 = distance(Pm , Pm );
d1 +d2
iv. d = d3 .
vi. d = d1 + d2 + F (d3 ) + dφ .
Coterminant Symmetric
P’’ y1
P’x1 d1
dc C’ f’+Df’/2
d2 f’’+Df’’/2
P’’ y2 C’’
P’x2
Figure 5:
iii. d = d1 + d2 .
i. dc = distance(centerC 0 , centerC 00 );
ii. dα = |α0 − α00 |;
0 00 0 00
iii. dr = |rm − rm | + |rM − rM |;
∆φ0 ∆φ00
iv. dφ = |(φ0 + 2 ) − (φ00 + 2 )| − π;
v. d = dc + dα + dr + dφ .
136 APPENDIX . APPENDIX C
Edgewise Pointwise
Connection Connection
Pex’
d
P’ec
Pm Pex Pm
d2
d1
P’er P’m P’er P’m
Figure 6:
iii. d = d1 + d2 .
Edgewise Pointwise
Connection Connection
Pex’
d
P’ey’
Pex
Pm
Pm
d1 d2
Figure 8:
iii. d = d1 + d2 .
Edgewise Pointwise
Connection Connection
P’ec P’ec
d
Pec Pec
P’m P’er P’m P’er
d2
d1
Per Pm Per Pm
Figure 9:
Edgewise Pointwise
Connection Connection
Pm Pex
d1 d2
P’ey P’m
Figure 11:
iv. d = d1 + d2 + dφ .
Coterminant Pointwise
Connection
P’ey
P’m
Pex d1
Pm Pex
d3
d2
P’ey’
Pm Pex’ P’ey P’m
Figure 13:
iv. d = d1 + d2 + dφ .
144 APPENDIX . APPENDIX C
Aspect Level
Edgewise Parallel
Connection Edge
v. d = d1 + d2 + dr + dφ .
Parallel Edges
i. d = |mr1 − mr2 |.
146 APPENDIX . APPENDIX C
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
Figure 15:
Parallel Edges
Parallel Edges
i. d1 = |m12 − m34 |;
iii. d = d1 + d2 .
P1 m12 P2
m41 m23
P4 m34 P3
148 APPENDIX . APPENDIX C
149
Edgewise Connection
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Parallel Edges
i. d1 = |m12 − m34 |;
iii. d = d1 + d2 .
150 APPENDIX . APPENDIX C
Edgewise Connection
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Parallel Edges
i. d1 = |m12 − m34 |;
iii. d = d1 + d2 .
151
Edgewise
Connection Non Parallel
Edge
v. d = d1 + d2 + dr + dφ .
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
Figure 17:
1
d= |mr1 −mr2 | .
152 APPENDIX . APPENDIX C
Conteined
Contained
i. dc = distance(centerC, centerC 0 );
0
rm −rm
ii. r = 0
rM −rM ;
0 z>0
iii. F (z) =
−kz z≤0
vi. d = dc + F (r).
154 APPENDIX . APPENDIX C
No Parallel
Edges
No Parallel Edges
i. d1 = |m12 − m34 |;
P1 m12 P2
m41 m23
P4 m34 P3
155
Edgewise Connection
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
i. d1 = |m12 − m34 |;
Edgewise Edgewise
Connection Connection
Edgewise Edgewise
Connection Connection
Edgewise Connection
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
157
158 APPENDIX . APPENDIX C
Edgewise
Connection
vi. d = d1 + d2 .
Px d1 P’y
d2 P’v
Pu
Figure 19:
159
160 APPENDIX . APPENDIX C
Prc1
mr1
d1
d1 Pcr1
C12
d2
C21
d2 Prc2
mr2
Pcr2
Figure 20:
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Edgewise Connection (straight) (fig. 20)
i. d1 = minx,y (Px , Prcy ) where x ∈ {1, 2, 3, 4} and y ∈ {1, 2};
ii. d2 = minu (Pu , Pcry ) where u ∈ {|x + 1|4 , |x − 1|4 };
iii. d = d1 + d2 .
Parallel Edge (elliptic) (fig. 20)
161
iv. d = dr + dφ + dα .
162 APPENDIX . APPENDIX C
Edgewise
Edgewise Connection
Connection
v. d = d1 + d2 + dr + dφ .
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
163
Edgewise Edgewise
Connection Connection
Edgewise Connection
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
164 APPENDIX . APPENDIX C
Edgewise
Edgewise
Connection
Connection
(elliptic)
(straight)
Edgewise Edgewise
Connection Connection
(elliptic) (elliptic)
d1
d1
d2
d2
Figure 21:
iii. d = d1 + d2 .
ii. d2 = minx0 ,y0 (distance(Pv0 x0 , Pw0 0 y0 )) where v 0 , w0 ∈ {rc, cr}, x0 , y 0 ∈ {1, 2} and
v 0 6= v, w0 6= w;
0
iii. dc = distance(center Cxx0 , center Cyy 0 );
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
165
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
166 APPENDIX . APPENDIX C
Edgewise Edgewise
Connection Connection
Py
Px Cxy
C41 C23
P4 P3
C34
Figure 22:
iv. dα = |α − αxy |;
v. d = d1 + dc + dc + dφ + dα .
167
Prc1
mr1
d1
d1 Pcr1
C12
d2
C21
d2 Prc2
mr2
Pcr2
Figure 23:
iii. d = d1 + d2 .
ii. d2 = minx0 ,y0 (distance(Pv0 x0 , Pw0 0 y0 )) where v 0 , w0 ∈ {rc, cr}, x0 , y 0 ∈ {1, 2} and
v 0 6= v, w0 6= w;
0
iii. dc = distance(center Cxx0 , center Cyy 0 );
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
168 APPENDIX . APPENDIX C
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Edgewise
Connection
d1
d2
Figure 24:
ii. d2 = minx0 ,y0 (distance(Pv0 x0 , Pw0 0 y0 )) where v 0 , w0 ∈ {rc, cr}, x0 , y 0 ∈ {1, 2} and
v 0 6= v, w0 6= w;
0
iii. dc = distance(center Cxx0 , center Cyy 0 );
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
170 APPENDIX . APPENDIX C
Edgewise
Connection
Prr
mcr mrc
C
Pcr Prc
C’
Figure 25:
i. d1 = distance(Pcr , C 0 ) + distance(Prc , C 0 );
iv. dα = |α − α0 |;
v. d = d1 + dc + dc + dφ + dα .