A Cognitive Vision System Based On Bayes Combination of Geometric Features

Università degli Studi di Roma “La Sapienza”
Dottorato di Ricerca in Ingegneria Informatica
XV Ciclo – 2003
A Cognitive Vision System Based on Bayes

Combination of Geometric Features
Massimo Romano
Dottorato di Ricerca in Ingegneria Informatica
XV Ciclo - 2003
Massimo Romano
A Cognitive Vision System Based on Bayes

Combination of Geometric Features
Thesis Committee Reviewers
Prof.ssa Fiora Pirri (Advisor) Murray Shanahan

Prof. Antonio Chella Martin Levine
Prof. Daniele Nardi
Author’s address:
Massimo Romano
Dipartimento di Informatica e Sistemistica
Via Salaria 113, I-00198 Roma, Italy
e-mail: romano@dis.uniroma1.it
www: http://www.dis.uniroma1.it/∼romano
i
This thesis is dedicated to Franco Carrozzo and Raymond Reiter.
They changed my life forever.

ii
Abstract
We present in this thesis an approach to Object Recognition that develops on earlier

works of [10, 75, 34, 105, 104, 93, 25, 77], and introduces some new ideas, related to
the concepts of Recognition by Components. The crucial contribution of the thesis
amounts to a formalization (and implementation) of the recognition process involved,
in terms of a cognitive process.
We consider, in fact, the process of recognition as a process that fully exploits
both the inner structure of human artifacts and the compositionality of representa-
tions. The process starts from the construction of a graph of the image patterns and
it goes up the hierarchy involving all the levels and elements of an object structure,
considering primitive traits, such as lines and arcs, faces, aspects and more com-
plex forms, until perceptual reasoning is reached, thus involving the description and
interpretation of a scene.
Perceptual reasoning, likewise the symbolic description of a scene are stated in
the Situation Calculus, because perception is based on sensing actions [77, 93, 25],
and a formalization of perception in the Situation Calculus has been given in [77]. A
description is a specification of an object in terms of its structure, that is, in terms of
its single components which, in turn, are described using SymGeons, a generalization
of parametric Geons (see [10, 75, 34, 105, 104], and in particular [73, 72]).
Here we introduce the concept of Bayes-Aspect network, that is, a Bayesian net-
work exploiting a given hierarchy, the one fixed by our SymGeon compositional def-
inition, and relying on two patterns of composition, one based on connection oper-
ators and the other on features and aspects. The Bayes-Aspect network extends
the concepts of aspect graph and hierarchical aspect graph earlier introduced in
[27, 88, 26, 29].
The advantages of this approach are demonstrated through experimentation, al-
though we have tested the recognition system on simply connected artifacts, composed
of either one or multiple parts.
To strengthen our approach, we shall also describe the recognition system, which
has been developed at the ALCOR1 Laboratory, at the University of Rome “La
Sapienza”. The system is implemented using MIL libraries and C++ for the im-
age processing leading to the Syntactic Analyzer which extracts the basic elements
of the image. C++ and Eclipse-Prolog are used for the Bayes-Aspect Network for
SymGeons recognition, and the high-level descriptions of few simple objects.
1 Autonomous Agents Laboratory for COgnitive Robotics.
iii
iv
Acknowledgements
I wish to acknowledge the people who helped me most during my Ph.D.. First of all,
my deepest debt of gratitude goes to my advisor Prof. Fiora Pirri who constantly
inspired and encouraged me. I express my special thanks for the foundamental sup-
port, for the guidance, and the productive collaboration. Its mainly due to her if I
could accomplish and finish this work.
I express my gratitude to Prof. Martin D. Levine and Prof. Murray Shanahan
who read a preliminary version of this thesis as external referees, and contributed
with a number of advices and very useful comments. I thank them for their attention
and their accurate reports.
I thank the members of my Ph.D. Committee who helped me in the accomplish-
ment of this work. A special thanks goes to Prof. Antonio Chella and Prof. Daniele
Nardi for her constant support.
I thank my colleges Marco Pirrone, Alberto Finzi for their collaboration. I thank
Massimiliano Cialente, Ivo Mentuccia and Katia Vona for their work with the mobile
manipulators Mr. ArmHandOne. Many other people contributed to this work with
interesting discussions, suggestions and support, among them I would like to cite Prof.
Marco Schaerf.
I would like to thank all the people of the Dipartimento of Informatica e Sis-
temistica of the Universitá degli Studi di Roma “La Sapienza”, the members of the
Artificial Intelligence Group, and the members of the ALCOR laboratory.
A special thanks goes, at the end of the corridor, to my colleagues of the room
WC-229: Marco Benedetti, Andrea Calı́, Domenico Lembo, Marco Pirrone, Alberto
Finzi, Carlo Marchetti.
Last thanks, but for sure the most important, to my love, Barbara. She has
believed in me unconditionality and incessantly, showing all her love; Thanks to her
essential support I have reached this important result.
Infinite gratitude to my father and my mother, Dario and Maria.
Rome, December 2002

Massimo Romano
v
vi
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
1.1 Research Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Works 9
2.1 Classification of Object Recognition Systems . . . . . . . . . . . . . . 9
Data Driven and Model Driven . . . . . . . . . . . . . . . . . . . . . . 9
View Centered and Object Centered . . . . . . . . . . . . . . . . . . . 10
2.2 ORS Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conceptual Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Alignment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Recognition By Component . . . . . . . . . . . . . . . . . . . . . . . . 13
Aspect Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Hierarchical Aspect Graph . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Preliminaries 17
3.1 Perception in the Situation Calculus . . . . . . . . . . . . . . . . . . . 17
A basic theory of action and perception . . . . . . . . . . . . . . . . . 19
High Level Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Symgeons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Bayes Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Burglary Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Computing Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Motivation and Methodology 29

4.1 Dealing with Recognition By Component . . . . . . . . . . . . . . . . 29
4.2 Reasoning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
The Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Axiomatixation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 The Bayes-aspect network . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii
viii CONTENTS
4.4 Cognitive and Description Framework . . . . . . . . . . . . . . . . . . 40
5 Syntactic Image Analysis 43

5.1 Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
SIA implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Portions of the SIA Code . . . . . . . . . . . . . . . . . . . . . . . . . 49
Ellipse’s Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Arc Extention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Ellipse’s Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Computing φ and ∆φ . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 An Algebra of Figure 55
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 An algebra of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Axioms for the Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Grouping Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Figures Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5 Connection Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Relation Between Primits . . . . . . . . . . . . . . . . . . . . . 68
6.5.2 Relation Between Boundits . . . . . . . . . . . . . . . . . . . . 70
6.5.3 Relations concerning Faces . . . . . . . . . . . . . . . . . . . . 72
7 A Bayes-aspect Network 77
7.1 SymGeons Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Hypotheses Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3 From Aspects to SymGeon . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4 Portion of the code for Aspect Recovery . . . . . . . . . . . . . . . . . 81
8 Description 85
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2 Analyzing the image graph and using descriptions . . . . . . . . . . . 88
8.3 Hints on descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.4 Parsing the reference: the scene graph . . . . . . . . . . . . . . . . . . 94
9 Experimental Results 99
9.1 Application Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10 Conclusion 111
10.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Appendix A 123
Appendix B 127
Appendix C 131
List of Figures
1.2 . . . . . . . . . . . . . . . . . .
The reasoning process behind perception 4
1.1 Mr. ArmHandOne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Different views of an arrow object (taken from [4]). . . . . . . . . . . 15

2.2 Aspect graph of a polyhedral object. . . . . . . . . . . . . . . . . . . . 16
2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Superquadrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Ellipsoid, Cuboid and Cylindroid . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Hierarchy of Symgeons according to bending and tapering defor-
mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Bayes Network for the Burglary Example. . . . . . . . . . . . . . . . . 26
4.1 Biederman’s Geons. Image taken from “Avian Visual Cognition” on-
line Book available at http://www.pigeon.psy.tufts.edu/avc/toc.htm . 30
4.2 Primit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Viewpoints Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 A gaussian probability function with mean 0 and variance 0.5 and 0.8. 39
5.1 Ellipse and line parameters . . . . . . . . . . . . . . . . . . . . . . . . 44
6.1 Connections among faces . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Non Valid Connections . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Boundits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5 Edgewise connection (on the left) and Pointwise connection (on the
right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6 Two faces obtained by both Pointwise and Edge Wise Connection . . 66
6.7 Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8 Point wise connection between two primits . . . . . . . . . . . . . . . . 68
6.9 Edge wise operation between primits . . . . . . . . . . . . . . . . . . . 69
6.10 Symmetry relation between primits . . . . . . . . . . . . . . . . . . . . 70
6.11 Seven possible Point Wise Connection . . . . . . . . . . . . . . . . . . 71
6.12 Edge Wise Connections between boundits . . . . . . . . . . . . . . . . 71
6.13 Invalid (i) and valid (ii) U − junction . . . . . . . . . . . . . . . . . . 72
6.14 The holdsEwc condition for primits . . . . . . . . . . . . . . . . . . . 72
ix
x LIST OF FIGURES
6.15 Symmetry relation between boundits . . . . . . . . . . . . . . . . . . . 73

6.16 Some example of faces generated by PWC and/or EWC among boundits. 73
6.17 P W C, EW C and Sym relation between faces. The dashed line repre-
sent either a l − primit or an a − primit. . . . . . . . . . . . . . . . . 73
6.18 Symmetry defined over primits and boundits . . . . . . . . . . . . . . 75
7.1 A portion of the Bayes-Aspect net. . . . . . . . . . . . . . . . . . . . . 78
8.1 The graph of the scene in which a table and a chair have been singled out of
a set of symgeons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 The graph of the scene representing a table . . . . . . . . . . . . . . . . . 91
8.3 The graph of the scene representing a table . . . . . . . . . . . . . . . . . 96
9.1 The Fantastic Word of Mr ArmHandOne . . . . . . . . . . . . . . . . 100

9.2 Mr. ArmHandOne’s binocular head. . . . . . . . . . . . . . . . . . . . 101
1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Example faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10 Example faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
12 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14 Example of faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
16 Example of aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
18 Example of aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
List of Tables
3.1 SymGeon Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 CPT for S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Relation between elements and their functional denotation. . . . . . . 37

4.2 CPT for feature nodes (left) and connection nodes (right). . . . . . . . . . 39
5.1 Syntactic Analysis for an Abstract Picture . . . . . . . . . . . . . . . . 48

5.2 List of Primits for the Abstract Picture . . . . . . . . . . . . . . . . . 53
7.1 CPT for a node of the Bayes Network for SymGeon Recognition . . . 80
8.1 Relations between symgeons in the scene and their functional denota-
tion. All relations are reflexive. . . . . . . . . . . . . . . . . . . . . . 95
9.1 SymGeon immersed in a massy environment. . . . . . . . . . . . . . . 102

9.2 List of Primits for cluttered image. . . . . . . . . . . . . . . . . . . . . 103
9.3 Aspect 1/4 and 2/4 of the cluttered image. . . . . . . . . . . . . . . . 104
9.4 Aspect 3/4 and 4/4 of the cluttered image. . . . . . . . . . . . . . . . 105
9.5 SymGeon scene with occlusion. . . . . . . . . . . . . . . . . . . . . . . 106
9.6 List of Primits for occluded image. . . . . . . . . . . . . . . . . . . . . 107
9.7 Aspect 1/2 and 2/2 of the occluded image. . . . . . . . . . . . . . . . 108
9.8 SymGeon composing a pot. . . . . . . . . . . . . . . . . . . . . . . . . 109
9.9 Aspect 1 and 2 of a pot. . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.1 Corridor Detection Domain . . . . . . . . . . . . . . . . . . . . . . . . 113
xi
xii LIST OF TABLES
Chapter 1
Introduction
1.1 Research Area

Cognitive Robotics is a research area of Artificial Intelligence concerning the ability of
an autonomous (robotic) agent to fulfill cognitive functions in order to autonomously
execute tasks in an unknown environment. Furthermore, the idea underlying Cogni-
tive Robotics is that an agent can execute complex tasks mainly using reasoning, and
thus it requires to exploit both theorem proving and a logical representation of the
domain, together with the physical and commonsense laws ruling it [59].
This research area emphasizes the need for high-level cognition in robotic systems,
requiring the formalization of a theoretical framework and the specification of a suit-
able system integrating reasoning, perception and action. It’s immediately clear that
the primary requirements for such kind of agents is the ability
to interact with a word of spatio-temporally located objects given only in-

complete and uncertain information [93].
As the main characteristic of such kind of agents is acting in a changing domain,

the keystone of the formal frameworks employed in Cognitive Robotics is perception.
For this reason in the literature several kind of perception are addressed: those based
on sonar data [25], those based on vision data [76, 80], and so on.
The main functionality of the human brain that the AI researchers have been
trying to emulate in their systems, is the ability to both construct a representation
of the world, and to classify the components appearing in a scene.
The perception problem consists in understanding the way the human brain cap-
tures, interprets and understand the perceptual stimulus induced by the outer world.
In particular, the most important sense used to this end is the visual one, so computer
vision has emerged over the years as one of the disciplines that can answer to this
question in the field of perception.
If we want to replicate the visual ability of the human brain into the cognitive
model of an autonomous agent , we have to understand the mental processes involved,
and the way human beings taxonomize objects in the learning phase of the recognition
process. To this end, three basic classes can be identified, in our opinion.
1
2 CHAPTER 1. INTRODUCTION
In the first class we shall consider the functional use of an object. It is certainly
the main source of information that would allow us to recognize it. In fact, whenever
an object is a human artifact, it is completely characterized by the purpose for which
it is made up. It is not difficult to believe that if someone tell us that the observed
object is used for drinking we immediately recognize a cup (or a glass) whether it
resembles it or not.
Furthermore, the human brain is able to recognize a glass, with a good error range,
even if it has meaningless shape or is it broken in ten pieces, because its function has
been anticipated.
Another important aspect that we consider is the scene context in which the object
is immersed. If we are in the Sahara desert and we are thirsty, a jar can be used to
drink too. This means that a jar can fulfill the same function of a glass because their
logical shape is the same.
Finally, we shall consider the visual aspect of an object. All shapes, colors and
textures are sources of information, that can be used to recognize an object but
their information content is less accurate, in our opinion, than the one drawn by the
previous aspect considered.
Since it is not entirely clear how these aspects interact one with the other, an-
other important question is to understand in which way they can be combined in
order to resolve the recognition prob-
lem. The human brain certainly use a
complex procedure to accomplish this
task, but all the existing Object Recog-
nition Systems (ORS) basically follow
a simple bottom up procedure, where
only visual and logical aspects are con-
sidered.
The Figure on the left shows a typical
framework of an ORS. The input of
the system is a digital image while the
output is the recognized object and,
in some cases, a specification of the
object pose. Basically every ORS is
characterized by the following steps:
First it extracts from the input im-
age an appropriate set of features like
edge (brightness discontinuity), cor-
ners (edge intersection), and regions
(homogeneous image patches). The
goal of features extraction is to select from a large amount of image data, only those
information which are relevant to identify the object. For example, Ullman and Hut-
tenlocher [54] use triples of corners (curvature discontinuity) to recognize polyhedral
objects.
Despite a lot of techniques exist for features extraction, which are completely
characterized and mathematically justified, to classify them the peculiar intuition
of the individual researcher is used, which gives an informal account of the human
reasoning process involved in this task. However all of them use a knowledge base
1.2. OUR APPROACH 3
(or objects database) where all the objects to be recognized are represented using
suitable models.
Therefore, the next step of the ORS is to group the extracted features into a
suitable collection of data in order to reconstruct the topological information of the
observed scene. The obtained collection is used to access the objects database and
retrieve a set of candidate objects i.e. all the objects conforming with the topological
information. Such a procedure avoids to using a linear search in the whole database,
because it allows to eliminate a large number of candidate objects contained in it.
Obviously, if we can recover some really distinguishing clues, we may have only one
model to test and we don’t need the search.
The final step of the recognition system is to decide which of the candidate objects
we’re looking at. To accomplish this task it must verify each of the candidates in
terms of how well it matches the image data. A score is finally assigned to each
object hypothesis, and the best scoring candidate is chosen as the interpretation of
the object.
The main component of an ORS is the objects database where a model for each
object, that has to be recognized, is represented. A model is a symbolic description of
an object which is generale and concise, and the basic goal of an ORS is to transform
the unstructured set of sensor data into a symbolic description which is consistent
with the chosen representation and which supports an efficient model matching.
A Cognitive Vision System (CVS) is a system that uses visual information to build
a symbolic description of the domain, based on a logical representation approach. It
is basically the result of a combination of techniques from symbolic and subsymbolic
AI with computer vision techniques. Therefore the area of Cognitive Vision includes
many of the major issues in AI such as knowledge representation, reasoning, learning,
and so on.
In this thesis we present a new approach to support knowledge representation and
reasoning in cognitive vision, in particular our essential purpose can be synthesized
into the following two points:
i. the introduction of a specific vocabulary and language to talk about objects and
scene;
ii. the definition of a specific perceptual reasoning system.
1.2 Our Approach
In this thesis we present the basic ideas governing a high level recognition system,
which has been tested on Mr. ArmHandOne, a tiny mobile manipulator endowed with
Outcome of sensing: there is a table in

the scene
Perception: table A is in the room
Meaningful Perception: table A can be
added to the Knowledge Base
Cognitive level
Description Level
The Bayes-Aspect net
The Syntactic analizer
Figure 1.2: The reasoning process behind perception
a pan-tilt binocular head equipped with two color

cameras PC100XS from Supercircuits (see figure on
the left).
We have implemented (MIL libraries and C++)
all the image processing leading to the Syntactic
Analyzer (see item 4 below), the Bayes-Aspect Net-
work for SymGeons recognition (C++ and Eclipse-
Prolog), and the high-level descriptions of few sim-
ple objects. We are still defining the Bayes-network
for object descriptions, that would allow us to have
a probability distribution on the presence of given
objects in the scene, and we are still working on
binocular reconstruction of SymGeons, but some
interesting results are showed in Chapter .
The system has been tested over simple arti-
facts, although part of it has not yet been imple-
Figure 1.1: Mr. ArmHandOne mented. It has a complex layered architecture ex-
ploiting several techniques and it is structured as follows:
1. Cognitive level: suitably formalized in the Situation Calculus, is the reasoning
component of the recognition process. This level is essential for solving any
dichotomy between what is perceived and what is inferred by an agent using
a priori knowledge and premises, and other similar problems requiring a con-
tinuous confrontation between an agent’s inner world representation and its
1.2. OUR APPROACH 5
sensed/perceived information, like e.g. the anchoring problem1 .
2. Description level: provides objects and scene descriptions in terms of categories.

Here a category is a kind gathering the simplest representations of a certain
object, using primitive components (i.e. symgeons): a table is an artifact con-
sisting of a flat top (of wood, marble, glass,etc.) with one or more vertical
supports. The flat top is either a cylindroid or cuboid etc... whose height is in
[5cm, 20cm]; the supports could be either cylindroids or cuboids whose height
is approximately 80 cm. We do not provide physical or functional specifications
although they would be very useful.
3. Recognition level: at this level recognition is concerned only with primitive com-
ponents of the image. Each shape in the image is classified as an aspect/view of
a symgeon, with a certain probability, depending on the traits of the shape itself.
The classification granting the existence of this or that symgeon is achieved by
a special Bayes net integrating hierarchical aspect graphs (see [27]), and the two
graphs are fused into an Aspect-Bayes net, that is, a Bayes-net in which causal
relations between nodes are enriched with compositional relations concerning
aspects.
4. Syntactic analyzer level: this level is concerned with the construction of a labeled
graph of the image. The labeled graph is defined in FOL. The syntactic image
analysis delivers a set of connected segments that we call boundits. This set
forms a graph that we call the syntactic graph.
5. Image processing, which is achieved with standard methodologies.
In synthesis our approach is based on the following idea. It is not possible to separate
the recognition process from knowledge of a specific environment. Each component
of an environment is either known or unknown. An element is known if a symbolic
representation for it is given, together with the information about where it could be,
what it is for, and where/how it will be after a specific action has taken place. An
object is unknown if there might be a description available for it, but its presence in
the environment is not acknowledged a priori. Clearly for everything that is known,
it is important to track its movement caused by the execution of actions, and this can
be achieved with enough accuracy (given the possibility of failures and noise in the
sensors). For the unknown elements a crucial role is played by the following factors:
1. The knowledge of other elements in the environment: it is a room, a corridor,

it is a park. Likewise the ability to relate the actual environment to a known
one.
2. The ability to relate the unknown object to a more general category. This
ability essentially is a classification of the unknown object requiring several
considerations. What its components are, the way they are connected together,
the object location, on the floor, on a table, etc.., what it is for, etc..
1 Anchoring is the process of creating and maintaining the correspondence between symbols and
sensor data that refer to the same physical object [23]

3. Reasoning, by making hypotheses that can be confirmed: “it could be a hat, a

pot, or a lamp-shade, and since it is on a lamp it must be a lamp-shade”.
Object recognition, as far as it is strictly related to the process of visual perception,

requires previous (either learned or a priori) information about the object that has
to be recognized or about some similar object (see e.g. [90]).
Generally speaking, the reasoning process taking place in recognition is based on
previous (not necessarily accurate) knowledge of the environment from which the cur-
rent scene is taken. In order to arrange knowledge of objects or scenes into patterns
we have to exploit the inner structure of both human artifacts and environments, and
this naturally leads to a compositional and constructive representation. This way of
thinking is also known as visual perceptual organization which was explored by [75]
in his seminal paper on perception. Pentland, in fact, pointed out that perception is
possible (likewise intelligent prediction and planning) because of an internal structur-
ing of our environment and because of the human ability to identify the connections
between these environmental regularities and primitive elements of cognition. All the
model-based approaches to perception have been influenced by this view and the use
of knowledge-based models of the world has become a standard paradigm in vision
(see e.g. [30, 104, 83, 16]). Among the model-based approaches, the constructive
approach, known as recognition by components (RBC), was pioneered by [65], who
introduced generalized cylinder, by [95] who use stick, blobs and plates, by [75] who
first introduced the use of superquadrics [3] in cognitive vision, by [10] who intro-
duced the novel idea of Geons (geometric icons), and finally by [104] who modeled
parametric Geons combining Biederman’s ideas with superquadrics.
In the model-based approach, representation models usually focus on two differ-
ent paradigms: object-centered and viewer-centered. The object-centered model (e.g
Marr’s generalized cylinders or Biedermans’ geons) requires one to gather 3D informa-
tion about the scene. For example in the classical approach of [54] depth information
is obtained from 2D features, while in most approaches like [33] or the parametric
geons approaches of [104], measuring tools such as laser or range finder are used to
obtain depth information. Now, the object-centered model does not represent images
or views of the object as seen by observers and, apart from other considerations, a
major problem with this approach to recognition is its intrinsic difficulty in generaliz-
ing across variations between different images of the same object, caused by changes
in position, viewing direction, and the like. Inspired by these problems, a substantial
research area of computer vision has proposed models of recognition based on mul-
tiple 2D views. A viewer-centered representation stores separate images of multiple
views of the object and does not directly provide information about its 3-D structure.
The viewer-centered approach was described initially by [56]. They observed that
for most views, a small change in the vantage point of an object produces a small
change in the shape of the projection of that object; for some views a large change
is produced. This large change represents a singularity of the visual mapping. To
cope with these problems, they introduced the notion of aspect-graph, as a structured
graph of the set of aspects of an object. A drawback of the representation of an aspect
graph of views is the dimension of the view space. In fact, in general, the problem
of finding the optimal set of views for a polyhedral object is NP-hard, reducing it
1.3. THESIS ORGANIZATION 7
to the Art Gallery2 problem of computational geometry [68]. However we can stick
to this complexity if, instead of representing the multiple views of a complex object,
we focus just on the views of simpler geometric objects like parametric geons. [27],
in fact, uses the aspect graph exclusively to represent simple geometric primitives,
because such primitives have straightforward representative views. Dickinson and
Metaxas have further introduced an evolution of the aspect graph, the hierarchical
aspect graph (HAG) in which both the features, at different levels of complexity, and
the compositional relationships between them can be represented.
Following the above line of research on visual perception, we develop our formal-
ization on the recent approach of [73] who introduces symgeons, that are an extension
of the above mentioned parametric geons. The advantage of symgeons over paramet-
ric geons is that, by losing the symmetry property which we do not need to preserve
in our approach, they can be used as coarse descriptions also of asymmetrical objects
(e.g. the well known snicker example that was Ullman’s point against recognition by
components [100]). On the other hand symgeons have several views, and views are
composed of faces which, in turn, are composed of simpler elements as coarse descrip-
tions of primitives geometric signs depicted in the image. This earlier compositional
construction is obtained using an Aspect-Bayes network, which plays a role similar
to the aspect graph, but here causal relationships are enriched with compositional
relationships.
As a matter of fact, we are trying to define a perceptual architecture that fully
exploits compositionality: from the analysis of the elementary signs in the image to
the analysis and description of an object structure, and finally to the interpretation
of a scene. To achieve this we construct a reasoning process that draws suitable hy-
potheses about each object occurring in the observed scene. Hypotheses are evaluated
according to their probability and the most probable one is added to the knowledge
base for interpreting the scene.
The work introduced here represents part of our effort to provide a control sys-
tem employed in the ASI (Italian Space Agency) PEGASO (PErcept Golog for Au-
tonomous agents in Space Station Operation) [37] and Marviss (Anthropomorphic
Hand and Virtual Reality for Robotic System on the International Space Station)
[47] projects.
1.3 Thesis Organization

In Chapter 2, we begin with a review of previous related approaches on object recogni-
tion in computer vision. A classification of these approaches can be done based on the
representation used to model the knowledge base and the paradigm used to match the
data acquired and the models. Most prominent approaches are those based on neural
networks [71, 9], conceptual space [15, 17], alignment method [53, 54], Recognition
By Component (RBC)[10, 12, 65, 34] and aspect graph [56].
In Chapter 3, we introduce some preliminary notions which are used in the rest
of this work. In particular we start describing the Situation Calculus language as a
logic formalism to talk about dynamic domain which is used to implement the high
2 The Art Gallery problem consist in determining the minimal number of observation points in
order to keep a complex environment under surveillance.

level cognition component of our system. Then we introduce a set of 3D volumetric

primitives, SymGeons, which are used in our framework to model 3D object according
to the paradigm of RBC. Finally, since our methodology is based on probabilistic
reasoning, a survey about Bayes Network is given.
In Chapter 4, we present the background motivation and methodology of our ap-
proach. We firstly summarize the RBC theory, discussing some historical limitation
and the we try to show the reasons for which we have based our system on this the-
ory. Then we describe the main components of our framework, which is based on the
definition of a logic reasoning system described in first order logic. In particular we
introduce its ontology, that is made of a set of visual features, and axiomatization
using an algebra of figure. Image noise and uncertainty are then considered, so a
Bayes-aspect network is introduced to infer probabilistic hypothesis about the pres-
ence of SymGeons in the image. Finally, the perceptual reasoning process used to
guess the presence of a certain object in the scene is presented.
In Chapters 5, 6 and 7 we describe the concepts concerning the recognition of
the SymGeons in the image. In particular, in Chapter 5 we introduce the notion of
Syntactic Image Analysis whose purpose is to recognize, in the acquired image, the
basic elements of our ontology (primits). Then, in Chapter 6 we detail the background
theory of our system which is the algebra of figure. Finally, in Chapter 7 we describe
the compositional process (based on a probabilistic reasoning approach) that infers
the presence of a SymGeon through the identification of boundits, faces and aspects
in turn.
Finally, in Chapter 9 we provide some experimental results to demonstrate the
various aspects of our vision reasoning system and in Chapter 10 we discuss our
approach considering its weaknesses and strengths.
Chapter 2
Related Works
In this chapter we shall present a classification of the approaches to object recognition

first and a description of the methodologies and techniques they use after.
2.1 Classification of Object Recognition Systems

In the computer vision community several techniques emerged over the past years to
formalize and implement the recognition of an interesting set of objects in a scene.
However, as mentioned in the previous chapter, all these approaches can be classified
in terms of the representation used to model the objects database and the paradigm
used to match the model with the image data. In the next paragraph we shall present
the criteria used to classify an approach in terms of the paradigm, and in the subse-
quent one those used to classify them in terms of the representation.
Data Driven and Model Driven

The approaches in the literature, that can be classified in terms of the paradigm
used to match the model with the image data, are so many that they de-
scribe a sort of continuum between the
Image Surface so called Data-Driven and Model-
Driven approaches.
The first class includes all those
Deformable techniques which use deformable mod-
model
els (or active models) to shape re-
covery. These techniques adapt the
Image Force model contour (in 2D) or surface (in
3D) using forces exerted according to
the image data. Obviously the model
integrity, represented by some physi-
cal properties like mass and stiffness,
must be maintained.
The recognition power of these Data Driven techniques are limited by some prob-
lems. First they are extremely critical when the object models are too generic and do
9
10 CHAPTER 2. RELATED WORKS
not specify an exact geometry. Furthermore, these techniques often assume that the
bounding contour of a region belongs to the object which is indeed problematic when
the object is occluded. Finally such techniques often require a manual segmentation
of an object into its meaningful parts.
The Model Driven approach to shape recovery uses models, that capture the exact
geometry of the object. The recognition algorithm acts according to the following
sequence.
First, simple 2D features, like corners or changes in curvature, extracted in the
image, are paired with similar features of a 3D model to obtain a set of possi-
ble correspondences. Then the sys-
tem uses solid transformations, like
Ima rotation and translation, to verify the
ge
given hypothesis bringing the model
features into alignment with their cor-
responding image features. Finally,
because the correspondence is weak,
those features belonging to the chosen
model must be compared to other im-
age features. If there is enough agree-
Hypothesis ment among the features, the object
is recognized.
Matching
If the number of object models is
small and the exact object geometry is known, this approach is highly efficient be-
cause it requires the extraction of simple recoverable image features. Moreover it is
substantially insensitive to occlusion. However, for large databases, the complexity
makes the model search intractable. Besides, the approach is very sensitive to minor
changes in the shapes of the objects because it is based on the verification of local
features. For example, if the curvature of an object part changes, a new object model
must be added.
View Centered and Object Centered

The objects database is an important component of an ORS likewise for every system
whose purpose is to find an interpretation of a complex data structure. Basically, the
database represents the agent’s prior knowledge of the environment in terms of a set
of models, one for each object the system has to recognize. There are two prevalent
schools, in the computer vision community, following this approach.
The first school, called Object Centered Modeling, faces the problem providing
exact three dimensional object descriptions, which are invariant to changes in the
position of the view-point. The second school, called View Centered Modeling, faces
the problem providing an object description consisting of the set of all possible views
of the object, linked together to form an aspect graph.
The first approach uses a more compact representation. In fact, it allows the user
to define a single model of the object to be recognized, but in order to recognize it
from a 2D image a 3D inference from 2D features is required.
The second approach, instead, reduces the complexity of the recognition prob-
lem because it reduces the problem from three dimensions down to two dimension.
2.2. ORS APPROACHES 11
Therefore objects can be more easily recognized, since the same techniques used for
2D recognition can be used for 3D recognition without changes.
However this approach has two major limitations. Firstly, the definition of multiple
views is not easy and the user has to consider a-priori all possible view points, in order
to guarantee the completeness of the recognition process. Secondly, it requires to store
many different views of the same object.
2.2 ORS Approaches

In this section we will survey previous research on object recognition. We do not
intend to describe in detail each one of the approaches suggested in the literature,
but we prefer to give only a short overview about some of them. However, in Chapter 3
we describe in more details some aspects of the approach which is preliminary to our
research: recognition by components.
Neural Network
Probably, the most obvious way to recognize the shape of an object, is to collect a
sufficiently ample set of object images (pattern) taken from different views. Therefore,
the recognition problem can be reformulated as the problem to compare this set of
images with the image data. A lot of ORS have been proposed, on the basis of this
paradigm, where the matching problem is solved using some kind of neural networks
[69].
Basically we can define a neural network as network-like model of computation,
designed to simulate the behaviour of biological neural networks (e.g. human brain),
whose basic computational unit is some kind of a formal x1
network (see Figure on the right).
A neuron receives input signals xl , ..., xn from either x2
other neurons or the outside environment. It computes y
its output signal y by adding together the xj ’s weighted
by some internal weight wj , and applying some transfer xn-1
σ, to the result:
xn
y = σ(Σnj=1 wj xj )
From a theoretical point of view an artificial neural network specifies a mapping

between an input space X and an output space Y . In particular such a mapping can
beconstructed by a training procedure using a finite set of examples (training set).
An advantage of neural networks is that they can be modelled on a general-purpose
computer or electrically implemented, and are fault tolerant and robust. Many learn-
ing paradigms and algorithms are available in practice.
Neural network models have been used successfully for many image-processing and
recognition applications. Researchers such as Grossberg [48] considered properties of
the human vision system and proposed neural network architectures for brightness
perception under constant and variable illumination conditions. Neural network mod-
els based on Gabor functions [24] have been used for texture segmentation.
Feed-forward networks with backpropagation-learning algorithms [91] have been

used in many pattern recognition applications. They also have been used in finger
print analysis, medical image analysis, character recognition, and texture analysis.
Associative memories [51, 52, 70] can be easily modelled with suitable neural
networks. Bidirectional associative memories have been used to store pairs of patterns,
which can be retrieved from partial or noisy version of this patterns. With associative
memories, it is possible to retrieve an image from the image database by specifying
image contents or partial image.
The approaches based on neural networks give good results mainly in the area of
natural shapes detection. For example Poggio et alt. have implemented (see [71] ) an
interesting system for pedestrian detection, using a special class of neural networks,
named support vector machine. Besides, Bayer et alt. in [9] use the same technique
for face detection.
Although very promising, the neural network approach to recognition can face
recognition problems far different from those proposed in this thesis, that is, the
recognition of unstructured objects that cannot be easily decomposed into a hierarchy
of elements, e.g. mountains, landscapes, and so on.
Conceptual Space
Conceptual spaces, introduced by Gärdenfors (see e.g. [42]), are metric spaces in
which entities are characterized by a number of quality dimensions which represent
some kind of qualities of the environment. Examples of such dimensions are color,
pitch, volume, spatial coordinates, and so on. Some dimensions are closely related to
the inputs data of the system, other may be characterized in more abstract terms.
A generic point in a CS is called a knoxel (the term is derived by analogy from
pixel). Knoxels are obtained from measurements of the external world performed on
the image acquired by the camera and processed applying low-level vision algorithms.
An important aspect of this theory is the possibility to define a metric function.
Following Gärdenfors, the distance between two knoxels calculated according to such
a metric function corresponds to a measure of the similarity between the entities
represented by the knoxels themselves.
In the domain of computer vision CS are used to model objects as follows: an
object is characterized by a set of attributes or qualities {q1 , q2 , ..., qn }. Each quality
qi takes values in a domain Qi . For example, the quality of volume could take values
in the domain of positive real numbers. Objects are identified with points in the CS
C = Q1 × Q2 × ... × Qn , and concepts are regions.
An interesting application of the notion of CS is presented in [15, 17]. Here a
framework for high-level representations in computer vision architectures is described.
Alignment Method
Generally speaking, ORS based on the alignment methodology perform the follow-
ing steps: given an acquired image representing an object view and starting from
a set of object models, the ORS performs a search in the space of solid geometric
transformations, trying to match the description of the image with a model in the
database.
More formally, in the alignment method we assume that a series of rigid transfor-
mations (e.g. translations, scale operations, rotations, stretching) could be applied
either to the input image or to the model. In such a way, the recognition problem
can be seen as a search problem. In fact, the ORS have to find a particular model Mi
and a combinations of transformations Tij maximizing an overlapping measurement
F between the model and the image V :

max F (V, Tij (Mi ))
ij
Ullman and Huttenlocher have introduced this technique in [53]. Their approach
is able to recognize polyhedral object identifying 2D point discontinuity in the image
(where the object contour changes significatively) and matching these points with the
vertex of a 3D model (see also [54]).
In particular, their system computes the transformation between each triplet of
points from the image, and each triplet of points from the target model. According
to such a transformation, all the other points from the target set are transformed. If
they match, the transformation receives a mark, and if the number of marks is above
a chosen threshold, the transformation is assumed to be the matching transformation
between the query and the target.
Such a search takes O(m4 n3 log(n)) time in the worst case (where m and n are the
number of points in the model and image, respectively) and shortcuts have brought
that down to O(mn3 ). But the complexity of the search also depends on the number
of models in the computer vision system’s library.
Variations of these methods also work for geometric features other than points,
such as segments, or points with normal vectors [2], and for other transformations
than affine transformations.
Recognition By Component
Recognition By Component (RBC) [10, 12, 65, 34] is an emerging approach to object
recognition and modelling in which the shape of an object is described in terms of
relatively few generic components, joined by spatial relationships.
An essential property of this technique is universality of primitives, i.e. the ability
to describing any object in the domain by a combination of one or more primitive
components. Therefore the standardization of the primitives (component and their
relationships) is crucial.
In the literature we can find several kinds of volumetric primitives. Ferrie and
Levine [34] uses ellipsoid and cylindroid as basic object components, obtaining an
extremely approximated representation.
Shapiro in [95] uses as primitives sticks, blobs and plates in order to represent long
and thin object parts, flat parts and volumetric parts respectively.
David Marr initially proposed to use cylinders as unique valid primitives [64] but,
in this case, a meaningful representation of an object requires a great number of
cylinders. Consequently the amount of information contained in the representation
is comparable to that contained in the original image, therefore the representation is
not concise.
Later Marr and Nishiara [65] proposed generalized cylinders as primitives. These
geometric shapes were realized by sliding a 2D figure along an axis of symmetry. The
representation obtained using this set of primitives is more expressive and versatile
and moreover they are easy to extract like simple cylinders.
In the 1980’s, from the human vision community, Biederman in [10] proposed 36
qualitatively different volumetric primitives named geons (short for geometric icons).
They were described in terms of four qualitative attributes of generalized cylinders:
Symmetry, Size, Edge and Axis. Psychological experimentations [11, 12] have pro-
vided support for the descriptive power of geon-based descriptions. Even if this model
has been used by several researchers to describe 3D object, most of the work about
this subject has focused on the recovery of geon models from complete line drawing
which depicted perfect geon-like objects.
Alternatively to this idea, different authors focus their researches on a represen-
tation using purely qualitative primitives. Pentland, for example, introduced in [75]
an interesting parametric family of closed surfaces named superquadrics. A crucial
characteristics of these structures is their ability to model a large number of objects,
because the shape of a superquadrics is based on parameter values (1 , 2 ), that can
be suitably varied. On the other hand this property makes the recognition problem
more difficult.
Terzopoulos et al. [98] uses generalized splines to create an active model of the
specific object, that is suitably deformed to be adapted to the object image. Such
models react to the external forces produced accordingly to the image data. Following
such an idea, Metaxas and Terzopulous [97] developed a deformable model based on
superquadrics, instead of generalized splines.
Kenong Wu and Martin Levine [104] combined both the qualitative and quantita-
tive approaches, by introducing a new set of geometric primitives named parametric
geons. Such a set consists of seven volumetric primitives derived by superquadrics,
specifying the parameter 1 and 2 and applying tapering and bending deformations.
The first three basic primitives are ellipsoid, cilindroid and cuboid, respectively
defined for (1 = 1, 2 = 1), (1 = 0.1, 2 = 1) and (1 = 0.1, 2 = 0.1). The other
four primitives, obtained applying tapering and bending operations to the basic one,
are tapered cylinder, tapered cuboid, curved cylinder e curved cuboid.
More information about superquadrics and parametric geons are given in Chapter 3.
Aspect Graph
The concept of Aspect Graph was initially described by Koenderink and van Doorn
[56] in 1979. They observed that for most views, a small change in the vantage point
of an object produces a small change in the shape of the projection of that object,
while for some views a large change is produced.
Starting from these considerations, they proposed a classification technique to
reduce the dimension of the image set based on a partitioning of the view space into
regions each one composed of qualitatively similar views.
In their approach the object in the knowledge base is represented using a graph
(named aspect graph) where each node is joined to a region (or views), while each
edge represents transitions among adjacent regions (visual events).
The success of such a representation depends on the ability to find the minimal
set of significative views for each object, in order to reduce the dimension of the
associated graph. For this reason since these concepts were introduced, much effort
has been expended in analytically deriving the exact Aspect Graph [31, 82, 57], which
yields a complete description of the object’s view-sphere.
Analytical methods, however, must be limited to simple models since they might
be expensive, if the object contains many features. For example, Platinga and Dyer
in [81] describe an algorithm to localize the most significative views for polyhedral
objects, while Gigus e Malik in [44] resolve the same problem for non convex polyhe-
dral objects, using an algorithm whose time complexity is Θ(n6 log n) in the number
of vertices.
In general, it is possible to demonstrate that the problem of finding the optimal set
of views for a polyhedral object is NP-hard, by reducing it to the Art Gallery1 problem
of computational geometry [68]. However, the exact aspect graph is generally too
Figure 2.1: Different views of an arrow object (taken from [4]).
detailed to be useful in model matching with real data. Koenderink and van Doorm,
in fact, recognize that many nodes in the aspect graph will probably correspond to
some unstable views, where an infinitesimal (i.e. suitably small for real applications)
camera movements will change the topological properties of the view.
Hierarchical Aspect Graph
An interesting evolution of the Aspect Graph was introduced by Dikinson and Metaxas
in [28, 27, 29].
In particular they propose to use this technique to represent simple geometric

primitives only, because such primitives do not require complex and expensive algo-
1 The Art Gallery problem consist to determine the minimal number of observation points in order
to surveillance a complex environment.

rithms to find the representative views, as a simple

visual inspection is sufficient. For example, for a
pyramidal object, it is immediate to recognize the
representative views (see Figure 2.2 on the left).
Therefore, while a traditional aspect graph
models an entire object with a set of aspects, each
defining a distinct view of the object, Dickinson and
Metaxas’s approach use aspects to represent a small
set of volumetric parts, from which each object in
Figure 2.2: Aspect graph of a poly- the database is constructed. The underlining idea
hedral object. it to use aspects to recover the 3D volumetric parts
that make up the object and then execute a recognition by component procedure.
However the Aspect Graph is not sufficient to recognize the primitive components
of the object, because of the occlusion problem. For this reason they realize a three
level aspect hierarchy, named “Hierarchi-
cal Aspect Graph” (HAG), where it is
possible to represent both features at dif-
ferent levels of complexity and the com-
positional relationships between them.
A portion of this structure is showed
in the image at the right, taken from [27].
The higher level of the HAG is composed
of a set of nodes representing the aspects
of the chosen primitives. In the middle
level instead, are represented the set of
faces composing the aspects. Finally at
the lower level there are the set of bound-
ary groups representing all subsets of con-
tours bounding the faces.
The ambiguous mapping between the
levels of the aspect hierarchy are captured
in a set of conditional probabilities, map-
ping boundary groups to faces, faces to
Figure 2.3:
aspects, and aspects to primitives. Such
probabilities are computed using a standard statistical analysis.
The most important property of this approach is that, since the number of primi-
tives is generally small, the number of possible aspects is limited and independent of
the number of objects in the database.
Chapter 3
Preliminaries
This chapter introduces some general preliminary notions that will be used in the
rest of the thesis. It describes the Situation Calculus, its language, its formalism
and the way it has been extended in [77] to face sensing and perception. Then a
deeper characterization of the SymGeons, a set of primitives deriving from the notion
of parametric Geons, which are our object recognition primitives, is given. Finally,
some fundamental notions on Bayes Networks are introduced. We have, in fact, used
Bayes networks to extend the concept of aspect graph, in order to face hypotheses
formation in SymGeons recognition.
3.1 Perception in the Situation Calculus

The Situation Calculus [66, 84, 85, 86, 61, 78] is a well known formalism for dealing
with theories of actions earlier invented by [66], and further developed by the Toronto
group of Artificial Intelligence and Cognitive Robotics [66, 84, 85, 86, 61, 60, 78]
and extended in several dialects, among which we are interested in those developed in
[76, 77, 79, 60, 86, 78, 37, 80, 18]. Situation Calculus is a second order theory specified
in the language L sitcalc . We shall skip a detailed description of the language and the
axiomatization, referring to [86] for a thorough introduction and deep elaboration on
the principles and methodologies of the Situation Calculus.
We recall that the language L sitcalc of Situation Calculus has three disjoint sorts:
action for actions, situation for situations and object for everything else depending
on the domain of application. We use ∧, ¬ and ∃, with the usual definitions of
a full set of connectives and quantifiers. In universally quantified formulae, quan-
tifiers are omitted. We refer the reader to [86], for a detailed presentation of the
alphabet of L sitcalc and for the metalanguage adopted to denote terms and for-
mulae of the language. The alphabet includes relations and functions called Flu-
ents, because their truth value depends on the history of actions performed by an
agent: a history is designated by a situation. A situation is the last argument
of Fluents, e.g Color(block1, green, do(paint(block1, green), S0 )), whose situation is
s = do(paint(block1, green), S0 ). Fluents can be relational and functional. To sim-
plify the presentation, we omit functional Fluents, we retain only one special func-
tional fluent for the visual image of the scene. In this thesis all Fluents will be
17
18 CHAPTER 3. PRELIMINARIES
observable Fluents. By observable we mean the sensory system can observe the fact
or property or object the fluent is designating. For example On(block5, block3, s) is a
relation between block5 and block3 that holds in s and can be observed, as far as it
holds, by the sensory system. Related to the observable Fluents are the perceptibles,
that is, terms that denote what is perceived about facts, properties or objects. As it
was done in [76] we consider only one sense, namely the vision system, identifying the
perceptibles with the data of the image. In fact, to account both for these data and
for the geometric description of object we shall assume (as in in [76]) that the objects
domain includes the reals R, relying instead on the standard interpretation of the
reals and their operations (addition, multiplication, etc.) and relations (<, ≤ etc.),
(see [86]). As far as the relationships among elements of the image are concerned,
likewise the relationships among elements of the scene, we shall discuss this topic in
Chapter 6.
However we do not treat a comparison with spatial reasoning (see [1, 6, 5, 21, 22]),
in this thesis, since their approach is genuinely qualitative and, in fact, they deal with
2D images and not with recognition by components, as we do scaling up from 2D to
3D. In any case we refer to the vast literature on this topic, to which we shall come
back later (see e.g. [50, 7, 102]).
The alphabet of L sitcalc is extended with function symbols of sort images, the
perceptibles, and of sort action, sensing actions, the special relational Fluents P ercept
and Occluded, and the special functional fluent Scene. Relational Fluents, like Align,
M istaken, SeP ercept, Connected and so forth, are introduced by definitions. The
advantage of introducing concepts by definitions is that one can rely on the composi-
tional laws of logic to get new properties on the basis of simpler ones.
As Fluents account for physical events that are affected by control actions, per-
ceptibles account for the mental events– the agent’s inner denotation of the physical
events – that are affected both by sensing and control actions. Formally a percepti-
ble is a term of a general sort perceptible which should be the result of the union of
many sorts that account for the different data that sensing can interpret. We treat
only perceptibles of sort images, nevertheless we shall specifically refer to it as sort
perceptible. A perceptible differs from a fluent also because it lacks the argument of
sort situation: isF : ((action ∪ object)n ) 7→ images.
A perceptible is taken as argument by the fluent P ercept, which traces the sensory
experience of the agent, and is of the form:
P ercept : (perceptible × pr × situation).

The fluent P ercept is called a percept, e.g P ercept(isColor(block1, green), 1, s). Like-
wise a perceptible is taken as argument by the fluent:
Occluded : (perceptible × situation).

Sensing actions are terms of sort action of the form: sense : perceptible×pr 7→ action,
with pr ∈ {0, 1}: 0 and 1 are threshold, 1 means that a given property or fact is sensed
to hold and 0 that it sensed not to hold. For example in sense(isColor(block1, green), pr),
pr = 1 means that the sensory system is in the condition to accept the sensing action
and the answer of the vision system is that the perceptible holds in the scene. If
pr = 0 then either the preconditions for sensing data in the scene are not satisfied or
the perceptible does not hold true in the scene.
3.1. PERCEPTION IN THE SITUATION CALCULUS 19
The fluent symbol Scene(s) is of the form Scene : situation 7→ images, and
it is used for the description of the perceptibles in the scene. So, for example,
isCube(block5) ⊂ Scene(s) is true, in the situation s, if there is a region in the
scene matching the set of points denoted by the perceptible isCube(block5) and this
set, after suitable transformations, can be aligned with the model of the cube.
A basic theory of action and perception

A basic theory of action D is the set:
DS0 ∪ Dap ∪ Dss ∪ Duna ∪ Σ (3.1)
Here DS0 is the initial database: the set of formulae that mention only the situation
term S0 and no other term of sort situation. Dap is the set of action precondition
axioms, one for each term of sort action; Dss is the set of successor state axioms,
one for each fluent; Duna is the set of unique name for actions and Σ is the set of
domain independent axioms specifying the properties of the domain of sort situation.
We refer again to [60, 86, 78] for details on each element in the above set.
A basic theory of action and perception is defined analogously: the set Dss in-
cludes, together with the successor state axioms for each fluent in the language, also
a successor state axioms for the Fluents P ercept and Occluded and for the functional
Fluents Scene, for each perceptible in the language. Dap includes action precondi-
tion axioms for sensing actions and Duna comprehend together with the set of unique
name for actions unique name axioms for perceptibles. To these sets we shall add a
set Dda which are the axioms for describing the representation of objects in the scene,
and the set Dgd which should describe the class of primitive objects (from the Primits
to the Symgeons, through a whole hierarchy of primitives, see Chapter 6).
In accordance with the problem solving approach to perception [89, 90], causal
laws for perception have been introduced in [76]. Here, in fact, perception is inter-
preted as a mental event subject to causal laws, therefore based on prior perceptions,
implying a perception-perception chain of causality. This is quite important to ac-
count for the correlation and causal laws among Percepts: perceived orientation is
the suitable context for perceived shape, perceived adjacency explains the perception
of the composition of an object and so forth [90].
Following the same ideas of [84, 85] causal laws for perception can be formulated
as negative and positive contexts.
Example 1 Consider the following causal laws:
1. a = sense(isCube(x), 1)→P ercept(isCube(x), 1, do(a, s));

2. F ragile(x, s) ∧ a = drop(x)→Broken(x, do(a, s));
3. ∃yP ercept(isClassRoom(y), 1, s) ∧ ∃wP ercept(isW all(w), 1, s)∧
P ercept(isP art(w, y), 1, s) ∧ P ercept(isAttached(x, w)1, s)∧
a = sense(isBlackboard(x), 1)→P ercept(isBlackboard(x), 1, do(a, s)).
In the first item the sensing action sense(isCube(x), 1) is context free, in the sec-
ond item the action drop(x) is context sensitive and in the third the sensing action
sense(isBlackboard(x), 1) is context sensitive.
2
In the last item, the context plays the role of deciding whether to accept or not
the answer from the vision system. When the vision system acknowledges a 1 to the
query sense(isBlackboard(x), 1), the context decides whether the perception should
hold or not.
Finally, in our approach, we distinguish between modeling sensing actions’ direct
effects (i.e. direct perception), and modeling the process of sensing data interpretation
and assimilation (selective perception, and meaningful perception).
Direct Perception. Direct perception is the process of sensing tout court. In fact,
it is formalized in the Situation Calculus just by suitably adding successor state axioms
for each perceptible. As noted above, according to the problem solving approach to
perception [90] we define causal laws for perception, and by following the same ideas
of [84, 85] we formulate them in terms of successor state axioms:
For each perceptibles p

(3.2)
P ercept(p, pr, do(A(~x), s)) ≡ Φp (p, pr, A(~x), s)
here p is a perceptible, e.g. isF (~y ), and pr is the outcome, pr ∈ {0, 1}. Analogously,
we introduce successor state axioms for the fluents Occluded and Scene, for each
perceptible in the language. Finally Dap includes action precondition axioms for
sensing actions. For all perceptibles p:
P oss(sense(p, pr), s) ≡ Πsense(p,pr) (x, a, s) (3.3)
For example:
P oss(sense(isCube(x), pr), s) ≡ Scene(isCube(x), s) ∧ ¬Occluded(isCube(x), s).
High Level Perception

The direct perception of a (dynamic) property of the world, i.e. a fluent F (~x, s), is
accounted by a sensing action sense(isF (~x), 1) and by the fluent P ercept(isF (~x), 1, s).
Sensing actions, indeed, do not affect the truth of observable fluents (which denote
real world dynamic properties) they affect the inner state of an agent through the
fluent P ercept. It might happen, therefore, that after an erroneous sensing action
the agent comes to perceive a property, P ercept(isF (t), 1, σ) which does not hold
in the real world at the specific situation σ. Here we have to distinguish two cases:
D |6 = F (t, σ) and D |= ¬F (t, σ). In the first case, sensing does not contradict the
information the agent has about the world, in the second it does, although D |=
¬F (t, σ) ∧ P ercept(isF (t), 1, σ) does not imply a logical contradiction, as shown in
[76]. Now, if an agent wants to use perception to make hypotheses about the real world
then we need to take care and trace the possible misalignment between perception (i.e.
the inner state of the agent) and the real world information (i.e. what is described in
terms of observable fluents and causal laws of transformation through the successor
state axioms). For example suppose that we have the following situation:
D |= P ercept(isGreen(a), 1, do(sense(isGreen(a), 1, S0 )))

3.1. PERCEPTION IN THE SITUATION CALCULUS 21
D 6|= Green(a, do(sense(isGreen(a), 1, S0 )))

and we want to use the perception to infer that a is green. What we have to do is to
add an hypothesis h to the initial database so that:
D ∪ {h} |= Green(a, do(sense(isGreen(a), 1, S0 )))
The hypothesis is built using the sensing history σ(H), that is the history of all the
sensing actions that lead to the derivation of a given P ercept(p, pr, σ) or a given
sentence mentioning the fluent P ercept.
In fact, by the consistency theorem of the Situation Calculus (see [78]) a basic
theory of actions is consistent given that the initial database DS0 is consistent, there-
fore we want to add an hypothesis to the initial database. Since we want to gather
the hypothesis from the perceptual process we proceed as follows. Suppose D |= ϕ(σ)
and ϕ(σ) is a sentence mentioning P ercept and D 6|= ψ(σ), with ψ(σ) a sentence in
which P ercept(isF (~x), 1, σ) is transformed into F (~x, σ) and P ercept(isF (~x), 0, σ) is
transformed into ¬F (~x, σ) (for details on the transformation see [76]), then we want
to add an hypothesis h such that D∪{h} |= ψ(σ). We construct such an hypothesis by
regressing ϕ(σ), in so extracting the sensing history. Then we transform the sensing
history into observable fluents and we build a sentence uniform in S0 using them. This
explanation based (or abductive) process is called meaningful perception, because it
allows to use the perception process to infer real properties of the world. In fact,
human being do this all the times, if one sees that the door is open one usually uses
the sensing to infer that indeed the door is open, obviously the outcome of sensing
is not taken as universal truth in fact if the door was a glass door and one ends up
bumping on it then one wants to give up the hypothesis.
Therefore we consider meaningful perception as a non monotonic inference process.
Of course meaningful perception is not always possible. In [76] Pirri and Finzi have
shown that meaningful perception is always possible when there is no misalignment
between perception and real world information. To account for the misalignment
between the inner state of the agent which is transformed by perception (sensing
actions and the fluent P ercept) and the real world which is transformed by control
actions Pirri and Finzi have introduced for each perceptible isF and property F the
following definition1 :
def
M istaken(isF (~x), s) ≡
(¬F (~x, s) ∧ P ercept(isF (~x), 1, s)) ∨ (F (~x, s) ∧ P ercept(isF (~x), 0, s)).
(3.4)
A misalignment admits two interpretations: (a) sensing is correct and the misalign-
ment between perception and observable fluents is not a mistake; (b) the agent has a
wrong perception of the real world. A brave approach would always consider sensing
correct, admitting the outcome of sensing actions to changing the truth value of ob-
servable fluents (interpretation a). A cautious one would consider the truth value of
the observable fluents a measure of the correctness of perception (interpretation b).
The definition of meaningful perception given in [76] is based on this second approach.
1 Observe that we assume that in the language for each observable fluent F there is a perceptible
isF .
3.2 Symgeons
The concept of symgeon, introduced by [73], is part of a families of related concepts,
that have been used in visual recognition, to describe the shape of an object in terms of
a relatively few generic components, joined by spatial relationships. Symgeons (which
we can consider as a simple generalization of parametric geons introduced by Kenong
Wu and Martine Levine [104, 103]) have their origins, in qualitative geons [10] and
in the computer graphic concept of Superquadrics [3]. The Biederman original geons
are 36 volumetric component shapes described in terms of the following qualitative
attributes of generalized cylinders: symmetry, size, edge, axis: each of these properties
can be suitably varied in so determining a unique geon.
Superquadrics were first introduced in computer graphics by Barr in his seminal
paper [3]: they are a family of shapes, depicted in Figure 3.1, including: superhyper-
boloid of one sheet, superhyperboloid of two sheets, superellipsoid and supertoroid.
In computer vision literature, it is so common to refer to superellipsoids by the
more generic term of superquadrics.
Figure 3.1: Superquadrics
Parametric geons consist of seven volumetric primitives derived by the equation

of the superellipsoid, that we give in (3.5), in implicit form, specifying the parameters
1 and 2 , that determine the shapes of the cross section in a plane perpendicular
(resp. parallel) to the (x, y) plane, and applying tapering and bending deformations.
Figure 3.2: Ellipsoid, Cuboid and Cylindroid

3.2. SYMGEONS 23
2/2 2/2 !2 /1 2/1

x y z
+ + =1 (3.5)
a1 a2 a3
The parameters a1 , a2 and a3 are scaling factors along the three coordinate axes.
The first three basic primitives are ellipsoid, cylindroid and cuboid (see Figure 3.2),
respectively defined for:
2 2 2
• 1 = 1, 2 = 1 x
a1 + ay2 + az3 = 1
2 2 10 20
x y z
• 1 = 0.1, 2 = 1 a1 + a2 + a3 =1
20 20 20

x y z
• 1 = 0.1, 2 = 0.1 a1 + a2 + a3 =1
The other four primitives, obtained by applying tapering and bending operations to
the basic ones, are tapered cylinder, tapered cuboid, curved cylinder and curved cuboid.
A tapering deformation is performed along the z axis with a linear rate with
respect to z; a point (x, y, z) of a primitive is transformed into a point (X, Y, Z) by
the following:  Kx
 X = ( a3 z + 1)x
K
Y = ( a3y z + 1)y (3.6)

Z = z
Here Kx and Ky are tapering parameter along the x and y axes. To avoid invalid
tapering operation as defined in [103], the constrain 0 ≤ Kx ≤ 1 and 0 ≤ Kx ≤ 1 are
imposed.
The bending operation adopted by
Wu and Levin, is easily defined consid-
ering a section of the primitive along
the xz plane.
It is depicted in the figure on
the left, taken from [103, 104], where
dark area delimits the original primi-
tives, while the thick line represent the
curved primitive.
The transformation is applied to
the primitive along the z axis in the
positive x direction. The O is the cen-
ter of bending curvature and θ is the
bending angle.
Given a point (x, y, z) belonging to
the original primitive, the operation
transform it into the point (X, Y, Z) according to the following equation:

 X = κ−1 − cos θ(κ−1 − x)
Y = y (3.7)

Z = (κ−1 − x) sin θ
Figure 3.3: The Hierarchy of Symgeons according to bending and tapering deforma-
tions
Here θ = κz, where κ is the bending parameter representing the curvature.
Petel and Holt [73] introduced a new set of primitives called SymGeons (Sym-
metrical Geometric Icons), extending the concept of parametric geons of Wu and
Levine considering the possibility to apply the tapering and bending transforma-
tions at the same time. In such a way they eliminated the intrinsic symmetry of
the parametric geons allowing one to
model a larger number of asymmetri-
cal objects. Symgeons are depicted in
figure (3.3).
In the rest of the paper we use the
term G(a, , K, κ) to denote a generic
symgeon, where a = (a1 , a2 , a3 ) is the
scale vector, = (1 , 2 ) is the square-
ness vector, K = (Kx , Ky ) is the ta-
pering vector and κ is the bending
parameter. To refer to the coordi-
nates of a symgeon in the scene we
shall use a term pos, so the position of
a specific symgeon G(a, , K, κ), with
γ = ha, , K, κi will be denoted by
g(pos, γ). We shall also use the term
−
→
NG ( −→
a ,−
→ , K , κ) to denote the normal
to the symgeon G. For more details
about these definitions we refer the
reader to Wu and Levine [104]. A clas-
sification of symgeons is given in Table
Table 3.1: SymGeon Classification
on the left.
3.3. BAYES NETWORKS 25
3.3 Bayes Networks

Bayesian networks model knowledge of causal relationships, and they are now the
most common representation schema for probabilistic knowledge: the Bayesian (or
personal) probability of an event x is an agent’s degree of belief in that event. Bayesian
networks have a causal semantics that makes the encoding of causal prior knowledge
by making predictions in the presence of interventions. In addition, Bayesian networks
encode the strength of causal relationships with probabilities. Consequently, prior
knowledge and data can be combined with well studied techniques from Bayesian
statistics. We refer the reader to Judea Pearl’s book [74] for an accurate analysis of
this methodology. Formally a Bayesian network is a direct acyclic graph (DAG) with
the following properties:
i. the nodes in the network are associated to causal variables;
ii. an oriented edge connects a pair of nodes (X, Y ) if node X directly affects node
Y;
iii. a conditional probability table (CPT) or a distribution function is associated to

every node X in order to quantify the parents P arents(X) effects on the sons
Xi : P (Xi |P arents(Xi )).
To this definition there is an underlying assumption asserting that each variable is

independent of its non descendants in the network given its parents. Bayesian net-
works allow to represent a probabilistic domain, completely specified by the joint
distribution P (X1 , . . . , Xn ), in a compact way by factoring such distributions into
local conditional distributions for each variable given its parents:
Y
P (X1 , . . . , Xn ) = P (Xi |P arents(Xi )) (3.8)
i
Burglary Example
Bayes networks are modeling tools which can be applied to a variety of domain. In
this section we present an example taken from [74].
Suppose you are living in a house in California, and you are concerned
about burglaries. Therefore there is an alarm in the house which might
go off if a burglary occurs. There are also two neighbors who work at
home: Dr. Watson and Mrs. Gibbons. If the alarm goes off they might
phone you at your office. On the other hand you know that your alarm also
sometimes goes off if there is a earthquake or tremor, but that earthquakes
and tremors will likely be reported on the radio.
The domain can be modeled using six variable, which are:
1. B: True if a burglary has occurred.
2. S: A 3-valued variable. 0 if no alarm has rung, 1 if a low alarm rings and 2 if

a loud alarm rings.
R
G
E
B
W
Figure 3.4: Bayes Network for the Burglary Example.
3. G: True if Mrs. Gibbons phone you to tell that an alarm has rung.
4. W : True if Dr. Watson phones you to tell that an alarm has rung.
5. E: True if an earthquake has occurred.
6. R: True if there is a radio report of an earthquake.
In Figure 3.4, is drawn the Bayes Network for the example. The independence as-
sumption allows us to make the following observation:
• B does not depend on E. So if we know E then we expect S to take on a
non-zero value. However, we do not then expect that B will become more likely
given that S has rung.
• If S takes on a non-zero value, we do expect that B and E both become more
likely. But if we later learn that E is false we would expect that B becomes
even more likely (there is no longer another reason for S).
• If B occurs then we expect both W and G to become more likely, since S is
more likely and that will make these two variables more likely to be true. On
the other hand if B occurs we do not expect E to become more likely.
These yield to the following product decomposition, according with the equation 3.8:
P rob(B ∧ E ∧ R ∧ S ∧ W ∧ G) =
P rob(B) · P rob(E) · P rob(R|E) · P rob(S|E ∧ B) · P rob(W |S) · P rob(G|S)
Once we have the net we must parameterize it defining a CPT for each vari-
able in the decomposition. This means that for each node in the Bayes network we
have a matrix of values: one probability for every different assignment of the node
given the different assignments of its parents. Parents S = 0 S = 1 S = 2
In particular in the burglary example all of the a b 0.1 0.15 0.75
variables are propositional except for S which a ¬b 0.2 0.6 0.2
has 3 possible values. We report on the table ¬a b 0.15 0.25 0.6
at the right an example of CPT for S, here ¬a ¬b 0.9 0.1 0
we use lower case letters to denote particular
assignment of the variables: Table 3.2: CPT for S
3.3. BAYES NETWORKS 27
Computing Probabilities
Besides saving the number of probabilities we need to obtain and store, Bayes nets
also allow us to compute new probabilities more efficiently. A typical reasoning task
in a Bayes net is that we want to compute the posterior probability of a variable given
some instantiation of a set of variables (the evidence node).
For example, we hear on the radio that there was an earthquake, and we want
to compute the new probability that the alarm rang: P (S = 0|e), P (S = 1|e),
P (S = 2|e). Or we want to compute the probability that Dr. Watson will call after
we hear the radio report: P (W = >|e), P (W = ⊥|e).
In general we want to compute the new probabilities of the different values of
a variable V given that we have some set of assignments to another collection of
variables V1 = v1 , V2 = v2 , ..., Vk = vk . Such computation can be performed using the
structure of the Bayes Network and Bayes rules.
Many sophisticated algorithms have been developed to treat inference, probabilis-
tic inference, in Bayes Networks. However, for our purposes we will give only an
overview of a simple technique using variable elimination.
Let V1 , V2 , ..., Vk be the variables of a Bayes Network. We want to compute the
probability of Vi given some set of values to some of the other variables Vj1 , ...Vjk
(an arbitrary set). For example, we want to compute the conditional probability
P rob(V3 |V1 = a, V4 = b, V5 = c). To this end, we will compute the set of unconditional
probabilities P rob(V3 , V1 = a, V4 = b, V5 = c), one for every possible values of V3 .
Then we will normalize these unconditional probabilities so that they sum to one
over all values of V3 , this will give us the set of conditional probabilities we want.
Now, to compute the unconditional probabilities we first write them down as sums
of probabilities involving all of the variables in the network. In this case we rewrite
the probabilities P rob(V3 , V1 = a, V4 = b, V5 = c) as the sum:
ΣV2 ,V6 ,...,VK P rob(V3 , V1 = a, V4 = b, V5 = c, V2 , V6 , ..., Vk )
Finally we use the Bayes Network to decompose the probabilities inside the summa-
tion, and try to simplify them by moving various terms out of summations.
Chapter 4
Motivation and Methodology
In this chapter we give an overview of the methodology we adopted to recognize

objects in a complex scene. In particular we would try to describe the logical path
that brought us to make some general assumptions.
Our framework is indeed a reasoning system defined in First Order Logic. Our
approach started from our personal consideration that ... “so far recognition by com-
ponents is the principal approach allowing for reasoning, because it exploits composi-
tionality”. In fact, considering the definition of a first order language, it is easy to
create a correspondence between the notion of atom, predicate and term and, respec-
tively, the concepts of primitive, relation and composed object introduced in RBC.
In section 4.2, we describe our reasoning system and in particular its basic ontol-
ogy. The definition of the system is founded on the notion of Algebra of Figure which
is briefly described here and detailed in Chapter 6.
Because of the image noise, and the inevitable imprecision and uncertainty of the
used algorithm, a probability framework is added to our system. It is based on the
notion of Bayes-Aspect Network which uses Bayes inference to make probabilistic
hypotheses about the presence of a SymGeon in the scene. The structure of such a
network is described in Section 4.3.
Finally, in Section 4.4 we address the perceptual reasoning problem concerning the
process that, starting from the symbolic description of the scene, infer the presence
of a certain object.
4.1 Dealing with Recognition By Component

In Chapter 2 we have introduced Recognition By Component as one of the most
interesting and promising approaches to object recognition and modeling. It belongs
to a class of approaches whose underlining idea is the structural decomposition of the
objects into generic components, and already postulated by Marr [64] a long time ago
in 1982.
In the original definition of Biederman [10], RBC was inspired by speech per-
ception, a process mediated by the identification of individual elements, phonemes,
from a relatively small set of primitives. This theory is meant to account for what
29
30 CHAPTER 4. MOTIVATION AND METHODOLOGY
Figure 4.1: Biederman’s Geons. Image taken from “Avian Visual Cognition” on-line
Book available at http://www.pigeon.psy.tufts.edu/avc/toc.htm
can be called primal access: the first contact of perceptual input from an isolated,
unanticipated object to a representation in memory.
Following this idea, RBC was developed to account for primal recognition of ob-
jects which does not utilize higher-level cognitive processes. Higher-level processing
may involve the use of shading, texture, or color in finer discriminations of objects.
Biederman notes that certain properties of visual features remain invariant to
perspective transformation through small angles. For example a straight edge appears
straight, while a curved edge appears curved, through a wide range of rotations of
the object, although the exact angle or curvature of that edge changes with rotation.
Starting from this consideration Biederman identified 36 geons (Figure 4.1), qual-
itatively derived using four attributes of generalized cylinders [13]:
1. Symmetry: mirror vs rotational symmetry, mirror symmetry vs asymmetrical
2. Size: constant vs expanded vs expanded and contracted;
3. Edge: curved vs straight;
4. Axis: straight vs curved;
The most important contribution of the RBC is its proposal for a particular vo-
cabulary of components, in fact since its introduction it has encouraged and inspired
many researches in the computer vision community.
Edelman in [30] described some theoretical limitations of the approach. In partic-
ular he emphasizes the necessity to introduce metric information for the components
4.2. REASONING FRAMEWORK 31
rather than a simple structural descrip-

tion. He showed that, by using a cou-
ple of animals, a cow and a giraffe, whose
structural decomposition were obviously
equal.
Starting from these practical consid-
erations a lot of primitives have emerged
from different research, completely speci-
fied and mathematically characterized.
However, the success of the approach
is strictly related to the use of a small set
of basic components. Therefore, the diffi-
culty in building exact models for natural
objects, because of the variety of shapes presented in nature (inside the same object
class too), has limited the use of the compositional approach to human artifact.
Despite this limitation, there still exists a lot of objects whose structural descrip-
tion is problematic, because of their particular silhouette. This is indeed the topic
of the famous sneaker problem raised by Edelman, and this is the reason why nearly
all of the works inspired by RBC are focused on the recovery of objects consisting of
perfect geon-like part.
The origins of such a problem is closely related to the definition of the primitives,
as suggested by Edelman themselves, and probably it cannot be solved completely.
However, in our opinion part of the reasons which are in the origin of the problem
is in the fact that all the proposed primitives in the literature are exclusively three-
dimensional. From this point of view we adopt the thesis of Dickinson et al. [27, 29],
which applies the structural approach to the primitives themselves, decomposing their
structure (see Chapter 2).
The advantages of this approach are not exclusively related to the possibility to
deal with the occlusion, but more important in the ability to manage the recognition
problem with an approach which is both qualitative and quantitative.
Having adopted an approach which is compositional and qualitative at the same
time, we managed to build a recognition framework defined in first order logic which
is indeed a reasoning system.
In particular, this approach inherits the classical advantages of AI logic-based
approaches. Firstly, it provides a rigorous, mathematical way to explicitly represent
knowledge (Knowledge Base) allowing a great degree of modularity. Secondly, it
allows the isolation of the control structure (used to reason) from the Knowledge
Base. Thirdly, it allows to defined each piece of knowledge independently from the
other.
4.2 Reasoning Framework

A Reasoning System is a software program that emulates human ability to combine
information based on certain inference rules, obtaining logical consequences which are
useful facts for a specific goal. In other words it is a system which is able to reason
about its knowledge.
The most common AI approach to realize such systems is based on the representa-
tion of the world and the inference rules in a logical theory, using inference mechanism
to reason about the world.
Since our framework is based on the definition of a logical reasoning system, our
principal task is its formalization. In general, the formalization process begins with
the definition of the language in which the facts can be represented. By translating
the facts into the language we obtain a set of sentences which is called the knowledge
base.
Then the reasoning process is translated into the language through the definition
of a set of axioms, which are generally non-trivial. The purpose of this step is to
ensure that if a fact A follows from the facts included in the knowledge base, the A
is a logical consequence of those facts.
Following this approach, the formalization of our vision based reasoning system
is achieved through the identification of a set of primitives, which constitute the
ontology of our langauge, and the introduction of a set of axioms both for primitives
connection and definition.
The Ontology
In philosophy, ontology is the branch that studies what things exist. In AI1 it is
an explicit formal specification of how to represent the objects, concepts and other
entities that are assumed to exist in some area of interest and the relationships that
hold among them.
Translating this into the recognition problem, the ontology of our reasoning system
is composed of a set of visual features used for the recognition task. In particular it
is a set of multidimensional primitives, composed of:
NP ∪ NB ∪ NF ∪ NA ∪ NSP
where, NP includes 2 one-dimensional primitives (primits), NB , NF and NA in-

clude two-dimensional primitives (boundits, faces and aspects), and NSG includes 9
three-dimensional primitives (SymGeons).
The most simple features of our language are primits, i.e. primitive traits, which
in turn can be a straight line or an arc of ellipse. Therefore the set of one-dimensional
primitives can be defined as NP ≡ {a, l}.
p1 SymGeon primitives instead, are described in detail
p1 in Chapter 3 and are represented in Table 3.1, so the set
of three-dimensional primitives NSG is composed of the
e following elements: ellipsoid, cylindroid, cuboid, curved
m cylinder, curved cuboid, tapered cylinder, tapered cuboid,
curved tapered cylinder end curved tapered cuboid.
The other entities are defined using a deterministic
p2 construction procedure. Such a procedure, starting from
p2 the set of nine SymGeon primitives NSG , extracts the set
of aspects NA , characterizing each primitive s ∈ NSG ,
Figure 4.2: Primit the set of faces NF , describing each aspect a ∈ NA , and
1 The word “ontology” seems to generate a lot of controversy in discussions about AI. We give
here our personal point of view.

finally, the set of boundaries NB , composing each face f ∈ NF .

In order to capture the set of aspects associated with a SymGeon g, we have to
determine the minimal and complete set of views with the following properties:
1. For each view v, v 0 the corresponding 2D orthographic projections (op) are such
that op(g, v) 6= op(g, v 0 ).
2. For each view v, v 0 , there is no mapping m such that m(v) = v 0 , i.e. the set is
minimal.
We shall show that, for each SymGeon, the set of its aspects represented in GS
are all its significant 2D orthographic projections, in which a linear transformation
for smooth edges is defined. In this way the superellipsoid geometrical properties,
likewise the deformations, are qualitatively simulated via a compositional construction
exploiting the 2D projections.
This problem is addressed in many pa-
pers concerning the construction of the as-
pect graph of a generic object [44, 31, 57].
For our purpose we use an orthographic
model to describe the viewpoints space,
where each vantage point is represented by
a point on a sphere surrounding the ob-
ject. In this way each viewpoint is defined
by two parameters: θ or longitude and φ
or latitude (see Figure on the right) and is
denoted as Vθ,φ .
In this way, given a point (x0 , y0 , z0 ) in
Figure 4.3: Viewpoints Space.
the object reference frame, we can compute
its projection on the image plane (u, v) in two steps:
i. a rotation to align the point to the z-axis, so that the vantage point is also
aligned;
ii. an orthographic projection into the z = 0 plane.
The rotation is given by the following equation:

  
cos θ 0 sin θ 1 0 0
[x0 , y0 , z0 ]  0 1 0  0 cos φ − sin φ  (4.1)
− sin θ 0 cos θ 0 sin φ cos φ
which, after the orthographic projection into the image plane, yields:
u = x0 cos θ − z0 sin θ
(4.2)
v = x0 sin θ sin φ + y0 cos φ + z0 cos θ sin φ
The points (θ, φ, u, v) defined by the above equations, for all possible values of
θ and φ represent the set of points in the viewpoint space occupied by the point
(x0 , y0 , z0 ). Our problem is to understand what happens to the image point (u, v) as
soon as the view point Vθ,φ is changed.
The set of aspects associated with a symgeon is obtained by partitioning the

viewpoints space into regions, so that each viewpoint belonging to the same region
represents the same aspect, and as soon as the viewpoint is moved from one region
to another, a significant visual event occurs, that makes the aspect change.
To represent these changes, concerning the aspects, we use
a topological characterization of aspects, which is obtained by
extending the notion of Image Structure Graph (ISG) (see [43,
44, 96]) given for polyhedral objects.
The 2D projection of a polyhedral objects is characterized by
three different features: vertex, edge and T-junction (see Figure
on the right). An ISG is thus a planar graph which has a node
for each junction (either a vertex or a TJ ) and an arc for each
−
→
edge. Let G(−→a ,−
→ , K , κ) be a symgeon where:
• →
−
a = (a1 , a2 , a3 ) are the scale parameters along the x, y and z axis;
• →
− = ( , ) are the squareness parameters along the longitude and latitude
1 2
direction;
→
−
• K = (Kx , Ky ) are the tapering parameters along x and y axis;
• κ is the bending parameter representing the curvature.

−
→
Let NG (−
→a ,−
→
, K , κ) be the normal to the geon G, we use the normal to define the
notion of Primitive Skeleton.
Definition 1 The Primitive Skeleton of a symgeon G, denoted as PS G , is the set of

3D points p, p ∈ G such that one of the followings holds:
i. p is a discontinuity point for the normal NG ;
ii. the direction of the normal NG in p is orthogonal to the view direction.
Note that the structure of PS G depends on the viewpoint Y-junction
Vθ,φ from which the symgeon is observed.

ge
The Primitive Skeleton of a generic SymGeon is composed c-ed
of four kind of features: rectilinear edges (r-edges), curvilin-
ear edges (c-edges), L-junctions and Y-junctions, which are L-junction
depicted in the right Figure.
e
r-edg
Then topological characterization of a symgeon’s aspect

is given as follows.
Definition 2 (Symgeon aspect) Let G be a symgeon. An aspect AG of a symgeon

is a labeled graph AG = (PS G , V, E, µ, ν) where:
i. PS G is the primitive skeleton of G.
ii. V is a set of nodes, one for each junction;
iii. E ⊆ V × V is a set of arcs, one for each edge;

iv. µ : V → {L, Y } is a function which assigns a label to the nodes depending on

the kind of junction generating it.
vi. ν : E → {r, c} is a function which assigns a label to the arcs depending on the
kind of edges generating it.
Relevant properties of an AG are:
1. Planarity: the AG can be drown without edge crossing;
2. Connectivity: every pairs of nodes in AG is joined by a walk;
3. Convexity: for every node in AG there exists a cycle containing it.
We are now able to recognize two different aspects of a symgeon using a notion of
Topological Equivalent Aspects which is obtained as a graph isomorphism.
Definition 3 A bijective function f : G → G0 is a graph isomorphism from a graph

G = (V, E, µ, ν) to a graph G0 = (V 0 , E 0 , µ0 , ν 0 ) if;
- for any edge e = (v1 , v2 ) ∈ E there exists an edge e0 = (f (v1 ), f (v2 )) ∈ E 0 ;
- for any v ∈ V , µ(v) = µ0 (f (v));
- for any (v1 , v2 ) ∈ E, ν(v1 , v2 ) = ν 0 (f (v1 ), f (v2 )).
Definition 4 Two graph G and G0 are isomorphic, i.e. G ≡ G0 , if there exists a

graph isomorphism between them.
Definition 5 (Topological equivalence) Let G be a symgeon, and Vθ,φ and Vθ0 ,φ0 ,
two view directions of G. Let AG and AG 0 be two aspects of G, defined according to
the two different view directions. AG and AG 0 are topologically equivalent if they are
isomorphic.
By topological equivalence it is possible to establish the minimal set of aspect

characterizing the elements of NA .
NA = {AG |G ∈ S, ¬∃AG 0 , AG ≡ AG 0 }
Here S is the set of symgeons (see Table 3.1). The set of aspects are depicted in
Appendix B.
Analogously as we have done for the aspects, we need to find the minimal set
of faces. To this end we have to consider each aspect level node and its topological
characterization through the aspect graph AG which is, by virtue of its properties, a
planar graph. Any planar graph enjoys the following property:
when a planar graph is drown in a plane without edge-crossing, it cuts the

plane into well defined regions called faces of the graph.
Moreover, the number of such regions is independent of the particular drawing. Ex-
actly one of these faces is unbounded and is called the exterior face. All other faces
are bounded by the edges of the graph.
Extending this property to the graph AG , we can consider
all its bounded regions as candidate faces of NF (see Figure on
the right). In the same way as we have done for the aspects,
we have to introduce a definition of similarity between faces,
in order to generate the minimal number of elements.
To this end we can give a topological definition of a face
using the notion of subgraph. A Face aspect (FA ) of a face
region is a subgraph of a symgeon aspect AG , formed by the
edges bounding the region. In particular, a FA is a cycle of
the AG which generates it.
By the above definition we can use the same topological
equivalence introduced before for the aspects. some elements of the set of faces are
depicted in Figure (6.7).
Finally, boundits represents boundary elements and as done above, by starting
from a face f ∈ NF we introduce a constructive way to generate the boundary
element b.
Let F =< vi1 , vi2 , . . . , vin > be a cycle representing a FA of an AG . This means
that for every pair of adjacent nodes vi and vj , belonging to F, there exists an edge
ek = (vi , vj ) ∈ AG . Therefore we can say that two edges ei and ej are connected in
F if ei = (vl , vk ), ej = (vk , vm ) and vl , vk , vm are adjacent nodes of F.
Using this notion, we can proceed starting from each face f and its structural
definition F, and adding a boundary b for every couple of adjacent edges in F. As
for the face level we can use the definition of topological equivalence given in (3,4,5)
to minimize NB .
Following this procedure NB could be composed only by the elements ll, eal and
eaea represented in Figure (6.4), but in order to simplify the axiomatization of our
reasoning system, we introduce in the ontology other 4 elements also.
Axiomatization
Since the image interpretation task that we consider involve information about prim-
itives arranged in space, representation and reasoning about spatial relations, are
important components of our framework. Therefore, the core of our visual reasoning
system is the Algebra of Figure (AF), described in detail in Chapter 6.
Techniques to represent spatial information have been studied in AI in the area of
spatial reasoning, where different logical framework are introduced to formalize image
domain knowledge. Reiter and Mackworth in [87] introduced a framework based on
the definition of three sets of axioms where an interpretation of an image is defined
to be a logical model of these axioms.
Other interesting results are obtained for spatial reasoning in the context of GIS
(Geometric Information System), where different sets of relations are introduced for
regions in space [20, 8, 19].
Our Algebra of Figure is a multi sorted algebra used both to define each basic
elements in our ontology, and the description of the objects that has to be recognized
4.3. THE BAYES-ASPECT NETWORK 37
by the system.
The sorts of the algebra are those needed to represent the class of elements de-
fined in the ontology (Primits, Boundits, Faces, Aspects and SymGeons) and a sort
named Scene Graph which represent ob-
jects obtained by suitable composition of C= Connected: ⊕nC
SymGeons.
The elements of the algebra can be P= Parallel: ⊕nP
composed using four possible operators
which represents geometrical properties: S= Symmetric: ⊕nS
Connection, Parallelism, Symmetry and
Orthogonality. For our commodity, to this
set we add an additional operator, repre- T= Orthogonal: ⊕nT
senting the Angularity property, that can
be defined in terms of the previous one.
They are represented in the Table on V= Angular: ⊕nV
the left, together with their functional de- Table 4.1: Relation between elements and their
notation which will be more clear in the functional denotation.
next Chapter.
Observe that all the relations used to define the operators are reflexive. This is an
important property in fact, especially for objects, it allows to give a definition which
is view independent, i.e. the composing relations are preserved from translation,
rotation or scaling.
The choice of the relations are inspired mainly by the work of Lowe [62, 83, 63],
which uses proximity, parallelism, symmetry, etc. to describe relations among 2D
features. Although their simplicity, these five operators allow us to describe a large
number of complex objects.
The axioms of our algebra are divided into three groups:
1. Grouping Axioms: these axioms are used to build all the terms of the algebra
and, more important, to parse the graph used to represent the observed scene,
the scene graph, described in Chapter 1.2. Such a parsing procedure is the
keystone of our reasoning approach, because it is used to recognize the objects.
2. Connection Axioms: these axioms are used to capture among all the possible
models, only those ones which are Canocical, i.e. which represent valid connec-
tions among elements. Besides this, connection axioms give a semantic to the
connection operators
3. Elementary Figure Axioms: these axioms are introduced to represent the

shape of what is known. In particular a subset of such axioms (Basic Feature
Axioms) are given to represent the primitives of the algebra (primits, boundits,
etc).
4.3 The Bayes-aspect network

Cognition and reasoning are crucial components of the recognition process, as we have
seen, and they rely on the construction of a data structure, the scene graph.
The scene graph, in turn, can be constructed because a set of SymGeons have been
recovered from the image. Each SymGeon, in turn, is recovered as an hypothesis and,
then, if the hypothesis is reliable the SymGeon is suitably localized in the scene.
In Section 4.2 we have described the primitive components of the image, i.e. the
primits. By suitably composing primits we form boundits, and by suitably composing
boundits we form faces, and finally composing faces we form aspects, where aspects
are views from different vantage points of a SymGeon. This composition process is
formulated in FOL, by explicit definitions, as previously introduced using the algebra
AF. However, due to the image noise and the uncertainty of its structure, the presence
of a SymGeon in the scene is gathered using Bayesian inference. E.g. a given closed
region of the Syntactic Graph is a cylindroid with probability p.
In this section we describe the basic idea of constructing the Bayes-aspect network.
It is obtained suitably composing the structure of a HAG [27] together with causal
relations obtained by introducing the connection relations r ∈ R, between features
(primits, boundits, faces, aspects), defined with the algebra AF. A hierarchical aspect
graph representation identifies equivalent views and neighborhood relations generating
a graphical structure of views. Each node of the HAG represents an aspect of the 3D
type in the hierarchy. Each link represents some visual event where transitions occur
between two neighboring general views. As we mentioned in the preliminaries, since a
symgeon is obtained by suitable tapering and/or bending transformations applied to
a superellipsoid, a few aspects are needed to capture the change in the vantage point.
Dickinson and Metaxas have defined a three layered HAG, Gpg = (NA ∪ NF ∪ NB , E)
for parametric geons in which the set of nodes N is partitioned into three sets, where
NA is the set of aspects, NF is the set of faces, and NB is the set of boundits. The
HAG levels are naturally ordered: the nodes in NB contributes to the composition
of the faces represented in NF which, in turn, contributes to the composition of the
aspects in NA . The aspects, finally, are the significant projections of the parametric
geons. In Chapter 2 we have given an overview about this approach.
In our case we have extended the notion of parametric geons to that of SymGeons.
We shall now give the methodology through which we construct the topology of
the Bayes-aspect network.
Definition 6 (Bayes-aspect net) A BAN = (C ∪ (NA ∪ NF ∪ NB ∪ NP ), E) is a

DAG where:
1. (NA ∪ NF ∪ NB ∪ NP ) are called the features nodes. Here NA , NF and NB are

as in HAG, while NP is the set of primits; C is the set of connection nodes,
and they are labeled by the following connection relations in R: Connected =
⊕C , Symmetric = ⊕S , Parallel = ⊕S , Orthogonal= ⊕T , Angular = ⊕V .
2. E = ECm ∪ ECa , where ECm is the set of composition links leading from a
feature node to a connection node, and ECa is the set of causal links leading
from a connection node to a feature node.
3. If n ∈ NX then ∃n0 , n0 ∈ C s.t. hn, n0 i ∈ ECm if NX ∈ {NB , NF } and ∃n00 , n00 ∈

C, s.t. hn00 , ni ∈ ECa , if NX ∈ {NF , NA }.
4. To each node is associated a Conditional Probability Table (CPT) as described

in the following:
4.3. THE BAYES-ASPECT NETWORK 39
(a) If n ∈ NX , with NX ∈ {NA , NF , NB } the CPT is suitable define case by

case. For example in Table 4.2 is shown a feature node’s CPT, representing
a quadrilateral face with curved boundaries, depicted on the right-hand side
feature node in Figure 7.1.
⊕C ⊕P P(|| |⊕C , ⊕P ) b1 b2 P(⊕C |b1 , b2 )

T T 1 T T N0,σ (b1 ⊕C b2 )
T F 0.5 T F 0
F T 0 F T 0
F F 0 F F 0
Table 4.2: CPT for feature nodes (left) and connection nodes (right).
(b) If n ∈ C then P arents(n) = n1 , n2 , ni ∈ NX , the CPT is defined according

to the distance between n1 and n2 . For example the ⊕C compositional
operator’s CPT, of the portion of the BAN represented in Figure 7.1, is
given below (Table 4.2) where Nµ,σ is the gaussian probability function (see
Figure 4.4) with mean µ and variance σ:
1 2 2
√ e−(x−µ) /2σ , x ∈ R,
σ 2π
0.8
0.6
0.4
0.2
-3 -2 -1 1 2 3
Figure 4.4: A gaussian probability function with mean 0 and variance 0.5 and 0.8.
According to such definition, each feature nodes of the BAN is labelled with the
feature element (primits, boundits, faces and aspects) composing our ontology, de-
scribed in a Section 4.2. Once all the nodes concerning the features used to represents
the aspect that root the structure have been determined, we need to introduce suit-
able connection nodes and the conditional probability tables (CPT) linking nodes of
a lower features level to the nodes of the upper level. For example, two boundary
nodes cooperates in determining a face via a connection node, e.g. a symmetry node,
labelled by ⊕S . Each connection node, since it is defined in terms of a distance as
described in the next Chapter, is given a probability according to the distance of the
features entering the connection.
Consider, for example, the quadrilateral face with curved boundaries f depicted
in Figure 7.1, which is obtained by suitable composing two boundits eal and ial using
symmetrical and connection operators, as follows:
f (eal, ial) ≡ eal ⊕C ⊕P ial, with eal, ial ∈ B (4.3)
Then by considering the operators ⊕S and ⊕C as defined in Chapter 6, and the

CP T introduced in Tables (4.2), we can define the probability of f as follows:
P rob(f (x, y)) = p ≡ ∃l1 , l2 , a1 , a2 .x = ial(l1 , a1 ) ∧ y = eal(l2 , a2 )∧

px = P rob(x) ∧ py = P rob(y)∧
d1 = x ⊕C y ∧ d2 = x ⊕P y∧
n
p = N0,σC (d1 ) × N0,σP (d2 ) × p2x × p2y ∨
h i o
p = N0,σC (d1 ) × 1 − N0,σP (d2 ) × px × py × px × py
(4.4)
Analogously the boundit eal (as for ial) is obtained by suitable composing two primits
l and a, respectively a straight line and an arc of ellipse, using connection operator
⊕C , as follows:
eal(l, a) ≡ l ⊕C a, with l, a ∈ P (4.5)
Therefore the probability of eal can be defined as follows:
P rob(eal(x, y)) = p ≡
∃p1 , p2 , p01 , p02 , α, α0 .x = primit(p1 , p2 , α) ∧ y = primit(p01 , p02 , α0 )∧
α = slope(p1 , p2 ) ∧ ∃C, rm , rM , β, φ, ∆φ.α0 = ell(C, rm , rM , β, φ, ∆φ)∧
d = x ⊕C y ∧ P rob(x) = 1 ∧ P rob(y) = 1 ∧ p = N0,σC (d)
(4.6)
Observe that in the case of primits, probabilities can be either 1 or 0, where 1
means that the primit is found in the image syntactic graph, by the syntactic analyzer,
and 0 otherwise.
4.4 Cognitive and Description Framework

The cognitive part of perception is described and axiomatized in the Situation Cal-
culus, introduced in Chapter 3, using the definition of perception given by Pirri and
Finzi in [76].
Now, the problem of perception consists essentially in managing the input data
obtained by sensors (e.g. the camera), processing them and suitably adding the results
to the theory of action and perception as hypotheses so that the following will hold:
D ∪ H |= ∃p, s.P ercept(p, 1, s) ∧ ∃x.p = isT able(x)
To understand what H is and the role of sensing actions, consider the following simple
example. There is a table and a chair in a room and an observation of the scene is
performed, i.e. a shot of the scene is taken (we open our eyes and look into the room);
let us cut the instant before we make sense out of what there is in the room. Clearly,
at the very moment in which the image is taken no distinction among objects is made.
Therefore it is not a single sensing action like sense(isT able(x), v) that takes place,
but a scene/image acquisition.
4.4. COGNITIVE AND DESCRIPTION FRAMEWORK 41
From the image acquisition till the inference, leading to an hypothesis that there
might be a table in the room, a complex process of revelation takes place. One
bringing the shapeless and scattered components identified in the image, to the surface
of cognition2 , by giving a structure to these components. And there is a step in the
structuring that reveals the meaning: “that’s a table”. In other words the re-cognition
process is a thread of revelations (the apperception) giving, attributing, meaning to
the elements of the image. This is achieved by conjugating the construction of a tight
data structure (a graph of all the symgeons occurring in the scene together with their
topological relations), which is the hypothesis H, together with the meaning given
by a description and denoted by a sensing action like sense(isT able(x), v). Therefore
the sense(isT able(x), v) action has, indeed, the double meaning of giving sense to the
elements of the data structure and of bringing to the surface of cognition the existence
of an object, a table, in fact.
To understand what we mean let’s go through the example of the table. We might
simplify the successor state axiom in (3.3) as follows:
P ercept(isT able(x), v, do(a, s)) ≡

a = sense(isT able(x), v) ∧ InRoom(s)∨ (4.7)
a 6= sense(isT able(x, v)) ∧ P ercept(isT able(x), v, s).
The above successor state axiom for P ercept states that the perception of the
perceptible isT able occurs when, being in a room, a sensing action has been exe-
cuted about the mentioned perceptible, or the perception had already been done in
a previous situation.
Now, we constrain the sensing action so that it can be performed just in case a data
structure has been released by the re-cognition process. To this end let us distinguish
between the object to be localized (the table) and the data structure accounting for
the reference primitives of the scene, i.e. the set of all elementary objects appearing
in the image, including the legs and the top of the table.
P oss(sense(isT able(x), v), s) ≡
(v = 1 ∧ ∃ref erence. Scene(isT able(x), ref erence))∨ (4.8)
(v = 0 ∧ ∀ref erence. ¬Scene(isT able(x), ref erence)).
The above action precondition axiom says that the outcome of sensing is 1 if the
description matches over the data structure, the reference, otherwise it will be 0.
What does the term ref erence mean? At each step of the perceptual reasoning
process, we deal with a data structure denoting what has been recognized so far, in
the following way:
1. Cognitive step: the data structure is a scene graph. Each node of the scene
graph is labeled by a symgeon (about which we know the metric in terms of
parameters, distances and position in the scene). Each edge is labeled by a 3D
relation. The notion of scene graph is described in Chapter 1.2.
2. Recognition step: the data structure is the Bayes-aspect network. Each node
of the Bayes-aspect network is labeled either by a feature (aspect, face, or
boundary) of a symgeon or by a 2D relation between features. The Bayes-
aspect network is introduced in the next section.
2 Re-cognition, indeed, means knowing again, to reveal again to cognition.
3. Processing step: the data structure is the image syntactic graph, i.e. the graph
depicting all the segments (straight segments and arc of segments) traced in the
image. Such a graph is described in Chapter 7.
Chapter 5
Syntactic Image Analysis
The framework of our vision reasoning system can be divided into two parts. The
first one concerns the recognition of the SymGeons in the scene, and the second one
regards the recognition of the objects.
In Chapters 5, 6 and 7 we describe the first recognition process. Such a recognition
is achieved in two steps, starting from 2D image data. The first step consists of an
image analysis whose purpose is to identify, in the acquired image, the basic elements
of our ontology (primits). It is described in this Chapter.
The second step concerns the compositional process that, starting from a set of
primits, identifies boundits, faces and aspects in turns, using probabilistic reasoning
and leading to the recognition of a specific SymGeon. It is described in Chapter 7.
5.1 Image Analysis

The purpose of the Syntactic Image Analysis is to recognize in the acquired image the
syntactical categories which are the primitives of the scene structure. The results of
such analysis is an image syntactic graph which is a graph recording all the elementary
traits of the image. Each node of the graph represents a junction point between
segments and is labeled by the information of its 2D position in the image. Each edge
of the syntactic graph represents a segment and is labeled by specific information,
related to the segment classification, straight line or arc of ellipse.
At the earlier steps of the analysis we just use classical techniques. In fact, for
the execution of some basic operations like filtering and edge detection, we use the
Matrox Library (MIL), and a convolution based edge detection operator is applied to
the image after filtration. After that a classical edge following algorithm [40, 46, 14],
suitably extended to filter isolated points or short segments, is used to generate an
ordered list of connected points (chain pixels).
The resulting edge elements are further approximated with straight lines and el-
lipse arcs. Let us call this whole basic processing of the image, in the situation s in
which an observation of the scene has been taken (to be distinguished from a sensing
action), processing(I, s). The basic processing will return a set V of straight lines
and ellipse arcs taking an argument of sort situation, because their values depend on
43
44 CHAPTER 5. SYNTACTIC IMAGE ANALYSIS
v
p1
a v p1
f
rm
p2 F2 b
Df rM
horizontal
p2 C
F1
u
Du u
Figure 5.1: Ellipse and line parameters
the specific situation in which the observation has been made.

Segmentation(V, do(a, s)) ≡ ∃I.a = observeScene(I)∧
(5.1)
processing(I, s) = V
Observe that processing(I, s) is a functional fluent denoting the interface with basic
visual processing. Therefore, it has to be distinguished both from a sensing action
and a control action.
In particular a straight line and an arc of ellipse are characterized using the fol-
lowing parameter (see Figure 5.1), defined w.r.t. the image reference frame (u, v):
• for linear segments:

i. the end points of the segment: p1 (u1 , v1 ) and p2 (u2 , v2 );
ii. the slope of the straight line through p1 and p2 :
v2 −v1
α = arctan u2 −u1 with α ∈ {−π, +π}.
• For curvilinear segment:
i. the end points of the segment: p1 (u1 , v1 ) and p2 (u2 , v2 );
ii. the center point of the ellipse: C(uc , vc );
iii. the major and minor axes of the ellipse: am and aM ;
iv. the inclination of the ellipse w.r.t the horizontal frame axis: α ∈ {−π, +π};
v. the initial angle φ and the open angle ∆φ, both defined in the range of
{0, 2π}.
Implementation
The input of the algorithm is a binary image obtained from the acquired image and
elaborated using edge detection operators. The output is a list of segments which are
classified in two categories: rectilinear and curvilinear. To accomplish this task we
operate in two steps:
1. Grouping: the image points that compose each instance of the target curve
are grouped together;
5.1. IMAGE ANALYSIS 45
2. Model fitting: given a set of image points probably belonging to a single curve,
find the best curve interpolating the points.
To solve the grouping problem we use a simple edge following algorithm that
provides an ordered list of connected points belonging to the object boundaries (chain)
by scanning the binary image in a predefined order. The algorithm is sketched out
below:
Algorithm EDGE FOLLOWING
A. For all edge points in the image:

B. locate the next unvisited edge pixel (i, j)
C. starting from (i, j), follow the chain of connected edge pixels;
D. mark all visited points
E. if the length of the resulting list is greater than a threshold save it.
The model fitting problem instead, is resolved using a split & merge algorithm
applied to each list of connected pixels. The two phases of the algorithm are described
in the following.
Split phase This phase consists in a top-down algorithm where the list is recursively
divided into sublists until each one can be approximated with a rectilinear line or an
arc of ellipse with an acceptable error.
Algorithm SPLIT
A. Let l the list of connected points:

B. determine the best line segment r and arc of ellipse e (if exists) that fit
the points in l;
C. compute the error r and e associated respectively to r and e;
D. if both r and e are greater than a predefined threshold:
i. split opportunely the l list into two sublist l1 and l2 ;
ii. apply the split algorithm to both l1 and l2 ;
Merge phase Using only the split phase the risk is to obtain an over-fitting of the
list i.e. a fitting of the list using more segments than is necessary. Therefore the task
of the merge phase is to analyze every couple of near segment and fuse them into a
unique segment if the error is acceptable.
The technique used to fit a set of points with an arc of ellipse is described in [39]
and in appendix A we present a short overview.
The final step of the procedure is to determine the parameters characterizing
each kind of primits, as reported previously. The endpoints of the primits are easily
identified considering the first and the last point of the list from which the primit is
obtained.
More complicated is to determine the parameters C, an , aM , α, φ and ∆φ charac-
terizing an elliptic primit. Starting from the coefficients of a generic conic, returned
by the ellipses fitting procedure:
Au2 + Bv 2 + Cuv + Du + Ev + F = 0 (5.2)
we firstly verify if it is a real ellipse, through the evaluation of the following conditions:

A C D
C 2 E2 A C ∆

∆ = 2 B 2 6= 0
= C 2 >0 < 0.
B A+B
D E F 2
2 2
Center of Ellipse: since an ellipse is a central conic (as a hyperbola), the center of
the conic (cu , cv ) is the intersection of the diameter’s equations, so it is the solution
of the following system:
2Au + Cv + D = 0
Cu + 2Bv + E = 0
Therefore, the coordinates of the center of ellipse are given by:

D C 2A D

E 2B C E
cu = cv =
2A C 2A C

C 2B C 2B
Major and Minor Axes: we need to transform the equation (5.2) in the canonical
form of the ellipse, whose equation is:
au2 + bv 2 + c = 0
the coefficient a, b and f are determined as the solution of the equation system:
 C 2
 ab = AB − ( 2 )
a+b=A+B
 c= ∆
AB−( C )2
q q
here ∆ is specified above. Let r1 = − fa and r1 = − fa , the major and minor axis
are:
am = r1 , aM = r2 if f r1 < r2
am = r2 , aM = r1 otherwise.
Inclination: the inclination is obtained translating the equation of a generic conic,

in order to move the center of ellipse in the origin of the reference frame, and applying
a rotation α, such that the coefficient of the term in uv is zero. Finally, we have:

0 if f C = 0
α= 1 C
2 A−B otherwise
Initial and Open Angle: to determine the parameter φ and ∆φ we need

5.1. IMAGE ANALYSIS 47
to know the endpoints p1 (u1 , v1 ), p2 (u2 , v2 )

v
and the coordinates of and additional point p1
pm (um , vm ) belonging to the primit, see fig-
ure on the right. Firstly, we compute the val- 1
C
ues φ1 = arctan( uv11 −c v2 −cv
−cu ) and φ2 = arctan( u2 −cu )
v
pm
where (cu , cv ) are the coordinates of the cen- 2
ter of ellipse. Then starting from the equation

au + bv + c = 0 of the straight line through p1
p2
and p2 , we compute the value r = aum + bvm + c
where um and vm are the coordinates of the u
point pm .
Finally, the values of the parameter φ and ∆φ are computed considering that ∆φ
must always be positive describing an arc in the direction of positive φ:
- if r > 0:
φ = φ1 − α
∆φ = arcExtension(φ1 , φ2 )
- if r < 0:
φ = φ2 − α
∆φ = arcExtension(φ2 , φ1 )
In Section 5.2, we list three basic functions of the Syntactic Image Analyzer: arcExtension,
ellipseParameters and ellipsePhiDeltaPhi.
The output of the Syntactic Image Analysis is the image structure graph. In our
logic-based framework such a graph is defined through a list of terms, representing
the sequence of primits extracted from the image, having the following form:
primit( l1, point(P1X, P1Y), point(P2X, P2Y), S ).
primit( a2, point(P1X, P1Y), point(P2X, P2Y),
ell( point(CX, CY), Rm, RM, Alpha, Phi, DPhi) ).
Here the first argument is a unique identifier in the form “l”, for straight line, or “a”,
for arc of ellipse, followed by an ordering number. The other parameters are those
introduced above, to describe primitive traits.
In the sequence of figure depicted in Table 5.1 the resulting output of the above
steps, applied to the image of an abstract picture, is shown. The associated list of
primits are reported in Table 5.2.
From the analysis of such a list, is it possible to note that two considerable kinds
of errors occurs. The fist one is an incorrect subdivision of a line into two primits e.g.
l1, l2 or l3, l4, the second one is a wrong classification of a segment e.g. l16, l17,
l31, l32. This behaviour is due to the choice of the threshold errors for the line and
ellipse fitting; in particular in the first case it is too high, while in the second one it
is too low. This contrast makes difficult the implementation of the Syntactic Image
Analysis.
The problem can be probably solved through an initial preprocessing of the data,
looking for the best threshold for the specific image. Another possibility, which best
fits our idea of a vision reasoning system, is a close interaction between the high-level
reasoning system and the low-level visual processing. In such a way, as described
in [94] through a mechanism of feedback and expectation, the high level hypothesis
Abstract Picture Chain Pixels
Primits
Table 5.1: Syntactic Analysis for an Abstract Picture

5.2. PORTIONS OF THE SIA CODE 49
about the interpretation of sensor data, directly affect the low level vision processing
that, in our particular case, adjust its internal thresholds trying to extract those clues
which allows to confirm such hypothesis.
5.2 Portions of the SIA Code

Ellipse’s Descriptor
typedef struct ellisse {
double center_x;
double center_y;
double major_axis;
double minor_axis;
double alpha;
double phi;
double delta_phi;
double A0, Ax, Ay, Axy, Axx, Ayy;
double q;
};
Arc Extention
//////////// Arc extension between two angle first and second ////////////
double arcExtension(double first, double second)
{
if ((first>0) && (second>0)) {
if (first<second)
return fabs(first-second);
else
return 2*PIGRECA-first+second;
} else {
if ((first<0) && (second<0)) {
if (first<second)
return fabs(first-second);
else
return fabs(first)+2*PIGRECA+second;
}
}
if (first<0)
return fabs(first)+second;
else {
return 2*PIGRECA-first-fabs(second);
}
}
Ellipse’s Parameters
/*********************** Ellipse Parameters **************************/
bool ellipseParameters(double A0, double Ax, double Ay,
double Axy, double Axx, double Ayy,

double x1, double y1,
double mx, double my,
ellisse *e) {
double delta, j, i, q;
double det, detX, detY;
double r1, r2;
if ((Axx==0) && (Ayy==0) && (Axy==0)) {

return false;
}
// Verify if it is an ellipse
delta = determinante3x3(Axx, Axy/2, Ax/2,
Axy/2, Ayy, Ay/2,
Ax/2, Ay/2, A0);
j = determinante2x2(Axx, Axy/2,
Axy/2, Ayy);
i = Axx + Ayy;
if (!((delta!=0) && (j>0) && ((delta/i)<0))) {

return false;
}
// Insert the coefficients in the ellipse’s descriptor //

e->A0 = A0;
e->Ax = Ax;
e->Ay = Ay;
e->Axy = Axy;
e->Ayy = Ayy;
e->Axx = Axx;
// Computing the inclination //

if (Axy!=0) {
e->alpha = 0.5 * atan(Axy/(Axx-Ayy));
} else {
e->alpha = 0;
}
///// Computing the intersection of the two diameter’s equations //////

// 2Axx * x + Axy * y + Ax = 0 , Axy * x + 2Ayy * y + Ay = 0 //
/////////////////////////////////////////////////////////////////////////////
det = determinante2x2(2*e->Axx, e->Axy,
e->Axy, 2*e->Ayy);
detX = determinante2x2(-e->Ax, e->Axy,
-e->Ay, 2*e->Ayy);
detY = determinante2x2(2*e->Axx, -e->Ax,
e->Axy, -e->Ay);
e->centro_x = detX/det;
e->centro_y = detY/det;
// Computing phi and DeltaPhi //

ellipsePhiDeltaPhi(e, x1, y1, x2, y2, mx, my);
/////////////////// Computing major and minor axes //////////////////

// a’c’=ac-b^2 , a’+c’=a+c //
//////////////////////////////////////////////////////////////////////
double gamma1 = e->Axx*e->Ayy-pow(e->Axy/2,2);
double gamma2 = e->Axx+e->Ayy;
double deltaEq = pow(gamma2,2)-4*gamma1;
double c1=(gamma2+sqrt(deltaEq))/2;
double c2=(gamma2-sqrt(deltaEq))/2;
double a1=gamma1/c1;
double a2=gamma1/c2;
double f = delta/gamma1;
r1 = sqrt(-f/a1);
r2 = sqrt(-f/c1);
if (r1>r2) {
e->raggio_maggiore = r1;
e->raggio_minore = r2;
} else {
e->raggio_maggiore = r2;
e->raggio_minore = r1;
}
return true;
}
Computing φ and ∆φ
/////////////////// Computing phi and DeltaPhi ////////////////////
bool ellipsePhiDeltaPhi(ellisse *e,
double mx, double my) {
double a, b, c, val, phi1, phi2;
if (distPuntoPunto(x1, y1, x2, y2)<MAT_LIMPARELL)

return false;
rettax2punti(x1, y1, x2, y2, &a, &b, &c);
val = mx * a + my * b + c;
if (val<-MAT_MINVAL) {
// arc from p1 to p2 in clockwise
phi1 = atan2(y1-e->centro_y, x1-e->centro_x);
e->phi = phi2-e->alpha;
e->delta_phi = ampiezzaArco(phi2, phi1);
} else if (val>+MAT_MINVAL) {
// arc from p1 to p2 in counterclockwise
e->phi = phi1-e->alpha;
e->delta_phi = ampiezzaArco(phi1, phi2);
} else {
return false;
}
return true;
}
primit( l1, point(11, 10), point(11, 445), 1.570796 ).

primit( l4, point(608, 465), point(629, 462), -0.141897 ).
primit( a13, point(450, 214), point(423, 255), point(372, 285),
ell( point(342, 274), 21, 130, -0.443267, -0.072148, 0.849195) ).
ell( point(190, 249), 83, 198, 0.418183, -1.564945, 1.252259) ).
ell( point(76, 276), 9, 59, 0.515780, 0.055115, 3.028107) ).
ell( point(388, 322), 5, 42, -0.498255, -1.690332, 3.036224) ).
ell( point(251, 270), 47, 139, -0.124330, 0.274372, 2.657143) ).
ell( point(528, 156), 89, 121, 0.384126, -2.967178, 0.558962) ).
ell( point(528, 156), 89, 121, 0.384126, 2.131096, 1.181726) ).
Table 5.2: List of Primits for the Abstract Picture

Chapter 6
An Algebra of Figure
Much of our intelligent behavior results from the interaction of mental

processes with the objects and constraints of the world ([67])
6.1 Introduction
In this chapter we present a formalization of the operations and primitives that we
consider the elementary constituents of a scene representation. By a scene represen-
tation we intend, indeed, a disposition of human artifacts, in a room, hallway, etc..,
thus excluding landscapes, animals, human beings and so on. Our formalization takes,
therefore, into account, the way a scene can first be decomposed into, by denoting
its basic patterns, and then recomposed by using rational laws concerning not only
geometric properties, but also commonsense rules. Despite these last rules cannot
be general, they are crucial because they take into account not only the way human
beings assemble artifacts, but also the way human beings observe artifacts.
This formalization will be used to introduce descriptions, that are the basic pat-
terns of the reasoning process leading to the interpretation of a scene. Here, in fact,
instead of talking about the image, in which patterns have no meaning associated
with them, we shall talk about the scene, and the difference relies on the fact that
each token of a scene has its own specific dimension and meaning.
A significant aspect of our formalization is that of denoting patterns by new terms,
from the most simple to the most complex. These terms can be composed through
operations which, in turn, can form new terms, according to the peculiar composition
they denote. We have therefore introduced a bunch of new terminology which is
not pretextuous, as it serves to introduce a suitable symbolic notation to model a
scene. The structural relationships among terms, denoting scene patterns, and their
underlying axiomatization, is what we call an algebra of figures. Despite figures are
terms of the algebra, we could not avoid a metric notion of distance to define them.
However the crucial point is: shall we refer to the image plane (which is 2D) or shall
we refer to the scene space (which is 3D)? It turns out that as far as the process of
composition is concerned, one can shift from the plane to the space very easily, since
there is only a metric problem (concerning the definition of distance), which can be
made independent ftrom the definition of each shape.
55
56 CHAPTER 6. AN ALGEBRA OF FIGURE
For example, if we want to describe a table by saying that the legs are orthogo-
nal to the surface we apparently do not need the third dimension, unless we specify
the SymGeons in the space instead of specifying them in the plane. Now, with the
exlusion of Chapter 7, in which we face the problem of Object Recognition, in this
chapter we are concerned only with the “representation” of SymGeons components,
therefore we shall refer only to the image plane, and when we talk about a dis-
tance, we consider the Euclidean distance between points as δ(p1 (u1 , v1 ), p2 (u2 , v2 )) =
p
(u2 − u1 )2 + (v2 − v1 )2 .
6.2 An algebra of Figures

In this section we introduce the basic notions of a multi-sorted algebra F, that we call
an Algebra of Figures. as follows:
Sorts: P = {P, B, F , G}. Here P is the set of primitives, and

P is the set of Primits;
B is the set of Boundits;
F is the set of Faces;
and G is the set of SymGeons;
To indicate primits we shall use the lower case greek letter ρ; to denote boundits
we shall use the lowercase greek letter β; to denote faces we shall use the lower case
greek letter φ and to denote SymGeons we shall use the lowercase greek letter γ. We
shall use all of them with superscripts or subscripts.
Operations, Ω = {, ⊕P , ⊕E , ⊕F }
Here is the concatenation operator (sometimes referred to as a C-operator);
⊕P is the pointwise connection operator;
⊕E is the edgewise connection operator;
and ⊕F is the facewise connection operator.
The usual metric notions on a plane have to be intended in a perceptive context,
therefore we relax, somehow, the (e.g. Birkhoff) usual metric definitions and consider
the following. A point (x, y) in the image plane is denoted with p, and δ is the
Euclidean distance defined above:
Definition 7 Any element π in the image (plane), is a primitive, iff π ∈ P. If π is

a primitive then there exists a point p in the image (plane), s.t. p ∈ π. Let π be a
primitive element and p ∈ π, a p -region of π, bounded by , r (p) is:
r (p) = {y ∈ π | δ(p, y) < }, with > 0
If r (p) is a p-region of π, bounded by , then r (p) ⊂ π
Consider now the usual geometric notions of parallelism, symmetry, and orthogonality.
We give some definition, that adapt to the image plane.
Definition 8 Two primitive elements π1 and π2 , in the image plane, are parallel if
they do not intersect; i.e., for any p ∈ π1 , there is no p-region r (p) s.t. r (p) ⊂ π2 .
6.2. AN ALGEBRA OF FIGURES 57
Definition 9 Two primitive elements π1 and π2 , in the image, are orthogonal, if

there are three ponts, p, p0 , and p00 s.t p 6= p0 and p 6= p00 , and p, p0 ∈ π1 , and p, p00 ∈ π2
and there exist two lines l1 , passing through p and p0 , and l2 passing through p and
p00 s.t. they form a right angle. π1 and π2 are simply angular if the angle is not a
right one.
Definition 10 Two primitive elements π1 and π2 , in the image, are symmetric, if

there is a point p and two lines l1 and l2 , passing through p, which are perpendicular
and:
1. For any p0 -region r (p0 ) ⊂ π1 there is a p00 -region r (p00 ) ⊂ π2 , which is in equal
distance from both l1 and l2 .
2. there is a 180· rotation of π1 that makes it having the same shape as π2 .
Let R be a set of relations between primitive elements, in the image plane (e.g.
parallelism, symmetry, orthogonality, connection, overlapping, etc..). In particular,
let II indicate parallelism, S indicate symmetry, V angularity, and T orthogonality,
and let R ∈ {II, T, S, V}. A family of n-ary operators {{⊕iX }i≤n }X∈R can be derived,
as follows. Let n be the number bounding the set of primitive elements in the image
plane. For each i ≤ n, and for each X ∈ R, there is a set of n-ary operators {⊕iX }i≤n ;
in particular, when i = 1 and X is any of the above defined connection operators, i.e.
X ∈ {C, P, E, F }, then {⊕iX }i≤n ∈ Ω. Now Ω can be extended with {{⊕iX }i≤n }X∈R .
Relations, the set of F-relations is {=, ≺}, where = is defined whithin the first
order language and ≺ is a precedence relation.
Definition 11 (Terms of F) The set of terms T of F are defined inductively as

follows:
1. If t ∈ X, with X ∈ P then t is an atom. If t is an atom, then t ∈ T .
2. If t, t0 ∈ T then t t0 ∈ T . ∈ Ω.
Nothing else is a term of F. In the following we will denote the terms of F with t with
superscripts or with τ , always with superscripts.
Given the above primitives and operators the two following elements can be de-
fined:
1. Aspects. The set A of aspects is defined inductively as follows:
(a) If a ∈ F (F is the set of Faces), then a ∈ A.

(b) if a1 , a2 ∈ A then a1 ⊕X a2 ∈ A, iff X ∈ {E, F } ⊂ Ω.
2. Scene Graph. The set SG of Scene graphs is, analogously, defined inductively
as follows:
(a) If g ∈ G then g ∈ SG.

(b) if g1 , g2 ∈ SG then g1 ⊕X g2 ∈ SG, iff X ∈ {P, E, F } ⊂ Ω.
Here follows a schema of composition of terms, using the connection operators.

: tG × tG 7→ tSG tG ∈ G ∈ P, tSG ∈ SG ∈ P.
⊕P : tX × tX 7→ tY tX ∈ X, tY ∈ Y, X ∈ {P, B, G}, Y ∈ {P, B, F, G}.
with the proviso that:
 if X = P then Y = B
if X = B then Y = F

if X = G then Y = SG
⊕E : tX × tX 7→ tY tX ∈ X ∈ {P, B, F, G}, tY ∈ Y ∈ {P, B, F, A, G, SG}.
 if X = B then Y = F
if X = F then Y = A

if X = A then Y = Gif X = G then Y = SG
⊕F : tX × tX 7→ tY tX ∈ X ∈ {F, A, G}, tY ∈ Y ∈ {F, A, G, SG}.
 if X = F then Y = A
if X = A then Y = G

if X = G then Y = SG
⊕iR : tG × tSG 7→ tSG with the proviso that:
i = 1,
R ∈ {S, T, II, V}
6.3 Axioms for the Algebra

In the previous section we have specified a set of operators, a set of relations and the
notion of term of the algebra F.
The language L F , for the Figures, can be easily obtained by extending the above
specified notions to the standard alphabet of logical symbols (∧, ∨, ¬, ∀, ∃, ..).
Now, we need to specify the properties of the terms, the properties of the operators,
the way a term is built, and this is given in a set of axioms that we call the Grouping
Axioms FG .
On the other hand, not all connections can be valid. We need to specify the way
a pair of primits can be connected, in order to form a boundit, or the way a face
and an aspect can be connected in order to form a SymGeon. This set of axioms,
axiomatizing the operators, is called the set of Connection Axioms and is denoted by
FC . The connection axioms give also a semantics to the connection operators.
Finally we need to axiomatise the shapes that can be obtained, by using primitives
and operators. In fact, there is a whole hierarchy of shapes, from the primits to the
SymGeons, and, of course only the 9 SymGeons must (and can) be obtained, by
composition. This set of axioms are called the Figures Axioms, FF .
Therefore the axioms for the Figures are:
1. Grouping axioms Which are necessary to group the terms of the algebra.
2. Connection axioms establishing the conditions under which a connection oper-
ator is valid.
3. Figures axioms postulating the shape of a figure, on the basis of the connections
among components and their reliability, according to the components distance.
6.3. AXIOMS FOR THE ALGEBRA 59
F1 F2
F3
Figure 6.1: Connections among faces
All the above axioms are necessary to recognize a SymGeon in the image (plane).
The above axioms, in fact, cannot be used to recognize an object, to this end we
shall introduce further the axioms for descriptions. Observe, however, that the set of
Descriptions can be empty, while the set of above aspecified axioms is necessary to
process the aspect graph of an image.
6.3.1 Grouping Axioms

In the following when ⊕ has no subscripts it means that the specified property, men-
tioning it, holds for any X ∈ R, and when it has no superscripts n it means that the
specified property holds for n = 1. Furthermore, for n = 1, ⊕n = {⊕P , ⊕E , ⊕F }. For
a generic connection we might use the term ⊕C .
Let t and t0 be terms of F.
0. tt = t.
1. tt0 = t0 t = tt0 = t0 t.
2. t ⊕ t0 = t0 ⊕ t.
3. t ⊕0 t = t.
4. t ⊕n (t1 · · · tn+1 ) = (t ⊕n−1 t1 · · · tn ) (t ⊕ tn+1 ).
5. (t ⊕nX τ ) (t ⊕m 0 n m 0
Y τ ) = t ⊕X ⊕Y (τ )(τ ) (distributive law 1).
m m m
6. (t ⊕X τ ) (t ⊕Y τ ) = t(⊕X ⊕Y ) τ (distributive law 2).
7. (t ⊕ t0 ) (t0 ⊕ t00 ) = t ⊕ (t0 ⊕ t00 ) (connection).
8. t 6≺ t.
9. τ ≺ (t τ 0 ) ≡ τ τ 0 , ∈ {⊕, }.
10. t ≺ t0 ∧ t0 ≺ t00 →t ≺ t00 (transitivity of ≺).
In the following we shall give some examples.
Let g1 , g2 and g3 be three SymGeons and suppose that g1 ⊕S g2 and g2 ⊕S g3 are
connected, then we would specify
(g1 ⊕S g2 ) (g2 ⊕S g3 )
which is a term t ∈ SG. By 7 above, the term t can be rewritten as follows:
(g1 ⊕S (g2 ⊕S g3 ))
A Scene Graph is a term of the form

t1 (t2 . . . (tn−1 tn ) . . .)
in which each ti is either a SymGeon or a Scene Graph, according to the definition
of Scene Graph given. Observe that also the terms (g1 ⊕S (g2 ⊕S g3 )) and (g2 ⊕S g3 )
are Scene graphs.
Let f1 and f2 and f3 be faces such that f1 is facewise connected with f2 and f1 is
facewise connected with f3 , then
(f1 ⊕F f2 ) (f1 ⊕F f3 )
is a term, and in particular it is an aspect (even if not a valid aspect), then by 4 we
get:
f1 ⊕2F (f2 f3 )
It is interesting to note that we would not be able to rewrite the above term as
f2 ⊕2F (f1 f3 )
Definition 12 (Length) Let t be a term, the length of t, |t| is defined inductively
as follows:
1. If t is an atom then |t| = 1.
2. If t = t1 t2 , then |t| = |t1 | + |t2 |.
Definition 13 (Head&Tail) Let t be a term:
1. If t is an atom then t is the head and the tail is empty.
2. If t = t1 ⊕ t2 or t = t1 t2 , and both t1 and t2 are atoms then either are head
and tail.
3. If t = t1 ⊕ τ or t = t1 τ , and t1 is an atom and τ is not, then t1 is the head
and τ the tail.
Proposition 1 The following are some of the properties of F derived by the grouping
axioms.
a. is reflexive for all t ∈ T .
b. ⊕ is reflexive for all t ∈ T .
c. g ⊕ t1 . . . g ⊕ tk+1 = g ⊕k (t1 · · · tk+1 ).
d. (g ⊕n τ ) (g ⊕m τ 0 ) = g ⊕k τ τ 0 , k = |τ τ 0 |.
Proof. a) By 1. b) By 2. c) By induction on k. For k = 1, g ⊕k (t1 · · · tk+1 ) = g ⊕ t1 .

Assume the statement is true for any j ≤ k. By 4, g ⊕(k+1) (t1 · · · tk+1 tk+2 ) =
g ⊕k t1 · · · tk+1 g ⊕ tk+2 = g ⊕k−1 t1 · · · tk g ⊕ tk+1 g ⊕ tk+2 . And the statement
holds by induction hypothesis.
d) Let τ = t1 · · · tn and τ 0 = t01 · · · t0m . By applying c) on both terms of the
expression we get: (g ⊕n t1 · · · tn ) (g ⊕m t01 · · · t0m ) = g ⊕ t1 . . . g ⊕ tn g ⊕
t01 . . . g ⊕ t0m . By Axiom 0, we can eliminate all the duplicates and we get
g ⊕ t1 . . . g ⊕ tp g ⊕ t01 . . . g ⊕ t0q , p ≤ n and q ≤ m. Let k = p + q, applying
again (c) we get the statement.

6.3. AXIOMS FOR THE ALGEBRA 61
b
V a V P
C
V
a d
b P d c
P C C
P C
V a
c c
c
i. ii.
Figure 6.2:
Definition 14 (Principal node) Let t be a term of F, t has a principal node if

there are two terms t0 and τ such that t = t0 ⊕nX τ .
Example 2 Consider the term seen above:
t = (f1 ⊕F f2 ) (f1 ⊕F f3 )
then t has no principal node. By suitably rewriting the term t, we get a term t0 as
follows
t0 = f1 ⊕2C (f2 f3 )
then t0 is a principal node.
We show now how the above defined algebra is useful to represent a graph. Con-
sider the graph depicted in Figure 6.2 (i), labeled by some relation in R. The term
denoting the graph is the following:
b ⊕V a b ⊕V c b ⊕P d c ⊕P a c ⊕C d d ⊕C a (6.1)
Proposition 2 Let t ∈ T be a term of F denoting a connected graph. There is a term

t0 , s.t. t0 = t and t0 has a principal node.
Proof. If t = t0 ⊕nX τ then t has already a principal node, therefore we can assume
that t has the form t = t1 · · · tk . We show the claim by induction on the length
|t| of t. If |t| = 1 then t is the only term, and the claim is proved. Suppose the claim
is true for any k ≤ n, and let t = t1 · · · tk tk+1 . Each ti ∈ T⊕ . Since t denotes
|τ |
a connected graph, and the term tk+1 = g ⊕X τ , it follows that either g must appear
in some of the tails in ti , 1 ≤ i ≤ k, or there is a term g 0 , appearing in τ which is
mentioned in some of the ti . If g is mentioned in the tail of ti then by applying the
connection rule introduced in (6.3.1), we can move tk+1 in the tail of such a ti and
remove it, so obtaining a term t0 = t1 · · · t0i · · · tk = t. And the claim is proved.
|τ |
Otherwise we look for the g 0 . By induction on the structure of g ⊕X τ , we show that
0
|τ | |τ |
we can transform this term into g 0 ⊕X τ 0 = g ⊕X τ .
0 0
If τ = g then by reflexivity we get g ⊕X g.
Figure 6.3: Non Valid Connections
|τ 00 |
If τ = g 0 ⊕Y τ 00 then by applying the connection rule we can transform t
|τ | |τ 00 | |τ 00 |
k + 1 = g ⊕X g 0 ⊕Y τ 00 into the terms g ⊕X g 0 g 0 ⊕Y τ 00 . By reflexivity we get
00
|τ | |τ 00 |
g 0 ⊕X g g 0 ⊕Y τ 00 . Finally applying the first distributive law we get g 0 ⊕X ⊕Y gτ 00 ,
now by applying the connection rule, using the term ti and g 0 , we can eliminate the
term tk+1 .
If τ = t01 · · · g 0 · · · t0m , then by item (c) of Proposition 1 we get tk+1 = g ⊕X
t1 · · · g ⊕X g 0 · · · g ⊕X t0m . This is equal to the term g ⊕X t01 · · · g 0 ⊕X
0
g · · · g ⊕X t0m = g 0 ⊕X g g ⊕m 0 0
X t1 · · · tm , by (c) again; now by applying the
connection rule we get the equal term t = g ⊕X (g ⊕m0 0 0 0
X t1 · · · tm ), and this last can
0
be substituted in the g mentioned in ti , by the connection rule. And since we have
moved the term t0 , t0 = tk+1 in the tail of such a ti , we can remove it, so obtaining a
term t00 = t1 · · · t0i · · · tk = t. And the claim is proved.
Example 3 Let t be the term expressed in Equation 6.3.1, denoting the completely
connected graph depicted in Figure 6.2 (i). Then t can be transformed into a term t0
having a principal node, as follows:
b ⊕2V ⊕P ((a ⊕P c) c)(d ⊕2C (c a)) (6.2)
The above term can be seen as one denoting a tree, e.g. the one depicted in Figure
6.2 (ii). Observe that in such a tree some node is repeated because of the relations
that have to be maintained.
6.4 Figures Axioms

The properties of connections, likewise the way terms can be rewritten tell us how
to process and transform a term of the algebra. However they do not tell us what is
allowed and what is not.
In terms of models of the language L F there are structures allowing strange
connections, which we do not consider valid. E.g. see the connections depicted in
Figure 6.3.
A figure, whether it is a primit, a boundit or a scene graph (e.g. a table, a table
with computer, a door, an office room etc..) in order to be recognized has to be
6.4. FIGURES AXIOMS 63
known. The general term known, means that they are explicitly defined by a set of
axioms, indeed the set FF of figures axioms. Observe that we have only one metric
notion, which is the distance according to which a connection is accepted or refused.
The other parameters (length, angles, etc.) are irrelevant to our axiomatization.
Observe that despite we have introduced sorts for each primitive element in the
image, we shall introduce for each primitive element an explicit definition as follows.
Primit As described in the previous chapter, the Syntactic Image Analyser (SIA)
returns a graph of the image in which there are tiny traits or patterns which could be
all considered arcs of ellipes of different dimension. However to simplify computations,
at the syntactic level, we shall partition the set of arcs into arc of line and arc of ellipse.
Then we recompose both of them into a general notion of primit staying for primitive
patterns. The two kind of primits are, thus:
ρ = primit(p1 , p2 , α), here α = slope(ρ), and ρ1 is a an arc of line.

ρ = primit(p1 , p2 , α), here α = ell(C, am , aM , θ, φ, ∆φ),
for some suitable C, am , aM , θ, φ and ρ is a an arc of ellipse.
(6.3)
Here C is the center of the ellipse, am and aM are, respectively,
the minor and major axes, θ is the inclination angle and
φ ∈ {0, 2π} is the beginning angle, and ∆φ the open angle.
To specify the above cases we define:
straight(ρ) ≡ ρ = primit(p1 , p2 , α) ∧ α = slope(ρ).

elliptic(ρ, η) ≡ ∃C, am , aM , θ, φ.η = ell(C, am , aM , θ, φ, ∆φ) ∧ ρ = primit(p1 , p2 , η).
In other words, a primit ρ = primit(p, p0 , α) is an arc of line if α is an ellipse whose

parameters are defined above in 6.3. In this case we shall reduce the set of parameters
appearing in the term ell to the symbols η and ρ: ell(η, ρ), as we did above.
Let SA denote the image graph obtained by the Syntactic analysis:
ρ = hprimit(p1 , p2 , α) ≡ ρ ∈ SA∧

α = slope(ρ) ∧ ∀ρ0 , p0 , α0 α0 = α → ¬(ρ0 = primit(p1 , p0 , α0 )∨
i
ρ0 = primit(p0 , p2 , α0 )) ∨
n
∃C, am , aM , θ, φ, ∆φ α = ell(C, am , aM , θ, φ, ∆φ)∧
h
∀γ, ρ0 , p0 , φ0 , ∆φ0 φ = φ0 ∨ ∆φ = ∆φ0 → ¬ (ρ0 = primit(p1 , p0 , γ)∨
io
ρ0 = primit(p0 , p2 , γ)) ∧ γ = ell(C, am , aM , θ, φ0 , ∆φ0 )
(6.4)
Boundit Now, for the definition of a boundits (see Figure 6.4) we have to consider
the two primits composing it, they could be:
1. Two linear segments.
2. A linear segment and an arc of line.

p1 P1
p1 a
a
l1
l2
l l
pm
pm p2 Pm
p2 P2
ll eal ial
p1 p1
p1
a1 a1
a1
a2 a2 a2
p2 p2 pm p2
pm pm
eaea iaea iaia
Figure 6.4: Boundits.
3. Two arc of lines.
In the second and third case we have to take into account the cases in which the arc
is convex or concave.
Furthermore, let the boundit be composed of the primits ρ1 = primit(p1 , pm , X)
and ρ2 = primit(pm0 , p2 , Y ), and let lx,y denote a line passing through two points px
and py and aligned(p1 , p2 , p3 ) is the equation of the straight line passing through the
three points p1 , p2 , and p3 . Let primit(p1 , pm , γ) be a primit, and p be a point. We
can now define the two notions of convex and concave, of ρ w.r.t. p2 , as follows:
Concave(primit(p1 , pm , α), p2 ) ≡ ∃p3 .p3 ∈ l1,m ∧ ∃p4 (p4 ∈ primit(p1 , pm , α)∧

aligned(p2 , p3 , p4 ) ∧ (d(p3 , p2 ) ≤ d(p4 , p2 )))
Convex(primit(p1 , pm , α), p2 ) ≡ ∃p3 p4 .p3 ∈ l1,m ∧ p4 ∈ primit(p1 , pm , α)∧

aligned(p2 , p3 , p4 ) ∧ (d(p3 , p2 ) ≥ d(p4 , p2 ))
We are now ready to introduce a definition for the set of boundits as follows.
Let pwc(x) be a term denoting the distance between two terms whenever they are
pointwise connected, according to suitable metric conditions that will be specified
further, in the context of the Connection Axioms:
β ∈ boundit(p1 , pm , p2 ) ≡
∃p2 , p0m αα0 .primit(p1 , pm , α) ∧ primit(pm0 , p2 , α0 )∧
β = primit(p1 , pm , α) ⊕P primit(pm0 , p2 , α0 ) ∧ pwc(β) ≤ ).
(6.5)
6.4. FIGURES AXIOMS 65
Now, given a general definition of the set of boundits, we detail each one in the
following axiom defining a boundit as the pointwise connection between two primit:

primit(p1 , pm , X) ⊕P primit(pm0 , p2 , Y ) = β ≡
n
β = ll(p1 , pm , p2 )∧
o
X = slope(primit(p1 , pm , X)) ∧ Y = slope(primit(pm0 , p2 , Y )) ∨
n
β = eal(p1 , pm , p2 )∧
(X = slope(primit(p1 , pm , X)) ∧oY = ell(η, primit(pm0 , p2 , Y ))∧
Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = ial(p1 , pm , p2 )∧
(X = slope(primit(p1 , pm , X)) o ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧
Convex(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = eaea(p1 , pm , p2 )∧
X = ell(η, primit(p1 , pm , X)) ∧ Y = ell(η, primit(pm0 , p2 , Y ))∧ o
Concave(primit(p1 , pm , X), p2 ) ∧ Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = iaea(p1 , pm , p2 )∧
Convex(primit(p1 , pm , X), p2 ) ∧ Concave(primit(pm0 , p2 , Y ), p1 ) ∨
n
β = iaia(p1 , pm , p2 )∧
Convex(primit(p1 , pm , X), p2 ) ∧ Convex(primit(pm0 , p2 , Y ), p1 ) .
(6.6)
Observe that the boundits iaea, ial and eal have a symmetric definition (respectively
eaia, lia, and lea), that can be obtained substituting ρ1 for ρ2 , and by a mirror
transformation. Indeed, we need only the the six boundits mentioned above, because:
ρ1 ⊕P ρ2 = ρ2 ⊕P ρ1
Is derivable in F, as stated in the grouping axioms.
Faces As seen in the previous paragraph there are 6 basic boundits, i.e. given that
there are 3 kinds of primits (straight, concave and convex), the set of boundits is 32
minus the 3 symmetric ones. Now, considering the set of faces, we have to consider the
admissible connections between boundits (see Figure 6.5 and the connection axioms):
i.e. the pointwise and edgewise connections.
To form the faces by pointwise connection, we can consider that each of the 6
basic boundits can connect to itself and the others, we thus get 36 boundits to which
we have to take away the 6 repetitions. Thus we get 30 possible faces. To this set
of 30 faces we have to add the set of 3-sides faces obtained by edgewise connections.
There 12 possible ways of combining 6 boundits with a common edge. Therefore we
get 42 faces.
Pex’
d
P’ec
Pm Pex Pm
d2
d1
P’er P’m P’er P’m
Figure 6.5: Edgewise connection (on the left) and Pointwise connection (on the right)
Figure 6.6: Two faces obtained by both Pointwise and Edge Wise Connection
In general a term of sort face, can be said to belong to the set of faces as follows,
where both pwc and ewc are terms which are specified in the connection axioms:
φ ∈ f ace(β1 , β2 ) ≡ (φ = β1 ⊕P β2 ∧ pwc(β1 ⊕P β2 ) ≤ )∨

(φ = β1 ⊕E ⊕P β2 ∧ ewc(β1 ⊕E β2 ) ≤ (6.7)
∧pwc(β1 ⊕P β2 ) ≤ )
Each of the 42 faces has then to be specified, as we did with the boundits. For
example a horn-face (see Figure 6.7 is the one on the right, below, in the frame of the
trilateral faces. And it is specified as follows:

horn(p1 , pm , p2 ) = iaea(p1 , pm , p2 ) ⊕E ⊕P eaea(p01 , p0m , p02 ) ≡
pwc(iaea(p1 , pm , p2) ⊕P eaea(p01 , p0m , p02 )) ≤ ∧
ewc(iaea(p1 , pm , p2) ⊕P eaea(p01 , p0m , p02 )) ≤
In order to state what are the composition properties allowed, in the next section
we shall introduce a formalization of the distance between primitive elements in the
image, furthermore we shall precise notions like symmetry and parallelism given at
the beginning of the chapter.
6.5 Connection Axioms

In this section we define the metric over the graph of the image, that is, we define
each operator in terms of a distance. This is necessary because in the graph all the
segments are indistinguishable. Therefore, suppose that two primits are collected
6.5. CONNECTION AXIOMS 67
Elliptical Faces
Quadrilateral Faces
Trilateral Faces
Figure 6.7: Faces

from the graph, in order to state whether they can be combined or not, by a point-
wise connection, it is enough to verify a specific distance between their end points.
Analogously for the boundits and the face and for other operators.
6.5.1 Relation Between Primits
d d d
(a) (b) (c)
Figure 6.8: Point wise connection between two primits
Before introducing the axioms we give some useful definitions that will be used in
the sequel.
The minimal distance between two primits is the distance between the closest
end points, and it is denoted with minDist(ρ1 , ρ2 ), and it is defined as follows, here
δ(p, p0 ) is the Euclidean distance between p and p0 :
minDist(ρ1 , ρ2 ) = δ(p, p0 ) ≡def

ρ1 = primit(p1 , p2 , X) ∧ ρ2 = primit(p3 , p4 , Y )∧
∃d1 d2 d3 d4 .δ(p1 , p3 ) = d1 ∧ δ(p1 , p4 ) = d2 ∧
δ(p2 , p3 ) = d3 ∧ δ(p2 , p4 ) = d4 ∧
(d1 ≤ d2 ∧ d1 ≤ d3 ∧ d1 ≤ d4 →d = d1 ∧ p = p1 ∧ p0 = p3 )∨
(d2 ≤ d1 ∧ d2 ≤ d3 ∧ d2 ≤ d4 →d = d2 ∧ p = p1 ∧ p0 = p4 )∨
d3 ≤ d1 ∧ d3 ≤ d2 ∧ d3 ≤ d4 →d = d3 ∧ p = p2 ∧ p0 = p3 )∨
(d4 ≤ d1 ∧ d4 ≤ d2 ∧ d4 ≤ d3 →d = d4 ∧ p = p2 ∧ p0 = p4 )
(6.8)
Further, given that the minimal distance is d let us define the distance between
the two other points as follows:

otherDist(ρ1 , ρ2 ) = δ(p, p0 ) ≡def
ρ1 = primit(p1 , p2 , X) ∧ ρ2 = primit(p3 , p4 , Y )∧
((minDist(ρ1 , ρ2 ) = δ(p1 , p3 )→p = p2 ∧ p0 = p4 ) ∨ . . . ∨
minDist(ρ1 , ρ2 ) = δ(p2 , p4 )→p = p1 ∧ p0 = p3 ))∧
δ(p, p0 ) 6= minDist(ρ1 , ρ2 )
(6.9)
Let ρ1 and ρ2 be two primits (see figure 6.8):
pwc(ρ1 ⊕P ρ2 ) = minDist(ρ1 , ρ2 ) (6.10)

The above condition is independent of the kind of primits and is the same for all
the case (a), (b) and (c) depicted in figure 6.8.
A closure or overlap relation between primits is captured through the edge wise
operator ⊕E , that depends on the kind of primits involved (see figure 6.9) and can
be expressed using the following axiom. Let ρ1 and ρ2 be two primits, and ⊕E the
d1
d1
d2 d2
(a) (b)
Figure 6.9: Edge wise operation between primits
edgewise connection operator. This is allowed when:

ewc(ρ1 ⊕E ρ2 ) = minDist(ρ1 , ρ2 ) + otherDist(ρ1 , ρ2 ) ≡ straight(ρ1 ) ∧ straight(ρ2 )
(6.11)
On the other hand, when the primits considered are eaea, eaia, iaia then the ⊕E
operator is allowed as follows:

ewc(ρ1 ⊕E ρ2 ) = d1 + d2 + dρ + dφ + dc + da ≡ ∃C1 C2 , am1 , am2 aM1 aM2 , θ1 , θ2 φ1 , φ2 .
elliptic(ρ1 , ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 ))∧
elliptic(ρ2 , ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ))∧
d1 = minDist(ρ1 , ρ2 )∧
d2 = otherDist(ρ1 , ρ2 )∧
dc = δ(C1 , C2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |φ1 + φ2 | + |∆φ1 − ∆φ2 |∧
da = |am1 − am2 | + |aM1 − aM2 |
(6.12)
The parallel relation between primits depends on the kind of primits involved, and
is defined as follows. Let ρ1 and ρ2 be two primits:

par(ρ1 ⊕II ρ2 ) = dρ + dφ + da ≡
ρ1 = primit(p1 , p2 , X) ∧ ρ2 = primit(p3,op4 , Y ))∧
n

X = m1 ∧ Y = m2 ∧ d = |m1 | − |m2 | ∨
∃C1 C2 , am1 , am2 aM1 aM2 , θ1 , θ2 φ1 , φ2 .
(6.13)
X = ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 )∧
Y = ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |(φ1 + ∆φ 1 ∆φ2
2 ) − (φ2 + 2 )|∧
da = |am1 − am2 | + |aM1 − aM2 |
To establish the symmetry between two primits ρ1 and ρ2 (see figure 6.10) we
generalize also the straight primits two arc of ellipses:
ϕ1
ϕ2
(a) (b)
Figure 6.10: Symmetry relation between primits

symm((ρ1 ⊕S ρ2 ) = dc + dρ + dφ + da ≡
ρ1 = primit(p1 , p2 , ell(C1 , am1 , aM1 , θ1 , φ1 , ∆φ1 )∧
ρ2 = primit(p3 , p4 , ell(C2 , am2 , aM2 , θ2 , φ2 , ∆φ2 ))))∧
d = |φ1 − π2 | + |φ1 − π2 |∨ (6.14)
dc = δ(C1 , C2 )∧
dρ = |ρ1 − ρ2 |∧
dφ = |φ1 − φ2 | − |(φ1 − φ2 ) + (∆φ1 − ∆φ2 )|∧
da = |am1 − am2 | + |aM1 − aM2 |
6.5.2 Relation Between Boundits

The same relation defined in the previous section can be formalized for boundits as
well. Since a boundit is obtained by a composition of primits, the condition pwc, ewc,
par and sym for a couple of boundits (β1 ,β2 ) will be expressed in term of the same
relation between the primits belonging to β1 and β2 . Furthermore, we shall collect
the boundits as follows:
al(x, y, z) = ial(x, y, z) ∨ al(x, y, z) = eal(x, y, z)

aa(x, y, z) = eaea(x, y, z) ∨ aa(x, y, z) = iaea(x, y, z) ∨ aa(x, y, z) = iaia(x, y, z)
(6.15)
Now, the definitions of minP wc and otherP wc for boundit is:

minP wc(β1 , β2 ) = minDist(ρ, ρ0 ) ≡
∃ρ1 ρ2 .β1 = ρ ⊕P ρ1 ∧ β2 = ρ0 ⊕P ρ2 .
(6.16)

otherP wc(β1 , β2 ) = otherDist(ρ, ρ0 ) ≡
∃ρ1 ρ2 .β1 = ρ ⊕P ρ1 ∧ β2 = ρ0 ⊕P ρ2
(6.17)
In figure 6.11 are shown the seven possible cases in which the three kind of bound-
its ll, al and aa can be combined in order to obtain a Point Wise Connection. For
each cases we provide in the following the definition of pwc:
Case a., b.

pwc β1 ⊕P β2 = minP wc(β1 , β2 ) + otherP wc(β1 , β2 ). (6.18)
d1 d1 d1 d1
d2 d2 d2 d2
(a) (b) (c) (d)
d1 d1 d1
d2 d2 d2
(e) (f) (g)
Figure 6.11: Seven possible Point Wise Connection
d d
(a) (b) (c)
d d
(d) (e) (f)
Figure 6.12: Edge Wise Connections between boundits
Case c. ,d. , e.
h i
pwc (aa1 ⊕P ll2 ) ⊕P (aa2 ⊕P ll1 ) = pwc(aa1 ⊕P ll2 ) + pwc(aa2 ⊕P ll1 )
(6.19)
Case f., g.

pwc aa ⊕P al = minP wc(aa, al) + otherP wc(aa, al) (6.20)
Analogously, there are six cases to be taken into account, for the Edge Wise
Connection relation as shown in figure 6.12. The ewc relation between boundits is
introduced in order to combine a couple of L − junction forming a valid U − junction
(see figure 6.13).
As we argued before we have to constrain the connection, in order to avoid strange
connections.
(i) (ii)
Figure 6.13: Invalid (i) and valid (ii) U − junction
Figure 6.14: The holdsEwc condition for primits
holdsEwc(β1 ⊕E β2 ) ≡ ∃p1 , pm , p2 p3 , pm0 , p4 .

β1 = boundit(p1 , pm , p2 ) ∧ β2 = boundit(p3 , pm0 , p4 ))∧
δ(pm , p4 ) ≤ δ(p1 , p3 )→
∀p.p 6= p1 ∧ p 6= p3 →p 6∈ boundit(p1 , pm , p2 )∧
p 6∈ boundit(p3 , pm0 , p4 ).
(6.21)
If two boundits are edgewise connected, one of the following condition must holds,
depending on the kind of boundit:
Case a., b., c., d., e., f

ewc β ⊕E β2 ≡ minEwc(β1 , β2 )∧
(6.22)
holdsEwc(β1 , β2 )
here minEwc is defined analogously as minP wc.

Observe that the term holdsEwc(ρ1 , ρ2 ) applied to a couple of primits is exactly
the same as that between a couple of boundits (see figure 6.14).
6.5.3 Relations concerning Faces

In our formalization a face f is defined by a composition of a couple of boundits (β1 ,
β2 ) which are in Point Wise Connection and/or in Edge Wise Connection. In figure
6.16 we show some example. A Point Wise Connection between faces means that
there is a pwc connection between the boundits belonging to it (see figure 6.17).
Similarly the Edge Wise Connection is defined in term of the ewc connection
(a) (b) (c)
Figure 6.15: Symmetry relation between boundits
b1
b2
b1
b1
b2 b2
PWC(b1, b2) EWC(b1, b2) PWC(b1, b2) ∧ EWC(b1, b2)
Figure 6.16: Some example of faces generated by PWC and/or EWC among boundits.
d
d
PWC EWC
Sym
Figure 6.17: P W C, EW C and Sym relation between faces. The dashed line represent
either a l − primit or an a − primit.
between boundits. Relations between faces likewise between aspects are a straight-
forward extension of the relations betwen boundits.
To exemplify the set of axioms, we shall therefore introduce a generalized definition
of Symmetry, taken on faces, boundits and primits:
symm(g1 ⊕S g2 ) = d ≡ ΨF (g1 , g2 , d) ∨ ΨB (g1 , g2 , d) ∨ ΨP (g1 , g2 , d) ∧ d < (6.23)
Here is an error threshold, and ΨF (g1 , g2 , d), ΨB (g1 , g2 , d), and ΨP (g1 , g2 , d) are,
respectively, defined as follows:
ΨF (g1 , g2 , d) ≡ g1 = fa (β1 , β2 ) ∧ g2 = fb (β3 , β4 )∧

d1 = sym(βi ⊕S βj ) ∧ (βi = minDist(β1 , β2 )) ∧ (βj = minDist(β3 , β4 )) ∧ βi 6= βj
d2 = sym(βh ⊕S βk )) ∧ (βh = minDist(β1 , β2 )) ∧ (βk = minDist(β3 , β4 )) ∧ βh 6= βk ∧
d = d1 + d2
Here fa and fb denote faces. In the previous section we have further specialized
the connection operator into point wise, and edge wise connections. ΨB (g1 , g2 , d) is
defined as follows:
ΨB (g1 , g2 , d) ≡ {(g1 = ll1 (x1 , x2 , x0 ) ∧ g2 = ll2 (x3 , x4 , x00 ))∨
(g1 = aa1 (x1 , x2 , x0 ) ∧ g2 = aa2 (x3 , x4 , x0 ))∨
(g1 = al1 (x1 , x2 , x0 ) ∧ g2 = al2 (x3 , x4 , x0 ))}∧
d1 = sym(xi ⊕S xj ) ∧ (xi = minDist(x1 , x2 )) ∧ (xj = minDist(x3 , x4 )) ∧ xi 6= xj ∧
d2 = sym(xh ⊕S xk )) ∧ (xh = minDist(x1 , x2 )) ∧ (xk = minDist(x3 , x4 )) ∧ xh 6= xk }∧
d = d1 + d2
Here ll1 and ll2 denote straight boundits; aai (here aa is any of the terms {eaea, eaia, iaea, iaia})
denotes elliptic boundits, and finally ali ∈ {ial, eal} denotes boundits which are ob-
tained by the connection of the two primits ai and lj . ΨP (g1 , g2 , d) is obtained as
follows:
ΨP (g1 , g2 , d) ≡
(g1 = l1 (p1 , p2 , m1 ) ∧ g2 = l2 (p3 , p4 , m2 ) ∧ d = |φ1 − π2 | + |φ2 − π2 |)∨
(g1 = a1 (p1 , p2 , e1 ) ∧ g2 = a2 (p3 , p4 , e2 )∧
dc = δ(C1 , C2 )∧
dα = |α1 − α2 |∧
dφ = |φ1 − φ2 | − |(φ1 − φ2 ) + (∆φ1 − ∆φ2 )|∧
dr = |rm1 − rm2 | + |rM1 − rM2 |∧
d = dc + dα + dφ + dr )
All these definitions have been suitably implemented in Eclipse-Prolog, since they
are used to infer in the Bayes Aspect network, which is discussed in the the next
chapter.
ϕ1
ϕ2
(a) (b) (a) (b) (c)
Figure 6.18: Symmetry defined over primits and boundits

Chapter 7
A Bayes-aspect Network
In this Chapter, we describe the second step of the SymGeon recognition process.
It is a compositional process that, starting from the set of primits obtained through
the Syntactic Image Analysis, identifies boundits, faces and aspects in turns, using
probabilistic reasoning and leading to the recognition of a specific SymGeon.
7.1 SymGeons Recognition

We have presented in Section 4.3, a methodology for constructing the topology of the
Bayes-aspect network, starting from the primitives, described in Section 4.2, belonging
to our ontology.
The topology is very simple, because there are just four features levels and three
connection levels, moreover each node of the features level has its parents nodes at
the connection level, and each node at the connection level has only two parents at
the features level. This implies that two nodes of the same level cannot be connected.
See Figure 7.1, representing a portion of the Bayes-aspect network, in which the
conditional probability tables for f and ⊕C nodes are given in Tables (4.2).
This simple topological structure makes the inference rather simple. The difficulty,
in fact, relies not yet on the inference but in the initialization of the network. In fact,
a crucial role in the network is played by the level mentioning the primitive traits,
i.e. the primit nodes. The reason why it is so, relies on the fact that the primits are
delivered by the image syntactic graph, described in Section 5, in which all the traits
of the image are recorded.
Definition 15 (roots of BAN) Let NP be the set of primit nodes of the BAN,
NP = {p1 , p2 , . . . , pK }
For each node p ∈ NP , there is a function ||·|| taking as arguments the image syntactic
graph and the primit p itself, and returning the set Pp of primits of the same kind,
occurring in the image syntactic graph. Each node p ∈ NP is thus decorated with all
its instances Pp occurring in the image syntactic graph. The family of sets {Pp }p∈NP
is called the root nodes of the BAN.
77
78 CHAPTER 7. A BAYES-ASPECT NETWORK
C S C P
b2 b1
C C
Figure 7.1: A portion of the Bayes-Aspect net.
The root nodes are, then, labeled with the position of the segments forming each
primit. Therefore the output of the image syntactic graph instantiates the BAN, by
creating the root nodes. The probability of a node p ∈ NP is given by the probability
of its root nodes, as follows: if the node has roots then its probability is 1 otherwise
is 0.
Once the network is initialized its root nodes constitutes the network evidence.
Thus the inference consists in querying the BAN about the probability that, in the
image syntactic graph, a specific aspect of a symgeon is mentioned, given the evidence.
The query is as follows:
∃p∃x1 · · · xn aspect(x1 , · · · , xn ) ∧ P rob(aspect(x1 · · · xn )|x1 · · · xn ) = p (7.1)
It is easy to see that the required inference is double:
1. The first inference is D |= ∃p∃x1 · · · xn aspect(x1 , · · · , xn ). This inference re-
quires to construct the terms x1 · · · xn such that each xi will be a term in T⊕ ,
p) ⊕C ⊕S p2 (p~0 )p3 (p~00 ), mentioning only primits, here p~, p~0 and p~00
e.g. ti = p1 (~
denote the set of values end points, middle point, slopes, etc, characterizing
primits. This can be achieved because each aspect is defined in terms of faces
and connections, and each face is defined in terms of boundits and connections.
7.2. HYPOTHESES FORMATION 79
2. The second inference is possible just in case the first returns the set of terms
defined by the primits. It is a classical diagnosing inference, requiring to com-
pute the composed probabilities of the paths leading from the specific aspect
node to the evidence nodes constituted by the primits composing the specified
query.
Observe that in item 1) the theory D is defined in the Situation Calculus and it
denotes the theory of perception [76]. D, also includes the theory of actions and fail-
ures (see [36]). And, finally, D includes the Algebra of Figures AC, and the definition
of each feature, such as aspects, faces and boundits.
The BAN is thus formalized in D simply by the definition of each feature, given
in FOL. For example, in the equation (6.4) we have already given the definition of a
primit. Boundits are defined using primits and connections, faces are defined using
boundits and connections and, finally, aspects are defined using faces, boundits and
connections. Observe that the topological structure, simply determine the minimal
set of boundits, faces and aspects, that can appear in the BAN.
7.2 Hypotheses Formation

When the BAN is initialized with all the root nodes then a set of primits, found in
the image syntactic graph, is added to the BAN, this set is also added to the initial
database with probability 1.
Definition 16 (Hypothesis for recognition) Let d be a threshold and p > d. A

query Q is a sentence
∃x1 · · · xn aspect(x1 , · · · , xn ) ∧ P rob(aspect(~x)|x1 · · · xn ) = p
h = aspect(~x) is an hypothesis for recognition if D |= Q and p > λ, where λ is a

threshold.
By the above conditions, it follows that:
Proposition 3 If D |= Q(aspect(~t)) then P rob(aspect((~t)) > 0.
However, this might not be enough to establish with enough accuracy the existence
of a SymGeon in the scene. For this reason we need to define a threshold λ, which
will depend on the circumstances in which the scene has been taken.
Now we shall construct the set of hypotheses H as follows. Each query to the BAN
returns the probability that a given aspect appears in the image syntactic graph. If
the probability is greater than λ, then the aspect recovered is recorded and all the
primits used to compose the aspects are marked. Starting from the consideration that
a primits can not belong to more than two aspects, if the primit is already marked,
it is limply deleted from the root nodes. Otherwise nothing is done, and a new query
is performed.
The BAN is queried until all the root nodes are deleted, or the set of root nodes
cannot contribute in the formation of an aspect.
7.3 From Aspects to SymGeon

The final inference process is the one concerning the recognition of a specific SymGeon.
Differently from the other features of our ontology, a SymGeon is not defined in terms
of suitable composition of aspects, because each recognized aspect in the image is a
view of one SymGeon only.
Although the inference process is still Bayesian, the topology of the net is more
simple w.r.t the BAN. It is defined according to the following:
Definition 17 (Bayes
Network for SymGeon Recognition (BNSG ) It is a DAG
(NSG ∪ NA ), E where:
i. NSG and NA are, respectively, the set of SymGeon and Aspects.
ii. E is a set of oriented links leading from a node x to a node y iff x ∈ NA ,

y ∈ NSG and exists a view v such that op(y, v) = x. Here op is the orthographic
projection as defined in page 33.
iii. To each node of NSG is associated a CPT in the form given in Table 7.1.
P(g|a1 , . . . , ak ) a1 a2 ··· ah
0.8 T F ··· F
0.6 F T ··· F
.. .. .. .. ..
. . . . .
0.4 F F ··· T
Table 7.1: CPT for a node of the Bayes Network for SymGeon Recognition
Note that, the particular structure of the CPT is a direct consequence of the
considerations previously done.
The set of aspects recovered H, are used to instantiate the root node of the BNSG
and to infer the hypotheses of the SymGeons occurring in the scene. This inference
process I(h) that assign to a given recovered aspect h = aspect(− →x ) ∈ H, with
Ph = P rob(h), the SymGeon s, maximize the probability returned from the BNSG :
I(h) = s ⇐⇒ ∀s0 ∈ NSG .s0 6= s→P rob(s|h) · P rob(h) > P rob(s0 |h) · P rob(h)
Note that P rob(h) is a common term, so it can be eliminated:
I(h) = s ⇐⇒ ∀s0 ∈ NSG .s0 6= s→P rob(s|h) > P rob(s0 |h)

This leads to the consideration that the recognition of a SymGeon depends only
to the recovered aspect type and to the CPT, while is independent from the aspect
probability obtained by the BAN. As the matter of the fact, the BNSG can be suitably
substituted with a simple ordered list.
The set of recognized SymGeons will be used, after a suitable elaboration, to form
the graph of the scene.
7.4. PORTION OF THE CODE FOR ASPECT RECOVERY 81
7.4 Portion of the code for Aspect Recovery

In this section we report a very little portion of the Prolog that implements the
reasoning system. In particular some of the definitions used to describe boundits and
faces are given.
%-----------------------------------------------------------------------------------------
% BOUNDIT DEFINITION
%-----------------------------------------------------------------------------------------
ll(L1, L2):-
L1 = primit(P1,P2,ALPHA1),
L2 = primit(P3,P4,ALPHA2),
primit(P1,P2,ALPHA1),
B = boundit(L1,L2,ll),
pwc(B)<EPSILON.
lea(L ,A , B) :-
L = primit(P1,P2,ALPHA1),
A = primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
external(L,A),
B = boundit(L,A,lea),
pwc(B)<EPSILON.
lia(L ,A , B) :-
L = primit(P1,P2,ALPHA1),
A = primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
primit(P3,PM2,P4,ell(point(XC,YC),Rmin,Rmax,ALPHA,PHI,DELTA_PHI)),
internal(L,A),
B = boundit(L,A,lia),
pwc(B)<EPSILON.
eaea(A1, A2, B) :-
A1 = primit(P1,PM1,P2,E1),
primit(P1,PM1,P2,E1),
external(A1,A2),
external(A2,A1),
B = boundit(L,A,eaea),
pwc(B)<EPSILON.
eaia(A1, A2, B) :-
external(A1,A2),
internal(A2,A1),
B = boundit(L,A,eaia),
pwc(B)<EPSILON.
iaia(A1 ,A2 ,B) :-

internal(A1,A2),
internal(A2,A1),
B = boundit(L,A,iaia),
pwc(B)<EPSILON.
................
%-----------------------------------------------------------------------------------------
% FACES DEFINITION
%-----------------------------------------------------------------------------------------
face(B1, B2, F) :-
B1=boundit(L1, L2, ll),
boundit(L1, L2, ll),
F=face(B1, B2, llll),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(L1, A1, lia),
B2=boundit(L2, A2, lea),
boundit(L1, A1, lia),
boundit(L2, A2, lea),
F=face(B1, B2, lialea),
pwc(F)<EPSILON.
face(B1, B2, F) :-
F=face(B1, B2, lialia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
F=face(B1, B2, lllia),
pwc(F)<EPSILON.
7.4. PORTION OF THE CODE FOR ASPECT RECOVERY 83
face(B1, B2, F) :-
B2=boundit(L3, A1, lea),
boundit(L3, A1, lea),
F=face(B1, B2, lllea),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, iaia),
boundit(A1, A2, iaia),
F=face(B1, B2, iaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaia),
boundit(A1, A2, eaia),
F=face(B1, B2, eaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
F=face(B1, B2, eaiaeaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
B1=boundit(A1, A2, eaea),
boundit(A1, A2, eaea),
F=face(B1, B2, iaiaiaia),
pwc(F)<EPSILON.
face(B1, B2, F) :-
F=face(B1, B2, eaeaeaea),
pwc(F)<EPSILON.
................
Chapter 8
Description
We have, then, two things to compare: (1) a name, which is a simple sym-
bol, directly designating an individual which is its meaning, and having
this meaning in its own right, independently of the meaning of all other
words; (2) a description, which consists of several words, whose meanings
are already fixed, and from which results whatever is to be taken as the
“meaning” of the description. ([92])
8.1 Introduction
Recognition is the process of associating a designator, i.e. something endowed with
an unambiguous meaning, with an object: the word “table” in the phrase “this is a
table” designates unambiguously the object indicated.
In the process of recognition the role of vision is that of associating an object of
the world with an inner, i.e. subjective, experience constituting the connotation or
meaning or interpretation of the object seen.
The inner experience, in the learning phase of what human artifact are, is crucial
although never revealed: a child that sees for the first time an airplane has a quite
strong experience. The experience is useless to denote the perception, and the child
indicating the airplane asks “what is this”. She wants to know the word that she
could use for denoting both the object seen and her inner experience.
Thus, although the recognition of the airplane is independent of the language, the
language makes the inner experience repeatable and recordable: the denotation has
been learned as soon as the word “airplane” has been grasped.
The role of the language, therefore of terms, words, nouns, names, descriptions,
etc.. is that of designating the object seen, by associating a denotation to the percep-
tive action: the term “airplane” denotes the object seen, but also the inner experience.
The limits of our world are the limit of our language.
We argue that there are three components of the recognition process: seeing,
experiencing, denoting. The crucial part is the abstraction, categorization and gen-
eralization process underlying the experience. The result of this process is the agent
ability to establish a correspondence between the object seen and the concept or
category built through the experience.
85
86 CHAPTER 8. DESCRIPTION
Finally, denotation, is the very last component of the recognition process, and it
plays the role of the agent awareness, as it means the ability to assign a noun or a
label or a designator to each object in each situation.
In this sense recognizing a table (or even a person, despite we treat recognition
of nothing else but artifacts) implies that the object recognized can be used in a
reasoning process through denotation. E.g the agent, once has recognized the table
can hold “this table is the same as the one in X’s studio”.
Therefore the final process, i.e. that of denoting, amounts to an assignment of a
term to an object, this assignment consists therefore in designating an object with an
unambiguous term, and this designator is anchored to the object so as to denote it in
every possible situation.
Both Russell and Kripke [92, 58] consider a designator as a word which has the
same reference in every possible world in which it has any reference at all. In particular
they consider proper names as the only possible rigid designators. Actually, objects
(human made) hold abstract nouns, and the difference with proper names does not
seems to be so meaningful, from a designating power point of view.
An abstract noun designates a category of objects sharing both a function and a
general inner structure ( the noun “table” denote a category of objects..), on the other
hand a proper name designates a set of individuals sharing very few characteristics,
e.g. the proper name “Mario Rossi” designates some (actually many) individuals
possibly having nothing in common, but being male and Italian.
In other words there is nothing in the language that can play the role of a rigid
designator in the peculiar sense of Kripke (see [58]). However a rigid designator could
be constructed by coding. The problem is the following, is there a way of coding each
object in the environment? Furthermore, would this coding be suitable enough to
recognize an object?
There are several possible way of coding, for example storing a set of vantage
points of an object in a database can be considered a way of coding it, a set of
matrices can be seen as a single string of numbers.
Let us introduce a total function C : X 7→ Y , defined for each object in the domain
of artifacts X; here Y is the codification set which is a very large set, but still finite.
Let us call C(x) the code of object x, i.e. its rigid designator, how do we associate to
each C(x) a reference?
In other words, suppose that we tell the agent that his goal is: “go to the telephone,
on the table, and pick up the receiver”. Therefore the agent needs to establish a
correspondence between what he sees and C(telephone) and C(table). This is the
crucial point, without this association no recognition is possible. Suppose that the
set of strings {1201179....0i}i≤n , for some n not too large, denotes the set of vantage
points of the telephone, and the association is obtained by a function allowing to
compare parts of the image with all the set of codings of all the artifacts in the domain.
The alignment method (see e.g.[54, 101, 100]) is in this perspective of codification.
This method has an evident drawback in a dynamic domain: if the vantage point is
changed (while approaching the object or because the object has been moved, etc.)
and the current one is not coded, then the robot get lost, because he cannot find the
telephone or because he finds something that is not a telephone.
Analogously, the ambiguity cannot be resolved if the coding is concerned with the
initial position of the object. For, suppose that each object in a scene is associated
8.1. INTRODUCTION 87
with a position, in the initial situation, therefore the following would axiomatize each
object:
IsAt(x0 , y0 , z0 , s0) = XY....Z, a code associated with a position.

IsAt(x, x1 , x2 , s) 6= IsAt(y, y1 , y2 , s), uniqueness of positions for all x, y and s.
IsAt(x0 , y0 , z0 , s0) 6= IsAt(x00 , y00 , z00 , s0), for each object in the scene, by uniqueness.
And the successor state axiom is
IsAt(x, x1 , x2 , do(a, s)) = w ≡
(a = move(w, x, x1 , x2 ) ∧ ...) ∨ IsAt(x, x1 , x2 , s) = w.
Here the ellipsis indicates a sentence precising the conditions under which the action
a leads to a change in position of the object whose code is a suitable substitution
for the variable w. Observe that the successor state axiom above, ensures that the
change in position is recorded therefore the object is anchored by the coding. The
drawbacks of this approach are twofold:
1. If the actions are non deterministic then the object might fail to be anchored.
2. A complete information is required in the initial database.
The problem with the above methodologies is the following: by requiring a specific
set of instances for each object they do not allow for a suitable generalization and
abstraction.
Despite we do not have a solution for representing the inner experience connot-
ing the action of seeing a specific object, we consider the possibility of introducing
descriptions as the simplest way for defining an inner concept associated with an
image.
As Russell points out in his Introduction to Mathematical Philosophy, a description
might be either definite (the so-and-so) or indefinite, ambiguous, (a so-and-so). So, for
example, a description such as “the table in my studio is wooden made and squared”
is definite; while “a table is an artifact made of a support, at most 110 cm tall, holding
a flat surface usually made of some smooth hard material like wood, glass, marble, or
some plastic material” is an indefinite description.
The difference between definite and indefinite descriptions is ontologically relevant,
but both the kind of descriptions are meaningful in recognition, as a way to mimic the
learning experience with a definition in which the salient aspects of the experience are
captured together with the denotation. In other words a description amounts to an
abstract noun, e.g. table, defined in terms of its primitive components, the symgeons,
recognized at an earlier stage.
In this chapter we shall describe the role of descriptions for capturing a suitable
generalization of an artifact and the way these descriptions are matched against a
reference. In the next section we shall briefly introduce the analysis of the scene,
then in Section 8.3 we introduce the notion and concept of descriptions using a single
example, finally we discuss a parsing algorithm, that it is used to parse the image
graph.
8.2 Analyzing the image graph and using descrip-

tions
The Bayes aspect network that we have already presented in Chapter 6, is used to
recover the hypotheses which are then used to parse the image graph. For example
the Bayes Network might output the hypotheses that there are two Symgeons in the
image graph, one with a “horn” shape (see Chapter 6, an Algebra of Figures) with
probability 7.85 and another with a cylindroid shape and probability 9.36.
Now, given these two hypotheses our goal is to understand if and how they are in
some way related, and if a relationship exists how it contributes to the formation of
an object, that might appear in the scene. Observe that the process of hypotheses
formation relaying on the Bayes inference through the Bayes aspect network has not
changed the space constraints of the image graph delivered by the syntactic image
analyzer. In other words the syntactic image analyzer holds the connections between
symgeons (see for example the Figures 8.1 and 8.2) since each shape identified by the
Bayes aspect network holds a position in the image graph.
From the image graph, once the Bayes network has inferred a set of hypotheses,
a scene graph is constructed, which is simply the list of hypotheses (i.e. symgeons)
together with their relative positions with all the other symgeons and in the scene.
The scene graph is then interpreted using both the descriptions and a parsing
algorithm that parses the term obtained by the descriptions against the graph.
This is to say that the description is the concept, abstract representation or experi-
ence given to the agent; while the scene graph is the vision process (with all its stages
of computation and processing), and the parsing algorithm, likewise the extraction of
the term denoting the scene, is the way the reference is established.
At this stage of the recognition process our goal consists of the following steps:
1. Individuating all the symgeons in the scene, and establish them as hypotheses,
by attributing them a specific probability. One of the crucial problem here is to
avoid repetitions. We need to ensure that the same elements of the image are
not used for more than one symgeon.
2. Establishing all the relationships holding, between the symgeons recovered and
between the symgeons and the image graph, just in terms of distances. These
relationships are covered by the relationships between and among aspects, e.g.
two aspects are parallel if, given their symmetry axis, the distance between the
two symgeons is measured on the symmetry axis.
3. Define accurate descriptions for each expected object, coherently with the suc-
cessor state axioms for perception.
4. Recover the objects in the scene, by parsing the scene graph.
Many of the above steps are simply just hinted, as they are still under experi-
mentation. However they deserve a suitable discussion, that will be given in the next
sections.
8.3. HINTS ON DESCRIPTIONS 89
Figure 8.1: The graph of the scene in which a table and a chair have been singled out of a
set of symgeons.
8.3 Hints on descriptions

A description is an expression explaining the concept underlying an artifact.
Since the concept underlying an artifact is the way the artifact is built and it is
functioning, then a description when it concerns an artifact is, necessarily, an expres-
sion (namely a sentence) describing the artifact structure and its essential features.
Observe, however, that a description is considered, in the ontology of a theory of
action and perception, a component of perception, namely the conceptualization of
the artifact or the bridge between the act of perceiving and the connotation of the
object perceived. Related to the description is the representation of perception.
The cognitive part of perception is described and axiomatized in the Situation
Calculus. We refer the reader to [76] for a full presentation of Perception at the
cognitive level. Here we recall some essential features introduced in [76]. Asserting
a “perception” about something that can be perceived, i.e. a perceptible, is denoted
by a predicate P ercept(p, v, s), in which p is the perceptible of the kind isT able(x),
v ∈ {0, 1} is the outcome of a sensing action of the kind sense(isT able(x), 1), and s
is a term of sort situation.
A perception, is an event affecting only the inner state of an agent, for this reason
in [76] a perceptible is represented by a term embedded in the fluent P ercept.
Example 4 Consider the following causal laws:
1. a = sense(isT able(x), 1)→P ercept(isT able(x), 1, do(a, s));

2. P ercept(isT elephone(x), 1, s) ∧ P ercept(isT able(x, y), 1, s)∧
a = sense(isOn(x, y), 1)→P ercept(isOn(x, y), 1, do(a, s))
With the above representation we can follow the problem solving approach to
perception [89, 90], in which perception is interpreted as a mental event subject to
causal laws. On the basis of the assumption that an agent experienced some prior
perceptions, and that a perception-perception chain of causality can be represented,
we define causal laws for perception. Obviously there is a frame problem to be solved
also for perception, and we adopt Reiter’s solution (see [86] for a full and very well
detailed description). A successor state axiom for the fluent P ercept accounts for
the correlation and causal laws among Percepts: perceived orientation is the the
suitable context for perceived shape, perceived adjacency explains the perception of
the composition of an object and so forth [90]. See the above Example 4.
For each fluent F (~x, s) in the language, which is observable, i.e. it can be sensed,
a perceptible isF (~x) is introduced into the language, together with a successor state
axiom of the kind:
P ercept(isF (~x), v, do(a, s)) ≡ ΨisF (~x, v, a, s) (8.1)
Here ΨisF (~x, v, a, s) is a sentence describing what should be true both in terms of
other previous percepts and in terms of properties holding in the domain, to make
perception hold about the perceptible isF in the situation in which a sensing action
has taken place. Each sensing action, in turn, has a precondition axiom of the kind:
P oss(sense(isF (~x), o), s) ≡ ΠisF (~x, o, s)
Observe that a successor state axiom like the one for P ercept does not interfere
with the successor state axioms for fluents, which means that we have two threads of
transformations: an inner thread (the agent’s inner world) traced by the history of
actions, through P ercept plus the perceptible isF , and an outer thread (the agent’s
external world ) traced by the history of actions, through the fluent F . These two
threads can be convergent or divergent. If they converge, what is perceived can be
added to the database, although the addition is non monotonic, since it is added
as an hypothesis. If they diverge, nothing can be added and the agent records a
mistake. A mistake is not an inconsistency, which can never occur through sensing and
percept. This reasoning process is called meaningful perception. Inference, according
to meaningful perception, is done using regression, so if D is the theory of action and
perception, and
D |=
P ercept(isT able(a), 1, s) ∧ P ercept(isT elephone(c), 1, s)∧
P ercept(isOn(c, a), 1, s)∧
s = [sense(isT able(a), 1), . . . sense(isOn(c, a), 1), . . .
sense(isT elephone(c), 1)]∧
∀p.p = isT able(a) ∨ p = isT elephone(c)
∨p = isOn(c, a)→¬M istaken(p, s)
then meaningful perception would allow the fluents T able(a, S0 ), T elephone(c, S0 ),

and isOn(c, a, S0 ) to be added to the initial database DS0 ; if the history s were
mentioning actions other than sensing actions, then meaningful perception would
allow the regression of the above fluents to be added to the initial database DS0 .
Were regression is an operation that transforms an expression φ(σ), of the Situation
Figure 8.2: The graph of the scene representing a table
Calculus, that mentions a situation σ of the kind do(ak , . . . , a1 , S0), with [a1 , . . . , ak ]
actions, into an equivalent expression Ψ(S0 ) mentioning only S0 , the initial situation.
On the other hand if we had:
D |=
P ercept(isT able(a), 1, s) ∧ P ercept(isT elephone(c), 1, s)∧
P ercept(isOn(c, a), 1, s)∧
s = [sense(isT able(a), 1), . . . , sense(isOn(c, a), 1), . . . ,
sense(isT elephone(c), 1)]∧
¬On(c, a, s)→M istaken(isOn(c, a), s)
then a mistake would be recorded and nothing can be added, in this case, to the
initial database, although no inconsistency is caused by the mistake.
Now, the problem of perception consists essentially in managing the input data
obtained by sensors (e.g. the camera), processing them and suitably adding the results
to the theory of action and perception as hypotheses so that the following will hold:
D ∪ H |= ∃p, s.P ercept(p, 1, s) ∧ ∃x.p = isT able(x)
To understand what H is and the role of sensing actions, consider the following simple
example. There is a table and a chair in a room and an observation of the scene is
performed, i.e. a shot of the scene is taken (we open our eyes and look into the room);
let us cut the instant before we make sense out of what there is in the room. Clearly,
at the very moment in which the image is taken no distinction among objects is made.
Therefore it is not a single sensing action like sense(isT able(x), v) that takes place,
but a scene/image acquisition.
From the image acquisition till the inference, leading to an hypothesis that there
might be a table in the room, a complex process of revelation takes place. One
bringing the shapeless and scattered components identified in the image, to the surface
of cognition1 , by giving a structure to these components. And there is a step in the
structuring that reveals the meaning: “that’s a table”. In other words the re-cognition
process is a thread of revelations (the apperception) giving, attributing, meaning to
the elements of the image. This is achieved by conjugating the construction of a tight
data structure (a graph of all the symgeons occurring in the scene together with their
topological relations), which is the hypothesis H, together with the meaning given
by a description and denoted by a sensing action like sense(isT able(x), v). Therefore
the sense(isT able(x), v) action has, indeed, the double meaning of giving sense to the
elements of the data structure and of bringing to the surface of cognition the existence
of an object, a table, in fact.
To understand what we mean let’s go through the example of the table. We might
simplify the successor state axiom in (8.1) as follows:
P ercept(isT able(x), v, do(a, s)) ≡

a = sense(isT able(x), v) ∧ InRoom(s)∨ (8.2)
a 6= sense(isT able(x, v)) ∧ P ercept(isT able(x), v, s).
The above successor state axiom for P ercept states that the perception of the
perceptible isT able occurs when, being in a room, a sensing action has been exe-
cuted about the mentioned perceptible, or the perception had already been done in
a previous situation.
Now, we constrain the sensing action so that it can be performed just in case a data
structure has been released by the re-cognition process. To this end let us distinguish
between the object to be localized (the table) and the data structure accounting for
the reference primitives of the scene, i.e. the set of all elementary objects appearing
in the image, including the legs and the top of the table.
P oss(sense(isT able(x), v), s) ≡
(v = 1 ∧ ∃ref erence. Scene(isT able(x), ref erence))∨ (8.3)
(v = 0 ∧ ∀ref erence. ¬Scene(isT able(x), ref erence)).
The above action precondition axiom says that the outcome of sensing is 1 if the
description matches over the data structure, the reference, otherwise it will be 0.
What does the term ref erence mean in the context of the above axiom (8.3)? At
each step of the perceptual reasoning process, we deal with a data structure denoting
what has been recognized so far, in the following way:
1. Cognitive step: the data structure is a scene graph (see next Section 8.4). Each
node of the scene graph is labeled by a symgeon (about which we know the
metric in terms of parameters, distances and position in the scene). Each edge
is labeled by a 3D relation.
2. Recognition step: the data structure is the Bayes-aspect network. Each node
of the Bayes-aspect network is labeled either by a feature (aspect, face, or
boundary) of a symgeon or by a 2D relation between features.
3. Processing step: the data structure is the syntactic graph, i.e. the graph de-
picting all the segments (straight segments and arc of segments) traced in the
1 Re-cognition, indeed, means knowing again, to reveal again to cognition.
image. Each node of the syntactic graph represents a junction point between
segments and is labeled by the information of its 2D position in the image. Each
edge of the syntactic graph represents a segment and is labeled by primits about
which we know the type and the specific information.
Therefore the term ref erence denotes the graph of the scene. When the inference
matches the term P oss(sense(isT able(x), 1), s), a description of the table is asked
for. A description should comply with the following conditions:
1. It has to be general enough to capture several possible shapes (at least the most
common) the specified object can be represented by.
2. It has to match several views of the object, including occlusion: e.g. the table
has just three legs, the fourth being occluded.
3. The description is a term t ∈ T⊕ having a recursive structure as follows: the
head of t is a symgeon and, recursively, the tail is decomposed into head and
tail, the head is the principal node.
In other words, the description gives sense to a term: it is the semantics of a term,
while its syntax is the structure of the object: e.g. the graph of the table (see 8.5)
is the syntactic structure of a table, the meaning of the table is given through its
description (see the next axiom (8.4). Matching is achieved by a parse function.
The parse function is a call to a parsing algorithm that analyzes the term while
transforming it by applying the rewriting rules introduced in the Algebra of figures
(see Chapter 5)).
Each object, in the scene, will be a term t (which should be represented with
a principal node, we shall deal with this in the next section), such a term t holds a
description, which is the prior knowledge of the agent about the artifact. For example
if t is a table then a description of the table will be as follows:
Scene(isT able(x), ref erence, do(a, s)) ≡

∃tableRef erence.tableRef erence 6= ∅∧
(a = parse(isT able(x), ref erence, tableRef erence))∧
∃top ∃support (tableRef erence = top ⊕C ⊕T support)∧
F lat(top, x) ∧ Hard(top, x)∧
∃legs.legs 6= ∅ ∧ parse(support, tableRef erence, legs)∧
Legged(support, legs)∧ (8.4)
∃l l0 l1 . . . lk (legs = l ∨ legs = l ⊕P l0 ∨ . . . ∨ l(⊕P )k (l1 . . . lk ))∧
∀l l0 (l ∈ legs ∧ l0 ∈ legs∧
(l 6= l0 →
height(l) = height(l0 ) ∧ material(l) = material(l0 ))∧
∀γ γ φ(l, γ) ∧ φ(l0 , γ 0 )→γ = γ 0 ∨
0
Scene(isT able(x), ref erence, s).
The above successor state axiom tells that a description is a fluent, obviously de-
pending on the situation, and the action is exactly that of matching the scene by
parsing ( we shall introduce the parsing algorithm in the next section). In particular,
a table is made of a hard and flat top and a legged support, in which the legs can
be one or many, but they have to be of the same shape and material. In particular,
the properties concerning the predicate F lat, Hard and Legged can be, for example,
being of a given height, of a selected material etc... Furthermore, here ref erence is
the scene graph, in which the table has to be found, the table is supposed to be a term
with a principal node, that is referred to as tableRef erence. Having delegated the
definition of the shape and structure of the top to the predicates F lat and Hard, and
having delegated the existence of the legs to Legged, the two sentences φ(l, γ) and
φ(l0 , γ 0 ) denote the possible symgeons (e.g. cylindroid, cuboid, etc.) that represent
a leg, and in this last case also the possible number of legs. Observe, also, that the
value that take the variable x is the anchoring of the perceptible isT able(x) to the
object, described by the term, through its principal node, which in this case is the
top of the table. In particular the variable x, might denote the position of the table,
and it will be associated with the term identifying the specific table. Therefore any
action on the table (e.g. moving the table, putting it upside down etc.), will affect
that term, i.e. its position. If another table is looked for, this will be identified by a
new name/position since for each perceptible isF (~x), the following will hold:
isF (~x) = isF (~y ) ≡ ~x = ~y
together with the unique name for perceptibles analogous to the set of axioms unique
name for actions. It is clear that building successor state axioms for description is
rather hard, the goal will be that of compiling them. We yet do not know if this will
ever be possible by learning.
8.4 Parsing the reference: the scene graph

In this section we will describe the scene graph, that is the term S denoting the scene,
and the algorithm needed to parse such a graph.
Suppose, for example, that the graph of the scene is the one described in Figure
8.1. This graph is obtained by reiteratively build a Bayes aspect network for each
Symgeon appearing in the scene. This process leads to the construction of a scene
graph, which is totally connected. That is, the graph mentions all the symgeons
recovered in the scene and their relationships: the relations among each symgeon and
all the others. The set of relations that label the edges of a scene graph are depicted
in Table 8.1, and are those defined by the algebra of figures F. Observe that all the
relations are reflexive.
Now, suppose that after suitable rearrangements, transforming the graph in its
normal form, a term will be obtained of the kind τ1 . . . τm in which each τi should
be a term with a principal node, denoting an artifact eventually occurring in the
scene. Observe that by the algebraic transformations discussed within the algebra of
figures F, it is always possible to rearrange the terms so as to get a term in normal
form.
This graph is parsed and a single term τi is looked for, indeed the one addressed
in the description, e.g. Scene(isT able(x), τi ). Once the term τi is found then its
structure is analyzed, observe that all these passages are useful just because the
structure of the artifact can be very general. For example to find a table in a scene
we need that it is made of a flat and hard top and a legged support. The number of
legs are not known in advance, in fact, the number of legs is found in the scene graph.
8.4. PARSING THE REFERENCE: THE SCENE GRAPH 95
C= Connected: ⊕C
P= Parallel: ⊕P
S= Symmetric: ⊕S
T= Orthogonal: ⊕T
V= Angular: ⊕V
Table 8.1: Relations between symgeons in the scene and their functional denotation.
All relations are reflexive.
For example, suppose that the graph referencing the table is the one given in
Figure 8.3.
~ k) be the cylindroid representing the top of the table, for suitable
Let G1 (~a, ~e, K,
~ and k, and let G2 (a~0 , e~0 , K
values of ~a, ~e, K, ~ 0 , k 0 ) be the tapered cuboid representing
the legs of the table, for suitable values of a~0 , e~0 , K ~ 0 , and k 0 . Let γ1 = h~a, ~e, K,
~ ki and
~0 ~0 ~ 0 0
γ2 = ha , e , K , k i. Finally let a, b, c, d denote the position pos of each symgeon then
the term of the table in the scene is defined as follows:
(g1 (a, γ) ⊕C ⊕T g2 (b, γ 0 ) g1 (a, γ) ⊕C ⊕T g2 (c, γ 0 ) g1 (a, γ) ⊕C ⊕T g2 (d, γ 0 )

g2 (b, γ 0 ) ⊕P g2 (c, γ 0 ) g2 (b, γ 0 ) ⊕P g2 (d, γ 0 ) g2 (d, γ 0 ) ⊕P g2 (c, γ 0 ))
This can be transformed into the following term, with a principal node.
(g1 (a, γ)(⊕C ⊕T )3

(g2 (b, γ 0 ) ⊕2P (g2 (c, γ 0 ) ⊕2P g2 (d, γ 0 ))(g2 (d, γ 0 )))
(8.5)
(g2 (c, γ 0 ))
(g2 (d, γ 0 )))
The term represents the transformation of the graph depicted in Figure 8.3 into
a tree. The transformation has, obviously, maintained all the relations labeling the
edges, as they are reflexive.
Now the point is, how do we infer from the above term that it is a term denoting a
table? This is, in fact, achieved by parsing the term denoting the scene graph, looking
for the term denoting a table. Let us recall that parsing is activated by a description
which, in turn, is activated by a sensing action; as we mentioned above, a sensing
action is a query to a data structure, namely the graph of the scene.
Definition 18 (Scene Graph) The scene graph S = hS, E, T i is a totally connected

graph, S is the set of the nodes each labeled by a symgeon, E is the set of edges,
|E| = n(n − 1)/2, if n = |S|, and T is the set of labels. Each label belongs to the set
of relations defined for the algebra F.
g 1 (a)
3
C
3 3
C C
g 2 (d)
2
P
2
P
g 2 (b) 2 g 2 (c)
P
Figure 8.3: The graph of the scene representing a table
In order to apply the parsing algorithm we need to suitably rearrange the terms
of S. The following normal form theorem will ensure us that this rearrangement is
possible and each term denoting an artifact can be devised in the scene graph.
Definition 19 Let S be a scene graph. S is said to be in normal form iff
S = τ1 . . . τk
And each τi has a principal node.
Theorem 1 Let S be a scene graph. There exists a finite number of terms τ1 , . . . , τk

such that each τi has a principal node and there exists a term S0 in normal form and
such that S = S0 .
Proof. First observe that by applying Proposition 2 to S we can get a graph G having
a principal node. Therefore G is already in normal form. Now, on the term G we can
use the distributive laws (see 6.3.1, 5 and 6) so as to split the term G and obtain a
more flat term, still in normal form.

Let S be the scene graph already in normal form, furthermore assume that S is
indeed the term mentioned in a successor state axiom Scene(isF (~x), S, do(a, s)).
Clearly, when the parsing algorithm is invoked, it will be with a given artifact,
e.g. isT able(x). Observe that in the description axiom, through the parsing action,
the term isT able(x) is described in terms of the algebra F, that is, there is a term
tableRef erence = top ⊕C ⊕T support saying to look inside the scene graph S (i.e.
ref erence), for a term of the form τ1 ⊕C ⊕T τ2 , which has a principal node, namely
τ1 , and whose tail (namely τ2 ) is further specified in the description axiom. This last
term, namely τ1 ⊕C ⊕T τ2 , will be returned and further analyzed.
The parse algorithm, therefore takes as input two terms t, t0 , in which the second
term has to be a term of F, and it outputs a term τ which is the structure of t in
terms of the algebra.
Example 5 Let S be the graph of the image as depicted in Figure 8.2, let isT able(a)
be the queried term, and top ⊕C ⊕T support be the term tableRef erence which has
8.4. PARSING THE REFERENCE: THE SCENE GRAPH 97
to be found in S. Then the term found is the term:
(g1 (a, γ)(⊕C ⊕T )3

(g2 (b, γ 0 ) ⊕2P (g2 (c, γ 0 ) ⊕2P g2 (d, γ 0 ))(g2 (d, γ 0 )))
(g2 (c, γ 0 ))
(g2 (d, γ 0 )))
Which is indeed an instance of the term looked for.
Before introducing the algorithm parse(t, t1 , t2 ), let us specify the function matches(f, g),
that will be used by the algorithm, as follows:
matches(f, g) = > iff f = g or

f = t ⊕X τ1 , g = t0 ⊕Y τ2
and t = t0 or
f = t τ1 , g = t 0 τ2
and t = t0
The algorithm parse(t1 , t2 , t3 ) performs a depth search on t2 , looking for an instance

of the term t3 and performing the following steps:
1. t3 = head(⊕X1 · · · ⊕Xk )|tail| tail, X1 · · · Xk ∈ R.
2. If no match for head is found in t2 (i.e. matches(t2 , t3 ) = ⊥) exit with failure.
Otherwise:
3. If a match for head is found in t2 , then:
(a) If t2 = head(⊕X1 · · · ⊕Xk )|tail| T , with T any, then let tail := T , and go
to item (1). Otherwise let t = head(⊕Y1 · · · ⊕Ym )|tail| tail be the current
term:
(b) If for some i, 1 ≤ i ≤ m, a match for head⊕qYi in t2 is found, then:
i. If there is only one term t0 matching head⊕qYi , and for no other sub-
terms t00 of t2 there is a match for head⊕hYj , j 6= i, then exit with
failure. Otherwise:
ii. rearrange the set of sub-term ti , 1 ≤ i ≤ w, of the term t2 , matching
head⊕qYi , according to the rewriting rules provided, so that head is the
principal node of rearranged(t2 ); let t2 := rearranged(t2 ) and go to
(a).
4. Exit with success.
Observe that a match is always reflexive, e.g. g ⊕ g 0 matches with g 0 ⊕ g. It is easy
to see that the algorithm deals with a rearrangement of a sub-term of the initial term
reference until only one term can be rearranged, and therefore it terminates. We have
still to prove its completeness: if a description can be matched then it will.
To conclude this section observe the following: if the recognition process, start-
ing from the image acquisition, is able to deliver a data structure, i.e. the scene
graph, s.t. all the recovered components of the image, i.e. the symgeons, together
with their relationships, have been identified (with a reasonable accuracy), then the
sensing action sense(isT able(x), v) is a query to such a data structure, defined by
the recognition system, asking whether all the elements, which are expected to be the
components of the abstraction of the object (the table in this case), belong to the data
structure, and are each other in the expected relationships. If this is achieved through
the parsing algorithm, matching the term given by the description, with the scene
graph, i.e. the ref erence, then the object is recognized and, through the predicate
P ercept(isT able(p), 1, s), the object will be added to the initial database, decorated
with all the information recovered by the recognition process.
Chapter 9
Experimental Results
In this Chapter, we will provide experimental results to demonstrate the various as-
pects or our vision reasoning system. As already mentioned in Chapter 1.2, actually
we have implemented the Syntactic Image Analyzer as a C++ module, and the rea-
soning system (Bayes-aspect network and feature description) used to recover all the
SymGeons in the image as a Prolog program, through the Algebra of Figure described
in Chapter 6.
However, other components of the framework are implemented, like the scene graph
parser, but they cannot be introduced in the system because we are still working on
the idea presented in Chapter 1.2, regarding the 3D reconstruction and the scene
graph recovery.
Other considerations about experimentation concern the test images. As men-
tioned previously in Section 10.2, the analysis capabilities of the Syntactic Image
Analyzer are strictly related to the ability of the edge detector. Up to now, we have
used Matrox Imaging Library (MIL) for the execution of basic image processing op-
erations, like acquisition, filtering, etc. and in particular for edge detection.
Unfortunately to execute this operation MIL uses a simple technique based on
convolution and zero crossing [45]. This implies that the quality of the binary image
obtained using MIL’s edge detector, is influenced by some environmental conditions:
illuminance, light direction, shadows, etc.
Since our goal is to test the ability of the reasoning system, we have preferred to
avoid the results compromission by the edge detector inaccuracy, using synthesized
images.
9.1 Application Platform

We have implemented our vision recognition system using resources of the ALCOR1
Laboratory of the University of Rome “La Sapienza”.
The test platform for the vision system is composed of a couple of Sony XC-999
ultra-small CCD (Charge Coupled Device) color video camera mounted in a computer
controlled pan-tilt unit, produced by Direct Perception. The available computer for
1 Autonomous Agents Laboratory for COgnitive Robotics.
99
100 CHAPTER 9. EXPERIMENTAL RESULTS
Figure 9.1: The Fantastic Word of Mr ArmHandOne
this platform has an Intel Pentium III (400 Mz) and Windows 2000 as Operating
System2 .
However, some of the basic modules realized in C++ for the Syntactical Analyzer
are also tested and used for the vision system of Mr. ArmHandOne [18] (see Figure
1.1), in occasion of the Mobile Robot Competition and Exhibition of AAAI-2002 (28
July - 1 August 2002).
Mr ArmHandOne is an autonomous mobile manipulator robot (MoMaBot) which

roams non engineered indoor mazes (see Figure 9.1) without prior knowledge about
the map of the world and its initial position.
Its body is based on an electro-mechanic structure commercialized by Lynxmotion

inc. It is essentially composed of an anthropomorphic arm upon a mobile base. The
robotic arm is composed of three link, four revolute joints (4 DOF), with an end-
effector that is represented by a small gripper; all the arm servos are Hitec HS-422
stepper motors.
The robot is endowed with many sensors. Thanks to them it can perceive the
world or know its internal state. They are composed of two cameras mounted
2 For the future we are seriously thinking to move our platforms under Linux.
9.2. TEST RESULTS 101
on a pan-tilt head-like support

upon the base, 17 sonars, one of
these is utilized as a third eye
between the cameras, and an-
other 16 are distributed along a
ring all around the robot, and
two encoders on the motorized
Figure 9.2: Mr. ArmHandOne’s binocular head. wheels.
9.2 Test Results

The images shown in this chapter to discuss the abilities of our vision reasoning
system, are chosen in order to test its behaviour in different possible situations. In
particular, we have considered three cases.
In the first one we have chosen to test the system’s behaviour in extreme condi-
tions, showing how it works with its probabilistic framework. To this end we have
used an image representing SymGeon-like shape in a cluttered environment. The im-
age is depicted in Table 9.1 together with the primits founded, whose parameters are
reported in in Table 9.2. The system recognizes correctly four aspects (represented in
Table 9.3 and 9.4) which are listed in the following table, together with the associated
probabilities (see Appendix B for the aspects’ name).
SymGeon Probability
pr cuboid 1 0.9336147
pr cylindroid 0.7164102
pr cuboid 1 0.7791866
In the second situation, we have considered the occlusion problem. Therefore the
test image depicted in Table 9.5, represents a SymGeon partially occluded, while
in Table 9.6 are reported the corresponding primits parameters. As in the previous
case, the system correctly recognizes two aspects despite one of them being occluded.
They are shown in Table 9.7 and listed in the following table with the associated
probabilities.
SymGeon Probability
pr cuboid 1 0.9235069
The final test is performed with a manmade object: a pot. It is a simple human
artifact composed of a cylindrical handle connected to a cylindrical basket. The pot
is represented in Table 9.8 together with the associated primits, while in Table 9.9
are depicted the aspects recognized by the system. However they are listed in the
following table.
SymGeon Probability
Cluttered Test Image
Primits
Table 9.1: SymGeon immersed in a massy environment.


ell( point(105, 273), 112, 211, 0.204219, -1.585552, 0.866697) ).
ell( point(112, 130), 24, 69, 0.360062, -1.450973, 2.895996) ).
ell( point(121, 356), 46, 88, 0.072690, 1.598686, 3.007391) ).
ell( point(216, 362), 20, 61, 0.083435, 1.714530, 2.793784) ).
ell( point(151, 214), 20, 72, 0.012589, -0.112324, 3.408569) ).
ell( point(151, 214), 20, 72, 0.012589, -2.873000, 2.760676) ).
ell( point(210, 363), 19, 59, 0.079841, -1.446017, 2.992531) ).
Table 9.2: List of Primits for cluttered image.

Aspect pr cuboid 1 with probability 0.9336147
Aspect pr cylindroid with probability 0.7164102
Table 9.3: Aspect 1/4 and 2/4 of the cluttered image.

Table 9.4: Aspect 3/4 and 4/4 of the cluttered image.

Occluded Test Image
Primits
Table 9.5: SymGeon scene with occlusion.


ell( point(554, 149), 29, 51, -0.166349, -1.567289, 3.183604) ).
ell( point(554, 149), 29, 51, -0.166349, 1.616314, 2.885542) ).
Table 9.6: List of Primits for occluded image.

Table 9.7: Aspect 1/2 and 2/2 of the occluded image.

Pot
Primits
Table 9.8: SymGeon composing a pot.

Table 9.9: Aspect 1 and 2 of a pot.

Chapter 10
Conclusion
10.1 Thesis Summary

We have presented an approach to recognition, focusing mainly on high level recog-
nition, in which a crucial role is played by the cognitive and reasoning aspects of the
entire recognition process. These levels are used to give meaning to a data struc-
ture that takes shape via a hierarchical specification which, starting from elementary
components go up, by composition, until the description of a simple human artifact.
We have conceived, in fact, a system in which we start from a very basic level, the
image syntactic graph, in which the primitive elements are the primits. We further
compose these elements, using a connection algebra, in which the operators are defined
on the basis of a metric uniform for all elements in the hierarchy, and we obtain the
boundits, by further composing these elements we obtain faces and aspects. The
hierarchy is specified via a Bayes-aspect network and a scene graph.
We recall here the basic steps of the recognition process that we have addressed.
Once the image is acquired, we proceed with a syntactic image analysis. The purpose
of the syntactic image analysis is to recognize in the acquired image the syntactical
categories which are the primitives of the scene structure. It delivers an Image syntac-
tic graph, which is a graph recording all the elementary traits of the image, suitably
filtered.
At the earlier steps of the analysis we just use classical techniques. In fact, for
the execution of some basic operations like filtering and edge detection we use the
Matrox Library (MIL), and a convolution based edge detection operator is applied to
the image after filtration.
An edge following algorithm is used to generate an ordered list of connected points
(chain pixels), and the resulting edge elements are further approximated with straight
lines and ellipse arcs which turn out to be the primitive elements of our taxonomiza-
tion, i.e. the primits.
The set of primits are given as input to the Bayes-Aspect network (BAN). It has
four levels of feature nodes and three levels of connection nodes. The set of feature
nodes is minimal by construction, and each feature is defined in first order logic, by
using the connection operators defined within the Figure Algebra AF. The BAN is
used for diagnostic inference: given the evidence, i.e. the set of primits initializing
111
112 CHAPTER 10. CONCLUSION
the network, a query to the BAN asks for the probability that a specific aspect of
a SymGeon occurs in the image (actually in the image syntactic graph), given the
evidence:
P rob(aspect(~t)|~t)
If the computed probability is greater than a given threshold λ, that depends on the
image acquisition, then the SymGeon, associated with such aspects, is added as an
hypothesis to the knowledge base.
Such a SymGeon will be then elaborated according to binocular reconstruction
and it will be recovered in its 3D shape, and its position will also be recovered. The
set of SymGeons obtained by querying the Bayes-aspect network will contribute to
the formation of the scene graph. The scene graph is finally queried by single sensing
actions whose role is to give meaning to the scene graph.
The formalization of perception in the Situation Calculus allows one to infer, via
the sensing actions, the existence of specific objects in the scene. Furthermore, be-
cause of its symbolic description, an object is anchored to the data structure, therefore
its dynamic can be suitably traced.
10.2 Discussion
In this section we emphasize and discuss some weaknesses and strengths of our ap-
proach.
3D object recognition is a complex problem and indeed hard domain test for our
system. Thanks to its generality, our framework can be used in a variety of different
domains. For example, remaining in the area of object recognition, if the underlining
approach is compositional, the system can be used with other primitives since they
can be decomposed into primits.
Corridor detection is another possible domain that we have tested in the maze
navigation problem of Mr ArmHandOne [18], introduced in Section 9.1. Descriptions
of different panel configuration are given to the system to recognize straight corridor
and left or right recess, using suitable definition in the Algebra of Figure. Some
screen-shot of the application is depicted in Table 10.1. Note that, an important
advantage of this domain is the possibility to discard elliptic arcs, simplifying the
reasoning process.
A considerable drawback of our approach is its dependency on the accuracy of the
edge detector. In our system we have used Matrox Image Library (MIL) to perform
low level image processing operation. This library uses a convolution operator for edge
detection, so the capabilities of our system result strictly related to environmental
conditions like illumination, light direction, shadows, etc.
However, this limitation can probably be overcome using more sophisticated fil-
tering and edge detector operator like Sobel filter or Canny edge detector [32].
Other limitations are due to the underlining assumptions of the approach. We
have assumed that the object of interest can be decomposed into parts. This isn’t
generally true, in particular if we consider natural objects, so we have focused our
attention on a specific subset of artifacts.
By taking into consideration these limitations, a more powerful vision reasoning
system for object recognition can be developed.
10.2. DISCUSSION 113
Corridor Edge Detection
Primit
Table 10.1: Corridor Detection Domain

114 CHAPTER 10. CONCLUSION
10.3 Thesis Contribution

In this thesis we have introduced a new interesting framework for object recognition
and classification. The major contribution of this research is in the introduction of a
theoretical foundation of a vision system in which simple data obtained from the low
level processing of image, are elaborated from a high level reasoning system, in order
to give meaningful sense to such information.
The idea to define a reasoning system starting from the notion of Recognition By
Component, is founded on the consideration that it is the principal approach allowing
reasoning, because it exploits compositionality. In particular, we have extended this
notion applying it also to the primitives definition, through the identification of the
elementary traits (primit) that characterize vision perception.
This result has allowed us to translate vision data into a simple symbolic descrip-
tion, demanding the interpretation of such a data to a high level reasoning system,
which uses probabilities to find the best meaning to them.
Another significant element of originality of this research is the application of the
compositional approach for part based object recognition, to images taken from a
stereo camera. To our knowledge, all the research based on compositionality use laser
range finder to acquire data, obtaining in such a way the required three-dimensional
information without any excessive effort.
Bibliography
[1] Asher, N., and Vieu, L. Toward a geometry of common sense: A semantics
and a complete axiomatization of mereotopology, 1995.
[2] Barequet, G., and Sharir, M. Partial surface matching by using directed
footprints. In Symposium on Computational Geometry (1996), pp. C–9–C–10.
[3] Barr, A. Superquadrics and angle-preserving transformations, 1981.
[4] Bellaire, G. Feature-based computation of hierarchical aspect-graphs, 1993.
[5] Bennett, B. Determining consistency of topological relations. CON-

STRAINTS, 3 (2 & 3), Special Issue on Spatial and Temporal Reasoning (1998),
213–225.
[6] Bennett, B., Cohn, A., and Isli, A. Combining multiple representations
in a spatial reasoning system. In Proceedings of the 9th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach,
CA (1997), pp. 314–322.
[7] Bennett, B., Cohn, A., Wolter, F., and Zakharyaschev, M. Multi-
dimensional modal logic as a framework for spatio-temporal reasoning. Applied
Intelligence (2001). To appear.
[8] Bennett, B., Cohn, A. G., Torrini, P., and Hazarika, S. M. A founda-
tion for region-based qualitative geometry. In Proceedings of ECAI-2000 (2000),
W. Horn, Ed., pp. 204–208.
[9] Beymer, D., and Poggio, T. Face recognition from one example view. Tech.
Rep. AIM-1536, , 1995.
[10] Biederman, I. Recognition by components - a theory of human image under-

standing. Psychological Review 94(2) (1987), 115–147.
[11] Biederman, I., and Cooper, E. Evidence for complete translational and
rectional invariance in visual object priming. Perception 20 (1991), 585–593.
[12] Biederman, I., and Cooper, E. Priming contour-deleted images: Evidence

for intermediate representations in visual object recognition. Cognitive Psychol-
ogy 23 (1991), 393– 419.
115
116 BIBLIOGRAPHY
[13] Binford, T. Visual perception by computer. In IEEE Conference on System

and Control (Miami, FL, 1971).
[14] Cederberg, R. Chain-link coding and segmentation for raster scan devices.
CGIP 10 (1979), 224–234.
[15] Chella, A., Frixione, M., and Gaglio, S. A cognitive architecture for
artificial vision. Artificial Intelligence 89, 1-2 (1997), 73–111.
[16] Chella, A., Frixione, M., and Gaglio, S. Understanding dynamic scenes.
AI 123, 1-2 (October 2000), 89–132.
[17] Chella, A., Frixione, M., and Gaglio, S. Conceptual spaces for computer
vision representations. Artificial Intelligence Review 16, 2 (2001), 137–152.
[18] Cialente, M., Finzi, A., Mentuccia, I., Pirri, F., Pirrone, M., Ro-
mano, M., Savelli, F., and Vona, K. The mr armhandone project. a mazes
roamer robot. In Proceedings of the The Mobile Robot Workshop AAAI-2002
(2002).
[19] Clementini, E., and Di Felice, P. Approximate topological relations. In-

ternational Journal of Approximate Reasoning 16 (1997), 173–204.
[20] Clementini, E., Felice, P. D., and van Oosterom, P. A small set of
formal topological relationships suitable for end-user interaction. In Advances
in Spatial Databases, Third International Symposium, SSD’93, Singapore, June
23-25, 1993, Proceedings (1993), D. J. Abel and B. C. Ooi, Eds., vol. 692 of
Lecture Notes in Computer Science, Springer, pp. 277–295.
[21] Cohn, A., and Varzi, A. Connection relations in mereotopology. In Proceed-

ings of 13th European Conf. on AI (ECAI98) (1998), John Wiley, pp. 150–154.
[22] Cohn, A., and Varzi, A. Modes of connection. Spatial Information Theory-
Cognitive and Computational Foundations of Geographic Information Science,
Lecture Notes in Computer Science 1661, Springer, Berlin (1999), 299–314.
[23] Coradeschi, S., and Saffiotti, A. Anchoring symbols to sensor data:

Preliminary report. In AAAI/IAAI (2000), pp. 129–135.
[24] Daugman, J. G. Complete discrete 2-d gabor transforms by neural networks

for image analysis and compression. IEEE Transactions on Acoustics Speech
and Signal Processing 36(7) (1988), 1169–1179.
[25] David Randell, M. W., and Shanahan, M. From images to bodies: Mod-
elling and exploiting spatial occlusion and motion parallax. In In Proceedings
of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01)
(2001), pp. 57–63.
[26] Dickinson, S., Bergevin, R., Biederman, I., Eklundh, J., Munck-
Fairwood, R., Jain, A., and Pentland, A. Panel report: The potential of
geons for generic 3-d object recognition, 1997.
BIBLIOGRAPHY 117
[27] Dickinson, S., and Metaxas, D. Integrating qualitative and quantitative

shape recovery. IJCV 13, 3 (December 1994), 311–330.
[28] Dickinson, S., Pentland, A., and Rosenfeld, A. 3d shape recovery using
distributed aspect matching. PAMI 14, 2 (1992), 174–198.
[29] Dickinson, S. J., Christensen, H. I., Tsotsos, J. K., and Olofsson, G.
Active object recognition integrating attention and viewpoint control. Computer
Vision and Image Understanding: CVIU 67, 3 (1997), 239–260.
[30] Edelman, S. Computational theories of object recognition. Proc. Trends in
Cognitive Sciences (1997), 296–304.
[31] Eggert, D., and Bowyer, K. Computing the orthographic projection aspect
graph of solids of revolution. PRL 11 (1990), 751–763.
[32] Faugeras, O. Three-dimensional computer vision: a geometric viewpoint. MIT
Press, 1993.
[33] Faugeras, O., and Hebert, M. The representation, recognition, and locating
of 3-d objects. International Journal of Robotics Research 5(3) (1986), 27–52.
[34] Ferrie, F. P., and Levine, M. D. Deriving Coarse 3D Models of Objects. In
Proc. IEEE 1988 Conf. on Computer Vision and Pattern Recognition (1988),
pp. 345–353.
[35] Finzi, A. Situation Calculus Extensions for High-Level Control in Cognitive
Robotics. PhD thesis, Universitá degli Studi di Roma “La Sapienza”, 2002.
[36] Finzi, A., and Pirri, F. Combining probabilities, failures and safety in robot
control. In IJCAI 2001 (2001), pp. 1331–1336.
[37] Finzi, A., Pirri, F., Pirrone, M., and Romano, M. Risk evaluation and
failures diagnosis for autonomous task execution in space. In International Sym-
posium on Artificial Intelligence, Robotics and Automation in Space (ISAIRAS)
2001 (2001). to apper.
[38] Finzi, A., Pirri, F., Pirrone, M., Romano, M., and Vaccaro,
M. Autonomous mobile manipulators managing perception and failures.
In AGENTS’01, The Fifth International Conference on Autonomous Agents
(2001).
[39] Fitzgibbon, A., Pilu, M., and Fisher, R. Direct least squares fitting of
ellipses, 1996.
[40] Freeman, H. On the encoding of arbitrary geometric configurations. IRE
Transactions on Electronic Computers 10 (1961), 260–268.
[41] Gander, W. Least squares with a quadratic constraint. Numerische Mathe-
matik 36, 3 (1981), 291–307.
[42] Gardenfors, P. Conceptual Spaces. MIT Press, Cambridge, Massachusetts,
2000.
118 BIBLIOGRAPHY
[43] Gigus, Z., Canny, J., and Seidel, R. Efficiently computing and representing
aspect graphs of polyhedral objects. T-PAMI 13 (1991), 542–551.
[44] Gigus, Z., and Malik, J. Computing the Aspect Graph for Line Drawings
of Polyhedral Objects. In Proc. IEEE 1988 Conf. on Computer Vision and
Pattern Recognition (1988), pp. 654–661.
[45] Gonzalez, R. C., and Woods, R. E. Digital Image Processing. Addison-

Wesley, Reading, MA, 1992.
[46] Grant, G., and Reid, A. An efficient algorithm for boundary tracing and
feature extraction. CGIP 17, 3 (November 1981), 225–237.
[47] Grassini, F., Midollini, B., and Mugnuolo, R. Marviss. anthropomrphic

hand and virtual reality for robotic system on the international space station.
In International Symposium on Artificial Intelligence, Robotics and Automation
in Space (ISAIRAS) 2001 (2001). to apper.
[48] Grossberg, S. Neural networks and natural intelligence. Massachusetts Insti-

tute of Technology, 1988.
[49] Halı́ř, R., and Flusser, J. Numerically stable direct least squares fitting of
ellipses. In Proc. 6th International Conference in Central Europe on Computer
Graphics and Visualization. WSCG ’98 (Plzeň, Czech Republic, Feb. 1998),
CZ, pp. 125–132.
[50] Hazarika, S. M., and Cohn, A. G. Taxonomy of spatio-temporal vague-

ness: An alternative egg-yolk interpretation. In Workshop on Spatial Vagueness,
Uncertainty and Granularity (2001). To appear.
[51] Hopfield, J. Neural networks and physical systems with emergent collective
computational abilities. In Proceedings of the National Academy of Scientists
(1982), vol. 79, pp. 2554–2558.
[52] Hopfield, J. J. Neurons with graded response have collective computational

properties like those of two-state neurons. IEEE Computer Society Press, 1988,
pp. 82–86.
[53] Huttenlocher, D., and Ullman, S. Object recognition using alignment.

ICCV 87 (1987), 102–111.
[54] Huttenlocher, D. P., and Ullman, S. Recognizing solid objects by align-

ment with an image. International Journal of Computer Vision 5, 2 (1990),
195–212.
[55] Jaklič, A., Leonardis, A., and Solina, F. Segmentation and Recovery of
Superquadrics, vol. 20 of Computational imaging and vision. Kluwer, Dordrecth,
2000. ISBN 0-7923-6601-8.
[56] Koenderink, J., and van Doorn, A. The internal representation of solid
shape with reference to vision. Biological Cybernetics 32 (1979), 211 – 216.
BIBLIOGRAPHY 119
[57] Kriegman, D., and Ponce, J. Computing exact aspect graphs of curved
objects: Solids of revolution. WI3DS 89 (1990), 116–122.
[58] Kripke, S. Naming and Necessity. OXFORD University Press, 1980.
[59] Lespérance, Y., Levesque, H., Marcu, D., Reiter, R., and Scherl,
R. B. A logical approach to high-level robot programming: A progress report.
Control of the Physical World by Intelligent Systems: Papers from the 1994
AAAI Fall Symopsium (1994), 79–85.
[60] Levesque, H., Pirri, F., and Reiter, R. Foundations for the situation
calculus. Series: Linkping Electronic Articles in Computer and Information
Science, Vol. 3: nr 018, (1999) 159–178.
[61] Levesque, H., Reiter, R., Lesperance, Y., Lin, F., and Scherl, R.
Golog: A logic programming language for dynamic domains. Journal of Logic
Programming 31 (1997), 59–84.
[62] Lowe, D. Perceptual organization and visual recognition, vol. 23. Kluwer
Academic Publishers, Massachusetts, 1985.
[63] Lowe, D. Visual recognition from spatial correspondence and perceptual or-
ganization. IJCAI 85 (1985), 953–959.
[64] Marr, D. Vision: A Computational Investigation into the Human Represen-

tation and Processing of Visual Information. W.H. Freeman, San Francisco,
1982.
[65] Marr, D., and Nishihara, H. Representation and recognition of the spatial
organization of three-dimensional shapes. In Proc. R. Soc. Lond. B, vol. 200
(1978), pp. 269–294.
[66] McCarthy, J., and Hayes, P. Some philosophical problems from the stand-
point of artificial intelligence. Machine Intelligence 4 (1969), 463–502.
[67] Norman, D. Things that make us smart. Addison Wesley, 1993.
[68] O’Rourke, J. Art gallery theorems and algorithms. Oxford Univ. Press (1987).
[69] Orponen, P. Computational complexity of neural networks: a survey. Nordic

Journal of Computing 1, 1 (Spring 1994), 94–110.
[70] Palm, G. On Associative Memory. No. 36. Byological Cybernetic, 1980, pp. 82–
86.
[71] Papageorgiou, C., Evgeniou, T., and Poggio, T. A trainable pedestrian

detection system. In Intelligent Vehicles (1998), 241–246.
[72] Patel, L. N. & Holt, P. O. Modelling visual complexity using geometric

primitives. In Proceedings, Systemics, Cybernetics and Informatics (Orlando,
July 2000).
120 BIBLIOGRAPHY
[73] Patel, L. N. & Holt, P. O. Modelling visual complexity using geometric

primitives: Implications for visual control tasks. In Proceedings of the 19th
European Annual Conference on Human Decision Making and Manual Control
(Italy, June 2000), pp. 3–8.
[74] Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,

San Mateo, California, 1988.
[75] Pentland, A. Perceptual organization and the representation of natural form.

Artificial Intelligence 28(2) (1986), 293–331.
[76] Pirri, F., and Finzi, A. An approach to perception in theory of actions: Part
i. ETAI 4 (1999), 19–61.
[77] Pirri, F., and Finzi, A. A preliminary approach to Perception in Theory of

Action. In Proceedings of the IJCAI99’s workshop on Nonmonotonic Reasoning,
Action and Change (NRAC99) (1999), pp. 49–56.
[78] Pirri, F., and Reiter, R. Some Contributions to the Metatheory of the
Situation Calculus. Journal of the ACM 3 (1999), 325–361.
[79] Pirri, F., and Reiter, R. Planning with natural actions in the situation
calculus. In Logic-Based Artificial Intelligence (2000), J. Minker, Ed., Kluwer.
[80] Pirri, F., and Romano, M. A situation-bayes view of object recognition

based on symgeons. In The Third International Cognitive Robotics Workshop
(Edmonton, Alberta, Canada, July 28 2002).
[81] Plantinga, W., and Dyer, C. Visibility, occlusion, and the aspect graph.
International Journal Computer Vision 5(2) (1990), 137–160.
[82] Ponce, J., Petitjean, S., and Kriegman, D. Computing exact aspect
graphs of curved objects: Algebraic surfaces. ECCV 92 (1992), 599–614.
[83] Pope, A., and Lowe, D. Learning object recognition models from images. In
ICCV93 (1993), pp. 296–301.
[84] Reiter, R. The frame problem in the situation calculus: A simple solution
(sometimes) and a completeness result for goal regression. In Artificial Intel-
ligence and Mathematical Theory of Computation: Papers in Honor of John
McCarthy, V. L. (Ed.), Ed. Academic Press, 1991, pp. 359–380.
[85] Reiter, R. Proving properties of states in the situation calculus. Artificial

Intelligence Journal (1993), 337–351.
[86] Reiter, R. KNOWLEDGE IN ACTION: Logical Foundations for Building

Dynamic Systems. MIT press, 2001.
[87] Reiter, R., and Mackworth, A. A logical framework for depiction and
image interpretation. AI 41 (1989), 125–155.
BIBLIOGRAPHY 121
[88] Rivlin, E., Dickinson, S. J., and Rosenfeld, A. Recognition by functional

parts. Computer Vision and Image Understanding: CVIU 62, 2 (1995), 164–
176.
[89] Rock, I. The Logic of Perception. The MIT Press, 1985.
[90] Rock, I. Indirect Perception. The MIT Press, 1996.
[91] Rumelhart, D., Hinton, G., and Williams, R. Learning representations

by backpropagating errors. Nature 323(9) (1986), 318–362.
[92] Russel, B. Introduction to Mathematical Philosophy. George Allen and Unwin

Ltd, 1919.
[93] Shanahan, M. What Sort of Computation Mediates Best Between Perception

and Action. Logical Foundations for Cognitive Agents: Contributions in Honor
of Ray Reiter. H.J. Levesque and F. Pirri, 1999, pp. 352–368.
[94] Shanahan, M. A logical account of perception incorporating feedback and

expectation. In Eighth International Conference on Principles of Knowledge
Representation and Reasoning (KR2002) (April 22-25 2002).
[95] Shapiro, L. G., Moriarty, J. D., Haralick, R. M., and Mulgaonkar,

P. G. Matching three-dimensional objects using a relational paradigm. Pattern
Recognition 17, 4 (1984), 385–405.
[96] Stewman, J., and Bowyer, K. Direct construction of the perspective pro-
jection aspect graph of convex polyhedra. CVGIP 51 (1990), 20–37.
[97] Terzopoulos, D., and Metaxas, D. Dynamic 3d models with local and
global deformations: Deformable superquadrics. IEEE Trans. on PAMI 13(7)
32 (1991), 703–714.
[98] Terzopoulos, D., Witkin, A., and Kass, M. Symmetry-seeking models

for 3D object reconstruction. International Journal of Computer Vision 1(3)
(1987), 211–221.
[99] Tsotsos, J., Verghese, G., Dickinson, S., Jenkin, M., Jepson, A., Mil-
ios, E., Nuflot, F., Stevenson, S., Black, M., Metaxas, D., Culhane,
S., Ye, Y., and Mann, R. Playbot: A visually-guided robot for physically
disabled children. Image and Vision Computing 16, 4 (April 1998), 275–292.
[100] Ullman, S. High-level vision: Object recognition and visual cognition, 1996.
[101] Ullman, S., and Basri, R. Recognition by linear combination of models.

Technical Report AIM-1152, 1989.
[102] Wolter, F., and Zakharyaschev, M. Spatio-temporal representation and

reasoning based on RCC-8. In KR2000: Principles of Knowledge Represen-
tation and Reasoning (San Francisco, 2000), A. G. Cohn, F. Giunchiglia, and
B. Selman, Eds., Morgan Kaufmann, pp. 3–14.
122 BIBLIOGRAPHY
[103] Wu, K. Computing Parametric Geon Descriptions Of 3D Multi-Part Objects.

PhD thesis, McGill University, April 1996.
[104] Wu, K., and Levin, M. 3D object representation using parametric geons.
Tech. Rep. CIM-93-13, CIM, 1993.
[105] Wu, K., and Levine, M. Parametric geons: A discrete set of shapes with
parameterized attributes. In In SPIE International Symposium on Optical En-
gineering and Photonics in Aerospace Sensing: Visual Information Processing
III (Orlando, FL, April 1994 1994), vol. 2239, pp. 14–26.
Appendix A
Ellipses Fitting
An ellipses is a general case of a general conic which can be described by an implicit
second order polynomial:
F (x, y) = x> a = ax2 + bxy + cy 2 + dx + ey + f = 0 (1)
with an ellipse specific constraint:
b2 − 4ac < 0 (2)
where a = [a, b, c, d, e, f ]> is the parameter vector formed by the ellipse’s coefficients,
while x = [x2 , xy, y 2 , x, y, 1].
The fitting of a general conic to a set of points (xi , yi ), i = 1 . . . N may be ap-
proached by minimizing the square sum of the distance of each points to the conic
which is represented by the parameter vector a.
N
X
min [D(p, a)]2 (3)
i=1
In general there are two main suitable distances D for ellipse fitting, the Euclidean
distance and the algebraic distance. The Euclidean distance is only partially satisfac-
tory, because it require the introduction of approximations that lead the problem to
a nonlinear minimization that can be solved only numerically. The algebraic distance
instead is different from the true geometric distance between a curve and a point. In
this sense we start with an approximation, however it is the only approximation we
introduce, since the algebraic distance turns the minimization problem into a linear
problem that we can solve in closed form an with no further approximations.
The minimization problem becomes:
N
X N
X N
X
min F (xi , yi )2 = min [Fa (xi )]2 = min [xi · a]2 (4)
i=1 i=1 i=1
The problem can be solved directly by the standard least squares approach, but the
result of such fitting is a general conic, and not necessary an ellipse. To ensure this,
the appropriate constraint has to be considered. Such a system is hard to solve in
general, however because αa represents the same conics as a for any α 6= 0, we have a
123
124 APPENDIX . APPENDIX A
freedom to arbitrarily scale the coefficients a. Tanks to this property, under a proper
scale the inequality constraint mentioned above can be changed into an equality one:
4ab − b2 = 1 (5)
and the specific ellipse fitting problem can be reformulated as:
min||Da||2 subject to a> Ca = 1 (6)

where D is the design matrix of the size N × 6,
 2 
x1 x1 y1 y12 x1 y1 1
 .. .. .. .. .. .. 
 . . . . . . 
 2 
D=  xi xi yi yi2 xi yi 1   (7)
 . .. .. .. .. .. 
 .. . . . . . 
x2N xN yN yN 2
xN yN 1
representing the least squares minimization, while C is the constrain matrix of the
size 6 × 6,  
0 0 2 0 0 0
 0 −1 0 0 0 0 
 
 2 0 0 0 0 0 
C=  0 0 0

 (8)
 0 0 0 
 0 0 0 0 0 0 
0 0 0 0 0 0
The minimization problem can be finally solved by a quadratically constrained least
squares minimization as proposed by [41]. First, by applying the Lagrange multipliers
we get the following condition for the optimal solution a:
Sa = λCa
(9)
a> Ca = 1
where S is the scatter matrix of the size 6 × 6,
 
Sx4 Sx3 y Sx2 y2 Sx3 Sx2 y Sx2
 Sx3 y Sx2 y2 Sxy3 Sx2 y Sxy2 Sxy 
 
 Sx2 y2 Sxy3 Sy 4 Sxy2 Sy 3 Sy 2 
S=D D=
>
 Sx3

 (10)
 Sx2 y Sxy2 Sx2 Sxy Sx 
 Sx2 y Sxy2 Sy 3 Sxy Sy 2 Sy 
Sx2 Sxy Sy 2 Sx Sx S1
in (10)the operator S denotes the sum:

N
X
Sxa yb = xai yib
i=1
Next the equation (9) can be solved by using generalized eigenvectors. There exist
up to six real solutions (λj , aj ), but because
||Da|| = a> D> Da = a> Sa = λa> Ca = λ

125
we are looking for the eigenvector ak corresponding to the minimal positive eigenvalues
λk . Finally after a proper scaling ensuring a> k Cak = 1, we get the solution of the
minimization problem, which represents the best fit ellipse for the given set of points.
The approach that we have described has several drawback. First the matrix C is
singular and the matrix S is also singular if all data points lie exactly on an ellipse.
Moreover the computation of the eigenvalues is numerical unstable and can produce
wrong results (as infinite or complex number). Another problem of the algorithm is
the localization of the optimal solution of the fitting. Unfortunately, the assumption
that there exists one positive eigenvalue (corresponding to the optimal solution), is
not true. In fact, as noted above, in an ideal case when all data points lie exactly on
an ellipse, the eigenvalue is zero. Moreover, a numerical computation of eigenvalues
can produce an optimal eigenvalue that is a small negative number. In such situation
the algorithm can produce non optimal solution or completely wrong solution.
To overcome such a problems in [49] a simplified approach is proposed. Tanks to
the special structure of the matrix S and C, this approach allows us to easily compute
the eigenvalues.
First, the design matrix D can be decomposed into its quadratic and linear parts
D = (D1 |D2 ) where:
   
x21 x1 y1 y12 x1 y1 1
 .. .. ..   .. .. .. 
 . . .   . . . 
 2   
D1 = 
 xi xi yi 2 
yi  D2 = 
 xi yi 1   (11)
 . .. ..   . .. .. 
 .. . .   .. . . 
x2N xN yN yN 2
xN yN 1
Then the scatter matrix can be split as follows:
S1 = D>
1 D1
S1 S2
S= where S2 = D>
1 D2 (12)
S> S3
2
S3 = D>
2 D2
Similarly, the constraint matrix C can be expressed as:

 
0 0 2
C1 0
C= where C1 =  0 −1 0  (13)
0 0
2 0 0
Finally we split the vector of coefficients a into:

   
a d
a1
a= where a1 =  b  and a2 =  e  (14)
a2
c f
Based on these decompositions, the first condition of the equation (9) can be rewritten
as:
S1 S2 a1 C1 0 a1
· = λ · · (15)
S>
2 S3 a2 0 0 a2
126 APPENDIX . APPENDIX A
which is equivalent to the following two equations:
S1 a1 + S2 a2 = λC1 a1 (16)
S>
2 a1 + S3 a2 = 0 (17)
The matrix S3 ,  
Sx2 Sxy Sx
S3 = D>
2 D2 =  Sxy Sy2 Sy  (18)
Sx Sy S1
is the scattered matrix of the line fitting problem i.e. the problem to fit a set of
points through a straight line. Such a matrix is singular only if all the points lie on
the line. In such situation there is no real solution, but in all other cases the matrix
S3 is regular.
From the equation (17), a2 can be expressed as:
a2 = −S−1 >
3 S2 a1 (19)
and including it into the equation (16) we have:
(S1 − S2 S−1 >

3 S2 )a1 = λC1 a (20)
Matrix C1 is regular, thus this equation can be rewritten as:
C−1 −1 >
1 (S1 − S2 S3 S2 )a1 = λa (21)
The second condition of the equation (9) can also be reformulated by using the de-
composition principle. Due to the special shape of the matrix C we simply get:
a>
1 C1 a1 = 1 (22)
Summarizing all the decomposition steps, the conditions in (9) can be finally expressed
as the following set of equations:
Ma1 = λa1
a>
1 C1 a1 = 1
−1 >
a2 = 3 S2 a1
−S (23)
a1
a =
a2
where M is the reduced scattered matrix of the size 3 × 3:
M = C−1 −1 >
1 (S1 − S2 S3 S2 ) (24)
Now we can return to the task of a fitting ellipse through a set of points. As we
said before, the task can be expressed as a constrained minimization problem (eq. 6)
whose optimal solution corresponds to the eigenvector a of equation (9) which yields
a minimal non negative value λ. Equation (9) is equivalent to equation (23), thus it
is enough to find the appropriate eigenvector a1 of the matrix M.
Appendix B
SymGeon Aspects
PR CYLINDROID 1
PR CUBOID 1
127
128 APPENDIX . APPENDIX B
PR CUBOID 2
PR B CYL 1
PR B CYL 4 PR B CYL 5
ELLIPSOID
129
S A CYL 2
S A CUB
S A B CYL 2
S A B CYL 3
S A B CUB
130 APPENDIX . APPENDIX B
Appendix C
Distance Formalization
In this Appendix we detail all the definitions concerning edgewise connection and
pointwise connection, given w.r.t. their metric. We have been using these definitions
for the formalization of the Algebra of Figure. For each figure we provide also its
composition by a composition graph.
Face Level
Edgewise
Coterminant
Connection
Cotermination (fig.1):
0
i. d1 = minx,y (distance(Pex , Pey )) where x, y ∈ {a, b};
0 0 0 0 0
ii. d2 = distance(Pex0 , Pey 0 ) where x , y ∈ {a, b} and x 6= x, y 6= y;
0
iii. d3 = distance(Pm , Pm );
d1 +d2
vi. d = d3 .
131
132 APPENDIX . APPENDIX C
Pm P’ey’
Pex’
Pex’
Pex d3 d2 P’ey
d1
P’ey’
d1 P’m
Pm d2
P’ey
Pex
P’m
Figure 1:
Edgewise Connection (fig.1):

0
i. d1 = miny (distance(Pm , Pey )) where y ∈ {a, b};
0
ii. d2 = minx (distance(Pm , Pex )) where x ∈ {a, b};
distance(Pex0 ,mmx )
iii. d3 = distance(Pey0 ,mmy ) where x0 , y 0 ∈ {a, b} and x0 6= x,y 0 6= y;

0 z>0
iv. F (z) =
−kz z≤0
v. d = d1 + d2 + F (d3 ).
Figure 2: Example of faces

133
Edgewise Edgewise
Coterminant Connection Connection
(Straight) (Elliptic)
P’ec
d1 P’er Pec
Pec Per
P’m P’er
P’er Pec
d3
Pm
d1 P’m d2
P’ec Pm d2 d1 P’m
Pm d2
Per Per P’ec
Figure 3:
Cotermination (fig.3):
0
i. d1 = distance(Pec , Per );
0
ii. d2 = distance(Per , Pec );
0
d1 +d2
iv. d = d3 .
Edgewise Connection (straight) (fig.3):

0
i. d1 = distance(Pm , Per );
0
ii. d2 = distance(Pm , Per );
distance(Pec ,mr )
iii. d3 = 0 ,m0 ) ;
distance(Pec r

0 z>0
iv. F (z) =
−kz z≤0
iv. d = d1 + d2 + F (d3 ).
Edgewise Connection (elliptic)

0
i. d1 = distance(Pm , Pec );
0
ii. d2 = distance(Pm , Pec );
distance(Per ,retta(Pm ,Pec ))
iii. d3 = 0 ,retta(P 0 ,P 0 )) ;
distance(Per ec m

0 z>0
iv. F (z) =
−kz z≤0
∆φ ∆φ0
v. dφ = |(φ − 2 ) − (φ0 − 2 )|;
vi. d = d1 + d2 + F (d3 ) + dφ .

135
Coterminant Symmetric
P’’ y1
P’x1 d1
dc C’ f’+Df’/2
d2 f’’+Df’’/2
P’’ y2 C’’
P’x2
Figure 5:
Cotermination (figure 5):
i. d1 = minx,y (Px0 , Py00 ) where x, y ∈ {a, b};
ii. d2 = minx0 ,y0 (Px0 0 , Py000 ) where x0 , y 0 ∈ {a, b} and x 6= x0 , y 6= y 0 ;
iii. d = d1 + d2 .
Symmetric (figure 5):
i. dc = distance(centerC 0 , centerC 00 );
ii. dα = |α0 − α00 |;
0 00 0 00
iii. dr = |rm − rm | + |rM − rM |;
∆φ0 ∆φ00
iv. dφ = |(φ0 + 2 ) − (φ00 + 2 )| − π;
v. d = dc + dα + dr + dφ .
Edgewise Pointwise
Connection Connection
Pex’
d
P’ec
Pm Pex Pm
d2
d1
P’er P’m P’er P’m
Figure 6:
Edgewise Connection (figure 6):

0
i. d1 = minx (distance(Pex , Pm )) where x ∈ {a, b};
0
iii. d = d1 + d2 .
Pointwise Connection (figure 6):

0
i. d = distance(Pex0 , Pec ) where x0 ∈ {a, b} and x0 6= x;
137
Figure 7: Example faces

Edgewise Pointwise
Pex’
d
P’ey’
Pex
Pm
Pm
d1 d2
P’ey P’m P’m
Figure 8:

0
i. d1 = minx (distance(Pex , Pm )) where x ∈ {a, b};
0
ii. d2 = miny (distance(Pm , Pey )) where y ∈ {a, b};
iii. d = d1 + d2 .

0 0 0 0 0
i. d = distance(Pex0 , Pey 0 ) where x , y ∈ {a, b} and x 6= x, y 6= y;
139
Edgewise Pointwise
P’ec P’ec
d
Pec Pec
P’m P’er P’m P’er
d2
d1
Per Pm Per Pm
Figure 9:

0
i. d1 = distance(Per , Pm );
0
iii. d = d1 + d2 .

0
i. d = distance(Pec , Pec );
Figure 10: Example faces

141
Edgewise Pointwise
Pm Pex
d1 d2
P’ey P’m
Figure 11:

0
i. d1 = miny (distance(Pm , Pey )) where y ∈ {a, b};
0
ii. d2 = minx (distance(Pm , Pex )) where x ∈ {a, b};
∆φ ∆φ0
iii. dφ = |(φ + 2 ) − (φ0 + 2 )|;
iv. d = d1 + d2 + dφ .

0 0 0 0 0
i. d = distance(Pex0 , Pey 0 ) where x , y ∈ {a, b} and x 6= x, y 6= y;

143
Coterminant Pointwise
Connection
P’ey
P’m
Pex d1
Pm Pex
d3
d2
P’ey’
Pm Pex’ P’ey P’m
Figure 13:
Coterminant (figure 13):

0
i. d1 = minx,y (distance(Pex , Pey )) where x, y ∈ {a, b};
ii. d2 = distance(Pex0 , Pey0 )) where x0 , y 0 ∈ {a, b} and x0 6= x, y 0 6= y;
0
d1 +d2
iv. d = d3 .
0
i. d1 = miny (distance(Pm , Pey ));
0
ii. d2 = minx (distance(Pm , Pex ));
∆φ ∆φ0
iii. dφ = |(φ + 2 ) − (φ0 + 2 )|;
iv. d = d1 + d2 + dφ .

145
Aspect Level
Edgewise Parallel
Connection Edge
Edgewise Connection (fig. 15)
i. d1 = minx (distance(Pcrx , ellipses C)) where x ∈ {1, 2};
ii. d2 = distance(Prcy , ellipses C) where y ∈ {1, 2} and y 6= x;
iii. dr = |rm − rmyx | + |rM − rMyx |;

∆φ ∆φyx
iv. dφ = |(φ + 2 ) − (φyx + 2 )|;
v. d = d1 + d2 + dr + dφ .
Parallel Edges
i. d = |mr1 − mr2 |.
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
Figure 15:
Figure 16: Example of aspects

147
Parallel Edges
Parallel Edges
i. d1 = |m12 − m34 |;
ii. d2 = |m23 − m41 |;
iii. d = d1 + d2 .
P1 m12 P2
m41 m23
P4 m34 P3
149
Edgewise Parallel Edgewise Parallel Edgewise Parallel

Connection Edges Connection Edges Connection Edges
Edgewise Connection
i. d1 = minx,y (distance(Px , Py0 )) where x, y ∈ {1, 2, 3, 4};
ii. d2 = minu,v (distance(Pu , Pv0 )) where u ∈ {|x + 1|4 , |x − 1|4 }

and v ∈ {|y − 1|4 , |y + 1|4 };
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Parallel Edges
i. d1 = |m12 − m34 |;
ii. d2 = |m23 − m41 |;
iii. d = d1 + d2 .
Parallel Edgewise Parallel

Edges Connection Edges
Edgewise Connection

and v ∈ {|y − 1|4 , |y + 1|4 };
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Parallel Edges
i. d1 = |m12 − m34 |;
ii. d2 = |m23 − m41 |;
iii. d = d1 + d2 .
151
Edgewise
Connection Non Parallel
Edge


∆φ ∆φyx
iv. dφ = |(φ + 2 ) − (φyx + 2 )|;
v. d = d1 + d2 + dr + dφ .
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
Figure 17:
Non Parallel Edge
1
d= |mr1 −mr2 | .
Figure 18: Example of aspects

153
Conteined
Contained
i. dc = distance(centerC, centerC 0 );
0
rm −rm
ii. r = 0
rM −rM ;

0 z>0
iii. F (z) =
−kz z≤0
vi. d = dc + F (r).
No Parallel
Edges
No Parallel Edges
i. d1 = |m12 − m34 |;
ii. d2 = |m23 − m41 |;

1
iii. d = max(d1 ,d2 ) .
P1 m12 P2
m41 m23
P4 m34 P3
155
Non Parallel Non Parallel Non Parallel

Edgewise Edgewise Edgewise
Edges Edges
Connection Edges Connection Connection
Edgewise Connection

and v ∈ {|y − 1|4 , |y + 1|4 };
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Non Parallel Edges
i. d1 = |m12 − m34 |;
ii. d2 = |m23 − m41 |;

1
iii. d = max(d1 ,d2 ) .
Edgewise Edgewise
Edgewise Edgewise
Edgewise Edgewise Edgewise Edgewise

Connection Connection Connection Connection
Edgewise Connection

and v ∈ {|y − 1|4 , |y + 1|4 };
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
157
Edgewise
Connection
i. d1 = minxy (distance(Px , Py0 )) where x, y ∈ {1, 2, 3};
ii. d2 = minuv (distance(Pu , Pv0 )) where u, v ∈ {1, 2, 3} and u 6= x, v 6= y;
vi. d = d1 + d2 .
Px d1 P’y
d2 P’v
Pu
Figure 19:
159
Edgewise Edgewise Edgewise Edgewise Edgewise Parallel

Connection Connection Connection Connection Connection Edge
(Straight) (Straight) (Straight) (Straight) (Elliptic) (elliptic)
Prc1
mr1
d1
d1 Pcr1
C12
d2
C21
d2 Prc2
mr2
Pcr2
Figure 20:
Edgewise Connection (elliptic) (fig. 20)

0
i. d1 = minv,w,x,y (distance(Pvx , Pwy )) where x, y ∈ {1, 2} and v, w ∈ {rc, cr};
ii. d2 = minx0 ,y0 (distance(Pv0 x0 , Pw0 0 y0 )) where v 0 , w0 ∈ {rc, cr}, x0 , y 0 ∈ {1, 2} and
v 0 6= v, w0 6= w;
0
iii. dc = distance(center Cxx0 , center Cyy 0 );
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Edgewise Connection (straight) (fig. 20)
i. d1 = minx,y (Px , Prcy ) where x ∈ {1, 2, 3, 4} and y ∈ {1, 2};
ii. d2 = minu (Pu , Pcry ) where u ∈ {|x + 1|4 , |x − 1|4 };
iii. d = d1 + d2 .
Parallel Edge (elliptic) (fig. 20)
161
i. dr = |rm12 − rm21 | + |rM12 − rM21 |;

∆φ12 ∆φ21
ii. dφ = |(φ12 + 2 ) − (φ21 + 2 )|;
iii. dα = |α12 − α21 |;
iv. d = dr + dφ + dα .
Edgewise
Edgewise Connection
Connection
Edgewise Connection (elliptic)

∆φ ∆φyx
iv. dφ = |(φ + 2 ) − (φyx + 2 )|;
v. d = d1 + d2 + dr + dφ .
Prcx
Pcrx
Cxy
Cyx
C
Prcy
Pcry
163
Edgewise Edgewise
Edgewise Connection

and v ∈ {|y − 1|4 , |y + 1|4 };
iii. d = d1 + d2 .
Px d1
P’y
d2
Pu P’v
Edgewise
Edgewise
Connection
Connection
(elliptic)
(straight)
Edgewise Edgewise
(elliptic) (elliptic)
Edgewise Edgewise Edgewise Edgewise

Connection Connection Connection Connection
(straight) (straight) (straight) (straight)
d1
d1
d2
d2
Figure 21:
iii. d = d1 + d2 .

0
v 0 6= v, w0 6= w;
0
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
165
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Edgewise Edgewise
Py
Px Cxy
C41 C23
P4 P3
C34
Figure 22:
i. d1 = minx (distance(Px , C) + distance(Py , C)) where x, y ∈ {1, 2, 3, 4} and y =

|x + 1|4 ;
ii. dc = distance(center C, center Ccy );

∆φ ∆φxy
iii. dφ = |(φ + 2 ) − (φxy + 2 )|;
iv. dα = |α − αxy |;
v. d = d1 + dc + dc + dφ + dα .
167
Edgewise Edgewise Edgewise NonParallel

Connection Connection Connection Edge
(straight) (straight) (elliptic) (elliptic)
Prc1
mr1
d1
d1 Pcr1
C12
d2
C21
d2 Prc2
mr2
Pcr2
Figure 23:
iii. d = d1 + d2 .

0
v 0 6= v, w0 6= w;
0
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Non Parallel Edge (elliptic) (fig. 23)
i. dr = |rm12 − rm21 | + |rM12 − rM21 |;

∆φ12 ∆φ21
ii. dφ = |(φ12 + 2 ) − (φ21 + 2 )|;
iii. dα = |α12 − α21 |;

1
iv. d = dr +dφ +dα .
169
Edgewise
Connection
d1
d2
Figure 24:

0
v 0 6= v, w0 6= w;
0
∆φ ∆φ0
iv. dφ = |(φ + 2 ) − (φ0 + 2 )|;
0
v. dα = |αxx0 − αyy 0 |;
vi. d = d1 + d2 + dc + dφ + dα .
Edgewise
Connection
Prr
mcr mrc
C
Pcr Prc
C’
Figure 25:
i. d1 = distance(Pcr , C 0 ) + distance(Prc , C 0 );
ii. dc = distance(center C, center C 0 );

∆φ ∆φ0
iii. dφ = |(φ + 2 ) − (φ0 + 2 )|;
iv. dα = |α − α0 |;
v. d = d1 + dc + dc + dφ + dα .

A Cognitive Vision System Based On Bayes Combination of Geometric Features

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Cognitive Vision System Based On Bayes Combination of Geometric Features

Hochgeladen von

Copyright:

Verfügbare Formate

Università degli Studi di Roma “La Sapienza”

Dottorato di Ricerca in Ingegneria Informatica

A Cognitive Vision System Based on Bayes

A Cognitive Vision System Based on Bayes

Thesis Committee Reviewers

Prof.ssa Fiora Pirri (Advisor) Murray Shanahan

This thesis is dedicated to Franco Carrozzo and Raymond Reiter.

They changed my life forever.

We present in this thesis an approach to Object Recognition that develops on earlier

1 Autonomous Agents Laboratory for COgnitive Robotics.

Rome, December 2002

4 Motivation and Methodology 29

4.4 Cognitive and Description Framework . . . . . . . . . . . . . . . . . . 40

5 Syntactic Image Analysis 43

2.1 Different views of an arrow object (taken from [4]). . . . . . . . . . . 15

5.1 Ellipse and line parameters . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 Connections among faces . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.15 Symmetry relation between boundits . . . . . . . . . . . . . . . . . . . 73

7.1 A portion of the Bayes-Aspect net. . . . . . . . . . . . . . . . . . . . . 78

9.1 The Fantastic Word of Mr ArmHandOne . . . . . . . . . . . . . . . . 100

3.1 SymGeon Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Relation between elements and their functional denotation. . . . . . . 37

5.1 Syntactic Analysis for an Abstract Picture . . . . . . . . . . . . . . . . 48

9.1 SymGeon immersed in a massy environment. . . . . . . . . . . . . . . 102

10.1 Corridor Detection Domain . . . . . . . . . . . . . . . . . . . . . . . . 113

1.1 Research Area

to interact with a word of spatio-temporally located objects given only in-

As the main characteristic of such kind of agents is acting in a changing domain,

ii. the definition of a specific perceptual reasoning system.

1.2 Our Approach

Outcome of sensing: there is a table in

The Bayes-Aspect net

The Syntactic analizer

Figure 1.2: The reasoning process behind perception

a pan-tilt binocular head equipped with two color

sensed/perceived information, like e.g. the anchoring problem1 .

2. Description level: provides objects and scene descriptions in terms of categories.

5. Image processing, which is achieved with standard methodologies.

1. The knowledge of other elements in the environment: it is a room, a corridor,

sensor data that refer to the same physical object [23]

3. Reasoning, by making hypotheses that can be confirmed: “it could be a hat, a

Object recognition, as far as it is strictly related to the process of visual perception,

1.3 Thesis Organization

order to keep a complex environment under surveillance.

level cognition component of our system. Then we introduce a set of 3D volumetric

In this chapter we shall present a classification of the approaches to object recognition

2.1 Classification of Object Recognition Systems

Data Driven and Model Driven

View Centered and Object Centered

2.2 ORS Approaches

From a theoretical point of view an artificial neural network specifies a mapping

Feed-forward networks with backpropagation-learning algorithms [91] have been

Figure 2.1: Different views of an arrow object (taken from [4]).

Hierarchical Aspect Graph

In particular they propose to use this technique to represent simple geometric

to surveillance a complex environment.

rithms to find the representative views, as a simple

3.1 Perception in the Situation Calculus

P ercept : (perceptible × pr × situation).

Occluded : (perceptible × situation).

A basic theory of action and perception

DS0 ∪ Dap ∪ Dss ∪ Duna ∪ Σ (3.1)

Example 1 Consider the following causal laws:

2/2 2/2 !2 /1 2/1

20 20 20

⊕C ⊕P P(|| |⊕C , ⊕P ) b1 b2 P(⊕C |b1 , b2 )