Sie sind auf Seite 1von 28

Visual Affordance and Function Understanding: A Survey

MOHAMMED HASSANIN∗ , University of New South Wales Canberra, Australia


SALMAN KHAN, Inception Institute of Artificial Intelligence, IIAT, UAE
MURAT TAHTALI, University of New South Wales Canberra, Australia
Nowadays, robots are dominating the manufacturing, entertainment and healthcare industries. Robot vision aims to equip robots with the
capabilities to discover information, understand it and interact with the environment which require an agent to effectively understand
object affordances and functions in complex visual domains. In this literature survey, firstly, ‘visual affordances’ are focused on and current
state-of-the-art approaches for solving relevant problems as well as open problems and research gaps are summarized. Then, sub-problems,
such as affordance detection, categorization, segmentation and high-level affordance reasoning, are specifically discussed. Furthermore,
functional scene understanding and its prevalent descriptors used in the literature are covered. This survey also provides the necessary
background to the problem, sheds light on its significance and highlights the existing challenges for affordance and functionality learning.
CCS Concepts: • Computing methodologies → Vision for robotics;
Additional Key Words and Phrases: Affordance prediction, Functional scene understanding, Deep learning, Visual reasoning
ACM Reference Format:
Mohammed Hassanin, Salman Khan, and Murat Tahtali. 2020. Visual Affordance and Function Understanding: A Survey. ACM Comput. Surv.
, (August 2020), 28 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION
Affordance understanding is concerned with the possible set of actions an environment allows, an area of study that aims to
answer the question of how an object can be used by an agent. Ecological psychologist James Gibson was the first to introduce
the concept of affordances in 1966 [37]. Since then, this theory has been extensively used in the design of better and more
robust robotic systems capable of operating in complex and dynamic environments [53]. In contrast to affordances, which
are directly dependent on an actor, functionality understanding relates to identifying the possible set of tasks that can be
performed with an object. Therefore, an object’s function is a permanent property of it independent of the characteristics
of its user. Affordance and functional understanding not only enable humans or Artificial Intelligence (AI) agents to better
interact with the world but also provide valuable feedback to product designers who need to consider potential interactions
between users and products. As a result, this research topic is highly important for domestic robotics, content analysis and the
context-aware understanding of a scene.
Despite being an indispensable step towards the design of intelligent machines, affordance understanding is a complex and
highly integrated task. Firstly, to understand how an object can be used by an agent requires reasoning about what it is and
where it is located. Furthermore, it is necessary to know its geometry and pose, e.g., an inverted cup cannot offer a ‘pouring’
action. The object affordances are also dynamic and an intelligent agent should be able to consider both the prior knowledge
and past experiences e.g., a vessel may be first ‘graspable’, then ‘liftable’ and finally ‘pourable’. These challenges offer room to
novel ideas and innovative solutions for visual affordance and functionality understanding.
Affordance understanding has been reviewed from different perspectives in the literature. Bohg et al. [8] considered
data-driven grasping tasks, particularly for the manipulation of objects, while Yamanobe et al. [164] ] reviewed affordance
tasks for the same purpose. Min et al. [87] introduced a general overview of affordances and existing techniques but only those
relevant to developmental robots and related tasks, such as the formalization of affordances. To the best of our knowledge, this
survey is the first attempt to review the literature from the perspective of ‘visual’ affordance and functionality understanding
whereas other literature reviews cover affordance understanding from the aspects of robotics perception, sensory-motor
coordination or psychology and neuroscience [8, 60, 87, 91, 164]. However, affordance-based reasoning is equally important
for machine vision and visual scene understanding, as demonstrated by the increasing activity in this area (Figure 1).
∗ This is the corresponding author

Authors’ addresses: Mohammed Hassanin, University of New South Wales Canberra, 1 Northcott Dr, ACT, 2600, Australia, m.hassanin@unsw.edu.au; Salman
Khan, Inception Institute of Artificial Intelligence, IIAT, UAE; Murat Tahtali, University of New South Wales Canberra, ACT, Australia.

ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the United States government. As such, the
United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes
only.
© 2020 Association for Computing Machinery.
0360-0300/2020/8-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
2 • M. Hassanin et al.

(a)

Fig. 1. Growth in the number of papers on visual affordance in computer and robot vision literature in recent years (from 2014 to 2019).

Figure 2 shows the taxonomy of this survey and our scope that will be discussed in detail. It composes of two main branches:
affordance-based understanding and functionality understanding methods. The former mainly uses visual techniques such
as detection and categorization to perceive the possible affordances which are related to the objects. Since affordance is the
perception of objects regardless of any agent or interaction in the environment, it is different from function understanding.
This affordance can be used as a context for activity recognition applications whereas learning the higher-order affordances is
classified as affordance reasoning. In contrast, function understanding is based on affordance alongside any form of interaction
in the scene. Functionality might be perceived from a motion descriptor or from the shape of the instances. Those three tasks
try to learn a descriptor from the scene. This descriptor can be used to recognize functionalities and usages of scene objects.
The rest of this survey is organized as follows: First, a comprehensive background to the area is provided along with the
definition of specific terms frequently used in the visual affordance literature in Sec. 2. Afterward, we provide the significance
and challenges in Secs. 3 and 4, respectively. We then cover the research in visual affordance understanding in Sec. 5 and
categorize the methods according to specific sub-problems such as affordance detection, categorization, semantic labeling.
Efforts to understand the functions of different objects and tools are summarized in Sec.6. The main computer vision datasets
with affordance and function annotations are listed in Sec. 7. Finally, the open research problems in this area are presented
and new research directions suggested in Sec. 8.

2 TERMINOLOGY AND BACKGROUND


The term affordance was coined by the psychologist Gibson to define the interactions between an actor and its environment
[37, 38]. In his own words:
Affordance: “The affordances of the environment are what it offers the animal, what it provides or furnishes,
either for good or ill. The word affordance implies the complementarity of the animal and the environment.”
−Gibson, 1979
He advocated that the target of computer vision should be to estimate possible interactions among a human, animal and
environment in a scene rather than merely detecting the contents of that scene [59]. It was claimed at the time that perception
is only a property of an agent but Gibson argued that its meaning is inherited from the environment [37]. Meanwhile, many
researchers tried to determine the best definition of affordance, e.g., Turvey [146] stated that "An affordance is a particular
kind of disposition, one whose complement is a dispositional property of an organism and used the term ’effectivity’ to denote
the dispositional property of an organism (or AI agent). Consequently, based on an agent’s effectivity and its environment’s
affordance properties, an action is realized. Stoffregen [138] sought to clarify the relationship between the affordances and
actions, concluding that they are not identical in many aspects. In another study, Stoffregen criticized the definition of Turvey
for neglecting the relationship between an animal and its environment which caused it to be sub-optimal from the viewpoint

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 3

Fig. 2. Survey taxonomy shows the structure of methods that have been used to solve affordance issues.

of perception-action affordance [139]. Instead of agreeing with their predecessors on the definition of affordance, Sahin et
al. [121] introduced a new formalization, arguing that affordance has three main constituents, an agent, environment and
observer. This concept allowed affordances to cover every aspect of robotic control ranging from perception to planning, i.e.,
first perceiving an object, then applying behavior to it and finally generating the desired effect.
Next, we define several key terms frequently used in the affordance understanding literature along with a visual description
for each task (Figure 3).
−Visual Affordance: the term visual affordance means extracting information related to affordance from an image or a video.
Similar to other machine vision domains such as object and human activity recognition, it uses computer-vision techniques to
perceive the affordance characteristics in visual media. Further, affordances are considered as explicitly perceivable from the
scene, as opposed to functions that require high-level reasoning (e.g., about interactions between scene elements). It is related
to the possible usages of the visible objects. For example, grasping is an inherent feature of the cup itself regardless of the
presence of other objects.
−Affordance detection: similar to object detection, it localizes and labels the affordance for scene objects. Different from
conventional detection tasks, it targets only the salient objects that are most relevant for actions (instead of all objects in
the scene). The goal of detecting these affordances is to predict the next action or recognize the function of some objects,
therefore it selectively targets only significant objects (Figure 3 (f)).
−Affordance categorization: it means classifying the input images into a possible set of affordance categories. Often, this
step is used as a precursor to affordance detection to make the process of localization and recognition easier.
−Affordance reasoning: affordance reasoning refers to a deeper understanding of affordances that requires higher-order
contextual modeling. The main purpose of such reasoning is to infer hidden variables like the amount of liquid inside a bottle
or how much water can be poured into an empty bottle.
−Affordance semantic labeling: this task involves segmenting an image into a set of regions that are labeled with a
semantically meaningful affordance category. Remarkably, this task assigns a category label to each pixel in a region of
interest.
−Interactive affordances: in the context of complex and dynamic environments, robots need to learn what can be done and
what cannot? In other words, interactive affordance involves teaching the robot to learn the possible set of actions that can
be performed in a given environment. It addresses the possible effects that arise through object-agent actions. It circulates
(overlaps) three main factors, that are object, action, and effect as shown in Figure 4 (b).
−Affordance-based activity recognition: objects bear possible actions of an agent which are represented as their affordances.
Thus, recognizing the affordances is a crucial step towards complete activity recognition. This task aims to represent the
activities in terms of affordances. Note that we refer to atomic single-person operation as an ‘action’, while actions performed
by multiple people in a complex environment as an ‘activity’, e.g., a moving robot is performing an action while a group of
marching robots is performing an activity.
−Social affordances: social affordance are a type of affordance that offers possible object-social actions. The term social
imposes human interaction through affordance understanding. These social affordances may be positive such as sit on an
empty chairs or negative (socially forbidden) to open a lady bag lying on the next chair.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
4 • M. Hassanin et al.

(a) Functionality understand- (b) Affordance categorization (c) Affordance segmentation (d) Social affordances[19]
ing

(e) Interactive Affordances (f) Affordance detection (g) Affordance reasoning (h) Affordance-based activity
recognition

Fig. 3. (a),(b) Detecting functional parts to understand the scene usage and affordance-based categorization (c) Semantic labeling. (d),(e)
recognize the activities and social situations learn the interactive affordances e.g. learning by observation. (f) Detecting the affordance
objects and labels together using bounding boxes. (g), (h) given affordances, reasoning affordances from visual inputs such as how much
water in the glass and recognize the whole activity in the scene

−Functionality understanding: it transcends the traditional tasks of object detection and segmentation in visual scene
understanding and uses them to build an inference model that understands the function of objects. Functionality is higher-level
affordances that emerge as a result of any interaction between objects and agents in the scene. For example, detecting the
electricity plug that is required to charge the laptop or mobile phone is an affordance, and ‘pressing‘ is the functionality, which
is the outcome of an interaction between an agent and the power point as shown in Figure 3 (a).
−Interaction-based functionality: since the functionality is a higher-order affordance, it can be concluded as a result of an
interaction. The presence of human-object interaction or object-to-object interaction leads to the functionality of the object.
For instance, the ‘drinking‘ function is an outcome of glass, water, and human.
−Motion-based functionality: the motion of an object in the scene can be used as a functionality descriptor to infer the
usages. Each object has its own usage in the interaction with other objects, which leads to a motion descriptor. Since each
function has a unique way to occur, similar motions can lead to similar usage (e.g. ‘striking’ is done by moving hammer or
stone perpendicularly alike).
−Shape-based functionality: the shape of the object is crucial to infer about its usage. Supporting the concept of form
follows function, we conclude that specific functions are performed with similar-shape objects (e.g. each flat place can be used
for ‘sitting‘).

3 SIGNIFICANCE
Affordance understanding is a crucial task in computer vision and human-machine interaction. As affordance relates to
objects, actions and effects, addressing it benefits all these associated fields. The significance of affordance and functionality
understanding is summarized below.
−Anticipating and predicting future actions: because affordances represent the possible actions that can be performed on
or with objects, learning helps to anticipate actions in a given environment, with examples in the literature where affordances
were used to predict future actions [56, 65, 66].
−Recognizing the agents’ activities: as the presence of an agent (e.g., a robot or human) in a scene offers the possibility of
it interacting with the surrounding environment, recognizing its activities is an indispensable task for developing a complete
understanding of the scene. These activities are highly dependent on the types of actions that are possible or more likely in a
given environment. Several studies have targeted the problem of affordance-based human activity recognition [28, 112, 155].

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 5

−Provides valid functionality of the objects: a conventional object recognition task does not present knowledge of its
function and affordance, e.g., although an occupied or broken chair will still be recognized as a chair, a person cannot sit on it.
In other words, recognizing what functions an object offers is necessary for using it, particularly in the case of interactive
robots [39].
−Understanding social scene situations: in these situations, forbidden actions, such as a healthy person occupying a
disabled seat, happen daily. However, they need to be learned in the case of visual recognition as learning social affordances
helps to understand a whole scene [19].
−Understanding the hidden values of the objects: recognizing an object’s category does not provide information about
its value or significance, for instance, the task of knocking a nail into a surface would normally be done with a hammer but, if
one is not available, learning affordances indicates that appropriate-sized stones can also be used for this task [181].
−Detailed scene understanding: in traditional classification tasks, different kinds of pots are labeled ‘pots’. However, the
usage of each pot could be different, e.g., some could store rice while others could be suitable for vegetables. Therefore,
categorizing objects according to their functionality is extremely important for understanding the details (or fine grains) of a
scene [39, 64].
−Affordance cues benefit object recognition: extracting them from a scene provides a model with contextual information
about objects that make the task of recognition easier. Many researchers have used affordance values as contextual information
to classify objects [43, 102].

4 CHALLENGES
It is worth noting that detection is the first step in the process of affordance understanding as it involves several challenges,
such as scale, illumination, appearance, and viewpoint variations. For example, affordance detection is overlapped with
classical object detection, it inherits all the conventional difficulties as well. Some of its specific challenges are explained in
detail below.
−Illumination Conditions: Illumination changes affect the final results of affordance understanding because they change
the quality of the image being processed. The robot will be stuck if the electricity switches off suddenly particularly in indoor
environments. Likewise, outdoor robots may face the same situation if the sunlight changes suddenly from sunny to cloudy or
even slowly from day to night. It affects the image at the pixel level, therefore any changes in the illumination will change
the final accuracy. Affordance understanding systems should be ready for unexpected degradation and must incorporate
preventive measures such as including image restoration methods to alleviate the impact of poor illumination.
−Viewpoint and Scale Variations: Images acquired from different viewpoints can affect the performance of recognition
algorithms. Therefore, the orientation of an object with respect to the camera affects the accuracy of the method. For robots,
robustness against viewpoint and pose changes is one of the main effective factors in the process of affordance understanding
[135]. Similarly, the scale of the object in terms of size is an important factor in the process of affordance understanding,
especially with tools. For example, a fruit knife is generally small in comparison to a meat knife. Affordance detection
algorithms should, therefore, be scale-invariant to generalize well to unseen examples [47]
−Single-Object-Multi Affordances (SOMA) and Multi-Object-Multi Affordances (MOMA):
The inherent nature of affordance understanding problem is different from traditional problems that have a single label for
every instance or a scene [35, 36]. For example, a knife in the kitchen has a grasping label in the hand and a cutting label in the
cutter. Similarly, a cup of tea has a grasping label outside and a pouring label inside. Hence, visual affordance inherits all the
problems of multi-label learning paradigms such as ranking, correlation, dependency, and multi-label scheme representation.
In contrast to multi-label problems, affordance cases have multiple labels at the object level rather than the complete scene level.
It is closer to multi-instance multi-label (MIML) problems if we consider the objects as instances [161, 176, 177]. Intuitively,
MOML inherits MIML challenges as well as single object multiple label difficulties, it has to address higher-order correlations
between instances, dependency between them as well as the exponential size of possible complex interactions.
−Multi-source features: The features required to perform affordance understanding are diverse in nature and should come
from complementary sources [41], such as the following:
(1) Visual cues such as color and texture.
(2) Physical properties such as weight and volume, for instance, the chair’s visual feature is movable, but the weight clue
does not allow sitting.
(3) Material properties e.g., to assess the comfort of a chair.
−Inter-object and Human-Object dependency: The affordance of an object sometimes relies on other objects, particularly
in service robots. For example, for the task of preparing a cup of tea, the robot has to boil water which depends on the
electricity plug and pour hot water in a cup. This task will be complicated if one of those objects is unseen or occluded.
The existence of a human in a scene makes it more complex in terms of actions, events, and affordance understanding, in

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
6 • M. Hassanin et al.

(a) Affordances and actions diagram (b) Affordances and other fields (c) Relation between affordances and vi-
sual fields

Fig. 4. (a) The overlapping between affordances and various fields, (b) The relations between objects, actions and effects, (c) The relation
between affordances and other visual paradigms.

particular because the affordance is then dependent on the attributes of the human. For example, putting some fruits in the
fridge depends on the height of the human [46].
−Plausible affordances: Since affordances are crucial for robots to move around in a given scene, they have to be realistic
and plausible [160]. For instance, to pour water into a pot, this pot has to be empty, not inverted up-side-down, and with
sufficient capacity [95, 168]. All these factors should be included to make a proper decision according to their environment.

5 WHAT IS A VISUAL AFFORDANCE?


The field of scene understanding, which aims to enable computers to understand an environment and its contents, has
been studied from different perspectives, such as object recognition [73, 120], object detection [9, 31, 77, 113, 150], scene
classification [24, 170], indoor scene understanding [16, 44] and so on.However, although a great deal of research has been
conducted, understanding possible interactions and developing higher-level reasoning about scenes has been less investigated.
In other words, merely detecting the elements in a scene is not sufficient to make intelligent decisions without inferring
its complex interactions and dynamics, such as the possible actions that can be taken, e.g., to learn how to use a vacuum
cleaner, it is necessary to know which button to press and where the electric plug is, and to use a chair as a chair, it must be
determined whether it is ‘sittable’, ‘occupied’ or ‘broken’ [39]. In summary, maximizing the benefits of scene understanding
requires other associated factors, such as affordance detection and reasoning.
The concept of affordance was initially accepted strongly in the fields of perceptual, cognitive and environmental psychology.
Affordances are often used to test the capabilities of objects’ interactions because they have close relationships with different
types of environments, address various research areas and are linked to an object regardless of its location. In detail, affordances
differ according to the environment, for instance, a robot inside a kitchen should understand the affordances of tools [97],
CAD objects such as chairs [136], and the functionalities that can be achieved. Moreover, if a human is involved in a scene,
the additional aspects of human-robot interactions and action recognition should be addressed, as introduced in the fields
of computer vision, human-robot interactions and robotics [147]. Affordance is inter-related with different fields, such as
developmental learning, robot manipulation and psychology, as shown in Figure 4 (c).
Visual affordance is a branch of research that deals with affordance as a computer vision problem based on images and
videos, and uses machine learning methods, such as deep learning, to solve challenges. As it is closely related to various fields,
such as action recognition [64, 65], scene understanding, grasping, human-robot interaction [11, 131] and function recognition
[166, 167], it overlaps with important computer vision problems, such as image classification and action recognition, and is
affected by an identical set of challenges such as illumination changes, occlusions, objects’ poses and dynamic scenes [1] (see
Figure 4 (a), (b)). Also, the problem of affordance understanding is more complicated because, as well as all the difficulties
inherited from computer vision, it encounters additional challenges, e.g., every object may have different functionalities.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 7

Fig. 5. Detection process through machine learning techniques.

5.1 Affordance Detection


Recent advances in robotics and computer vision have paved the way for autonomous agents to impact every aspect of our
lives. To enable service robots to perform tasks, they need to understand the environment, for example, a robot cannot use a
hammer without locating its handle. Therefore, detecting the affordance of a scene, tool or given image is mandatory for
effective interaction and manipulation (see Figure 5) which is, however, quite a challenging task, as explained below.
5.1.1 Feature-engineering approaches. an affordance detection task deals with labeling and localizing particular parts of
a tool that can perform certain actions and also involves reasoning about the tool’s functionality. Early work on affordance
detection sought to recognize 3D CAD objects, e.g., chairs based on their functions. Stark and Bowyer [136] built the Generic
Recognition using Form and Function (GRUFF) system to recognize different objects according to their functionalities rather
than their shapes using several functional primitives assigned to perform function-based recognition. Aldoma et al. [5]
proposed a visual cue method for finding affordances in a scene depending on the pose of a particular object. It depended on
first recognizing an object in a scene and then estimating its 3D pose and finally learning the so-called 0-order affordances
which were hidden and unhidden ones (Table 1 presents those investigated).
Myers et al. [97] were the first to treat images from the perspective of geometry as traditional vision tasks. They detected
the affordances of tool parts based on their importance for robot vision and introduced a RGB-D dataset with ground-truth
annotations using a Support Vector Machine (SVM) as the learning algorithm. They used a pixel-wise method to generate
geometric features to learn through it with hand-crafted features. Although the idea behind using such a method in this sense
was novel, it was complicated because the same pixel could have had different affordance labels. They used two methods
to train this model: firstly, the Superpixel-based Hierarchical Matching Pursuit (S-HMP) [7] to extract geometric features
(the depth, surface normals, principle curvatures, shape-index and curvedness) with SVM as the main classifier; and then the
Structured Random Forest (S-RF) [26] to infer the affordance labels based on the extracted decision trees, particularly on a
real-time basis. They introduced the seven affordance labels shown in Table 1.
Herman et al. [51] sought to introduce a new method that depended on physical and visual features, such as the material,
shape, size and weight, to learn affordance labels (Table 1). They collected their own data, which belonged to six categories of
balls, books, boxed containers (mugs, bottles and pitchers), shoes and towels. Based on these features, they used SVM and k-nn
classifiers to test their method. Their work emphasized the concept that combining physical and visual attributes enhances
Affordance understanding.
Grabner et al. [39] ] used the concept of functionality to recognize whether an object, e.g., a chair, was sittable or not, i.e., if
it provided sitting affordance but could be used by another object. In order to detect whether it enabled sitting or was ‘sittable’,
they used a 3D model of a human skeleton to test it sitting in 3D models of chairs. Through different human sitting poses
and interactions between an actor and object, they detected not only the chair’s affordances but also how to use it. However,
despite its effects on detection and reasoning, it required additional cues to improve its performance.
Moldovan et al. [90] proposed a novel method to estimate objects affordances in an occluded environment. They used the
relational affordances concept to search for objects which can afford certain actions [89]. Additionally, they used Statistical
Relational Learning (SRL) [21, 22] methods to model probability distributions that encode the relationships between objects.
However, they generate these probabilities through an external knowledge base i.e., web images. Hassan and Dharmaratne

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
8 • M. Hassanin et al.

Affordance Label References Description Examples


rollable [5, 51, 81] whether the object is rollable or not roads, trolley
containment [5, 97, 100, 126, 128] indicates contain-ability of an object pots
liquid-containment [5] indicates liquid contain-ability of an object glasses, cups, mug
whether the object pose is stable after glass cups may be broken
unstable [5]
pushing when pushing
stackable-onto [5] whether the object can be stacked mugs, pots
sittable [5, 61, 81, 119] whether the object can be used to sit or not chairs, desks
defines the location of manipulation of flat
grasp [51, 61, 81, 97, 100, 143] hammer, cups
tools
cut [97, 100, 126, 128, 143] indicates cutting knife, penknife, key
scoop [97] indicates curved surfaces tools trowels, cookie scoop
pound [97, 100] indicates striking tools hammerhead
flat tools (turners, spatulas),
support, place-on [97, 100, 126] indicates helpers or support an agent place-on (tables, desks), agent
support (walls)
wrap-grasp [97, 100] detects the location of grasping the outside of a cup)
pushable (forward,
[51, 61, 143] whether the object is push-able trolley, bike
right, left)
liftable [51, 61, 143] whether the object can be lifted or no liftable chairs
draggable, pushable
[51, 61] whether the object can be dragged desk, table
backward
carryable [51] whether the object can be carried light-weight pots, balls
traversable [51] whether the object can be traversed road, grass
fridge, room, microwave, book,
openable [126, 143] whether the object can be opened
box
pourable [126] whether the object is pour-able mug
holdable [126] whether the object can be held the outside of the mug
display, observe [81, 100] refers to display sources TV, monitor screen
engine [100] refers to tools’ engine parts drill engine
hit [100, 128] refers to tools used to strike other objects. racket head
obstruct [81] indicates the locations of obstructer wall
break [81] indicates break-sensitive objects glass cups
pinch-pull [81] indicates objects that pulled with a punch knob
indicates objects that pulled with hooking
hook-pull [81] handle
up
indicates objects that perform actions after
tip-push [81] electricity buttons
pushing
warmth [81] indicates warmth objects fireplaces
illumination [81] indicates light objects lamps
dry [81] indicates objects that absorb water towels
walk [81, 119] indicates places that allow walking gardens
refers to free space that allow a person to
lyable [119] bed
lie down
reachable [119] refers to reachable objects for picking up bottle in the fridge
movable [61] refers to objects that can be moved around small objects like balls, mugs
Table 1. Indoor affordance labels used to detect objects as in studies [5, 51, 61, 81, 97, 100, 119, 126, 128, 143]

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 9

(a) Affordancenet [25] (b) Deep driving [14]

Fig. 6. (a) The deep learning architecture that is proposed by [25]. The concept of RoIs has been used [49] to share the weights with the
main CNN layers whereas the VGG is the feature extractor. In addition, they used deconvolution layer to refine the results of small objects.
(b) Given an image from TORCS [162], the CNN extracts 13 indicators that will be fused along with the car speed to reason the proper
command for this case

[46] proposed an affordance detection method based on the object, human, and the ambient environment. They used the
objects’ attributes (physical, material, shape, etc), human attributes (poses), and object-to-object to train their scheme. Local
features have been extracted and used as inputs for classification based on Bayesian networks.
5.1.2 Feature-learning approaches: Inspired by [97], Nguyen et al. [99] built their model to extract geometric deep features
using Convolutional Neural Network (CNN) to detect affordances. They used an encoder-decoder architecture based on
deep CNNs with multi-modal features (horizontal disparity, height, and angle between pixel normals and inferred gravity)
[45, 98]. It was demonstrated that the automatic feature learning performed on top of geometric features resulted in better
performance compared to [97]. Despite their significant enhancements, the data set images are simple i.e. it has no occlusion
nor clutter. Sawatzky et al. [125, 126] proposed a weakly supervised method to learn affordance detection using a deep CNN
based expectation-maximization framework. This framework adequately handles weakly-labeled data at the image-level or
key-point level annotations. They sought to fix the problem of affordance segmentation which needs special care because
every pixel may be assigned to multiple labels. They learned deep features from the training data, however, they used human
pose to represent the context. They introduced affordance RGBD datasets with rich contextual information. The annotated
labels are shown in Table 1. However, having employed an additional step to update the parameters of CNN and estimate
the required masks for segmentation [125]. For this reason, they proposed the adaptive binarization threshold approach to
get rid of that step and enhance the results. Nguyen et al. [100] conducted another study to detect the affordance labels in
the images using a deep CNN architecture. Similar to Ye et al. [167], their work was inspired by popular CNN based object
detectors [113, 150, 166] and, as a result, they treated affordance understanding as an object detection problem. This approach
starts with a candidate set of bounding box proposals for objects which are generated using [20, 114]. This detection stage is
followed by atrous convolution technique [15] to extract deep features which are finally followed by a Conditional Random
Fields (CRF) model. The CRF, as a post-processing mechanism, provides substantial improvements [15, 173]. The authors
published a new RGB-D dataset called IIT-AFF (with nine affordance labels as given in Table 1) to prove the efficacy of their
method. In contrast to the above-mentioned methods, Chen et al. [14] proposed a novel idea to utilize affordances to reason
about autonomous driving actions. They trained deep Convolutional Neural Network (CNN) on the KITTI dataset [34] and
twelve recorded hours of video games [162] (see Figure 6). These learned affordances are tested to predict the right action.
Eventually, a detailed comparison of affordance detection techniques is presented in Table 2.

5.2 Affordance Categorization


The affordance categorization task aims to tag an image with the relevant set of affordance labels. To this end, a general
approach is to represent an image in the form of discriminative features and employ a classifier to assign affordance labels.
This task is relatively simple compared to affordance detection and segmentation, which also localize affordance categories
(see Figure 7).
Varadarajan and Vincze [154] proposed hybrid parallel architecture of deep learning and suggestive activation (PDLSA) to
overcome the problems of deep learning like uni-modality and serialization in order to categorize the objects based on the
affordance features. The authors extracted the semantic features of affordances as proposed in [153] as well as the structural
and material features to enhance the recognition results. The Washington RGB-D dataset was used to test the efficacy of
their model. Sun et al. employed object categorization as an intermediate step to infer the affordances more correctly [140].
They developed a visual category-based affordance model, which encoded the relationships among visual features, learned
categories, and affordances in probabilistic form. Such probabilistic modeling allows knowledge transfer and enhances accuracy,
especially when limited annotated data is available. The authors suggested seven object categories and six affordance labels as

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
10 • M. Hassanin et al.

Object Features Features Extraction Evaluation Training Model

Weakly Supervised
Feature Learning

Mathematical
Unsupervised
Hand-crafted
Multimodal

Benchmark

Supervised

Neural Net
Simulation
Real Robot
2D

3D
Stark et al. [136] ✓ ✓ ✓ ✓ ✓
Aldoma et al. [5] ✓ ✓ ✓ ✓ ✓
Myers et al. [97] ✓ ✓ ✓ ✓ ✓
Herman et al. [51] ✓ ✓ ✓ ✓ ✓
Moldovan et al. [90] ✓ ✓ ✓ ✓ ✓
Nguyen et al. [99] ✓ ✓ ✓ ✓ ✓ ✓
Sawatzky et al. [126] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Nguyen et al. [100] ✓ ✓ ✓ ✓ ✓ ✓
Do et al. [25] ✓ ✓ ✓ ✓ ✓
Chen et al. [14] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 2. Comparison between affordance detection methods

Fig. 7. Affordance classification process through machine learning techniques.

shown in Table 1. However, this study has been devoted to indoor buildings and it was not enough to treat all the cases. In
[39], the authors have categorized the images based on functions to enhance the performance of detection. The concept of
bootstrapping, which uses past knowledge to accelerate the learning process, has been applied to get the affordances [128, 149].
Schoeler et al. [128] proposed to infer any possible usage of a tool even if that usage is possible through another main tool
e.g., using stone as a hammer or using helmet instead of a water cup. They sought to divide the tools into six functional
categories (contain, cut, hit, hook, poke, and sieve). Afterward, an ontology of tool functions was created to allow a deeper
understanding by exploring their usage in the absence of the main tool. They developed their main algorithm in three steps:
(1) Part Segmentation through Constrained Planar Cuts (CPC) algorithm [127], (2) Extraction of part-based visual features via
a Signature of Histograms of Orientations (SHOT) descriptor [96, 144], (3) The pose of individual parts with respect to each
other is encoded via a pose signature that models the alignment and attachment between parts. Although they introduced a
new idea that was thoroughly tested on a 3D synthetic dataset. However, their method gets confused when many tools exist
to perform the same action. Also, the selected tool for some action may not be well-suited due to its size or shape.
Given basic affordances, Ugur et al. [149] proposed a bootstrapping method to learn the complex affordances through the
relational affordances or so-called paired-object affordances. They evaluated their approach using a real robot on various
object shapes like boxes, spheres, and cylinders. Additionally, they trained the robot to perform actions such as side-poke and
stack. Similarly, Fichtl et al. [128] addressed the problem of ‘related affordances’, which denotes the cases where affordances

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 11

Input Features Evaluation Training Abstraction

Feature learning

Semi-supervised
Self-supervised
Unsupervised

Mathematical
Handcrafted
Multimodal

Benchmark

Supervised
Simulation
Real Robot

Neural
2D

3D
Varadarajan et al. [154] ✓ ✓ ✓ ✓ ✓
Sun et al. [140] ✓ ✓ ✓ ✓ ✓
Schoeler et al. [128] ✓ ✓ ✓ ✓ ✓
Ugur et al. [149] ✓ ✓ ✓ ✓ ✓
Fichtl et al. [128] ✓ ✓ ✓ ✓ ✓
Abelha et al. [2–4] ✓ ✓ ✓ ✓ ✓
Mar et al. [84] ✓ ✓ ✓ ✓ ✓
Mar et al. [85] ✓ ✓ ✓ ✓ ✓
Pieropan et al. [107] ✓ ✓ ✓ ✓ ✓
Kjellström et al. [64] ✓ ✓ ✓ ✓ ✓
Table 3. Comparison between affordance classification methods.

are related together e.g. open the kitchen door to fetch a glass from inside. They used the pose and size as the visual features to
build their model. Abelha et al. [2–4] proposed methods to identify tools and their substitutes based on matching superquadrics
[58] of these tools in point cloud data. Mar et al. [84, 85] used learning by exploration as the methodology to train the iCub
robot. In the latter work [85], they categorized the objects before using parallel Self Organizing Maps (SOM). Following their
study [107], Pieropan et al. used the same method to learn affordances rather than functionalities [108]. Kjellström et al.
[64] proposed functional categorization to learn object affordances through human demonstration. They used a Conditional
Random Field (CRF) and factorial conditional random field to train the model and infer the object’s affordances and actions
even when they are novel. Overall, Table 3 presents a detailed comparison between affordance classification methods.

5.3 Affordance-Semantic Labeling


The affordance semantic labeling task involves assigning pixel-level affordance category labels to relevant regions in an image.
This problem combines segmentation and detection tasks and is relatively more challenging. It requires local and global
contextual modeling for accurate pixel-level predictions. Affordance labeling is highly useful for precisely locating where
appropriate actions can be performed in a given scene. Figure 8 shows the most important steps to do segmentation.
Inspired by the semantic segmentation framework proposed by Eigen and Fergus [29] for generic object categories, Roy et
al. proposed a multi-scale method to segment semantically meaningful affordances through CNN [119]. Given an RGB image,
their model predicted three types of information: 1- the depth map, 2- surface normals, 3- labels for semantic segmentation.
Thereafter, the outputs were merged together in a CNN network to predict the affordance maps. The experiments were
performed on the NYU Depth dataset which consists of real-world indoor images [132]. This dataset was extended with
affordance ground-truth annotations. The authors suggested five affordance labels as shown in Table 1. Although this method
introduced a new feature encoding hierarchy based on intermediate semantic segmentation, it does not address the problem
of multi-label affordances which is conflicting with the concept of segmentation. This is due to the reason that semantic
segmentation aims to assign a single label to all pixels belonging to an object, while each object usually has multiple affordance
labels e.g. a knife has both cut and handle affordances. Likewise, Kim et al. [61] performed affordance segmentation by using
surface geometry features (e.g. linearity, normal, and occupancy) of RGB-D images. The authors suggested six affordances as
shown in Table 1. Luddecke and Florentin [81] proposed a new method to label affordances in RGB images using a refined
version of Residual CNN [50] which was inspired by the work of Piheiro et al. [110]. As a major novelty, they developed a new
cost function to handle multiple affordances in case of incomplete data. In a similar work, Roy and Todorovic [119] used the
concept of action maps which predict the ability of users to do actions at various locations [115, 123]. This results in pixel-wise
affordance segmentations given RGB images. They introduced two original concepts:
• Object Parts − These are used to detect relevant segments of an object e.g., the surface of a table is important for
placement while table legs are useless. As a consequence, they can train on the only data set that supports object parts
i.e. ADE20K [174].

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
12 • M. Hassanin et al.

Fig. 8. affordance-based segmentation process diagram.

Input Features Evaluation Training Model

Weakly Supervised
Feature learning

Unsupervised

Mathematical
Hand crafted

Benchmark

Supervised
Real robot

Neural
2D

3D

Roy et al. [119] ✓ ✓ ✓ ✓ ✓


Kim et al. [61] ✓ ✓ ✓ ✓ ✓
Luddecke et al. [81] ✓ ✓ ✓ ✓ ✓
Nguyen et al. [100] ✓ ✓ ✓ ✓ ✓
Do et al. [25] ✓ ✓ ✓ ✓ ✓ ✓
Hassanin et al. [47] ✓ ✓ ✓ ✓ ✓ ✓
Table 4. Comparison between affordance-semantic labeling methods

• Transfer Table − It is a manual look-up table to map between object labels and affordance labels. In order to cover the
affordance parts, the authors specifically suggested fifteen labels as shown in Table 1.

Do et al. [25] followed the two previous studies of the same group [99, 100] with a new effort to build an end-to-end deep
learning architecture. Notably, the end-to-end model learning concept has recently predominated the recognition techniques,
which has led to algorithms that can perform model training in a single framework [77, 113]. Further, these systems detect
the objects and their affordances in a single stage instead of multiple isolated and disintegrated steps. Hence, they reduce
training time and lead to better performances. Although they mainly followed the same strategy of [49, 72] as shown in Figure
6, they added new components like deconvolutional layers and robust resizing strategies to handle the multiple affordance
classes problem. They relied on the affordance labels of [100] and tested their approach on two datasets: UMD dataset [97]
and IIT-AFF [100]. In [47], Hassanin et al. proposed a novel regression loss to improve the localization performance as well as
addressing the variances of the objects’ sizes. Chu et al. [18] proposed a synthetic version of the UMD dataset [97] for the
manipulations’ tasks. It includes synthetic input data with auto-generated annotations (labels, bounding boxes, and affordance
masks). Most recently, Chu et al. [17] proposed a category-agnostic approach for affordance segmentation that is able to
generalize to unseen objects by exploiting the instance masks. Also, they used KL-divergence to rank the learned affordances
according to the scene. Apart from that, Table 4 compares the recent methods in the literature.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 13

5.4 Interactive Affordances


In interactive environments, robots need to learn how to interact, and which objects to interact with? Interactive affordance
involves training the robot to learn the possible set of actions that can be performed in a given environment. It includes
various ways to move experience to robots such as learning by imitation, self-observation, self-interaction, etc. In an early
work, Fitzpatrick et al. [32] proposed a method that allows a robot to learn how to segment the objects through imitation.
The goal was to allow the humanoid robots to understand through acting. Similarly, through imitating the human actions on
objects, a robot can learn object affordances e.g., whether a spherical shape is rollable or a cubic shape is slide-able. In addition
to predicting affordances, it could interpret other’s actions. The intertwining of objects and actions, which is formally known
by motor actions [67], to learn about objects affordances or segment objects was promising at that time. Montesano et al.
[93] sought to learn affordances through robot-environment interactions. Additionally, they used affordance labels such as
eatable, movable, and graspable as sensing capabilities. Local features e.g., color characteristics were used to detect the shape
of the object which was eventually used to detect these affordances. Then the robot learned the model of grasping and used
affordances through imitation and self-observation. In the same context of motor actions, They proposed a probabilistic model
based on Markov Chain Monte Carlo sampling. Inspired by Montesano et al. [92], Lopes et al. [79] proposed a probabilistic
technique based on Bayesian Networks to learn affordances, and thereafter these learned affordances were used to recognize
the demonstrations of an agent and learn the given task. Based on the affordances (e.g. tappable or graspable) and through
self-observation, the robot could relate the action with the resulting effects. However, learning affordances was employed for
only single objects. Likewise, Ugur et al. [148] used self-interaction and self-observation to build their model. However, they
provided behavioral parameters to enhance accuracy. In contrast to previous methods [79, 93], the authors used unsupervised
clustering to segment grasping through 300 trials whereas SVM was used to learn affordance labels. Varadarajan et al. [152, 153]
developed a dataset to build knowledge ontologies similar to MIT ConceptNet [48] and KnowRob Semantic Map [142], but for
household RGB-D images. They presented various affordance features such as grasp, material, and structural. The authors
built affordance filtrations which started by localizing the affordances, then they identified the entities related to that object
and named it affordance duals. They looked for all the entities that share the same affordance with that object. Moreover, they
used a semantic part segmentation algorithm [151] for segmentation whereas modified the Levelberg Marquardt Algorithm
(LMA) [94] using swarm PSO in order to recognize the objects. Moldovan et al. [88] extended the model of [80, 93] that
considers three related concepts: actions, objects and effects as shown in figure 4 (b). They used spatial relations, which were
defined by the distance between objects and affordances to tackle the problem of multiple-objects manipulation. They used
probabilistic programming with logical rules and probabilities to build their model. On the contrary, Lopes et al. [80] used
object affordances as a priori information to enhance gesture recognition and reduce ambiguities based on motor terms. All of
the above-mentioned methods have been summarized in Table 5 to show what has been achieved and what still needs to be
considered.

Input Features Evaluation Training Model


Semi-supervised
Feature learning

Mathematical
Unsupervised
Handcrafted

Benchmark
Real Robot

Supervised

Neural
2D

3D

Fitzpatrick et al. [32] ✓ ✓ ✓ ✓ ✓


Lopes et al. [79] ✓ ✓ ✓ ✓ ✓
Montesano et al. [93] ✓ ✓ ✓ ✓ ✓
Ugur et al. [148] ✓ ✓ ✓ ✓ ✓
Varadarajan et al. [152, 153] ✓ ✓ ✓ ✓ ✓
Gupta et al. [43] ✓ ✓ ✓ ✓ ✓
Castellini et al. [10] ✓ ✓ ✓ ✓ ✓
Grabner et al. [39] ✓ ✓ ✓ ✓ ✓
Lopes et al. [80] ✓ ✓ ✓ ✓ ✓
Wang et al. [159] ✓ ✓ ✓ ✓ ✓
Sun et al. [141] ✓ ✓ ✓ ✓ ✓
Thermos et al. [143] ✓ ✓ ✓ ✓ ✓
Table 5. Comparison between interactive affordance methods

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
14 • M. Hassanin et al.

Fig. 9. Affordance-based activity recognition process.


.

5.5 Affordance-based activity recognition


Enabling seamless human-robot interactions is a crucial step towards achieving the ubiquitous use of personal robots. It
is a multidisciplinary research area that overlaps with robotics, human-computer interactions, cognitive science, Artificial
Intelligence (AI), action recognition and affordance prediction. To understand the interactions between a robot and its
environment, the affordances of objects should be predicted as they can be very useful cues for activity recognition, as shown
in Figure 9.
The close relationships among affordances, actions and humans have been investigated and, in this section, those between
affordances and actions are reviewed. Koppula et al. [65] learned human activities from RGB-D videos considering object
affordances, structuring their scheme using a graphical representation in which the nodes represented the sub-activities and
objects, and the edges the objects’ affordances and relationships between humans and objects. They also learned their model
through a structural SVM algorithm. In a recent study, Koppula et al. [68] presented a modified version of their previous work,
Koppula et al. [66] presented a modified version of their previous work [65] in which they used a CRF to represent their model.
They merged the CRF structure with object affordances and sub-activities to form a so-called Anticipatory Temporal CRF
(ATCRF). Following these studies [65, 66], Jain et al. merged a spatial-temporal graph with a Recurrent Neural Network (RNN)
to address affordance problems in graphical models [56]. They used different kinds of spatial-temporal cases like motion
modeling, human activity prediction, and action anticipation. Qi et al. [112] represented affordances, human actions, and
interacting objects in a Spatial-Temporal And-Or graph (ST-AOG) to predict human activities in RGB-D videos. They built
their model in two main stages: video parsing and activity prediction. The parsing is done using segmentation by a dynamic
programming approach and later label refinement using Gibbs sampling. For activity prediction, it depended on an Earley
parser [28] to predict sub-activities and all the learned cues (parsed graph and sub-activities) to estimate human activity.
Vu et al. [155] challenged that various scenes under the same category have similar functional features. They described
scenes in terms of functionalities to predict actions from static images. Dutta and Zielinska [27] presented a novel method to
predict the next action based on object affordances and human interaction. They represented the model in spatio-temporal
based probabilistic state automaton. The generated motion trajectory was used to build action heat maps that led to infer
the next actions. Shu et al. [131] proposed learning social affordances from human to human interactions. They represented
their model in a graphical scheme that has nodes as subevents/subgoals. They provided an RGB-D video dataset (HHOI) to
describe human to human interactions. Given an RGB-D video, Shu et al. [130] learned social affordance grammars and then
represented them as an ST-AOG to perform motion modeling. Gupta et al. [43] modeled affordances in 3D indoor images
to detect the workspace based on human poses while they used upright, lay down, reach sit as the human poses. Recently,
Wang et al. [159] used Variational-Auto Encoders (VAE) [63] to build their model to predict the affordance poses and, based on
their locations, the algorithm classifies it into one of the 30 pose classes. Thereafter, it uses a VAE to extract the deformation
of this pose. They proved that the nonexistent poses can be predicted using ConvNets and VAE. In addition, the authors
claimed that deep learning-based efforts are quite a few in this area due to the non-existence of a big dataset for learning
affordances. For this reason, they build a large-scale dataset from sitcoms to learn affordances. Sun et al. [141] presented a
method to learn affordances from objects’ interactions. Given a video with labeled sequenced frames, their approach produces
interactive motion models between pairs of objects and then represents it in a Bayesian network with human actions. Ruiz
and Cuevas [105, 171] proposed using bisector surfaces and developed them to include rational weighting, provenance vectors,
and affordance key points. They aimed to study the affordance locations for objects over one another (e.g. man riding a bike,

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 15

Input Features Evaluation Training Model

Weakly supervised
Feature learning

Mathematical
Handcrafted

Benchmark
Real Robot

Supervised

Neural
2D

3D
Koppula et al. ✓ ✓ ✓ ✓ ✓ ✓
[65]
Koppula et al.
✓ ✓ ✓ ✓ ✓
[66]
Jain et al. [56] ✓ ✓ ✓ ✓ ✓
Qi et al. [112] ✓ ✓ ✓ ✓ ✓
Vu et al. [155] ✓ ✓ ✓ ✓ ✓
Dutta and
✓ ✓ ✓ ✓ ✓
Zielinska [27]
Shu et al. [131] ✓ ✓ ✓ ✓ ✓
Shu et al. [130] ✓ ✓ ✓ ✓ ✓
Table 6. Comparison between affordance-based activity recognition methods

Fig. 10. Affordance reasoning process.

placing a pot in the kitchen shelf or where the kids ride a bike in a flat), and simultaneously detect unseen views of the object.
Thermos et al. [143] presented a deep learning paradigm to investigate the problem of sensorimotor 3D object recognition.
They employed a biological neural network architecture (VGG-16 network [133]) to fuse multiple evidence sources to learn
affordances. They used fifteen affordance types as shown in Table 1. Some of these types describe complex affordances like
"squeeze” and continuous like "write”. Moreover, they introduced a new RGB-D dataset which has human-interaction and
affordance types to test their method. To sum up, Table 6 compares affordance-based activity recognition techniques.

5.6 High-level Affordance Reasoning


Affordances can be used as a tool for reasoning about more complex properties of the objects and events in a scene, for
example, they have been used to infer hidden properties, e.g., the contents of a container, the intricate relationships between
objects and the answers to complex questions about a scene. In this section, we discuss research works which performed
high-level reasoning based on affordances (Figure 10).
Zhu et al. [178] proposed the first study to discuss the visual reasoning of affordances. They developed Knowledgebase (KB)
that represents the object along with other nodes that describe attributes (visual, physical, and categorical) or affordances
to infer the affordance labels, human poses, or relative locations. They learned their model using Markov Logic Network
(MLN) [117] whereas zero-shot learning [118] has been used to predict affordances for novel objects. However, their approach
assumed that the affordance semantics and attributes are given in advance and use static models. Chao et al. proposed this
study [12] that models affordance semantics problems in the form of action-object pairs as connected verb-noun nodes in
WordNet [86] or encoding the plausibility of a matrix such as this study [175]. They used novel statistical methods like

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
16 • M. Hassanin et al.

co-occurrences to infer about the affordances and give the best description (verb) for this object (noun). Regrading reasoning
about the containers’ contents, Güler et al. introduced the visual-tactile method to infer what is inside the container [40].
A Kinect camera was used to capture the object and then the deformal model was detected before using tactile signals for
reasoning. They used a three-fingered Schunk Dextrous Hand to grasp and squeeze the container to check whether it is full
or empty. PCA was the feature extractor whereas kNN (k-Nearest Neighbor) and SVM (Support Vector Machines) were the
classifiers of the model. However, they assumed the presence of the 3D object model to perform the learning.
Yu et al. [168] proposed a physics-based method to reason liquid contanability affordances through two steps: 1) the best
filling direction 2) considering the direction, transfer the liquid outside to estimate the containability of an object. They used
fluid dynamics in 3D space to simulate the human motion of liquid transfer to test their approach. Although this study tried to
predict containability based on the visual images, it did not use the human factors in the scene which makes the reasoning
easier. Further, they did not relate the material attributes to the reasoning e.g. any curved tool can contain liquid even if it is
made of this paper. Seeking to overcome the human factors in the scene problem, Wang et al. [156] developed their approach
as an enhancement of this paper [168]. Given a 3D scene and considering the compatibility among container, containee and
pose, they learn their model to reason the best pose and container to do the task of transferring the liquid. They provided an
RGB-D dataset and they used SVM to train the approach. The process of filling the containers is the same process of this study
[168]; they voxelize the object and simulate the filling inside its space. In the same context, Mottaghi et al. [95] developed
an approach to reason about the affordance of liquids inside a container (the volume, amount of liquid) and predict their
behaviors. They introduced Containers Of liQuid contEnt (CODE) dataset of RGB images along with 3D CAD models. They
used deep learning in the form of CNN and RNN to learn the containability affordance based on contextual cues. Phillips et
al. [106] introduced a method to detect, localize and segment pose estimation of transparent objects like glasses; and they
provided an annotated dataset of transparent objects. Liang et al. studied the human cognition of the containers to infer their
affordances (in terms of the numbers of objects they could contain) [75]. In a recent study, Liang et al. [74] inferred containment
affordances and relations in RGB-D videos over time. For example, the fridge contains the eggs carton which contains the eggs.
They used human actions (move-in, move-out, no-change and paranormal-change) to draw containment graphs based on
spatial-temporal relations; that is, the action used to detect containment objects. They introduced RGB-D videos dataset and
they developed probabilistic dynamic programming to optimize the containment graphs. Zhu et al. [179] used physics-based
simulation to infer the forces and pressures for the different body parts while sitting on a chair. They predicted the object
affordances through human utilities while sitting e.g. comfortable and lazy. Krunic et al. introduced a model to include verbal
information to link between utterances and objects through inferring the context between words, actions, and their outcomes.
Zhu et al. [180] built a knowledge base (KB) to reason answers for image questions. They represented nodes in their model
as attributes, affordance labels, scene categories, or image features. In their most recent study, Chuang et al. presented a
promising study to reason about the action-object affordances based on physical and social norms [19]. They annotated
the ADE20k [174] dataset with affordance features, detailed explanations, and potential consequences for every object. For
instance, pouring water into a cup has an explanation that it is improper to pour because the cup is full and the consequences
that you will make a mess in that place. They built the model using a Gated Graph Neural Network (GGNN) while the spatial
version of GGNN has been employed to reason affordances. This study combined the social norms, physical features, visual
attributes, and situation parameters to infer the affordances and its relations whether positive or negative. Li et al. [70] built
3D model to learn the semantically plausible human poses within indoor environments. They built a 3D pose synthesizer
to learn the affordances from its interaction with the environment. In order to achieve this goal, they collected large-scale
datasets from TV shows. Most recently, Wang et al. [160] used physical principles for estimating scene affordance directly as
well as pose constraints on the human pose as an interaction’s guidance with the scene. To facilitate this task, they collected
3D dataset, namely Geometric Pose Affordance dataset, which includes samples for people interacting with various rich 3D
environments. To summarize, Table 7 gives more details about the used methods in the literature of reasoning.

6 FUNCTIONAL SCENE UNDERSTANDING


Zhu et al. [181] made a distinction between functionality and affordance stating that the latter depends mainly on detecting and
classifying objects and learning their affordances, and then obtaining more detailed understanding and performing reasoning
based on the properties of these affordances. Although understanding the real meaning of functionality is related to affordance
understanding, it specifically aims to identify the tasks that can be performed with an object. It reasons about an object’s
functions in the context of an agent (animal or robot). Also, some objects have both affordance and functional parts, e.g., a
hammer has a head which is suitable for striking objects (function) and a handle by which it can be grasped (affordance).
Some object parts have multiple affordances but a single function, e.g., as a hammer’s handle can be used to push or pull
objects, it has pushable and pullable affordances while its function is to support the hammer’s head. In contrast, hammer’s
claws have multiple functions and multiple affordances, i.e., they can function as a lever and also pull out nails from timber

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 17

Fig. 11. The deep learning architecture that is proposed by [19]. ResNet is used as the feature extraction layer and the results are fed into
a customized Spatial Gated Graph Neural Network (SGGNN) to represent the visual affordances through graphs. Hence, use this graph to
reason the affordances, explanations and consequences
.

Input Features Evaluation Training Model


Methods
Feature learning

Mathematical
Unsupervised
Handcrafted
Multimodal

Benchmark

Supervised
Simulation

Neural
2D

3D

Zhu et al. [178] ✓ ✓ ✓ ✓ ✓


Chao et al. [12] ✓ ✓ ✓ ✓ ✓
Yu et al. [168] ✓ ✓ ✓ ✓ ✓
Zhu et al. [179] ✓ ✓ ✓ ✓ ✓
Wang et al.
✓ ✓ ✓ ✓ ✓
[156]
Liang et al. [74] ✓ ✓ ✓ ✓ ✓
Mottaghi et al.
✓ ✓ ✓ ✓ ✓
[95]
Phillips et al.
✓ ✓ ✓ ✓ ✓
[106]
Zhu et al. [180] ✓ ✓ ✓ ✓ ✓
Chuang et al.
✓ ✓ ✓ ✓ ✓
[19]
Table 7. Comparison between affordance reasoning methods

while having the affordances of hooking, grasping and pushing. Therefore, their relationship between affordance and function
is many-to-many, as shown in Figure 12.
As scene understanding is an old problem, it has been studied from many perspectives. However, functional scene un-
derstanding has not been thoroughly investigated before. Although this problem is of high significance for robotics, a few
research efforts have focused on functional scene understanding. For example, a robot can not clean the kitchen dynamically
without understanding how to use taps or electricity plugs to run the vacuum. So that the functional scene understanding is
crucial to the robots particularly cognitive robots. As Figure 13 shows, categorizing the images according to their purposes is
meaningful in many cases than categorizing according to its appearance.

6.1 Motion-based Functionality


Gupta et al. [42] used spatial and functional features to understand human actions in static images and videos. They argued
that the functionality, which has been considered motion trajectories and human poses as well as pose, can interpret the
scene in terms of action recognition. They built Bayesian Network model which fused functional and spatial data for object,
action, and object effect for recognition. Oh et al. [101] detected the functionality of moving objects in order to understand the
scene context. They claimed that appearance features are not enough for some objects in traffic scenes (e.g. parks and way
lanes) because all of them are similar without much difference. Turek et al. [145] presented a novel method to categorize a
scene according to the motion of objects. The categories are function-based such as sidewalks and parking areas, with local
descriptors used to model an object’s properties and behavioral relationships, and temporal features to discriminate between

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
18 • M. Hassanin et al.

Fig. 12. The relation between functions and affordances is many to many.

Fig. 13. Categories according to the functionality. The left side recognizes the objects based on the shapes, whereas the right side categorizes
the usage of objects.

functional regions. Inspired by [145], Zen et al. used the same steps but additionally included the semantics of the moving
object to get a more informative descriptor [169]. Additionally, they tested their model on a traffic dataset [158]. Rhinehart
and Kitani [116] analyzed egocentric videos to extract functional descriptors to learn "Action Maps" (six actions) [123] which
can tell a user about how to perform activities in novel scenes. Pirk et al. [111] used scanners to track interaction parts and
built the functionality descriptor by analyzing the motion of these parts interaction with an object e.g. analyzing an agent
tries to pour water into a mug.

6.2 Interaction-based Functionality


Gupta et al. [43] modeled human-scene interaction with a functional descriptor in terms of 3D human poses to predict various
poses of the human and they named it the human workspace. They used a 3D cuboidal model to fit objects and predicted
subjective affordances. In [23], Delaitre et al. functional descriptors were proposed to allow a better scene understanding.
The main idea is to extract functional descriptors from human-object interaction aiming to achieve better object recognition
results. They extended an existing dataset [33] to evaluate their proposed model. Similarly, Fouhey et al. used the idea of [42]
to predict the affordance labels from videos by observing person-object interactions [33]. For instance, the knife has the "cut"
function because many persons used it for cutting. Additionally, they provided a video dataset for indoor scenes with function
labels. In [109], Pieropan et al. proposed using object-to-object Spatio-temporal relationships to create a so-called “object
context" along with a functional descriptor to predict human activities. As an example, only the presence of a mug does not
confirm if a drinking action could take place, but the presence of a water bottle beside it increases the likelihood of a drinking
action. They used the kitchen images of the dataset proposed in [107] and trained a probabilistic model, Conditional Random
Field (CRF), to classify functional classes into four types: tools, ingredients, support, and containers. Yao et al. represented

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 19

function understanding problems as weakly supervised to discover all possible functionalities for each object [166]. They used
unsupervised clustering in an iterative manner to categorize human-object interactions, then they used these updates as input
for detection and pose estimation in order to understand the functionalities. The authors tested their model through a musical
instrument dataset [165] which contains images for human-object interactions. Given human-object interaction observations,
Stark et al. [137] proposed a learning mechanism to categorize grasping affordances as object functionalities (shown in Figure
13). Through observing the interaction with some object, the affordance cues are defined as a relation between the robot
hand and an object. The implicit shape model (ISM) [69] was the main algorithm used by [137] to categorize the objects. To
this end, these cues have been used to predict the grasping points of a 2D image. Pieropan et al. [107], applied functional
descriptors/cues, which have been learned from hand-object interaction, to understand human activities from RGB-D videos.
They represented the objects by their interaction with human hands as well as encoded these objects as strings through which
the string kernel [78] measured their similarity. Likewise, they used spatial location and temporal trajectory to estimate object
position relative to the hand. Hence, the estimated object position produced a so-called functional descriptor. Thereafter, they
fused the similarity measures with functional descriptors to recognize human activities. Mar et al. [84] proposed a method
to learn tool affordances based on its function and the way of interaction (grasping). To find functional descriptors, they
learned geometrical features through Self-Organized Maps (SOM) and K-means. Hu et al. introduced what called Interaction
CONtext (ICON) to describe the functionality of the 3D objects through geometric features [55] focused on other aspects
of functionality usage in vision research. The main idea was to define what is called the contextual descriptor to describe
the shape functionality in the presence of other objects using Interaction Bisector Surface (IBS) [171]. In other words, they
used object-to-object interaction to built a geometrical descriptor for functionality analysis and hence they recognized the
correspondences between similar parts on various shapes of 3D images. They used the Trimble 3D Warehouse1 to test their
experiment. Savva et al. [111] built action maps for potential actions through scanning geometry of the captured 3D scenes,
reconstructing depth meshs, tracking human interactions to define the functionality descriptors and therefore predicting the
affordances of unseen objects. Zhu et al. [181] used a simulation engine to perform affordance and functionality analysis. They
differentiated between functional understanding that was defined as the right location to do a task on the target object; and
affordance understanding that was defined as the best location to grasp the object depending on the tool type. Additionally,
they fused physical features (e.g. forces, pressure, and volume), human pose from imagined action, affordance features, and
functional features together to understand and infer tool’s usage.

6.3 Shape-based Functionality


Jain et al. [57] proposed a probabilistic model based on a tool’s geometric features related to its functionalities, such as those
based on its material and shape, along with probabilistic dependencies among its effects, actions and functional features. A
BN was used in their scheme because of its capability to handle probabilistic dependencies among/between nodes. Laga et
al. proposed a model that extracted the pair-wise semantics of shapes by combining structural and geometric features [68].
Additionally, they recognized the functionality of the shapes (e.g. graspable and container) using supervised learning methods.
Kim et al. used affordances as a priori information to predict the correspondence in human poses through which they predicted
the functional descriptors [62]. Apart from the above-mentioned studies that used abstract functionality such as "to pour" and
"to move", Lun et al. [82] designed a unified model to detect a human pose according to human-object affordances (leaing,
holding, sitting and treating) along with object parts. The functionality descriptor has been employed to recover mechanical
assemblies or parts from raw scans [76]. They used segmentation and joint optimization to learn their scheme. Hu et al. in a
recent study [54] proposed a method to analyze inter-object relations and intra-object relation aiming to categorize the objects
based on their functionalities. They used objects’ parts contexts, semantics, and functionalities to recognize their shapes.
Various methods with different ways of representation have been introduced to address functionality issues. Zhao and Zhu
proposed a stochastic method, Function-Geometry Appearance (FGA), to parse 3D scenes by combining features of the
functionality, geometry, and appearance [172]. They modeled the FGA through top-down/bottom-up hierarchy and used
MCMC to build the algorithm and infer the parse tree. The functional descriptor was composed of functional scene categories,
groups, objects, and parts and use AND-OR rules to understand the affordance and therefore the full scene. Table 8 summarizes
and compares the most important properties to make it more understandable.

7 DATASETS
In this section, we investigate the available datasets provided with affordance annotations. As shown in Table 9, the distribution
of them ranges from RGB, RGB-D for images and videos. For visual cues, many datasets have been proposed such as UMD
and IIT-AFF [97, 100] to facilitate detecting affordance objects from the scene i.e., detecting objects that bear affordances or
functionalities from the input image. In other words, these datasets enable researchers to treat affordances as traditional
1 https://3dwarehouse.sketchup.com/

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
20 • M. Hassanin et al.

Input Features Evaluation Training Abstraction

Weakly supervised
Feature learning

Mathematical
Unsupervised
Handcrafted

Benchmark

Supervised
Simulation
Real Robot

Neural
2D

3D
Zhu et al. [181] ✓ ✓ ✓ ✓ ✓
Shiraki et al. [129] ✓ ✓ ✓ ✓ ✓
Turek et al. [145] ✓ ✓ ✓ ✓ ✓
Yao et al. [166] ✓ ✓ ✓ ✓ ✓
Pechuk et al. [104] ✓ ✓ ✓ ✓ ✓
Hinkle and Olson [52] ✓ ✓ ✓ ✓ ✓
Jain et al. [57] ✓ ✓ ✓ ✓ ✓
Awaad et al. [6] ✓ ✓ ✓ ✓ ✓
Madry et al. [83] ✓ ✓ ✓ ✓ ✓
Xie et al. [163] ✓ ✓ ✓ ✓ ✓
Oh et al. [101] ✓ ✓ ✓ ✓ ✓
Zen [169] et al. ✓ ✓ ✓ ✓ ✓
Gupta et al. [42] ✓ ✓ ✓ ✓ ✓
Delaitre et al. [23] ✓ ✓ ✓ ✓ ✓
Fouhey et al. [33] ✓ ✓ ✓ ✓ ✓
Rhinehart and Kitani [116] ✓ ✓ ✓ ✓ ✓
Zhao and Zhu [172] ✓ ✓ ✓ ✓ ✓
Kim et al. [62] ✓ ✓ ✓ ✓ ✓
Hu et al. [55] ✓ ✓ ✓ ✓ ✓
Laga et al. [68] ✓ ✓ ✓ ✓ ✓ ✓
Lun et al. [82] ✓ ✓ ✓ ✓ ✓
Savva et al. [124] ✓ ✓ ✓ ✓ ✓
Lun et al. [82] ✓ ✓ ✓ ✓ ✓
Saponaro et al. [122] ✓ ✓ ✓ ✓ ✓
Stark et al. [137] ✓ ✓ ✓ ✓ ✓
Table 8. Comparison between function-scene understanding methods

image detection tasks like pedestrian detection or face detection. Since physical and material attributes are important for
defining the functionality and affordances, it has been provided in these methods [51, 74, 95, 174]. Regarding human activities
recognition, these methods [10, 126, 131] advanced annotated datasets that contain human subjects. In the essence of object
parts, this dataset [174] has part’s annotations.

8 OPEN PROBLEMS AND RESEARCH QUESTIONS


Lack of Consensus on Affordance Definition: Nearly 5 decades after the introduction of the affordance concept by Gibson,
no formal definition of affordances has been agreed upon by AI researchers and ecological psychologists. Gibson’s own
description of the concept in his seminal work “The Ecological Approach to Visual Perception” (1979) shows the complex
nature of the concept, e.g., he said “Affordance is equally a fact of the environment and a fact of behavior. It is both physical
and psychological, yet neither. An affordance points both ways, to the environment and to the observer" [59]. Previous efforts
have tried to describe affordance as the property of the environment, the mutual phenomenon between an agent and its
environment, or an observer’s perception of the relationships between an agent and its surroundings [13, 121]. However, there
does not exist a unified definition for affordances so far.
Function vs Affordance: The terms, function, and affordance, are sometimes used with the same meaning in the literature
even though they are completely different. The relation between these two terms is complicated where one object may
have different affordances and functions such as the case of a hammer which has affordances (grasping, striking, dragging)
and functions (drive nails, fit parts, forge metal, and break apart objects). Differently, some objects have affordances and
functions with the same meaning such as knife which has affordance (cut) and function (cut). However, the agreed concept
(also emphasized in this survey) is that affordance always relates to the object itself rather than the function which relates to
only another object. In other words, affordance relates to the possible actions whereas the function relates to the effect. Much
effort has to be devoted to this issue to distinguish between affordance and functionality in the right manner.
Affordance and Attributes: Recently, much research have been introduced to describe objects with their attributes
[30, 71, 103, 157]. Along with the object label, semantically meaningful attributes are needed to understand the object

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 21

Subjects Number

Image/Video
Affordances

Categories
Reference

Properties

Subjects
Format
Year
Table
UMD[97] 2015 visual RGB-D 17 30,000 Indoor
1
Image

visual, physical and Table


[51] 2011 RGB-D 7 375 Indoor
material 1
Image

visual, Table
CAD120 [126] 2016 RGB-D 35 3090 / 215 Indoor
human-interactions 1
Video

Table
IIT-AFF [100] 2017 visual RGB-D 9 8,835 Tools
1
Image

visual, physical, social,


Table
ADE- Affordance [174] 2016 object-action pairs, RGB Images 7 10,000 Indoors
1
exceptions, explanations

Table
[134] 2016 visual RGB Image 8 10,360 Indoor
1
tools

Table
Extended NYUv2 [119] 2016 visual RGB Image 5 1449 Indoor
1

CONTACT VMGdB visual, human Table


2011 RGB-D 20 5 5200 Indoor
[10] interactions 1
Video

visual, human
HHOI[131] 2016 Table1 RGB-D 14 5 – Indoor
interactions
video

30
Binge Watching [159] 2017 visual, human poses RGB-D 30 11449 indoor
poses

visual, human Table


CERTH-SOR3D [143] 2017 RGB-D 14 20,800 indoor,
interactions 1
tools

COQE [95] 2017 visual, physical – RGB 10 5000 Containers

[74] 2016 visual, physical – RGB-D 4 1326 Containers


video

Tool & Tool-Use (TTU) visual, physical, human Table


2015 RGB-D 10 452 Tools
[181] demonstrations 1
Image

Table 9. The datasets that have ground-truth annotation for affordances and functionalities.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
22 • M. Hassanin et al.

characteristics. For instance, a cup of tea may be described with some attributes like glass, white, and has a handle. Therefore,
describing the cup by its attributes will help in deciding the best way of grasping. Furthermore, the affordance understanding
problem needs attributes to address some difficulties such as getting some fruits from the fridge requires prior knowledge
about the height of the fridge and the agent as well. Despite its significance, the use of attributes has not been addressed in the
context of visual affordances before.
Multi-class Labeling: Considering affordances and functions together gives rise to an advanced set of labels for every
object. Therefore, it is a multi-label problem in its nature. For example, the Figure of cup 12 shows nine labels. If a more
detailed analysis is required, these labels need to be ordered according to the object, scene, and the situation. Remarkably,
these labels should be identified in terms of the parts rather than the whole object.
Deep Learning for affordance understanding: Nowadays, deep learning is dominating the field of vision and it achieved
remarkable improvements in this context. However, it has not received much attention in addressing the challenges particular
to affordance and function understanding. By looking at comparison tables, the number of feature learning methods proposed
in the literature is too few. Another related factor to deep learning is the size of datasets. In other words, modeling the
affordance algorithms with deep learning needs large annotated datasets that are currently unavailable.
Complex Affordances: Affordances of an object can be classified into basic and higher-order affordances [97, 167] e.g.
the "pouring" hot water into a cup is basic affordance while making tea from this water is higher-order because it depends on
the basic one. These higher-order affordances have not been explored in the existing works.
Outdoor Affordances: Mostly all of the previous studies focused on indoor scenes. Only one method has been introduced
to address outdoor affordances [14]. Despite less focus on outdoor affordances, this direction of research is highly valuable
due to the relationship between affordances and silent actions which is important in self-driving cars, autonomous driving,
traffic monitoring applications.
Affordances for Developmental Robotics: Since developmental robotics and affordances are tightly related to each
other, visually understanding the environment can enhance the learning process. Developmental robots seek to enhance
robot understanding through environmental interactions while affordances are emergent environmental variables. Thus,
learning affordances visually would shorten the time required for interactive robots to build their knowledge accurately. Yet,
the conducted studies in this paradigm are still not sufficient and need more investigations to get efficient baselines. In the
case of extrinsic motivation, visual affordances can be used as reward evaluators and indicators to measure the progress of
that learning scheme. For intrinsic motivations, visual affordances can be utilized to point out the most important features in
the environment which will increase the robots’ curiosity to learn more.
Visual Questions Answering (VQA) & Learning-by-Asking (LBA): VQA deals with answering intelligent questions
about a visual element. Because affordances are constant environment variables that provide highly valuable information
about scene content, they can assist in developing a deep insight. Similar to affordance-based recognition, fusing affordances
and VQA will improve the accuracy of these answers. Unlike VQA, LBA seeks to understand the environment by asking
questions and requesting supervision. Assuming that affordances are a precursor for object interaction, merging affordances
and LBA will reduce the agent time to build its knowledge.
We believe that the application of visual scene understanding algorithms including those for semantic/instance segmentation,
physics-based reasoning, and 3D volumetric analysis to affordance will help in resolving several underlying challenges.

9 CONCLUSION
In this survey, visual affordance and functional scene understanding have been reviewed. We introduce a hierarchical taxonomy
and cover the progress according to each sub-task e.g., classification, segmentation, and detection. We begin with a formal
definition of each sub-task and provide significance and applications to motivate the readers. The paper succinctly compares
the best approaches in every sub-category in a tabular form to help researchers identify research gaps and open questions.
The paper also covers datasets proposed in the field and provides detailed comparisons between them. We discussed the open
problems and challenges in this field. Finally, we hope this survey will be helpful for researchers particularly as it is the first
survey to review visual affordances and function understanding.

REFERENCES
[1] S. Aarthi and S. Chitrakala. 2017. Scene understanding; A survey. In 2017 International Conference on Computer, Communication and Signal Processing
(ICCCSP). 1–4. https://doi.org/10.1109/ICCCSP.2017.7944094
[2] Paulo Abelha and Frank Guerin. 2017. Transfer of Tool Affordance and Manipulation Cues with 3D Vision Data. arXiv preprint arXiv:1710.04970 (2017).
[3] P. Abelha, F. Guerin, and M. Schoeler. 2016. A model-based approach to finding substitute tools in 3D vision data. In 2016 IEEE International Conference
on Robotics and Automation (ICRA). 2471–2478.
[4] Paulo Abelha Ferreira and Frank Guerin. 2017. Learning How a Tool Affords by Simulating 3D Models from the Web. In Proceedings of IEEE International
Conference on Intelligent Robots and Systems (IROS 2017). IEEE Press.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 23

[5] Aitor Aldoma, Federico Tombari, and Markus Vincze. 2012. Supervised learning of hidden and non-hidden 0-order affordances and detection in real
scenes. In Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 1732–1739.
[6] Iman Awaad, Gerhard K Kraetzschmar, and Joachim Hertzberg. 2015. The role of functional affordances in socializing robots. International Journal of
Social Robotics 7, 4 (2015), 421–438.
[7] Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2013. Unsupervised feature learning for RGB-D based object recognition. In Experimental Robotics. Springer,
387–402.
[8] J. Bohg, A. Morales, T. Asfour, and D. Kragic. 2014. Data-Driven Grasp Synthesis;A Survey. IEEE Transactions on Robotics 30, 2 (April 2014), 289–309.
https://doi.org/10.1109/TRO.2013.2289018
[9] Lubomir Bourdev and Jitendra Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In Computer Vision, 2009 IEEE 12th
International Conference on. IEEE, 1365–1372.
[10] C. Castellini, T. Tommasi, N. Noceti, F. Odone, and B. Caputo. 2011. Using Object Affordances to Improve Object Recognition. IEEE Transactions on
Autonomous Mental Development 3, 3 (Sept 2011), 207–215.
[11] Y. W. Chao, Z. Wang, R. Mihalcea, and J. Deng. 2015. Mining semantic affordances of visual object categories. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). 4259–4267. https://doi.org/10.1109/CVPR.2015.7299054
[12] Y. W. Chao, Z. Wang, R. Mihalcea, and J. Deng. 2015. Mining semantic affordances of visual object categories. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). 4259–4267.
[13] Anthony Chemero. 2003. An Outline of a Theory of Affordances. ECOLOGICAL PSYCHOLOGY 15, 2 (2003), 181–195.
[14] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. 2015. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. In 2015 IEEE
International Conference on Computer Vision (ICCV). 2722–2730. https://doi.org/10.1109/ICCV.2015.312
[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2015. Deeplab: Semantic image segmentation with deep
convolutional nets, atrous convolution, and fully connected crfs. In Proceedings of the International Conference on Learning Representations.
[16] Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. 2015. Indoor scene understanding with geometric and semantic contexts.
International Journal of Computer Vision 112, 2 (2015), 204–220.
[17] Fu-Jen Chu, Ruinian Xu, Landan Seguin, and Patricio A Vela. 2019. Toward affordance detection and ranking on novel objects for real-world robotic
manipulation. IEEE Robotics and Automation Letters 4, 4 (2019), 4070–4077.
[18] Fu-Jen Chu, Ruinian Xu, and Patricio A Vela. 2019. Learning affordance segmentation for real-world robotic manipulation via synthetic images. IEEE
Robotics and Automation Letters 4, 2 (2019), 1140–1147.
[19] Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja Fidler. 2017. Learning to Act Properly: Predicting and Explaining Affordances from Images.
arXiv preprint arXiv:1712.07576 (2017).
[20] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural
information processing systems. 379–387.
[21] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen H Muggleton. 2008. Probabilistic inductive logic programming. Vol. 4911. Springer.
[22] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. 2007. ProbLog: A probabilistic Prolog and its application in link discovery. (2007).
[23] Vincent Delaitre, David F Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A Efros. 2012. Scene semantics from long-term observation of
people. In European conference on computer vision. Springer, 284–298.
[24] Mandar Dixit, Si Chen, Dashan Gao, Nikhil Rasiwasia, and Nuno Vasconcelos. 2015. Scene Classification With Semantic Fisher Vectors. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Thanh-Toan Do, Anh Nguyen, Ian Reid, Darwin G Caldwell, and Nikos G Tsagarakis. 2017. Affordancenet: An end-to-end deep learning approach for
object affordance detection. arXiv preprint arXiv:1709.07326 (2017).
[26] Piotr Dollár and C Lawrence Zitnick. 2013. Structured forests for fast edge detection. In Computer Vision (ICCV), 2013 IEEE International Conference on.
IEEE, 1841–1848.
[27] Vibekananda Dutta and Teresa Zielinska. 2017. Action prediction based on physically grounded object affordances in human-object interactions. In
Robot Motion and Control (RoMoCo), 2017 11th International Workshop on. IEEE, 47–52.
[28] Jay Earley. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2 (1970), 94–102.
[29] David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In
Proceedings of the IEEE International Conference on Computer Vision. 2650–2658.
[30] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition.
[31] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models.
IEEE transactions on pattern analysis and machine intelligence 32, 9 (2010), 1627–1645.
[32] Paul Fitzpatrick, Giorgio Metta, Lorenzo Natale, Sajit Rao, and Giulio Sandini. 2003. Learning about objects through action-initial steps towards
artificial cognition. In Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, Vol. 3. IEEE, 3140–3145.
[33] David F Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A Efros, Ivan Laptev, and Josef Sivic. 2014. People watching: Human actions as a cue for
single view geometry. International journal of computer vision 110, 3 (2014), 259–274.
[34] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics
Research 32, 11 (2013), 1231–1237.
[35] Eva Gibaja and Sebastián Ventura. 2014. Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 4, 6 (2014), 411–444.
[36] Eva Gibaja and Sebastián Ventura. 2015. A tutorial on multilabel learning. ACM Computing Surveys (CSUR) 47, 3 (2015), 52.
[37] James Jerome Gibson. 1966. The senses considered as perceptual systems. (1966).
[38] James J Gibson. 1979. The theory of affordances. The people, place, and space reader (1979), 56–60.
[39] Helmut Grabner, Juergen Gall, and Luc Van Gool. 2011. What makes a chair a chair?. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on. IEEE, 1529–1536.
[40] Püren Güler, Yasemin Bekiroglu, Xavi Gratal, Karl Pauwels, and Danica Kragic. 2014. What’s in the container? Classifying object contents from vision
and touch. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on. IEEE, 3961–3968.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
24 • M. Hassanin et al.

[41] A. Gupta and L. S. Davis. 2007. Objects in Action: An Approach for Combining Action Understanding and Object Perception. In 2007 IEEE Conference
on Computer Vision and Pattern Recognition. 1–8.
[42] Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. 2009. Observing human-object interactions: Using spatial and functional compatibility for
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 10 (2009), 1775–1789.
[43] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. 2011. From 3D scene geometry to human workspace. In CVPR 2011. 1961–1968.
[44] Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Indoor scene understanding with RGB-D images: Bottom-up segmentation,
object detection and semantic segmentation. International Journal of Computer Vision 112, 2 (2015), 133–149.
[45] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation.
In European Conference on Computer Vision. Springer, 345–360.
[46] Mahmudul Hassan and Anuja Dharmaratne. 2016. Attribute Based Affordance Detection from Human-Object Interaction Images. Springer International
Publishing, Cham, 220–232.
[47] Mohammed Hassanin, Salman Khan, and Murat Tahtali. 2019. A New Localization Objective for Accurate Fine-Grained Affordance Segmentation
Under High-Scale Variations. IEEE Access 8 (2019), 28123–28132.
[48] Catherine Havasi, Robert Speer, and Jason Alonso. 2007. ConceptNet 3: a flexible, multilingual semantic network for common sense knowledge. In
Recent advances in natural language processing. John Benjamins Philadelphia, PA, 27–29.
[49] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. 2017. Mask R-CNN. In The IEEE International Conference on Computer Vision (ICCV).
[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 770–778.
[51] Tucker Hermans, James M Rehg, and Aaron Bobick. 2011. Affordance prediction via learned object attributes. In IEEE International Conference on
Robotics and Automation (ICRA): Workshop on Semantic Perception, Mapping, and Exploration. 181–184.
[52] L. Hinkle and E. Olson. 2013. Predicting object functionality using physical simulations. In 2013 IEEE/RSJ International Conference on Intelligent Robots
and Systems. 2784–2790.
[53] Thomas E Horton, Arpan Chakraborty, and Robert St Amant. 2012. Affordances for robots: a brief survey. Avant 3, 2 (2012).
[54] Ruizhen Hu, Oliver van Kaick, Bojian Wu, Hui Huang, Ariel Shamir, and Hao Zhang. 2016. Learning how objects function via co-analysis of interactions.
ACM Transactions on Graphics (TOG) 35, 4 (2016), 47.
[55] Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction context (icon): Towards a geometric
functionality descriptor. ACM Transactions on Graphics (TOG) 34, 4 (2015), 83.
[56] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.
[57] Raghvendra Jain and Tetsunari Inamura. 2013. Bayesian learning of tool affordances based on generalization of functional feature to estimate effects of
unseen tools. Artificial Life and Robotics 18, 1-2 (2013), 95–103.
[58] Ales Jaklic, Ales Leonardis, and Franc Solina. 2013. Segmentation and recovery of superquadrics. Vol. 20. Springer Science & Business Media.
[59] Gibson James. 1979. The ecological approach to visual perception. Dallas: Houghtom Mifflin (1979).
[60] Lorenzo Jamone, Emre Ugur, Angelo Cangelosi, Luciano Fadiga, Alexandre Bernardino, Justus Piater, and José Santos-Victor. 2016. Affordances in
psychology, neuroscience and robotics: a survey. IEEE Transactions on Cognitive and Developmental Systems (2016).
[61] David Inkyu Kim and Gaurav S Sukhatme. 2014. Semantic labeling of 3d point clouds with object affordance for robot manipulation. In Robotics and
Automation (ICRA), 2014 IEEE International Conference on. IEEE, 5578–5584.
[62] Vladimir G Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. 2014. Shape2pose: Human-centric shape analysis. ACM Transactions
on Graphics (TOG) 33, 4 (2014), 120.
[63] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[64] Hedvig Kjellström, Javier Romero, and Danica Kragić. 2011. Visual object-action recognition: Inferring object affordances from human demonstration.
Computer Vision and Image Understanding 115, 1 (2011), 81–90.
[65] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International
Journal of Robotics Research 32, 8 (2013), 951–970.
[66] Hema S Koppula and Ashutosh Saxena. 2016. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on
pattern analysis and machine intelligence 38, 1 (2016), 14–29.
[67] Hideki Kozima, Cocoro Nakagawa, and Hiroyuki Yano. 2002. Emergence of imitation mediated by objects. (2002).
[68] Hamid Laga, Michela Mortara, and Michela Spagnuolo. 2013. Geometry and context for semantic correspondences and functionality recognition in
man-made 3D shapes. ACM Transactions on Graphics (TOG) 32, 5 (2013), 150.
[69] Bastian Leibe, Ales Leonardis, and Bernt Schiele. 2006. An implicit shape model for combined object categorization and segmentation. In Toward
category-level object recognition. Springer, 508–524.
[70] Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. 2019. Putting humans in a scene: Learning affordance in 3d
indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12368–12376.
[71] Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human attribute recognition by deep hierarchical contexts. In European Conference
on Computer Vision. Springer, 684–700.
[72] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully Convolutional Instance-Aware Semantic Segmentation. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
[73] Ming Liang and Xiaolin Hu. 2015. Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 3367–3375.
[74] Wei Liang, Yibiao Zhao, Yixin Zhu, and Song-Chun Zhu. [n. d.]. What Is Where: Inferring Containment Relations from Videos.
[75] Wei Liang, Yibiao Zhao, Yixin Zhu, and Song-Chun Zhu. 2015. Evaluating Human Cognition of Containing Relations with Physical Simulation.. In
Proceedings of the 37th Annual Meeting of the Cognitive Science Society.
[76] M. Lin, T. Shao, Y. Zheng, N. J. Mitra, and K. Zhou. 2018. Recovering Functional Mechanical Assemblies from Raw Scans. IEEE Transactions on
Visualization and Computer Graphics 24, 3 (March 2018), 1354–1367.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 25

[77] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox
detector. In European conference on computer vision. Springer, 21–37.
[78] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. Journal of Machine
Learning Research 2, Feb (2002), 419–444.
[79] M. Lopes, F. S. Melo, and L. Montesano. 2007. Affordance-based imitation learning in robots. In 2007 IEEE/RSJ International Conference on Intelligent
Robots and Systems. 1015–1021.
[80] M. Lopes and J. Santos-Victor. 2005. Visual learning by imitation with motor representations. IEEE Transactions on Systems, Man, and Cybernetics, Part
B (Cybernetics) 35, 3 (June 2005), 438–449.
[81] Timo Lüddecke and Florentin Wörgötter. 2017. Learning to Segment Affordances. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 769–776.
[82] Zhaoliang Lun, Evangelos Kalogerakis, Rui Wang, and Alla Sheffer. 2016. Functionality preserving shape style transfer. ACM Transactions on Graphics
(TOG) 35, 6 (2016), 209.
[83] M. Madry, D. Song, and D. Kragic. 2012. From object categories to grasp transfer using probabilistic reasoning. In 2012 IEEE International Conference on
Robotics and Automation. 1716–1723.
[84] T. Mar, V. Tikhanoff, G. Metta, and L. Natale. 2015. Multi-model approach based on 3D functional features for tool affordance learning in robotics. In
2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). 482–489.
[85] Tanis Mar, Vadim Tikhanoff, Giorgio Metta, and Lorenzo Natale. 2017. Self-supervised learning of tool affordances from 3D tool representation through
parallel SOM mapping. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 894–901.
[86] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
[87] Huaqing Min, Changan Yi, Ronghua Luo, Jinhui Zhu, and Sheng Bi. 2016. Affordance research in developmental robotics: A survey. IEEE Transactions
on Cognitive and Developmental Systems 8, 4 (2016), 237–255.
[88] Bogdan Moldovan, Plinio Moreno, Davide Nitti, José Santos-Victor, and Luc De Raedt. 2018. Relational affordances for multiple-object manipulation.
Autonomous Robots 42, 1 (2018), 19–44.
[89] Bogdan Moldovan, Plinio Moreno, Martijn van Otterlo, José Santos-Victor, and Luc De Raedt. 2012. Learning relational affordance models for robots in
multi-object manipulation tasks. In Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 4373–4378.
[90] B. Moldovan and L. De Raedt. 2014. Occluded object search by relational affordances. In 2014 IEEE International Conference on Robotics and Automation
(ICRA). 169–174. https://doi.org/10.1109/ICRA.2014.6906605
[91] Luis Montesano, Manuel Lopes, Alexandre Bernardino, and José Santos-Victor. [n. d.]. Learning object affordances: from sensory–motor coordination
to imitation. ([n. d.]).
[92] Luis Montesano, Manuel Lopes, Alexandre Bernardino, and Jose Santos-Victor. 2007. Modeling affordances using bayesian networks. In Intelligent
Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on. IEEE, 4102–4107.
[93] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor. 2008. Learning Object Affordances: From Sensory–Motor Coordination to Imitation. IEEE
Transactions on Robotics 24, 1 (Feb 2008), 15–26.
[94] Jorge J Moré. 1978. The Levenberg-Marquardt algorithm: implementation and theory. In Numerical analysis. Springer, 105–116.
[95] Roozbeh Mottaghi, Connor Schenck, Dieter Fox, and Ali Farhadi. 2017. See the Glass Half Full: Reasoning About Liquid Containers, Their Volume and
Content. In The IEEE International Conference on Computer Vision (ICCV).
[96] Wail Mustafa, Nicolas Pugeault, and Norbert Krüger. 2013. Multi-view object recognition using view-point invariant shape relations and appearance
information. In Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 4230–4237.
[97] A. Myers, C. L. Teo, C. FermÃijller, and Y. Aloimonos. 2015. Affordance detection of tool parts from geometric features. In 2015 IEEE International
Conference on Robotics and Automation (ICRA). 1374–1381. https://doi.org/10.1109/ICRA.2015.7139369
[98] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In Proceedings of the 28th
international conference on machine learning (ICML-11). 689–696.
[99] A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis. 2016. Detecting object affordances with Convolutional Neural Networks. In 2016 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS). 2765–2770.
[100] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. 2017. Object-based affordances detection with convolutional neural
networks and dense conditional random fields. In International Conference on Intelligent Robots and Systems (IROS).
[101] Sangmin Oh, Anthony Hoogs, Matthew Turek, and Roderic Collins. 2010. Content-based retrieval of functional objects in video using scene context. In
European Conference on Computer Vision. Springer, 549–562.
[102] Devi Parikh and Kristen Grauman. 2011. Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 503–510.
[103] Genevieve Patterson and James Hays. 2017. The SUN Attribute Database: Organizing Scenes by Affordances, Materials, and Layout. In Visual Attributes.
Springer, 269–297.
[104] Michael Pechuk, Octavian Soldea, and Ehud Rivlin. 2008. Learning function-based object classification from 3D imagery. Computer Vision and Image
Understanding 110, 2 (2008), 173–191.
[105] Martin Peternell. 2000. Geometric properties of bisector surfaces. Graphical Models 62, 3 (2000), 202–236.
[106] Cody J Phillips, Matthieu Lecce, and Kostas Daniilidis. 2016. Seeing Glassware: from Edge Detection to Pose Estimation and Shape Recovery.. In In
Robotics: Science and Systems, Vol. 3.
[107] Alessandro Pieropan, Carl Henrik Ek, and Hedvig Kjellström. 2013. Functional object descriptors for human activity modeling. In Robotics and
Automation (ICRA), 2013 IEEE International Conference on. IEEE, 1282–1289.
[108] Alessandro Pieropan, Carl Henrik Ek, and Hedvig Kjellström. 2015. Functional descriptors for object affordances. In IEEE International Conference on
Intelligent Robots and Systems Workshop. IEEE.
[109] A. Pieropan, C. H. Ek, and H. KjellstrÃűm. 2014. Recognizing object affordances in terms of spatio-temporal object-object relationships. In 2014
IEEE-RAS International Conference on Humanoid Robots. 52–58.
[110] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In European Conference on Computer
Vision. Springer, 75–91.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
26 • M. Hassanin et al.

[111] Sören Pirk, Vojtech Krs, Kaimo Hu, Suren Deepak Rajasekaran, Hao Kang, Yusuke Yoshiyasu, Bedrich Benes, and Leonidas J Guibas. 2017. Understanding
and exploiting object interaction landscapes. ACM Transactions on Graphics (TOG) 36, 3 (2017), 31.
[112] Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu. 2017. Predicting Human Activities Using Stochastic Grammar. In The IEEE International
Conference on Computer Vision (ICCV).
[113] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 779–788.
[114] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In
Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc.,
91–99. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
[115] Nicholas Rhinehart and Kris M Kitani. 2016. Learning action maps of large environments via first-person vision. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 580–588.
[116] Nicholas Rhinehart and Kris M Kitani. 2016. Learning action maps of large environments via first-person vision. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 580–588.
[117] Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine learning 62, 1-2 (2006), 107–136.
[118] Marcus Rohrbach, Michael Stark, and Bernt Schiele. 2011. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer
Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1641–1648.
[119] Anirban Roy and Sinisa Todorovic. 2016. A multi-scale cnn for affordance segmentation in rgb images. In European Conference on Computer Vision.
Springer, 186–201.
[120] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[121] Erol Şahin, Maya Çakmak, Mehmet R Doğar, Emre Uğur, and Göktürk Üçoluk. 2007. To afford or not to afford: A new formalization of affordances
toward affordance-based robot control. Adaptive Behavior 15, 4 (2007), 447–472.
[122] G. Saponaro, G. Salvi, and A. Bernardino. 2013. Robot anticipation of human intentions through continuous gesture recognition. In 2013 International
Conference on Collaboration Technologies and Systems (CTS). 218–225.
[123] Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring action maps in 3D environments.
ACM transactions on graphics (TOG) 33, 6 (2014), 212.
[124] Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring action maps in 3D environments.
ACM transactions on graphics (TOG) 33, 6 (2014), 212.
[125] Johann Sawatzky and Jurgen Gall. 2017. Adaptive Binarization for Weakly Supervised Affordance Segmentation. In The IEEE International Conference
on Computer Vision (ICCV).
[126] Johann Sawatzky, Abhilash Srikantha, and Juergen Gall. 2017. Weakly Supervised Affordance Detection. (July 2017).
[127] Markus Schoeler, Jeremie Papon, and Florentin Worgotter. 2015. Constrained planar cuts-object partitioning for point clouds. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 5207–5215.
[128] Markus Schoeler and Florentin Wörgötter. 2016. Bootstrapping the Semantics of Tools: Affordance analysis of real world objects on a per-part basis.
IEEE Transactions on Cognitive and Developmental Systems 8, 2 (2016), 84–98.
[129] Y. Shiraki, K. Nagata, N. Yamanobe, A. Nakamura, K. Harada, D. Sato, and D. N. Nenchev. 2014. Modeling of everyday objects for semantic grasp. In The
23rd IEEE International Symposium on Robot and Human Interactive Communication. 750–755.
[130] T. Shu, X. Gao, M. S. Ryoo, and S. C. Zhu. 2017. Learning social affordance grammar from videos: Transferring human interactions to human-robot
interactions. In 2017 IEEE International Conference on Robotics and Automation (ICRA). 1669–1676.
[131] Tianmin Shu, M. S. Ryoo, and Song-Chun Zhu. 2016. Learning Social Affordance for Human-robot Interaction. In Proceedings of the Twenty-Fifth
International Joint Conference on Artificial Intelligence (IJCAI’16). AAAI Press, 3454–3461. http://dl.acm.org/citation.cfm?id=3061053.3061104
[132] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European
Conference on Computer Vision. Springer, 746–760.
[133] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
(2014).
[134] Hyun Oh Song, Mario Fritz, Daniel Goehring, and Trevor Darrell. 2016. Learning to detect visual grasp affordance. IEEE Transactions on Automation
Science and Engineering 13, 2 (2016), 798–809.
[135] Hyun Oh Song, M. Fritz, C. Gu, and T. Darrell. 2011. Visual grasp affordances from appearance-based cues. In 2011 IEEE International Conference on
Computer Vision Workshops (ICCV Workshops). 998–1005.
[136] Louise Stark and Kevin Bowyer. 1994. Function-based generic recognition for multiple object categories. CVGIP: Image Understanding 59, 1 (1994),
1–21.
[137] Michael Stark, Philipp Lies, Michael Zillich, Jeremy Wyatt, and Bernt Schiele. 2008. Functional object class detection based on learned affordance cues.
Computer Vision Systems (2008), 435–444.
[138] Thomas A Stoffregen. 2000. Affordances and events. Ecological psychology 12, 1 (2000), 1–28.
[139] Thomas A Stoffregen. 2003. Affordances as properties of the animal-environment system. Ecological psychology 15, 2 (2003), 115–134.
[140] Jie Sun, Joshua L Moore, Aaron Bobick, and James M Rehg. 2010. Learning visual object categories for robot affordance prediction. The International
Journal of Robotics Research 29, 2-3 (2010), 174–197.
[141] Yu Sun, Shaogang Ren, and Yun Lin. 2014. Object–object interaction affordance learning. Robotics and Autonomous Systems 62, 4 (2014), 487–496.
[142] M. Tenorth, L. Kunze, D. Jain, and M. Beetz. 2010. KNOWROB-MAP - knowledge-linked semantic object maps. In 2010 10th IEEE-RAS International
Conference on Humanoid Robots. 430–435.
[143] Spyridon Thermos, Georgios Th. Papadopoulos, Petros Daras, and Gerasimos Potamianos. 2017. Deep Affordance-Grounded Sensorimotor Object
Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[144] Federico Tombari, Samuele Salti, and Luigi Di Stefano. 2010. Unique signatures of histograms for local surface description. In European conference on
computer vision. Springer, 356–369.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
Visual Affordance and Function Understanding: A Survey • 27

[145] Matthew W Turek, Anthony Hoogs, and Roderic Collins. 2010. Unsupervised learning of functional categories in video scenes. In European Conference
on Computer Vision. Springer, 664–677.
[146] Michael T Turvey. 1992. Affordances and prospective control: An outline of the ontology. Ecological psychology 4, 3 (1992), 173–187.
[147] Emre Ugur, Mehmet R Dogar, Maya Cakmak, and Erol Sahin. 2007. The learning and use of traversability affordance using range images on a mobile
robot. In Robotics and Automation, 2007 IEEE International Conference on. IEEE, 1721–1726.
[148] E. Ugur, E. Oztop, and E. Sahin. 2011. Going beyond the perception of affordances: Learning how to actualize them through behavioral parameters. In
2011 IEEE International Conference on Robotics and Automation. 4768–4773.
[149] Emre Ugur, Sandor Szedmak, and Justus Piater. 2014. Bootstrapping paired-object affordance learning with learned single-affordance features. In
Development and Learning and Epigenetic Robotics (ICDL-Epirob), 2014 joint IEEE International Conferences on. IEEE, 476–481.
[150] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International
journal of computer vision 104, 2 (2013), 154–171.
[151] Karthik Mahesh Varadarajan and Markus Vincze. 2011. Object part segmentation and classification in range images for grasping. In Advanced Robotics
(ICAR), 2011 15th International Conference on. IEEE, 21–27.
[152] Karthik Mahesh Varadarajan and Markus Vincze. 2012. Afnet: The affordance network. In Asian Conference on Computer Vision. Springer, 512–523.
[153] Karthik Mahesh Varadarajan and Markus Vincze. 2012. Afrob: The affordance network ontology for robots. In Intelligent Robots and Systems (IROS),
2012 IEEE/RSJ International Conference on. IEEE, 1343–1350.
[154] Karthik Mahesh Varadarajan and Markus Vincze. 2013. Parallel deep learning with suggestive activation for object category recognition. In International
Conference on Computer Vision Systems. Springer, 354–363.
[155] Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva, and Josef Sivic. 2014. Predicting actions from static scenes. In European Conference on
Computer Vision. Springer, 421–436.
[156] Hanqing Wang, Wei Liang, and Lap-Fai Yu. 2017. Transferring objects: Joint inference of container and human pose. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2933–2941.
[157] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. 2017. Attribute Recognition by Joint Recurrent Learning of Context and Correlation. In The
IEEE International Conference on Computer Vision (ICCV).
[158] Meng Wang and Xiaogang Wang. 2011. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3401–3408.
[159] Xiaolong Wang, Rohit Girdhar, and Abhinav Gupta. 2017. Binge Watching: Scaling Affordance Learning From Sitcoms. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
[160] Zhe Wang, Liyan Chen, Shaurya Rathore, Daeyun Shin, and Charless Fowlkes. 2019. Geometric pose affordance: 3d human pose with scene constraints.
arXiv preprint arXiv:1905.07718 (2019).
[161] Jianxin Wu, Xiang Bai, Marco Loog, Fabio Roli, and Zhi-Hua Zhou. 2017. Multi-instance Learning in Pattern Recognition and Vision.
[162] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dimitrakakis, Rémi Coulom, and Andrew Sumner. 2014. TORCS, the open racing car
simulator. Software available at http://www.torcs.org (2014).
[163] Dan Xie, Sinisa Todorovic, and Song-Chun Zhu. 2013. Inferring "Dark Matter" and "Dark Energy" from Videos. In The IEEE International Conference on
Computer Vision (ICCV).
[164] Natsuki Yamanobe, Weiwei Wan, Ixchel G Ramirez-Alpizar, Damien Petit, Tokuo Tsuji, Shuichi Akizuki, Manabu Hashimoto, Kazuyuki Nagata, and
Kensuke Harada. 2017. A brief review of affordance in robotic manipulation research. Advanced Robotics 31, 19-20 (2017), 1086–1101.
[165] Bangpeng Yao and Li Fei-Fei. 2010. Grouplet: A structured image representation for recognizing human and object interactions. In Computer Vision and
Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 9–16.
[166] B. Yao, J. Ma, and L. Fei-Fei. 2013. Discovering Object Functionality. In 2013 IEEE International Conference on Computer Vision. 2512–2519. https:
//doi.org/10.1109/ICCV.2013.312
[167] C. Ye, Y. Yang, R. Mao, C. FermÃijller, and Y. Aloimonos. 2017. What can i do around here? Deep functional scene understanding for cognitive robots.
In 2017 IEEE International Conference on Robotics and Automation (ICRA). 4604–4611. https://doi.org/10.1109/ICRA.2017.7989535
[168] Lap-Fai Yu, Noah Duncan, and Sai-Kit Yeung. 2015. Fill and transfer: A simple physics-based approach for containability reasoning. In Proceedings of
the IEEE International Conference on Computer Vision. 711–719.
[169] Gloria Zen, Negar Rostamzadeh, Jacopo Staiano, Elisa Ricci, and Nicu Sebe. 2012. Enhanced semantic descriptors for functional scene categorization. In
Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 1985–1988.
[170] L. Zhang, X. Zhen, and L. Shao. 2014. Learning Object-to-Class Kernels for Scene Classification. IEEE Transactions on Image Processing 23, 8 (Aug 2014),
3241–3253. https://doi.org/10.1109/TIP.2014.2328894
[171] Xi Zhao, He Wang, and Taku Komura. 2014. Indexing 3d scenes using the interaction bisector surface. ACM Transactions on Graphics (TOG) 33, 3
(2014), 22.
[172] Yibiao Zhao and Song-Chun Zhu. 2013. Scene parsing by integrating function, geometry and appearance models. In Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on. IEEE, 3119–3126.
[173] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015.
Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1529–1537.
[174] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2016. Semantic understanding of scenes through the ADE20K
dataset. arXiv preprint arXiv:1608.05442 (2016).
[175] Tinghui Zhou, Hanhuai Shan, Arindam Banerjee, and Guillermo Sapiro. 2012. Kernelized probabilistic matrix factorization: Exploiting graphs and side
information. In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 403–414.
[176] Zhi-Hua Zhou and Min-Ling Zhang. 2007. Multi-instance multi-label learning with application to scene classification. In Advances in neural information
processing systems. 1609–1616.
[177] Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang, and Yu-Feng Li. 2012. Multi-instance multi-label learning. Artificial Intelligence 176, 1 (2012),
2291–2320.
[178] Yuke Zhu, Alireza Fathi, and Li Fei-Fei. 2014. Reasoning about object affordances in a knowledge base representation. In European conference on
computer vision. Springer, 408–424.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.
28 • M. Hassanin et al.

[179] Yixin Zhu, Chenfanfu Jiang, Yibiao Zhao, Demetri Terzopoulos, and Song-Chun Zhu. 2016. Inferring forces and learning human utilities from videos.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3823–3833.
[180] Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base system for answering visual queries.
arXiv preprint arXiv:1507.05670 (2015).
[181] Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. 2015. Understanding tools: Task-oriented object modeling, learning and recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2855–2864.

ACM Comput. Surv., Vol. , No. , Article . Publication date: August 2020.

Das könnte Ihnen auch gefallen