Machine Learning Approaches To Robotic Grasping PDF

Machine Learning Approaches
to Robotic Grasping
Supervised by Dr.-Ing. Mohsen Kaboli

Institute for Cognitive Systems
Submitted by Axel Fehr
Submitted on 16.07.2018
Abstract
Grasping an object or a tool is very often the first step of interacting with it. This is why grasping is such an essential skill for
robots because they usually carry out tasks that involve manipulations of objects and interactions with objects. The ability
to grasp many different objects reliably even in unstructured environments would make robots more useful and versatile and
would enable them to carry out tasks that could not have been done by a robot before. A lot of research has been done to explore
different approaches to robotic grasping. In recent years, approaches based on machine learning have become more popular due
to recent progress in this field and an increased availability of training data. This work gives an overview of state-of-the-art
machine learning approaches to robotic grasping.
Contents
Introduction 3
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Related Work 4
Summaries of Approaches 5
References 76
2/77
Introduction
Background
Humans use dozens of different objects every day. Many of these usages involve grasping an object (e.g. a cup, a pen or a door
handle). In humans, multiple senses like the sense of touch and sense of vision are involved in the grasping process. The sense
of vision is used to localize the object that should be grasped and to decide how it should be grasped (e.g. which fingers should
be involved and how they should be moved). The sense of touch is used to adjust an already performed grasp and to assess
whether it is stable.
Given the motivation that grasping is a universally important skill, a lot of research has been done to find out what methods
could enable robots to perform grasping as reliably as possible. There are major challenges that have to be addressed in the
field of robotic grasping. These challenges inlude object localization, deciding how to grasp an object (e.g. based on shape and
material properties), deciding how much force to apply during the grasp, assessing grasp stability and detecting if the grasped
object is slipping.
In the past, many hand-engineered approaches were designed to solve the mentioned challenges but reliable robotic grasping
still remains a challenging problem. In recent years, machine learning approaches that could enable a robot to learn how to
grasp objects rose in popularity. In such learning-based approaches, the robot is not equipped with hand-designed methods that
determine how to grasp an object. The core of learning-based approaches is a learning algorithm that learns from experience to
carry out whatever task it is given. One of the reasons why such learning-based approaches became more popular is that more
datasets related to robotic grasping became available. The idea is that these large amounts of data can be used to train learning
algorithms to execute reliable grasps based on sensor data (e.g. camera images). The long-term goal is to build systems that can
perform grasps even in chaotic and unstructured environments like the real world.
Contribution
In this work, twenty research papers related to robotic grasping were reviewed. The approach in each reviewed paper is
summarized and followed by a brief discussion of the strengths and weaknesses of the presented approach.
Most of the research papers included in this survey directly focus on the problem of executing grasping movements based on
sensory input like camera images or tactile measurements. However, some papers focus on subproblems of robotic grasping
such as recognizing material charactersistics of objects with visual information or learning to distinguish objects based on
tactile data. There are also other papers that do not solely focus on grasping but also on grasping and manipulating an object to
carry out a given task (e.g. picking up a bag of rice and putting it into a bowl).
3/77
Related Work
In this section, the reviewed research papers are categorized. The reviewed approaches can be categorized based on input data,
learning methods and setups.
The input data that the approaches in this work were based on are visual, tactile and multisensory (fusion of both visual and
tactile data). Vision-based approaches were designed to perform grasping or grasping-related tasks solely based on RGB
images or depth images. The tactile-based approaches solely relied on sensor readings from tactile sensors. The multisensory
approaches combined information from images with information from tactile sensor readings to carry out grasping-related
tasks.
Categories based on input data:

• Vision-based
• Tactile-based
• Multisensory
The used learning methods can be divided into supervised and unservised methods. Supervised learning methods like neural
networks and Support Vector Machines are trained with labeled data while unsupervised methods like autoencoders are trained
with unlabeled data. Some works used both supervised and unsupervised methods.
Categories based on learning methods:
• Supervised learning
• Unsupervised learning
Finally, the setups of the approaches can also be grouped into two categories. Some approaches were implemented on a real
robotic platform while other approaches were implemented on a virtual robot in a simulated environment. Some approaches
would fall into both categories because they involved training a learning algorithm on data obtained through simulation (e.g.
rendered images of objects) but also implementing the eventual system on a real robot.
Categories based on setups:

• Simulated environment
• Real-world environment
The following chapter contains summaries of various machine learning approaches to robotic grasping.
4/77
Summaries of Approaches
A Tactile-Based Framework for Active Object Learning
and Discrimination using Multimodal Robotic Skin1
Authors: Mohsen Kaboli, Di Feng, Kunpeng Yao, Pablo Lanillos, Gordon Cheng
Motivation
Robots that explore their environment with tactile feedback can extract useful information from their surroundings such as
material characteristics. This kind of information is very hard to collect via visual feedback. The following work describes a
tactile-based framework for autonomous workspace exploration and object recognition based on material properties.
Approach
The framework consists of three main components:
• A pre-touch strategy to explore the unknown workspace

• A method to learn physical properties of objects
• An algorithm based on active touch to discriminate objects
Figure 1. The proposed tactile-based framework
Sensor Types
A multi-modal artificial skin was used as a sensory platform. The skin provided the robot with a sense of touch and pre-touch.
The skin was made of multiple connected skin cells, each of which had the following sensors:
• One proximity sensor (for pre-touch)
• One three-axis accelerometer

• One temperature sensor
• Three normal-force sensors
5/77
Figure 2. Types of sensors per skin cell
A skin patch consisting of seven skin cells was attached to the end-effector of a 6-DoF UR10 industrial robot.
Preprocessing and Feature Extraction

Perception of Physical Properties
Three physical properties of an object were perceived by the robot:
• Stiffness (measured by pressing against an object)
• Texture (measured by sliding the skin along the surface)
• Thermal conductivity (measured with temperature sensors under static contact)
Active Pre-Touch for Workspace Exploration
The workspace was represented as a discretized 3D grid. The robot first started exploring the workspace from a fixed point. To
decide where to move the end-effector next, the probability of detecting an object was computed based on sensory measurements
from the proximity sensors for each of the neighboring locations. The location that maximized this probability was the one the
end-effector was moved to. This was done iteratively until the certainty about the workspace reached a set threshold.
After the probability of the presence of an object was calculated for each grid cell, a clustering algorithm determined the number
of objects in the workspace. A bounding box and a centroid were computed for each object.
Learning Method
Learning Physical Properties
This work treats tactile learning as a supervised learning problem with multiple classes, where each class represents an object.
A probabilistic classifier is used for each tactile property. Gaussian Process Classifiers (GPCs) are used to create observation
models of objects. New training samples are collected by deciding which object and which physical property to explore next.
The GPCs are iteratively updated after collecting more training data that are supposed to lead to the largest possible performance
improvement of the GPCs.
Object Discrimination
As few exploratory actions as possible should be taken. This can be achieved by deciding to take actions that provide the most
useful information. To determine the best next action, the similarity between one object and all other objects is calculated. The
action that is taken is the one that is the most likely to minimize the similarity score. A lower similarity score means the object
is easier to distinguish from others.
6/77
Results
Figure 3. Development of the workspace entropy with different strategies. The proposed method minimized the entropy the
fastest.
Figure 4. Recognition accuracy based on (a): stiffness, (b): surface texture, (c): thermal conductivity and (d): all three
properties.
Strengths and Weaknesses

Strengths:
• Proposed strategy outperformed uniform and random strategies
• High classification accuracy with few training examples
• Method is unrestricted by feature dimensions and therefore suitable for high-dimensional features
Weaknesses:
• Low spatial resolution of proximity sensors leads to difficulties when clustering objects that are very close
7/77
End-to-End Training of Deep Visuomotor Policies2
Authors: Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel
Motivation
Teaching robots to perform tasks based on raw image data is very challenging. Hand-engineered solutions were often utilized
in the past for perception and control in order to obtain low-dimensional representations of the robot’s observations and
actions. Convolutional neural networks (CNNs) can deal with high-dimensional visual input and can learn low-dimensional
representations but require large amounts of training data. Large amounts of real-world training data for robotic tasks are
usually not available in large amounts. This makes it challenging to train deep neural networks on this kind of data until
they develop well-performing control policies. In this work, Levine et al. investigated whether more effective policies can be
achieved by training the robot’s perception end-to-end together with the control policy instead of training these parts separately.
Approach
Two components were the basis of the proposed method:
• A supervised learning algorithm (implemented as a CNN)

• A trajectory-centric reinforcement learning algorithm
The first component was trained on raw image data and supervised by the second component. The second component provided
the first component with guiding trajectories that result from training on the full state of the system, which included the
3D-coordinates of objects in the scene as opposed to just the raw image data. Since the second component was provided
with the full state, it was substantially easier and faster to optimize. This made it possible to use the second component as a
supervisor for the first component, which generated trajectories solely based on raw image observations and not the full state of
the system. The first component was trained to match the trajectories generated by the second component, which is supervised
learning where the supervision is provided by the trajectory-centric algorithm. During training, the guidance of the second
component was used to obtain feedback for the trajectories generated by the deep network without executing the trajectories on
a physical system.
Figure 5. Architecture of the visuomotor policy
8/77
Figure 6. Diagram of the proposed approach. pi is a local time-varying linear-Gaussian controller for the initial state x1i , τ is
the trajectory and πθ is the policy of the network in figure 5.
Experiments
To evaluate whether how well the visuomotor policy performs, four experiments were carried out:
• Hanging a coat hanger on a rack
• Inserting a block with a certain shape into a shape sorting cube
• Fitting the claw of a wooden hammer under the head of a nail
• Screwing a bottle cap onto a bottle
Results
The results indicate that training the perception and control system together lets the CNN learn more robust task-specific
features that would not have been learned, if the CNN had been trained solely on visual tasks that do not involve control.
Figure 7. Success rates on 1) positions and grasps seen during training (second column from the left), 2) new target positions
not seen during training (third column) and 3) training positions with visual distractors (fourth column).

Strengths:
9/77
• Small amount of training data required
• Even extremely high-dimensional policies can be optimized
• More robust task-specific features were learned
Weaknesses:
• Full state has to be observable during training (especially difficult if environment includes moving objects)
• Due to the small amount of training data, distractors that correlate with task-relevant variables can inhibit generalization
(e.g. the image of the robot arm holding an object)
10/77
Deep Learning for Tactile Understanding From Visual
and Haptic Data3
Authors: Yang Gao, Lisa Anne Hendricks, Katherine J. Kuchenbecker, Trevor Darrell
Motivation
Since an appropriate way to manipulate or grasp an object also depends on the material properties of it, robots would gain
an advantage if they could estimate its haptic characteristics. This paper presents a way to combine visual and haptic data to
estimate the material properties of objects.
Approach
The approach was based on two different convolutional neural networks:
• A CNN for visual data
• A CNN for haptic data
(a) CNN for image data. (b) CNN for haptic data.
Figure 8. Structure of the CNNs used in the proposed approach. The weights for CNN (a) were transferred from a CNN based
on the GoogleNet architecture4 trained on a material recognition training set named Materials in Context Database5 . CNN (b)
performed temporal convolutions on the input haptic signal.
The output of the networks that processed the visual and haptic data were concatenated. The concatenated output was then used
by a classifier to output a prediction regarding the material properties of the respective object.
Figure 9. Structure of the multisensory model.
11/77
Experiments
The method was tested on the Penn Haptic Adjective Corpus 2 dataset, which contains haptic information and images of 53
household objects. There are eight images of each object taken from different perspectives. Along with haptic and visual data,
the presence or absence of 24 different haptic characteristics was provided. Each object in the dataset was explored with four
actions per trial:
• Squeeze
• Hold
• Slow slide
• Fast slide
Each of the listed actions was performed ten times on each of the household objects.
Figure 10. The 24 haptic characteristics in the used dataset
Sensor Types
Haptic data of each object were obtained with a pair of SynTouch BioTac tactile sensors. These sensors provided the following
signals:
• Low-frequency fluid pressure (sample rate: 100 Hz)
• High-frequency fluid vibrations (sample rate: 2200 Hz)
• Core temperature (sample rate: 100 Hz)
• Core temperature change (sample rate: 100 Hz)
• Electrode impedance from 19 electrodes (sample rate: 100 Hz)
Preprocessing
The following steps were taken to preprocess the haptic data:
• Normalization
• Downsampling of the high-frequency fluid vibrations to 100 Hz
• The length of each signal was set to a fixed length of 150
• Four principal components of the electrode impedances were extracted via principal component analysis
The preprocessing resulted in (4 + 4) × 4 = 32 haptic signals as input for each trial (four electrode signals and the four other
haptic signals for each exploratory action).
Regarding the image data, the mean values of the RGB images were subtracted from each image. Furthermore, the images
were resized to a smaller fixed size and only a central crop of each image was kept.
Since there were ten trials for each of the 53 objects, 530 training samples were available. To augment the haptic data,
the measurements from both BioTac sensors were treated as separate training samples. Additionally, the haptic data was
subsampled five times, each time from a different starting point. This resulted in 5,300 training samples.
12/77
Results
It could be concluded that the multimodal approach of combining haptic and visual information yields a better classification per-
formance than approaches relying on one or the other alone. This suggests that the haptic and visual signals are complementary
for the task.
Figure 11. Accuracies of different methods. The prefix of the name of a method (e.g. ”Haptic”) refers to the kind of data the
model was trained on. Methods A-F were SVMs. When there is a suffix (1, 8 or 10), it denotes the number of trials on a single
object or images of a single object that were combined for the classification.
Figure 12. A comparison of AUC scores of the best performing haptic and visual classifier for each material characteristic. It
can be seen that some characteristics are more reliably detected with haptic signals and some are more reliably detected with
visual signals. Combining both can lead to better results.

Strengths:
• The multimodal approach performed better than a haptic or visual approach alone
Weaknesses:
• The subsampling of the signals discards potentially important information
• The multimodal approach classifies some haptic adjectives less accurately than a pure haptic or visual approach
13/77
Tactile-based active object discrimination and target ob-
ject search in an unknown workspace6
Authors: Mohsen Kaboli, Kunpeng Yao, Di Feng, Gordon Cheng
Motivation
Our sense of touch is invaluable when we interact with objects in our daily lives and therefore would benefit robots as well. A
sense of touch can provide information about the environment that would be very difficult to gather through other senses. There
have been many advances in the development of tactile sensors but less attention has been paid to how tactile information could
be processed. This poses a problem because the effectiveness of tactile systems is not solely based on the tactile sensors but
also on the methods that are used to deal with the provided sensory data.
Approach
In this work, Kaboli et al. present a tactile-based framework for active workspace exploration and active object recognition to
solve the mentioned problem. Their contribution can be summarized in four points:
• A strategy to autonomously explore an unkown workspace to estimate the location and orientation of objects in it
• A method to learn about objects and their physical properties and to construct observation models with as little training
samples as possible
• An approach to distinguish objects and to search for target objects
• A technique to explore the center of mass of rigid objects
Figure 13. The tactile-based framework had three components: (a) a method to explore the workspace, (b) an algorithm to
efficiently learn the physical characteristics of objects (texture, stiffness, center of mass) and (c) a method to recognize objects.
Part (c) consisted of two parts. (c-1) discriminated objects based on physical properties and (c-2) was a strategy to search target
objects in a workspace with unknown objects.
14/77
Sensor Types
In order to get tactile feedback, three OptoForce OMD-20-SE-40N 3D tactile sensors were attached to the fingertips of a robotic
gripper. The sensors measured the force vectors acting on a grasped item.
Figure 14. A picture of the experimental setup showing the positions of the tactile sensors on the fingertips.

Perception of Physical Properties
Three physical properties of an object were perceived by the robot:
• Stiffness (measured by pressing the sensors against an object)
• Texture (measured by sliding the sensors along the surface)
• Center of mass (the current lifting position was estimated to be the center of mass if defined force and torque conditions
were met)
Active Touch for Unknown Workspace Exploration

The workspace was represented as a discretized 3D grid. From its starting position, the robot incrementally executed movements.
If contact with an object was detected, the presence of an object at the current position was recorded, which eventually resulted
in a point cloud. To decide which position the robot should move to next, the system considered the uncertainty of the workspace
in neighboring positions. The position with the highest uncertainty was the one the gripper was moved to. This was done until
the uncertainty about the workspace was below a set threshold. The point cloud was then clustered to localize objects.
Learning Method
Active Touch for Object Learning
The presented approach treated the learning of physical properties or objects as a mulit-class classification problem. The
classification was performed by Gaussian Process Classifiers (GPCs), of which there was one for each physical property. To
start the learning process, a small set of training data was generated by measuring the stiffness, texture and center of mass once
for each object. Then, new training samples were collected to minimize the uncertainty about objects as quickly as possible.
The object and property that was sampled next was the object-property pair with the largest entropy. This was done iteratively
until the entropy of the observation models did not measurably change.
Active Touch for Object Recognition
The number of exploratory actions to discriminate objects should be kept as low as possible. To achieve this, the expected
uncertainty about an object after executing an action was calculated. The action that minimized this score was selected to be
executed next.
15/77
Results
Figure 15. A comparison of the proposed method for object localization with a uniform and and a random strategy. It can be
seen that the proposed method reduced the uncertainty much faster.
Figure 16. A comparison of the classification accuracies of three different approaches: the proposed one, a uniform one and a
random one. The proposed method for object learning outperformed both the uniform and the random approach.
16/77
Figure 17. Development of the classification accuracies shown for all three physical characteristics individually: (a) stiffness,
(b) texture and (c) the center of mass.

Strengths:
• Proposed strategy outperforms uniform and random strategies
• High classification accuracy with fewer training examples
• Method does not suffer from the curse of dimensionality and is therefore suitable for high-dimensional features
Weaknesses:
• An underlying assumption was having rigid objects in the workspace, which is not always the case in the real world
17/77
ST-HMP: Unsupervised Spatio-Temporal Feature Learn-
ing for Tactile Data7
Authors: Marianna Madry, Liefeng Bo, Danica Kragic, Dieter Fox
Motivation
Tactile sensors provide rich feedback for robots during object manipulation. But extracting meaningful features from the raw
sensory data is challenging.
Approach
Madry et al. investigated the effectiveness of a method they call Spatio-Temporal Hierarchical Matching Pursuit (ST-HMP) to
solve the described problem. This method was designed to extract features from a time series of raw sensor data. It is based
on unsupervised hierarchical feature learning. ST-HMP extracts features from frames of data and pools them over the time
dimension. This was done at several scales. In other words, ST-HMP combines spatial features from frames of data over time
in a hierarchical way, generating a single feature vector for a whole sequence.
At first, codebooks over a large collection of spatial data patches were learned with the method K-SVD. The underlying idea at
this step is to represent data as a sparse linear combination of codewords from a codebook.
With a given sequence of tactile data, each data pixel was represented by the sparse codes calculated based on the neighborhood
around it (e.g. a patch of 4 x 4 pixels). Feature vectors were then generated by using spatio-temporal max pooling in the
pyramid described in image 18. The features were the max pooled sparse codes.
Figure 18. A visualization of a partition of data in a (a) spatial, (b) temporal and (c) spatio-temporal pyramid, which is a
spatio-temporal representation of the tactile data. The cells in which features are pooled are marked in green, blue and red.
Experiments
The described approach was tested on the following synthetic and real databases:
• Schunk Dextrous Synthetic database (SDS)
• Schunk Dextrous database with 5 real objects (SD-5)
• Schunk Dextrous database with 10 objects (SD-10)
• Schunk Parallel databases (SPr-10)
• Schunk Parallel database with 7 objects (SPr-7)
• iCub database with 10 objects (iCub-10)
These databases contain tactile measurements recorded during grasps on different sets of objects (see figure ??) with different
grippers. Two of these databases (SDS and SD-5) were used for grasp stability assessment. Five of them (SD-5, SD-10, SPr-10,
SPr-7 and iCub-10) were used for evaluating object recognition.
The features that were generated by the proposed approach were fed into a Support Vector Machine to perform the classification
task.
18/77
Figure 19. Objects contained in the six databases used for experimental evaluation. Some of these objects are deformable.
Results
Figure 20. Performance of the different methods to assess grasp stability on the SDS database. MV-HMP refers to an
approach where the original HMP algorithm from previous work is used for each frame in a sequence of tactile data and
temporal information is added to the representation by using the class labels of each frame of tactile data to perform majority
voting in order to recognize a whole sequence.
Figure 21. Classification results for grasp stability assessment with different objects from the DS-5 database. A comparison
to other existing methods is made. The best-performing method is the method proposed in this work.
Figure 22. Object classification results and a comparison between the proposed approach and other existing approaches.
19/77
Strengths:
• ST-HMP achieved state-of-the-art performance in grasp stability assessment and object recognition at the time of the
publication (2014)
• Proposed approach is universal (performs well accross different datasets)
• Manual design of features not necessary
• Method also works with deformable objects
20/77
Deep Learning for Detecting Robotic Grasps8
Authors: Ian Lenz, Honglak Lee, Ashutosh Saxena
Motivation
Image information in combination with depth information can be a useful basis to help robots detect good locations to place a
gripper at in order to grasp an object. This is challenging to do in real-time because evaluating a large number of potential
grasping locations very is time-consuming. The following work presents a method that can be used in real-time to detect robotic
grasps with an RGB-D camera.
Approach
This work focused on parallel-plate gippers and its contributions can be summarized in the following points:
• A deep learning algorithm to detect robotic grasps
• A new regularization method for multimodal inputs
• A two-step process to significantly reduce the computational costs of the detections
Figure 23. Illustration of the approach (from left to right). An RGB-D image is obtained from a depth camera on the robot
and possible grasps are searched in it with a neural network. A set of raw features (color, depth and suface normal) is then
extracted for all potential grasps, which forms the input to another deep network that evaluates each grasp. The grasp with the
best evaluation is chosen and executed. The red and the green lines in the images represent the gripper plates.
Figure 24. Visualization of the two-stage process that significantly reduces the computational cost or the proposed method. A
small deep network is used to comprehensively search for potential grasping positions and output a small set of grasping
positions (represented by rectangles) that are the most likely to work. Then, a larger deep network is used to find the best
grasping position among the best ones selected by the smaller network. Since it takes much longer to compute the output of the
larger network, it is only used to evaluate a small set of grasps that are pre-selected by the much smaller network.
The network architecture that was used for the two networks in figure 24 was a feedforward neural network with two hidden
layers. The smaller network had fewer neurons in the hidden layers and received a smaller number of input features.
The new regularization method was specifically developed for multimodal inputs. This new technique regularizes the
number of modalities used for each feature in the hidden layers. This discouraged the networks to learn weak correlations
between different modalities.
21/77
Sensor Types
Only RGB-D images collected with a Kinect sensor were used.

Unsupervised Feature Learning
Prior to the training phase, the hidden layer weights were initialized by training a sparse auto-encoder to reconstruct inputs.
The first hidden layer was trained to reconstruct the input to the network (an image). The second hidden layer was trained to
reconstruct the output of the first hidden layer, which was the input to the second hidden layer.
Features
Grasps were represented by rectangles in the image plane (like in figure 23 and 24) with two parallel edges representing the
plates of the gripper.
Each rectangle is parametrized as follows:

• X and Y coordinates of the upper-left corner
• Width
• Height
• Orientation
The proposed algorithm used only local information obtained from an RGB-D sub-image in a rectangle representing a grasp.
The image was rotated so that the left and the right edge of it correspond to the gripper plates. It was then re-scaled to a fixed
size of 24 × 24 pixels. Each pixel had seven channels which resulted in 24 × 24 × 7 = 4032 input features.
Channels:
• Channel 1-3: image in the YUV color space
• Channel 4: depth
• Channel 5-7: X, Y, and Z components specifying surface normals computed based on the depth channel
Preprocessing
The data were preprocessed in two ways:
• The data was whitened
• Each rectangle was padded if necessary to preserve the aspect ratio when it was re-scaled
Experiments
The proposed method was first tested on an updated version of the Cornell grasping dataset, which contains more complex
objects that the origninal one. This dataset consists of 1035 images of 280 graspable objects. Each image contains positive and
negative grasping rectangles.
The small network contained 50 neurons in each hidden layer while the larger network contained 200 neurons in each hidden
layer. The larger network assessed the best 100 rectangles selected by the small network.
The following metrics determined if a predicted rectangle is considered to be a correct prediction:

• Point metric (Is the center point of the predicted rectangle close to the center of a ground-truth rectangle?)
• Rectangle metric (Is the orientation difference less than 30◦ ?)
• Bounding box metric (Is the area of intersection between the predicted and the ground-truth rectangle more than 25% of
the area of these two rectangles combined?)
22/77
The proposed method was tested in two experiments in real-time. The two experiments were performed with a Baxer Research
Robot and a PR2 robot. The Kinect sensor was used again for both experiments. In each of the two experiments, a single object
was placed within an area of fixed size on a table. Some example objects in the experiments can be seen in figure 25.
Figure 25. Example objects used in the real-time experiments.
Results
Figure 26. Grasp detection results on the used dataset based on different modalities or combinations of them.
Figure 27. Experimental results with the Baxter robot sorted by object category. ”Tr.” is the number of trials and ”Acc.” is the
accuracy.
Figure 28. Experimental results with the PR2 robot sorted again by object category. As in figure 27, ”Tr.” is the number of
trials and ”Acc.” is the accuracy.
23/77
Strengths:
• No feature engineering required
• Method outperforms previous methods with hand-engineered features by a large margin
• The two-pass system resulted in a reduction of rectangles the large network had to evaluate by a factor of 1000 while
increasing detection performance by 2.4%
• The proposed regularization method produced successful grasps in some cases where standard L1 regularization could
not produce successful grasps
• Objects with obvious handles (e.g. knives) can be grasped very consistently
• The method works on different platforms and can robustly detect grasps on a wide range of objects, including objects
never seen before
Weaknesses:
• The algorithm sometimes had problems with white objects because it was trained on a dataset only containing images
with a white background
• No tactile feedback involved (one object was slightly crushed by the PR2 robot during a grasping attempt)
• Kinect often fails to collect depth information from objects with glossy surfaces (e.g. glass or metal)
24/77
Learning Robot Tactile Sensing for Object Manipula-
tion9
Authors: Yevgen Chebotar, Oliver Kroemer, Jan Peters
Motivation
Tactile sensing is an integral part of object manipulation. Being capable of perceiving and processing tactile feedback is an
important ability especially in unstructured environments.
Approach
Chebotar et al. did research on how a robot could learn to use tactile feedback when manipulating objects. The work consists of
the following parts:
• The adaptation of three pose estimation algorithms to in-hand object localization

– Probabilistic hierarchical object representation (PHOR)10
– Iterative closest point (ICP)11
– Voting scheme (described below)
• Dynamic motor primitives (DMPs) to encode movements from human demonstrations
• A learning algorithm to optimize parameters in the DMPs to generate desired trajectories
Pose Estimation
The pose estimation consisted of two phases:
1. The learning of the model of an object based on collected tactile data

2. Pose inference (estimation of the object pose with the learned model based on new observations)
The algorithms that were adapted for the pose estimation (PHOR, ICP and voting scheme) were originally developed to handle
visual data. The adapted algorithms in this work can be used for tactile images obtained from tactile data in matrix form during
the grasp of an object.
The voting scheme that was used for the pose estimation is appearance-based. In this scheme, an object model is represented
with an appearance set of tactile image patches and a pose set of relative transformations between the patches. When new tactile
data are obtained from an observation, the method computes descriptors of each patch in the new tactile image and searches for
the K nearest neighbors of each new patch in the appearance set. For each found patch, a weighted vote for the object pose was
computed based on the relative transformation in the pose set.
Dynamic Motor Primitives

The trajectories from the human demonstrations were expressed with DMPs. An imitation learning process of the robot tried
to replicate these trajectories by learning weights in a forcing function of a DMP so that the resulting trajectory matches the
trajectory from the human demonstration as closely as possible.
Learning Method
The used imitation learning process to learn to replicate the trajectories from the human demonstrations was reinforcement
learning. The policy was defined as a Gaussian distribution of the feedback weights in the equation of the DMP. The reward for
the learning algorithm was computed based on the deviation of the tactile signals of the generated trajectory (generated by
the weights of the policy) from the tactile signals of the desired trajectory (the trajectory of the human demonstration). The
optimization of the policy was performed with episodic relative entropy policy search (REPS)12 . This optimization method
limits the loss of information during policy updates, which has the advantage that it results in better convergence behavior.
25/77
Sensor Types
Two low-resolution matrix analog pressure sensors by PlugAndWear.com were used in this work to obtain tactile data as a
matrix array (tactile images) like in figure 29. The size of the sensitive area of such a sensor matrix was 16cm × 16cm and
contained 64 sensor cells in eight rows and eight columns. The rate at which measurements were taken was 50 Hz. Two sensor
matrices were attached to a parallel gripper as shown in figure 30.
Figure 29. A tactile image of a screwdriver handle (on the left) obtained with the tactile sensor (on the right).
Figure 30. An image showing how the sensors were attached to a robotic gripper of a KUKA robotic arm.

To reduce the number of weights that have to be learned, the dimensionality of the input (tactile images) was reduced in the
following ways:
• Principal component analysis was performed on tactile images collected during multiple demonstrations of the same task
• A action (e.g. scraping) was split into phases and only a single weight is learned for each phase
In order to recognize which phase a tactile measurement belongs to, the kernel similarity of tactile images acquired during the
execution of an action is clustered with spectral clustering. Figure 31 shows an example where the action consists of three
phases.
26/77
Figure 31. Left: Similarity matrix heat-map. Right: Result of spectral clustering of tactile images from a scraping task. The
three clusters correspond to the three phases of an action. Red cluster: movement towards the table. Yellow cluster: Scraping
along the table surface. Green cluster: moving back to the starting position.
Experiments
Chebotar et al. carried out two experiments:
• An in-hand localization experiment
• A tactile-based manipulation experiment
In-Hand Localization Experiment

The grasping capability of the proposed method was evaluated by letting the robot grasp different objects:
• A hammer
• A screwdriver
• A roll of tape
• A saw
• A spatula
The robot executed 40 graps per object from random angles and positions. The tactile images that were collected during these
graps were recorded together with the respective object poses in the gripper. The gripping force was the same in all experiments.
The ability to localize the objects was evaluated with leave-one-out cross-validation. So the model of an object was first
instantiated with 39 grasps while holding out one grasp for testing. This was done with all the pose estimation algorithms
mentioned above.
Tactile-Based Manipulation Experiment

The effectiveness of the method for object manipulation with tactile feedback was tested by letting the robot gently scrape across
a flat surface with a spatula. After the movement was learned from demonstration, two kinds of variations were introduced.
The first variation was elevating the surface by 5 cm with respect to where the robot expected the surface to be. The second
variation was placing a ramp on the table to change the orientation of the surface.
Results
In-Hand Localization Results
Figure 32. Mean absolute errors of the tested in-hand localization methods with leave-one-out cross-validation. ”Pos.” stands
for the position error in centimeters and stands for the orientation error in degrees. The localization with the proposed voting
scheme performed best.
27/77
Figure 33. Mean absolute errors of the tested in-hand localization methods with leave-ten-out cross-validation. Here, tactile
measurements from 10 grasps were combined to estimate the object pose. The localization with the proposed voting scheme
performed best.
Tactile-Based Manipulation Results
Figure 34. Development of the mean reward values as policy updates are made. Top: scraping task on the elevated surface.
Bottom: scraping task on the ramp.

Strengths:
• The proposed method works with low-cost sensors

• The method can adapt to unexpected changes in the environment
Weaknesses:
• Grasping thin objects like the spatula sometimes caused two opposing tactile sensors to touch each other and to generate
noise
• The localization of objects much larger than the gripper is a problem because only a small object part can be observed at
a time (e.g. the pose estimation of objects with long straight handles like a hammer was often wrong by 180◦ since the
respective tactile images were similar)
28/77
Data-efficient Deep Reinforcement Learning for Dexter-
ous Manipulation13
Authors: Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron,
Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, Martin Riedmiller
Motivation
Deep reinforcement learning has recently been used to solve various problems relating to continuous control. Applying
traditional hand-engineered methods to dexterous manipulation tasks in robotics is very difficult. An example for such a
manipulation task is grapsing objects and stacking them on top of each other. The following work investigated the effectiveness
of a deep reinforcement learning algorithm with different proposed adaptations in solving the mentioned manipulation task.
Approach
The learning method this work was based on is the Deep Deterministic Policy Gradient algorithm14 (DDPG), which is a
model-free Q-learning-based method with experience replay.
To make DDPG more data-efficient and scalable, two extensions were implemented:
• Multiple mini-batch replay steps (instead of performing policy updates after each interaction step, a fixed number of
mini-batch updates were performed for each step in the environment)
• Asynchronous DPG (combining multiple instances of an actor and a critic [i.e. ”workers”] that share network parameters
and either share or have independent experience replay buffers to allow parallelization)
To guide and speed up the learning process, two other methods were implemented:
• Composite rewards (instead of only getting reward at task completion, the system also receives reward for achieving
subgoals to make the reward less sparse)
• Instructive starting states (initializing episodes with states taken from from anywhere along or close to desired trajectories)
Experiments
The task on which the proposed methods were tested on is that of picking up a Lego brick with a simulated robot arm (with 9
DoF) and stacking it onto a second one in a simulated environment. The simulated robot arm that performed the task was a
close match to the Jaco robot arm from Kinova Robotics. The simulation environment was MuJoCo. The agent had 7.5 seconds
to complete the task. Successful completion of the task yielded a reward of one and zero otherwise.
Figure 35. Simulation rendering of the task in different stages. From left to right: starting state, reaching, grasping (the
StackInHand starting state) and stacking.
The following information was provided to the agent:
• Angles and angular velocities of the six joints and the three fingers of the gripper
• Position and orientation of the two Lego bricks
• Relative distances of the two bricks to the the point where the gripper’s fingertips would meet if the fingers were closed
29/77
The action the agent could take was continuous and nine-dimensional (velocities of six arm joints and three finger joints).
Figure 36. In the experiments, the full task as well as two sub-tasks in isolation were considered.
Each of the adaptations mentioned in the previous section were tested in the described environment.
Figure 37. Composite rewards for subgoals in the experiment.
Regarding the instructive starting states, two alternative methods to generate them were used in the experiments:
1. Manually defined starting states
• Original starting state with both bricks on the table
• States where the first brick is already grasped by the gripper
2. Randomly initialized starting states along a desired trajectory obtained from a demonstrator
Results
Figure 38. Mean episode return as a function of state transitions in millions of DDPG with multiple mini-batch replay steps
(one worker) on the Grasp (left) and StackInHand (right) task. Number of mini-batch updates per transition: 1 (blue), 5 (green),
10 (red), 20 (yellow) and 40 (purple). Increasing the number of updates improves data-efficiency.
30/77
Figure 39. Mean episode return as a function of state transitions in millions of DDPG with multiple mini-batch replay steps
and asynchronicity (16 workers) on the Grasp (left) and StackInHand (right) task. The color-coding is the same as in figure 38.
Figure 40. Results for the four shaping reward functions in figure 37 in combination with different instructive starting states
(blue: both bricks on the table, green: manually defined starting states from figure 36, red: starting states initialized along
solution trajectories). The x-axes represent the total amount of state transitions of all workers combined in millions and the
y-axes represent the mean episode return. Policies with a mean reward over 100 reliably perform the full stacking task from
different starting states.
Figure 41. Success rates of the learned policies on the full task and its subtasks.
31/77
Strengths:
• Using multiple agents to collect experience (i.e. parallelization) speeds up learning
• The proposed adaptations of DDPG made it possible to learn robust policies within a number of transitions that is feasible
to collect with a real robot within a few days (or less if multiple robots are used)
Weaknesses:
• Transferring results from the simulation to the real world is a challenge
• The information about the environment that the robot received in the simulation likely requires additional instrumentation
in the real world (e.g. to determine the position of the two Lego bricks)
32/77
Deep Spatial Autoencoders for Visuomotor Learning15
Authors: Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, Pieter Abbeel
Motivation
Reinforcement learning is a powerful tool to teach robots new skills. It is a challenge however to find ways to compactly
represent the state of the environment based on high-dimensional sensory input such as camera images. This work focuses on
taking a step towards solving this issue by automating state-space construction by learning state representations from camera
images.
Approach
The approach of Finn et al. can be summarized in a six-step process:
1. Training a linear-Gaussian controller that learns the dynamics of the used robotic system without vision by executing
trajectories with the robot and training the controller to predict the next state based on the current joint configuration and
the current executed movement
2. Collecting an image dataset of the workspace (or using camera images collected with a camera mounted on the robot
during step 1)
3. Training a deep spatial autoencoder using the images from step 2 to obtain a feature encoder that turns RGB images of a
scene into a low-dimensional, meaningful representation
4. Filtering and pruning the features from step 3 and defining the full state with the newly obtained features
5. Defining the cost function using images of the target state (i.e. an image of what the workspace looks like if the task is
accomplished)
6. Training a new controller with vision with the cost function from step 5 and the new states from step 4 to accomplish the
task

Deep Spatial Autoencoder
This section provides more information about step 3 of the approach. The autoencoder used in this work mapped RGB images
to a down-sampled grayscale image. The autoenconder was forced to encode spatial information such as the position of key
features in the image with the help of a spatial softmax activation function.
Figure 42. The architecture of the used autoencder. The first convolutional layer was inintialized with weights from another
neural network trained on the ImageNet dataset. The reconstruction part (not visualized here) consisted of fully-connected
layers of neurons.
Filtering and Feature Pruning

Step 4 of the proposed approach involved pruning and filtering the feature points obtained with the autoencoder. This was in-
tended to prevent the feature points from becoming meaningless when parts of the scene are occluded temporarily and to prevent
the autoencoder from modelling phenomena that lowered the reconstruction loss but are not important for the task (e.g. lighting).
The following filtering and pruning techniques were used:

• The activations of the spatial softmax were thesholded. Meaning that a feature was regarded as present, if the correspond-
ing activation exceeded a set threshold value.
33/77
• The positions of the feature points during a trial were filtered with a Kalman filter to have smooth predictions even when
some parts are temporarily not visible (e.g. because they are occluded).
• Features that did not help to represent the task-relevant objects in the scene were pruned from the state. The quality of
each feature was assessed by pruning it from the state and observing the resulting performance. The features that had the
least impact on performance when removed were pruned.
Experiments
Finn et al. evaluated the proposed method on four diffent tasks:
• Sliding a Lego block 30cm accross a table
• Balancing a small white bag in a spoon and dropping it into a bowl
• Using a spatula to scoop up a bag of rice and putting it into a bowl
• Hanging a loop of rope on a hook
Figure 43. Illustrations of the four experimental taks. All tasks were carried out with a 7-DoF arm of a PR2 robot. Images
were collected with a consumer RGB camera and the controller ran at 20 Hz. The goal pose of the end-effector and an image of
the goal state were provided to the robot for each task.
There were approximately 50 trials for each task to train the autoencoders. Each trial consisted of 100 image frames and 5
seconds of interaction time resulting in a total of 5000 frames per task. The final vision-based controller was trained with
another 50-75 trials (10-15 minutes of total interaction time).
Results
Figure 44. Average distance to the goal in the Lego task and success rates in the three other tasks. Another controller was
trained with the robot’s configuration without visual input and the performance dropped significantly. This showed that vision
is a key part in achieving the given tasks.
34/77
Figure 45. Success rates in the loop hook task with different positions of the hook. Training the system solely with the robot’s
configuration (without vision) leads to a worse performance because the localization of the hook becomes impossible.
Figure 46. Comparisons of different variants of the proposed method and autoencoders from a prior work16 .

Strengths:
• Very sample-efficient (required interaction time is low)
• Can deal with high-dimensional visual input
• Feature points from the autoencoder carry information about the location of features, not just about their presence
Weaknesses:
• The absence of depth and haptic sensing could prove limiting in other object manipulation tasks (e.g. tasks involving
fragile objects)
• Structures or objects that occur multiple times in the scene are not handled gracefully because the representation from the
autoencoder contains exactly one output point per feature channel
35/77
Learning Hand-Eye Coordination for Robotic Grasp-
ing with Deep Learning and Large-Scale Data Collection17
Authors: Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen
Motivation
When humans manipulate objects in their daily lives, the manipulation involves feedback loops. Due to its complex nature,
incorporating continuous visual feedback in robots is challenging. Levine et al. presented a learning-based approach to visual
hand-eye coordination for robotic grasping to address this problem.
Approach
The approach of the authors was data-driven and based on visual feedback. A dataset of over 800,000 grasp attempts with many
different household objects was collected.
Each sample in the collected dataset consisted of the following:

• An image of a scene with objects
• The robot’s pose at the time the image was taken
• The sequence of poses (i.e. the trajectory) until an object was grasped
• A label that expresses whether the grasp was successful or not
Figure 47. The robotic manipulator that was used for the data-collection.
36/77
Figure 48. Illustration of multiple robotic manipulators from figure 47 collecting data.
The proposed method learned to control a robotic arm with a gripper to perform successful graps of objects lying in front of it.
The most promising motor commands were continuously recomputed as the robot arm moved towards an object, allowing the
system to react to unforeseen changes in the environment.
The proposed method consisted of two components:

• A grasp success predictor (a CNN that assesses how likely it is that a given motor command results in a successful grasp)
• A continuous servoing mechanism that uses the CNN to continuously update the robot’s trajectory to maximize the
probability of a successful grasp
Figure 49. Images from the cameras of each robot during the training process. The images show that the system was trained
with different bin locations, lighting conditions, objects and camera poses relative to the robot.
Continuous Servoing
This section describes the second component in more detail. The servoing meachanism used the grasp prediction from the
first component, which received an image of the current scene together with a motor command, to choose motor commands
that most likely lead to a successful grasp. Instead of randomly sampling motor commands and choosing the one that is best
according to the grasp success predictor, a small optimization is performed on the motor command.
The used optimization method is the cross-entropy method (CEM). CEM samples a batch of values at each iteration, fits a
Gaussian distribution to a subset of these samples and samples a new batch of values from the resulting distribution again.
This was done for three iterations in this work to determine the best motor command. An advantage of this approach was
that constraints could be posed on the motor commands that were sampled (e.g. to prevent the robot reaching outside of the
workspace).
37/77
Figure 50. Pseudocode of the servoing mechanism. The best motor command is determined with the grasp success predictor
and CEM. The gripper is closed when the predictor estimates that the success probability with no movement is at least 90% of
the success probability with the best motor command vt∗ . If the success probability is less than 50% when no movement is
executed in comparison to when vt∗ is executed, the gripper is raised off the table since the gripper is then likely to be in a bad
configuration for a successful grasp.
Learning Method
This section describes the learning algorithm that was trained to estimate the success probability of a grasp based on the current
camera observation and a given motor command.
The learning algorithm is a convolutional neural network with the following inputs:
• The current camera image (with the robotic gripper in it)
• An additional image of the scene that was taken before (without the gripper in it) to give the robot an unoccluded view of
the scene
• After several layers: the motor command vector consisting of five values (the 3D translation vector and sine-cosine
encoding of the change in orientation)
Figure 51. Architecture of the described CNN.
Experiments
To evaluate the proposed grasping system, experiments were conducted with objects that were not seen during training.
There were two experimental protocols:

1. Grapsing with replacement: Objects were put in a bin in front of the robot and it made 100 grasping attempts, putting
each grasped object back into the bin (trains the robot to grasp objects in cluttered scenes)
2. Grasping without replacement: Same as in the previous protocol, just without putting back the grasped objects
The proposed method was also compared to the following three methods:
1. Open loop (same method but without continuous visual feedback during the grasp)
2. Random grasping
3. A hand-engineered grasping system
38/77
Results
Figure 52. Failure rates of different methods with and without replacement. The failure rates during the first 10, 20 and 30
grasp attempts are shown (averaged over four repetitions of the experiment).
Figure 53. Failure rates of the proposed methods for different dataset sizes. M stands for the number of images in the training
set and N stands for the total number of grasp attempts that was averaged over (the first 10, 20 and 30 grasp attempts were
again averaged over four repetitions of the experiment).
(b) Examples for objects that are difficult to grasp due to their shape,
weight or being translucent.
(a) Different grasps chosen for objects with similar apprearance but
different material properties.
Figure 54. Grasps for different challenging objects.

Strengths:
• No calibration of the camera pose relative to the robot required
• Method even works with heavy and translucent objects and objects with uncommon shapes
• Method can perceive the material properties of objects and adjust grasps accordingly
39/77
Weaknesses:
• Performance is based on a very large dataset that would take several robots and a few months to collect
• All grasp attempts were executed on flat surfaces and the proposed method would therefore probably not generalize well
to other different settings (e.g. shelves)
40/77
Learning Robot In-Hand Manipulation with Tactile Fea-
tures18
Authors: Herke van Hoof, Tucker Hermans, Gerhard Neumann, Jan Peters
Motivation
Many objects have to be held in a certain coinfiguration in order to be used (e.g. screwdriver). But it is not always possible to
pick up objects in the configuration they have to be in. It is therefore important to have the ability to reposition tools or objects
as they are held in a hand or a gripper. This work focused on learning to manipulate unknown objects inside a robotic gripper.
Approach
Van Hoof et al. proposed an method that directly learned how to manipulate objects in a robotic gripper based on tactile
feedback without knowing the gripper’s dynamics or kinematics. The in-hand manipulation problem was formalized as a
Markov Decision Process (MDP) and an optimal policy was learned with reinforcement learning. The manipulation task that
this work focused on was rolling an object between two of the gripper’s fingers as shown in figure 55.
Figure 55. Illustration of the rolling task in this work. An object has to be rolled from an initial location (a) to a goal location
(c) while maintaining the pressure on the object.
Sensor Types
The used robotic gripper in this work was a three-fingered, underactuated ReFlex robot hand. Each finger had two links with
each link being driven by a single tendon. Furthermore, each finger was equipped with position sensors and nine MEMS
barometers to gather tactile feedback. Only two fingers were used in this work. Two of the fingers could be controlled by a
coupled joint.

Since the underlying framework of the proposed method is reinforcement learning, the state and action space have to be defined.
State Space
The state representation was six-dimensional:
• Dimension 1-2: tactile measurements
• Dimension 3-4: proximal joint angles
• Dimension 5-6: distal joint angles
The sensors for the proximal joint angles encoded the angle directly so no prespocessing was applied. The distal joint angles
were determined by the difference between the respective joint encoders and motor encoders. The tactile measurements were
encoded by applying a pre-defined non-linear function to the measurements of each of the two fingers used in the task. The
result was scaled between 0 and 1.
Action Space
The action space is defined by the joint velocities applied by the robot. The applied velocities could not exceed a pre-defined
maximum.
41/77
Learning Method
Reward Function
Keeping in mind that the goal was to roll an object between two fingers to a goal position, the reward function was designed
based on the following aspects:
• Punishment for deviating from the desired pressure value
• Punishment for deviating from the desired joint configuration
• Exponentially increasing punishment for the magnitude of the applied velocities
Learning Optimal Control Policies

The in-hand manipulation problem was formalized as an MDP. To solve the MDP, an adatpion of a method called Relative
Entropy Policy Search12 (REPS) was used. This policy search method provides a boundary on the magnitude of the policy
updates and therefore leads to smoother and safer updates. The used adaption of REPS (NREPS) is non-parametric and does
not require a parametric form of the value function or the policy. The policy is represented as a cost-sensitive Gaussian process.
Experiments
There were several trials of experiments. Each trial followed these steps:
1. Policy is initialized
2. Policy undergoes ten iterations of gathering data with the latest policy
3. Evaluating how well the resulting policy generalizes to other rollable objects
The used objects have have different dimensions to ensure that the controller uses tactile sensing to adapt itself to an object.
Figure 56. The objects used for the experiment. The wooden objects on the left were used during training and the objects on
the right were used for evaluation.
The proposed method is compared to a hand-engineered feedback controller that is not based on learning and therfore has a
constant performance.
42/77
Results
Figure 57. Reward and standard deviation averaged across four learning trials. The dashed blue line represents the
hand-engineered approach. The proposed method (red line) performs better than the hand-egineered approach after several
trials of learning. Each iteration consisted of approximately 500 time steps (50 seconds of robots interaction time).
Figure 58. Average rewards and standard deviation obtained with the trained policies on four unseen objects. Except for the
thin vitamin container, the rewards obtained with the unknown objects are almost as high as the reward obtained with the
known objects (the wooden cylinders on the right).

Strengths:
• Method generalized well to previously unseen objects
• Not a lot of interaction time needed
• Method worked with an underactuated system
Weaknesses:
• Method only works with cylindrical objects
• Method does not work well with thin cylindrical objects like the vitamin container in figure 56
• The learned policy was more risk-seeking and had a higher reward variance
43/77
Supersizing Self-supervision: Learning to Grasp from
50K Tries and 700 Robot Hours19
Authors: Lerrel Pinto, Abhinav Gupta
Motivation
Most learning-based robotic grasping approaches are based on human-labeled datasets of images of potential grasping locations
on objects. But since an object can be grasped in multiple ways, labeling grasp locations on pictures manually is difficult.
Additionally, manually-labeled datasets can induce human bias into the model. An alternative approach would be grapsing
based on trial-and-error but most datasets with trial-and-error grasping are too small to train big machine learning models.
Approach
Pinto and Gupta made the following three contributions:
1. An automated way to collect grasping data with as little human intervention as possible so that collecting trial-and-error
grasping data on a large scale is possible
2. A novel formulation of a convolutional neural network (CNN) that finds suitable grasp configurations in a given image
3. A multi-stage learning approach to collect difficult samples the CNN can be trained on
Data Collection (Contribution 1)

The only human involvement in the data-collection process was switching on the robot and arbitrarily placing objects on a table
in front of it. The robot gathered data via trial-and-error in a clutter of objects by trying to grasp an object at a certain location
and recording whether the grasp was successful or not. With clutters of objects on the table, multiple random trials were executed.
A random trial was executed as follows:

• Region of interest sampling (identifying regions of interests with a Mixture of Gaussians (MOG) background subtraction
algorithm)
• Grasp configuration sampling (moving the robot arm above a given region of interest, choosing a random point in this
region and choosing a random angle at which the point will be approached)
• Grasp execution and annotation (grasping an object and raising it and assessing the success of the grasp based on force
sensor readings)
The resulting dataset contained 50,000 samples collected over 700 hours. Each sample consisted of an image patch with a
grasping location and a label showing whether the grasp at the given location was successful or not.
Figure 59. Overview of the data collection process.
Sensor Types
The propsed method and all experiments were carried out on a Baxter robot from Rethink Robotics. The used gripper was a
two-fingered parallel gripper.
The robot was equipped with these two sensors:
1. A Kinect V2 on the head of the robot
2. A camera attached to each of the robot’s end-effectors (both robot arms were used to collect data more quickly)
44/77
Learning Method
Problem Formulation (Contribution 2)
Given an image patch, the task of was to find a successful grasp configuration in it. A grasp configuration consisted of the
position of the grasping point on the table (x and y coordinates) and a grasping angle ranging in 18 steps of 10◦ from 0◦ to
170◦ . The prediction of the grasp configuration was done by a CNN, with the output consisting of 18 values corresponding to
graspability scores for each angle.
At test time, grasp locations were sampled from a new image and the respective image patches were fed into the CNN. For each
patch, the chosen grasping angle was the one with the best graspability score.
Figure 60. Architecture of the used CNN. The first few layers were initialized with weights from AlexNet, a network trained
on the ImageNet dataset.
Training Approach (Contribution 3)

To increase the amount of data the network can be trained with, the image patches were rotated by a random angle and were
added to the original dataset (the corresponding angles in the labels of the image patches were adjusted accordingly).
Staged Learning
The training process consisted of multiple trials. After being trained with data obtained from random trials described above, the
updated network was used to collect more data in subsequent trials. This was done iteratively as the network was updated on
newly collected data.
Figure 61. Examples of patches the CNN was trained on.
Experiments
The proposed method was tested with previously seen and unseen objects arranged in a clutter. Additionally, other methods
were compared to the proposed approach.
The following methods were used for the comparison:
45/77
• A heuristic method that makes the predictions based on set rules (e.g. grasping at the smallest object width, not grasping
objects that are too thin)
• The k Nearest Neighbors algorithm with Histograms of Gradients (HoGs) as features
• A linear SVM with HoGs as features
Results
Figure 62. Performance of different approaches on the generated dataset.
Figure 63. Performance of the learning algorithm with different training set sizes.

Strengths:
• Proposed method generalized well to novel objects
• Method also works with objects in a clutter
• Proposed data collection strategy barely needs human intervention
• Generated dataset does not include manually labeled samples and therefore was not influenced by human bias
Weaknesses:
• The used learning algorithm needs a big dataset in order to perform well
46/77
Supervision via Competition: Robot Adversaries for Learn-
ing Tasks20
Authors: Lerrel Pinto, James Davidson, Abhinav Gupta
Motivation
Data-driven learning is becoming more common in the robotics domain to help robots acquire new skills such as grasping
objects. Most data-driven approaches require large amounts of data to perform well, which gave rise to self-supervised
approaches where sensors, rather than human supervision, are used to determine whether a robotic grasp was successful or not
so that robots can autonomously collect large amounts of training data. But these sensors do not differentiate between a grasp
that holds an object safely and a grasp that is barely good enough to keep the grasped object from slipping out of the gripper
because it was grasped at a bad location.
Approach
The idea presented in this paper is that training a robotic system to remove an object out of another robotic gripper, if it did not
grasp the object safely enough, leads to improved grasping performance since grasps at locations that do not hold an object
firmly in place are penalized. The proposd framework consisted of one protagonist (a robotic system learning to grasp objects)
and an adversary (another robotic system trying to remove grasped objects out of the protagonist’s gripper).
Figure 64. The adversary trying to snatch a grasped object from the protagonist.
Two adversarial frameworks were created:

• Shaking the grasped object (not done by a separate robot)
• Snatching the grasped object (done by a separate robot)
Figure 65. Left: Shaking of the grasped object. Center and right: Pulling and pushing of a separate robot arm to snatch the
grasped object.
Given an image patch, the protagonist (a neural network) predicted the success proability of a grasp at the patch center for 18
different angles (0◦ to 170◦ in 10◦ steps). The adversary was a neural network as well. Given the image patch of the location
47/77
where the protagonist grasped the object, the adversary predicted an action that maximized the likelihood of removing the
object from the protagonist’s gripper. The possible predicted actions of the two types of adversaries are shown in figure and 66
and 67.
Figure 66. The shaking adversary could perform 15 different actions (three different shaking directions with five different
configurations). The five different shaking configurations are shown here. The frequency and amplitude of the shaking were
constant for all shaking actions.
Figure 67. The snatching adversary could perform 36 different actions (choosing between nine grasping locations and
grasping angles). The 36 different possible grasps are shown here, where the green lines represent the two plates of the gripper.
Figure 68. Illustration of the framework with the shaking adversary.
Learning Method
The protagonist and two different adversaries were all convolutional neural networks that shared the same architecture. Both
adversaries were trained to remove an object in the protagonist’s gripper. The protagonist was trained to grasp an object in a
manner that makes it as hard as possible for the adversaries to shake it off or snatch it.
48/77
Figure 69. The network architecture of the protagonist and its adversaries. Note that the protagonist and the adversaries had
the architecture in common, but the output layer was different due to the differnt tasks each of the networks learned. The output
was scaled between 0 and 1 by sigmoid activation functions. The weights of the convolutional layers are initialized with the
weights of a network trained on the ImageNet dataset.
Experiments
The performance of the proposed method was tested with 10 different objects. The method was tested with the two types of
adversaries and without any adversaries. All experiments were run on a Baxter robot with parallel jaw grippers.
Figure 70. The 10 objects that were used for the experiment. Note that they have different properties (soft, hard, translucent,
opaque, light, heavy etc.).
The authors collected 9k grasp samples with the shaking adversary (2.4k successful grasps, out of which 0.5k were made
unsuccessful by the adversary). With the snatching adversary, 2k grasp samples were collected (with 0.7k successful grasps, out
of which 0.2k were made unsuccessful by the adversary). The baseline model was trained with 56k training examples without
any adversaries.
49/77
Results
Figure 71. Grasping success rates out of 10 tries with a low gripping force of 7 N (20% of the maximum gripping force). 128
grasping patches were sampled from the input image. The training with adversaries shows improvements of the grasping
performance.
Figure 72. Grasping success rates out of 10 tries with a high gripping force of 35N (maximum gripping force) and an attached
rubber surface to get more grip. 1280 grasping patches (10 times more than in figure 71) were sampled from the input image. It
can be seen that the adversaries lead to a better performance even though the corresponding models were trained with less data.

Strengths:
• Training with adversaries forces a robot to learn robust grasps
• Method can differentiate between robust grasps and unstable grasps
• Training with adversaries is more data-efficient
Weaknesses:
• Method might not work well with delicate objects due to a lack of tactile feedback and high force exertion
50/77
Learning to Grasp without Seeing21
Authors: Adithyavairavan Murali, Yin Li, Dhiraj Gandhi, Abhinav Gupta
Motivation
In robots, most grasping appraoches are vision-based. But it is hard to correct an already executed grasp based on vision, since
the gripper occludes parts of the grasped object. A sense of touch could play a vital role in such scenarios. In fact, humans are
able to grasp objects solely based on their sense of touch (e.g. fumbling in the dark for a phone on a nightstand). Effectively
utilizing haptic information could improve grasping performance in robots and help them adjust their grasps.
Approach
The approach presented here was solely based on haptic feedback without any assistance by vision. The authors created a big
grasping dataset to train deep neural networks to extract haptic features, estimate grasp stability and adjust the grasp. The
authors used a three-fingered robotic gripper for their implementation on a robot.
The presented approach consisted of two components:

1. A touch localization model (scans the workspace with its sense of touch, uses a particle filter to estimate the location of
an object and executes an initial grasp)
2. A re-grasping model (estimates grasp stability and learns to improve grasps based on tactile feedback and learned tactile
features)
The re-grasping model is used iteratively until the estimated grasp stability is high enough.
Figure 73. Grasping and regrasping framework.
51/77
Figure 74. Overview of the whole approach. The separate components are explained in subsequent sections.
Dataset
The collected dataset contains more than 2.8 million tactile samples from 7,800 grasps and 52 objects in a planar setting.
Every grasp interaction in the dataset lasted 3.5-4 seconds and was recorded with the following data:
• Four RGB frames (initial scene, before, during and after grasp execution) with a resolution of 1280 × 960
• Haptic measurements from force sensors mounted on the three fingers of the used gripper containing force magnitudes
and directions (sample rate: 100 Hz)
• Grasping actions and labels (position and orientation of the gripper during the initial grasp and the re-grasps)
• Material properties (a label corresponding to one of seven material characteristics)
Touch Localization and Initial Grasping (Component 1)

Component 1 estimated the location of an object lying in the workspace based on touch and executed an initial grasp. One
finger of the gripper was used to scan a fixed 2D plane in the workspace by moving around in straight lines. When the signal
from the force sensors exceeded a set threshold, contact with an object was detected. A particle filter was used to infer the
object’s location in the scanned 2D plane. At the end of the scanning process, the centroid of the sampled particles is estimated
to be the obejcts location.
After the location estimation, the first grasp is performed. The grasp is executed at the given 2D location and the other grasp
parameters (coordinate of the third dimension and grasping angle) are randomly chosen.
Sensor Types
The approach was implemented on a research edition of a Fetch mobile manipulator equipped with a three-fingered adaptive
gripper from the company Robotiq.
The robot was equipped with two sensors for the data-collection process:
• A 3-axis Optoforce sensor on each finger for haptic measurements

• A PrimeSense Carmine 1.09 short-range RGB-D camera on the robot’s head to acquire visual data
Learning Method
This section provides more information about the second component. Given the object localization and the randomly selected
first grasp from the first component, the second component extracted features from the raw haptic data and used them to predict
grasp stability in order to judge whether and how to adjust the grasp.
52/77
Haptic Feature Learning
The raw haptic signals were a time series of low-dimensional vectors, where each vector corresponded to haptic measurements
at a single time step. An autoencoder was used to obtain a low-dimensional representation of a sequence of haptic data. Both
the encoder and decoder had an LSTM-based recurrent architecture.
Figure 75. Architecture of the used autoencoder. The hidden state H contained the extracted latent features.
Learning to Re-grasp
The encoded haptic features from the autoencoder were used by another neural network to predict a corrective action that
improves the grasp stability.
Figure 76. Architecture of the neural network used for re-grasping. The output conisisted of the suggested changes to the
grasp regarding the location (∆x, ∆y, ∆z) and orientation (∆θ ). The output space is discretized into five bins for each control
dimension. The grasp adjustments were done iteratively until the predicted grasp stability exceeded a certain threshold.
The grasp stability was estimated by another feedforward neural network that predicted the grasp stability (the probability that
the given grasp is successful) based on the haptic encoding from the autoencoder. This network consisted of five fully-connected
layers with 512, 512, 256, 128, and 64 neurons respectively (with a sigmoid unit as the output layer).
Experiments
Separate experiments were conducted to test the quality of the learned haptic features and the tactile-based grasping framework.
Experiments with the learned haptic features:

• Material recognition (using tactile the learned tactile features to classify materials into seven categories)
• Grasp stability estimation (using the learned features to predict whether the corresponding grasp will be successful)
Experiments with the tactile-based grasping framework:
• Re-grasping objects at a given location
• Searching for objects with unknown location and using the re-grasping framework to grasp them
• Combining a vision- and tactile-based approach
53/77
Figure 77. Objects from the training set and the two different test sets that were used in some experiments. Note the different
material characteristics and shapes of the objects.
Results
Figure 78. Accuracies of a deep network and an SVM trained with haptic features from different methods to classify
materials. The models were trained on 80% of the collected data and tested on the remaining 20%. The deep network trained
with the learned features from the autoencoder performed best.
Figure 79. Accuracies of a deep network and an SVM trained with the same features as in figure 78 to predict grasp stability.
The models were tested with 580 grasping trials on 20 unseen objects. Again, the deep network trained with the haptic features
from the proposed approach performed best.
Figure 80. Grasping success rates without re-grasping, with random re-grasping and with the proposed re-grasping method on
the two test sets shown in figure 77. The object location was known. The proposed approach performed best on all test sets.
54/77
Figure 81. Grasping accuracies with the proposed object localization and the proposed re-grasping method, the proposed
object localization without re-grasping, vision-based grasping and vision-based grasping in combination with the proposed
re-grasping method. Combining the vision-based and tactile-based approach yielded the best results.

Strengths:
• Provides a way to improve the grasp on an object
• Provides the possibility to grasp objects without relying on (or having) a sense of vision
Weaknesses:
• Some regions of the robotic fingers were not covered with tactile sensors and might have come into contact with objects
and pushed them around, which affected the grasping performance
• The initial grasp of the proposed method is random, which is suboptimal
55/77
Learning to Push by Grasping: Using Multiple Tasks
for Effective Learning22
Authors: Lerrel Pinto, Abhinav Gupta
Motivation
End-to-end learning approaches are becoming more commonly used in the field of robot control. A major drawback of these
methods is that they require huge amounts of data. This might be mitigated by training an end-to-end learning algorithm to
perform multiple related tasks instead of just a single one specific task.
Approach
To answer the question of whether multi-task learning is better than learning a single task, Pinto and Gupta considered three
related manipulation tasks. A neural network was then trained to perform all of the considered tasks.
Considered tasks:
• Planar grasping (figure 82)
• Planar pushing (figure 83)
• Poking (figure 84)
For planar grasping, a dataset from previous work was used. This dataset contained image patches of objects, corresponding
grasps (defined by two coordinates and the orientation of the gripper) and labels that express whether the corresponding grasp
was successful or not. Given an image patch, the task was to output a grasp configuration (grasp center and discretized grasp
orientation) that would produce a successful grasp of an object in the image patch.
For the pushing task, another previously collected dataset was used. Each data-point in this dataset consisted of an image of an
object before it was pushed, the pushing action (the coordinates of where the pushing action started and where it ended) and an
image of the object after if was pushed. The task here was to predict the pushing action based on the two images showing the
scene before and after the push.
Yet another previously acquired dataset was used for the poking task. This dataset contained images of objects and the tactile
force (defined by two parameters) measured when the object was poked. The task was to predict the tactile measurements given
the image of the object.
Figure 82. Examples for successful and unsuccessful grasps from the dataset used to train the used learning algorithm on the
grasping task.
56/77
Figure 83. Four examples for push actions. In each example, the left image shows the object before the push, the image in the
middle shows the pushing action and the right image shows the object after the push.
Figure 84. Samples from the poking dataset. In each sample, the left image shows the object and the right image the
measured tactile response.
Learning Method
The input to the used neural network was an RGB image. Therefor, the images of all datasets had to be resized to have the same
dimensions.
Figure 85. Architecture of the shared network. The convolutional layers were shared by all task-specific parts of the network.
The features from the convolutional layers were then fed into separate network parts, each of which computed the output for a
particular task (grasping, pushing or poking).
During each iteration in the training process, a random batch of fixed size was obtained from each of the task-specific training
sets. The shared layers were updated with a version of gradient descent with respect to the the accumulated loss in all three
tasks. The task-specific layers were only updated with respect to the loss in the corresponding task. So the shared layers were
trained on data from all datasets while the unshared layers were only updated with data from the training set they compute the
output for.
Experiments
Three experiments were conducted:
57/77
1. Performance comparison with the multi-task training (only pushing and grasping were considered) and single-task
training
2. Performance comparison with different training set ratios (only pushing and grasping were considered)
3. Measuring performance of the network when training it to perform all three tasks (grasping, pushing and poking)
Results
Figure 86. Results of the first experiment. If the number of training samples crossed a certain threshold, training the model
with a training set where half of the data are from another related task lead to a lower error.
Figure 87. Results of the second experiment. The size of the part of the training set that contained data from the other task
was varied and the resulting error was recorded. The error is still low when a large percentage of the training data comes from
the other task.
Figure 88. Performance of a model that was trained on all three tasks (with varying training set ratios). There were
improvements when the the model was trained on the poking task as well.

Strengths:
58/77
• Multi-task learning can improve performance (even if total amount of training data is the same)
Weaknesses:
• Datasets of related tasks are not always available and could be difficult to acquire
59/77
Grasping Virtual Fish: A Step Towards Robotic Deep
Learning from Demonstration in Virtual Reality23
Authors: Jonatan S. Dyrstad, John Reidar Mathiassen
Motivation
A virtual reality is an intuitive environment for humans and would therefore be a good environment to teach robots by human
demonstration. This idea can be used to teach robots how to grasp difficult objects like fish, which is an unsolved but relevant
problem.
Approach
Dyrstad and Mathiassen implemented the mentioned idea to gather training examples from human demonstrations in a virtual
environment to train a neural network to grasp fish.
The approach of this paper consisted of three main components:

• A virtual reality (VR) interface to demonstrate to a robot how to grasp fish
• Domain randomization (generating many training examples from a low number of human demonstrations)
• A deep neural network that was trained with data from the second component
Figure 89. Illustration of the proposed approach. a) A person controlling a virtual robotic gripper provided a few examples on
how to grasp virtual fish lying in a box. b) Then, domain randomization generated a large dataset based on the few provided
grasping examples and c) a deep neural network was trained with the generated data, which included virtual 3D images of a
box with one or multiple fish and successful and unsuccessful grasp configurations. After training, d) the trained neural
network controlled a virtual gripper to grasp virtual fish.
Preprocessing
This section describes how the grasps from the human demonstrations are preprocessed with so-called domain randomization
to generate a large training set of grasps for the neural network.
The key idea here is that each of the grasps from the human demonstrations corresponded to a certain fish (the one that was
grasped). Apart from its 3D model, a fish was also defined by its position and orientation in the virtual world.
To generate more training data, the following parameters of the grasped fish from the human demonstration were randomly
changed:
• Position
• Orientation
• Amount of fish
For each of these changes, the corresponding successful grasp was computed by changing the grasp configuration from the
human demonstration accordingly (e.g. by shifting or rotating). The grasps that would lead to collisions were discarded. In
60/77
addition, the position and orientation of the virtual depth camera that observed the scene were also randomly changed.
The explained process was used to generate more positive grasping examples from the human demonstrations. Negative
examples were generated by randomly sampling sections of the scene with random grasps and labeling them as unsuccessful.
The whole domain randomization process generated 76,000 training samples from a few dozen human demonstrations.
Learning Method
The generated 3D images and grasps were used to train a convolutional neural network. The input to this network was a
three-dimensional image of the virtual scene. The output consisted of a grasp configuration (position and orientation of the
gripper in the 3D image) and an estimate of how likely the output grasp is to be successful.
Figure 90. Architecture of the used convolutional neural network. The input is a depth image and the output was a suggested
gripper position and orientation and the corresponding grasp success probability.
Experiments
The proposed method was tested in real-time. A grasping trial was considered to be unsuccessful, if the gripper did not reach
the suggested grasping point because it collided with something in the environment (e.g. the box) or if the fish could not be
held when the gripper was closed. If multiple grasps were detected in an image, the one with the largest estimated success
probability was chosen.
Results
Figure 91. Grasping success rates with 100 grasp attempts and a varying number of fish in the box. The trained network could
handle different amounts of fish almost equally well, indicating that it can process cluttered scenarios with many overlaps well.

Strengths:
• A virtual environment can make it easier to demonstrate tasks to a robot
• The proposed domain randomization generated a big dataset from just a few human demonstrations
Weaknesses:
• Despite the use of special fish simulation software, accurate simulation of the material properties of fish is not guaranteed
• It is not certain that the simulated results can be transferred to a real robotic system
61/77
Fast Object Learning and Dual-arm Coordination for
Cluttered Stowing, Picking, and Packing24
Authors: Max Schwarz, Christian Lenz, Germán Martı́n Garcı́a, Seongyong Koo, Arul Selvam
Periyasamy, Michael Schreiber, Sven Behnke
Motivation
Common warehouse tasks like picking up items or stowing them in boxes could be automated with robots. However, this is
very challenging to implement.
Approach
The authors of the paper implemented a way to automate warehouse tasks for the Amazon Robotics Challenge. The two tasks
that the robot had to be able to accomlish were stowing newly arrived items in a storage (the stow task) and picking specific
items in the storage and placing them in a box (the pick task).
Figure 92. CAD model of the system. The system consisted of two 6 DoF Universal Robots UR5 robot arms. Each arm was
equipped with an end-effector with a suction finger and a 2-DoF finger. The suction was created by the vacuum cleaners.
Figure 93. Setup of the system from above. The storage boxes are in gray. Left: Setup during the stow task with a box in the
middle (red) from which items should be stowed in the storage. Right: Setup during the pick task with boxes (orange) where
the target items from the storage should be placed.
To analyze and pick objects lying in a box, an object perception pipeline consisting of the following components was used:
• Item models (Models were obtained by observing the item from multiple views in different orientations. See figure 94.)
• Semantic segmentation of scenes of item clutters with a neural network
• Scene synthesis (New scenes were generated with the item views from the first component to train the neural network
that performs semantic segmentation. See figure 95.)
62/77
• Heuristic grasp selection (Using the item contours obtained with semantic segmentation to select a grasp based on
pre-defined rules)
• Generation of a clutter graph to facilitate planning (see figure 96)
• Object pose estimation with a convolutional neural network
The heuristics, based on which a grasp was selected, depend on the weight of an item. For lightweight items, the loaction at
which an item was grasped with the suction gripper was the point with the largest distance to the item boundaries given by the
semantic segmentation network. Regarding heavy items, the grasping point was the 2D center of mass.
Figure 94. Generated views of an item from multiple perpectives. An item was placed on a turntable and images were taken
while it was spun to observe the item at different orientations. The background was subtracted in each image.
Figure 95. Examples of scenes generated with the annotated background frame in the left column. Top: RGB image of the
scene. Bottom: Color-coded generated semantic segmentation ground truth. Item images from the turntable setup were
randomly added to the background to generate new scenes the semantic segmentation network was trained with.
63/77
Figure 96. The clutter graph of the example scene in figure 97. The bottom half is not shown and only the items on top of the
pile can be seen. The vertices contain the name of the detected object and the detection confidence. The clutter graph was a
representation of how the objects in the scene were arranged, which facilitates planning which object to pick.
Figure 97. Object perception example during the picking phase. Left: RGB image of the scene. Center: Segmentation output.
Right: Item contours with detection confidences, centers of mass (small points) and suction spots (large points).
During the picking task, the perception pipeline was triggered, which also identified the target items and their position in the
box. The number of items occluding the target items (obtained from the clutter graph) was considered to decide which target
items to pick first and which items have to be moved out of the way in order to reach an occluded target item.
During the stowing task, the perception pipeline detected objects that were likely to be on top of the pile of items that have to
be stowed. The items that were likely to be on top of the pile were picked up and stowed first.
Learning Method
Semantic Segmentation
The neural network that performed the semantic segmentation was a specialized semantic segmentation network called
RefineNet25 , which was proposed by Lin et. al. The network received an image of the scene as input and output the semantic
segmentation. Training examples can be seen in figure 95.
Object Pose Estimation

Since some objects are only graspable at specific grasp configurations, a 6D pose estimation module was developed, which
allowed specifying grasps relative to an item frame. The network predicted the pose of an object from a single view. The output
is the 3D object orientation relative to the camera frame in the form of a unit quaternion and the origin of the object frame in
2D pixel coordinates.
To generate training examples the network could be trained with, item views from the turntable setup were placed on a randomly
cropped storage box scene.
64/77
Figure 98. Architecture of the pose estimation network. The first layers are from RefineNet and their output was computed by
three convolutional layers and two fully-connected layers. The output was the rotation as a unit quaternion (qx , qy , qz and qw )
and the pixel coordinates of the origin of the object frame (x and y).
Experiments
Four experiments were conducted:
• The approach was tested in the Amazon Robotics Challenge 2017
• Evaluation of the semantic segmentation performance of the synthetic scenes
• Evaluation of the pose estimation performance
• Assessment of the speedup obtained by using two robot arms instead of one for the pick and stow tasks (in simulation)
Results
Figure 99. Timings and success rates for different tasks in the robotics challenge. The team reached second place in the
challenge for both the pick task and the stow task.
65/77
Figure 100. Results of the segmentation experiment. Left: Number of processed scenes during the training process depending
on the number of used GPUs. Right: Segmentation performance given by the intersection over union metric with different
training data (isolated scenes with only one object, synthesized scenes, data provided by Amazon and a combination of the
synthesized scenes and data from Amazon).
Figure 101. Pose estimation results with three different items.
Figure 102. Average run time for both tasks with one (red) and two (blue) robot arms.

Strengths:
• Perception pipeline can be quickly adapted to new objects
• System can coordinate two robotic arms sharing the same workspace and handle possible conflicts
Weaknesses:
• Using a second robot arm results in a speedup factor of only 1.2-1.3 and might not be worth the cost
66/77
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with
Synthetic Point Clouds and Analytic Grasp Metrics26
Authors: Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu,
Juan Aparicio Ojea, Ken Goldberg
Motivation
Many robotic systems are equipped with a depth camera. It is however a big challenge to process depth images to estimate
what grasps would be successful in picking up an object that is in the field of view.
Approach
The approach of Mahler et al. had three components:
• Creation of a big dataset of grasps of virtual objects based on a previously collected dataset
• Training a convolutional neural network with the new dataset to estimate the success probability of a given grasp in a
given scene
• A grasp planning method that ranks grasps with the trained neural network to choose the one that is most likely to be
successful
Figure 103. Illustration of the proposed approach.
Dataset Generation
The generated dataset was generated with 1,500 3D models of objects from another dataset called Dex-Net 1.0 and was therefore
called Dexterity Network (Dex-Net) 2.0. Each model was labeled with the success or failure of up to 100 simulated parallel-jaw
grasps at different locations of the object lying on a table.
Each sample in the dataset contained:

• A 3D model of an object
• A configuration of a parallel-jaw grasp (the center in 3D coordinates and the orientation in the table plane)
• A binary label indicating the success or failure of the given grasp
• A rendered point cloud of the object in a stable pose, which simulated the observations of a depth camera
67/77
Figure 104. Illustration dataset generation process.
Preprocessing
The generated dataset was preprocessed in three ways:
• Augmenting the data by rotating each image by 180◦ , since this would lead to equivalent results
• Adding image noise to each image
• Normalizing the input data by subtracting the mean and dividing by the standard deviation
Learning Method
The generated data was used to train a convolutional neural network to evaluate how likely it is that a given grasp will be
successful. The input consisted of the gripper position in the third dimension (the z axis; the x- and y-axis represented the
table surface) and a depth image centered at the grasp center in pixel coordinates. The image was aligned with the gripper
orientation, removing the need to learn rotational invariances.
Figure 105. Architecture of the used convolutional neural network. With the given input, the grasp quality (i.e. success
probability) was computed.
Experiments
The classification performance was evaluated on both real and other synthetic datasets.
68/77
Figure 106. Left: Experimental setup for the physical experiments. Top right: A set of eight objects with very difficult shapes.
Bottom right: 10 household objects with varying shapes and material properties.
Performance metrics:
• Success rate (percentage of grasps that were able to lift and hold an object after shaking)
• Precision (success rate of grasps with an estimated success probability of more than 50%)
• Robust grasp rate (percentage of executed grasps with an estimated quality of more than 50%)
• Planning time (the time it took to process an image and output a grasp)
To compare the proposed grasp classification approach, the following methods were used:
• Image-based Grasp Quality Metrics - IGQ (detecting object boundary via edge detection and comparing grasp candidates
based on the distance between the gripper center and the centroid of the object boundary)
• Point-Cloud Registration - REG (a state-of-the-art method for precomputed grasps)
• Support Vector Machine - SVM (with a Histogram of Oriented Gradients as features)
• Random Forest - RF (normalized raw images and gripper depths as input)
Results
(a)
(b)
Figure 107. ROC curve and validation accuracies of versions of the proposed method and other machine learning algorithms
on the synthetic adversarial set in figure 106.
69/77
Figure 108. Performance of the different grasp planning methods on the adversarial set of objects in figure 106. The
adversarial objects were seen during training.
Figure 109. Performance of the different grasp planning methods on a set of novel objects (the test set in figure 106).

Strengths:
• Computing time of the method is comparatively short

• High grasp success rates on both known and unknown objects
Weaknesses:
• Objects in the train and test set were chosen according to pre-defined criteria (weight and size), which would exclude
many objects that can be found in the real world
• Depth sensing was sometimes too inaccurate to measure thin parts of an object, sometimes leading to grasp failure
70/77
Deep Learning a Grasp Function for Grasping under
Gripper Pose Uncertainty27
Authors: Edward Johns, Stefan Leutenegger, Andrew J. Davison
Motivation
A commonly researched subject is that of computing a pose of a robotic gripper based on a given image of an object that has to
be grasped. However, it is difficult in practice to exactly align the robotic gripper with a given pose due to noisy measurements
from joint encoders, deformation of kinematic links or inaccurate calibration between the robot and the camera.
Approach
Instead of computing a single best grasp pose from an image, the method of Johns et al. first computed a score for every
possible grasp configuration. This was done by a convolutional neural network (CNN). The resulting grasp function was then
convolved with another function expressing the uncertainty about the gripper’s position to get a more robust estimate. The
training data for the CNN was generated in simulation.
Figure 110. Illustration of a grasp function being convolved with the gripper pose uncertainty to yield a robust grasp function.
Figure 111. Visualization of the setup. The task was to grasp a single, isolated object based on a depth image of the object
and to return an optimal gripper pose. The pose was defined by two image coordinates and the orientation angle.
71/77
Figure 112. Example of a grasp simulation. The used gripper type in this work was a parallel-jaw gripper. Grasps were
executed for every discrete pose at a constant height. If the grasped object could be lifted by 20cm, the grasp was considered to
be successful. The 3D models were taken from an available database called ModelNet.
Grasp Execution
The grasp execution had the following steps:
1. Processing a depth image of an object with the CNN to predict the quality of each grasp configuration (see figure 113)
2. Convolving the grasp function from step 1 with a Gaussian kernel in three dimensions (position and orientation), where
the covariance matrix represented the gripper pose uncertainty
3. Interpolating the resulting distribution over the pose space
4. Sending the gripper to the maximum (i.e. best pose) of the final distribution from step 3
Learning Method
The mapping from depth images to grasp scores was done by a CNN. This task was turned into a muli-class classification
problem by letting the CNN predict, which of six discretized values (from 0 to 1.0 in steps of 0.2) of the grasp score the depth
image should be assigned to.
Figure 113. Visualization of the grasp function during training. (a): Ground-truth depth image. (b): scaled image after adding
noise. (c): Image from (b) overlayed with the grasp function. Each line represents a grasp position and orientation, with the
thickness indicating the score for the pose This was the level of granularity used in the work (8712 grasps per image). (d):
Same as in (c) but with a coarser distribution of grasp poses.
Figure 114. Architecture of the CNN trained to output the grasp function. Given a depth image of an object, the network
classified each grasp with one of six grasp scores.
72/77
Experiments
The CNN was trained on 1000 randomly selected objects from the ModelNet dataset as described above. Each training image
was randomly rotated and shifted to get 1000 new augmented images per original training image. The CNN was pre-trained to
classify objects in another set of 3D CAD object models.
With the described setup, two experiments were conducted:
• Grasping 1000 novel objects in simulation

• Grasping a set of 20 real objects with a Kinova MICO robot arm
Baseline methods:
• Centroid: The target gripper pose is the centroid of the image with a perpendicular orientation to the dominant direction
calculated via PCA
• Best grasp: The target pose is the maximum of the learned grasp function
• Robust best grasp: The target pose is the maximum of the grasp function convolved with the pose uncertainty (proposed
approach)
Figure 115. Set of 20 real objects used for the second experiment. The objects have different shapes, weights and material
properties.
73/77
Results
Figure 116. Result of the experiment carried out in simulation. These are the grasp success rates of the different methods
with varying gripper position uncertainties. The gripper orientation uncertainty was set to 10◦ .
Figure 117. Results of the second experiment, which was carried out by a real robot. The table shows the grasp success rates
obtained by the three different methods with different gripper pose uncertainties, where σu,v corresponds to position uncertainty
in millimeters and σθ to orientation uncertainty in degrees.

Strengths:
• Improved grasping performance
• Proposed method can deal with relatively high pose uncertainty and can therefore be used with imprecise sensors
Weaknesses:
• Manual tuning of parameters necessary to obtain simulations that are as realistic as possible
• Convolution of the pose uncertainty with the pose function increases computation time
74/77
Closing the Loop for Robotic Grasping: A Real-time,
Generative Grasp Synthesis Approach28
Authors: Douglas Morrison, Peter Corke, Jurgen Leitner
Motivation
To perform grasping in unstructured environments like the real world, a robot must be capable of computing grasps for many
different objects. But the real world is a dynamic environment, which would include changes in a robot’s workspace, sensor
noise and inaccuracies in the robot’s control. A control loop that allows a robot to analyze and correct grasps in real-time would
help a robot to cope with these challenges.
Approach
The approach was a real-time, object-independent grasping method that can be used for closed-loop grasping. The core of the
proposed method was a convolutional neural network (CNN) that estimated the grasp quality at every pixel of a given depth
image. The whole approach dealt with the task of detecting and executing grasps on unknown objects lying on a planar surface.
Figure 118. Left: Grasping setup. A grasp was defined by its position (x, y, z) and orientation around the surface normal (Φ).
Right: In the depth image, the grasp pose was defined by its center in pixel coordinates (u, v), its orientation Φ̃ and the gripper
width w̃.
Figure 119. Illustration of the proposed real-time grasping method.
The neural network was trained on the Cornell Grasping Dataset, which contains 885 RGB-D images of real objects with more
than 5,000 human-labeled positive and almost 3,000 negative grasps.
75/77
The training data was augmented by randomly cropping, zooming and rotating each image, resulting in 8,840 depth images
with more than 50,000 grasp examples.
Learning Method
The trained CNN converted a depth image into a so-called grasp map. Each element in this grasp map corresponded to a pixel
in the depth image. The grasp map was then used to compute the best possible grasp in the given image.
Each element in the grasp map contained the following parameters:

• Position of the grasp in pixel coordinates (given by the pixel in the depth image the element it correpsonds to)
• Grasping width
• Grasping angle
• Grasp quality (probability of being a successful grasp)
Figure 120. Architecture of the CNN. Qθ , Wθ , and Φθ correspond to the grasp quality, grasp width and the grasp orientation
respectively.
Experiments
Four experiments were conducted:
1. Grasping single, static objects
2. Grasping single, moving objects
3. Grasping objects arranged in a clutter
4. Grasping objects with simulated kinematic errors
Figure 121. Two sets of real objects used for the first two experiments (grasping static and moving objects). Left: Set of eight
adversarial objects with difficult shapes. Right: Set of 12 household objects.
76/77
Results
Figure 122. Results from the first three grasping experiments. The performance of the proposed approach is on the very right.
Other works (denoted by the numbers at the top) were also compared for some object sets.
Figure 123. Comparison between the proposed appraoch as an open-loop method (choosing a grasp based on a single depth
image) and a closed-loop method (re-evaluating and adapting a grasp based on newly observed depth images as the gripper
moves to the grasping location). Kinematic errors were simulated with cross-correlation between velocities in the x-, y- and
z-direction.

Strengths:
• Method can handle external changes during a grasp (e.g. moving objects)
• Method can deal with positioning errors of a robot
• Used CNN is very fast to execute (19 ms) and therefore suitable for closed-loop control
Weaknesses:
77/77
• Like many other depth cameras, the depth camera in this work was unable to return accurate depth measurements at short
distances
• No useful depth data on many black or reflective objects
78/77
References
1. Kaboli, M., Feng, D., Yao, K., Lanillos, P. & Cheng, G. A tactile-based framework for active object learning and
discrimination using multimodal robotic skin. IEEE Robotics Autom. Lett. 2, 2143–2150 (2017).
2. Levine, S., Finn, C., Darrell, T. & Abbeel, P. End-to-end training of deep visuomotor policies. The J. Mach. Learn. Res.
17, 1334–1373 (2016).
3. Gao, Y., Hendricks, L. A., Kuchenbecker, K. J. & Darrell, T. Deep learning for tactile understanding from visual and haptic
data. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, 536–543 (IEEE, 2016).
4. Szegedy, C. et al. Going deeper with convolutions (Cvpr, 2015).
5. Bell, S., Upchurch, P., Snavely, N. & Bala, K. Material recognition in the wild with the materials in context database
(supplemental material). In Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015).
6. Kaboli, M., Yao, K., Feng, D. & Cheng, G. Tactile-based active object discrimination and target object search in an
unknown workspace. Auton. Robots 1–30 (2018).
7. Madry, M., Bo, L., Kragic, D. & Fox, D. St-hmp: Unsupervised spatio-temporal feature learning for tactile data. In
Robotics and Automation (ICRA), 2014 IEEE International Conference on, 2262–2269 (IEEE, 2014).
8. Lenz, I., Lee, H. & Saxena, A. Deep learning for detecting robotic grasps. The Int. J. Robotics Res. 34, 705–724 (2015).
9. Chebotar, Y., Kroemer, O. & Peters, J. Learning robot tactile sensing for object manipulation. In Intelligent Robots and
Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, 3368–3375 (IEEE, 2014).
10. Detry, R., Pugeault, N. & Piater, J. H. A probabilistic framework for 3d visual object representation. IEEE Transactions on
Pattern Analysis Mach. Intell. 31, 1790–1803 (2009).
11. Chen, Y. & Medioni, G. Object modelling by registration of multiple range images. Image vision computing 10, 145–155
(1992).
12. Peters, J., Mülling, K. & Altun, Y. Relative entropy policy search. In AAAI, 1607–1612 (Atlanta, 2010).
13. Popov, I. et al. Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073
(2017).
14. Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
15. Finn, C. et al. Deep spatial autoencoders for visuomotor learning. In Robotics and Automation (ICRA), 2016 IEEE
International Conference on, 512–519 (IEEE, 2016).
16. Lange, S., Riedmiller, M. & Voigtlander, A. Autonomous reinforcement learning on raw visual input data in a real world
application. In Neural Networks (IJCNN), The 2012 International Joint Conference on, 1–8 (IEEE, 2012).
17. Levine, S., Pastor, P., Krizhevsky, A. & Quillen, D. Learning hand-eye coordination for robotic grasping with large-scale
data collection. In International Symposium on Experimental Robotics, 173–184 (Springer, 2016).
18. Van Hoof, H., Hermans, T., Neumann, G. & Peters, J. Learning robot in-hand manipulation with tactile features. In
Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on, 121–127 (IEEE, 2015).
19. Pinto, L. & Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Robotics and
Automation (ICRA), 2016 IEEE International Conference on, 3406–3413 (IEEE, 2016).
20. Pinto, L., Davidson, J. & Gupta, A. Supervision via competition: Robot adversaries for learning tasks. In Robotics and
21. Murali, A., Li, Y., Gandhi, D. & Gupta, A. Learning to grasp without seeing. arXiv preprint arXiv:1805.04201 (2018).
22. Pinto, L. & Gupta, A. Learning to push by grasping: Using multiple tasks for effective learning. In Robotics and
23. Dyrstad, J. S. & Mathiassen, J. R. Grasping virtual fish: A step towards robotic deep learning from demonstration in virtual
reality. In Robotics and Biomimetics (ROBIO), 2017 IEEE International Conference on, 1181–1187 (IEEE, 2017).
24. Schwarz, M. et al. Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing. In IEEE
International Conference on Robotics and Automation (ICRA) (2018).
25. Lin, G., Milan, A., Shen, C. & Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation.
In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1, 3 (2017).
79/77
26. Mahler, J. et al. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics.
arXiv preprint arXiv:1703.09312 (2017).
27. Johns, E., Leutenegger, S. & Davison, A. J. Deep learning a grasp function for grasping under gripper pose uncertainty. In
Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, 4461–4468 (IEEE, 2016).
28. Morrison, D., Corke, P. & Leitner, J. Closing the loop for robotic grasping: A real-time, generative grasp synthesis
approach. arXiv preprint arXiv:1804.05172 (2018).
80/77

Machine Learning Approaches To Robotic Grasping PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Machine Learning Approaches To Robotic Grasping PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Machine Learning Approaches

Supervised by Dr.-Ing. Mohsen Kaboli

Submitted by Axel Fehr

Categories based on input data:

Categories based on learning methods:

Categories based on setups:

• A pre-touch strategy to explore the unknown workspace

Figure 1. The proposed tactile-based framework

• One proximity sensor (for pre-touch)

• One three-axis accelerometer

Preprocessing and Feature Extraction

Strengths and Weaknesses

• A supervised learning algorithm (implemented as a CNN)

Figure 5. Architecture of the visuomotor policy

Strengths and Weaknesses

Figure 9. Structure of the multisensory model.

Figure 10. The 24 haptic characteristics in the used dataset

Strengths and Weaknesses

Preprocessing and Feature Extraction

Active Touch for Unknown Workspace Exploration

Strengths and Weaknesses

Preprocessing and Feature Extraction

Each rectangle is parametrized as follows:

The following metrics determined if a predicted rectangle is considered to be a correct prediction:

Figure 25. Example objects used in the real-time experiments.

• The adaptation of three pose estimation algorithms to in-hand object localization

1. The learning of the model of an object based on collected tactile data

Dynamic Motor Primitives

Preprocessing and Feature Extraction

In-Hand Localization Experiment

Tactile-Based Manipulation Experiment

Tactile-Based Manipulation Results

Strengths and Weaknesses

• The proposed method works with low-cost sensors

The following information was provided to the agent:

Figure 37. Composite rewards for subgoals in the experiment.

Preprocessing and Feature Extraction

Filtering and Feature Pruning

The following filtering and pruning techniques were used:

Strengths and Weaknesses

Each sample in the collected dataset consisted of the following:

The proposed method consisted of two components:

Figure 51. Architecture of the described CNN.

There were two experimental protocols:

Figure 54. Grasps for different challenging objects.

Strengths and Weaknesses

Preprocessing and Feature Extraction

Learning Optimal Control Policies

Strengths and Weaknesses

Data Collection (Contribution 1)

A random trial was executed as follows:

Figure 59. Overview of the data collection process.

Training Approach (Contribution 3)

Figure 61. Examples of patches the CNN was trained on.

The following methods were used for the comparison:

Figure 62. Performance of different approaches on the generated dataset.

Strengths and Weaknesses

Two adversarial frameworks were created:

Figure 68. Illustration of the framework with the shaking adversary.

Strengths and Weaknesses

The presented approach consisted of two components:

Figure 73. Grasping and regrasping framework.

Touch Localization and Initial Grasping (Component 1)