Sie sind auf Seite 1von 116

Simons Foundation 2019-02-06

Could Machines Learn


Like Humans?

Yann LeCun
Facebook AI Research
New York University
http://yann.lecun.com
AI today is mostly supervised learning
Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
CAR
Text→topic
…. PLANE
1957 Perceptron & 1960 Adaline: Analog Computers
N
Linear threshold units y=sign( ∑ W i X i + b)
Perceptron: weights are motorized potentiometers i=1

Adaline: Weights are electrochemical “memistors”

https://youtu.be/X1G2g3SiCwU
Deep Learning
Traditional Machine Learning

Feature Trainable
Extractor Classifier

Hand engineered Trainable

Trainable
Deep Learning

Low-Level Mid-Level High-Level Trainable


Features Features Features Classifier
Multi-Layer Neural Nets

Multiple Layers of simple units ReLU ( x )= max ( x , 0)

Each units computes a weighted sum of its inputs


Weighted sum is passed through a non-linear function
The learning algorithm changes the weights

Ceci est une voiture

Weight
matrix
Hidden
Layer
1986-1996 Neural Net Hardware at Bell Labs, Holmdel
1986: 12x12 resistor array
Fixed resistor value
E-beam lithography
1988: 54x54 neural net
Programmable ternary weights
On-chip amplifiers and I/O
1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
Supervised Machine Learning = Function Optimization

Function with
adjustable parameters
Objective
Function Error

traffic light: -1
It's like walking in the mountains in a fog
and following the direction of steepest
descent to reach the village in the valley
But each sample gives us a noisy
estimate of the direction. So our path is
a bit random. ∂ L( W , X )
W i ←W i −η
∂Wi
Stochastic Gradient Descent (SGD)
Computing Gradients by Back-Propagation
C(X,Y,Θ) ●
A practical Application of Chain Rule
Cost

Backprop for the state gradients:
W

dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/ Fn(Xn-1,Wn)
n ●
dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
dWn dC/dXi Xi
Wi ●
Backprop for the weight gradients:
Fi(Xi-1,Wi)
dC/dWi ●
dC/dWi = dC/dXi . dXi/dWi
dC/dXi- Xi-1 ●
dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
1
F1(X0,W1)

Y (desired output)
X (input)
Convolutional Network Architecture [LeCun et al. NIPS 1989]

Filter Bank +non-linearity

Pooling
Filter Bank +non-linearity

Pooling

Filter Bank +non-linearity

Inspired by [Hubel & Wiesel 1962] &


[Fukushima 1982] (Neocognitron):
simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
Convolutional Network (LeNet5, vintage 1990)
Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
ConvNets can recognize multiple objects
All layers are convolutional
Networks performs simultaneous segmentation and recognition
Check Reader (AT&T 1995)

Graph transformer network


trained to read check amounts.
Trained globally with Negative-
Log-Likelihood loss (MMI).
50% percent correct, 49% reject,
1% error (detectable later in the
process).
Fielded in 1996, used in many
banks in the US and Europe.
Processed an estimated 10% to
20% of all the checks written in
the US in the early 2000s.
[LeCun, Bottou, Bengio ICASSP1997]
[LeCun, Bottou, Bengio, Haffner 1998]
DARPA LAGR: Learning Applied to Ground Robots

`
...
100@25x121

CONVOLUTIONS (6x5)
...

20@30x125

MAX SUBSAMPLING (1x4)


...

20@30x484

CONVOLUTIONS (7x6)

3@36x484

YUV input
Semantic Segmentation with ConvNet for off-Road Driving
[Hadsell et al., J. of Field Robotics 2009]
[Sermanet et al., J. of Field Robotics 2009]

Input image Stereo Labels Classifier Output

Input image Stereo Labels Classifier Output


LAGR Video
Semantic Segmentation with ConvNets (33 categories)
Driving Cars with Convolutional Nets

MobilEye

NVIDIA
Deep Learning Today

History and State of the Art


Deep ConvNets for Object Recognition (on GPU)
AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013]
1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
Error Rate on ImageNet

Depth inflation

(Figure: Anirudh Koul)


Deep ConvNets (depth inflation)

VGG
[Simonyan 2013]

GoogLeNet
Szegedy 2014]

ResNet
[He et al. 2015]

DenseNet
[Huang et al 2017]
GOPS vs Accuracy on ImageNet vs #Parameters

[Canziani 2016]
ResNet50 and
ResNet 100 are used
routinely in
production.
Multilayer Architectures == Compositional Structure of Data
Natural is data is compositional => it is efficiently representable hierarchically

Low-Level Mid-Level High-Level Trainable


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Future: weakly/self-supervised learning on massive datasets

Pretraining on 3.5b instagram images with 17k hashtags. Training/test on ImageNet

[Mahajan et al. CVPR 2018]


Progress in Computer Vision
[He 2017]
Mask R-CNN: instance segmentation
[He, Gkioxari, Dollar, Girshick
arXiv:1703.06870]
ConvNet produces an object
mask for each region of
interest
Combined ventral and dorsal
pathways
RetinaNet, feature pyramid network

One-pass object detection


[Lin et al. ArXiv:1708.02002]
Mask-RCNN Results on COCO dataset
Individual
objects are
segmented.
Mask-RCNN Results on COCO dataset
Individual
objects are
segmented.
Mask R-CNN Results on COCO test set
Mask R-CNN Results on COCO test set
Real-Time Pose Estimation on Mobile Devices

Maks R-CNN
running on
Caffe2Go
Detectron: open source vision in PyTorch
https://github.com/facebookresearch/maskrcnn-benchmark
DensePose: real-time body pose estimation

[Guler, Neverova, Kokkinos CVPR 2018] http://densepose.org


20 fps on a single GPU
3D ConvNet for Medical Image Analysis
Segmentation Femur from MR Images
[Deniz et al. Nature 2018]
3D ConvNet for Medical Image Analysis
FairSeq for Translation
[Gehring et al. ArXiv:1705.03122]
SING: Symbol to Instrument Neural Generator

[Defossez NeurIPS 18]

Original

SING

Nsynth
Unsupervised Translation [Lample 2018]
DrQA: open domain question answering
Question answering
systems that “knows” wikipedia
Applications of ConvNets
Self-driving cars, visual perception
Medical signal and image analysis
Radiology, dermatology, EEG/seizure prediction….
Bioinformatics/genomics
Speech recognition
Language translation
Image restoration/manipulation/style transfer
Robotics, manipulation
Physics
High-energy physics, astrophysics
New applications appear every day
E.g. environmental protection,….
Applications of Deep Learning
Medical image analysis [Mnih 2015]
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for: [MobilEye]
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017] [Esteva 2017]
Deep Learning on Graphs

Irregular graphs and non-euclidean data


ConvNets on Graphs (fixed and data-dependent)
Graphs can represent: Natural
language, social networks, chemistry,
physics, communication networks...

Review paper: “Geometric deep learning: going


beyond euclidean data”, MM Bronstein, J Bruna, Y
LeCun, A Szlam, P Vandergheynst, IEEE Signal
Processing Magazine 34 (4), 18-42, 2017
[ArXiv:1611.08097]
Spectral ConvNets / Graph ConvNets

Regular grid graph


Standard ConvNet
Fixed irregular graph
Spectral ConvNet
Dynamic irregular graph
Graph ConvNet

IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Open Source Projects from FAIR
PyTorch: deep learning framework http://pytorch.org
Many examples and tutorials. Used by many research groups.
FAISS: fast similarity search (C++/CUDA)
ParlAI: training environment for dialog systems (Python)
ELF: distributed reinforcement learning framework

ELF OpenGo: super-human go-playing engine


FastText: text classification, representation, embedding (C++)
FairSeq: neural machine translation with ConvNets, RNN...
Detectron / Mask-R-CNN: complete vision system
DensePose: real-time body pose tracking system
https://github.com/facebookresearch
What About (Deep)
Reinforcement Learning?
It works great …
…for games and virtual environments
Reinforcement Learning works fine for games

RL works well for games


Playing Atari games [Mnih 2013], Go
[Silver 2016, Tian 2018], Doom [Tian
2017], StarCraft (work in progress at
FAIR, DeepMind….)
RL requires too many trials.
RL often doesn’t really work in the
real world
FAIR open Source go player:
OpenGo
https://github.com/pytorch/elf
ELF OpenGo (open source go player from FAIR)
Pure RL requires many, many trials to learn a task
[Hessel ArXiv:1710.02298]

Median performance on 57
Atari games relative to
human performance
(100%=human)
Most methods require over
50 million frames to match
human performance (230
hours of play)
The best method
(combination) takes 18
million frames (83 hours).
Pure RL is hard to use in the real world
Pure RL requires too many
trials to learn anything
it’s OK in a game
it’s not OK in the real world
RL works in simple virtual
world that you can run faster
than real-time on many
machines in parallel.

Anything you do in the real world can kill you

You can’t run the real world faster than real time
What are we missing?
To get to “real” AI
1. Reasoning
2. Learning models of the world
3. Learning hierarchical representations of actions
What current deep learning methods enables
What we can have What we cannot have (yet)
Safer cars, autonomous cars Machines with common sense
Better medical image analysis Intelligent personal assistants
Personalized medicine “Smart” chatbots”
Adequate language translation Household robots
Useful but stupid chatbots Agile and dexterous robots
Information search, retrieval, filtering Artificial General Intelligence
Numerous applications in energy, (AGI)
finance, manufacturing,
environmental protection, commerce,
law, artistic creation, games,…..
Differentiable Programming:
Marrying Deep Learning
With Reasoning
Neural nets with dynamic, data-dependent structure,
A program whose gradient is generated
automatically.
Augmenting Neural Nets with a Memory Module
Recurrent networks cannot remember things for very long
The cortex only remember things for 20 seconds
We need a “hippocampus” (a separate memory module)
LSTM [Hochreiter 1997], registers
Memory networks [Weston et 2014] (FAIR), associative memory
Stacked-Augmented Recurrent Neural Net [Joulin & Mikolov 2014] (FAIR)
Neural Turing Machine [Graves 2014],
Differentiable Neural Computer [Graves 2016]

Recurrent net memory


Answering complex questions by running a program
[Johnson et al. ArXiv:1705.03633]
Inferring and executing programs for visual reasoning
https://research.fb.com/visual-reasoning-and-dialog-towards-natural-language-conversations-about-visual-data/
PyTorch: differentiable programming

Software 2.0:
The operations in a program are only partially specified
They are trainable parameterized modules.
The precise operations are learned from data, only the general structure
of the program is designed.
Dynamic computational graph
Automatic differentiation by recording a “tape” of operations and rolling it
backwards with the Jacobian of each operator.
Implemented in PyTorch1.0, Chainer…
Easy if the front-end language is dynamic and interpreted (e.g Python)
Not so easy if we want to run without a Python runtime...
How do Humans
and Animal Learn?
So quickly
Babies learn how the world works by observation
Largely by observation, with remarkably little interaction.

Photos courtesy of
Emmanuel Dupoux
Early Conceptual Acquisition in Infants [from Emmanuel Dupoux]

pointing
Social-
helping vs false perceptual
communicati beliefs
hindering
ve
Perception

Action face tracking rational, goal-


s directed actions
biological
motion
gravity, inertia
Physic stability, support conservation of
s momentum

Object permanence shape


constancy
Objects solidity, rigidity

natural kind categories Time (months)


roduction

0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
prooto-imitation
crawling walking
emotional contagion
Prediction is the essence of Intelligence
We learn models of the world by predicting
The Future:
Self-Supervised Learning
With massive amounts of data
and very large networks
Self-Supervised Learning

Predict any part of the input from any Time →


other part.
Predict the future from the past.

Predict the future from the recent past.

Predict the past from the present.

Predict the top from the bottom.

Predict the occluded from the visible


Pretend there is a part of the input you ← Past Future →
Present
don’t know and predict that.
Three Types of Learning
Reinforcement Learning
The machine predicts a scalar reward given once in a
while.
weak feedback
Supervised Learning
The machine predicts a category or a few numbers for PLANE
each input
medium feedback CAR
Self-supervised Predictive Learning
The machine predicts any part of its input for any
observed part.
Predicts future frames in videos
A lot of feedback
How Much Information is the Machine Given during Learning?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar reward given once in a
while.
A few bits for some samples

Supervised Learning (icing)


The machine predicts a category or a few numbers
for each input
Predicting human-supplied data
10→10,000 bits per sample

Self-Supervised Learning (cake génoise)


The machine predicts any part of its input for any
observed part.
Predicts future frames in videos
Millions of bits per sample
Self-Supervised Learning Fills in the Blanks

Our brains do this all the time

Filling in the visual field at the retinal blind spot


Filling in occluded images, missing segments in speech
Predicting the state of the world from partial (textual)
descriptions
Predicting the consequences of our actions
Predicting the sequence of actions leading to a result

Predicting any part of the past, present or future


percepts from whatever information is available.
Self-Supervised Learning: Filling in the Blanks
Self-Supervised Learning works well for text

Word2vec
[Mikolov 2013]

FastText
[Joulin 2016]

BERT
Bidirectional Encoder
Representations from
Transformers
[Devlin 2018]

Figure credit: Jay Alammar http://jalammar.github.io/illustrated-bert/


But it doesn’t really work for high-dim continuous signals

Video prediction:
Multiple futures are possible.
Training a system to make a single
prediction results in “blurry” results
the average of all the possible futures
The Next AI Revolution

THE REVOLUTION
WILL NOT BE SUPERVISED
(nor purely reinforced)
With thanks
To
Alyosha Efros
Could Self-Supervised Learning Lead to Common Sense?

Successfully learning to predict everything from everything else


would result in the accumulation of lots of background knowledge
about how the world works.
Perhaps this ability to learn and use the regularities of the world is
what we call common sense.
If I say “john picks up his briefcase and leaves the conference room”
You can infer a lot of facts about the scene.
John is a man, probably at work, he is
extending his arm, closing his hand around
the handle of his briefcase, standing up,
walking towards the door. He is not flying to
the door, not going through the wall….
Learning Predictive Models
of the World
Learning to predict, reason, and plan,
Learning Common Sense.
Classical model-based optimal control
Simulate the world (the plant) with an initial control sequence
Adjust the control sequence to optimize the objective through gradient descent
Backprop through time was invented by control theorists in the late 1950s
it’s sometimes called the adjoint state method
[Athans & Falb 1966, Bryson & Ho 1969]

Plant Plant Plant Plant


Simulator Simulator Simulator Simulator

Command Command Command Command

Objective Objective Objective Objective


Planning Requires Prediction
To plan ahead, we simulate the world
World

Agent Actions/
World Percepts
Simulator Outputs
Predicted Inferred
Action
Percepts World State Agent
Proposals
Actor
Agent State
Actor State
Critic Predicted Objective
Cost Cost
Training the Actor with Optimized Action Sequences

1. Find action sequence through optimization


2. Use sequence as target to train the actor
Over time we get a compact policy that requires no run-time optimization

Agent
World World World World
Simulator Simulator Simulator Simulator

Perception
Actor Actor Actor Actor

Critic Critic Critic Critic


Training a forward model

Forward model:
[s(t+1),r(t+1)] = f(s(t),a(t))
s(t): state
a(t): action
r(t): reward/cost

Observe a sequence of
states, actions, and
rewards (or costs).
Train the forward model
Learning Physics (PhysNet)
[Lerer, Gross, Fergus ICML 2016, arxiv:1603.01312]
ConvNet produces object masks that predict the trajectories of falling
blocks. Blurry predictions when uncertain
The Hard Part: Prediction Under Uncertainty
Invariant prediction: The training samples are merely representatives of a
whole set of possible outputs (e.g. a manifold of outputs).

Percepts

Hidden State
Of the World
Learning the “Data Manifold”: Energy-Based Approach

Learning an energy function (or contrast function) that takes


Low values on the data manifold
Higher values everywhere else
Y2

Y1
Energy Function for Data Manifold
Energy Function: Takes low value on data manifold, higher values everywhere else
Push down on the energy of desired outputs. Push up on everything else.
But how do we choose where to push up?

Implausible
futures Plausible futures
(high energy) (low energy)
Adversarial Training

Learning to predict under uncertainty


Predicting under Uncertainty: Adversarial Training
Invariant prediction: The training samples are merely representatives of a
whole set of possible outputs (e.g. a manifold of outputs).

Percepts

Hidden State
Of the World
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)

Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z)
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)

Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z) F(X,Y)
Adversarial Training: the key to prediction under uncertainty?
Generative Adversarial Networks (GAN) [Goodfellow et al. NIPS 2014],
Energy-Based GAN [Zhao, Mathieu, LeCun ICLR 2017 & arXiv:1609.03126]
Past: X
Dataset Y
T(X) Discriminator F: minimize
F(X,Y)
F(X,Y)

Actual future Y
X
Predicted future Past: X
X
Discriminator
Z Generator Y F(X,Y)
F: maximize
G(X,Z)
DCGAN: “reverse” ConvNet maps random vectors to images
DCGAN: adversarial training to generate images.
[Radford, Metz, Chintala 2015]
Input: random numbers; output: bedrooms.
Faces “invented” by a neural net (from NVIDIA)

From random numbers [Karras et al. ICLR 2018]


Adversarial Networks for Creation

[Sbai 2017]
Video Prediction with
Adversarial Training
[Mathieu, Couprie, LeCun ICLR 2016]
arXiv:1511:05440
Multi-Scale ConvNet for Video Prediction
4 to 8 frames input → ConvNet → 1 to 8 frames output
Multi-scale ConvNet, without pooling
If trained with least square: blurry output

Predictor (multiscale ConvNet Encoder-Decoder)


Predictive Unsupervised Learning
Our brains are “prediction machines”
Can we train machines to predict the future?
Some success with “adversarial training”
[Mathieu, Couprie, LeCun arXiv:1511:05440]
But we are far from a complete solution.
Video Prediction: predicting 5 frames
Video Prediction in
Semantic Segmentation Space
[Luc, Neverova, Couprie, Verbeek,
& LeCun ICCV 2017]
Temporal Predictions of Semantic Segmentations
Predictions a single future frame
CityScape dataset [Cordt et al. CVPR 2016]
Temporal Predictions of Semantic Segmentations
Prediction 9 frames ahead (0.5 seconds)
Auto-regressive model
Temporal Predictions of Semantic Segmentations
Prediction 9 frames ahead (0.5 seconds)
Auto-regressive model
Instance Segmentation Prediction

[Luc , ECCV 2018]


Instance Segmentation Prediction
Predicting Instance Segmentation Maps

[Luc, Couprie, LeCun, Verbeek ECCV 2018]


Mask R-CNN Feature Pyramid Network backbone
Trained for instance segmentation on COCO
Separate predictors for each feature level
Predictions
Long-term predictions (10 frames, 1.8 seconds)
Stochastic Latent-Variable
Forward Model:

[Henaff, Canziani, LeCun ICLR 2019]


[Henaff, Zhao, LeCun ArXiv:1711.04994]
[Henaff, Whitney, LeCun Arxiv:1705.07177]
Stochastic Forward Modeling: regularized latent variable model

Latent variable is
predicted from the target.
+
The latent variable is set
to zero half the time
during training (drop out) Noise
and corrupted with noise
The model predicts as
much as it can without
the latent var.
The latent var corrects
the residual error.
Application to Autonomous Driving
Overhead camera on
highway.
Vehicles are tracked
A “state” is a pixel
representation of a
rectangular window
centered around each
car.
Forward model is
trained to predict how
every car moves relative
to the central car.
steering and acceleration
are computed
Forward Model Architecture

Architecture: Encoder Decoder

expander

Latent variable predictor


Predictions
Cost optimized for Planning & Policy Learning

Differentiable cost function


Increases as car deviates from lane
Increases as car gets too close to other
cars nearby in a speed-dependent way
Uncertainty cost:
Increases when the costs from multiple
predictions (obtained through sampling
of drop-out) have high variance.
Prevents the system from exploring
unknown/unpredictable configurations
that may have low cost.
Approach 2: Learning a Policy with Stochastic Value Gradient

Feed initial state


Sample latent variable
sequences of length 20
Run the forward model
with these sequences
Backpropagate gradient of
cost to train a policy
network.
Iterate

No need for planning at


run time.
Adding an Uncertainty Cost (doesn’t work without it)
Estimates epistemic
uncertainty
Samples multiple drop-
puts in forward model
Computes variance of
predictions
(differentiably)
Train the policy network
to minimize the
lane&proximity cost plus
the uncertainty cost.
Avoids unpredictable
outcomes
Results

Blue car is being controled


White dot visualizes the action
Surrounding cars do not see the car being controled
Rate of Successful Runs as a fn of Rollout Length
MPUR: Model-Predictive with Uncertainty Regularization

Car gets squeezed &


hit in the back
because it’s
invisible to the
other cars

MPER: expert
regularization
Driving an Invisible Car in “Real” Traffic
Promising Areas for Research
Deep Learning on new domains (beyond multi-dimensional arrays)
Graphs, structured data...
Marrying deep learning and (logical) reasoning
Replacing symbols by vectors and logic by algebra
Self-supervised learning of world models
Dealing with uncertainty in high-dimensional continuous spaces
Learning hierarchical representations of control space
Instantiating complex/abstract action plans into simpler ones
Theory!
Compilers for differentiable programming.
Technology drives & motivates Science (and vice versa)
Science drives technology, but technology also drives science
Sciences are born from the study of technological artifacts
Telescope → optics
Steam engine → thermodynamics
Airplane → aerodynamics
Calculators → computer science
Telecommunication → information theory
What is the equivalent of thermodynamics for intelligence?
Are there underlying principles behind artificial and natural intelligence?
Are there simple principles behind learning?
Or is the brain a large collection of “hacks” produced by evolution?
Thank you

Das könnte Ihnen auch gefallen