Beruflich Dokumente
Kultur Dokumente
• Labs:
Mihaela Găman (mp.gaman@gmail.com)
Antonio Bărbălău (antoniobarbalau@gmail.com)
Prerequisites
• Practical Machine Learning
– Classifiers, regressors, loss functions, normalization, MLE,
etc.
• Linear Algebra
– Matrix multiplication, eigenvalues, etc.
• Calculus
– Multi-variate gradients, hessians, jacobians, etc.
• Programming!
– Projects will require Python
– Libraries/Frameworks: Numpy, OpenCV, TensorFlow /
PyTorch
Grading System
Your final is based on 1 or 2 projects:
• Collaboration
– Each student must write their own code for the project(s)
• No tolerance on plagiarism
– Neither ethical nor in your best interest
– Don’t cheat. We will find out (code will be checked!)
Acquisitions
Topics in Practical ML
• Basics of Statistical Learning
• Loss function, MLE, MAP, Bayesian estimation, bias-variance tradeoff,
overfitting, regularization, cross-validation
• Supervised Learning
• Nearest Neighbour, Naïve Bayes, Logistic Regression, Support Vector
Machines, Kernels, Neural Networks, Decision Trees
• Ensemble Methods
• Unsupervised Learning
• Clustering: k-means, Gaussian mixture models, EM
• Dimensionality reduction: PCA, SVD, LDA
• Perception
• Applications to Vision, Natural Language Processing
What is Machine Learning?
• “the acquisition of knowledge or skills
through experience, study, or by being
taught”
What is Machine Learning?
• [Arthur Samuel, 1959]
– Field of study that gives computers
– the ability to learn without being explicitly programmed
Machine
Data Understanding
Learning
ML in a Nutshell
• Tens of thousands of machine learning
algorithms
– Hundreds new every year
• Unsupervised learning
– Training data does not include desired outputs
• Reinforcement learning
– Rewards from sequence of actions
Tasks
Supervised Learning
x Classification y Discrete
x Regression y Continuous
Unsupervised Learning
x Clustering y Discrete ID
x Dimensionality y Continuous
Reduction
Supervised Learning
x Classification y Discrete
Vision: Image Classification
x y
man
camel
carrot
car
NLP: Machine Translation
Speech: Speech2Text
AI: Turing Test
“Can machines think”
y A: 6
Supervised Learning
• Input: x (images, text, emails…)
• Data
– (x1,y1), (x2,y2), …, (xN,yN)
• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations
• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features
• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! bag-of-words
classifier “+”
fixed learned
It’s an old paradigm
• The first learning machine:
Feature Extractor
the Perceptron
Built at Cornell in 1960
• The Perceptron was a linear classifier on
top of a simple feature extractor A Wi
• The vast majority of practical applications
of ML today use glorified linear classifiers
𝑁
or glorified template matching
• Designing a feature extractor requires 𝑦 = 𝑠𝑖𝑔𝑛 𝑊𝑖 𝐹𝑖 (𝑋) + 𝑏
considerable efforts by experts 𝑖=1
Hierarchical Compositionality
VISION
SPEECH
sample spectral formant motif phone word
band
NLP
character word NP/VP/.. clause sentence story
Building a Complicated Function
Given a library of simple functions
Compose into a
complicated function
Building A Complicated Function
Given a library of simple functions
+
Building A Complicated Function
Given a library of simple functions
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Building A Complicated Function
Given a library of simple functions
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Deep Learning = Hierarchical Compositionality
“car”
Deep Learning = Hierarchical Compositionality
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
The Mammalian Visual Cortex is Hierarchical
• The ventral (recognition) pathway in the visual cortex
• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations
• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features
• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Feature Engineering
HoG Textons
SPEECH
Mixture of
MFCC
Gaussians
classifier
\ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
SPEECH
Mixture of
MFCC
Gaussians
classifier
\ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
• “Shallow” models
hand-crafted “Simple” Trainable
Feature Extractor Classifier
fixed learned
• Deep models
Trainable Trainable Trainable
Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier
• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations
• End-to-End Learning
– Learning (goal-driven) representations
– Learning to extract features
• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Distributed Representations Toy
Example
• Local vs Distributed
Distributed Representations Toy
Example
• Can we interpret each dimension?
Power of distributed representations!
Local
Distributed
Power of distributed representations!
Romania
Power of distributed representations!
View
Pixel n
Ideal
Feature
Extractor
Pixel 2
Expression
Pixel 1
Distributed Representations
• Q: What objects are in the image? Where?
Power of distributed representations!
So what is Deep (Machine) Learning?
• (Hierarchical) Compositionality
– Cascade of non-linear transformations
– Multiple layers of representations
• End-to-End Learning
– Learning (goal-driven) representations
– Learning to feature extraction
• Distributed Representations
– No single neuron “encodes” everything
– Groups of neurons work together
Benefits of Deep/Representation
Learning
• (Usually) Better Performance
– “Because gradient descent is better than you”
Yann LeCun
• Standard response #1
– “Yes, but all interesting learning problems are non-convex”
– For example, human learning
• Order matters wave hands non-convexity
• Standard response #2
– “Yes, but it often works!”
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– Pipeline systems have “oracle” performances at
each step
– In end-to-end systems, it’s hard to know why
things are not working
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
Pipeline End-to-End
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– Pipeline systems have “oracle” performances at each step
– In end-to-end systems, it’s hard to know why things are not
working
• Standard response #1
– Tricks of the trade: visualize features, add losses at
different layers, pre-train to avoid degenerate
initializations…
– “We’re working on it”
• Standard response #2
– “Yes, but it often works!”
Problems with Deep Learning
• Problem#2: Hard to track down what’s failing
– There are methods to
visualize the features, e.g.:
GradCAM
GradCAM++
Problems with Deep Learning
• Problem#3: Lack of easy reproducibility
– Direct consequence of stochasticity & non-convexity
• Standard response #1
– It’s getting much better
– Standard toolkits/libraries/frameworks now available
– TensorFlow, PyTorch, Caffe, Theano
• Standard response #2
– “Yes, but it often works!”
Yes it works, but how?
ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
Classification:
1000 object classes 1.4M/50k/100k images
Detection:
200 object classes 400k/20k/40k images
Dalmatian
http://image-net.org/challenges/LSVRC/{2010,…,2014}
Data Enabling Richer Models
• [Krizhevsky et al. NIPS12]
– 54 million parameters; 8 layers (5 conv, 3 fully-
connected)
– Trained on 1.4M images in ImageNet
– Better Regularization (Dropout)
1k output
Input Image
units
Convolution Layer Pooling Layer Convolution Layer Pooling Layer Fully-Connected MLP
+ Non-Linearity + Non-Linearity
ImageNet Classification 2012
• [Krizhevsky et al. NIPS12]: 15.4% error
• Next best team: 26.2% error
• More data
– 108 samples (compared to 103 in 1990s)
• Better algorithms/models/regularizers
– Dropout
– ReLU
– Batch-Normalization
– …
THE SPACE OF
MACHINE LEARNING METHODS
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net
SVM
Deep (sparse/denoising)
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
BayesNP
Disclaimer: showing only a
subset of the known methods
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net
SVM
Deep (sparse/denoising)
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP
BayesNP
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net
SVM
SUPERVISED
Deep (sparse/denoising) UNSUPERVISED
Autoencoder
Autoencoder
Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP
BayesNP
SHALLOW
Recurrent
Boosting
Neural Net
Convolutional
Neural Net
Perceptron
Neural Net
SVM
SUPERVISED
Deep (sparse/denoising) UNSUPERVISED
Autoencoder
Autoencoder
PROBABILISTIC Sparse Coding
GMM
Deep Belief Net
Restricted BM
DEEP
BayesNP
feed-forward Main types of deep architectures
Feed-back
• Neural Nets • Hierar. Sparse Coding
• Conv Nets • Deconv Nets
input input
Bi-directional
input
input
feed-forward Focus of this class
Feed-back
• Neural Nets • Hierar. Sparse Coding
• Conv Nets • Deconv Nets
input input
Bi-directional
input
input
Main types of learning protocols
• Purely supervised
• Backprop + SGD
– Good when there is lots of labeled data.