Sie sind auf Seite 1von 525

Data Science!

Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 1: Introduction to Data Science!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Program Organizer!

Professor, Vice-Provost EPFL!


Head of LTS4 at EPFL!
Apple ARTS award, Boelpaepe prize !
Co-founded two start ups!

Prof. Pierre Vandergheynst!


pierre.vandergheynst@epfl.ch !

!"#$%&'(&%))*+'

-'

Program Instructor!
- Prof of Data Science at the Institute
of Data Science at NTU, Singapore!
- Publications in NIPS, ICML, JMLR!
- Teach Master and PhD courses in
Data Science at EPFL !
- Trained at EPFL, UCLA!
- Consulting !

Dr. Xavier Bresson !


xavier.bresson@epfl.ch!

!"#$%&'(&%))*+'

.'

Teaching Assistants!

M. Kirell Benzi!
kirell.benzi@epfl.ch!
Data Scientist/Artist!

!"#$%&'(&%))*+'

M. Michal De"errard!
michael.de"errard@epfl.ch!
Data Scientist!

/'

Data Scientist!

Source: Drew Conway '

Best job in the U.S in 2015 [Forbes].!


Salary has jumped from $125,000 to $200,000+ [Glassdoor].!
McKinsey projects that by 2018, the U.S. alone may face a 50 percent to 60
percent gap between supply and requisite demand of deep analytic talent. !
!"#$%&'(&%))*+'

0'

In the News!

!"#$%&'(&%))*+'

1'

Data Science!
Q: What is Data Science? !

!"#$%&'(&%))*+'

2'

What is Data Science? - Short Answer !


Science of transforming raw data into meaningful
knowledge to provide smart decisions to real-world
problems.!

!"#$%&'(&%))*+'

3'

What is Data Science? - Long Answer


Q: What are the fields !
of Data Science? !

Q: What are the applications? !

Computer Science
Science!

Scalable databases for storing, accessing data. !


E.g. Cloud computing, Amazon EC2, Hadoop.!

Distributed and parallel frameworks !


for data processing. !
E.g. MapReduce, GraphLab.
GraphLab.!

Personalized !
Services!
Services

E.g. Healthcare (enhanced diagnostics)


diagnostics)!
(products)!
Commerce (products)

Mathematical!
Mathematical
Modeling!
Modeling

Design algorithms that transform


transform!
data into knowledge.
knowledge.!
Use Linear algebra, optimization, !
statistics.!
graph theory, statistics.

Data Science
Science!

Multidisciplinary field: 1+1=3


1+1=3!

Data!

Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!

Collection of massive amounts of !


data at increasing rate.!
E.g. Social networks, sensor networks, !
mobile devices, biological networks,!
administrative, economics data!

Issues of privacy, !
ownership!
security, ownership

Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences

E.g. Economy, Biology, Physics, Neuroscience, sociology.


sociology.!

Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!

Q: What are the main !


challenges?!

Industry!
Industry

Intelligent !
Systems!
Systems

E.g. Autonomous cars, security, !


interactive tools for data organization
organization!
and exploration. !

E.g. E-commerce, Telecommunications, !


Finance.
Finance.!

Major challenges: Multidisciplinary integration, large-scale databases, scalable


computational infrastructures, design math algorithms for massive datasets, trade-o"
speed and accuracy for real-time decisions, interactive visualization tools. !

What is Data Science? - Medium Answer !


Q: What is big data?!

Q: Is AI new?!

Data Science = Big Data + Computational Infrastructure + Artificial Intelligence


Intelligence!
3rd industrial !
revolution!

Cloud computing
computing!
GPU!

Not new!!

!"#$%&'(&%))*+'

,4'

A Brief History of Data Science!

Q: Did you hear about !


the 4th industrial revolution?!

RNN!
Schmidhuber!
CNN!
LeCun!

First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!

First !
KDD
KDD!
1989'
1989
1987'

Neocognitron!
Fukushima!

Birth of!
Data Science!
Split from Statistics!
Tukey!

AI Hope!

!"#$%&'(&%))*+'

Big Data!
Volume doubles/1.5 year!

1998
1997 1998'
1997'
1995'

1999
1999'

Hardware!
GPU speed doubles/ year!

First Amazon!
Cloud Center!

Google AI !
TensorFlow!
Facebook AI!
Torch!

Kaggle!
Platform!

2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!

First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!

2012
2012'

2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!

4th industrial revolution?!


Digital Intelligence !
Deep Learning!
Revolution!
Breakthough !
or new AI bubble?!
Hinton, Ng!

AI Resurgence!

,,'

Data Science and Graph Science!

and Graph Science !!

!"#$%&'(&%))*+'

,-'

Networks/Graphs!
!! Graphs encode complex data structures.
They are everywhere: WWW, Facebook,
Amazon, etc !
MNIST Image !
Network!

Social Network'
Graph of Google Query!
California'

!! Graphs are the most important discrete


models in the world! - G. Strang (MIT)!

GTZAN Music !
Network!

Network of Text Documents!


20newsgroups'
!"#$%&'(&%))*+'

,.'

Why Networks are important?!


!! Networks improve all data science tasks, for a small computational price!!
!! Essential data lie on networks:!
(1) Social networks (Facebook, Twitter)!
(2) Biological networks (genes, brain connectivity)!
(3) Communication networks (Internet, wireless, tra"c)!

7'
3+14#5.'()*+",/!

6"#4'./)"01)0"(!

8(5(1+990'41#:+'.
'()*+",/!

!"#$%&'()*+",-.
./)"01)0"(2.2#)#..

!"#$%&'(&%))*+'

,/'

567899:";"<)=$%+=%<=%<=*>&)%?;@'

!"#$%&'(&%))*+'

,0'

Outline of the Course!


1st day!

Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction

Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!

Python!
Python

Language for !
data science!
science

Supervised
Classification!
Classification
SVM!
SVM

Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!

Data Science
Science!

Pagerank, collaborative
Pagerank
filtering
content filtering!

3rd day!

Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!

!"#$%&'(&%))*+'

Recommender
Systems
Systems!

Feature
Extraction!
Extraction

PCA, NMF, Sparse


coding
coding!

2nd day!
,1'

Structure of the Course!


Introduction of main concepts/ideas, technical details.!
!

Coding illustration on real-world data.!


!

Please, ask questions! !


!

Please, share your own data science problem for discussion.!

Xavier Bresson

17

Goal of the Course!

!"#$%&'(&%))*+'

,3'

QuesCons?

Xavier Bresson

19

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 2: Introduction to Python!
Kirell Benzi and Michael De"errard!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Python!
!! Why Python?!

!"#$%&'(&%))*+'

-'

Python!
!! Why Python for Data Science?!

!"#$%&'(&%))*+'

.'

Computational Needs
Fast numerical mathematics: BLAS & LAPACK libraries
Easy bridging to data: data files, databases, scraping
Easy bridging to legacy code: C, matlab, Fortran
Easy results presentation: html / web & pdf reports
Rapid prototyping
Ideally the same framework for R&D and production
Cluster computing: multi-threads, MPI, OpenMP, Ipython Parallel
GPU computing: OpenCL, CUDA

Xavier Bresson

Python Pros, for Prototyping


Easy-to-learn while powerful
Elegant syntax (quick to write, easy to read)
High-level data structures: list, tuple, set, dict & containers
Multi-paradigm: procedural, object-oriented, functional
Dynamically typed
Automatic memory management (garbage collector)
Interpreted (JIT is coming)
Runs everywhere: Windows, Mac, Linux, Cloud
Large community
Extensive ecosystem of libraries
l

easy to share & install packages via pip

Xavier Bresson

Python Pros, for Production


General purpose (unlike matlab / R / julia)
Encourage code reuse: modules and packages
Integrated documentation
Open-source
Many tools: unit & integration testing, documentation generation,
debugging, performance optimization

Xavier Bresson

Python Cons
Python 2 vs 3
Slow execution
l Specialized libraries: numpy, scipy
l Compilation: pypy, numba, jython
Need to run to catch errors

Xavier Bresson

Scientific Python
Libraries for everything !
Numerical analysis
l numpy: multidimensional arrays, data types, linear algebra
l scipy:
higher-level algorithms, e.g. optimization, interpolation, signal
processing, sparse matrices, decompositions
SciKits
l scikit-learn: machine learning
l scikit-image: image processing
Deep Learning: tensorflow, theano, keras
Statistics: pandas
Symbolic algebra: sympy
Visualization
l matplotlib: similar to MATLAB plots
l bokeh: interactive visualization
Xavier Bresson

Data Storage
Flat files
l CSV: numpy / pandas
l Matlab: scipy
l JSON: std lib
l HDF5: h5pyBasic relational database storage
Connectors for relational databases
l SQLite: std lib
l PostgreSQL: psycopg (DB API)
l MySQL: mysqlclient
l Oracle: cx_Oracle (DB API)
l Microsoft SQL Server: pypyodbc (DB API)
NoSQL data stores
l Redis: Redis-py
l MongoDB: PyMongo (MongoEngine)
l Hbase: HappyBase
l Cassandra: Datastax
Object-Relational Mapping (ORM)
l SQLAlchemy, Peewee, Pony
Xavier Bresson

Jupyter
HTML-based notebook environment
Multiple kernels / languages: Python, matlab, R, Julia
Platform agnostic: Windows, Mac, Linux, Cloud
All-in-one reports: text, latex math, code, figures, results
Most adapted for prototyping / data exploration
l Convert to Python modules when mature for production
Cloud: github, nbviewer
Alternatively, scientific IDEs: Spyder, Rodeo
l Jupyter is itself becoming an HTML-based IDE !
Other IDEs: IDLE, PyCharm
Text editors: vim, emacs, atom, sublime text

Xavier Bresson

10

Install It Yourself
Windows: anaconda or python(x,y) or Enthought Canopy
Mac: anaconda or homebrew / macports / fink
Linux: package manager (apt-get, yum, pacman)
Use pip to install packages from PyPI or GitHub
Use pyvenv to work with virtual environments

Xavier Bresson

11

Live Session
1) Cloud IDE: nitrous.io
2) Notebook: Jupyter / IPython
3) Basics of Scientific Python: numpy, scipy, scikit-learn, matplotlib
4) Demo: data visualization by Kirell Benzi

Xavier Bresson

12

Ques8ons?

Xavier Bresson

13

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 3: Graph Science!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

Graph/Network Science!
!! Definition of graph/network: mathematical models representing pairwise
relations between objects/data:!
All pairwise !
relationships'
data1'

data2'

data3'
data1'

data2'

Q: Why are they useful? !


Graphs o"ers a global view of data structure!
Extract global meaningful patterns, insights about data!
Increase performances, ex: classification from 5-20% (later)!
Easy to use, slight increase of computational time!
Some tasks are only designed on networks, e.g. Google pagerank!

!"#$%&'(&%))*+'

.'

Graph Science = Graph Theory!


Q: When did it start? !
History of graph theory: Graphs have been studied since 1736, starting with
Leonhard Euler and the famous problem of Seven Bridges of Knigsberg:!
Q: Can we find a path through the city that cross each bridge once and only
once? Yes if the graph has even degree.!

Simplification

Knigsberg!
City

Graph !
representation

Source: Wikipedia.

A: Not possible. Needs cycles


All vertices must have even degree.

Graph theory oers many analysis tools to use networks for all kind of
applications: from clustering to classification, visualization, recommendation,
deep learning, etc.!
Xavier Bresson

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

Class of Graphs/Networks!
!! Natural Graphs:!
(1) Social networks: Facebook, LinkedIn, Twitter!

Q: Cite a few networks? !

(2) Biological networks: Brain connectivity and functionality, Gene regulatory networks!
(3) Communication networks: Internet, Networking Devices!
(4) Transportation networks: Trains, Cars, Airplanes, Pedestrians !
(5) Power networks: Electricity, Water !

Facebook'

Brain !
Connectivity'

!'

Minnesota Road Network'

Graphs/networks!

US Electrical Network'

Telecommunication!
Network'

!! Essential data $$ lie on network structures, like medical, social,


communication data.!
Terminology: Natural graphs means no graph construction.!
!"#$%&'(&%))*+'

1'

Class of Graphs/Networks!
!! Constructed Graphs (from Data). !
Examples:!

MNIST Image Graph!

(1) MNIST image network!


(2) GTZAN music network!
(3) 20NEWS text document network!

GTZAN Music Graph !


Graph of Text Documents!
20newsgroups'

(4) 3D mesh points!

3D mesh points!

!! Graph Construction: No universal recipe but good common practices


(later discussed) and domain expertise knowledge.!
Q: How much time to construct a network from data? !
!! Computational time: May be time consuming O(n2), n=#data.!
Exs: d=1K,

n=1K
n=100K

time < 1sec!


time ~ 1 min!

n=1M

time > 1 hour!

!! Approximate technique:

FLANN is a library for performing fast !

approximate nearest neighbor searches in high dimensional spaces.!


FLANN: http://www.cs.ubc.ca/research/flann!
!"#$%&'(&%))*+'

kd-tree'
2'

Class of Graphs/Networks!
!! Mathematical/Simulated Graphs:!
(1) Erdos-Renyi graphs (1959)!
(2) Stochastic blockmodels [Faust-Wasserman 92]!
(3) Lancichinetti-Fortunato-Radicchi (LFR) graphs (2008)

!
Erdos-Renyi Network'
Source: Wikipedia.'

Q: Why using artificial math networks?!


!! Mathematical Models:!
!

Advantages: Precise control of your data analysis model (best performances, data
assumptions). No need to perform extensive experiments! (big issue with deep learning)!

! Limitations: Most data assumptions are too restrictive, and it may be hard to check if
your data follow the model assumptions.!

!"#$%&'(&%))*+'

3'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

Basic Definitions!
!! Graphs: Fully defined by G=(V,E,W):!
!! V set of vertices,!
!! E set of edges, and |V|=n,!
!! W similarity matrix.!

eij 2 E

i2V

!! Directed/undirected graphs: !

j2V

Wij = 0.9

Note: In this workshop, we will mostly talk about undirected networks.!


!"#$%&'(&%))*+'

,5'

Basic Definitions!
!! Vertex Degree: !
(1) Binary graphs (Wij={0,1}):

degree= #edge connected to a vertex!

(2) Generic graphs (Wij in [0,1]):

degree= ! di =

Wij

j2V

!! Shortest path: A path on a graph with the smallest possible distance.!


Fast Algorithm: Dijkstra's algorithm (1956).!
!

Q: Do you know a popular application? !


A: Road navigator, e.g. Lausanne to Venezia.!

!"#$%&'(&%))*+'

,,'

Full vs. Sparse Graphs!


!! Full/Complete Graphs: Each vertex is connected to all other vertices.!

|E| =

n(n

1)
2

= O(n2 )

!! Sparse Graphs: Each vertex is connected to a few other vertices (k=10-50).!


Q: Full or sparse? !

|E| = O(n)

!! A: Sparse networks are highly desirable for memory and computational e#ciency.!
Ex: Internet,, n= 4.73 billion pages (August 2016)
2016)!
|E| = n2 = 1018 if it was full.
full.!
|E| = k.n = 1011 as it is (very) sparse.
sparse.!
!! Good news: most natural/real-world networks (Facebook, Brain,
Communication) are sparse. Besides, sparsity
structure:!

!"#$%&'(&%))*+'

Full graph'

Sparse graph'

,-'

Adjacency/Similarity Matrix W!
Definition: Matrix W in G=(V,E,W) actually contains all information
about your network. There are two choices of W:!
(1) Binary W: Wij in {0,1}!
(2) Weighted W: Wij in [0,1] (commonly normalized to 1)!

(1) Binary W:!


Wij =

1 if (i, j) 2 E
0 otherwise

(2) Weighted W:!

wij 2 [0, 1] if (i, j) 2 E


Wij =
0 otherwise

Xavier Bresson

13

Demo: Synthetize Social Networks!


!! Run lecture03_code01.ipynb!
Synthesize LFR social networks: Play with the mixing
parameter $:!
(1) $ small: communities well separated.!
(2) $ large: communities are mixed up.!

prin
prout

!"#$%&'(&%))*+'

,/'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

15

Curse of Dimensionality!
Q: What is the curse of dimensionality?!
A: In high dimensions, (Euclidean) distances between data is meaningless
all data are close to each other!!
Result [Beyer98]: Suppose data are uniformly distributed in Rd,!
Pick any data xi then we have:

limd!1

d`2 (x , V x ) d`2 (x , V
i
i
min i
E max
2
d`min
(xi , V xi )

xi )

!0

Loss of intuition: Gaussian distribution of data.!


In low-dim, most data are concentrated at the center.!
In high-dim, most data are concentrated at the surface.

Xavier Bresson

1-D Gaussian

1000,000-D Gaussian

16

Blessing of Structure!
Q: What is the blessing of structure?!
Good news: Assumption data are uniformly distributed is not true for realworld data. Data have always some structures in the sense that they belong to
a low-dimensional space called manifold
distances on this surface are
meaningful!

Uniform distribution !
of data!
No structure!
Randomness
Xavier Bresson

Non-Uniform distribution !
of data!
!

Structureness!

17

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

18

Manifold Learning!
!! Big challenge: It is di#cult to discover the structures hidden in the data
because:!
(1) High-dimensional data!
(2) Large-scale data!
A class of algorithms exists and is called manifold learning techniques (later
discussed).'

!! Some data have clear structures (some others less):'

MNIST Image !
Graph!

!"#$%&'(&%))*+'

Graph of Text Documents!


20newsgroups'

,4'

From Manifolds to Graphs!


Manifold assumption: High-dim data are
sampled from a low-dim manifold.!
Ex: Let x be a movie, each movie is defined
by d features/attributes like genre, actors,
release year, origin country, etc such that x
in Rd. Then we can make the assumption
that all movies form a manifold in Rd.!

Assumption validity: It is a good working hypothesis for:!


(1) Several types of data (images, text documents, music, etc)!
(2) Most data science tasks (classification, visualization, recommendation, etc)!

Xavier Bresson

20

From Manifolds to Graphs!


!! Graphs=Manifold sampling: The manifold information is encoded by
neighborhood graphs (we never form the manifold):!
Graph !
construction!

Sampling!

Smooth!
manifold !

Data points!

Graph!
[Belkin]'
G=(V,E,W)!

(
!! Neighborhood Graphs: !
dist(xi ,xj )2
2
if j 2 Nik
e
k-NN graphs (most popular)! Wij =
0 otherwise

where dist(xi,xj) is the distance between


xi and xj (to be defined), % is the scale
parameter (value depends on data), Nik is
the neighborhood of data xi:'

!"#$%&'(&%))*+'

-,'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

22

Graph Analysis=Spectral Graph Theory!


Q: How to use graphs? !
!! A: Given a data graph G=(V,E,W), use spectral graph theory (SGT) to:!
(1) Find meaningful patterns (multi-scale data structures)!
(2) Analyze data on top of the network (Facebook users with messages, videos, etc)!
(3) Design data science tasks (clustering, classification, recommendation, etc) !

!! Graph Laplacian Operator L: The most powerful tool in SGT to


analyze and process networks! What is it?!
(1) Heat di"usion operator (on graphs).!
(2) Basis functions of this operator are the well-known Fourier modes (on graphs)
(later discussed).!

!"#$%&'(&%))*+'

-.'

Graph Laplacian Definitions!


Unnormalized/combinatorial graph Laplacian: (historical)!

Lun = D

with D is the degree matrix: D = diag(d1 , ..., dn ),

nn

di =

n = |V |

Wij

Normalized Laplacian: (most popular)!

L=D

1/2

Lun D

1/2

= In

1/2

WD

1/2

Nice math properties s.a. robustness w.r.t.


unbalanced sampling:

Random Walk Laplacian: (for Google PageRank)!

L=D

Lun = In

Note: All Laplacian are diusion operators, but dierent diusion properties.!
Xavier Bresson

24

Graph Spectrum!
Motivation: Study the modes of variation of the graph system.!
Q: How? A: Eigenvalue Decomposition (EVD) of Laplacian L:!

L = U U

Luk =

k uk

U = [u1 , ..., un ]
= diag( 1 , ...,

huk , uk0 i =
0=

min

n)

1 if k = k 0
0 otherwise
1

...

max

Interpretation:!
(1) uk: Fourier modes, i.e. vibration vectors of the graph.!
(2) k: Frequencies of the Fourier modes uk, i.e. how much they vibrate.!

Xavier Bresson

25

Demo: Modes of Variations of the Graph System!


!! Run lecture03_code02.ipynb!

Graph = Meshes in graphics


(Google 3D shape recognition)!

Graph = Regular grids (JPEG image


compression, most used worldwide)!

Q: What is the main property of the first and last of eigenvectors?!

!! A: First eigenvectors = smoothest modes of vibration of the graph !


Last eigenvectors = highest frequencies of the graph !
!"#$%&'(&%))*+'

-1'

Neuroscience!
!! Goal: Find meaningful activation patterns in brain using Structural MRI
and Functional MRI. !
Time series !
at this location!

Dynamic activity !
of the brain"

!! Methodology: G

uk

dynamic activation patterns:'

Connectivity !
of the brain!
(fibers connecting!
regions)!

Graph connectivity'

Q: How to construct the dynamic network?!

!! Results: (Re)discover dynamic


patterns related to basic functional
tasks: vision, body motor, language, etc!
!"#$%&'(&%))*+'

-2'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

28

How to Construct Graphs from Data?!


Three fundamental questions:!
(1) What type of graphs?!
(2) What distances between data?!
(3) What data features?!
Answers: Optimal graph construction is an open problem depends on
data and analysis tasks. However, they exist good practices and domain
expertise are useful.!

Xavier Bresson

29

What Type of Graphs?!


Neighborhood graphs: k-NN graphs (there exist -graphs but they are dense)!
Parameters:
(1) k = #nearest neighbors !
(2) = scale parameter!
k-value: 10-50!
-value: Two strategies:!
(1) Global scale: = mean distance of all kth neighbors!
(
dist(xi ,xj )2
2
if j 2 Nik
e
Wij =
0 otherwise
(2) Local scale

[Zelnik-Peron04]:

Wij =

Xavier Bresson

i = distance of the kth neighbor of vertex i.!


dist(xi ,xj )2

i j
if j 2 Nik
e
0 otherwise

30

What Distances?!
Q: What Distances do you know?!
(1) Euclidean distance: !
Good for low-dim data d<10!
Good for high-dim data with clear structures (MNIST)!

d`2 (xi , xj ) = kxi

(2) Cosine distance: !

v
u d
uX
xj k2 = t
|xi,m
m=1

xj,m |2

Good for high-dim and sparse data (text documents)!


hx , x i
i
j
1
dcos (xi , xj ) = cos
= |ij |
kxi k2 kxj k2
Note:

(3) Other distances: Kullback-Leibler (information theory), Wasserstein (earths


mover) (PDEs theory), etc.!
Xavier Bresson

31

What Data Features?!


!! Three types of data features:! xi 2 Rd
(1) Natural features (e.g. movie features s.a. genre, actors, year, etc)!
(2) Hand-crafted features (e.g. SIFT in computer vision)!
(3) Learned features (PCA, NMF, sparse coding, deep learning)!
!! Bad approach: It is usually a bad idea to use directly raw data as features. !
!! Good approach: Transform raw data into meaningful data representation by
extracting features (later discussed) and use them for graph construction (trick
known as bag of words):!

Wij = e

dist(xi ,xj )2
2

'
Wij = e
!"#$%&'(&%))*+'

dist(zi ,zj )2
2

.-'

Data Pre-Processing!
Center data (along each dimension): zero-mean property (very common)!

xi

xi

mean({xi })

Normalize data variance (along each dimension): z-scoring property

xi

xi /std({x
})
s i

std({xi }) =

X
j

|xj

(w/ zero-mean )!

mean({xi }|2

Projection on l2-sphere (along each data): !

xi

xi /kxi k2

Normalize max and min value: !

xi 2 [0, 1]
Xavier Bresson

xi

33

Demo: Graph Construction and Pre-Processing!


!! Run lecture03_code03.ipynb!
Let us test: Pre-processing, Construct k-NN graphs, Visualize distances,
Visualize W, Test graph quality with clustering accuracy.!

!"#$%&'(&%))*+'

./'

Demo: Construct Network of Text Documents !


!! Run lecture03_code04.ipynb!

!"#$%&'(&%))*+'

.0'

Outline!
Graph Science and Graph Theory!
!

Class of Networks!
Basic Definitions!
Curse of Dimensionality and Structure!
Manifolds and Graphs!
Spectral Graph Theory!
Construct Graphs from Data!
Conclusion!

Xavier Bresson

36

Summary!
Graph is a superior representation of data: !
Data
Graph G=(V,E,W)
1st fundamental tool: Adjacency Matrix W!
(1) It reveals structures hidden on data.!
(2) It allows to visualize graphs.!
(3) It is used for analysis tasks (later discussed).!
2nd fundamental tool: graph Laplacian Matrix L!
(1) It represents the modes of variations of the graph. !
(2) Used for image compression (jpeg), neuroscience, etc.!

Xavier Bresson

37

Pipeline of Graph Science!


!! Step 1: Data

Graph!

Feature!
Extraction'
High-dim !
Raw data'

!! Step 2: Graph

Graph!
Construction'
Data
Features'

Good !
Practices'

Analysis!
Spectral!
Graph
Theory'

Graph'

Graph!
Analysis'

Identify!
Patterns
Patterns'

Unsupervised
Learning'
Supervised
Learning'

Graph'

Make use of!


Graphs'

Recommend
ation'

Graph!
Analysis'

Visualization'

!"#$%&'(&%))*+'

Domain!
Expertise '

Feature!
Extraction'
'
.3'

Ques8ons?

Xavier Bresson

39

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 4: Unsupervised Learning!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

Unsupervised Learning!
Q: What unsupervised means? !
Unsupervised learning aims at designing algorithms that can find
patterns in datasets without the use of labels, i.e. prior information.!
There exists several unsupervised learning techniques:!
(1) Unsupervised data clustering (this lecture)!
(2) Graph partitioning (this lecture)!
(3) Data representation/feature extraction (Lecture 7)!

Q: What is the most popular unsupervised


clustering algorithm? !

Xavier Bresson

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

K-Means Algorithm!
Most popular clustering algorithm (among top 10 algorithms in data
science).!
Three types of K-Means techniques:!
(1) Standard/linear K-Means!
(2) Kernel K-Means Expectation-Maximization (EM) Approach!
(3) Kernel K-Means Spectral Approach!

Xavier Bresson

Standard/Linear K-Means!
!! Description: Given n data xi in Rd, K-Means partitions the data into K
clusters S1,,SK that minimize the least-squares objective: !

E(M,
(M, S
S) =

K X
X

k=1 xi 2Sk
Means:! M = {m1 , ..., mK }
Clusters:' S = {S1 , ..., SK }

Distance between xi !
and its mean mk!

mk k22

kkxi

kth mean'

kth cluster'

Sk
!"#$%&'(&%))*+'

Sk 0
1'

Algorithm [Lloyd57, Forgy65]!


!! Expectation-Maximization (EM) approach:!
Initialization: Random initial means !
Iterate until convergence: l=0,1,!
(1) Cluster update/expectation step: !

Skl+1 = {xi : kxi

mlk k22 kxi

(2) Mean update/maximization step: !

ml+1
k

!"#$%&'(&%))*+'

xi 2Skl+1
|Skl+1 |

mlk0 k22 , 8k 0 6= k}

Voronoi cell'

xi

2'

Demo: Standard K-Means!


!! Run lecture04_code01.ipynb!

!"#$%&'(&%))*+'

3'

Properties of EM Algorithm!
Advantages:!
(1) Monotonic: El+1

El for all iterations.!


(2) Convergence is guaranteed!
(3) Speed/complexity: O(n.d.K.#iter), where

#iter=nb iterations to converge!

(4) Easy to implement!


(5) Several extensions: K-Medians, !
K-hyperplans, other distances:!
(6) K-Means has relationships with !
other popular algorithms (PCA, NMF) !

Limitations:!
(1)Non-convex energy (NP-hard)!
Existence of local minimizers: some are good, some are bad.!
Good initialization is critical, or restart many times and pick the solution
with the lowest energy value.!
Xavier Bresson

Main Limitation!
!! Assumption: Standard K-Means
suppose data follow a Gaussian Mixture
Model (GMM), meaning that clusters
are linearly separable and spherical. !

!! Consequence: Standard K-Means does


not work for non-linear separable data.!
Solution: Kernel K-Means. !

!"#$%&'(&%))*+'

,5'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

11

Kernel K-Means

[Scholkopf-Smola-Muller98] !

Q: Do you know the kernel trick?!


!! Kernel trick: Map data to a higherdimensional space where data can be
linearly separated:!

!! New objective: Weighted Kernel K-Means energy:!

E(M, S) =

K X
X

k=1 xi 2Sk

i k ((xi )

xi ! (xi )

Weight contribution!
of data xi!

!! Mean update:!

!"#$%&'(&%))*+'

@E
=0
@mk

mk =

mk k22

i (xi )
xi 2Sk i

xi 2Sk

,-'

Cluster Update!
!! Value of kth cluster Sk:!
Skl+1 = {xi : k (xi )

mlk k22 k (xi )

d(xi , mk0 )

d(xi , mk )

With:'

d(xi , mk ) = k (xi )

mlk0 k22 , 8k 0 6= k}

mk k22 = h (xi )

mk , (xi )

mk i

= h (xi ), (xi )i 2h (xi ), mk i + hmk , mk i


(II)kk
K(xi , xi ) = Kii
(I)ik
Kernel matrix!

Linear algebra:'

(I)ik = (KF )ik


(II)kk = (F T KF )kk
= diag(1 , ..., n )
X
X
= diag(1/
i , ..., 1/
i )
i2S1
i2SK

1 if xi 2 Sk
Fik =
0 otherwise
K(x, y) = h (x), (y)i

!"#$%&'(&%))*+'

Kernel matrix'
,.'

Kernel K-Means EM Algorithm!


Initialization: Random initial means !
Iterate until convergence: l=0,1,!
(1) Cluster update/expectation step: !
Compute all distances: !

l
Dik
= d(xi , mlk )

D = diag(K)
Update clusters: !

l+1
Fik

2KF + diag(F T KF )
l
l
1 if Dik
= argmink0 Dik
0
0 otherwise

Fk is an implicit representation!
of cluster Sk!

l+1
Skl+1 = {xi : Fik
= 1}

(2) Mean update/maximization step: !

No update here! It is implicitly done in D.!

!"#$%&'(&%))*+'

,/'

Demo: Kernel K-Means!


!! Run lecture04_code02.ipynb!

!"#$%&'(&%))*+'

,0'

Kernel Trick!
Q: Do we need to compute the kernel mapping "?!
!! A: No, we never use explicitly the non-linear function "! The exact
expression is actually irrelevant, only the kernel matrix K is important. !
!! Why is this good?!

K(xi , xj ) = (ahxi , xj i + b)c = h (xi ), (xj )i

hxi , xj i

Low-dim products, e.g. xi in R100'

h (xi ), (xj )i

High-dim products, e.g. "(xi) in R1,000'

consuming!
Time consuming!!

!! Popular kernels:!
(1)! Gaussian kernels:!

Kij = e

kxi xj k22 /

(2)! Polynomial kernels: ! Kij = (ahxi , xj i + b)c


!"#$%&'(&%))*+'

,1'

Algorithm Properties!
Advantage: All computations are basically matrix computations
(linear algebra) Good news because most processors have an
architecture and libraries to perform very fast linear algebra calculus:!
(1) Intel Math Kernel Library (MKL) that includes Linear Algebra
Package (LAPACK) and Basic Linear Algebra Subprograms
(BLAS).!
(2) AMD Core Math Library (ACML) also includes LAPACK and
BLAS.!

Limitation: Still local minimizers, next approach will decrease the


number of bad local minimizers.!

Xavier Bresson

17

Kernel K-Means Spectral Approach!


!! Let us start again from the weighted kernel k-means objective:!
K X
X
min E(M, S) =
kxi mk k22
M,S

k=1 xi 2Sk

!! Let us rewrite it as a trace optimization problem under some constraints:!


after some linear algebra'

min
Y

tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd


Yik =

( P i

j2Sk j

1/2

if i 2 Sk

(weighted) indicator !
of clusters'

otherwise

max tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd


Y

as:'

!"#$%&'(&%))*+'

min
Y

E , max E
Y

,3'

Spectral Relaxation!
Q: What NP-hard means?!
!! Minimizing the objective is a NP-hard problem (i.e. would take forever)! !

max tr(Y T 1/2 K1/2 Y ) s.t. Y T Y = IK and Y 2 SInd


Y

This binary constraint makes!


the problem NP-hard'

!! Spectral relaxation: We drop the indicator constraint Y in Sind:!

max tr(Y T AY ) s.t. Y T Y = IK with A = 1/2 K1/2


Y

!! Solution [Spectral Theorem]: top K eigenvectors of matrix A given by EVD.!

!"#$%&'(&%))*+'

,4'

Eigen Value Decomposition (EVD)!


!! Suppose A is symmetric and positive semi-definite (SDP) (all kernel matrices
are sym. and SDP by construction), then EVD gives:!

Ayk =

k yk

max

with'

hyk , yk0 i =
!! Top K eigenvalues/eigenvectors: K largest #k: !
tr(Y T AY ) =

K
X

ykT Ayk =

k=1

K
X

ykT

k yk

k=1

k hyk , yk i

k=1

K
X

...

1 if k = k 0
0 otherwise

K
X

Y T Y = IK

k=1

K largest values
values!

max tr(Y T AY ) s.t. Y T Y = IK


Y

!"#$%&'(&%))*+'

-5'

Kernel K-Means Spectral Algorithm!


Three steps:!
(1) Compute: !

A = 1/2 K1/2

(2) Solve EVD for the top K eigenvectors:

k=1,..,K

Ayk =

k yk

(3) Binarization step: Use the solution Y as embedding coordinates of X and


apply standard K-Means on Y.!
Note 1: Remember original K-Means problem is NP-hard
approximate solution.!
indicator constraint Y in Sind

we drop the

Note 2: Spectral techniques (EVD) are vastly used in data analysis


because theory well understood. However, they do not scale well to big
data as EVD complexity is O(n3) (it exists however stochastic EVD).!

Xavier Bresson

21

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

22

Balanced Cuts for Graph Partitioning!


Motivations:!
(1) Kernel techniques are sound, but they do not scale well to large datasets
because K is full: memory requirement is O(n2) and EVD is O(n3).!
Ex: n=1M
n2=1Tera, n3=1Yotta! !
Solution: Sparse matrices, but how? Data graphs!!
(2) Natural graphs/networks: Partitioning graphs is a fundamental
problem for (i) Identifying connected groups of data (users on Facebook),!
(ii) Balanced graph cutting for distributed big graphs computing. !

Solution: Balanced graph cuts. They play a central role in:!


(1) Graph theory (define families of networks and their properties).!
(2) Applications (state-of-the-art for unsupervised clustering).!

Xavier Bresson

23

Data Clustering = Balanced Cuts!


!! Graph representation: Given a set of data V={x1,,xn} in Rd, construct a
k-NN graph G=(V,E,W):!
k-NN graph!
construction !

V = {x1 , ..., xn } 2 Rd

G = (V, E, W )

!! Data clustering: Observe that data close on


graphs are similar
finding clusters of data
can be done by cutting the graph at some
specific locations.!

!"#$%&'(&%))*+'

-/'

Min Cut

[Wu-Leahy93]!

!! Cut operator: Given a graph G, a cut partitions G into two sets S and Sc
with value: !
X
c
Cut(S, S ) =
Wij
i2S,j2S c

!!

"#$&
"#$&'
"#$%'

!"!

"#$'''

Value of cut1:!
cut: Cut(S,Sc) = 0.3 + 0.2 + 0.3 = 0.8!
Value of cut2: Cut(S,Sc) = 0.5 + 0.5 + 0.5 + 0.5 = 2.0 !
Value of cut3: Cut(S,Sc) = 0.5 !

!! Min cut is biased: It favors small sets of isolated data.!


Solution: Find clusters of similar sizes/volumes while minimizing the cut
operator:!
Cut
Balanced cuts!!
min Cut and max V ol , min
V ol
!"#$%&'(&%))*+'

-0'

Most Popular Balanced Cuts!


!! Cheeger Cut [Cheeger69]: (most popular in
graph theory)!
c
min
S

Cut(S, S )
min(V ol(S), V ol(S c ))

Ckc

Ck

!! Normalized Cut [Shi-Malik00]: (most


popular in applications)!

Partitioning by min edge cuts.

Cut(S, S c ) Cut(S, S c )
+
min
S
V ol(S)
V ol(S c )

!! Normalized Association:!
Assoc(S, S) Assoc(S c , S c )
+
min
S
V ol(S)
V ol(S c )

Ck
Partitioning by max vertex matching.

with'

Cut(S, S ) =

Wij

#connections between S and Sc'

i2S,j2S c

V ol(S) =

di , with di =

i2S

Assoc(S, S) =
!"#$%&'(&%))*+'

Wij

j2V

i2S,j2S

Wij

#connections between all vertices in S'

-1'

Spectral Relaxation!
!! Issue: Solving balanced cut problems is directly intractable as NP-hard
combinatorial problems
We need to find the best possible approximation
(close to original solution)
Best approximate techniques are based on
spectral relaxation.!
!! Normalized Association:!

min
Sk

K
X
Assoc(Sk , Sk )

V ol(Sk )

k=1

(1)

!! Reformulation: Rewrite combinatorial problem as a continuous


optimization problem:!
(1)

max
F

K
X
F T W Fk
k

k=1

FkT DF

(2)

with the change !


of variable:' Yk =

(2)

max tr(Y T AY ) s.t. Y T Y = IK , Y 2 SInd and A = D


Y

Yik =
!"#$%&'(&%))*+'

D1/2 Fk
kD1/2 Fk k2

1/2

WD

Dii
1/2
( V ol(S
)
)
k
0

1/2

if i 2 Sk
otherwise
-2'

Spectral Relaxation!
Binary constraint Y in Sind makes the problem NP-hard - we drop it:!
max tr(Y T AY ) s.t. Y T Y = IK
Y

Solution: Top K eigenvectors of A given by EVD.!

Q: Have we seen this before?!

Xavier Bresson

28

Relationships between Kernel K-Means and


Balanced Cuts [Bach-Jordan04, Dhillon-Guan-Kulis04] !
Relationships:!
Kernel K-Means:!
max tr(Y T AY ) s.t. Y T Y = IK with A = 1/2 K1/2 , Y 2 SInd
Y

(1)

Balanced Cuts:!
max tr(Y T AY ) s.t. Y T Y = IK with A = D
Y

Equivalence:!

(1)

(2)

for

=D

1/2

WD

1/2

, Y 2 SInd

(2)

, K=W

Consequence: Balanced cut problems can also be solved by EM approach! !


Graclus algorithm [Dhillon-Guan-Kulis04]: It does not require EVD
Scale up to large datasets (one of the best graph partitioning techniques)!
Either EM or Spectral approaches still compute approximate solutions
we improve the quality of clustering/partitioning solutions?!

Xavier Bresson

Can

29

Demo!
!! Run lecture04_code03.ipynb!

!"#$%&'(&%))*+'

.5'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

31

NCut Algorithm

[Yu-Shi04] !

Motivation: We must drop the binary constraint Y in Sind or Y in {0,1}


to compute a solution of the NP-hard problem. Spectral solutions do not
naturally satisfy this constraint
let us try to enforce it.!

NCut: It is one of the best graph clustering algorithm!!


It has two steps:!
(1) Compute spectral solutions (as before).!
(2) Find best solution that satisfies the binary constraint (new step). !

Xavier Bresson

32

Demo: NCut!
!! Run lecture04_code04.ipynb!

!"#$%&'(&%))*+'

..'

Technical Details!
Step 1: !

Y ? = argmax tr(Y T AY ) s.t. Y T Y = IK with A = D

1/2

WD

1/2

Solved by EVD.
Step 2: !

min kZ
Z,R

Y ? Rk2F s.t. RT R = IK , and Z 2 {0, 1}


Solved by SVD.

Step 1: Y* !

Xavier Bresson

Step 2: Z (rotate Y*) !

34

EVD and SVD!


Q: Do you know SVD?!
EVD (Eigen Value Decomposition): A symmetric and PSD: !

A = Y Y T
Ayk =

k yk

SVD (Singular Value Decomposition): Generalization of EVD to


non-square and non-PSD matrix: !
T
U: Left singular vectors!
A
=
U
V
n x m
m x m
n x n
V: Right singular vectors!
n x m
: Singular values
With:

uk A = uk k
Avk = k vk

U T U = In
V T V = Im

EVD and SVD are matrix factorization techniques: very common tools in
(linear) data science: many techniques boil down to EVD and SVD. !

Xavier Bresson

35

Example!
Ncut on noisy real-world networks WEBK4 and CITESEER: !

Xavier Bresson

36

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

37

PCut

[B-et.al.16]

State-of-The-Art !

!! Balanced Cuts are biased towards cluster outliers:!

!! Results:!

!"#$%&'(&%))*+'

.3'

Demo: PCut!
!! Run lecture04_code05.ipynb!

!"#$%&'(&%))*+'

.4'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

40

Clustering/Partitioning with !
Unknown Number of Clusters!
Recall: Previous techniques assume to know the number K of clusters.!
If K is unknown, there exist two approaches:!
(1) Define a quality measure of clustering (domain expertise), and use
previous techniques with dierent K values, pick the K with best
measure.!
(2) K is a variable of the clustering problem: Louvain Algorithm.!
Louvain Technique [Blondel-et.al.08]: Very popular in social sciences.
It is greedy algorithm that optimizes the modularity objective:!
max Q(f ) =
f

(Wij

min
Sk

K
X

k=1

2m

Cut(Sk , Skc )

2m =

ij

(fi , fj ) =

Wij = V ol(V )

1
0

if i = j
otherwise

V ol(Sk ).V ol(Skc )

Same eect as balanced cut, !


but extra parameter to select.
Xavier Bresson

with

ij

di dj ) (fi , fj )

min
Sk

K
X

k=1

Cut(Sk , Skc )
V ol(Sk )V ol(Skc )
41

Greedy Algorithm!
Q: What is a greedy algorithm?!
Step 1: Energy minimization step!
Find communities by minimizing locally the modularity.!
Each node is first associated to its own community then for each
node i, we assign i to the community of its neighbor that best
decreases the modularity. The process is repeated until no changes
occur.!

Step 2: Graph coarsening step!


Compute a new graph by merging the communities
to a super-vertex.!
From Step 1, a new graph is constructed by forming a new
X
adjacency matrix:!
new

Wkk0 =

Wij

i2Sk ,j2S

k
Note: Greedy algorithm:!
(1) (Relatively) fast algorithm!
(2) No theoretical guarantee on the solution (local minimizer)!

Xavier Bresson

42

Demo: Louvain!
!! Run lecture04_code06.ipynb!

!"#$%&'(&%))*+'

/.'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

44

Clustering/Partitioning !
Small-Scale Communities !
Motivation: How Facebook target small communities of users for
advertisement? !
Goal: Identify small-scale clusters on networks.!

Nibble Algorithm: Leskovec, Lang, Dasgupta, Mahoney08, Spielman,


Teng08, Reid, Chung, Lang06.!

Xavier Bresson

45

Nibble Algorithm!
Core principle: It is a greedy algorithm that optimizes locally the
Cheeger Cut on graphs: !
Iterate until K clusters are found:
Step 1: Pick a vertex randomly on graph.!
Step 2: Diuse the Dirac function of the vertex s:!
l+1
l

= Lf

with

L = In

1/2

WD

1/2

Step 3: Threshold f at the value that optimizes the Cheeger cut:!

Cut(S , Sc )
min ECheeger (S ) =
S
min(V ol(S ), V ol(Sc ))

Xavier Bresson

46

Demo: Nibble!
!! Run lecture04_code07.ipynb!

!"#$%&'(&%))*+'

/2'

Outline!
Definition!
!

Linear K-Means!
Kernel K-Means!
Balanced Cuts!
NCut!
PCut!
Louvain Algorithm!
Nibble Algorithm!
Conclusion!

Xavier Bresson

48

Summary!
Unsupervised Clustering'

unknown
K is unknown'

K is known'
Full Matrix'

Sparse Matrix!
But graph construction !
may be needed'

Data!
(no graph)'

Graph'

K-Means*'

Balanced Cuts!
Cheeger, Normalized Cuts/!
Associations'

Louvain Algorithm!
Greedy technique'

Kernel!
Construction'

NP-hard!
Spectral relaxation'
Clustering
Spectral Graph Clustering'

Kernel K-Means!
(1)! EM (Graclus*)!
(2)! Spectral'

Equivalence !
of solutions'
!"#$%&'(&%))*+'

Linear !
Relaxation '

Ncut*!
Loose relaxation!
of balanced cuts!
Medium size clusters'

Non-linear !
Relaxation '
Nibble
Nibble!
Pcut!
Small-scale!
Tight relaxation'
Clusters!
Greedy algorithm'

/4'

Transductive Clustering!
Previous techniques are fully unsupervised, no prior information about
class labels is given. !
Transductive clustering: when class labels are available, it usually boosts
the clustering results significantly, like 5-20%. However, it can be time
consuming to collect labeled data (trade-o).!
Note: Transductive clustering is dierent from semi-supervised classification.
Classification aims at learning a decision function for new data, clustering
objective is to classify given data (no new data are considered). !

Xavier Bresson

50

Conclusion!
Unsupervised clustering is one of the most generic data analysis tasks.!
(1) It is applied when basically nothing is new about the data.!
(2) It is a Lego block that can be used for all kind of data analysis tasks.!

Unsupervised data representation or feature extraction will be


studied in Lecture 7: PCA, NMF, Sparse Coding.!

Xavier Bresson

51

Ques8ons?

Xavier Bresson

52

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 5: Supervised Learning!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

Classes of Learning Techniques!


Q: Do you know the dierence between unsupervised, !
supervised, and semi-supervised learning?!
Three classes of learning algorithms (for clustering, classification,
recommendation, visualization, feature extraction, etc):!
(1) Unsupervised learning: Algorithms that do not use any prior
information.!
(2) Supervised learning: Algorithms that only use labeled data.!
(3) Semi-Supervised learning: Algorithms that use both labeled and
unlabeled data. !
Labeled data are gold data but they are generally expensive to produce,
while unlabeled data are usually cheap to get. Ex: Image recognition in
Computer Vision, millions of images are available with Google, but they are
unlabeled. How much time to label 1,000 images? !
Previous lecture focused on unsupervised data clustering.!
This lecture focuses on supervised and semi-supervised data classification,
particularly SVM techniques. !
Xavier Bresson

Support Vector Machine (SVM)!


!! SVM is a very popular classification technique (among top 10 algorithms in
data science). SVM is used in deep learning as loss function (later).!
!! We will cover the supervised and semi-supervised binary SVM
classification techniques: Given a set of labeled data that belongs to two
classes, construct a classification function that outputs the class of any new
data. !
V = {{xi , `i }ni=1 , xi 2 Rd , `i 2 { 1, +1}
data
data'

labels'

label : xi , `i = +1

f (x) = +1
8x 2 C1

label : xi , `i =

C1! C2!

!"#$%&'(&%))*+'

f (x) = 1
8x 2 C2

/'

History of SVM Techniques!

f (x) =

f (x) = +1

C1!

C2!
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!

!"#$%&'(&%))*+'

Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!

Non-Linear/Kernel SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!

[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!

0'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

Linear SVM [Vapnik, Chervonenkis 63]!


!! Assumption: Training (and test) data are linearly separable, i.e. data can
be perfectly separated by a hyperplane: !

C1!
C2!

hyperplane!

!! Linear SVM: Given a training dataset V={xi,li}, design a classification


function that assigns any new data x to the class that is best consistent with V. !

f : x 2 Rd ! { 1, +1}
!! Class of (linear) solutions: Given the assumption, determine the hyperplane
that best separates the two classes. Any hyperplane is parameterized by two
variables (w,b), where w is the normal vector of hyperplane, and b is the o"set value. !
Hyperplane equation:!
!"#$%&'(&%))*+'

hw, xi + b = 0

2'

Linear SVM Classifier!


!! Classification function: !

f (x) = sign(hw, xi + b) =

if x 2 C1
if x 2 C2

+1
1

hw, xi + b > 0

w
C1!
C2!

hw, xi + b = 0

hhw,
w, x
xi + b < 0

!"#$%&'(&%))*+'

3'

How to Find (w,b)? !


!! SVM idea: Define the best hyperplane by maximizing the margin d
between the 2 classes:!
margin '

d'
d

x+

C1!
C2!

!! Relationship between w and d: !


hw, x+ i + b = +1
hw, x i + b = 1
hw, x+

x i+b=2
d~ = x+

x = w

2
kwk22

d=

2
kwk2

'Maximize the margin d: ! max d , max


'Minimize the weight w !

!"#$%&'(&%))*+'

2
, min kwk22
kwk2
4'

Primal Optimization!

hw, xi i + b

`i = 1

!! Variable w is called primal variable. !


kwk22
Optimization problem w.r.t. w is! min
w
with constraints: !
fi = hw, xi i + b =
`i =

+1
1
+1
1

if x 2 C1
if x 2 C2
if x 2 C1
if x 2 C2

`i .fi

1 8i 2 V

hw, xi i + b
`i =

1
1

!! Summary: SVM classifier f is given by the solution of the Quadratic


Programming (QP) problem:!

min kwk22 s.t. `i .fi


w,b
Quadratic!
function!

!"#$%&'(&%))*+'

1 8i 2 V

Convex set!
(polytope)!

fi = hw, xi i + b
SVM !
classifier!

,5'

Support Vectors!
!! Definition: Data that are exactly localized on the margin hyperplanes.!

!! O"set value b: b is defined by:


by:!

`i .(hw, xSP
i i + b)

!"#$%&'(&%))*+'

1 = 0, 8xSP
!
i

bi = `i

hw, xSP
i i

b = E({bi })

,,'

Dual Problem!
Primal problem: ! min kwk22 s.t. `i .fi

w,b

1 8i 2 V

Dual problem: The dual variable is .!


Motivation: Nice to work with data products xi,xj (for the kernel trick). !
After some computations

1 T
min Q
0 2
With:

h, 1i s.t. h, 1i = 0

QP problem

Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i

Linear kernel

Xavier Bresson

12

Optimization Algorithm!
Classification function: !

f (x) = sign(hw? , xi + b? )
= sign(?T LK(x) + b? )
Optimization scheme: Solution * given by iterative scheme: !
1
1
l=0
l=0
,
=

=
y
=
0
Initalization:
kQk
kLk
Iterate until convergence: l=0,1,2,...

l+1 = P

0 [( Q

+ In )

(k +

Ly k )]

y l+1 = y k + Ll+1

Xavier Bresson

? = 1

13

Demo: Standard/Linear SVM!


!! Run lecture05_code01.ipynb!

!"#$%&'(&%))*+'

,/'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

15

Soft-Margin SVM [Cortes-Vapnik 95]!


!! Motivation: Linear SVM suppose data are linearly separable, i.e. there exist
an hyperplane separating perfectly the data. !
Q: What happens in the presence of outliers? !

!! Soft SVM: Find an hyperplane that best separate the data (by maximizing
the margin) while allowing as few outliers as possible.!

!"#$%&'(&%))*+'

,1'

Modeling Errors with Slack Variables!


!! Slack variables ei: Measure the error of each data xi to be an outlier.!

!! New optimization:!

min kwk22 +
w,b

n
X

s.t. `i .fi

i=1

ei , e i

0 8i 2 V

Trade-o" between large margin!


and small errors!
!"#$%&'(&%))*+'

,2'

Dual Problem!
min kwk22 +
w,b

s.t. `i .fi

ei , e i

i=1

0 8i 2 V

!! Primal problem:!

n
X

!! Dual problem:!
After some computations'

min

0
0

1 T
Q
2

Only this trivial modification!!

h, 1i s.t. h, 1i = 0
With:'

Q = LKL
L = diag(`1 , ..., `n )
Kij = hxi , xj i

!"#$%&'(&%))*+'

,3'

Demo: Soft-Margin SVM !


!! Run lecture05_code02.ipynb!

!"#$%&'(&%))*+'

,4'

Hinge Loss Function!


min kwk22 +
w,b

min
w,b

kwk22

s.t. `i .fi

i=1

!! Primal problem:!

n
X

n
X

ei , e i

0 8i 2 V

Q: Do you know the


hinge function?!

Vhinge (fi , `i )

i=1

Vhinge (fi , `i ) = max(0, 1

fi .`i )

Popular SVM loss function'

!"#$%&'(&%))*+'

-5'

Several Other Loss Functions!

(1 fi .`i )2 if fi .`i 1
V`2 (fi , `i ) =
Quadratic/L2 loss:!
0
otherwise

(1 fi .`i )2 + |1 fi .`i | if fi .`i 1


VElasticN et (fi , `i ) =
Elastic Net loss:!
0
otherwise
Huber loss:!

VHuber (fi , `i ) =

8
<
:

1
2
1
2 (1

fi .`i
fi .`i )2

if fi .`i 0
if 0 < fi .`i 1
otherwise

Logistic loss (Facebook, Amazon):! VLogistic (fi , `i ) = e1

Xavier Bresson

fi .`i

21

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

22

Kernel Techniques!
Very popular techniques (until deep learning).!

Reproducing Kernel Hilbert Space (RKHS): A space associated to


bounded, symmetric, PSD operators called kernels K that can reproduce any
smooth function f. !

Representer Theorem [Scholkopf, Herbrich, Smola, 01]: Any continuous and


smooth function in a RKHS can be represented as a linear combination of the
kernel function K evaluated at the data points: !

f (x) =

n
X

ai K(x, xi ) + b

i=1

Xavier Bresson

23

Interpretation of Representer Theorem!


Interpretation: The Representer Theorem is a powerful tool for function
interpolation in high-dim spaces: !

f (x) =

n
X

ai K(x, xi ) + b

i=1

with

Popular kernels:!
(1) Linear kernel:!
!

(2) Gaussian kernel:!

K(x, y) = hx, yi

K(x, y) = e

K(x, xi ) = e

kx yk22 /

kx xi k22 /

(3) Polynomial kernel:!


Xavier Bresson

K(x, y) = (ahx, yi + b)c


24

Feature Maps and Kernel Trick!


!! Definition: Any feature map $ defines a reproducing kernel K: !
def'

h (x), (y)i = K(x, y)

and inversely:!

def'

K (x, y) = h 0 (x),

(y)i

!! Summary:!
Representer !
Theorem:!
X
f (x) =
ai Kxi (x)
i

Reproducing!
Kernel K '

K(x, y) = h (x), (y)i


Norm of f:!
kf kHK = aT Ka

Bounded
Continuous!
Function f'

Feature Map!
$'
f (x) =

X
i

!"#$%&'(&%))*+'

Kernel trick:!

ai h (xi ), (xj )i
-0'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

26

Non-Linear/Kernel SVM !
[Bosser, Guyon, Vapnik 92]!

!! Motivation: Linear/soft SVM assume data are linearly separable (up to a few
outliers). For several real-world data, the hyperplane assumption is not
satisfied. A better separation is a non-linear hyperplane, that is a
hypersurface.!
!! Kernel Trick: Project data
into a higher-dim space with
feature map $ where the data
are linearly separable.!
Linear separator'

Non-linear separator'

!! Decision function in high-dim:!

f (x) = X
hw, xi + b
w=
i ` i xi
i

f (x) =

!"#$%&'(&%))*+'

X
i

!
!

f (x) X
= hw, (x)i + b
w=
i `i (xi )
i

i `i h (x), (xi )i + b =

X
i

i `i K(xi , x) + b
Kernelx!
Kernelx

-2'

Optimization!
Dual problem:!

min

1 T
Q
2

h, 1i s.t. h, 1i = 0
With: Q

= LKL
L = diag(`1 , ..., `n )

Kij = hxi , xj i

Kij = h (xi ), (xj )i

Same optimization algorithm J


Recall: We never compute explicitly and the products (xi),(xj), only
the Kernel matrix:

K(x, y) =

Xavier Bresson

(ahx, yi + b)c
2
2
e kx yk2 /

28

Demo: Kernel/Non-Linear SVM !


!! Run lecture04_code03.ipynb!

!"#$%&'(&%))*+'

-4'

General Supervised Learning!


!! Generalization:!

min kwk22 +
w

n
X

Vloss (fi , `i )

i=1

min kf k2HK +

f 2HK

n
X
i=1

Error for inaccurate !


predictions!

Regularity!
of f!

kf k2HK = kwk22 for f (x) = hw, xi


Representer theorem:!

f (x) =

n
X

Trade-o"!

ai K(x, xi )

i=1

Norm of in RKHS:!

!"#$%&'(&%))*+'

kf k2HK

Vloss (fi , `i )

= hf, f iHK =

fi fj Kij = f T Kf

ij

.5'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

31

Semi-Supervised Learning Graph SVM!


!! Motivation: It is time consuming to label data, but it is cheap to collect
unlabeled data! Besides, unlabeled data are without information: they hold
the geometric structure of data  Best design of classification function
is with simultaneously labeled and unlabeled data.!
f (x) = +1

f (x) =

Labeled data +,-!


Supervised Learning!

!"#$%&'(&%))*+'

Labeled data +,-!


Supervised Learning!

Labeled data +,-!


Unlabeled data o!
Semi-Supervised Learning!

.-'

Semi-Supervised Learning Graph SVM!


Case of few labels: Semi-supervised learning is particularly wanted in
the case of few labels (extreme case is one label/class). Let us call!
n: the number of labeled data!
m: the number of unlabeled data!
Then, semi-supervised plays an important role in the case of n m.!

Q: How to design SVM to deal simultaneously


with labeled and unlabeled data? !
SVM on graphs!!

Xavier Bresson

33

Manifold Assumption!
Observation: Geometry of data is independent of labels!!
(Labeled and unlabeled) data are assumed to lie on a manifold, where the
classification will be carried out. !

Xavier Bresson

34

Manifold Assumption!
!! How to introduce the manifold geometry in SVM?!
! First, we approximate the manifold M with a neighborhood graph,
i.e. a k-NN graph.!
! Second, we add a regularization term that forces the classification
function f to be smooth on the manifold (/graph).!

!"#$%&'(&%))*+'

.0'

Optimization!
!! Optimization problem:!

min kf k2HK +

f 2HK

n
X

Vloss (fi , `i ) +

i=1

|rf |2

Dirichlet energy:!
(1)!It forces f to be smooth on M!
(2)!Derivative is "f=0 (heat di#usion)!

!! Discretization of Dirichlet on graphs:!

|rf |

X
ij

Wij |f (xi )

f (xj )|2 = f T L
Lf
Laplacian operator!

!"#$%&'(&%))*+'

.1'

Algorithm !
Semi-supervised SVM or Laplacian SVM [Belkin, Niyogi, Sindhwani06]:!

min kf k2HK +

f 2HK

n
X

Vhinge (fi , `i ) +

Gf

i=1

Lf

)
f (x) = sign

n
X

a?i K(x, xi )

i=1

a? = (I +

G LK)

HL?

1 T
Q
= arg min
0 l 2
?

with

Xavier Bresson

Q = LHK(I +

G LK)

h, 1i s.t. h, 1i = 0
1

HL

37

Demo: Graph SVM !


!! Run lecture05_code04.ipynb!

!"#$%&'(&%))*+'

.3'

Outline!
Learning techniques!
Linear SVM!
Soft-Margin SVM!
Kernel Techniques!
Non-Linear/Kernel SVM!
Graph SVM!
Conclusion!

Xavier Bresson

39

Summary!

f (x) =

f (x) = +1

C1!

C2!
Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!

!"#$%&'(&%))*+'

Linear SVM!
Supervised Learning!
[Vapnik-Chervonenkis63]!

Non-Linear/Kernel SVM!
Laplacian SVM!
Supervised Learning! Semi-Supervised Learning!

[Boser,Guyon,Vapnik-Chervonenkis92]! [Belkin,Niyogi,Sindhwari06]!

/5'

Summary!
!! General supervised and semi-supervised optimization techniques:!

min kf k2HK +

f 2HK

Vloss

!"#$%&'(&%))*+'

n
X

Vloss (fi , `i ) +

G Rgraph (f )

i=1

Regularity!
of f!

Error for inaccurate !


predictions!

8
Hinge
>
>
>
>
< L2
L1
=
>
>
Huber
>
>
:
Logistic

8
< Dirichlet: krG k22
Total Variation: krG k1
Rgraph () =
:
Wavelets: kDwavelets k22

Graph regularization!
for unlabeled data!

/,'

Ques8ons?

Xavier Bresson

42

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 6: Recommender Systems!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

Introduction!
Recommendation have become a central part of intelligent systems. !
Q: Where do you find recommender systems? !

Examples: Google search engine recommends webpages on internet,!


recommending movies on Netflix, friends on Facebook, products on Amazon,
jobs on LinkedIn, articles on NY Times website:!

Xavier Bresson

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

Google PageRank!
!! A B$ Algorithm! " !
!! PageRank is an algorithm that ranks websites on Internet. It is at
the core of Google Search Engine, which introduced a revolution in 1998 as
ranking was previously done manually by humans.!
Q: Do you know how many webpages in 1998 and 2016? !
In 1998, the size of WWW was 2.4M of webpages.!
Today, on Aug 2016, the size of the WWW is 4.6B!!

!"#$%&'(&%))*+'

0'

PageRank Technique!
It is a sound technique as!
(1) Mathematically well defined.!
(2) Computationally ecient.!
Core idea: PageRank sorts the vertices of a directed graph G using the
stationary state of G.!
Definition: The stationarity and modes of vibration of graphs/networks
can be studied by EVD (Lecture 3) such that:!

Axl =

l xl

What is the operator A in the case of directed graphs? Or how to design a


meaningful stationary state of G?!
Perron-Frobenius Theorem.

Xavier Bresson

Perron-Frobenius Theorem!
!! Given a graph G=(V,E,W) defined by a stochastic and irreducible matrix
W, the PF theorem establishes that the largest left eigenvector (with
eigenvalue 1) is the stationary state (PageRank) solution:!

xTmax W = xTmax
Max left eigenvector '

max

= xTmax

Eigenvalue=1 '

'Consequence: xpagerank = solution of eigenproblem:!

x T W = xT

!"#$%&'(&%))*+'

2'

Stochastic Matrix!
!! Definition: A matrix W with the rows normalized as probability density
function:!
X

W ='

Wij = 1

W1 = 1

!! Interpretation: Wij = probability to move from vertex i to vertex j!

!! Make W stationary:!

W
!"#$%&'(&%))*+'

for

Dii

P
( j Wij )
=
0

if ith row6= 0
otherwise
3'

Irreducible Matrix !
!! Definition: A matrix W that represents a strongly
connected graph, that is W has, for any pair of
vertices (i,j):!
(1) A directed path from i to j.!
(2) A directed path from j to i. !

!! Make W irreducible:!

Wsi

Stochastic and irreducible matrix!


= Full matrix'

W + (1

In
)
n

Identity matrix'

Original matrix!
= Sparse matrix'

Common choice: "=0.85 '


!"#$%&'(&%))*+'

4'

Interpretation!
Q: What is a random surfer? !
Term (1-)In/n: It is equivalent to a random surfer/user who can jump to
any webpage. !
Whole model D-1Wsi + (1-)In/n: It represents a surfer/user who follows
the internet structure % of the time and suddenly, in (1-)% of the time, the
surfer clicks to a random webpage that has no connection to the provious page:!

Xavier Bresson

10

Naive Algorithm!
PageRank simple algorithm: Solve the EVD problem:!
Left eigenproblem

xT Wsi = xT

(xT Wsi )T = (xT )T


T
Wsi
x=x

Right eigenproblem

Limitations: Computing eigensolution is:!


(1) Slow O(n2)!
(2) Memory consuming O(n2)!
(3) Not easily parallelizable, eigenvectors are solutions of global linear systems!
EVD does not scale to big networks like Internet.
Xavier Bresson

11

Power Method !

[Mises, PollaczekGeiringer 29, Page, Brin, Motwani, Winograd98] !

Solution of WsiTx=x can be found by a fixed point iterative scheme: !


T k
xk+1 = Wsi
x

At convergence:
T 1
x1 = Wsi
x

T
x1 = xpagerank because solution of x = Wsi
x

Algorithm:!

1n
n

Initialization:

xk=0 =

Iterate until convergence:

xk+1 = D

Xavier Bresson

W T xk +

1n
(1
n

12

Properties!
Full matrix!
O(n2)!

!! (1) Memory e#ciency:

T
EVD: ! Wsi x = x

Power Method: ! xk+1 = D


Ex: For WWW, n=4.6B, E100.nn2!

!! (2) Speed e#ciency:

W T xk +

1n
(1
n

Sparse matrix!
O(E)!

EVD: O(n2) !
Power Method: The number of iterations to converge to a precision #
is controled: !

log
85
for = 0.85
10
kxk+1 xk k1
=
, and = 10
K=
for'
1833 for = 0.99
log10

!! (3) Highly parallelizable: Basic linear algebra


operations Ax.!
!! (4) Very popular: Power method is the best
technique to compute largest or smallest eigenvector.!
!"#$%&'(&%))*+'

,.'

Demo: PageRank!
!! Run lecture06_code01.ipynb!

!"#$%&'(&%))*+'

,/'

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

15

Recommender Systems Collaborative Filtering!


Q: Do you know the Netflix Prize? !
!! Famous example: Netflix Prize 1M$!
Netflix is the biggest online movie company in the U.S. !
(launched in Europe in 2015) #Movies> 100,000 and
#Users> 50M.!
!

Netflix prize was a competition for the best algorithm that


can predict user ratings for movies based on previous ratings.!
ratings.
(Big) data: 480,189 users!
17,770 movies!
100,480,507 ratings  only 0.011%'

!! Companies using collaborative filtering technique: Facebook, LinkedIn,


MySapce, LastFM.!

!"#$%&'(&%))*+'

,1'

Collaborative Filtering!
!! Formulation: Given a few ratings/observations Mij of movie j and user i, find
a low-rank matrix X that best fits the ratings. !

Recommendation !
= !
Matrix completion!

M'

!"#$%&'(&%))*+'

X'

,2'

Low-Rank Recommendation!
!! Definition: A low-rank matrix has many columns and rows that linearly
dependent. The rank of a matrix gives the number of independent rows and
columns. !
Nb of independent rows = 13!
Nb of independent cols = 15 !
 rank(X) = max(13,15) = 15 '

X='

!! Low-rank assumption: This hypothesis is valid for many real-world


recommender systems. For Netflix movie recommendation:!
(1) There exist communities of users who rate movies the same way.!
(2) There are groups of movies that receive same ratings. !
!

Same assumptions for Amazon (users, products), LinkedIn (users, jobs), Facebook
(users, ads), etc.!

!"#$%&'(&%))*+'

,3'

Formalization!
!! Modeling:!

min rank(X)

s.t.

Noiseless case!
(Observations are clean)'

Xij = Mij
Xij = Mij + nij

Combinatorial !
NP-hard problem'

8ij 2 obs
8ij 2 obs
Noisy case!
(Observations may be corrupted)'

Relaxation needed!

Convex !
Relaxation!

!"#$%&'(&%))*+'

Non-Convex !
Relaxation!

,4'

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

20

Convex Optimization!
Convex optimization has become a very powerful tool in the last
decade in data science (2nd topic at NIPS conference, behind deep learning). !
Several state-of-the-art techniques are based on convex opt s.a. (sparse)
data representation, recommender systems, unsupervised clustering, etc. !
Classes of optimization problems in data science:!
(1) Linear programming (LP)!
(2) Quadratic programming (QP)!
(3) Smooth convex optimization !
(4) Non-smooth convex optimization !
(5) Non-Convex optimization!

Xavier Bresson

21

Linear Programming!
!! Linear programming (LP): very common!

min hc, xi
x

s.t.

Linear objective!

Ax b

Convex set!
(polytope)!

Convex set

vs. Non-convex set!

Ex: Bipartite graph matching!

!"#$%&'(&%))*+'

--'

Quadratic Programming!
!! Quadratic programming (QP):!

1 T
min x Qx
x
2

s.t.

Ax b

Ex: SVM!

!! Tikhonov/least-squares/ridge regression problem:!

min kAx
x

bk22 + kRxk22

A0 x = b0

Trick: Never solve exactly a linear system of equations Ax=b!!


Use fast approximate solution like a few (10-50) steps of conjugate
gradient echnique [Hestenes-Stiefel52]. '

!"#$%&'(&%))*+'

-.'

Smooth Convex Optimization!


Smooth convex optimization: !

min Fs (x)
x

s.t.

Ax b

Newtons algorithm: !

xk+1 = xk

[HFs (xk )]
Hessian !
Matrix

=!
Optimal!
time step

Advantages:!

rFs (xk )
Gradient!
vector

(1) (Very) fast convergence:!


F (xk )

F (x? ) =

O( k12 )
O(e k )

for convex functions


for strongly convex functions

(2) Also work for non-convex functions (but


captures local minimizers)

Xavier Bresson

24

Non-Smooth Convex Optimization!


Non-smooth convex optimization:!

min Fns (x)


x

s.t.

Ax b

Three classes of ecient algorithms:!


(1) Primal-dual techniques!
(2) ADMM techniques!
(3) Iterative shrinkage techniques (FISTA)!
Convergence rate: F (xk )

F (x? ) = O(

1
) (optimal J [Nesterov])
k2

Lasso (least absolute shrinkage) regression problem:!


min kAx
x

bk22 + kxk1

Encourage sparsity !
(feature selection)

Interpretation

Compressed sensing problem:!


Revolution in digital signal processing (beyond Shannons sampling theory).!
Xavier Bresson

25

Non-Convex Optimization!
!! No general theory for non-convex problems.!
!! Case-by-case math analysis.!
!! What always work: Standard gradient descent algorithm:!

min Fnc (x)


x

k+1

=x

@F k
(x )

@x

Time step:!

(1) Manual choice or !


(2) Automatic line search technique!

Be aware: Gradient descent techniques are slow! !


'Big issue/bottleneck in deep learning.!
!! Feel free to use convex optimization techniques, but use them with
safety.!
!"#$%&'(&%))*+'

-1'

Convex Optimization for !


Collaborative Recommendation!
!! Combinatorial problem for robust recommendation:!
min rank(X)

s.t.

min rank(X)) +
X

Xij = Mij + nij 8


8ij 2 obs

kIobs (X

M )k2F

with' (Iobs )ij =

1
0

if ij 2 obs
otherwise

promotes low-rank! promotes data fidelity !


and robustness!
Combinatorial !
NP-hard problem
problem'

Continuous (non-smooth) convex relaxation:!


min kXk? +
X

kIobs (X

M )k2F

p=min(m,n)
m,n)

k=1

SVD'

X = U V T
!"#$%&'(&%))*+'

k (X)|

Singular values!
= diag(

1 , ...,

p)

-2'

Primal-Dual Optimization!
!! Algorithm:!
X k=0 = M

Initialization:!

Y k=0 = 0

Iterate until convergence between primal variable X and dual


variable Y:!
Y k+1 = Y k
X

!"#$%&'(&%))*+'

k+1

Xk

U h1/ ()V T
k Y k+1 + k M
1 + k Iobs

SVD'

with'

U V

=Yk +

Xk
h (x)

-3'

Demo: Collaborative Filtering!


!! Run lecture06_code02.ipynb!

!"#$%&'(&%))*+'

-4'

Properties!
Advantages:!
(1) Unique solution (whatever the initialization)!
(2) Well-posed optimization algorithms!
Limitations:!
(1) Complexity is dominated by SVD O(n3)!
(2) Memory requirement is O(n2)!
Convex algorithms do not scale up to big data.!

Xavier Bresson

30

Non-Convex Techniques!
!! Combinatorial problem for robust recommendation:!
min rank(X) +
X

kIobs (X

promotes low-rank!

M )k2F

with' (Iobs )ij =

1
0

if ij 2 obs
otherwise

promotes data fidelity !


and robustness!

Combinatorial !
NP-hard problem
problem'

Continuous non-smooth and non-convex relaxation:!


min
L,R

1
1
kLk2F + kRk2F + kIobs (LR
2
2
2

M )k2F

rm

R'

L'

!"#$%&'(&%))*+'

nr

X=LR'

nm

r n, m

.,'

Properties!
Advantages:!
(1)Optimization problem is non-convex, but smooth and quadratic
conjuguate gardient, Newton, etc. !
(2) Big data optimization: As the objective is dierentiable, stochastic

!
!

gradient descent techniques can be applied large-scale opt and !


recommender systems.!
(3) Monotonicity property: There exists a class of factorization algorithms !
called non-negative matrix factorization (NMF) that is monotonic
!
(later discussed).!
E k+1 E k , 8k
Limitations:!
(1) Non-convex local minimizers (good initialization is essential)!
(2) Extra parameter: The rank value r needs to be fixed a priori.!

Xavier Bresson

32

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

33

Recommender Systems Content Filtering!


!! Collaborative recommendation focuses on low-rank approximation
of ratings. !
Content recommendation looks for similarities between users and
between products.!
!! Companies using this class of recommendation: Amazon, IMDB.!
!! Formulation: Given (1) a few ratings Mij of product j and user i,!
(2) a set of user features and product features, !
find a matrix X that best fits the ratings and satisfies the similarities between
users and products.!
!! User features and product features:!
(1) User features/attributes: Genre, age, job, hobbies, etc!
(2) Product features/attributes: Field, price, age, new, etc !

!"#$%&'(&%))*+'

./'

How to Encode Similarity Information?!


!! Similarities between users/products are encoded by graphs, as graphs
naturally store the proximities between any pair of data!
Rows/users graph:!

Gr = (Vr , Er , Wr )

Network of users'

Cols/products graph:!

Gc = (Vc , Ec , Wc )

Network of products'

!"#$%&'(&%))*+'

.0'

Content Recommendation !
[Huang-Chung-Ong-Chen02]!

Cols/products graph:'
Gc = (Vc , Ec , Wc )

Rows/users graph:'
Gr = (Vr , Er , Wr )

Recommendation !
=!
Matrix completion'

M'

X'

'How to fill M using the graphs of users and products?!


!"#$%&'(&%))*+'

.1'

Formalization!
Simple idea: Diuse the ratings on the networks of users and products.!
Optimization formulation:!

min kXkdiff,G rows + kXkdiff,G cols +


X

kIobs (X

M )k2F

What is the diusion objective on graphs?!

kXkdiff,G

Xavier Bresson

37

How to Design the Di%usion Term?!


!! Observation: When user i is close to user i on
the graph Gr=Gusers, it means that the two users
i,i are similar, and so there is a high chance
they will rate all movies almost the same way.
Hence !
'rowi and rowi on X are expected to be
close!!

'Di$usion term should be designed to force the ratings to be


smooth on graphs. There exists a popular choice of graph smoothness
called Dirichlet objective:!
kXkdiff,G = kXkDir
!"#$%&'(&%))*+'

= tr(X T LX)
where L is the graph Laplacian.'
.3'

Content Recommendation!
!! Optimization problem:!
min kXkdiff,G rows + kXkdiff,G cols +
X

min tr(X T Lr X) + tr(XLc X T ) +


X

kIobs (X

kIobs (X

M )k2F

M )k2F

'Problem is smooth and quadratic:!


(1)! It reduces to Ax=b: Conjugate gradient technique.!
!

(Im Lr + Lc In + Imn )X = M

Ax = b

(2) Stochastic gradient descent for large-scale dataset.!

!"#$%&'(&%))*+'

.4'

Demo: Content Filtering!


!! Run lecture06_code03.ipynb!

!"#$%&'(&%))*+'

/5'

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

41

Hybrid Systems!

Q: Can we combine collaborative and content recommendation? !


!! Combine collaborative and content techniques for enhanced recommender
systems: [Ma-et.al11, Bresson-Vandergheynst14]!
!! Formulation: Given a few ratings Mij of product j and user i, user features
and product features, design a recommender systems that best benefits from:!
(1) Collaborative filtering by constraining X to be low-rank,!
(2) Content filtering by enforcing graph regularizing on X.!

!"#$%&'(&%))*+'

/-'

Formalization!
Optimization:!

min kXk? +
X

Xavier Bresson

tr(X T Lr X) +

tr(XLc X T ) +

kIobs (X

M )k2F

43

Demo: Hybrid Filtering!


!! Run lecture06_code04.ipynb!

!"#$%&'(&%))*+'

//'

State-of-the-Art!
!! Limitation: Graph Dirichlet regularization/smoothness forces:!
(1)!Two rows/columns of X to be similar if they are close on graphs ".!
!

(2) X to be smooth on graphs !This assumes the ratings to vary !


smoothly on the graphs #. This assumption is actually limited and a !
better one is to force X to be piecewise constant on graphs.!

Graph Dirichlet regularization/smoothness!

Graph TV regularization/smoothness!

!! State-of-the-art technique available on GitHub:!


678)9::;$<6=>?@*A:B$B*6):&%@*;'
!"#$%&'(&%))*+'

/0'

Outline!
Recommender Systems!
Google PageRank!
Collaborative Recommendation!
Convex Optimization!
Content Recommendation!
Hybrid Systems!
Conclusion!

Xavier Bresson

46

Fundamental Property of !
Recommender Systems!
Prediction !
Error!
(the lower !
the better)'

Collaborative Filtering/ low-rank approximation'

regularization'
Content Filtering/ graph regularization
Hybrid=!
Content'

Hybrid recommender!
recommender
System'
System

number
Small number!
of ratings'

Hybrid=!
Collaborative'

Large number!
number
of ratings'

#Available!
Observations/!
ratings'

!! Conclusion:!
(1) If not enough ratings, then focus on collecting data features!
(2) When enough ratings, then give less importance of features!
!"#$%&'(&%))*+'

/2'

Summary!
PageRank for data ranking according to pairwise relationships.!

Recommender systems are based on:!


(1) Collaborative filtering!
(2) Content filtering!
(3) Hybrid formulation !
Optimization:!
(1) For small-scale recommendation (n<10K), use convex techniques. !
(2) For large-scale recommendation, prefer non-convex techniques. !

Xavier Bresson

48

QuesDons?

Xavier Bresson

49

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 7: Feature Extraction!
Data Representation!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

The Feature Extraction Problem!

Q: What are features? !

!! Goal: Find the best possible representation of data that reveal special
structures useful for further applications (like classification, recognition, etc).!

Apply!
Filters!

Meaningful!
Features!

Raw data!

!! There are two types of filters for feature extraction:!


(1) Handcrafted features from domain expertise. !
(2) Learned features from data linear or non-linear representations. This
approach has become dominant. !
!"#$%&'(&%))*+'

.'

Handcrafted Features!
!! Domain expertise: Handcrafted features are domain-dependent, i.e. designed
from experts in specific fields with years of experience (usually not
generalizable to other fields).!
!! Popular example: SIFT - Best image features in Computer Vision, used for
many applications such as image recognition. It needed 30 years of experience
(1966-1999) to design good image features! !

Image!

SIFT!
Filters!

SIFT!
Features!

Features for!
image/object !
recognition!

!"#$%&'(&%))*+'

/'

Learned Features From Data!


Paradigm shift: Handcrafted filters/features have been successful for
decades but the emergence of big data has made available enough data to
learn the best features from data directly, without handcrafting anything. !
Besides, deep learning has showed us how to extract highly meaningful
features from data.!

Learned filters: There exist two classes of learned filters depending on


the mathematical data representations:!
(1) Linear representation!
(2) Non-linear representation!

Xavier Bresson

Linear Representation!
!! Formulation:!

z = Dx

Features!
or coe"cients of x!
in the dictionary D'

Dictionary !
of filters!
(or basis functions)'

3
hD1, , xi
7
6
..
z = Dx = 4
5
.
hDK, , xi
2

High-dim!
data'

zi = hDi,
i, , xi

ith coe"cient'

!"#$%&'(&%))*+'

ith filter'

1'

Linear Representation!
!! How to learn A and z?!
'Techniques available: PCA, ICA, NMF, Sparse Coding, etc. !
!! Which technique to choose?!
Each technique assumes di"erent assumptions on data. Pick the one that
follows your data properties (later discussed).!
!! Matrix factorization: Linear representation of data can also be seen as a
matrix factorization problem:!

X = DZ
d'
Kxd

nxd
d'
n x K'

!"#$%&'(&%))*+'

2'

Non-Linear Representation!
Non-linear mapping : !
Linear representation:
Non-linear representation:

x
x

!
!

z = Dx
z = '(x)

(z 6= Dx)

These techniques are called non-linear dimensionality reduction


techniques, and they are used for feature extraction, classification,
visualization, etc. !

Examples: !
(1) Non-linear PCA, Locally Linear Embedding (LLE), Laplacian
Eigenmaps, t-Distributed Stochastic Neighbor Embedding (t-SNE)
(Lecture 8).!
(2) Deep Learning (Lectures 9-12).!

Xavier Bresson

Linear vs. Non-Linear Representations!


!! Which representation to use? The answer depends on the type of data
distributions:!
If data follow a Gaussian Mixture Model (GMM) like human faces, it is
then enough to use linear data representation (and useless to use non-linear
techniques).!
However, if data follow complex distributions like text documents then they
require non-linear techniques. !
Q: What is the shape of Gaussian Model? !

!"#$%&'(&%))*+'

4'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

10

Principal Component Analysis!


Q: Who does *not* know PCA? !
History: PCA introduced by Pearson in 1901 in physics.!
PCA is the most popular technique for linear representation (also
one of the top 10 techniques in data science). !
Formulation: Given a set of data, PCA aims at projecting data onto an
orthogonal basis that best captures the data variance. !
Consequence: The first basis function or principal direction v1 captures the
largest possible variance of data.!
The second basis function or principal direction v2 captures the largest possible
variance of data while satisfying orthogonality constraint v1,v2=0. etc !

Xavier Bresson

11

Formalization!
PCA defines an orthogonal transformation that maps the data to a new
coordinate system (v1,v2,,vK) called principal directions such that the vks
capture the largest possible data variances. !

Notes: !
(1) PCA requires the data to be centered.!
(2) PCA does not say anything about data normalization, but its
analysis may change (PCA is not invariant w.r.t. data normalization).!

Xavier Bresson

12

Covariance Matrix!
!! Definition: The Covariance Matrix C is defined as!

C = XT X
d x d'

X= n x d data matrix!
n = number of data!
d = number of dimensions'

d x n'

!! Covariance matrix C encodes all data variances along each dimension:!


C = hX, , X, i = kX, k22 =
C = hX, , X, i =

n
X

n
X
i=1

Xi Xi =

i=1

|Xi |2 =
n
X

n
X

x2i,

i=1

xi, xj,

i=1

Variance of data !
along #th dimension '
Covariance of data !
along #th and 6th dimensions '

Note: xi are zero-centered along each dimension #:'


E({xi }) =
!"#$%&'(&%))*+'

n
X
i=1

xi, =

n
X
i=1

Xi = 0 8
,.'

PCA = EVD of Covariance Matrix!


Cv1 =

!! Proof: Let us show that !

Principal direction !
(PD) v1!

1 v1
Largest variance!
of data along PD v1!

Vector v1 = 1st principal direction!


= Direction of largest data variance!
= argmax

i=1

|hxi , vi|
i|2

Square distance of!


data projected on v'

kvk2 =1

n
X

Matrix notation'
= argmax

kvk2 =1

kXvk22 = (Xv)T (Xv) = v T X T Xv = v T Cv

Spectral solution: Solution v1 is the largest eigenvector of C'


Cv1 =

!"#$%&'(&%))*+'

1 v1

v1T Cv1

v1T

1 v1

2
1 kv1 k2

= argmax

kvk2 =1

n
X
i=1

|hxi , vi|2

,/'

PCA = EVD of Covariance Matrix!


Vector v2 = 2nd principal direction!
= Direction of second largest data variance!
= argmax

kvk2 =1

v T Cv s.t. hv, v1 i = 0

)
Cv2 =

2 v2 ,

with

Vector v3 = 3rd principal direction !


Matrix factorization: Full EVD of C!

C = V V T
with

Xavier Bresson

V = v1 , ..., vd , V T V = Id , = diag(

1 , ...,

d)

15

Principal Components!
Definition: PCs are the coordinates of original data projected into the basis
of principal directions (PDs): !

Xpca = XV

Xavier Bresson

16

PCA = SVD of Data Matrix!


!! Matrix factorization: !

X = U V
V T with U U T = In , V V T = Id
n x d'

nxn
n'

n x d'

d x d'

!! Relationship between EVD and SVD:!


T
C = X T X = (U V T )T (U V T ) = V (U T U )V T = Vsvd Vsvd

T
= Vevd Vevd
8
< Vsvd = Vevd
2 = ! k = k2
:
Xpca = XVevd = Usvd

!! Q: PCA with EVD or SVD? It depends on the size of data matrix X:!
(1) For d>n: use SVD.!
(2) For d<n: use EVD.!
!"#$%&'(&%))*+'

,2'

PCA as Dimensionality Reduction!


!! Essential observation: Most (linear) data are concentrated along the
first principal directions:!

xi = hxi , v1 i + hxi , v2 i
hxi , v1 i

The first PDs are enough to provide a good data representation, i.e.!

kX

XK k22 small for a small K


Approximation of X !
with first K PDs:'

XK = U K VKT

Truncated 7 with first K !


data variances'
!"#$%&'(&%))*+'

Truncated 8 with !
first K PDs'
,3'

How to select K?!


!! Rule: Data variance is captured by each principal direction, then if one wants
to retain 90% of total variance then K must be selected as follows: !

PK

k=1
Pd
k=1

k
k

0.9

90% of total variance'

K
YaleBFaces dataset!

!"#$%&'(&%))*+'

structure'

noise'

,4'

PCA as Visualization Tool!


Use first principal components for visualization: If high-dimensional
data are linear, follow a GMM distribution then they can be visualized in
2D, 3D spaces.!

Xavier Bresson

20

Demo: Standard/Linear PCA!


!! Run lecture07_code01.ipynb!

!"#$%&'(&%))*+'

-,'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

22

Sparse PCA!
Q: Is PCA interpretable? !
!! Motivation: Standard PCA is able to !
(1) Capture most variability information contained in data.!
(2) Identify uncorrelated information (because principal directions are
orthogonal).!
!! However, PCA is limited in feature interpretation: It is hard to to identify
the most relevant features for each principal direction.!
!! Example: Analysis of genes with standard PCA gives: !

Gene1!
Gene2!
Gene3!
'

Q: What genes are the !


most meaningful?!

'Solution: Sparse PCA!


!"#$%&'(&%))*+'

-.'

Sparse PCA Techniques!


Mainly, three classes exist:!
(1) Lasso PCA!
(2) Elastic PCA!
(3) Power PCA (based on power method) !
Lasso PCA: !
Advantage: Great feature selection technique as it finds sparse, accurate
and robust solutions.!
Limitation: The number of features that can be selected by Lasso is at
most n, the number of data. It may be an issue in some applications like in
bioinformatics: !
n = # microarray data = 100!
d = # gene predictors = 10,000!
Lasso PCA can select at most 100 genes. Solution: Elastic PCA. !

Xavier Bresson

24

Elastic PCA!
!! Elastic PCA solves an elastic net regression problem:!
min kX
A,B

XBAT k2F +

2
2 kBkF

Data fidelity!
term!

Bk1
1 kBk

L1 term forces!
sparse solution!

Elastic net!
regression!

?
B,j
=
? k
kB,j
2

Sparse principal directions: !

sPDj = V,j

Sparse principal components: !

Xspca = XV

!"#$%&'(&%))*+'

s.t. AT A = IK

-0'

Algorithm!
Optimization problem is non-smooth but convex (separately):!

Am=0 = VKpca (standard PCA)

Initialization:
Iterate until convergence:
Step 1:

B m+1 = argmin kX

XBAm k2F +

1 kBk1

(1)

Step 2:

Am+1 = argmin kX

XB m+1 Ak2F s.t. AT A = IK

(2)

2
2 kBkF

Problem (1) can be solved eciently by FISTA.!


!

Problem (2) can be solved by SVD.

Xavier Bresson

26

Demo: Sparse PCA!


!! Run lecture07_code02.ipynb!

!"#$%&'(&%))*+'

-2'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

28

Robust PCA!
Q: Is PCA robust to outliers? !

!! Motivation: Standard/sparse PCA are


sensitive to outliers, i.e. a single outlier may
change significantly the true PCA solution. !

!! Solution: Robust PCA is a technique that


separates the outliers from the clean data
where PCA is performed. !

!"#$%&'(&%))*+'

-4'

Formalization!
!! Standard PCA:!

!! Robust PCA:!

min kX
L

Lk2F s.t. rank(X) = K

min rank(X) +
L,S

card(S)) s.t. X = L + S
Data'

NP-hard combinatorial problem


problem!
'Continuous relaxation needed.'

)
min kXk? +
L,S

(1)

Low-rank matrix!
Standard PCA!
(structure)'

Sparse matrix!
captures outliers!
(no structure)'

Convex !
relaxation'

kSk1 s.t. X = L + S

(2)

Strong result: Solution of (2) is almost the solution of (1)!!

!"#$%&'(&%))*+'

.5'

Algorithm!
!! ADMM technique: Fast, robust and accurate solutions.!
Initialization:!

Lm=0 = X

S m=0 = Z m=0 = 0

Iterate until convergence:!


svd

Lm+1 = U h1/r ()V T with U V T = X

S m+1 = h /r X Lm+1 + Z m /r

Z m+1 = Z m + r(X

!"#$%&'(&%))*+'

Lm+1

S m+1 )

S m + Z m /r

h (x)

.,'

Demo: Robust PCA!


!! Run lecture07_code03.ipynb!

!"#$%&'(&%))*+'

.-'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

33

PCA on Graphs!
Q: Can do PCA on networks like Facebook? !
!! Motivation: When data similarities are available or can be computed,
it enhances PCA.!
!! Formalization:!
min rank(X) +
L,S

card(S) +

G kLkG smooth

s.t. X = L + S

Force smoothness !
on graphs!

)
min kXk? +
L,S

!"#$%&'(&%))*+'

kSk1 +

Continuous!
convex !
relaxation'

G kLkG Dir

s.t. X = L + S

./'

Demo Video Surveillance!

!! Separate background from moving objects: State-of-the-art [ICCV15]!

!"#$%&'(&%))*+'

.0'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

36

Non-Negative Matrix Factorization!


Q: Is PCA the best linear data representation? !
!! Motivation: PCA vs. NMF!
PCA learns the main variations of data.!
 Data representation is based on main directions of data variations.!
NMF learns the most common parts of data.!
 Data representation is based on main common data parts.!

[Lee-Seung99]'

PCA'

!"#$%&'(&%))*+'

NMF'

.2'

Matrix Factorization!
!! PCA and NMF are both factorized models:!
svd

PCA : X = U V T

m!
m!

NMF : X = LR with L, R

r! r!

R!

L!

LR!

0
n!

X!

=!

n!

Essential constraints to !
Identify data parts!

!! Dimensionality reduction technique: L,R are small low-rank matrices


(compared to X). They can be interpreted as compressed features! !
Movies!
Users!

X!

10'
Users!

L!

Compressed !
User features'

!"#$%&'(&%))*+'

10'
Movies!

R!

Compressed !
Movie features'

.3'

Linear Representation!
!! Text document representation: !

ri

40,000 words'
m!
20,000
text
n!
documents'

xi

=!

Lri

xi = Lri
'Each document is represented by a linear
combination of compressed word features.!
'Same for word:'
!"#$%&'(&%))*+'

xj = R T `j
.4'

How to Compute L,R?!


Factorization of the form:!

X = LR with L, R

can be solved by optimization these loss functions:!


!
(1) Least-squares loss:!

min kX

L,R 0

LRk2F

!
(2) Kullback-Leibler (relative entropy) loss: (histogram distances)!

min KL(X, LR) =

L,R 0

Xavier Bresson

X
ij

Xij log

(LR)ij
Xij

40

Algorithms!
Several techniques exist:!
(11) Multiplicative update techniques!
Advantage: Monotonic.!
Limitation: Slow to converge.!
!

(2) ADMM, Primal-Dual techniques!


Advantage: Fast.!
Limitation: No theoretical guarantee.!
!

(3) Power Methods (most recent)!


Advantage: Fast.!
Limitation: No theoretical guarantee.!

Xavier Bresson

41

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

42

Sparse Coding!

Q: Can do better than PCA and NMF? !

!! Motivation: PCA and NMF make strong assumptions about the dictionary
D used for linear representation: !

z = Dx
dictionary '
PCA!
D captures main directions!
of data variations!

NMF!
D captures main !
common parts of data!

Q:'How to relax these assumptions to learn more generic filters?!


!! Sparse coding (recently a.k.a. dictionary learning): !
New data assumption: Represent data as a sparse linear
combination of a few filters.!
!! Note: This is the best linear representation and feature extraction
technique, and for any kind of data! A class of deep learning
techniques use sparse coding at core feature extraction deconvolutional
neural networks. !
!"#$%&'(&%))*+'

/.'

Formalization!
!! Optimization problem: !
n
X
min
kxj
D,zj

j=1

Dzj k22 + kzj k1 s.t. kDi, k2 1 8i


Forces !
sparsity'

Controls!
filter energies'

Eni = kDi, k2 =

1
0

Algorithm can learn!


the best number of filters'

zj
Di,

=0
!"#$%&'(&%))*+'

//'

Algorithm !
!! Non-smooth and convex optimization:!
n
X

Dzj k22 + kzj k1 s.t. kDi, k2 1 8i

min kX

DZk2F + kZk1 s.t. kDi, k2 1 8i

Z,D

j=1

Z,D

Initialization:!

kxj

min

Dm=0 = randn

Iterate until convergence:!

Z m+1 = arg min kX

Dm Zk2F + kZk1

Dm+1 = arg min kX

DZ m+1 k2F s.t. kDi, k2 1 8i

'Each sub-optimization problems can be solved !


e"ciently by FISTA.!
!"#$%&'(&%))*+'

/0'

Demo: Sparse Coding!


!! Run lecture07_code04.ipynb!

Learned Dictionary= !
Human visual filters (V1 cells) !
in the primary visual cortex!
!"#$%&'(&%))*+'

/1'

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

47

Summary!
Feature Extraction Problem: !
(1) Handcrafted filters/features: less popular.!
(2) Learned filters/features: more and more common.!

Learned filters = Data representation problem:!


(1) Linear Representations!
(2) Non-linear representations (next lecture, and deep learning lectures) !
Linear Representations:!
(1) PCA: based on data variances.!
(2) NMF: based on positive common parts of data!
(3) Sparse Coding: based on sparse representation (highly adaptable !
technique)!

Xavier Bresson

48

Outline!
The Feature Extraction Problem!
Standard Principal Component Analysis (PCA)!
Sparse PCA!
Robust PCA!
PCA on Networks!
Non-Negative Matrix Factorization (NMF)!
Sparse Coding (SC)!
Conclusion!

Xavier Bresson

49

Ques;ons?

Xavier Bresson

50

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 8: Data Visualization!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

Visualization Problem!
!! Data visualization is the same problem as!
(1) Data representation!
(2) Feature extraction!
!! Data representation looks for the best filters or dictionary D where
the data x can be represented, and the projected data z on D are used
as coordinates for 2D or 3D visualization:!
zi 2 R 2
2D visualization'

xi 2 R d , d

1
3D visualization'

!"#$%&'(&%))*+'

zi 2 R 3

.'

Visualization Techniques!
!! Visualization techniques are also dimensionality reduction techniques
because they aim at mapping data into a much lower-dimensional space,
2D or 3D Euclidean spaces.!
!! Linear dimensionality reduction (LDR) techniques.!
Assumption: Data that can be represented on a low-dimensional
hyperplane.!
d

R , d

xi
+'
+'
+'+'
+'+'
+'

xi 2 Rd
!"#$%&'(&%))*+'

Hyperplane Rm , m d

LDR!

Find a linear!
mapping A s.t.!

A : xi ! zi

Am,

+'
+' z
i
+'
+' +'

A1,

zi 2 R m = A x i
/'

Non-Linear Dimensionality Reduction!


!! Assumption: Data that can be represented on low-dimensional curved
spaces, i.e. manifolds:!
Manifold M Rd ,
Rd , d

dim(M) = m d

xi
+'+' +'
+' +'+'
+'

NLDR!

zi

Find a non-linear!
mapping " s.t.!

xi 2 R

' : xi ! z i

zi 2 Rm = '(xi )

!! Non-linear dimensionality reduction (NLDR) techniques are also called


manifold learning techniques.!
!"#$%&'(&%))*+'

0'

Demo: LDR vs. NLDR!


!! Run lecture08_code01.ipynb!

!"#$%&'(&%))*+'

1'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

Kernel PCA

[Scholkopf-Smola-Muller97]!

!! Standard PCA:!
2

6
Gram matrix:' G = XX = 6
6
4
T

hx1 , x1 i hx1 , x2 i . . .
hx2 , x1 i hx2 , x2 i
..
..
.
.
hxn , xn i

7
7 EV D
7 = U DU T ) Xpca = U D1/2
5

!! Kernel PCA: Gram matrix in higher-dim space:!


2

h (x1 ), (x1 )i
6 h (x2 ), (x1 )i
6
G=6
..
4
.

h (x1 ), (x2 )i
h (x2 ), (x2 )i

...
..

.
h (xn ), (xn )i

7
7 EV D
7 = U DU T ) Xkpca = U D1/2
5

Apply the kernel trick to the Gram matrix:!

K(x, y) = h (x), (y)i = e

kx yk22 /

Never computed!!
!"#$%&'(&%))*+'

3'

Demo: Kernel PCA!


!! Run lecture08_code02.ipynb!

!"#$%&'(&%))*+'

4'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

10

Locally-Linear Embedding (LLE)

[Roweis, Saul00]!

!! Motivation: Design a mapping from high-dimensional space to lowdimensional space such that the geometric distances between
neighbor data are preserved.!

xi

+'
+

Rd , d

+ xj
+'

'

j
xi x+
+'

+'
+

R3

!! Description: LLE aims at computing a manifold M by locally linear fits,


that is each data on M and its neighbors lie on a locally linear patch of
M. !

!"#$%&'(&%))*+'

,,'

Algorithm!
Step 1: For each data xi, compute the k nearest neighbors.!

Step 2: Compute linear patches: find the weights Wij which best linearly
reconstruct xi from its neighbors:!
min
W

n
X

xi

i=1

Wij xj

2
2

s.t.

X
j

Wij = 1 8i

Solution: Ax=b

Step 3: Compute the low-dimensional embedding data zi, which is best


reconstructed by the weights Wij: !
min

Z=[z1 ,...,zm ]

n
X
i=1

zi

X
j

Wij zj

2
2

s.t.

zi = 0, Z T Z = Im

Solution: EVD

Xavier Bresson

12

Demo: LLE!
!! Run lecture08_code03.ipynb!

!"#$%&'(&%))*+'

,.'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

14

Laplacian Eigenmaps

[Belkin, Niyogi03]!

Very popular visualization technique.!


Motivation: Same as LLE but stronger mathematical analysis and
understanding. !
Manifold assumption: Data are sampled from a manifold M
represented by a k - nearest neighbors graph.!

Xavier Bresson

15

Dierential Geometry!
Eigenfunctions vk of continuous Laplace-Beltrami M serves as
embedding coordinates of M:!

Note: Discretization of M = graph Laplacian L !

Xavier Bresson

16

Formalization!
1D Visualization: Map a graph G=(V,E,W) to a line such that neighbor
data on G stay as close as possible on the line.!

Note: We look for the mapping but we never compute it explictely.!


We look for the coordinates zi of xi on the low-dimensional manifold M
such that yi=(xi), that is:!
min
y

Wij (yi

yj ) 2

(1)

ij

Interpretation: As Wij=1 if xi close to xj, then miny Wij(yi-yj)2 implies


that yi to be close to yj.!

Xavier Bresson

17

Generalization to 2D and 3D!


!! K-D Visualization: Generalizing (1) to K dimensions is straightforward:!
min
Y

X
k

T
Y,k
LY,k = tr(Y T LY ) s.t. Y T Y = IK

Graph Laplacian!

!! Spectral Solution: Top K eigenvectors of graph Laplacian L.!


EV D

L = U U T ! Y = UK
!! Advantages:!
(1) Global solutions (independent of initialization)!
(2) Fast algorithms!

!"#$%&'(&%))*+'

,3'

Demo: Laplacian Eigenmaps !


!! Run lecture08_code04.ipynb!

MNIST !
PCA!

!"#$%&'(&%))*+'

MNIST !
Lap Eigenmpas!

USPS!
Lap Eigenmpas!

,4'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

20

T-SNE

[van der Maaten, Hinton08]!

Q: What visualization technique is the most used? !


!! t-Distribution Stochastic Neighbor Embedding (t-SNE) is the most popular
visualization technique, among the top 10 algorithms in data science.!
!! Model description: t-SNE learns the mapping/embedding # function, such
that yi=#(xi), by minimizing the Kullback-Leibler distance between the
distribution of high-dim data and the distribution of the computed low-dim
data: !

e
pij = P

kxi xj k22 /

ke

2
i

kxi xk k22 /
i

2
i

= k th nearest neighbor distance from xi

(1 + kyi yj k22 ) 1
qij (y)
y)) = P
yk k22 )
k (1 + kyi

Embedding coordinates of high-dim data!

!"#$%&'(&%))*+'

-,'

Optimizing Kullback-Leibler !
Problem:!

min KL(P, Q(y)) =


y

ij

yim+1

yim

X
j

(pij

pij log

pij
qij (y)

Gradient descent technique

qij )(1 + kxi

xj k2 )

(yim

yj )

Advantages:!
(1) Local distance preservation (as LapEig, LLE): minimizing KL !
forces qij to be close to pij, the distribution of high-dim data.!
(2) t-SNE does not assume the existence of a manifold: More !
flexibility to visualize more complex hidden structures.!
Limitations:!
(1) Non-convex energy Existence of bad local solutions, problem of
initialization (PCA is used as initialization).!
(2) Slow optimization (gradient descent).!
Xavier Bresson

22

Demo: T-SNE!
!! Run lecture08_code05.ipynb!

!"#$%&'(&%))*+'

-.'

Outline!
Visualization Problem!
Kernel PCA!
Locally-Linear Embedding (LLE)!
Laplacian Eigenmaps!
T-SNE!
Conclusion!

Xavier Bresson

24

Summary!
Data'
Non-Linear
Non-Linear!
Structure
Structure'

Linear!
Structure
Structure'

zi = '(xi )

zi = Axi
Low-dim!
data'

High-dim!
High-dim
data'

Low-dim!
data'

High-dim!
data'

Non-Linear Mapping/Embedding'

Dictionary'
Sparsity
Sparsity!
Structure
Structure'

Variability
Variability!
Structure
Structure'

Sparse Coding!

PCA!

(1997)'

(1901)'

Kernel PCA
PCA!
(1998)'

LLE!
LLE

(2000)'

T-SNE!
LapEig! T-SNE
LapEig
(2008)'
Maps!
(2000)'

Main property of these techniques:!


" preserves local distances in high-dim!
spaces:!
And in low-dim spaces:

xi

'+ xj
'
+
!"#$%&'(&%))*+'

Rd , d

xi xj

'
'++

'
R3

Most popular!

Popular!
Popular

Math sound!
Unique solution!
Manifold assumption !
Too strong!

But slow convergence!


As non-convex opt!
Local minimizers !

-0'

Gephi!
Q: What visualization software is the most used? !

!! Awesome visualization tool!!


!! Run lecture08_code06.ipynb to convert graphs to Gephi format.!
!"#$%&'(&%))*+'

-1'

Ques:ons?

Xavier Bresson

27

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 9: Deep Learning 1!
Classification Techniques!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

Note: Some slides are taken from F.F. Li,


A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'

,'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

Classification Problem!
Q: What it the classification problem? !
!! Classification is a core problem in many applications:!
(1) Computer Vision: Image classification !
Image " Class (original deep learning [Hinton-et.al12])!
(2) Speech: Sound recognition!
Sound " Class (original deep learning [Dahl-et.al12])!
(3) Text document: Text categorization!
Text " Class (Wikipedia analysis)!
(4) Neuroscience: Brain functionality!
Activation pattern " Vision, hearing, body control !

!! Pipeline of classification models:!


Raw data " Feature extraction " Classifier function !

!"#$%&'(&%))*+'

.'

Image Classification !
!! We will consider the image classification problem in Computer Vision as a
generic classification problem (generalization will be discussed in Lecture 11).!

!! Problem:!

Image !
Classification '

!"#$%&'(&%))*+'

/'

Main Challenge!
!! Bridge the semantic gap between raw data (N-D array of numbers) and
cognitive/human understanding.!

Images are represented as 3D


arrays of numbers, with
integers between [0, 255].'

What humans see [cat]'

!"#$%&'(&%))*+'

Matrices of e.g. size


300x100x3.'

0'

Semantic Information is Invariant !


to Many Deformations!
Deformations in Computer Vision:!
(1) Spatial variations!
(translation, rotation,!
Scaling, shearing)

(2) Illumination!
changes

(5) Background!
clutter

Xavier Bresson

(3) Object !
deformation

(4) Occlusion

(6) Intra-class!
variation

How to Solve the Classification Problem?!


!! Highly challenging problem ( sorting problem):!
In Computer Vision (CV), early works from 1950s (history later), only recently
(2012) algorithms have achieved super-human performances.!
!! Before 2012:!
Many works exist, mostly based on two separate steps:!
(1) Handcrafting best possible filters/features (e.g. SIFT features)!
(2) Linear SVM classification on extracted features!
!! After 2012:!
Deep learning revolution: Learn simultaneously !
(1) Filters/features from raw data (do not handcraft anything)!
(2) Linear SVM classification on extracted features!
" State-of-the-art in CV, speech recognition, etc!

!"#$%&'(&%))*+'

2'

Classification is a Data-Driven Approach!


!! Generic approach:!
(1) Collect a training dataset of images and labels (training set).!
(2) Train an image classifier.!
(3) Evaluate classifier on test images (test set). !

!! Note: Collecting data is easy (big data era) but labeling is time consuming.!
!"#$%&'(&%))*+'

3'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

Nearest Neighbor Classifier!


Q: What it the nearest neighbor classifier? !
!! Naive Classifier:!

Training set!
!"#$%&'(&%))*+'

Test data!

Nearest data!
in training set!

,5'

How to find Nearest Neighbor?!


!! Distance metrics: L2, L1, cosine, KullbackLeibler, your favorite..!
X
this example
example!
d`1 (I1 , I2 ) =
|I1 ij I2 ij |
!! Python code:!

!"#$%&'(&%))*+'

ij

,,'

Test Time!
!! Q: What is the test time? And how the classification speed depends on
the size n of training data? A: O(n), linearly ". This is a (major)
limitation. Fast test time is preferred in practice.!
Note: Neural Networks have fast test time, but expensive training time. !
!! Partial solution: Use approximate nearest neighbor techniques, which finds
approximate nearest neighbors quickly. !

!"#$%&'(&%))*+'

,-'

k-Nearest Neighbor Classifier!


!! Limitation: Nearest Neighbor is sensitive to outliers/noise.!
Solution: Find the k nearest images, and pick the label with majority voting
(sort of regularization process).!

Training set!
!"#$%&'(&%))*+'

data!
Test data

k=5!
k-Nearest data!
in training set!

,.'

Illustration!
outlier'

Data!

NN/1-NN classifier!

5NN voting is tied'

5-NN classifier!

Q1: What it the accuracy of 1-NN on training data?! 100% !


Q2: What it the accuracy of 5-NN on training data?! 100% !
Q3: What it the accuracy of 5-NN on test data?!
!"#$%&'(&%))*+'

,/'

Hyperparameters!
Q: What is the di#erence between parameter and hyperparameter?!
!! There exist two types of parameters:!
(1) Parameters: Variables that can be estimated by optimization #.!
(2) Hyperparameters: Variables that can be estimated by cross- !
validation (cannot be estimated by optimization) ".!
!! Examples of hyperparameters: distance metric, k value.!
L2, L1, cosine, KullbackLeibler
KullbackLeibler?'

k=1,2,5,10,15?
k=1,2,5,10,15?'

Q: What is cross-validation?!
!! Cross-validation:!
Q: Try out what hyperparameters work best on test set? Bad idea.
!

Test set used for the generalization performance! Use it only after training is
done. !

!"#$%&'(&%))*+'

,0'

Cross-Validation!
!! Split training data into training set and validation:!

Validation data!
Use to test hyperparameters'

Training data!
use to learn classifier'

!! Cross-validate: Cycle through the 5 folds, and record results: !

Training data
data!

!"#$%&'(&%))*+'

Validation data!

Training data!

,1'

Cross-Validation Result!
!! Example of 5-fold cross-validation for finding the value of k:!

Each point: single outcome.!


The line goes
through the mean, !
bars indicated
standard deviation!
deviation

'6alue k = 7 works best for this data.'

!"#$%&'(&%))*+'

,2'

Demo: K-Nearest Neighbor !


!! Run lecture09_code01.ipynb!

!"#$%&'(&%))*+'

,3'

k-Nearest Neighbor Performances!


Best accuracy (for k=7) is 29% on validation sets, may be even lower on
test set.!
Conclusion:!
(1) Never use k-NN (at least for image classification).!
(2) Not robust to perturbations (spatial variations, illumination changes, object
deformation, occlusion, background clutter, intra-class variation).!
(3) Bad test time.!

Xavier Bresson

19

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

20

Linear Classifier!
!! Image classification: !

Image !
classification !
task'

Class of Images: !
CAT!

Array 32x32x3'

!! Linear classifier: !
input'
input

f (x, W, b) = W x + b

vectorize'

3D array !
32x32x3'

O#set/!
Bias'
10 numbers!
indicating
class scores!
(highest is
the choice)!

s!
1D array !
10x1!

x!
1D array
3072x1!
Linear classifier/!
Score function:'

f = Wx + b
10x1!

!"#$%&'(&%))*+'

Parameters/!
Weights'
Weights

10x3072! 3072x1!

10x1!
-,'

Interpreting Linear Classifier!


Q: What does a linear classifier do?!
A: Template matching technique: It scores the image (data) by matching
it with 10 templates.!
Template/filters !
2
3
3
2
matching!
hW
W11,,
W11,, , xi
7
7
6
6
..
W = 4 ... 5
Wx = 4
5
.
W10,
hW10, , xi
10x3072!
10x3072
10x1!
Highest score!
decides!
the class!

un-vectorize'

Wi, =
1x3072!
1x3072

32x32x3!
32x32x3

Trained weights W* (later discussed)'


!"#$%&'(&%))*+'

--'

Interpreting Linear Classifier!


Q: What does a linear classifier do?!
A: Linear mapping: It maps the high-dim image (data) R3072 to a low-dim
linear space R10.!
{x : f (x) = Wcar x + bcar }
has maximum score'

{x : f (x) = Wcar x + bcar = 0}

!"#$%&'(&%))*+'

-.'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

24

How to Compute Weights?!


(1) Define a loss function(/objective/energy): A loss function L
quantifies how well the parameters (weights, oset) are chosen to get the
highest possible score across all training data.!
Exs of loss functions (Lecture 5): SVM, L2, Logistic, Huber, etc. !
(2) Optimization: The process of changing the parameters (weights, oset) to
minimize the loss function in order to get the highest possible score across all
training data.!

Xavier Bresson

25

SVM Loss Function !


!! Multiclass SVM loss: Given (1) training data (xi,yi) and!
(2) score function si = f(xi,W),!
the multiclass SVM loss function is: !
X
Li =
max(0, sj
j6=i

si + 1)
comes form the margin
margin'

SVM loss measures how well the weights s are chosen to get the highest
possible score. Then, Li is 0 when si is well classified, that is when it has the
highest score for its own class yi, and Li is large when si is misclassified. !

Example !
when si is !
well classified:!

!"#$%&'(&%))*+'

-1'

SVM Loss Function !

Example !
when si is !
misclassified:

1X
Total SVM loss:! L =
Li
n i=1
Q: What is the min value of L?!

A: 0.!

Q: What is the max value of L?!

A: + . !

Xavier Bresson

27

Loss Functions!
Q: Will we get the same classification for this loss function?!
X
A: Probably not.!
Li =
max(0, sj si + 1)2
j6=i

Reminder (Lecture 5): There are multiple available loss functions:!


(1) Hinge/SVM loss!
(2) L2 loss!
(3) Logistic regression loss (later)!
(4) Huber loss!

Xavier Bresson

28

Non-Uniqueness of Solutions!
n

Optimization problem:!

min
W

1X
Li (W )
n i=1

(1)

1 XX
max(0, sj
n i=1

si + 1)

j6=i

1 XX
max(0, W xj
n i=1

W xi + 1)

j6=i

This opt problem is ill-posed.


Call W* the solution of (1) then W*, >1 is also solution of (1).

Example:

Xavier Bresson

Q: How to fix this issue? A: Regularization.!

29

Regularization!
n

!! Remember Lecture 5: !

min
W

1X
Li (W ) + kW k2F
n i=1

Equivalent to maximize margins


between training data!

!! Other math interpretation: Strongly convex term


unique!!

make the solution

!! Regularization terms:!
(1) L2 regularization: smooth and di#erentiable.!
(2) L1 regularization: non-smooth and non-di#erentiable, but promotes
sparsity (a few non-zero elements).!
(3) Elastic net regularization: mixture of L1 and L2.!

(4) Dropout for Neural Nets (later discussed).!

!"#$%&'(&%))*+'

.5'

Demo: Linear Classifier and SVM Loss!


!! Run lecture09_code02.ipynb!

!"#$%&'(&%))*+'

.,'

Limitations of Linear Classifier!


Conclusion:!
(1) Never use linear classifier directly on raw data (at least for
image classification).!
(2) Not robust to perturbations (spatial variations, illumination

changes, object deformation, occlusion, background clutter, intra-class


variation).!
(3) Excellent test time.!

Xavier Bresson

32

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

33

Softmax Classifier!
!! Softmax classifier = Multinomial logistic regression !
!! Motivation (from statistics): Maximize the log likelihood of the score
probabilities of the classes:!
Scores = unnormalized
log probabilities !
of the classes '

si = f (xi , W )
e si
P (Y
Y = yi |X = xi ) = P sj
je

Li =

log P (Y = yi |X = xi )

Li =

!"#$%&'(&%))*+'

Softmax function'

esi
log P sj
je

./'

Demo: Softmax Classifier!


!! Run lecture09_code03.ipynb!

!"#$%&'(&%))*+'

.0'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

36

Neural Network Classifier!


!! Image classification: !

Image !
classification !
task'

Class of Images: !
CAT!

Array 32x32x3'

!! Linear classifier: !
vectorize'

f = Wx
10x1!
10x1

3D array !
32x32x3'

!"#$%&'(&%))*+'

x!
1D array
3072x1!

10x3072! 3072x1!

s!
1D array !
10x1!

.2'

Neural Network Classifier!


!! 2-layer classifier: !

non-linear!
activation'

weights'

10x100!
10x100

100x3072!
100x3072

W1

vectorize'

3D array !
32x32x3'

!! 3-layer classifier: !

x!
1D array
3072x1!

h!
h

max(W1x,0)!
1D array
100x1!

W2

f = W2 max(W1 x, 0)

W2max(W1x,0)!
1D array
10x1!

f = W3 max(W2 max(W1 x, 0), 0)

!! Conclusion: Neural Networks (NN) are simply series of linear classification


and non-linear activations (max).!

!"#$%&'(&%))*+'

.3'

Code for 2-Layer NN Classifier!


!! Full implementation of training a 2-layer Neural Network needs 11 lines:!

789:;;$"<=&")>?@$=7AB?$*;-5,0;52;,-;B")$CD9E=7*+D+%=F*&>'

!"#$%&'(&%))*+'

.4'

Neural Network Architecture!


Fully connected (FC) layers: Each neuron is connected to all neurons in
the next layer.!

2-layer Neural Net!


or 1-hidden-layer Neural Net

3-layer Neural Net, !


or 2-hidden-layer Neural Net

The need for more structure: FC networks are very generic but also
highly computationally expensive to learn (huge number of parameters).
They cannot be deep! !
However, using special structures of data (like local stationarity in convolutional
neural networks, and recurrence in recurrent neural networks) allow to construct
deep networks that can be learned (later discussed).!
Xavier Bresson

40

Test Time!
!! Once training is done, it is fast to classify new data (simple linear
algebra operations):!

!! This operation is called forward pass (later discussed).!

!"#$%&'(&%))*+'

/,'

Demo: Neural Network Classifier!


!! Run lecture09_code04.ipynb!

!"#$%&'(&%))*+'

/-'

Demo: Linear vs. Neural Network Classifiers!


!! Run lecture09_code05.ipynb!

!"#$%&'(&%))*+'

/.'

Online Demo!
!! ConvNetJS:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html!

more neurons = !
more capacity'

Regularization !
handles outliers'

!"#$%&'(&%))*+'

//'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

45

Brain Analogy!

Wx + b

!"#$%&'(&%))*+'

/1'

Very Limited Analogy!


!! Biological Neurons:!
- Many di#erent types!
- Dendrites can perform complex
non-linear computations!
- Synapses are not a single weight
but a complex non-linear
dynamical system!
- Rate code may not be adequate!

!"#$%&'(&%))*+'

/2'

Ouline!
The Classification Problem!
Nearest Neighbor Classifier!
Linear Classifier !
Loss Function!
Softmax Classifier!
Neural Network Classifier!
Brain Analogy!
Conclusion !

Xavier Bresson

48

Summary!
Image/data classification: Given a training set, design a classifier and
predict labels for test set. !

k-Nearest Neighbor classifier:!


Predict labels from k nearest images in the training set.!
(Almost) never used as bad accuracy, and bad test time. !

Linear/softmax classifier:!
Predict labels with a linear function.!
Has been used for a long time (kernel techniques) but overcome by deep learning.!
Score function:!

f = Wx + b

SVM loss function:!

max(0, sj si + 1)
j6=i
!
esi
Softmax loss function: ! Li = log P s
j
je
Xavier Bresson

Li =

49

Summary!
!! Standard Neural Networks (NNs):!
Neurons arranged as fully connected layers. !
Series of linear functions and non-linear activations.!
Fast test time (matrix multiplications)!
Performances: bigger = better, but expensive training time (thanks GPUs)!
Bigger = (layer) width and depth (deep)!

width!

depth !

Q: How to train Neural Networks? !

!"#$%&'(&%))*+'

05'

QuesHons?

Xavier Bresson

51

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 10: Deep Learning 2!
Training Neural Networks!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

Note: Some slides are taken from F.F. Li,


A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'

,'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

Loss Function for Classification!


Classification: Use training data (xi,yi) to design a score function s for
classification: !

s = f (W, x) = W x
Weight W: They are found by minimizing a loss function which quantifies
how well the training data have been classified:!
X
(1) SVM loss: !
Li (W ) =
max(0, sj si + 1)
!

j6=i

(2) Softmax loss:!


!

(3) Regularization: !

e si
Li (W ) = log P sj
je
X
E(W ) =
Li (W ) + R(W )
i

Q: How to minimize loss functions?!


A: Steepest gradient descent (follow the slope!)!

Xavier Bresson

Gradient Descent Techniques !


!! Gradient descent: Most standard optimization technique!!
Note: This class of techniques are weak in optimization, but they are the
most generic when the energy landscape is di"cult, that is non-convex.
Training neural networks is (very) slow because of the gradient descent
bottleneck  new research is on-going to speed up the optimization.'

!"#$%&'(&%))*+'

/'

Gradient Operator!
!! Two types: !
(1) Analytic gradient: !
!

(2) Numerical gradient: !

rW E =

@E
= explicit formula
@W

E
E(W +
=
W

W)
W

E(W )

!! Properties of numerical gradient:!


(i) Approximation of analytic gradient.!
(ii) Slow to evaluate (compared to analytic gradient).!
Evaluation the gradient numerically:'

!"#$%&'(&%))*+'
!"#$%&'(&%))*+

0'

Analytic Gradient!
Properties:!
(1) Exact value (use Calculus)!
(2) Fast to evaluate.!

E(W ) =

kW k2F

@E
= 2W
rW E =
@W

Common practice: Gradient Check!


Always use analytical gradient but check its implementation with
numerical gradient.!

Xavier Bresson

Update Rule!
!! Update: !

W m+1 W m
W
=
=
t

Speed of !
Gradient descent!
techniques'

Time step/!
Learning rate/!
Step size'

rW E(W m )

W m+1 = W m

rW E(W m )
negative
negative'

!! Code: !

!"#$%&'(&%))*+'

2'

Monotonicity!
!! Loss/energy value decreases monotonically at each iteration m:!

Q: What happens with big data? !


Large value, e.g. n=billions'

E(W ) =

n
X

Li (W ) + R(W )

i=1

Analytic gradient uses all data at the same time  it is not possible
to load all data in memory!!

!"#$%&'(&%))*+'

3'

Mini-batch/Stochastic Gradient Descent!


!! Special property of loss functions: Additively separable functions, i.e.
functions that are the sum of a single data function Li (independent of all
other data):!
n

1X
min L(W ) =
Li (W )
W
n i=1

nq
n1
n2
1 X
1 X
1 X
Li (W ) +
Li (W ) + ... +
Li (W )
=
n1 i=1
n2 i=1
nq i=1
n = n1 + n2 + ... + nq

)
nq
n1
n2
1 X
1 X
1 X
rL(W ) =
rLi (W ) +
rLi (W ) + ... +
rLi (W )
n1 i=1
n2 i=1
nq i=1

only use a small portion of the training set


to compute the gradient!

!! Stochastic gradient descent: !

!! Deterministic gradient descent: !

W m+1 = W m

m+1

=W

nj
1 X

rLj (W m )
nj i=1

rL(W
W m)

All data
data'
!"#$%&'(&%))*+'

4'

Mini-batch/Stochastic Gradient Descent!


More details:!
Iterate ne epochs: !
For each epoch, iterate over all mini-batches j=1,,nq:!

W m+1 = W m

nj
1 X

rLj (W m )
nj i=1

Note1: An epoch is a complete pass of all training data.!


Note2: At each new epoch, randomly shue training data (improve
significantly results).!
Note3: Stochastic consistency:!
nj
1 X

m
E
rLi (W )
nj i=1

E(W m+1 )

Xavier Bresson

m!1

m!1

1X
rLi (W m )
n i=1

W m+1

10

Stochastic Monotonicity!
!! Code:!

!! Loss vs epochs (iteration m):!

Note1: Mini-batch size: 32, 64, 128, limited by GPU memory.!


Note2: Several works to speed up optimization: Momentum, Adagrad, Adam,
etc (later discussed), but still major bottleneck of NN training.!
Note3: Stochastic gradient descent technique is not only used for large-scale neural
networks, but also for most big data problems: k-means clustering, SVM
classification, Lasso regression, recommendation, etc.!
!"#$%&'(&%))*+'

,,'

Influence of Learning Rate 6 !


!! Challenging problem to find the optimal step size 67!

Large "!
Small "!

Optimal "!

!"#$%&'(&%))*+'

,-'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

13

Computational Graph!
!! Neural networks (NNs) are represented by computational graphs (CGs).!
Definition: A series of operators applied to inputs. Easy to combine (lego
strategy), can be huge.!
Usefulness: Clear visualization of NN operations (great for debugging).!
CG are essential to derive gradients by backpropagation.!

Computational !
Graph: '

Google Tensorflow'
!"#$%&'(&%))*+'

,/'

Backpropagation!
Definition: A recursive application of chain rule along a
computational graph (CG) provides the gradients of all inputs,
weights, intermediate variables.
Chain rule: Calculus:

@L
@L(F (x))
@L @F (x)
=
=
.
@x
@x
@F @x

Essential property of backpropagation: It can compute the gradient of


any variable in the CG by a simple local rule, independently of the size of
the CG (including very deep NNs).

Xavier Bresson

15

Local Rule!
!! Any computational graph is a series of elementary neurons (also called
nodes, gates) 'The gradient of the loss w.r.t. the inputs x,y of the local
neurons can be computed with the local rule:!
Gradient L w.r.t. x,y = !
Recursive gradient * Local gradient w.r.t. x,y!

Local gradients '

Chain rule
rule'

L'

Recursive gradient'

!"#$%&'(&%))*+'

,1'

Backpropagation Techniques!
!! Backpropagation consists of two steps:!
(1) Forward pass/flow: Compute final loss value and all intermediate
output values of neurons/nodes. Save them in memory for gradient
computations (in backward step).!
(2) Backward pass/flow: Compute the gradient of the loss functions w.r.t.
all variables on the network using the local gradient rule.!

Backward low: !
Compute gradient values'
!"#$%&'(&%))*+'

Forward low: !
Compute loss values'

,2'

An Example!
!! Step 1:!
@f
=1
@f

!! Step 2:!

!"#$%&'(&%))*+'
!"#$%&'(&%))*+

,3'

An Example!
!! Step 3:!

!! Step :
:!

!"#$%&'(&%))*+'

,4'

Another Example!

!"#$%&'(&%))*+'

-5'

Backpropagation Implementation!
Forward and Backward Functions!
!! Code:!

!"#$%&'(&%))*+'

-,'

Backpropagation Implementation!
Forward and Backward Functions!
!! Pseudo-code:!

!"#$%&'(&%))*+'

--'

Gradient with Vectorized Code!


@L
!! When variables x,y,z are row vectors: ! @L = @f . @
@x

@x @ff

Jacobian !
Matrix'

Q: What is the size of a


Jacobian matrix? !
A: If input dimensions is 4096 then J is
4096 x 4096 = 16M variables ! !

h @f i
@f
i
=
@x
@xj
@f @L
@L
=
.
@x
@x @f

4096 x 1 '

4096 x 4096 '

4096 x 1 '

!! Good news: Most computations are element-wise in computational graphs,


i.e. apply to each element x independently of the other elements. In this case,
Jacobian = diagonal matrix 'Very fast computations in parallel with
vectorized operations on CPUs.!

!"#$%&'(&%))*+'

-.'

Example!
!! Activation gate:!

!"#$%&'(&%))*+'

-/'

Backpropagation Cost!
Cost forward cost backward (slightly higher) !
Backward requires to store forward values!!

Mini-batch optimization with backpropagation:!


(1) Sample randomly a batch of data!
(2) Forward propagate to get loss values!
(3) Backward propagate to get gradients!
(4) Update neural network weights and other intermediate variables!
!

Works on huge computational graphs.!

Xavier Bresson

25

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

26

Activation Functions!
!! Reminder: Neural network classifiers are succession of linear classification and
non-linear activations.!
Exs: 2-layer classifier:! f = W2 max(W1 x, 0)
W1 x, 0), 0)
3-layer classifier:! f = W3 max(W2 max(W
Activation !
function!

!"#$%&'(&%))*+'

-2'

Class of Activation Functions !

Xavier Bresson

28

Sigmoid Activation!
!! Historically popular by analogy
with neurobiology.!

Sigmoid !
(x) = 1/(1 + e

!! Three issues:!
(1) Saturated neurons kill gradients!

 Vanishing gradient problem (later discussed)!

r = (1

(2) Exp is a bit computationally expensive.!


!

(3) Sigmoid are not zero-centered:!


Suppose input neurons are always positive, then
gradients are either all positive, or all negative: '
f (z = w1 x1 + w2 x2 ) !

@f
@f @z
@f
xi ,
=
=
@wi
@z @wi
@z

xi

 Always zero-center your data!!


!"#$%&'(&%))*+'

-4'

Tanh Activation [LeCun-et.al91]!


!! Advantage: !
Zero-centered function " !

Tanh!
(x) = tanh(x)

!! Issue:!
Kill gradients  Vanishing
gradient problem (later discussed)!

!"#$%&'(&%))*+'

.5'

ReLu Activation [Hinton-et-al12]!


!! ReLU (Rectified Linear Unit):!

ReLu!
(x) = max(x, 0)

!! Advantages: !
(1) Converges 6x faster than sigmoid/tanh!
(2) Does not saturate in positive region!
(3) Max is computationally e"cient !

!! Limitations:!
(1) Not zero-centered function!
(2) It kills gradient when input is negative !
 Standard trick: Initialize neurons with a small positive biases like 0.01. !

!"#$%&'(&%))*+'

.,'

Leaky ReLu [Mass-et.al13]!


!! Advantages: ReLU (Rectified Linear
Unit)!
(1) Converges 6x faster than sigmoid/tanh!
(2) Does not saturate in positive region!
(3) Max is computationally e"cient!
(4) It does not kill the gradient!
(5) Parameter 8'can be learned by
backpropagation.!

Leaky !
ReLu!

(x) = max(x, x), = 0.01

!! In practice: Use Relu, try out Leaky ReLu, do not expect much of
tanh, never use sigmoid. !

!"#$%&'(&%))*+'

.-'

Demo: Train Fully Connected Neural


Networks with Backpropagation!
!! Run lecture10_code01.ipynb!

!"#$%&'(&%))*+'

..'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

34

Weight Initialization!
Q: What happens when initial W=0 is used? !
A: All neurons compute the same outputs and the weights are the same. !
'Need to break symmetry!!

!! Natural idea: Use small random numbers for initialization.!


W = Gaussian/normal distribution with zero mean and 1e-2 standard deviation!

'Works well for small networks, but not for deep networks.'
!"#$%&'(&%))*+'

.0'

Vanishing Gradient Problem!


!! Example: 10-layer net, 500 neurons on each layer, tanh activation, and
initialization:!
!! Collect output statistics at each layer: means, variances, histograms!

All activations become zero!!


Q: Why?'

!"#$%&'(&%))*+'

.1'

Vanishing Gradient Problem!


Vanishing gradient: At initialization, all weights
W are small. This has two consequences:!
(1) At each layer, output backpropagated gradient
is small!
(2) We chain all recursive gradients by
backpropagation:!
W x W x W x x W exponential decay at each
output (e.g. W=0.110)!
The deeper the network the lower the gradient!!
This is the vanishing gradient problem (was a big issue for long time).

Xavier Bresson

37

Exploding Gradient Problem!


!! Let us try the opposite: !
' All neurons get saturated to -1 and +1 (tanh
activation), and then the gradients = 0.!

)
' Tricky to set a good value for the mean of
the normal distribution:!

2 [0.01, 1]
!"#$%&'(&%))*+'

.3'

Xaviers Initialization [Glorot-et-al10] !


!! Idea: Pick an initial W that keep all layers with variance 1:!

' It scales the gradient:!


(1) Large #inputs  small weights!
(2) Small #inputs  large weights'
!! Theory: Reasonable initialization but mathematical derivation assumes linear
activations  It works well for tanh, but not for ReLu. It needs a small change:!
!"#$%&'(&%))*+'

.4'

Batch Normalization [Io#e-Szegedy15] !


!! Motivation: Unit Gaussian activations are desirable all over the network !
'let us enforce this property!!
!! Formula:!

BN along each dimension:'


layer!

xk
x
= p
k

!! Node/gate in NN:!

E(xk )
Var(xk )

Smooth and di#erentiable function!


'Backpropagation can be carried out! '

!"#$%&'(&%))*+'

/5'

Where to Insert BN in NN?!


!! Usually inserted after Fully Connected layers (or
convolutional layers, next lecture) and before
nonlinearity:!
!! Q: But do we necessarily want a unit Gaussian
input to a tanh layer?!
A: Not necessarily 'Let the network decide by
backpropagation.!

Normalize: '

xk
x
= p

E(xk )

Var(xk )

Then allow the network to to change the range if it wants to:'

yk =

k k

x
+

k
k

Note the network can learn the identity mapping if it wants to:'
k
!"#$%&'(&%))*+'

q
= Var(xk )
= E(xk )

/,'

Properties!
Pseudo-code:'

!! Properties:!
(1) Reduces strong dependence on
initialization!
(2) Improves the gradient flow through
the network!
(3) Allows higher learning rates !
 Learn faster the network!
(4) Acts as regularization!
!! Price: 30% more computational time.!
!! At test time: Mean and variance are estimated during training and
average values are selected.!

!"#$%&'(&%))*+'

/-'

Demo: Batch Normalization !


!! Run lecture10_code02.ipynb!

!"#$%&'(&%))*+'

/.'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

44

Optimization for NNs!


!! Training Neural Networks:!
(1) Sample a batch of data.!
(2) Forward prop it through the graph, get loss values!
(3) Backprop to calculate the gradients!
(4) Update the parameters using gradients!
!! Code (main loop):!

Q: Can we do better than


simple stochastic gradient
descent (SGD) update? !

sgd is the slowest!!

A: Yes! Extensive works. Again


major bottleneck.'
!"#$%&'(&%))*+'

/0'

Why SGD is slow?!


!! Illustration:!
Level set of loss'

Gradient loss
is steep
vertically'

Gradient loss
is flat
horizontally'

Q: What is the trajectory along which we converge to the


minimum with SGD? !
A: Trajectory is jittering (up and down) because very slow progress along
flat directions.!
'Solution: Momentum update.!
!"#$%&'(&%))*+'

/1'

Momentum [Hinton-et.al86] !
!! New update rule:!

Friction'

Acceleration'

Velocity'

Physical interpretation: Momentum update is like rolling a ball along the


landscape of the loss function. !

!! Advantages: !
(1) Velocity builds up along flat directions.!
(2) Decrease velocity in steep directions.!

!"#$%&'(&%))*+'

/2'

Limitation of Momentum!
Momentum overshoots the minimum but overall gets faster than SGD
(too much velocity).!

In practice: !
(1) = 0.5, 0.9!
(2) Initialization: v=0!
Xavier Bresson

48

Nesterov Momentum!
!! Nesterov accelerated gradient (NAG) technique used for momentum
update:!

only change'
Nesterov update:'
v t = v t rf (xt + v t )

xt+1 = xt + v t

'It can correct its trajectory


faster than momentum'
!"#$%&'(&%))*+'

/4'

Energy Landscape of NNs!


Energy landscapes of NNs are non-convex
Existence of local minimizers, which
usually is a big issue as most local solutions
are bad. !

Surprisingly, most local minimizers in large-scale networks are


satisfactory! Why? Nobody knows.!
So, initialize with random function, then it will always end up to a
good solution, no bad local minimizers!!

The problem of local minimizers is for small-scale networks.!

Xavier Bresson

50

AdaGrad [Duchi-et.al11]!
!! Origin: Convex optimization.!
!! Update rule:!

dx

Prevents division by 0.
0.'

|v n+1 |2

sign(dx) = 1

The dynamics is controlled because the


speed is 1 'It equalizes the step size
in steep and flat directions.'

!! Issue: When vm gets large then xm will stop moving! !


 Solution: RMSProp!
!"#$%&'(&%))*+'

0,'

RMSProp [Hinton12]!
!! RMSProp update rule: It does not stop the learning process.!

!"#$%&'(&%))*+'

0-'

Adam [Kingma-Ba14]!
!! Adam = Momentum + Adagrad/RMSProp!

!! Adam = Default current optimization technique for NNs. !


!! Hyperparamaters: :,;'0.9<':-;'0.9=!

!"#$%&'(&%))*+'

0.'

Global Learning Rate 6 !


!! All optimization algorithms have
a learning rate " as
hyperparameter.!
!! Q: Is a constant learning rate is
good? !
A: No. A learning rate should decay
over time.!
Decay learning rate at each epoch:!
!
!

(1)!Exponential decay!
!
!

(2) Polynomial decay!

= 0 e 0 m
= 0 /(1 + 0 m)

!! Common good practice: Babysit the loss value and the learning
rate.!

!"#$%&'(&%))*+'

0/'

Demo: Update Rules/!


Neural Network Optimization!
!! Run lecture10_code03.ipynb!

!"#$%&'(&%))*+'

00'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

56

Dropout Regularization [Hinton14]!


!! Dropout: A regularization process
tailored to NNs!
Idea: Randomly set some neurons to
zero in the forward pass.!

!! How to deactivate neurons?!


Use simple binary masks U.!

!"#$%&'(&%))*+'

02'

Why It Works?!
It prevents overfitting: It reduces the number of parameters to learn
for the NN.!

It increases robustness of learning process: For each dropout unit,


dropout sub-samples the NN and we learn the best weights for this
sub-network. And we do it for many dierent sub-networks. Besides
all these sub-networks share weights Global consistency of
weight values and better robustness because we learn on smaller
networks.!

Xavier Bresson

58

Code!

Xavier Bresson

59

Demo: Dropout!
!! Run lecture10_code04.ipynb!

!"#$%&'(&%))*+'

15'

Outline!
Generic Gradient Descent Techniques!
Backpropagation!
Activation!
Weight Initialization!
Neural Network Optimization!
Dropout!
Conclusion!

Xavier Bresson

61

Summary!
!! Training Neural Networks:!
(1) Sample a batch of data.!
(2) Forward prop it through the graph, get loss values!
(3) Backprop to calculate the gradients!
(4) Update the parameters using gradients!

!! Neural Networks = Computational Graphs!


Lego approach of building large-scale NNs!
!! Activation functions:!
(1) Sigmoid (never)!
(2) Tanh (try)!
(3) ReLu (default)!
(4) Leaky ReLu (try)!

!"#$%&'(&%))*+'

1-'

Summary!
Weight initializations:!
(1) Xaviers initialization (default)!
(2) Batch Normalization (30% additional cost)!
Parameter updates/optimization:!
(1) SGD!
(2) Momentum!
(3) Nesterov momentum!
(4) Adagrad/RMSProp!
(5) Adam (default)!
Dropout regularization!

Xavier Bresson

63

Ques@ons?

Xavier Bresson

64

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 10: Deep Learning 2 (Supplementary)!
Common Good Practices for NN Learning!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

Note: Some slides are taken from F.F. Li,


A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'

,'

Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!

Xavier Bresson

Step 1: Pre-Process Data!


!! Assume a n x d data matrix X: !

!"#$%&'(&%))*+'

.'

Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!

Xavier Bresson

Step 2: Choose NN Architecture!


!! Start small: 1 hidden layer then increase number of layers.!

Example!
CIFAR:!

!! Initialization:!
(1) Small networks: Normal distribution with 0.01 standard deviation!
(2) Large networks: Xaviers initialization !

!"#$%&'(&%))*+'

0'

Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!

Xavier Bresson

Step 3: Monitor Loss Decrease!


!! Initialization: Loss value = - log(1/#classes)!

!! Add regularization: Loss value increases 'Good sanity check!

!"#$%&'(&%))*+'

2'

Step 3: Monitor Loss Decrease!


Let us train: !
(1) First, overfit small portion of data with SGD, reg=0!
Easy way to find a good value for the global learning rate .!

Xavier Bresson

Step 3: Monitor Loss Decrease!


!! Let us train: !
(2) Second, reg=small then find 3'
!
!
!
!
!
!
!
!
!
!
!
!
!
(3) Last, increase reg value!
!"#$%&'(&%))*+'

5'

Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!

Xavier Bresson

10

Step 4: Hyperparameter Optimization!


!! Cross-validation strategy:!
(1) First step: Only use a few epochs to get an idea of what values work!
!
!
!
!
!
!
!
!
!
!
!

(2) Second step: Try out many values !


'Very long computational times !

!"#$%&'(&%))*+'

,,'

Grid vs. Random Search!

!"#$%&'(&%))*+'

,-'

Hyperparameters!
!! List:!
(1) Network architecture!
(2) Learning rate, decay schedule!
(3) Regularization: L2 and dropout!

!"#$%&'(&%))*+'

,.'

Outline!
Step 1: Pre-process data!
Step 2: Choose NN Architecture!
Step 3: Monitor Loss Decrease!
Step 4: Hyperparameter Optimization!
Step 5: Monitor Test Accuracy!

Xavier Bresson

14

Step 5: Monitor Test Accuracy!

!"#$%&'(&%))*+'

,0'

Ques9ons?

Xavier Bresson

16

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 11: Deep Learning 3!
Convolutional Neural Networks!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

Note: Some slides are taken from F.F. Li,


A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'

,'

Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!

Xavier Bresson

A Brief History!
!! Hubel and Wiesel: Nobel Prize in Medicine in 1959 for understanding
the primary visual cortex system.!
!! Visual system is composed of receptive fields called V1 cells that are
composed of neurons that activate depending on the orientation.!

!"#$%&'(&%))*+'

.'

Hierarchical Organization of Visual Neurons!


!! The second layer of the visual cortex is composed of V2 cells that takes as
inputs the outputs of V1 neurons. This forms a hierarchical organization.!

!"#$%&'(&%))*+'

/'

Perceptron [Rosenblatt57]!
!! Application: Character recognition!
Perceptron is only hardware (circuits, electronics), no code/simulations.!
Perceptron was connected to a camera that produced 400-pixel images.!
!

Update rule was empirical:!

W t+1 = W t + (D

Activation was binary:!


!

(x) =

1
0

Y t )X

if hw, xi + b > 0
otherwise

No concepts of loss function, no gradient, no backpropagation 'Learning was bad.!


Multilayer perceptron in 1960: stack perceptron, still hardware.!

!"#$%&'(&%))*+'

0'

Neurocognitron [Fukushima80]!
Application: Handwritten character recognition!
Direct implementation of Huber-Wiesel simple and complex cells (V1 and V2
cells) with hierarchical organization.!
Introduction of concepts of local features (reception fields).!
No concepts of loss function, no gradient, no backpropagation Learning was bad.!
Inspiration of convolutional neural networks (CNNs)!

Complex cells: perform pooling

Xavier Bresson

Backpropagation [Rumelhart-et.al86] !
Introduction of backpropagation: Concepts of loss function, gradient,
gradient descent.!

Issue: Backprop did not work for large-scale/deep NNs (vanishing gradient
problem).!
Xavier Bresson

Convolutional Neural Networks (CNNs)!


[LeCun-Bengio-et.al98]!
!! Like Google PageRank, a B$ algorithm, among top 10 algorithms in data
science.!

!! Computational issue in 1998: Very long to train for large-scale/deep


networks.!

!"#$%&'(&%))*+'

3'

2012 Breakthrough [Hinton-et.al12] !


!! AlexNet: CNN with 7 layers (5CL+2FC)!
'Error on ImageNet dataset is 15.3%, second is 26.3% (w/
handcrafted features)!
!! AlexNet uses:!
(1) ReLu activation!
(2) Dropout regularization!
(3) More layers!
(3) Graphics processing units (GPUs): A breakthrough in NN learning
as they allow to learn large-scale networks. !
!! Also breakthrough in speech
recognition [Dahl-et.al12]!
'Error decreased from 23.2% to 16.0%!!

!"#$%&'(&%))*+'

4'

CNNs/Deep Learning is Everywhere..!


in Computer Vision!!
!! Classification, Retrieval!
!! Detection, Segmentation!
!! Self-driving car!
!! Face detection!
!! Medical!
!! Go game!
!! Arts/deep dreams!

!"#$%&'(&%))*+'

,5'

DeepArts !
678)9::;%%8"&<=$*'

!"#$%&'(&%))*+'

,,'

CNNs are Used by All big IT Companies!


!! Facebook (Torch software)!
!! Google (TensorFlow software, Google Brain, Deepmind)!
!! Microsoft!
!
!! Tesla (AI Open)
Open)!
!! Amazon (DSSTNE software)!
!! Apple!
!! IBM!

!"#$%&'(&%))*+'

,-'

Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!

Xavier Bresson

13

Convolutional Neural Networks!


CNNs are extremely ecient at extracting meaningful statistic patterns in
large-scale and high-dimensional datasets.!

Key idea: Learn local stationary structures and compose them to form
multiscale hierarchical patterns.!

Why CNNs are good? It is open (math) question to prove the eciency of
CNNs.!
Note: Despite the lack of theory, the entire ML and CV communities have
shifted to Deep Learning techniques! Ex: NIPS16: 2326 submissions, 328 DL
(14%), Convex Optimization 90 (3.8%). !

Xavier Bresson

14

Local Stationarity!
!! Assumption: Data are locally stationary
across the data domain:!

"

similar local patches are shared

!"#$"%&'"#(%)
*+%,-./)

!! How to extract local stationary patterns? !


Convolutional filters (filters with !
compact support kernels)!

F1
F2

F3
!"#$%&'(&%))*+'

x F1

x F2

x F3

,0'

Convolutional Layer (CL)!


!! Step 1: Convolve the filter with the image: Slide filter
over the image spatially, and compute dot products.!

height!

width!
width
depth!

Filter/!
Reception field!
for each neuron'
!"#$%&'(&%))*+'

1 number: The result of taking the


dot product between the filter and a
small 5x5x3 chunk of the image!
(i.e. 5*5*3 = 75-dimensional dot
product + bias)'
T

w x+b

,1'

Convolutional Layer (CL)!


!! Step 2: Produce a stack of activation maps. For example, using 6 5x5 filters,
we get 6 separate activation maps that we stack up to get a new image of
size 28x28x6.!

!! Step 3: Apply a non-linear activation function: For instance, ReLU.!

!"#$%&'(&%))*+'

,2'

Multiscale Hierarchical Features!


!! Assumption: Local stationary patterns can be composed to form more abstract
complex patterns:!
'
Layer 11'

Layer 22'

Layer 33'

Layer 44'

Deep/hierarchical !
Features (simple to abstract)'

!! How to extract multiscale hierarchical patterns? !


Downsampling of data domain (s.a. image grid) with Pooling (s.a. max, average).!
2x2 max!
pooling'

2x2 max !
pooling'

!! Other advantage: Keep same computational complexity while increasing


#filters.!
!"#$%&'(&%))*+'

,3'

Illustration of CLs in CNNs!

!"#$%&'(&%))*+'

,4'

Illustration of Activation Maps!

Xavier Bresson

20

Classification Function!
Classifier: After extracting multiscale locally stationary features, use them to
design a classification function with the training labels.!
How to design a (linear) classifier? !
Fully connected neural networks.!

Class 1
Class 2

xout = W xlayer

Class K

Features
Xavier Bresson

Output signal
Class labels
21

Full Architecture of CNNs!


!"#$"%&'"#(%)
*+%,-./)

x F1

F1
F2
x

7-8&)(9'$('"#)
:)
;.+<)<"=#/(>1%+#2)
:)
?""%+#2))

@'

x F2

IJ2J)+>(2-)

F3

0&,1&,)/+2#(%)

x F3

!%(//)%(3-%/)

xl=0 = x

xl=0
conv

A#1&,)/+2#(%)

!"#$"%&'"#(%)%(B-./)
C-D,.(9,)%"9(%)/,('"#(.B)E-(,&.-/)(#<)9">1"/-),F->))
$+()<"=#/(>1%+#2)(#<)1""%+#2G)

!"#$%&'(&%))*+'

xl=1

xl=1
conv

xl

y 2 R nc

*&%%B)9"##-9,-<)%(B-./)
C!%(//+H9('"#)E&#9'"#G)

--'

Example!

Xavier Bresson

23

Case Studies!
!! LeNet5 [LeCun-Bengio-et.al98]: !
Input is 32x32!
Architecture is CL-PL-CL-PL-FC-FC!
Accuracy on MNIST is 99.6%!

!! AlexNet [Krizhevsky-et.al12]: !
Input is 227x227x3!
Architecture is 7CL-3PL-2FC!
Accuracy on ImageNet is 15.4%!
Note: CL1 with 96 filters 11x11: 227x227x3 >'55x55x96 (stride=4), #parameters=(11x11x3)x96=35K!
PL1 2x2: 55x55x96 >'27x27x96, #parameters=0!!

!"#$%&'(&%))*+'

-/'

Case Studies!
!! GoogleNet [Szegedy-et.al14]: !
Input is 227x227x3!
Architecture is 22 layers!
Accuracy on ImageNet is 6.7%!

Architecture'

!! ResNet [He-et.al15]:
Microsoft Asia !
Input is 227x227x3!
Architecture is 152 layers!!
Accuracy on ImageNet is 3.6%!

!"#$%&'(&%))*+'

-0'

The Deeper The Better!

!"#$%&'(&%))*+'

-1'

Demo: LeNet5!
!! Run lecture11_code01.ipynb!

TensorBoard!
!"#$%&'(&%))*+'

-2'

CNNs Only Process Euclidean-Structured Data!


!! CNNs are designed for Data lying on Euclidean spaces:!
(1) Convolution on Euclidean grids (FFT)!
(2) Downsampling on Euclidean grids!
(3) Pooling on Euclidean grids!
 Everything mathematically well defined and computationally fast!!

Q: What type of data can be processed with CNNs?!

Images (2D, 3D) !


videos (2+1D)'

Sound (1D)'

!! But not all data lie on Euclidean grids!!


!"#$%&'(&%))*+'

-3'

Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!

Xavier Bresson

29

Non-Euclidean Data!
!! Examples of irregularly/graph-structured data: !
(i) Social networks (Facebook, Twitter)!
(ii) Biological networks (genes, brain connectivity)!
(iii) Communication networks (Internet, wireless, tra"c)!

P'
N"9+(%)#-,=".L/!

O.(+#)/,.&9,&.-!

Q-%-9">>&#+9('"#)
#-,=".L/!

;.(1FK#-,=".LM)
)/,.&9,&.-<)<(,())

!! Main challenges: !
(1) How to define convolution, downsampling and pooling on graphs?!
(2) And how to make them numerically fast?!
!! Current solution: Map graph-structured data to regular/Euclidean grids with
e.g. kernel methods and apply standard CNNs. !
Limitation: Handcrafting the mapping is against CNN principle! !
!"#$%&'(&%))*+'

.5'

CNNs for Graph-Structured Data!


Our contribution [Deerrard-B-Vandergheynst16]: Generalizing CNNs
to any graph-structured data with same computational complexity as
standard CNNs!!
What tools for this generalization?!
(1) Graph spectral theory for convolution on graphs,!
(2) Balanced cut model for graph coarsening,!
(3) Graph pooling with binary tree structure of coarsened graphs.!

Xavier Bresson

31

Related Works!
!! Categories of graph CNNs: !
(1) Spatial approach!
(2) Spectral (Fourier) approach !
!! Spatial approach: !
! Local reception fields [Coates-Ng11, Gregor-LeCun10]:!
Find compact groups of similar features, but no defined convolution.!
! Locally Connected Networks [Bruna-Zaremba-Szlam-LeCun13]:!
Exploit multiresolution structure of graphs, but no defined convolution.!
! ShapeNet [Bronstein-et.al.1516]:!
Generalization of CNNs to 3D-meshes. Convolution well-defined in these
smooth low-dimensional non-Euclidean spaces. Handle multiple graphs. !
Obtained state-of-the-art results for 3D shape recognition.!
!! Spectral approach: !
! Deep Spectral Networks [Hena#-Bruna-LeCun15]:!
Computational complexity is O(n2), while ours is O(n). !
!"#$%&'(&%))*+'

.-'

Convolution on Graphs 1/3!


!! Graphs: G=(V,E,W), with V set of vertices, E set of edges, !
W similarity matrix, and |V|=n.!
eij 2 E

i2V

j2V

Wij = 0.9

!! Graph Laplacian (core operator to spectral graph theory [1]): !


2nd order derivative operator on graphs!

L = D W
normalized
L = In D 1/2 W D 1/2 unnormalized
[1] Chung, 1997'
!"#$%&'(&%))*+'

..'

Convolution on Graphs 2/3!


Fourier transform on graphs [2]: L symmetric positive semidefinite matrix !
It has a set of orthonormal eigenvectors {ul}l known as graph Fourier modes,
associated to nonnegative eigenvalues {l}l known as the graph frequencies.!
The Graph Fourier Transform of f 2 Rn is

FG f = f = U T f 2 Rn ,

which value at frequency l is:


f( l ) = fl := hf, ul i =

n
X1

f (i)ul (i)

i=0

The inverse GFT is defined as:


FG 1 f = U f = U U T f = f,

which value at vertex i is:


(U f)(i) =

n
X1

fl ul (i).

l=0

[2] Hammond, Vandergheynst, Gribonval, 2011

Xavier Bresson

34

Convolution on Graphs 3/3!


Convolution on graphs (in the Fourier domain) [2]:!
f G g = FG 1 FG f

which value at vertex i is:


(f G g)(i) =

FG g 2 R n ,

n
X1

fl gl ul (i)

l=0

It is also convenient to see that:


f G g = g(L)f,

as
f G g = U (U T f )

6
(U T g) = U 4

[2] Hammond, Vandergheynst, Gribonval, 2011


Xavier Bresson

g(

0)

..

.
g(

n 1)

7 T
5U f

= U g()U T f = g(L)f
35

Translation on Graphs 1/2!


Translation on graphs: A signal f defined on the graph can be translated to
any vertex i using the graph convolution operator [3]:!
Ti f := f G

i,

where Tif is the graph translation operator with vertex shift i. !


Function f, translated at vertex i, has the following value at vertex j:
(Ti f )(j) = f (j

i) = (f G

i )(j)

n
X1

fl ul (i)ul (j),

l=0

This formula is the graph counterpart of the continuous translation operator:


Z
(Ts f )(x) = f (x s) = (f s )(x) =
f()e 2is e2ix d,
R

where f() = hf, e2ix i, and e2ix are the eigenfunctions of the continuum
Laplace-Beltrami operator , i.e. the continuum version of the graph Fourier
modes ul .

[3] Shuman, Ricaud, Vandergheynst, 2016


Xavier Bresson

36

Translation on Graphs 2/2!


!! Note: Translation on graphs are easier to carry out with the spectral approach,
than directly in the spatial/graph domain.!

(a) Ts f

(b) Ts0 f

(c) Ts00 f

(d) Ti f

(e) Ti0 f

(f) Ti00 f

[Shuman-Ricaud-!
Vandergheynst16]'

Figure 1: Translated signals in the continuous R2 domain (a-c), and in the graph
domain (d-f). The component of the translated signal at the center vertex is
highlighted in green.
!"#$%&'(&%))*+'

.2'

Localized Filters on Graphs 1/3!


!! Localized convolutional kernels: As standard CNNs, we must !
define localized filters on graphs.!
!! Laplacian polynomial kernels [2]: We consider a family of spectral
kernels defined as:!
K
X
g( l ) = pK ( l ) :=
ak kl ,

(1)

k=0

where pK is a K th order polynomial function of the Laplacian eigenvalues


This class of kernels defines spatially localized filters as proved below:

l.

Theorem 1 Laplacian-based polynomial kernels (1) are K-localized in the sense


that
(Ti pK )(j) = 0

if

dG (i, j) > K,

(2)

where dG (i, j) is the discrete geodesic distance on graphs, that is the shortest
path between vertex i and vertex j.
[2] Hammond, Vandergheynst, Gribonval, 2011'
!"#$%&'(&%))*+'

.3'

Localized Filters on Graphs 2/3!


Corollary 1. Consider the function
ij

Then

= Ti pK (j) = pK G
ij

=0

ij

such that

(j) = pK (L)

(j) =

K
X

ak Lk

(j).

k=0

if

dG (i, j) > K.

Spatial profile of polynomial filter given by

ij

BiK = Support of
polynomial filter at vertex i

Vertex i

!"#$%&'(&%))*+'

Figure 2. Illustration of localized filters on graphs. Laplacian-based polynomial


kernels are exactly localized in a K-ball BiK centered at vertex i.

.4'

Localized Filters on Graphs 3/3!


Corollary 2. Localized filters on graphs are defined according to the principle:
Frequency smoothness

Spatial graph localization

This is obviously the Heisenbergs uncertainty principle extended to the graph


setting. Recent papers have studied the uncertainty principle on graphs.

BiK = Support of
polynomial filter at vertex i

Vertex i

!"#$%&'(&%))*+'

/5'

Fast Chebyshev Polynomial Kernels 1/2!


!! Graph filtering: Let y be a signal x filtered by a Laplacian-based!
polynomial kernel:!
K
X
y = x G pK = pK (L)x =
ak Lk x
k=0

!"#$"%&'"#(%)
*+%,-./)

F1
x F2

F2

F3

x F3

The monomial basis {1, x, x2 , x3 , ..., xK } provides localized spatial filters, but
R1
2 1
does not form an orthogonal basis (e.g. h1, xi = 0 1xdx = x2 0 = 12 ), which
limits its ability to learn good spectral filters.
polynomials: Let Tk (x) the Chebyshev polynomial of order k gen!! Chebyshev
C!
erated by the fundamental recurrence property Tk (x) = 2xTk 1 (x) Tk 2 (x)
with T0 = 1 and T1 = x. The Chebyshev basis {T0 , T1 , ..., TK } forms an orthogonal basis in [ 1, 1].

Figure 3. First six Chebyshev polynomials.


!"#$%&'(&%))*+'

x F1

/,'

Fast Chebyshev Polynomial Kernels 2/2!


Graph filtering with Chebyshev [2]: The filtered signal y is defined with the
Chebyshev polynomials is:!
K
X
y = x G qK =
k Tk (L)x,
k=0

with the Chebyshev spectral kernel:


qK ( ) =

K
X

k Tk ( ).

k=0

PK
Fast filtering: Let denote Xk := Tk (L)x and rewrite y = k=0 k Xk . Then
F!
all {Xk } are generated with the recurrence equation Xk = 2LXk 1 Xk 2 .
As L is sparse, all matrix multiplications are done between a sparse matrix
and a vector. The computational complexity is O(EK), and reduces to linear
complexity O(n) for k-NN graphs.

GPU parallel implementation: Linear algebra operations can be done in


parallel, allowing a fast GPU implementation of Chebyshev filtering. !
[2] Hammond, Vandergheynst, Gribonval, 2011
Xavier Bresson

42

Graph Coarsening!
!! Graph coarsening: As standard CNNs, we must define a grid coarsening
process for graphs. It will be essential for pooling similar features together.!

Graph coarsening/
clustering

Gl=0 = G

Graph coarsening/
clustering

Gl=1

Gl=2

Figure 4: Illustration of graph coarsening.'

!! Graph partitioning: Graph coarsening is equivalent to graph clustering, which


is a NP-hard combinatorial problem.!

!"#$%&'(&%))*+'

/.'

Graph Partitioning!
!! Balanced Cuts [4]: Two powerful measures of graph clustering are the
Normalized Cut and Normalized Association defined as:!

Normalized Cut:
K
X
Cut(Ck , Ckc )
min
C1 ,...,CK
Vol(Ck )
k=1

Partitioning by min edge cuts.

Ckc

Ck

Equivalence by
complementarity

Normalized Association:
K
X
Assoc(Ck )
max
C1 ,...,CK
Vol(Ck )
k=1

Ck
Partitioning by max vertex matching.

Figure 5: Equivalence between NCut and NAssoc partitioning.'

P
P
where Cut(A,
B)
:=
W
,
Assoc(A)
:=
ijP
i2A,j2B
i2A,i2B Wij ,
P
Vol(A) := i2A,j2B di , and di := j2V Wij is the degree of vertex i.

[4] Shi, Malik, 2000'


!"#$%&'(&%))*+'

//'

Graclus Graph Partitioning!


!! Graclus [5]: It is a greedy (fast) technique that computes clusters that locally
maximize the Normalized Association.!
(P1) Vertex matching:

@'
@

i, j = argmax
j

l+1

(P2): G

dli + dlj

?
?'
?
?'

Wijl+1 = Cut(C
Cil , Cjl )
ll+1
+1
Wii = Assoc(C
Cil )

Graph coarsening/
clustering

?'
?
?'
?

?'
?

Gl

l o
Wiil + 2W
Wijl + Wjj

Matched vertices { i , j } are


merged into a super-vertex
at the next coarsening level.

Gl

+1
Gll+1

@
@'
Gl+2

Partition energy at level l :


X

matched{i,j}

l
Wiil + 2Wijl + Wjj

dli + dlj

K
X
Assoc(C l )
k

k=1

Vol(Ckl )

where Ckl is a super-vertex computed by (P1), i.e. Ckl := matched{i, j}.


It is also a cluster with at most 2k+1 original vertices.

Figure 6: Graph coarsening with Graclus. Graclus proceeds by two successive steps: (P1)
Vertex matching, and (P2) Graph coarsening. These two steps provide a local solution to
the Normalized Association clustering problem at each coarsening level l.'
[5] Dhillon, Guan, Kulis, 2007'
!"#$%&'(&%))*+'

/0'

Fast Graph Pooling 1/2!


Graph pooling: As standard CNNs, we must define a pooling process such
as max pooling or average pooling. This operation will be done many times
during the optimization task. !

Unstructured pooling is inecient: The graph and its coarsened versions


indexed by Graclus require to store a table with all matched vertices.
!
Memory consuming and not (easily) parallelizable.
!

Structured pooling is as ecient as a 1D grid pooling: Start from the


coarsest level, then propagate the ordering to the next finer level such that node
k has nodes 2k and 2k+1 as children binary tree arrangement of the nodes
such that adjacent nodes are hierarchically merged at the next coarser level. !

Xavier Bresson

46

Fast Graph Pooling 2/2!


Graph coarsening:

Graph coarsening:
Matched vertices

0
Gl=0 = G

Gl=0 = G

55,, 7

2 0, 2

33,, 6

55,, 7
3, 6

0, 2
1, 4

Graph pooling:
0

2
0

Reindexing w.r.t.
coarsening structure

66,, 7

0 0, 1

Gl=1

Gl=2

44,, 5

1 22,, 3

0, 1
2, 3

44,, 5
66,, 7

Graph pooling:
3

Gl=1
3 11,, 4

Gl=2

7
1

Unstructured arrangement of vertices

3
1

7
3

Binary tree arrangement of vertices

Figure 7: Fast graph pooling using graph coarsening structure. The binary tree arrangement of
vertices allows a very e"cient pooling on graphs, as fast as a regular 1D Euclidean grid pooling.'
!"#$%&'(&%))*+'

/2'

Full Architecture of Graph CNNs!


!"#$"%&'"#(%)*+%,-./)
0C6G)1(.(>-,-./)
0CIJ6G)"1-.('"#/)C;?R/G)

G = Gl=0

g1K1

7-8&)(9'$('"#)
:)
;.(1F)9"(./-#+#2)

Gl=2
Gl=1

*(9,".)51)
?.-M9">1&,-<)

:)
?""%+#2)C;?R/G)

@'

g2K1

g3K1

0&,1&,)/+2#(%)

;.(1F)

!%(//)%(3-%/)

IDS)/"9+(%T)3+"%"2+9(%T))
,-%-9">>&#+9('"#)2.(1F/)

x 2 Rn
x

l=0

2R

nl=0

A#1&,)/+2#(%)
"#)2.(1F/)

y 2 R nc
l=6 2 Rn5 nc

xl=0
2 R n0 F 1
g
l=1 2 RK1 F1

xl=1 2 Rn1 F1
n1 = n0 /2p1

xl=5 2 Rn5 F5
l=5 2 RK5 F1 ...F5

;.(1F)9"#$"%&'"#(%)%(B-./)
C-D,.(9,)>&%'/9(%-)%"9(%)/,('"#(.B)E-(,&.-/)"#)2.(1F/G)

*&%%B)9"##-9,-<)%(B-./)

Figure 8. Architecture of the proposed CNNs on graphs.


Notation: l is the coarsening level, xl are the down sampled signals at layer
l, Gl is the coarser graph, gKl are the spectral filters at layer l, xlg are the
filtered signals, pl is the coarsening exponent, nc is the number of classes, y is
the output signal, and l is the number of parameters to learn at l.

!"#$%&'(&%))*+'

/3'

Optimization!
!! Backpropagation [6] = Chain rule applied to the neurons at each layer.!

yj =

Fin
X

gij (L)xi

i=1

'

yj

ij

'

xi

@E
@xi

Loss function: E =

ls log ys

Local !
computations!

s2S

Gradient descents:

n+1
n
= ij
ij
xn+1
= xni
i

@E
. @
ij
@E
. @xi

[6] Werbos 1982 and Rumelhart, Hinton, Williams, 1985'


!"#$%&'(&%))*+'

@yj
@ij

@E
E
@yyj

Backpropagation '

Accumulative
Accumulative!
computations!

Local gradients:
Xh
@E @yj
@E
X0,s , ..., XK
=
=
@ij
@yj @ij
s2S

F
out
X
@E @yj
@E
@E
=
=
gij (L)
@xi
@yj @xi
@yj
j=1

iT @E
1,s
@yj,s

/4'

Revisiting Euclidean CNNs!


!! Sanity check: MNIST is the most popular dataset in Deep Learning [8].!
It is a dataset of 70,000 images represented on a 2D grid of size 28x28
(dim data = 282 = 784) of handwritten digits, from 0 to 9.!

!! Graph: A k-NN graph (k=8) of the Euclidean grid: !

Wij = e

kxi

kxi xj k22 /

xj k2

[8] LeCun, Bottou, Bengio, 1998'


!"#$%&'(&%))*+'

05'

Revisiting Euclidean CNNs!

!! Results: Classification rates!

Algorithm
Linear SVM
Softmax
CNNs [LeNet5]
graph CNNs: CN32-P4-CN64-P4-FC512-softmax

!"#$%&'(&%))*+'

Accuracy
91.76
92.36
99.33
99.18

0,'

Non-Euclidean CNNs!
!! Text categorization with 20NEWS: It is a benchmark dataset introduced at
CMU [9]. It has 20,000 text documents (dim data = 33,000, #words in
dictionary) across 20 topics.!

Table 1. 20 Topics of 20NEWS!

Instance of document in topic:!


Auto!

Instance of document in topic:!


Medicine!

[9] Leng, 1995'


!"#$%&'(&%))*+'

0-'

Non-Euclidean CNNs!

Results: Classification rates!


Algorithm
Linear SVM
Multinomial NB
Softmax
FC + softmax + dropout
FC + FC + softmax + dropout
graph CNNs: CN32-softmax

Xavier Bresson

Word2vec features
65.90
68.51
66.28
64.64
65.76
68.26

53

Demo: Convolutional Neural Networks for


Graph-Structured Data!
!! Run lecture11_code02.ipynb!

!"#$%&'(&%))*+'

0/'

Outline!
History of CNNs!
Standard CNNs!
CNNs for Graph-Structured Data!
Conclusion!

Xavier Bresson

55

Summary!
CNNs is game changer:!
(1) Breakthrough for all Computer Vision-related problems!
(2) Revive dream of Artificial Intelligence!
(3) Deep learning = Big Data + GPUs/Cloud + Neural Networks!
(4) Big question why it works so well?!

CNNs for unstructured data: Beyond Computer Vision!


(1) Generalization of CNNs to non-Euclidean domains/graph-structured data!
(2) Localized filters on graphs!
(3) Same learning complexity as CNNs while being universal to graphs!
(4) GPU implementation !

Xavier Bresson

56

QuesBons?

Xavier Bresson

57

Data Science
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 12: Deep Learning 4!
Recurrent Neural Networks!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

Note: Some slides are taken from F.F. Li,


A. Karpathy, J. Johnsons course on Deep Learning !
!"#$%&'(&%))*+'

,'

Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!

Xavier Bresson

Motivation!
!! Recurrent Neural Networks (RNNs) operate on ordered sequences of
inputs and outputs. Examples: Text, financial series, videos, robot motion, etc.!

Output data!
Ex: class!

Hidden !
Layers!

Input data!
Ex: image'
Vanilla NNs!
One input vector maps to
one output vector.!
Ex: CNNs (image to
class)'
!"#$%&'(&%))*+'

Output data!
Ex: sequence of
words!
Hidden !
Layers!

Input data!
Ex: image'
RNNs!
One input vector maps to
multiple output vectors.!
Ex: Image captioning
(image to caption sentence)'
.'

Motivation!
Output data!
Ex: sequence of
words!

Output data!
Ex: class!

Hidden !
Layers!

Hidden !
Layers!

Input data!
Ex: sequence of
images!

Input data!
Ex: sequence of
words!

RNNs!
Multiple input vectors map
to an output vector.!
Ex: Video classification
(sequence of images to class)'

RNNs!
Multiple input vectors map
to multiple output vectors.!
Ex: Machine translation
(sentence to sentence)'

!! RNNs learn dynamics/temporal properties of data.!


!"#$%&'(&%))*+'

/'

Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!

Xavier Bresson

General Description!
!! RNNs are recurrent learning machine:!
!

(1) RNNs have a state ht. This state ht can be


modified by changing the RNN parameters
W. '
!

(2) RNNs receive at each time t an input vector


x, and learn to predict the next input vector
x at time t+1 with the output vector y.!
Example: hello'
hello

Block RNN !
State ht,!
Parameters W'

x'y'

!"#$%&'(&%))*+'

1'

Recurrence Formula!
!! Update of the RNN state is done with a
recurrence formula at each time step:!
Recurrence! Previous!
function'
state'

ht = fW ((ht
New state of
RNN'

Weights/!
Parameters of
recurrence
function'

1 , xt )

ht

Recurrence !
Formula '

Input vector at
current time step'

!! Notes:!
(1) Recurrence function is independent of time t!!
Same function f is used at every time step.!
(2) Changing W will change the behavior of RNNs.!
(3) Weights W are learned by backpropagation on training data.!
!"#$%&'(&%))*+'

2'

Vanilla RNNs!
!! Simplest RNNs:!

ht = fW (ht

1 , xt )

ht

Recurrence !
Formula '

RNN state at !
step t'

ht = tanh(Whh ht
yt = Why ht

+ Wxh xt )

tanh!

!"#$%&'(&%))*+'

3'

Example: Character-Level Language Model!


!! Ask RNN to predict the next character in a sequence.!
Simple example: Training sequence is hello, Vocabulary={h,e,l,o}  input
vector is 4-dimensional.!

Unormalized !
probability!

Recurrence formula:!
Not!
good!
good!!

ht = tanh(Whh ht

+ Wxh xt )

Linear/softmax classifier!
for next character:
character:!

yt = Why ht

Note: In text
analysis, we never
work with characters
directly, but with
numbers (via 1-to-1
mapping between
characters and
numbers)!

Learn by !
backpropagation!

Vocabulary!
Vector!
!"#$%&'(&%))*+'

4'

VRNN = 100 Python lines!

https://gist.github.com/karpathy/d4dee566867f8291f086!

Xavier Bresson

10

VRNN = 100 Python lines!

!"#$%&'(&%))*+'

,,'

Example: Shakespeare-like Sequences !


!! Generate sequences during training: Seed with a few characters and
look at outputs. !

!"#$%&'(&%))*+'

,-'

Demo: Vanilla Recurrent Neural Networks!


!! Run lecture12_code01.ipynb!

!"#$%&'(&%))*+'

,.'

Example: Mathematics!
!! Training data: Open source textbooks on algebraic geometric!

!"#$%&'(&%))*+'

,/'

Example: Code!
!! Training data: Linux code!

!"#$%&'(&%))*+'

,0'

Image Captioning !
!! It is possible to merge CNNs and RNNs!!
Example: Image captioning !

!"#$%&'(&%))*+'

,1'

Design!
!! Step 1: Remove the last FC layer and softmax classifier in CNNs
(classification is not needed, only visual feature extractors).!

CNNs !

!"#$%&'(&%))*+'

,2'

Design!
!! Step 2: Connect CNN output to RNN.!

New!!
!"#$%&'(&%))*+'

,3'

Design!
!! Step 3: Construct the whole RNN.!

!"#$%&'(&%))*+'

,4'

Results!

!"#$%&'(&%))*+'

-5'

Demo: Image Captioning with RNNs!


!! Run lecture12_code02.ipynb!

!"#$%&'(&%))*+'

-,'

Deep RNNs!
!! Multilayer RNNs:!

+ Wxh xt )

ht = tanh(Whh ht

rewriting'

ht = tanh W

!"#$%&'(&%))*+'

xt
ht 1

layer'

with W =

Wxh
0

0
Whh

State for layer l!

--'

Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!

Xavier Bresson

23

Long Short-Term Memory (LSTM) !


[Hochreiter-Schmidhuber97]!

!! With CNNs, another B$ algorithm, among top 10 algorithms in data science.!


Use by all big IT companies: Facebook, Google, Microsoft, Apple, IBM, etc!
!! Standard RNNs su"er from vanishing gradient problem, so they
cannot scale up to deep networks. LSTM does not have this issue.!
!! What is LSTM?!
LSTM is also a RNN but with a more complex
recurrence formula:!
(1) The state of RNN has more variables, and
more weights.!
(2) The update of the state variables is more
complex.!
Recurrence!
formula:'

!"#$%&'(&%))*+'

-/'

Understanding LSTM!
!! From paper: !

!"#$%&'(&%))*+'

-0'

Understanding LSTM!
LSTM has two state vectors: !
h: hidden state vector!
c: cell state vector. !
Besides:!
f: called forget vector!
i: called input vector!
o: called output vector!

Let suppose the variables to be binary, for easier analysis.!


Xavier Bresson

26

Understanding LSTM!
Time step t'

c=next cell state!


f=forget vector, f={0,1}!
This gate can reset the flow, or!
It can let flow the previous cell value'

!"#$%&'(&%))*+'

i=input vector, i={0,1}!


g = {-1,1}!
The gate i*g can add nothing, or!
It can increment the flow by 1, or -1'

-2'

Understanding LSTM!
Cell state c flows!
to hidden state'

tanh for activation


activation'

h=next hidden state!


c=cell state!
o={0,1}!
This gate can reset the flow, or!
It can let flow the previous hidden state'

Hidden state h flows!


to cell state'

!"#$%&'(&%))*+'

-3'

Understanding LSTM!
Stack up to get multilayer LSTM: !

Xavier Bresson

29

RNNs vs. LSTM!


The (+) gates distributes the gradient equally during
backpropagation. It allows to avoid the vanishing gradient problem
(otherwise the gradient vanishes quickly).!

Xavier Bresson

30

LSTM Variants!

At the end of the day, LSTM gives the best performances over many
possible experimental conditions. !

Xavier Bresson

31

Demo: Image Captioning with LSTMs!


!! Run lecture12_code03.ipynb!

!"#$%&'(&%))*+'

.-'

Outline!
Motivation!
Vanilla Recurrent Neural Networks (RNNs)!
Long Short-Term Memory (LSTM)!
Conclusion!

Xavier Bresson

33

Summary!
RNNs oer lots of flexibility in NN architecture.!
!

Vanilla RNNs do not work well (vanishing gradient).!


!

Use LSTM (no vanishing gradient).!


!

Hot research:!
(1) Architecture design.!
(2) Better understanding.!
(3) Why performances are so good? Open theoretical question.!

Xavier Bresson

34

Ques8ons?

Xavier Bresson

35

Data Science!
Sept 12-14, 2016!

EPFL-UNIL Continuing Education !


Lecture 13: Conclusion!
Xavier Bresson!
!

Swiss Federal Institute of Technology (EPFL) !

!"#$%&'(&%))*+'

,'

Data Science !
Science of transforming raw data into meaningful
knowledge to provide smart decisions to real-world
problems.!

!"#$%&'(&%))*+'

-'

Data Science!
Computer Science
Science!

Scalable databases for storing, accessing data. !


E.g. Cloud computing, Amazon EC2, Hadoop.!

Distributed and parallel frameworks !


for data processing. !
E.g. MapReduce, GraphLab.
GraphLab.!

Personalized !
Services!
Services

E.g. Healthcare (enhanced diagnostics)


diagnostics)!
(products)!
Commerce (products)

Mathematical!
Mathematical
Modeling!
Modeling

Design algorithms that transform


transform!
data into knowledge.
knowledge.!
Use Linear algebra, optimization, !
graph theory, statistics.!
statistics.

Data Science
Science!

Multidisciplinary field: 1+1=3


1+1=3!

Data!

Knowledge !
Discovery !
E.g. Physics, genomics, !
social sciences.
sciences.!

Collection of massive amounts of !


data at increasing rate.!
E.g. Social networks, sensor networks, !
mobile devices, biological networks,!
administrative, economics data!

Issues of privacy, !
ownership!
security, ownership

Domain
Domain!
Expertise
Expertise!
Sciences!
Sciences

E.g. Economy, Biology, Physics, Neuroscience, sociology.


sociology.!

Government!
Government
E.g. Healthcare, Defense, Education, Transportation..!

Industry!
Industry

Intelligent !
Systems!
Systems

E.g. Autonomous cars, security, !


interactive tools for data organization
organization!
and exploration. !

E.g. E-commerce, Telecommunications, !


Finance.
Finance.!

Major challenges: Multidisciplinary integration, large-scale databases, scalable


computational infrastructures, design math algorithms for massive datasets, trade-o"
speed and accuracy for real-time decisions, interactive visualization tools. !

Deep Learning!
Data Science = Big Data + Computational Infrastructure + Artificial Intelligence!
3rd industrial !
revolution!

!"#$%&'(&%))*+'

Cloud computing
computing!
GPU!

Math parts!

.'

A Brief History of Data Science/Deep Learning!


RNN!
Schmidhuber!
CNN!
LeCun!

First !
NIPS!
Visual primary cortex!
Hubel-Wiesel!
1959'
1962 1975'
1962'
1958'
Backprop !
Perceptron
Perceptron!
Werbos!
Rosenblatt
Rosenblatt!

First !
KDD
KDD!
1989'
1989
1987'

Neocognitron!
Fukushima!

Birth of!
Data Science!
Split from Statistics!
Tukey!

AI Hope!

!"#$%&'(&%))*+'

Big Data!
Volume doubles/1.5 year!

1998
1997 1998'
1997'
1995'

1999
1999'

Hardware!
GPU speed doubles/ year!

First Amazon!
Cloud Center!

Google AI !
TensorFlow!
Facebook AI!
Torch!

Kaggle!
Platform!

2010'
2006'
Auto-encoder!
LeCun, Hinton, Bengio!

First NVIDIA !
GPU!
SVM/Kernel techniques!
Vapnik!
AI Winter [1966-2012]!
Kernel techniques!
Handcrafted features!
Graphical models!

2012'
2012

2014' 2015'
Data scientist!
Facebook Center!
1st Job in US!
OpenAI Center!

4th industrial revolution?!


Digital Intelligence !
Deep Learning!
Revolution!
Breakthough !
or new AI bubble?!
Hinton, Ng!

AI Resurgence!

/'

Outline of the Course!


1st day!

Graph Science
Science!
Data structure
structure!
Pattern extraction!
extraction

Unsupervised
Clustering
Clustering!
k-means, graph cuts
cuts!

Python!
Python

Language for !
data science!
science

Supervised
Classification!
Classification
SVM!
SVM

Deep Learning
Learning!
NNs, CNNs, RNNs,
RNNs,!

Data Science
Science!

Pagerank, collaborative
Pagerank
filtering
content filtering!

3rd day!

Data
Visualization!
Visualization
Manifold, t-SNE
t-SNE!

!"#$%&'(&%))*+'

Recommender
Systems
Systems!

Feature
Extraction!
Extraction

PCA, NMF, Sparse


coding
coding!

2nd day!
0'

Current Deep Learning!

!"#$%&'(&%))*+'

1'

Rapid Development!!

!"#$%&'(&%))*+'

2'

Future of Deep Learning!


!! Deep Learning is a new revolutionary
paradigm in AI.!
It has the capability to find highly meaningful
patterns in big data.!

!! Deep Learning is a breakthrough in


Computer Vision and Voice Recognition. !
However, it does not have yet the same
breakthrough in other fields.!
We are far away from a true AI.!

!"#$%&'(&%))*+'

3'

Future of Deep Learning!


!! Unsupervised learning: Google
brain. Self-taught learning with
unlabelled youtube videos and 16,000
computers!
!! Better hardware with bigger
machine cluster "!

!! Bigger data "!

!! Better understanding - why it works? # !

!"#$%&'(&%))*+'

,4'

Thank you!

Xavier Bresson

11

Das könnte Ihnen auch gefallen