Sie sind auf Seite 1von 48

Deep Learning

through Examples
Arno Candel
!

0xdata, H2O.ai
Scalable In-Memory Machine Learning
!

Silicon Valley Big Data Science Meetup,


Vendavo, Mountain View, 9/11/14
!

Who am I?
@ArnoCandel

PhD in Computational Physics, 2005


from ETH Zurich Switzerland
!

6 years at SLAC - Accelerator Physics Modeling


2 years at Skytree, Inc - Machine Learning
9 months at 0xdata/H2O - Machine Learning
!

15 years in HPC/Supercomputing/Modeling
!

Named 2014 Big Data All-Star by Fortune Magazine


!

H2O Deep Learning, @ArnoCandel

H2O DeepLearning:
Kaggle #1 rank (out of 413) - 40d left

#1
#17

Achieved with H2O Deep


Learning from R!
!
@matlabulous (Jo-fai Chow, Blend it like a Bayesian!) says:
I am 99.99999999999% sure that I can still go further with H2O.

H2O Deep Learning, @ArnoCandel

Outline
Intro & Live Demo (10 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q & A (5 mins)

H2O Deep Learning, @ArnoCandel

About H20 (aka 0xdata)


Java, Apache v2 Open Source
Join the www.h2o.ai/community!
#1 Java Machine Learning in Github

H2O Deep Learning, @ArnoCandel

Customer Demands for


Practical Machine Learning
Requirements

Value

In-Memory

Fast (Interactive)

Distributed

Big Data (No Sampling)

Open Source

Ownership of Methods

API / SDK

Extensibility

H2O was developed by 0xdata from


scratch to meet these requirements

H2O Deep Learning, @ArnoCandel

H2O Integration

Java

H2O

JSON

H2O

Scala

Python

H2O

YARN

Hadoop MR

HDFS

HDFS

HDFS

Standalone

Over YARN

On MRv1

H2O Deep Learning, @ArnoCandel

H2O Architecture
Prediction Engine

R Engine

Distributed
In-Memory K-V store

Col. compression
Memory manager
MapReduce

Nano fast
Scoring Engine

Machine
Learning
Algorithms
e.g. Deep Learning

H2O Deep Learning, @ArnoCandel

H2O - The Killer App on Spark

http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html

H2O Deep Learning, @ArnoCandel

H2O DeepLearning on Spark


// Test if we can correctly learn A, B where Y = logistic(A + B*X)
test("deep learning log regression") {
val nPoints = 10000
val A = 2.0
val B = -1.5

!
!

// Generate testing data


val trainData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 42)
// Create RDD from testing data
val trainRDD = sc.parallelize(trainData, 2)
trainRDD.cache()
import H2OContext._
// Create H2O data frame (will be implicit in the future)
val trainH2ORDD = toDataFrame(sc, trainRDD)
// Create a H2O DeepLearning model
val dlParams = new DeepLearningParameters()
dlParams.source = trainH2ORDD
dlParams.response = trainH2ORDD.lastVec()
dlParams.classification = true
val dl = new DeepLearning(dlParams)
val dlModel = dl.train().get()
// Score validation data
val validationData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 17)
val validationRDD = sc.parallelize(validationData, 2)
val validationH2ORDD = toDataFrame(sc, validationRDD)
val predictionH2OFrame = new DataFrame(dlModel.score(validationH2ORDD))('predict)
val predictionRDD = toRDD[DoubleHolder](sc, predictionH2OFrame) // will be implicit in the future
// Validate prediction
validatePrediction( predictionRDD.collect().map (_.predict.getOrElse(Double.NaN)), validationData)

Brand-Sparkling-New Sneak Preview!

10

H2O Deep Learning, @ArnoCandel

11

H2O R CRAN package

John Chambers (creator of the S language, R-core member)


names H2O R API in top three promising R projects

H2O Deep Learning, @ArnoCandel

H2O + R = Happy Data Scientist


Machine Learning on Big Data with R:
Data resides on the H2O cluster!

12

H2O Deep Learning, @ArnoCandel

13

Higgs Particle Discovery


Large Hadron Collider: Largest experiment of mankind!
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc.
Higgs boson discovery (July 12) led to 2013 Nobel prize!

Higgs
vs
Background

Images courtesy CERN / LHC


http://arxiv.org/pdf/1402.4735v2.pdf

Machine Learning Meets Physics

Or rather: Back to the roots


(WWW was invented at CERN in 89)

H2O Deep Learning, @ArnoCandel

14

Higgs: Binary Classification Problem


Current methods of choice for physicists:
- Boosted Decision Trees
- Neural networks with 1 hidden layer
BUT: Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset:
21 low-level features

AND

7 high-level derived features


Train: 10M rows, Test: 500k rows
Metric: AUC = Area under the ROC curve (range: 0.51, higher is better)

Algorithm

low-level H2O AUC

all features H2O AUC

Generalized Linear Model

0.596

0.684

Random Forest

0.764

Gradient Boosted Trees

0.753

Neural Net 1 hidden layer

0.760

add
derived
features

0.840
0.839
0.830

H2O Deep Learning, @ArnoCandel

Higgs: Can Deep Learning Do Better?


Algorithm

low-level H2O AUC

all features H2O AUC

Generalized Linear Model

0.596

0.684

Random Forest

0.764

0.840

Gradient Boosted Trees

0.753

0.839

Neural Net 1 hidden layer

0.760

0.830

Deep Learning

<Your guess goes here>


reference paper results: baseline 0.733

Lets build a H2O Deep Learning model and


find out! (That was my last weekend)

15

H2O Deep Learning, @ArnoCandel

What is Deep Learning?


Wikipedia:
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple
non-linear transformations.
Example:

Prediction

Input data

(who is it?)

(image)
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans

16

H2O Deep Learning, @ArnoCandel

What is NOT Deep


Linear models are not deep
(by definition)

Neural nets with 1 hidden layer are not deep


(only 1 layer - no feature hierarchy)

SVMs and Kernel methods are not deep


(2 layers: kernel + linear)

Classification trees are not deep


(operate on original input space, no new features generated)

17

H2O Deep Learning, @ArnoCandel

18

Deep Learning is Trending


Google trends

2009

2011

2013

Businesses are using


Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)

!
FBI FACE: $1 billion face recognition project

!
Chinese Search Giant Baidu Hires Man Behind the Google Brain (Andrew Ng)

H2O Deep Learning, @ArnoCandel

Deep Learning History


slides by Yan LeCun (now Facebook)

Deep Learning wins competitions


AND
makes humans, businesses and
machines (cyborgs!?) smarter

19

H2O Deep Learning, @ArnoCandel

Deep Learning in H2O


1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)

+ distributed processing for big data


(H2O in-memory MapReduce paradigm on distributed data)

+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)

+ smart algorithms for accuracy


(weight initialization, adaptive learning rate, momentum, dropout regularization,
l1/L2 regularization, grid search, checkpointing, auto-tuning, model averaging)

= Top-notch prediction engine!

20

H2O Deep Learning, @ArnoCandel

Example Neural Network


fully connected directed graph of neurons
input/output neuron
hidden neuron
information flow

age

married
income

single
employment
Input layer
#neurons

#connections 3x4

Hidden
layer 1

Hidden
layer 2

3
4x3

Output layer
2
3x2

21

H2O Deep Learning, @ArnoCandel

22

Prediction: Forward Propagation


neurons activate each other via weighted sums
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!

age

income

zk

wkl

pl

married

per-class probabilities
sum(pl) = 1

xi
uij

yj

vjk

single

employment
yj = tanh(sumi(xi*uij)+bj)
zk = tanh(sumj(yj*vjk)+ck)
activation function: tanh
alternative:
x -> max(0,x) rectifier

bj, ck, dl: bias values


(indep. of inputs)

pl = softmax(sumk(zk*wkl)+dl)
softmax(xk) = exp(xk) / sumk(exp(xk))

H2O Deep Learning, @ArnoCandel

23

Data preparation & Initialization


Neural Networks are sensitive to numerical noise,
operate best in the linear regime (not saturated)

age

income

wkl

married

xi

single

employment
Automatic standardization of data

Automatic initialization of weights

xi: mean = 0, stddev = 1

Poor mans initialization: random weights wkl

horizontalize categorical variables, e.g.

{full-time, part-time, none, self-employed}

Default (better): Uniform distribution in

->

+/- sqrt(6/(#units + #units_previous_layer))

{0,1,0} = part-time, {0,0,0} = self-employed

H2O Deep Learning, @ArnoCandel

Training: Update Weights & Biases


For each training row, we make a prediction and compare
with the actual label (supervised learning):

predicted actual
0.8
married
1
0.2
single
0
Objective: minimize prediction error (MSE or cross-entropy)
Mean Square Error = (0.22 + 0.22)/2 penalize differences per-class

!
Cross-entropy = -log(0.8)

strongly penalize non-1-ness


1

Stochastic Gradient Descent: Update weights and biases via


gradient of the error (via back-propagation):
E

w < w - rate * E/w


rate

24

H2O Deep Learning, @ArnoCandel

25

Backward Propagation
How to compute E/wi for wi < wi - rate * E/wi ?
Naive: For every i, evaluate E twice at (w1,,wi,,wN) Slow!

Backprop: Compute E/wi via chain rule going backwards

net = sumi(wi*xi) + b

xi

y = activation(net)

wi

E = error(y)

E/wi = E/y * y/net * net/wi


= (error(y))/y * (activation(net))/net * xi

H2O Deep Learning, @ArnoCandel

H2O Deep Learning Architecture


initial model: weights and biases w
H2O atomic
in-memory
K-V store

K-V
w

nodes/JVMs: sync
threads: async

each node trains a


copy of the weights

HTTPD
i

map:

(some* or all of) its


w

and biases with

local data with


asynchronous F/J
threads

communication
w1

w3

w2

w4
2

reduce:
model averaging:

w1+w3

Query & display


the model via
JSON, WWW

K-V

w2+w4

1
w*

= (w1+w2+w3+w4)/4
1

average weights and


biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3

HTTPD

updated model: w*

Keep iterating over the data (epochs), score from time to time
*auto-tuned

(default) or user-specified number of points per MapReduce iteration

26

H2O Deep Learning, @ArnoCandel

Secret Sauce to Higher Accuracy


Adaptive learning rate - ADADELTA (Google)
Automatically set learning rate for each neuron
based on its training history
Regularization
L1: penalizes non-zero weights
L2: penalizes large weights
Dropout: randomly ignore certain inputs
Grid Search and Checkpointing
Run a grid search to scan many hyperparameters, then continue training the most
promising model(s)

27

H2O Deep Learning, @ArnoCandel

28

Detail: Adaptive Learning Rate


!

Compute moving average of wi2 at time t for window length rho:

!
E[wi2]t = rho * E[wi2]t-1 + (1-rho) * wi2
!
Compute RMS of wi at time t with smoothing epsilon:

!
RMS[wi]t = sqrt( E[wi2]t + epsilon )
Do the same for E/wi, then
obtain per-weight learning rate:

rate(wi, t) =

RMS[wi]t-1
RMS[E/wi]t

Adaptive acceleration / momentum:


accumulate previous weight updates,
but over a window of time
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents freezing

cf. ADADELTA paper

(unlike ADAGRAD: no window)

H2O Deep Learning, @ArnoCandel

Detail: Dropout Regularization


Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.

age

income

married

single

employment
Testing:
Use all activations, but reduce them by a factor p
(to simulate the missing activations during training).
cf. Geoff Hinton's paper

29

H2O Deep Learning, @ArnoCandel

MNIST: digits classification


MNIST = Digitized handwritten
digits database (Yann LeCun)
Yann LeCun: Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.

Data: 28x28=784 pixels with


(gray-scale) values in 0255
Train: 60,000 rows
Test:
10,000 rows

784 integer columns


784 integer columns

10 classes
10 classes

Standing world record:


Without distortions or convolutions,
the best-ever published error rate on
test set: 0.83% (Microsoft)

Lets see how H2O does on the MNIST dataset!

30

H2O Deep Learning, @ArnoCandel

H2O Deep Learning on MNIST:


0.87% test set error (so far)
test set error: 1.5% after 10 mins
1.0% after 1.5 hours
0.87% after 4 hours

World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training

Frequent errors: confuse 2/7 and 4/9

Running on 4
nodes with 16
cores each

31

H2O Deep Learning, A. Candel

Weather Dataset
Predict RainTomorrow from Temperature,
Humidity, Wind, Pressure, etc.

32

H2O Deep Learning, A. Candel

Live Demo: Weather Prediction


5-fold cross validation

3 hidden Rectifier
layers, Dropout,
L1-penalty

Interactive ROC curve with


real-time updates

12.7% 5-fold cross-validation error is at


least as good as GBM/RF/GLM models

33

H2O Deep Learning, @ArnoCandel

Live Demo: Grid Search


How did I find those parameters? Grid Search!
(works for multiple hyper parameters at once)

Then continue training


the best model

34

H2O Deep Learning, @ArnoCandel

Text Classification

Goal: Predict the item from


sellers text description
Vintage 18KT gold Rolex 2 Tone
in great condition

Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,,0


gold

Train: 578,361 rows


Test: 64,263 rows

vintage condition

8,647 cols
8,647 cols

467 classes
143 classes

Lets see how H2O does on the ebay dataset!

35

H2O Deep Learning, @ArnoCandel

36

Text Classification
Train: 578,361 rows
Test: 64,263 rows

8,647 cols
8,647 cols

467 classes
143 classes

Out-Of-The-Box: 11.6% test set error after 10 epochs!


Predicts the correct class (out of 143) 88.4% of the time!

Note 1: H2O columnar-compressed in-memory


store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Note 2: No tuning was done
(results are for illustration only)

H2O Deep Learning, @ArnoCandel

Parallel Scalability
(for 64 epochs on MNIST, with 0.87% parameters)
Training Time
in minutes

100

Speedup
40.00

75

30.00

50

20.00

25

10.00

2.7 mins
0.00

0
1

8 16 32 63

H2O Nodes

2 4 8 16 32 63

H2O Nodes

(4 cores per node, 1 epoch per node per MapReduce)

37

H2O Deep Learning, @ArnoCandel

Deep Learning Auto-Encoders for


Anomaly Detection
Toy example:
Find anomaly in ECG heart
beat data. First, train a
model on whats normal:
20 time-series samples of
210 data points each
Deep Auto-Encoder:
Learn low-dimensional
non-linear structure of
the data that allows to
reconstruct the orig. data
Also for categorical data!

38

H2O Deep Learning, @ArnoCandel

Deep Learning Auto-Encoders for


Anomaly Detection

+
Model of whats normal

Test set with anomaly

=>
Test set prediction is
reconstruction, looks normal

Found anomaly! large


reconstruction error

39

H2O Deep Learning, @ArnoCandel

H2O brings Deep Learning to R


All parameters are
available from R

R Vignette with
example R scripts
http://0xdata.com/h2o/algorithms/

40

H2O Deep Learning, @ArnoCandel

41

POJO Model Export for


Production Scoring
Plain old Java code is
auto-generated to take
your H2O Deep Learning
models into production!

H2O Deep Learning, @ArnoCandel

Higgs Particle Discovery with H2O


How well did H2O
Deep Learning do?

<Your guess goes here>

reference paper results

Any guesses for AUC on low-level features?


AUC=0.76 was the best for RF/GBM/NN (H2O)
Lets see how H2O did in the past 30 minutes!

42

H2O Deep Learning, @ArnoCandel

H2O Steam: Scoring Platform


http://server:port/steam/index.html

Higgs Dataset Demo on 10-node cluster


Lets score all our H2O models and compare them!
Live Demo

43

H2O Deep Learning, @ArnoCandel

Scoring Higgs Models in H2O Steam

Live Demo on 10-node cluster:


<10 minutes runtime for all algos!
Better than LHC baseline of AUC=0.73!

44

H2O Deep Learning, @ArnoCandel

45

Higgs Particle Detection with H2O


HIGGS UCI Dataset:
21 low-level features
7

AND

high-level derived features

Train: 10M rows, Test: 500k rows


*Nature

Algorithm

Papers

paper:

http://arxiv.org/pdf/1402.4735v2.pdf

low-level all features

Parameters (not heavily tuned),

l-l AUC

H2O AUC

H2O AUC

H2O running on 10 nodes

Generalized Linear Model

0.596

0.684

default, binomial

Random Forest

0.764

0.840

50 trees, max depth 50

Gradient Boosted Trees

0.73

0.753

0.839

50 trees, max depth 15

Neural Net 1 layer

0.733

0.760

0.830

1x300 Rectifier, 100 epochs

Deep Learning 3 hidden layers

0.836

0.850

3x1000 Rectifier, L2=1e-5, 40 epochs

Deep Learning 4 hidden layers

0.868

0.869

4x500 Rectifier, L1=L2=1e-5, 300 epochs

Deep Learning 6 hidden layers

0.880 running

6x500 Rectifier, L1=L2=1e-5

Deep Learning on low-level features alone beats everything else!


H2O prelim. results compare well with papers results*

(TMVA & Theano)

H2O Deep Learning, @ArnoCandel

46

Tips
for
H
O
Deep
Learning
2
!
General:
More layers for more complex functions (exp. more non-linearity).
More neurons per layer to detect finer structure in data (memorizing).
Add some regularization for less overfitting (lower validation set error).
Specifically:
Do a grid search to get a feel for convergence, then continue training.
Try Tanh/Rectifier, try max_w2=1050, L1=1e-5..1e-3 and/or L2=1e-51e-3
Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation
set. Input dropout is recommended for noisy high-dimensional input.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-41e-2, rate_annealing = 1e-51e-9,
momentum_start = 0.50.9, momentum_stable = 0.99,
momentum_ramp = 1/rate_annealing.
Try balance_classes = true for datasets with large class imbalance.
Enable force_load_balance for small datasets.
Enable replicate_training_data if each node can h0ld all the data.

H2O Deep Learning, @ArnoCandel

Extensions for H2O Deep Learning


- Vision: Convolutional & Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training: Stacked Auto-Encoders PUB-1014
- Faster Training: GPGPU support PUB-1013
- Language/Sequences: Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms

Contribute to H2O!
Add your own JIRA tickets!

47

H2O Deep Learning, @ArnoCandel

Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!

H2O Deep Learning is ready to take your advanced


analytics to the next level - Try it on your data!
!

Join our Community and Meetups!


https://github.com/h2oai
http://docs.h2o.ai
www.h2o.ai/community
@h2oai

Thank you!

48