Beruflich Dokumente
Kultur Dokumente
through Examples
Arno Candel
!
0xdata, H2O.ai
Scalable In-Memory Machine Learning
!
Who am I?
@ArnoCandel
15 years in HPC/Supercomputing/Modeling
!
H2O DeepLearning:
Kaggle #1 rank (out of 413) - 40d left
#1
#17
Outline
Intro & Live Demo (10 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q & A (5 mins)
Value
In-Memory
Fast (Interactive)
Distributed
Open Source
Ownership of Methods
API / SDK
Extensibility
H2O Integration
Java
H2O
JSON
H2O
Scala
Python
H2O
YARN
Hadoop MR
HDFS
HDFS
HDFS
Standalone
Over YARN
On MRv1
H2O Architecture
Prediction Engine
R Engine
Distributed
In-Memory K-V store
Col. compression
Memory manager
MapReduce
Nano fast
Scoring Engine
Machine
Learning
Algorithms
e.g. Deep Learning
http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html
!
!
10
11
12
13
Higgs
vs
Background
14
AND
Algorithm
0.596
0.684
Random Forest
0.764
0.753
0.760
add
derived
features
0.840
0.839
0.830
0.596
0.684
Random Forest
0.764
0.840
0.753
0.839
0.760
0.830
Deep Learning
15
Prediction
Input data
(who is it?)
(image)
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans
16
17
18
2009
2011
2013
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the Google Brain (Andrew Ng)
19
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
20
age
married
income
single
employment
Input layer
#neurons
#connections 3x4
Hidden
layer 1
Hidden
layer 2
3
4x3
Output layer
2
3x2
21
22
age
income
zk
wkl
pl
married
per-class probabilities
sum(pl) = 1
xi
uij
yj
vjk
single
employment
yj = tanh(sumi(xi*uij)+bj)
zk = tanh(sumj(yj*vjk)+ck)
activation function: tanh
alternative:
x -> max(0,x) rectifier
pl = softmax(sumk(zk*wkl)+dl)
softmax(xk) = exp(xk) / sumk(exp(xk))
23
age
income
wkl
married
xi
single
employment
Automatic standardization of data
->
predicted actual
0.8
married
1
0.2
single
0
Objective: minimize prediction error (MSE or cross-entropy)
Mean Square Error = (0.22 + 0.22)/2 penalize differences per-class
!
Cross-entropy = -log(0.8)
24
25
Backward Propagation
How to compute E/wi for wi < wi - rate * E/wi ?
Naive: For every i, evaluate E twice at (w1,,wi,,wN) Slow!
net = sumi(wi*xi) + b
xi
y = activation(net)
wi
E = error(y)
K-V
w
nodes/JVMs: sync
threads: async
HTTPD
i
map:
communication
w1
w3
w2
w4
2
reduce:
model averaging:
w1+w3
K-V
w2+w4
1
w*
= (w1+w2+w3+w4)/4
1
HTTPD
updated model: w*
Keep iterating over the data (epochs), score from time to time
*auto-tuned
26
27
28
!
E[wi2]t = rho * E[wi2]t-1 + (1-rho) * wi2
!
Compute RMS of wi at time t with smoothing epsilon:
!
RMS[wi]t = sqrt( E[wi2]t + epsilon )
Do the same for E/wi, then
obtain per-weight learning rate:
rate(wi, t) =
RMS[wi]t-1
RMS[E/wi]t
age
income
married
single
employment
Testing:
Use all activations, but reduce them by a factor p
(to simulate the missing activations during training).
cf. Geoff Hinton's paper
29
10 classes
10 classes
30
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each
31
Weather Dataset
Predict RainTomorrow from Temperature,
Humidity, Wind, Pressure, etc.
32
3 hidden Rectifier
layers, Dropout,
L1-penalty
33
34
Text Classification
vintage condition
8,647 cols
8,647 cols
467 classes
143 classes
35
36
Text Classification
Train: 578,361 rows
Test: 64,263 rows
8,647 cols
8,647 cols
467 classes
143 classes
Parallel Scalability
(for 64 epochs on MNIST, with 0.87% parameters)
Training Time
in minutes
100
Speedup
40.00
75
30.00
50
20.00
25
10.00
2.7 mins
0.00
0
1
8 16 32 63
H2O Nodes
2 4 8 16 32 63
H2O Nodes
37
38
+
Model of whats normal
=>
Test set prediction is
reconstruction, looks normal
39
R Vignette with
example R scripts
http://0xdata.com/h2o/algorithms/
40
41
42
43
44
45
AND
Algorithm
Papers
paper:
http://arxiv.org/pdf/1402.4735v2.pdf
l-l AUC
H2O AUC
H2O AUC
0.596
0.684
default, binomial
Random Forest
0.764
0.840
0.73
0.753
0.839
0.733
0.760
0.830
0.836
0.850
0.868
0.869
0.880 running
46
Tips
for
H
O
Deep
Learning
2
!
General:
More layers for more complex functions (exp. more non-linearity).
More neurons per layer to detect finer structure in data (memorizing).
Add some regularization for less overfitting (lower validation set error).
Specifically:
Do a grid search to get a feel for convergence, then continue training.
Try Tanh/Rectifier, try max_w2=1050, L1=1e-5..1e-3 and/or L2=1e-51e-3
Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation
set. Input dropout is recommended for noisy high-dimensional input.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-41e-2, rate_annealing = 1e-51e-9,
momentum_start = 0.50.9, momentum_stable = 0.99,
momentum_ramp = 1/rate_annealing.
Try balance_classes = true for datasets with large class imbalance.
Enable force_load_balance for small datasets.
Enable replicate_training_data if each node can h0ld all the data.
Contribute to H2O!
Add your own JIRA tickets!
47
Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!
Thank you!
48