Sie sind auf Seite 1von 180

H2O

World 2014 Training

Table of Contents
1. Introduction
2. H2O World Training Sandbox
3. H2O in Big Data Environments
4. Hands-On Training
i. H2O with the Web UI
ii. R with H2O
iii. Supervised Learning
i. Generalized Linear Models
ii. Gradient Boosted Models
iii. Random Forests
iv. Regression
v. Classification
vi. Deep Learning
iv. Unsupervised Learning
i. KMeans Clustering
ii. Dimensionality Reduction
iii. Anomaly Detection
v. Advanced Topics
i. Multi-model Parameter Tuning
ii. Categorical Feature Engineering
iii. Other Useful Tools
vi. Practical Use Cases for Marketing
5. Sparkling Water
6. Python on H2O
7. Demos
i. Tableau
ii. Excel
iii. Streaming Data
8. Build Applications on Top of H2O
i. KMeans
ii. Grep
iii. Quantiles
iv. Build with Sparkling Water
9. Troubleshooting
10. More Information

H2O World 2014 Training

Note: This information is now outdated. Please refer to the


H2O World 2015 Training book instead!
November 2014's full day of immersive training with H2O. For
H2O versions 2.8 and prior.
What is H2O?
It is the only alternative to combine the power of highly advanced algorithms, the freedom of open source, and the capacity
of truly scalable in-memory processing for big data on one OR many nodes. Combined, these capabilities make it faster,
easier, and more cost-effective to harness big data to maximum benefit for the business.
Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through
faster and better predictive modeling. Existing Big Data stacks are batch-oriented. Search and analytics need to be
interactive. Use machines to learn machine-generated data. And more data beats better algorithms.
With H2O, you can make better predictions by harnessing sophisticated, ready-to-use algorithms and the processing power
you need to analyze bigger data sets, more models, and more variables.
Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic
way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you
can extend the platform seamlessly into your Hadoop environments.

Get H2O!
Download H2O
http://h2o.ai/download/

Join the Community


H2O Google Group

Hack H2O
https://github.com/h2oai/h2o.git

Introduction

H2O World 2014 Training

Setup H2O World Training Sandbox


1. Prerequisites
VirtualBox
Provided on H2O World USB sticks, or
Available for download
H2O World Training Sandbox Image (.ova format)
Provided on H2O World USB sticks, or
Available for download (4GB)
Note: You can use also your favorite virtualization tool like VMWare Fusion, VMWare Player, or Microsoft Hyper-V.

2. Setup Sandbox
Start VirtualBox

Select File > Import Appliance

H2O World Training Sandbox

H2O World 2014 Training

Import wizard is shown

Open provided h2oworld-training.ova file from USB drive

H2O World Training Sandbox

H2O World 2014 Training

Review configurations of the image file and hit Import

3. Launch Sandbox
H2O World Training Sandbox

H2O World 2014 Training

Select h2oworld-training item and click on Start icon

Login screen

H2O World Training Sandbox

H2O World 2014 Training

3.1 Sandbox Credentials


user: h2o , password: h2o
Note: To access root user account, use sudo su - or password h2o .

4. Sandbox Content
H2O v2.8.2.8 - "Maxwell"
Sparkling Water v0.2.1-31
RStudio Desktop v0.98.1091
R version 3.1.1 (2014-07-10) - "Sock it to Me"
H2O R package v2.8.2.8
Spark v1.1.0 for Hadoop 2.4
Git v1.7.1
Oracle JDK v1.7.0_45
IntelliJ Idea 14 Community Edition

H2O World Training Sandbox

H2O World 2014 Training

H2O in Big Data Environments


H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful
algorithms while keeping the widely used languages of R and JSON as an API. H2O brings and elegant lego-like
infrastructure that brings fine-grained parallelism to math over simple distributed arrays.
Customers can use data locked in HDFS as a data source. H2O is a primary citizen of the Hadoop infrastructure & interacts
naturally with the Hadoop JobTracker & TaskTrackers on all major distros.
Similarily users can also import data sitting in S3 by launching H2O on EC2 instances with access to S3 buckets. This way
customers can take advantage of having the flexibility to add nodes to a cloud with minimal difficulty.
Read more about H2O on Big Data Environments from our VP of Engineering, Tom Kraljevic:
H2O Big Data Environments

H2O in Big Data Environments

H2O World 2014 Training

Hands-On Training
In the following sections, you'll learn how to get started with H2O via the simple to use Web UI, R, and go through a number
of specific examples.

3.1 H2O with Web UI


Amy Wang: Step-by-step Tutorial (Video Tutorial)

3.2 H2O in R
Patrick Aboyoun: Basics and Exploratory Data Analysis (EDA)

3.3 Supervised Learning - Regression and Classification


Patrick Aboyoun: Introduction to Generalized Linear Models in H2O
Patrick Aboyoun: Introduction to Gradient Boosting Machines in H2O
Patrick Aboyoun: Introduction to Random Forests in H2O
Patrick Aboyoun: Regression
Patrick Aboyoun: Classification
Arno Candel: Deep Learning

3.4 Unsupervised Learning


Arno Candel & Spencer Aiello: K-Means Clustering
Arno Candel: Dimensionality Reduction on MNIST
Arno Candel: Anomaly Detection on MNIST with H2O Deep Learning

3.5 Advanced Topics


Arno Candel: Multi-model Parameter Tuning for Higgs Dataset
Arno Candel: Categorical Feature Engineering for Adult dataset
Arno Candel: Other Useful Tools

3.6 Yan: Marketing Algorithms and Use-Cases


Yan Zou: H2O for Marketing/CRM Applications - Presentation
Yan Zou: H2O for Marketing/CRM Applications - R Script
Vinod Iyengar: Lead Scoring For Real Time Bidding - Presentation

Hands-On Training

10

H2O World 2014 Training

Data Science Flow from H2O's Web Interface


You can follow along with our video tutorial:

Step 1: Import Data


The airlines data set we are importing is a subset of the data made available by RITA with a mix of numeric and factor
columns. In the following tutorial, we will build multiple classification models predicting flight delays, run model
comparisons, and score on a specific model.
Navigate to Data > Import File
Input into path /data/h2o-training/airlines/allyears2k.csv and hit Submit
Click the nfs link C:\data\h2o-training\airlines\allyears2k.csv
Scroll down the page to get a preview of your data before clicking Submit again.

Step 2: Data Summary


On the data inspect page navigate to the Summary which you can also access by Data > Summary
Click Submit to get a summary of all the columns in the data:
Numeric Columns: Min, Max, and Quantiles
Factor Columns: Counts of each factor, Cardinality, NAs

Step 3: Split Data into Test and Training Sets


Navigate back to data inspect page Data > View All > allyears2k.hex > Split Frame
Select shuffle and click Submit
Select allyears2k_suffled_part0.hex for the training frame

Step 4: Build a GLM model

H2O with the Web UI

11

H2O World 2014 Training

Go to Model > Generalized Linear Model


Input for source: allyears2k_shuffled_part0.hex
Select for response: IsDepDelayed
Select to ignore all columns (Ctrl+A) except for Year , Month , DayofMonth , DayOfWeek , UniqueCarrier , Origin , Dest ,
and Distance (Ctrl)
Select for family: binomial
Check use all factor levels and variable importances
Hit submit to start the job

Step 5: Build a 50-Tree GBM model


Go to Model > Gradient Boosting Machine
Input for source: allyears2k_shuffled_part0.hex
Select for response: IsDepDelayed
Select to ignore all columns (Ctrl+A) except for Year , Month , DayofMonth , DayOfWeek , UniqueCarrier , Origin , Dest ,
and Distance (Ctrl)
Click Submit to start the job

Step 6: Build a simpler 5-Tree GBM model


Go to Model > Gradient Boosting Machine
Input for source: allyears2k_shuffled_part0.hex
Select for response: IsDepDelayed
Select to ignore all columns (Ctrl+A) except for Year , Month , DayofMonth , DayOfWeek , UniqueCarrier , Origin , Dest ,
and Distance (Ctrl)
Input for ntrees: 5
Click Submit to start the job
On the model output page, click the JSON tab.
On the model output page, click the JAVA tab.

Step 7: Deep Learning with Model Grid Search


Go to Model > Gradient Boosting Machine
Input for source: allyears2k_shuffled_part0.hex
Select for response: IsDepDelayed
Select to ignore all columns (Ctrl+A) except for Year , Month , DayofMonth , DayOfWeek , UniqueCarrier , Origin , Dest ,
and Distance (Ctrl)
Input for hidden: (10,10), (20,20,20)
Click Submit to start the job
The models are sorted by error rates. Scroll to all the way to the right to select the first model on the list.

Step 8: Multimodel Scoring Engine


Navigate to Score > Multi model Scoring (beta)
Select data set allyears2k.hex and scroll to the compatible models and select VIEW THESE MODELS...
Select all the models on the left hand task bar.
Click SCORE... and select allyears2k_shuffled_part1.hex and click OK
The tabular viewing of the models allows the user to have a side by side comparison of all the models.

H2O with the Web UI

12

H2O World 2014 Training

Creating Visualizations
Navigate to ADVANCED Tab to see overlaying ROC curves
Click ADD VISUALIZATION...
For the X-Axis Field choose Training Time (ms)
For the Y-Axis Field choose AUC
Examine the new graph you created. Weigh the value of extra gain in accuracy for time taken to train the models.
Select the model with your desired level of accuracy, then copy and paste the model key and proceed to step 9.

Step 9: Create Frame with Predicted Values


Navigate back to Home Page > Score > Predict
Input for model: paste the model key you got from Step 8
Input for data: allyears2k_shuffled_part1.hex
Input for prediction: pred

Step 10: Export Predicted Values as CSV


Inspect the prediction frame
Select Download as CSV
or export any frame:
Navigate to Data > Export Files
Input for src key: pred
Input for path: /data/h2o-training/airlines/pred.csv

Step 11: Save a model for use later


Navigate to Data > View All
Choose to filter by the model key
Click Save Model
Input for path: /data/h2o-training/airlines/50TreesGBMmodel
Click Submit

Errors? Download and send us the log files!


Navigate to Admin > Inspect Log
Hit Download all logs

Step 12: Shutdown your H2O instance


Go to [Admin > Shutdown]

Extra Bonus: Reload that saved model


In a active H2O session:

H2O with the Web UI

13

H2O World 2014 Training

Navigate to the Load Model


Input for path: /data/h2o-training/airlines/50TreesGBMmodel
Hit Submit

H2O with the Web UI

14

H2O World 2014 Training

R with H2O
This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within
R. It requires an installation of the h2o R package and its dependencies.

Load the H2O R package and start an local H2O cluster


Connection to an H2O cloud is established through the h2o.init function from the h2o package. For the purposes of this
training exercise, we will use a local H2O cluster running on the default port of 54321 . We will also use the default cluster
memory size and set nthreads = -1 to make all the CPUs available to the H2O cluster.

library(h2o)
h2oServer <- h2o.init(nthreads = -1)

Download Data
This tutorial uses a 10% sample of the Person-Level 1% 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau, making it a Person-Level 0.1% 2013 PUMS.

wget https://s3.amazonaws.com/h2o-training/pums2013/adult_2013_full.csv.gz

Load data into the key-value store in the H2O cluster


We will use the h2o.importFile function to read the data into the H2O key-value store.

datadir <- "~/Downloads"


csvfile <- "adult_2013_full.csv.gz"
adult_2013_full <- h2o.importFile(h2oServer,
path = file.path(datadir, csvfile),
key = "adult_2013_full", sep = ",")

The key argument to the h2o.importFile function sets the name of the data set in the H2O key-value store. If the key
argument is not supplied, the data will reside in the H2O key-value store under a machine generated name.
The results of the h2o.ls function shows the size of the object held by the adult_2013_full key in the H2O key-value
store.

kvstore <- h2o.ls(h2oServer)


kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013_full"] / 1024^2

Examine the proxy object for the H2O resident data


The resulting adult_2013_full object is of class H2OParsedData , which implements methods commonly associated with
native R data.frame objects.

class(adult_2013_full)
dim(adult_2013_full)
head(colnames(adult_2013_full), 50)

R with H2O

15

H2O World 2014 Training

Create an up-to-date UCI Adult Data Set


In the interest of familiarity, we will create a data set similar to the UCI Adult Data Set from the University of California Irvine
(UCI) Machine Learning Repository. In particular, we want to extract the age of person ( AGEP ), class of worker ( COW ),
educational attainment ( SCHL ), marital status ( MAR ), industry employed ( INDP ), relationship ( RELP ), race ( RAC1P ), sex
( SEX ), interest/dividends/net rental income over the past 12 months ( INTP ), usual hours worked per week over the past 12
months ( WKHP ), place of birth ( POBP ), and wages/salary income over the past 12 months.

nms <- c("AGEP", "COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX",
"INTP", "WKHP", "POBP", "WAGP")
adult_2013 <- adult_2013_full[!is.na(adult_2013_full$WAGP) &
adult_2013_full$WAGP > 0, nms]
h2o.ls(h2oServer)

Although we created an object in R called adult_2013 , there is no value with that key in the H2O key-value store. To make
it easier to track our data set, we will copy it's value to the adult_2013 key using the h2o.assign function and delete all the
machine generated keys with the prefix Last.value that served as intermediary objects using the h2o.rm function.

adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")


h2o.ls(h2oServer)
rmLastValues <- function(pattern = "Last.value.")
{
keys <- h2o.ls(h2oServer, pattern = pattern)$Key
if (!is.null(keys))
h2o.rm(h2oServer, keys)
invisible(keys)
}
rmLastValues()
kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013"] / 1024^2

Summarize the 2013 update of the UCI Adult Data Set


As mentioned above, an R proxy object to an H2O data set implements several methods commonly associated with R
data.frame objects including the summary function to obtain column-level summaries and the dim function to get the row

and column count.

summary(adult_2013)
dim(adult_2013)

As with R data.frame objects, individual columns within an H2O data set can be summarized using methods commonly
associated with R vector objects. For example, the quantile function in R is used to find sample quantiles at probability
values specified in the prob argument.

centiles <- quantile(adult_2013$WAGP, probs = seq(0, 1, by = 0.01))


centiles

The use of the $ operator to extract the column WAGP from the adult_2013 data set generated new Last.value keys that
we will clean up in the interest of maintaining a tidy key-value store.

R with H2O

16

H2O World 2014 Training

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Derive columns: capital gain and capital loss columns


The original UCI Adult Data Set contains columns for capital gain and capital loss, which can be extracted from the INTP
column within the Person-Level PUMS data set. We will derive these two columns using the ifelse function where the test
condition is whether the INTP column is positive or negative, and if that condition is met, the value is either INT (capital
gain) / - INT (capital loss) or 0 . If we were just interested in measure the magnitude of either a loss or a gain, we could
have used the abs function.

capgain <- ifelse(adult_2013$INTP > 0, adult_2013$INTP, 0)


caploss <- ifelse(adult_2013$INTP < 0, - adult_2013$INTP, 0)
adult_2013$CAPGAIN <- capgain
adult_2013$CAPLOSS <- caploss
adult_2013 <- adult_2013[,- match("INTP", colnames(adult_2013))]

Now that we have the capital gain and loss columns, we can assign our new data set to the adult_2013 key and remove all
the temporary keys from the H2O key-value store.

adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")


h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Derive columns: log transformations for income variables


The UCI Adult Data Set was originally created to predict whether a person's income in the early 1990s exceeds $50,000
per year. Given that incomes are right-skewed, transforming these measures to a log scale tends to make them more
conducive to use in predictive modeling.

adult_2013$LOG_CAPGAIN <- log(adult_2013$CAPGAIN + 1L)


adult_2013$LOG_CAPLOSS <- log(adult_2013$CAPLOSS + 1L)
adult_2013$LOG_WAGP <- log(adult_2013$WAGP + 1L)

Now that we have the log transformed columns, we can assign our new data set to the adult_2013 key and remove all the
temporary keys from the H2O key-value store.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Create cross-tabulations of original and derived categorical variables


We will begin an analysis of wages by exploring the pairwise relationships between wage groups subdivided into
percentiles and the variables we will use as predictors in our statistical models. In the code below the h2o.cut function
create the wage groups and the h2o.table function to performs the cross-tabulations.

cutpoints <- centiles

R with H2O

17

H2O World 2014 Training

cutpoints[1L] <- 0
adult_2013$CENT_WAGP <- h2o.cut(adult_2013$WAGP, cutpoints)
adult_2013$TOP2_WAGP <- adult_2013$WAGP > centiles[99L]
centcounts <- h2o.table(adult_2013["CENT_WAGP"], return.in.R = TRUE)
round(100 * centcounts/sum(centcounts), 2)
top2counts <- h2o.table(adult_2013["TOP2_WAGP"], return.in.R = TRUE)
round(100 * top2counts/sum(top2counts), 2)
relpxtabs <- h2o.table(adult_2013[c("RELP", "TOP2_WAGP")], return.in.R = TRUE)
relpxtabs
round(100 * relpxtabs/rowSums(relpxtabs), 2)
schlxtabs <- h2o.table(adult_2013[c("SCHL", "TOP2_WAGP")], return.in.R = TRUE)
schlxtabs
round(100 * schlxtabs/rowSums(schlxtabs), 2)

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Coerce integer columns to factor (categorical) columns


As with standard R integer vectors, integer columns in H2O can be converted to a categorical type using an as.factor
method. For our data set we have 8 columns that use integer codes to represent categorical levels.

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP"))


adult_2013[[j]] <- as.factor(adult_2013[[j]])

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Create pairwise interaction terms for linear modeling


While some modeling approaches, such as gradient boosting machines (GBM), random forests, and deep learning, are
able to derive interactions between terms during the modeling training stage, other modeling approaches, such as
generalized linear models (GLM), require interactions to be user defined inputs. We will use the h2o.interaction function to
generate a new column in our data set that pairs relationship ( RELP ) with education attainment ( SCHL ) to form a new
column labeled RELP_SCHL that we will "column bind" to our data set using a cbind method.

inter_2013 <- h2o.interaction(adult_2013, factors = c("RELP", "SCHL"),


pairwise = TRUE, max_factors = 10000,
min_occurrence = 10)
adult_2013 <- cbind(adult_2013, inter_2013)
adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")
colnames(adult_2013)

Now that we have derived a few sets of variables, we can examine the H2O key-value store to ensure we have the
expected objects.

R with H2O

18

H2O World 2014 Training

h2o.ls(h2oServer)
rmLastValues()
kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013"] / 1024^2

Generate group by aggregates


In addition to cross-tabulations, we can create more detailed group by aggregates using the h2o.ddply function that was
inspired by the ddply function from Hadley Wickham's plyr package, which is hosted on the Comprehensive R Archive
Network (CRAN).

wagpSummary <- function(frame) {


cbind(N = length(frame$WAGP),
Min = min(frame$WAGP),
Median = median(frame$WAGP),
Max = max(frame$WAGP),
Mean = mean(frame$WAGP),
StdDev = sd(frame$WAGP))
}
h2o.addFunction(h2oServer, wagpSummary)
statsByGroup <- as.data.frame(h2o.ddply(adult_2013, "RELP", wagpSummary))
colnames(statsByGroup)[-1L] <- c("N", "Min", "Median", "Mean", "Max", "StdDev")
statsByGroup <- statsByGroup[order(statsByGroup$median, decreasing = TRUE), ]
rownames(statsByGroup) <- NULL
statsByGroup

Create training and test data sets to use during modeling


As a final step in an exploration of H2O basics in R, we will create a 75% / 25% split, where the larger data set will be used
for training a model and the smaller data set will be used for testing the usefulness of the model. We will achieve this by
using the h2o.runif function to generate random uniforms over [0, 1] for each row and using those random values to
determine the split designation for that row.

rand <- h2o.runif(adult_2013, seed = 1185)


adult_2013_train <- adult_2013[rand <= 0.75, ]
adult_2013_train <- h2o.assign(adult_2013_train, key = "adult_2013_train")
adult_2013_test <- adult_2013[rand > 0.75, ]
adult_2013_test <- h2o.assign(adult_2013_test, key = "adult_2013_test")

Now check to make sure the size of the resulting data sets meet expectations.

nrow(adult_2013)
nrow(adult_2013_train)
nrow(adult_2013_test)

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

R with H2O

19

H2O World 2014 Training

Supervised Learning - Regression and Classification


Patrick Aboyoun: Introduction to Generalized Linear Models in H2O
Patrick Aboyoun: Introduction to Gradient Boosting Machines in H2O
Patrick Aboyoun: Introduction to Random Forests in H2O
Patrick Aboyoun: Regression
Patrick Aboyoun: Classification
Arno Candel: Deep Learning

Supervised Learning

20

H2O World 2014 Training

Introduction to Generalized Linear Models in H2O


This tutorial introduces H2O's Generalized Linear Models (GLM) framework in R.

Generalized Linear Models (GLM)


Intuition: A linear combination of predictors is sufficient for determining an
outcome.
Important components:
1. Exponential family for error distribution (Gaussian/Normal, Binomial, Poisson, Gamma, Tweedie, etc.)
2. Link function, whose inverse is used to generate predictions
3. (Elastic Net) Mixing parameter between the L1 and L2 penalties on the coefficient estimates.
4. (Elastic Net) Shrinkage parameter for the mixed penalty in 3.

R Documentation
The h2o.glm function fits H2O's Generalized Linear Models from within R.

library(h2o)
args(h2o.glm)

The R documentation (man page) for H2O's Generalized Linear Models can be opened from within R using the help or ?
functions:

help(h2o.glm)

We can run the example from the man page using the example function:

example(h2o.glm)

And run a longer demonstration from the h2o package using the demo function:

demo(h2o.glm)

Generalized Linear Models

21

H2O World 2014 Training

Introduction to Gradient Boosting Machines in H2O


This tutorial introduces H2O's Gradient (Tree) Boosting Machines framework in R.

Gradient Boosting Machines (GBM)


Intuition: Average an ensemble of weakly predicting (small) trees where each tree "adjusts" to the "mistakes" of
the preceding trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Learning rate ( shrinkage parameter)
where smaller learning rates tend to require larger number of tree and vice versa.

R Documentation
The h2o.gbm function fits H2O's Gradient Boosting Machines from within R.

library(h2o)
args(h2o.gbm)

The R documentation (man page) for H2O's Gradient Boosting Machines can be opened from within R using the help or
? functions:

help(h2o.gbm)

We can run the example from the man page using the example function:

example(h2o.gbm)

And run a longer demonstration from the h2o package using the demo function:

demo(h2o.gbm)

Gradient Boosted Models

22

H2O World 2014 Training

Introduction to Random Forests in H2O


This tutorial introduces H2O's Random Forest framework in R.

Random Forests
Intuition: Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other
trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Number of variables randomly sampled as candidates for splits
4. Sampling rate for constructing data set to use on each tree

R Documentation
The h2o.randomForest function fits H2O's Random Forest from within R.

library(h2o)
args(h2o.randomForest)

The R documentation (man page) for H2O's Random Forest can be opened from within R using the help or ? functions:

help(h2o.randomForest)

We can run the example from the man page using the example function:

example(h2o.randomForest)

And run a longer demonstration from the h2o package using the demo function:

demo(h2o.randomForest)

Random Forests

23

H2O World 2014 Training

Regression
Regression using Generalized Linear Models, Gradient
Boosting Machines, and Random Forests in H2O
This tutorial demonstrates regression modeling in H2O using generalized linear models (GLM), gradient boosting machines
(GBM), and random forests. It requires an installation of the h2o R package and its dependencies.

Brief overview of regression modeling


Generalized Linear Models (GLM)
Intuition: A linear combination of predictors is sufficient for determining an outcome.
Important components:
1. Exponential family for error distribution (Gaussian/Normal, Poisson, Gamma, Tweedie, etc.)
2. Link function, whose inverse is used to generate predictions
3. (Elastic Net) Mixing parameter between the L1 and L2 penalties on the coefficient estimates.
4. (Elastic Net) Shrinkage parameter for the mixed penalty in 3.

Gradient (Tree) Boosting Machines (GBM)


Intuition: Average an ensemble of weakly predicting (small) trees where each tree "adjusts" to the "mistakes" of
the preceding trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Learning rate ( shrinkage parameter)
where smaller learning rates tend to require larger number of tree and vice versa.

Random Forests
Intuition: Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other
trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Number of variables randomly sampled as candidates for splits
4. Sampling rate for constructing data set to use on each tree

Load the h2o R package and start an local H2O cluster


Regression

24

H2O World 2014 Training

We will begin this tutorial by starting a local H2O cluster using the default heap size and as much compute as the operating
system will allow.

library(h2o)
h2oServer <- h2o.init(nthreads = -1)
rmLastValues <- function(pattern = "Last.value.")
{
keys <- h2o.ls(h2oServer, pattern = pattern)$Key
if (!is.null(keys))
h2o.rm(h2oServer, keys)
invisible(keys)
}

Load and prepare the training and testing data for analysis
This tutorial uses a 0.1% sample of the Person-Level 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau with 75% of that sample being designated to the training data set and 25% to the test data set. This data
set is intended to be an update to the UCI Adult Data Set.

datadir <- "/data"


pumsdir <- file.path(datadir, "h2o-training", "pums2013")
trainfile <- "adult_2013_train.csv.gz"
testfile <- "adult_2013_test.csv.gz"
adult_2013_train <- h2o.importFile(h2oServer,
path = file.path(pumsdir, trainfile),
key = "adult_2013_train", sep = ",")
adult_2013_test <- h2o.importFile(h2oServer,
path = file.path(pumsdir, testfile),
key = "adult_2013_test", sep = ",")
dim(adult_2013_train)
dim(adult_2013_test)

For the purposes of validation, we will create a single column data set containing only the target variable LOG_WAGP from the
test data set.

actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"],


key = "actual_log_wagp")
rmLastValues()

Also for our data set we have 8 columns that use integer codes to represent categorical levels so we will coerce them to
factor after the data read.

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {


adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
}
rmLastValues()

Fit a basic generalized linear model


To illustrate some regression concepts, we will add a column of random categories to the training data set.

rand <- h2o.runif(adult_2013_train, seed = 123)


randgrp <- h2o.cut(rand, seq(0, 1, by = 0.01))

Regression

25

H2O World 2014 Training

adult_2013_train <- cbind(adult_2013_train, RAND_GRP = randgrp)


adult_2013_train <- h2o.assign(adult_2013_train, key = "adult_2013_train")
rmLastValues()

We will start with an ordinary linear model that is trained using the h2o.glm function with Gaussian (Normal) error and no
elastic net regularization ( lambda = 0 ).

log_wagp_glm_0 <- h2o.glm(x = "RAND_GRP", y = "LOG_WAGP",


data = adult_2013_train,
key = "log_wagp_glm_0",
family = "gaussian",
lambda = 0)
log_wagp_glm_0

Inspect an object containing a single generalized linear model


When a single model is trained by h2o.glm it produces an object of class H2OGLMModel .

class(log_wagp_glm_0)
getClassDef("H2OGLMModel")

In this model object, most of the values of interest are contained in the model slot.

names(log_wagp_glm_0@model)
coef(log_wagp_glm_0@model) # works through stats:::coef.default
log_wagp_glm_0@model$aic
1 - log_wagp_glm_0@model$deviance / log_wagp_glm_0@model$null.deviance

In addition the h2o.mse function can be used to calculate the mean squared error of prediction.

h2o.mse(h2o.predict(log_wagp_glm_0, adult_2013_test),
adult_2013_test[, "LOG_WAGP"])
rmLastValues()

Perform 10-fold cross-validation to measure variation in coefficient estimates


We can perform cross-validation within an H2O GLM model by specifying an integer greater than 1 in the nfolds argument
to the h2o.glm function.

log_wagp_glm_0_cv <- h2o.glm(x = "RAND_GRP", y = "LOG_WAGP",


data = adult_2013_train,
key = "log_wagp_glm_0_cv",
family = "gaussian",
lambda = 0,
nfolds = 10L)

The resulting model object is also of class H2OGLMModel , but now the xval slot is populated with model fits resulting when
one of the folds was dropped during training.

class(log_wagp_glm_0_cv)
length(log_wagp_glm_0_cv@xval)
class(log_wagp_glm_0_cv@xval[[1L]])

Regression

26

H2O World 2014 Training

We can create boxplots for each of the coefficients to see how variable the estimates were after each of the folds were
dropped in turn.

boxplot(t(sapply(log_wagp_glm_0_cv@xval, function(x) coef(x@model)))[,-100L],


names = NULL)
points(1:99, coef(log_wagp_glm_0_cv@model)[-100L], pch = "X", col = "red")
abline(h = 0, col = "blue")

Explore categorical predictors


In generalized linear models, a categorical variable with k categories is expanded into k - 1 model coefficients. This
expansion can occur in many different forms, with the most common being dummy variable encodings consisting of
indicator columns for all but the first or last category, depending on the convention.
We will begin our modeling of the data by examining the regression of the natural logarithm of wages ( LOG_WAGP ) against
three sets of predictors: relationship ( RELP ), educational attainment ( SCHL ), and combination of those two variables
( RELP_SCHL ).

log_wagp_glm_relp <- h2o.glm(x = "RELP", y = "LOG_WAGP",


data = adult_2013_train,
key = "log_wagp_glm_relp",
family = "gaussian",
lambda = 0)
log_wagp_glm_schl <- h2o.glm(x = "SCHL", y = "LOG_WAGP",
data = adult_2013_train,
key = "log_wagp_glm_schl",
family = "gaussian",
lambda = 0)

Regression

27

H2O World 2014 Training

log_wagp_glm_relp_schl <- h2o.glm(x = "RELP_SCHL", y = "LOG_WAGP",


data = adult_2013_train,
key = "log_wagp_glm_relp_schl",
family = "gaussian",
lambda = 0)

As we can see below, both the Akaike information criterion (AIC) and percent deviance explained metrics point to using a
combination of RELP and SCHL in a linear model for LOG_WAGP .

log_wagp_glm_relp@model$aic
log_wagp_glm_schl@model$aic
log_wagp_glm_relp_schl@model$aic
1 - log_wagp_glm_relp@model$deviance / log_wagp_glm_relp@model$null.deviance
1 - log_wagp_glm_schl@model$deviance / log_wagp_glm_schl@model$null.deviance
1 - log_wagp_glm_relp_schl@model$deviance / log_wagp_glm_relp_schl@model$null.deviance

Fit an elastic net regression model across a grid of parameter settings


Now that we are familiar with H2O model fitting in R, we can fit more sophisticated models involving a larger set of
predictors.

addpredset <- c("COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP",


"WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

In the context of elastic net regularization, we need to search the parameter space defined by the mixing parameter alpha
and the shrinkage parameter lambda . To aide us in this search H2O can produce a grid of models for all combinations of a
discrete set of parameters.
We will use different methods for specifying the alpha and lambda values as they are dependent upon one another. For
the alpha parameter, we will specify five values ranging from 0 (ridge) to 1 (lasso) by increments of 0.25. For lambda , we
will turn on an automated lambda search by setting lambda = TRUE and specify the number of lambda values to 10 by
setting nlambda = 10 .

log_wagp_glm_grid <- h2o.glm(x = c("RELP_SCHL", addpredset), y = "LOG_WAGP",


data = adult_2013_train,
key = "log_wagp_glm_grid",
family = "gaussian",
lambda_search = TRUE,
nlambda = 10,
return_all_lambda = TRUE,
alpha = c(0, 0.25, 0.5, 0.75, 1))

We now have an object of class H2OGLMGrid that contains a list of H2OGLMModelList objects for each of the models fit on the
grid.

class(log_wagp_glm_grid)
getClassDef("H2OGLMGrid")
class(log_wagp_glm_grid@model[[1L]])
getClassDef("H2OGLMModelList")
length(log_wagp_glm_grid@model[[1L]]@models)
class(log_wagp_glm_grid@model[[1L]]@models[[1L]])

Currently the h2o package does not contain any helper functions for extracting models of interest, so we have to explore

Regression

28

H2O World 2014 Training

the model object and choose the model we like best.

log_wagp_glm_grid@model[[1L]]@models[[1L]]@model$params$alpha # ridge
log_wagp_glm_grid@model[[2L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[3L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[4L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[5L]]@models[[1L]]@model$params$alpha # lasso
log_wagp_glm_grid_mse < sapply(log_wagp_glm_grid@model,
function(x)
sapply(x@models, function(y)
h2o.mse(h2o.predict(y, adult_2013_test),
actual_log_wagp)))
log_wagp_glm_grid_mse
log_wagp_glm_grid_mse == min(log_wagp_glm_grid_mse)
log_wagp_glm_best <- log_wagp_glm_grid@model[[5L]]@models[[10L]]

Fit a gaussian regression with a log link function


In the previous example we modeled WAGP on a natural logarithm scale, which implied a multiplicative error structure. We
can explore if we have an additive error, but supplying the untransformed wage variable ( WAGP ) as the response and using
the natural logarithm link function.

wagp_glm_grid <- h2o.glm(x = c("RELP_SCHL", addpredset), y = "WAGP",


data = adult_2013_train,
key = "log_wagp_glm_grid",
family = "gaussian",
link = "log",
lambda_search = TRUE,
nlambda = 10,
return_all_lambda = TRUE,
alpha = c(0, 0.25, 0.5, 0.75, 1))
wagp_glm_grid

Fit a gradient boosting machine regression model


Given that not all relationships can be reduced to a linear combination or terms, we can compare the GLM results with that
of a gradient (tree) boosting machine. As with the final GLM exploration, we will fit a grid of GBM models by varying the
number of trees and the shrinkage rate and select the best model with respect to the test data set.

log_wagp_gbm_grid <- h2o.gbm(x = c("RELP", "SCHL", addpredset),


y = "LOG_WAGP",
data = adult_2013_train,
key = "log_wagp_gbm_grid",
distribution = "gaussian",
n.trees = c(10, 20, 40),
shrinkage = c(0.05, 0.1, 0.2),
validation = adult_2013_test,
importance = TRUE)
log_wagp_gbm_grid
class(log_wagp_gbm_grid)
getClassDef("H2OGBMGrid")
class(log_wagp_gbm_grid@model)
length(log_wagp_gbm_grid@model)
class(log_wagp_gbm_grid@model[[1L]])
log_wagp_gbm_best <- log_wagp_gbm_grid@model[[1L]]
log_wagp_gbm_best

A comparison of mean squared errors against the test set suggests our GBM fit outperforms our GLM fit.

Regression

29

H2O World 2014 Training

h2o.mse(h2o.predict(log_wagp_glm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_gbm_best, adult_2013_test),
actual_log_wagp)

Fit a random forest regression model


Lastly we will fit a single random forest model with 200 trees of maximum depth 10 and compare the mean squared errors
across the three model types.

log_wagp_forest <- h2o.randomForest(x = c("RELP", "SCHL", addpredset),


y = "LOG_WAGP",
data = adult_2013_train,
key = "log_wagp_forest",
classification = FALSE,
depth = 10,
ntree = 200,
validation = adult_2013_test,
seed = 8675309,
type = "BigData")
log_wagp_forest
h2o.mse(h2o.predict(log_wagp_glm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_gbm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_forest, adult_2013_test),
actual_log_wagp)

Fit a deep learning regression model


Lastly we will fit a single Deep Learning model with default settings (more details about Deep Learning follow later) and
compare the mean squared errors across the four model types.

log_wagp_dl <- h2o.deeplearning(x = c("RELP", "SCHL", addpredset),


y = "LOG_WAGP",
data = adult_2013_train,
key = "log_wagp_dl",
classification = FALSE,
validation = adult_2013_test)
log_wagp_dl
h2o.mse(h2o.predict(log_wagp_glm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_gbm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_forest, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_dl, adult_2013_test),
actual_log_wagp)

Regression

30

H2O World 2014 Training

Classification using Generalized Linear Models, Gradient


Boosting Machines, Random Forests and Deep Learning
in H2O
This tutorial demonstrates classification modeling in H2O using generalized linear models (GLM), gradient boosting
machines (GBM), random forests and Deep Learning. It requires an installation of the H2O R package and its
dependencies.

Load the H2O R package and start an local H2O cluster


We will begin this tutorial by starting a local H2O cluster using the default heap size and as much compute as the operating
system will allow.

library(h2o)
h2oServer <- h2o.init(nthreads = -1)
rmLastValues <- function(pattern = "Last.value.")
{
keys <- h2o.ls(h2oServer, pattern = pattern)$Key
if (!is.null(keys))
h2o.rm(h2oServer, keys)
invisible(keys)
}

Load the training and testing data into the H2O key-value store
This tutorial uses a 0.1% sample of the Person-Level 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau with 75% of that sample being designated to the training data set and 25% to the test data set. This data
set is intended to be an update to the UCI Adult Data Set.

datadir <- "/data"


pumsdir <- file.path(datadir, "h2o-training", "pums2013")
trainfile <- "adult_2013_train.csv.gz"
testfile <- "adult_2013_test.csv.gz"
adult_2013_train <- h2o.importFile(h2oServer,
path = file.path(pumsdir, trainfile),
key = "adult_2013_train", sep = ",")
adult_2013_test <- h2o.importFile(h2oServer,
path = file.path(pumsdir, testfile),
key = "adult_2013_test", sep = ",")
dim(adult_2013_train)
dim(adult_2013_test)

For the purposes of validation, we will create a single column data set containing only the target variable TOP2_WAGP from
the test data set.

actual_top2_wagp <- h2o.assign(adult_2013_test[, "TOP2_WAGP"],


key = "actual_top2_wagp")
rmLastValues()

Also for our data set we have 8 columns that use integer codes to represent categorical levels so we will coerce them to
factor after the data read.

Classification

31

H2O World 2014 Training

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {


adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
}
rmLastValues()

Fit a basic logistic regression model


top2_wagp_glm_relp <- h2o.glm(x = "RELP", y = "TOP2_WAGP",
data = adult_2013_train,
key = "top2_wagp_glm_relp",
family = "binomial",
lambda = 0)
top2_wagp_glm_relp

Generate performance metrics from a single model


We can create detailed performance metrics using the h2o.performance function that was inspired by the performance
function from Tobias Sing's ROCR package, which is hosted on the Comprehensive R Archive Network (CRAN).

Binary classifier performance metrics

pred_top2_wagp_glm_relp <- h2o.predict(top2_wagp_glm_relp, adult_2013_test)


pred_top2_wagp_glm_relp

Classification

32

H2O World 2014 Training

prob_top2_wagp_glm_relp <- h2o.assign(pred_top2_wagp_glm_relp[, 3L],


key = "prob_top2_wagp_glm_relp")
rmLastValues()
f1_top2_wagp_glm_relp <- h2o.performance(prob_top2_wagp_glm_relp,
actual_top2_wagp,
measure = "F1")
f1_top2_wagp_glm_relp
class(f1_top2_wagp_glm_relp)
getClassDef("H2OPerfModel")

Plot Receiver Operating Characteristic (ROC) curve and find its Area Under the
Curve (AUC)
A receiver operating characteristic (ROC) curve is a graph of the true positive rate ( recall ) against the false positive rate (1
- specificity) for a binary classifier.

plot(f1_top2_wagp_glm_relp, type = "cutoffs", col = "blue")

plot(f1_top2_wagp_glm_relp, type = "roc", col = "blue", typ = "b")

Classification

33

H2O World 2014 Training

We can extract the area under the ROC curve from our model object to see how close it is to the ideal of 1.

f1_top2_wagp_glm_relp@model$auc

The Gini coefficient, not to be confused with the Gini impurity splitting metric used in decision tree based algorithms such
as h2o.randomForest , is a linear rescaling of AUC defined by $Gini = 2 * AUC - 1$.

f1_top2_wagp_glm_relp@model$gini
2 * f1_top2_wagp_glm_relp@model$auc - 1

Another metric we could have used to determine a cutoff is the lift metric, which is a ratio of the probability of a positive
prediction given the model and the baseline probability for the population, that is generated by the h2o.gains function,
which was inspired by Craig Rolling's gains package hosted on CRAN.

h2o.gains(actual_top2_wagp, prob_top2_wagp_glm_relp)

Making class predictions based on a probability cutoff


Using the cutoff that results in the largest harmonic mean of precision and recall (F1 score), we can now classify our
predictions on the test data set.

class(f1_top2_wagp_glm_relp@model)
names(f1_top2_wagp_glm_relp@model)
class_top2_wagp_glm_relp < prob_top2_wagp_glm_relp > f1_top2_wagp_glm_relp@model$best_cutoff

Classification

34

H2O World 2014 Training

class_top2_wagp_glm_relp <- h2o.assign(class_top2_wagp_glm_relp,


key = "class_top2_wagp_glm_relp")
rmLastValues()

Generate a confusion matrix from a single model


To understand how well our model fits the test data set, we can examine the results of the h2o.confusionMatrix function for
a tabular display of the predicted versus actual values.

h2o.confusionMatrix(class_top2_wagp_glm_relp, actual_top2_wagp)

Remove temporary objects from R and the H2O cluster


rm(pred_top2_wagp_glm_relp,
prob_top2_wagp_glm_relp,
class_top2_wagp_glm_relp)
rmLastValues()

Fit an elastic net logistic regression model across a grid of parameter settings
Now that we are familiar with H2O model fitting in R, we can fit more sophisticated models involving a larger set of
predictors.

addpredset <- c("COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP",


"WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

In the context of elastic net regularization, we need to search the parameter space defined by the mixing parameter alpha
and the shrinkage parameter lambda . To aide us in this search H2O can produce a grid of models for all combinations of a
discrete set of parameters.
We will use different methods for specifying the alpha and lambda values as they are dependent upon one another. For
the alpha parameter, we will specify five values ranging from 0 (ridge) to 1 (lasso) by increments of 0.25. For lambda , we
will turn on an automated lambda search by setting lambda = TRUE and specify the number of lambda values to 10 by
setting nlambda = 10 .

top2_wagp_glm_grid <- h2o.glm(x = c("RELP_SCHL", addpredset),


y = "TOP2_WAGP",
data = adult_2013_train,
key = "top2_wagp_glm_grid",
family = "binomial",
lambda_search = TRUE,
nlambda = 10,
return_all_lambda = TRUE,
alpha = c(0, 0.25, 0.5, 0.75, 1))

We now have an object of class H2OGLMGrid that contains a list of H2OGLMModelList objects for each of the models fit on the
grid.

class(top2_wagp_glm_grid)
getClassDef("H2OGLMGrid")
class(top2_wagp_glm_grid@model[[1L]])
getClassDef("H2OGLMModelList")

Classification

35

H2O World 2014 Training

length(top2_wagp_glm_grid@model[[1L]]@models)
class(top2_wagp_glm_grid@model[[1L]]@models[[1L]])

Currently the h2o package does not contain any helper functions for extracting models of interest, so we have to explore
the model object and choose the model we like best.

top2_wagp_glm_grid@model[[1L]]@models[[1L]]@model$params$alpha # ridge
top2_wagp_glm_grid@model[[2L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[3L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[4L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[5L]]@models[[1L]]@model$params$alpha # lasso
top2_wagp_glm_grid_f1 < sapply(top2_wagp_glm_grid@model,
function(x)
sapply(x@models, function(y)
h2o.performance(h2o.predict(y, adult_2013_test)[, 3L],
actual_top2_wagp,
measure = "F1")@model$error
))
top2_wagp_glm_grid_f1
top2_wagp_glm_grid_f1 == min(top2_wagp_glm_grid_f1)
top2_wagp_glm_best <- top2_wagp_glm_grid@model[[4L]]@models[[7L]]
prob_top2_wagp_glm_best <- h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L]
prob_top2_wagp_glm_best <- h2o.assign(prob_top2_wagp_glm_best,
key = "prob_top2_wagp_glm_best")
rmLastValues()
f1_top2_wagp_glm_best <- h2o.performance(prob_top2_wagp_glm_best,
actual_top2_wagp,
measure = "F1")
plot(f1_top2_wagp_glm_best, type = "cutoffs", col = "blue")

Classification

36

H2O World 2014 Training

plot(f1_top2_wagp_glm_best, type = "roc", col = "blue")

Classification

37

H2O World 2014 Training

We can find the number of non-zero coefficients in our best logistic regression model to gain some understanding about its
overall complexity.

table(coef(top2_wagp_glm_best@model) != 0)
nzcoefs <- coef(top2_wagp_glm_best@model)
nzcoefs <- names(nzcoefs)[nzcoefs != 0]
nzcoefs <- unique(sub("\\..*$", "", nzcoefs))
setdiff(c("RELP_SCHL", addpredset), nzcoefs) # all preds had non-zero coefs

Fit a gradient boosting machine binomial regression model


Given that not all relationships can be reduced to a linear combination or terms, we can compare the GLM results with that
of a gradient (tree) boosting machine. As with the final GLM exploration, we will fit a grid of GBM models by varying the
number of trees and the shrinkage rate and select the best model with respect to the test data set.

top2_wagp_gbm_grid <- h2o.gbm(x = c("RELP", "SCHL", addpredset),


y = "TOP2_WAGP",
data = adult_2013_train,
key = "top2_wagp_gbm_grid",
distribution = "multinomial",
n.trees = c(10, 20, 40),
shrinkage = c(0.05, 0.1, 0.2),
validation = adult_2013_test,
importance = TRUE)
top2_wagp_gbm_grid
class(top2_wagp_gbm_grid)
slotNames(top2_wagp_gbm_grid)
class(top2_wagp_gbm_grid@model)
length(top2_wagp_gbm_grid@model)
class(top2_wagp_gbm_grid@model[[1L]])

Classification

38

H2O World 2014 Training

top2_wagp_gbm_best <- top2_wagp_gbm_grid@model[[1L]]


top2_wagp_gbm_best
h2o.performance(h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L],
actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_gbm_best, adult_2013_test)[, 3L],
actual_top2_wagp, measure = "F1")@model$error

Fit a random forest classifier


We will fit a single random forest model with 200 trees of maximum depth 10.

top2_wagp_forest <- h2o.randomForest(x = c("RELP", "SCHL", addpredset),


y = "TOP2_WAGP",
data = adult_2013_train,
key = "top2_wagp_forest",
classification = TRUE,
depth = 10,
ntree = 200,
validation = adult_2013_test,
seed = 8675309,
type = "BigData")
top2_wagp_forest

Fit a deep learning classifier


Lastly we will fit a single Deep Learning model with default settings (more details about Deep Learning follow later) and
compare the errors across the four model types.

top2_wagp_dl <- h2o.deeplearning(x = c("RELP", "SCHL", addpredset),


y = "TOP2_WAGP",
data = adult_2013_train,
key = "top2_wagp_dl",
classification = TRUE,
validation = adult_2013_test)
top2_wagp_dl

h2o.performance(h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L],


actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_gbm_best, adult_2013_test)[, 3L],
actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_forest, adult_2013_test)[, 3L],
actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_dl, adult_2013_test)[, 3L],
actual_top2_wagp, measure = "F1")@model$error

Classification

39

H2O World 2014 Training

Classification and Regression with H2O Deep Learning


This tutorial shows how a Deep Learning model can be used to do supervised classification and regression. This file is both
valid R and markdown code.

R Documentation
The h2o.deeplearning function fits H2O's Deep Learning models from within R.

library(h2o)
args(h2o.deeplearning)

The R documentation (man page) for H2O's Deep Learning can be opened from within R using the help or ? functions:

help(h2o.deeplearning)

As you can see, there are a lot of parameters! Luckily, as you'll see later, you only need to know a few to get the most out of
Deep Learning. More information can be found in the H2O Deep Learning booklet and in our slides.
We can run the example from the man page using the example function:

example(h2o.deeplearning)

And run a longer demonstration from the h2o package using the demo function (requires an internet connection):

demo(h2o.deeplearning)

Start H2O and load the MNIST data


For the rest of this tutorial, we will use the well-known MNIST dataset of hand-written digits, where each row contains the
28^2=784 raw gray-scale pixel values from 0 to 255 of the digitized digits (0 to 9).
Initialize the H2O server and import the MNIST training/testing datasets.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
TRAIN = "train.csv.gz"
TEST = "test.csv.gz"
train_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ',', key = 'train.hex')
test_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TEST), header = F, sep = ',', key = 'test.hex')

While H2O Deep Learning has many parameters, it was designed to be just as easy to use as the other supervised training
methods in H2O. Automatic data standardization and handling of categorical variables and missing values and per-neuron
adaptive learning rates reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of
hidden layers, the number of epochs and the activation function and maybe some regularization techniques.

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex, validation=test_hex,

Deep Learning

40

H2O World 2014 Training

hidden=c(50,50), epochs=0.1, activation="Tanh")

Let's look at the model summary, and the confusion matrix and classification error (on the validation set, since it was
provided) in particular:

dlmodel
dlmodel@model$confusion
dlmodel@model$valid_class_error

To confirm that the reported confusion matrix on the validation set (here, the test set) was correct, we make a prediction on
the test set and compare the confusion matrices explicitly:

pred_labels <- h2o.predict(dlmodel, test_hex)[,1]


actual_labels <- test_hex[,785]
cm <- h2o.confusionMatrix(pred_labels, actual_labels)
cm
dlmodel@model$confusion
dlmodel@model$confusion == cm

To see the model parameters again:

dlmodel@model$params

Hyper-parameter Tuning with Grid Search


Since there are a lot of parameters that can impact model accuracy, hyper-parameter tuning is especially important for
Deep Learning:

grid_search <- h2o.deeplearning(x=c(1:784), y=785, data=train_hex, validation=test_hex,


hidden=list(c(10,10),c(20,20)), epochs=0.1,
activation=c("Tanh", "Rectifier"), l1=c(0,1e-5))

Let's see which model had the lowest validation error (grid search automatically sorts the models by validation error):

grid_search
best_model <- grid_search@model[[1]]
best_model
best_params <- best_model@model$params
best_params$activation
best_params$hidden
best_params$l1

Hyper-parameter Tuning with Random Search


Multi-dimensional hyper-parameter optimization (more than 4 parameters) can be more efficient with random parameter
search than with a Cartesian product of pre-given values (grid search). For a random parameter search, we do a loop over
models with parameters drawn uniformly from a given range, thus sampling the high-dimensional space uniformly. This
code is for demonstration purposes only:

models <- c()


for (i in 1:10) {

Deep Learning

41

H2O World 2014 Training

rand_activation <- c("TanhWithDropout", "RectifierWithDropout")[sample(1:2,1)]


rand_numlayers <- sample(2:5,1)
rand_hidden <- c(sample(10:50,rand_numlayers,T))
rand_l1 <- runif(1, 0, 1e-3)
rand_l2 <- runif(1, 0, 1e-3)
rand_dropout <- c(runif(rand_numlayers, 0, 0.6))
rand_input_dropout <- runif(1, 0, 0.5)
dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex, validation=test_hex, epochs=0.1,
activation=rand_activation, hidden=rand_hidden, l1=rand_l1, l2=rand_l2,
input_dropout_ratio=rand_input_dropout, hidden_dropout_ratios=rand_dropout)
models <- c(models, dlmodel)
}

We can then find the model with the lowest validation error:

best_err <- best_model@model$valid_class_error #best model from grid search above


for (i in 1:length(models)) {
err <- models[[i]]@model$valid_class_error
if (err < best_err) {
best_err <- err
best_model <- models[[i]]
}
}
best_model
best_params <- best_model@model$params
best_params$activation
best_params$hidden
best_params$l1
best_params$l2
best_params$input_dropout_ratio
best_params$hidden_dropout_ratios

Checkpointing
Let's continue training the best model, for 2 more epochs. Note that since many parameters such as epochs, l1, l2,
max_w2, score_interval, train_samples_per_iteration, score_duty_cycle, classification_stop, regression_stop,
variable_importances, force_load_balance can be modified between checkpoint restarts, it is best to specify as many

parameters as possible explicitly.

dlmodel_continued <- h2o.deeplearning(x=c(1:784), y=785, data=train_hex, validation=test_hex,


checkpoint = best_model, l1=best_params$l1, l2=best_params$l2, epochs=0.5)
dlmodel_continued@model$valid_class_error

Once we are satisfied with the results, we can save the model to disk:

h2o.saveModel(dlmodel_continued, dir="/tmp", name="mybest_mnist_model", force=T)

It can be loaded later with

dlmodel_loaded <- h2o.loadModel(h2oServer, "/tmp/mybest_mnist_model")

Of course, you can continue training this model as well (with the same x , y , data , validation )

dlmodel_continued_again <- h2o.deeplearning(x=c(1:784), y=785, data=train_hex, validation=test_hex,


checkpoint = dlmodel_loaded, l1=best_params$l1, epochs=0.5)
dlmodel_continued_again@model$valid_class_error

Deep Learning

42

H2O World 2014 Training

World-record results on MNIST


With the parameters shown below, H2O Deep Learning matched the current world record of 0.83% test set error for models
without pre-processing, unsupervised learning, convolutional layers or data augmentation after running for 8 hours on 10
nodes:

# > record_model <- h2o.deeplearning(x = 1:784, y = 785, data = train_hex, validation = test_hex,
# activation = "RectifierWithDropout", hidden = c(1024,1024,2048),
# epochs = 8000, l1 = 1e-5, input_dropout_ratio = 0.2,
# train_samples_per_iteration = -1, classification_stop = -1)
# > record_model@model$confusion
# Predicted
# Actual 0 1 2 3 4 5 6 7 8 9 Error
# 0 974 1 1 0 0 0 2 1 1 0 0.00612
# 1 0 1135 0 1 0 0 0 0 0 0 0.00088
# 2 0 0 1028 0 1 0 0 3 0 0 0.00388
# 3 0 0 1 1003 0 0 0 3 2 1 0.00693
# 4 0 0 1 0 971 0 4 0 0 6 0.01120
# 5 2 0 0 5 0 882 1 1 1 0 0.01121
# 6 2 3 0 1 1 2 949 0 0 0 0.00939
# 7 1 2 6 0 0 0 0 1019 0 0 0.00875
# 8 1 0 1 3 0 4 0 2 960 3 0.01437
# 9 1 2 0 0 4 3 0 2 0 997 0.01189
# Totals 981 1142 1038 1013 977 891 956 1031 964 1007 0.00830

Note: results are not 100% reproducible and also depend on the number of nodes, cores, due to thread race conditions and
model averaging effects, which can by themselves be useful to avoid overfitting. Often, it can help to run the initial
convergence with more train_samples_per_iteration (automatic values of -2 or -1 are good choices), and then continue
from a checkpoint with a smaller number (automatic value of 0 or a number less than the total number of training rows are
good choices here).

Regression
If the response column is numeric and non-integer, regression is enabled by default. For integer response columns, as in
this case, you have to specify classification=FALSE to force regression. In that case, there will be only 1 output neuron,
and the loss function and error metric will automatically switch to the MSE (mean square error).

regression_model <- h2o.deeplearning(x=1:784, y=785, data=train_hex, validation=test_hex,


hidden=c(50,50), epochs=0.1, activation="Rectifier",
classification=FALSE)

Let's look at the model summary (i.e., the training and validation set MSE values):

regression_model
regression_model@model$train_sqr_error
regression_model@model$valid_sqr_error

We can confirm these numbers explicitly:

mean((h2o.predict(regression_model, train_hex)-train_hex[,785])^2)
mean((h2o.predict(regression_model, test_hex)-test_hex[,785])^2)
h2o.mse(h2o.predict(regression_model, test_hex), test_hex[,785])

The difference in the training MSEs is because only a subset of the training set was used for scoring during model building
(10k rows), see section "Scoring on Training/Validation Sets During Training" below.
Deep Learning

43

H2O World 2014 Training

Cross-Validation
For N-fold cross-validation, specify nfolds instead of a validation frame, and N+1 models will be built: 1 model on the full
training data, and N models with each one successive 1/N-th of the data held out. Those N models then score on the held
out data, and the stitched-together predictions are scored to get the cross-validation error.

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex,


hidden=c(50,50), epochs=0.1, activation="Tanh",
nfolds=5)
dlmodel
dlmodel@model$valid_class_error

Note: There is a RUnit test to demonstrate the correctness of N-fold cross-validation results reported by the model.
The N individual cross-validation models can be accessed as well:

dlmodel@xval
sapply(dlmodel@xval, function(x) (x@model$valid_class_error))

Cross-Validation is especially useful for hyperparameter optimizations such as grid searches.

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex,


hidden=c(10,10), epochs=0.1, activation=c("Tanh","Rectifier"),
nfolds=2)
dlmodel

Variable Importances
Variable importances for Neural Network models are notoriously difficult to compute, and there are many pitfalls. H2O Deep
Learning has implemented the method of Gedeon, and returns relative variable importances in descending order of
importance.

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex,


hidden=c(10,10), epochs=0.5, activation="Tanh",
variable_importances=TRUE)
dlmodel@model$varimp[1:10]

Adaptive Learning Rate


By default, H2O Deep Learning uses an adaptive learning rate (ADADELTA) for its stochastic gradient descent
optimization. There are only two tuning parameters for this method: rho and epsilon , which balance the global and local
search efficiencies. rho is the similarity to prior weight updates (similar to momentum), and epsilon is a parameter that
prevents the optimization to get stuck in local optima. Defaults are rho=0.99 and epsilon=1e-8 . For cases where
convergence speed is very important, it might make sense to perform a few grid searches to optimize these two parameters
(e.g., with rho=c(0.9,0.95,0.99,0.999) and epsilon=c(1e-10,1e-8,1e-6,1e-4) ). Of course, as lways with grid searches,
caution has to be applied when extrapolating grid search results to a different parameter regime (e.g., for more epochs or
different layer topologies or activation functions, etc.).

dlmodel <- h2o.deeplearning(x=seq(1,784,10), y=785, data=train_hex, hidden=c(10,10), epochs=0.5,


rho=c(0.9,0.95,0.99,0.999), epsilon=c(1e-10,1e-8,1e-6,1e-4))
dlmodel

Deep Learning

44

H2O World 2014 Training

If adaptive_rate is disabled, several manual learning rate parameters become important: rate , rate_annealing ,
rate_decay , momentum_start , momentum_ramp , momentum_stable and nesterov_accelerated_gradient , the discussion of which

we leave to H2O Deep Learning booklet.

H2O Deep Learning Tips & Tricks


Activation Functions
While sigmoids have been used historically for neural networks, H2O Deep Learning implements Tanh , a scaled and
shifted variant of the sigmoid which is symmetric around 0. Since its output values are bounded by -1..1, the stability of the
neural network is rarely endangered. However, the derivative of the tanh function is always non-zero and back-propagation
(training) of the weights is more computationally expensive than for rectified linear units, or Rectifier , which is max(0,x)
and has vanishing gradient for x<=0 , leading to much faster training speed for large networks and is often the fastest path
to accuracy on larger problems. In case you encounter instabilities with the Rectifier (in which case model building is
automatically aborted), try one of these values to re-scale the weights: max_w2=c(1,10,100) . The Maxout activation function
is least computationally effective, and is not well field-tested at this point.

Generalization Techniques
L1 and L2 penalties can be applied by specifying the l1 and l2 parameters. Intuition: L1 lets only strong weights survive
(constant pulling force towards zero), while L2 prevents any single weight from getting too big. Dropout has recently been
introduced as a powerful generalization technique, and is available as a parameter per layer, including the input layer.
input_dropout_ratio controls the amount of input layer neurons that are randomly dropped (set to zero), while
hidden_dropout_ratios are specified for each hidden layer. The former controls overfitting with respect to the input data

(useful for high-dimensional noisy data such as MNIST), while the latter controls overfitting of the learned features. Note
that hidden_dropout_ratios require the activation function to end with ...WithDropout .

Early stopping and optimizing for lowest validation error


By default, override_with_best_model is set to TRUE and the model returned after training for the specified number of
epochs is the model that has the best training set error, or, if a validation set is provided, the lowest validation set error. This
is equivalent to early stopping, except that the determination to stop is made in hindsight, similar to a full grid search over
the number of epochs at the granularity of the scoring intervals. Note that for N-fold cross-validation,
override_with_best_model is disabled to give fair results (all N cross-validation models must run to completion to avoid

overfitting on the validation set). Also note that the training or validation set errors can be based on a subset of the training
or validation data, depending on the values for score_validation_samples or score_training_samples , see below. For actual
early stopping on a predefined error rate on the training data (accuracy for classification or MSE for regression), specify
classification_stop or regression_stop .

Training Samples per (MapReduce) Iteration


This parameter is explained in Section 2.2.4 of the H2O Deep Learning booklet, and becomes important in multi-node
operation.

Categorical Data
For categorical data, a feature with K factor levels is automatically one-hot encoded (horizontalized) into K-1 input neurons.
Hence, the input neuron layer can grow substantially for datasets with high factor counts. In these cases, it might make
sense to reduce the number of hidden neurons in the first hidden layer, such that large numbers of factor levels can be
handled. In the limit of 1 neuron in the first hidden layer, the resulting model is similar to logistic regression with stochastic
gradient descent, except that for classification problems, there's still a softmax output layer, and that the activation function
is not necessarily a sigmoid ( Tanh ). If variable importances are computed, it is recommended to turn on
use_all_factor_levels (K input neurons for K levels). The experimental option max_categorical_features uses feature

Deep Learning

45

H2O World 2014 Training

hashing to reduce the number of input neurons via the hash trick at the expense of hash collisions and reduced accuracy.

Missing Values
H2O Deep Learning automatically does mean imputation for missing values during training (leaving the input layer
activation at 0 after standardizing the values). For testing, missing test set values are also treated the same way by default.
See the h2o.impute function to do your own mean imputation.

Reproducibility
Every run of DeepLearning results in different results since multithreading is done via Hogwild! that benefits from intentional
lock-free race conditions between threads. To get reproducible results at the expense of speed for small datasets, set
reproducible=T and specify a seed. This will not work for big data for technical reasons, and is probably also not desired
because of the significant slowdown.

Scoring on Training/Validation Sets During Training


The training and/or validation set errors can be based on a subset of the training or validation data, depending on the
values for score_validation_samples (defaults to 0: all) or score_training_samples (defaults to 10,000 rows, since the
training error is only used for early stopping and monitoring). For large datasets, Deep Learning can automatically sample
the validation set to avoid spending too much time in scoring during training, especially since scoring results are not
currently displayed in the model returned to R. For example:

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex, validation=test_hex,


hidden=c(50,50), epochs=0.1, activation="Tanh",
score_training_samples=1000, score_validation_samples=1000)

Note that the default value of score_duty_cycle=0.1 limits the amount of time spent in scoring to 10%, so a large number of
scoring samples won't slow down overall training progress too much, but it will always score once after the first MapReduce
iteration, and once at the end of training.
Stratified sampling of the validation dataset can help with scoring on datasets with class imbalance. Note that this option
also requires balance_classes to be enabled (used to over/under-sample the training dataset, based on the max. relative
size of the resulting training dataset, max_after_balance_size ):

dlmodel <- h2o.deeplearning(x=1:784, y=785, data=train_hex, validation=test_hex,


hidden=c(50,50), epochs=0.1, activation="Tanh",
score_training_samples=1000, score_validation_samples=1000,
balance_classes=TRUE,
score_validation_sampling="Stratified")

Benchmark against nnet


We compare H2O Deep Learning against the nnet package for a regression problem (MNIST digit 0-9). For performance
reasons, we sample the predictors and use a small single hidden layer.

if (! "nnet" %in% rownames(installed.packages())) { install.packages("nnet") }


library("nnet")
train.R <- as.data.frame(train_hex)
train.R <- train.R[,c(seq(1, 784, by = 20), 785)]
test.R <- as.data.frame(test_hex)
nn <- nnet(x=train.R[,-ncol(train.R)], y=train.R$C785, size=10, linout=T)
nn_pred <- predict(nn, test.R)
nn_mse <- mean((nn_pred-test.R[,785])^2)

Deep Learning

46

H2O World 2014 Training

nn_mse
h2o_nn <- h2o.deeplearning(x=seq(1,784,by=20),y=785,data=train_hex,hidden=10,classification=F,activation="Tanh")
h2o_pred <- h2o.predict(h2o_nn, test_hex)
h2o_mse <- mean((as.data.frame(h2o_pred)-test.R[,785])^2)
h2o_mse

More information can be found in the H2O Deep Learning booklet and in our
slides.

Deep Learning

47

H2O World 2014 Training

Unsupervised Learning
Arno Candel & Spencer Aiello: K-Means Clustering
Arno Candel: Dimensionality Reduction on MNIST
Arno Candel: Anomaly Detection on MNIST with H2O Deep Learning

Unsupervised Learning

48

H2O World 2014 Training

Unsupervised Learning and Clustering With H2O KMeans


This tutorial shows how a KMeans model is trained. This file is both valid R and markdown code. We will use a variety of
well-known datasets that are used in published papers, which evaluate various KMeans implementations.

Start H2O and build a KMeans model on iris


Initialize the H2O server and import the datasets we need for this session.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
datadir <- "/data"
homedir <- file.path(datadir, "h2o-training", "clustering")
iris.h2o <- as.h2o(h2oServer, iris)

Our first KMeans model


It's easy to run KMeans, it's just like the kmeans method available in the stats package. We'll leave out the Species column
and cluster on the iris flower attributes.

km.model <- h2o.kmeans(data = iris.h2o, centers = 5, cols = 1:4, init="furthest")

Let's look at the model summary:

km.model
km.model@model$centers # The centers for each cluster
km.model@model$tot.withinss # total within cluster sum of squares
km.model@model$cluster # cluster assignments per observation

To see the model parameters that were used, access the model@model$params field:

km.model@model$params

You can get the R documentation help here:

?h2o.kmeans

Use the Gap Statistic (Beta) To Find the Optimal Number of Clusters
This is essentially a grid search over KMeans.
You can get the R documentation help here:

?h2o.gapStatistic

The idea is that, for each 'k', generate WCSS from a reference distribution and examine the gap between the expected
WCSS and the observed WCSS. To obtain the W_k from the reference distribution, 'B' Monte Carlo replicates are drawn
KMeans Clustering

49

H2O World 2014 Training

from the reference distribution. For each replicate, a KMeans is constructed and the WCSS reported back.

gap_stat <- h2o.gapStatistic(data = iris.h2o, K = 10, B = 100, boot_frac = .1, cols=1:4)

Let's take a look at the output. The default output display will show the number of KMeans models that were run and the
optimal value of k:

gap_stat

We can also run summary on the gap_stat model:

summary(gap_stat)

We can also plot our gap_stat model:

plot(gap_stat)

Comparison against other KMeans implementations.


Let's use the Census 1990 dataset, which has 2.5 million data points with 68 integer features.

# census1990 <- "Census1990.csv.gz"


# census.1990 <- h2o.importFile(h2oServer, path = file.path(homedir,census1990), header = F, sep = ',', key = 'census.1990.hex')
# dim(census.1990)
# km.census <- h2o.kmeans(data = census.1990, centers = 12, init="furthest") # NOT RUN: Too long on VM
# km.census@model$tot.withinss

We can compare the result with the published result from Fast and Accurate KMeans on Large Datasets where the cost for
k = 12 and ~2GB of RAM was approximately 3.50E+18. This paper implements a streaming KMeans, so of course
accuracy in the streaming case will not be as good as a batch job, but results are comparable within a few orders of
magnitude. H2O gives the ability to work on datasets that don't fit in a single box's RAM without having to stream the data
from cold storage: simply use distributed H2O.
We can also compare with StreamKM++: A Clustering Algorithm for Data Streams. For various k, we can compare our
implementation, but we only do k = 30 here.

# km.census <- h2o.kmeans(data = census.1990, centers = 30, init="furthest") # NOT RUN: Too long on VM
# km.census@model$tot.withinss # NOT RUN: Too long on VM

We can also compare with the kmeans package:

# census.1990.r <- read.csv(file.path(homedir,census1990))


# km.census.r <- kmeans(census.1990.r, centers = 24) # NOT RUN: Quick-TRANSfer stage steps exceeded maximum (= 122914250)
# km.census.r$tot.withinss

Let's compare now on the big dataset BigCross Big Cross, which has 11.6 million data points with 57 integer features.
KMeans Clustering

50

H2O World 2014 Training

# bigcross <- "BigCross.data.gz"


# big.cross <- h2o.importFile(h2oServer, path = file.path(homedir,bigcross), header = F, sep = ',', key = 'big.cross.hex')
# dim(big.cross) # NOT RUN: Too long on VM
# km.bigcross <- h2o.kmeans(data = big.cross, centers = 24, init="furthest") # NOT RUN: Too long on VM
# km.bigcross@model$tot.withinss # NOT RUN: Too long on VM

We can compare the result with the published result from Fast and Accurate KMeans on Large Datasets, where the cost for
k = 24 and ~2GB of RAM was approximately 1.50E+14.
We can also compare with StreamKM++: A Clustering Algorithm for Data Streams. For various k, we can compare our
implementation, but we only do k = 30 here.

# km.bigcross <- h2o.kmeans(data = big.cross, centers = 30, init="furthest") # NOT RUN: Too long on VM
# km.bigcross@model$tot.withinss # NOT RUN: Too long on VM

KMeans Clustering

51

H2O World 2014 Training

Dimensionality Reduction of MNIST


This tutorial shows how to reduce the dimensionality of a dataset with H2O. We will use both PCA and Deep Learning. This
file is both valid R and markdown code. We use the well-known MNIST dataset of hand-written digits, where each row
contains the 28^2=784 raw gray-scale pixel values from 0 to 255 of the digitized digits (0 to 9).

Start H2O and load the MNIST data


Initialize the H2O server and import the MNIST training/testing datasets.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
DATA = "train.csv.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,DATA), header = F, sep = ',', key = 'train.hex')

The data consists of 784 (=28^2) pixel values per row, with (gray-scale) values from 0 to 255. The last column is the
response (a label in 0,1,2,...,9).

predictors = c(1:784)
resp = 785

We do unsupervised training, so we can drop the response column.

data_hex <- data_hex[,-resp]

PCA - Principle Components Analysis


Let's use PCA to compute the principal components of the MNIST data, and plot the standard deviations of the principal
components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix).

pca_model <- h2o.prcomp(data_hex)


plot(pca_model@model$sdev)

Dimensionality Reduction

52

H2O World 2014 Training

We see that the first 50 or 100 principal components cover the majority of the variance of this dataset.
To reduce the dimensionality of MNIST to its 50 principal components, we use the h2o.predict() function with an extra
argument num_pc :

features_pca <- h2o.predict(pca_model, data_hex, num_pc=50)


summary(features_pca)

Deep Learning Autoencoder


ae_model <- h2o.deeplearning(x=predictors,
y=42, #ignored (pick any non-constant predictor)
data_hex,
activation="Tanh",
autoencoder=T,
hidden=c(100,50,100),
epochs=1,
ignore_const_cols = F)

We can now convert the data with the autoencoder model to 50-dimensional space (second hidden layer)

features_ae <- h2o.deepfeatures(data_hex, ae_model, layer=2)


summary(features_ae)

Dimensionality Reduction

53

H2O World 2014 Training

To get the full reconstruction from the output layer of the autoencoder, use h2o.predict() as follows

data_reconstr <- h2o.predict(ae_model, data_hex)


summary(data_reconstr)

Dimensionality Reduction

54

H2O World 2014 Training

Anomaly Detection on MNIST with H2O Deep Learning


This tutorial shows how a Deep Learning Auto-Encoder model can be used to find outliers in a dataset. This file is both
valid R and markdown code.
Consider the following three-layer neural network with one hidden layer and the same number of input neurons (features)
as output neurons. The loss function is the MSE between the input and the output. Hence, the network is forced to learn the
identity via a nonlinear, reduced representation of the original data. Such an algorithm is called a deep autoencoder; these
models have been used extensively for unsupervised, layer-wise pretraining of supervised deep learning tasks, but here we
consider the autoencoder's application for discovering anomalies in data.

We use the well-known MNIST dataset of hand-written digits, where each row contains the 28^2=784 raw gray-scale pixel
values from 0 to 255 of the digitized digits (0 to 9).

Start H2O and load the MNIST data


Initialize the H2O server and import the MNIST training/testing datasets.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
TRAIN = "train.csv.gz"
TEST = "test.csv.gz"
train_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ',', key = 'train.hex')
test_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TEST), header = F, sep = ',', key = 'test.hex')

The data consists of 784 (=28^2) pixel values per row, with (gray-scale) values from 0 to 255. The last column is the
response (a label in 0,1,2,...,9).

predictors = c(1:784)
resp = 785

We do unsupervised training, so we can drop the response column.


Anomaly Detection

55

H2O World 2014 Training

train_hex <- train_hex[,-resp]


test_hex <- test_hex[,-resp]

Finding outliers - ugly hand-written digits


We train a Deep Learning Auto-Encoder to learn a compressed (low-dimensional) non-linear representation of the dataset,
hence learning the intrinsic structure of the training dataset. The auto-encoder model is then used to transform all test set
images to their reconstructed images, by passing through the lower-dimensional neural network. We then find outliers in a
test dataset by comparing the reconstruction of each scanned digit with its original pixel values. The idea is that a high
reconstruction error of a digit indicates that the test set point doesn't conform to the structure of the training data and can
hence be called an outlier.

1. Learn what's normal from the training data


Train unsupervised Deep Learning autoencoder model on the training dataset. For simplicity, we train a model with 1
hidden layer of 50 Tanh neurons to create 50 non-linear features with which to reconstruct the original dataset. We learned
from the Dimensionality Reduction tutorial that 50 is a reasonable choice. For simplicity, we train the auto-encoder for only
1 epoch (one pass over the data). We explicitly include constant columns (all white background) for the visualization to be
easier.

ae_model <- h2o.deeplearning(x=predictors,


y=42, #response (ignored - pick any non-constant column)
data=train_hex,
activation="Tanh",
autoencoder=T,
hidden=c(50),
ignore_const_cols=F,
epochs=1)

Note that the response column is ignored (it is only required because of a shared DeepLearning code framework).

2. Find outliers in the test data


The Anomaly app computes the per-row reconstruction error for the test data set. It passes it through the autoencoder
model (built on the training data) and computes mean square error (MSE) for each row in the test set.

test_rec_error <- as.data.frame(h2o.anomaly(test_hex, ae_model))

In case you wanted to see the lower-dimensional features created by the auto-encoder deep learning model, here's a way
to extract them for a given dataset. This a non-linear dimensionality reduction, similar to PCA, but the values are capped by
the activation function (in this case, they range from -1...1)

test_features_deep <- h2o.deepfeatures(test_hex, ae_model, layer=1)


summary(test_features_deep)

3. Visualize the good, the bad and the ugly


We will need a helper function for plotting handwritten digits (adapted from http://www.r-bloggers.com/the-essence-of-ahandwritten-digit/). Don't worry if you don't follow this code...

plotDigit <- function(mydata, rec_error) {

Anomaly Detection

56

H2O World 2014 Training

len<-nrow(mydata)
N<-ceiling(sqrt(len))
op <- par(mfrow=c(N,N),pty='s',mar=c(1,1,1,1),xaxt='n',yaxt='n')
for (i in 1:nrow(mydata)) {
colors<-c('white','black')
cus_col<-colorRampPalette(colors=colors)
z<-array(mydata[i,],dim=c(28,28))
z<-z[,28:1]
image(1:28,1:28,z,main=paste0("rec_error: ", round(rec_error[i],4)),col=cus_col(256))
}
on.exit(par(op))
}
plotDigits <- function(data, rec_error, rows) {
row_idx <- order(rec_error[,1],decreasing=F)[rows]
my_rec_error <- rec_error[row_idx,]
my_data <- as.matrix(as.data.frame(data[row_idx,]))
plotDigit(my_data, my_rec_error)
}

Let's look at the test set points with low/median/high reconstruction errors. We will now visualize the original test set points
and their reconstructions obtained by propagating them through the narrow neural net.

test_recon <- h2o.predict(ae_model, test_hex)


summary(test_recon)

The good
Let's plot the 25 digits with lowest reconstruction error. First we plot the reconstruction, then the original scanned images.

plotDigits(test_recon, test_rec_error, c(1:25))


plotDigits(test_hex, test_rec_error, c(1:25))

Clearly, a well-written digit 1 appears in both the training and testing set, and is easy to reconstruct by the autoencoder with
minimal reconstruction error. Nothing is as easy as a straight line.

The bad
Now let's look at the 25 digits with median reconstruction error.

plotDigits(test_recon, test_rec_error, c(4988:5012))


plotDigits(test_hex, test_rec_error, c(4988:5012))

Anomaly Detection

57

H2O World 2014 Training

These test set digits look "normal" - it is plausible that they resemble digits from the training data to a large extent, but they
do have some particularities that cause some reconstruction error.

The ugly
And here are the biggest outliers - The 25 digits with highest reconstruction error!

plotDigits(test_recon, test_rec_error, c(9976:10000))


plotDigits(test_hex, test_rec_error, c(9976:10000))

Now here are some pretty ugly digits that are plausibly not commonly found in the training data - some are even hard to
classify by humans.

Voila!
We were able to find outliers with H2O Deep Learning Auto-Encoder models. We would love to hear your usecase
for Anomaly detection.
Note: Every run of DeepLearning results in different results since we use Hogwild! parallelization with intentional race
conditions between threads. To get reproducible results at the expense of speed for small datasets, set reproducible=T and
specify a seed.
Anomaly Detection

58

H2O World 2014 Training

More information can be found in the H2O Deep Learning booklet and in our
slides.

Anomaly Detection

59

H2O World 2014 Training

Advanced Topics
Arno Candel: Multi-model Parameter Tuning for Higgs Dataset
Arno Candel: Categorical Feature Engineering for Adult dataset
Arno Candel: Other Useful Tools

Advanced Topics

60

H2O World 2014 Training

Higgs Particle Discovery with H2O Deep Learning


This tutorial shows how H2O can be used to classify particle detector events into Higgs bosons vs background noise. This
file is both valid R and markdown code.

HIGGS dataset and Deep Learning


We use the UCI HIGGS dataset with 11 million events and 28 features (21 low-level features and 7 humanly created nonlinear derived features). In this tutorial, we show that (only) Deep Learning can automatically generate these or similar highlevel features on its own and reach highest accuracies from just the low-level features alone, outperforming traditional
classifiers. This is in accordance with a recently published Nature paper on using Deep Learning for Higgs particle
detection. Remarkably, Deep Learning also won the Higgs Kaggle challenge on the full set of features.

Start H2O, import the HIGGS data, and prepare train/validation/test splits
Initialize the H2O server (enable all cores)

library(h2o)
h2oServer <- h2o.init(nthreads=-1)

Import the data: For simplicity, we use a reduced dataset containing the first 100k rows.

homedir <- "/data/h2o-training/higgs/"


TRAIN = "higgs.100k.csv.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ',', key = 'data_hex')

For small datasets, it can help to rebalance the dataset into more chunks to keep all cores busy

data_hex <- h2o.rebalance(data_hex, chunks=64, key='data_hex.rebalanced')

Prepare train/validation/test splits: We split the dataset randomly into 3 pieces. Grid search for hyperparameter tuning and

Multi-model Parameter Tuning

61

H2O World 2014 Training

model selection will be done on the training and validation sets, and final model testing is done on the test set. We also
assign the resulting frames to meaningful names in the H2O key-value store for later use, and clean up all temporaries at
the end.

random <- h2o.runif(data_hex, seed = 123456789)


train_hex <- h2o.assign(data_hex[random < .8,], "train_hex")
valid_hex <- h2o.assign(data_hex[random >= .8 & random < .9,], "valid_hex")
test_hex <- h2o.assign(data_hex[random >= .9,], "test_hex")
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))

The first column is the response label (background:0 or higgs:1). Of the 28 numerical features, the first 21 are low-level
detector features and the last 7 are high-level humanly derived features (physics formulae).

response = 1
low_level_predictors = c(2:22)
low_and_high_level_predictors = c(2:29)

Establishing the baseline performance reference with several H2O classifiers


To get a feel for the performance of different classifiers on this dataset, we build a variety of different H2O models
(Generalized Linear Model, Random Forest, Gradient Boosted Machines and DeepLearning). We would like to use gridsearch for hyper-parameter tuning with N-fold cross-validation, and we want to do this twice: once using just the low-level
features, and once using both low- and high-level features.
First, we source a few helper functions that allow us to quickly compare a multitude of binomial classification models, in
particular the h2o.fit() and h2o.leaderBoard() functions. Note that these specific functions require variable importances and
N-fold cross-validation to be enabled.

source("~/h2o-training/tutorials/advanced/binaryClassificationHelper.R.md")

The code below trains 60 models (2 loops, 5 classifiers with 2 grid search models each, each resulting in 1 full training and
2 cross-validation models). A leaderboard scoring the best models per h2o.fit() function is displayed.

N_FOLDS=2
for (preds in list(low_level_predictors, low_and_high_level_predictors)) {
data = list(x=preds, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
h2o.fit(h2o.glm, data,
list(family="binomial", variable_importances=T, lambda=c(1e-5,1e-4), use_all_factor_levels=T)),
h2o.fit(h2o.randomForest, data,
list(type="fast", importance=TRUE, ntree=c(5), depth=c(5,10))),
h2o.fit(h2o.randomForest, data,
list(type="BigData", importance=TRUE, ntree=c(5), depth=c(5,10))),
h2o.fit(h2o.gbm, data,
list(importance=TRUE, n.tree=c(10), interaction.depth=c(2,5))),
h2o.fit(h2o.deeplearning, data,
list(variable_importances=T, l1=c(1e-5), epochs=1, hidden=list(c(10,10,10), c(100,100))))
)
best_model <- h2o.leaderBoard(models, test_hex, response)
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
}

The output contains a leaderboard (based on validation AUC) for the models using low-level features only:

Multi-model Parameter Tuning

62

H2O World 2014 Training

# model_type train_auc validation_auc important_feat tuning_time_s


# DeepLearning_ba44837829dc8d1e h2o.deeplearning 0.6102486 0.6661170 C7,C3,C10,C11,C2 5.604354
# GBM_8aad39d45442ed2418646fac4 h2o.gbm 0.6393795 0.6483558 C7,C2,C5,C10,C11 3.379146
# DRF_b64f1aca48cfa9532f78408df h2o.randomForest 0.6358507 0.6439986 C7,C10,C2,C5,C11 4.423504
# SpeeDRF_a40eda3fe25b1271b6be8 h2o.randomForest 0.5977397 0.6000200 C11,C7,C5,C10,C14 3.381785
# GLMModel__9f5855ebfcb804a1664 h2o.glm 0.5907724 0.5893810 C5,C7,C14,C2,C10 1.422024

Note that training AUCs are based on cross-validation.


When using both low- and high-level features, the AUC values go up across the board:

# model_type train_auc validation_auc important_feat tuning_time_s


# DeepLearning_a7cb0d0cc6f6fb8 h2o.deeplearning 0.7280887 0.7586982 C27,C29,C28,C24,C7 6.663959
# GBM_bc87169c331b22e37339c643 h2o.gbm 0.7552951 0.7545594 C27,C29,C28,C7,C24 4.437348
# DRF_b3572d5d088a043addcb3684 h2o.randomForest 0.7467827 0.7534905 C27,C29,C28,C24,C7 6.636395
# SpeeDRF_ba048cb60c3aad567b17 h2o.randomForest 0.7274115 0.7301433 C27,C28,C29,C7,C24 5.516673
# GLMModel__b59d95ace34a2979da h2o.glm 0.6827518 0.6764049 C29,C28,C27,C7,C24 1.523096

Clearly, the high-level features add a lot of predictive power, but what if they are not easily available? On this sampled
dataset and with simple models from low-level features alone, Deep Learning seems to have an edge over the other
methods indicating that it is able to create useful high-level features on its own.
Note: Every run of DeepLearning results in different results since we use Hogwild! parallelization with intentional race
conditions between threads. To get reproducible results at the expense of speed for small datasets, set reproducible=T and
specify a seed.

Build an improved Deep Learning model on the low-level features alone


With slightly modified parameters, it is possible to build an even better Deep Learning model. We add dropout/L1/L2
regularization and increase the number of neurons and the number of hidden layers. Note that we don't use input layer
dropout, as there are only 21 features, all of which are assumed to be present and important for particle detection. We also
reduce the amount of dropout for derived features. Note that the importance of regularization typically goes down with
increasing dataset sizes, as overfitting is less an issue when the model is small compared to the data.

h2o.deeplearning(x=low_level_predictors, y=response, activation="RectifierWithDropout", data=train_hex,


validation=valid_hex, input_dropout_ratio=0, hidden_dropout_ratios=c(0.2,0.1,0.1,0),
l1=1e-5, l2=1e-5, epochs=20, hidden=c(200,200,200,200))

With this computationally slightly more expensive Deep Learning model, we achieve a nice boost over the simple models
above: AUC = 0.7245833 (on validation)

Voila!
We applied multi-model grid search with N-fold cross-valiation on the real-world Higgs dataset, and demonstrated
the power of H2O Deep Learning in automatically cerating non-linear derived features for highest predictive
accuracy!

Extension to the full dataset


Please note that this tutorial was on a small subsample (<1%) of the UCI HIGGS dataset, and results do not trivially
transfer from small samples to the full dataset. Previous results by H2O Deep Learning on the full dataset (training
on 10M rows, validation on 500k rows, testing on 500k rows) agree with a recently published Nature paper on
using Deep Learning for Higgs particle detection, where 5-layer H2O Deep Learning models have achieved a test
set AUC value of 0.869. We would love to hear about your best models or even ensembles!

Multi-model Parameter Tuning

63

H2O World 2014 Training

More information can be found in the H2O Deep Learning booklet and in our
slides.

Multi-model Parameter Tuning

64

H2O World 2014 Training

Feature Engineering on the Adult dataset


This tutorial shows feature engineering on the Adult dataset. This file is both valid R and markdown code.

Start H2O and load the Adult data


Initialize the H2O server and import the Adult dataset.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/adult/"
TRAIN = "adult.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ' ', key = 'data_hex')

We manually assign column names since they are missing in the original file.

colnames(data_hex) <- c("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex


summary(data_hex)

We will try to predict whether income is <=50K or >50K . summary(data_hex$income) response = "income"
First, we source a few helper functions that allow us to quickly compare a multitude of binomial classification models, in
particular the h2o.fit() and h2o.leaderBoard() functions. Note that these specific functions require variable importances and
N-fold cross-validation to be enabled.

source("~/h2o-training/tutorials/advanced/binaryClassificationHelper.R.md")

We then add this simple helper function to split a frame into train/valid/test pieces, train a GLM and a GBM model with 2fold cross-validation and obtaining the best model after printing a leaderbaord. For more accurate

N_FOLDS = 2
h2o.trainModels <- function(frame) {
# split the data into train/valid/test
random <- h2o.runif(frame, seed = 123456789)
train_hex <- h2o.assign(frame[random < .8,], "train_hex")
valid_hex <- h2o.assign(frame[random >= .8 & random < .9,], "valid_hex")
test_hex <- h2o.assign(frame[random >= .9,], "test_hex")
predictors <- colnames(frame)[-match(response,colnames(frame))]
# multi-model comparison with N-fold cross-validation
data = list(x=predictors, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
h2o.fit(h2o.glm, data, glmparams),
h2o.fit(h2o.gbm, data, gbmparams)
)
best_model <- h2o.leaderBoard(models, test_hex, match(response,colnames(frame)))
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
best_model
}

Baseline performance on original dataset


Categorical Feature Engineering

65

H2O World 2014 Training

For simplicity, we use default parameters (no grid search parameter tuning) to establish baseline performance numbers for
this dataset.

glmparams <- list(family="binomial", variable_importances=T, use_all_factor_levels=T)


gbmparams <- list(importance=TRUE)
best_model <- h2o.trainModels(data_hex)

Both GLM and GBM do a great job at this dataset, we get validation AUC values of above 90%: GLM: 0.9028564 GBM:
0.9009924 According to GBM, the most important columns are marital-status,relationship,capital-gain,education-num,age

Feature engineering
The following section shows ways to create new derived features. We'll need this simple append function, as we're going to
add new columns the our dataset

h2o.append <- function(frame, col) {


appended_frame <- h2o.assign(cbind(frame, col), "appended_frame")
appended_frame
}

1. Turn age into a factor


The feature age is an integer value, but GLM for example will have a tough time predicing income from age with a linear
relationship, while GBM should be able to carve out these non-linear dependencies by itself (but it might need more trees,
deeper interaction depth than default values)

data_hex <- h2o.append(data_hex, as.factor(data_hex$age))


colnames(data_hex)
summary(data_hex)
best_model <- h2o.trainModels(data_hex)

GLM clearly benefited from this. We see that ages 18,19 and 20 are among the most important predictors for income and
we get the following validation AUC values: GLM: 0.9066634 GBM: 0.9009924
For fun, let's look at the largest positively and negatively correlated coefficients:

head(sort(best_model@model$normalized_coefficients,decreasing=T),5)
head(sort(best_model@model$normalized_coefficients,decreasing=F),5)

2. Same for capital-gain/loss and work hours per week


data_hex <- h2o.append(data_hex, as.factor(data_hex$'hours-per-week'))
data_hex <- h2o.append(data_hex, as.factor(data_hex$'capital-gain'))
data_hex <- h2o.append(data_hex, as.factor(data_hex$'capital-loss'))
colnames(data_hex)
summary(data_hex)
best_model <- h2o.trainModels(data_hex)

With all these new factor levels as predictors, GLM now got a nice boost: GLM: 0.9285384 GBM: 0.9021225
Let's give GBM a shot at beating GLM by using better parameters:

Categorical Feature Engineering

66

H2O World 2014 Training

gbmparams <- list(importance=TRUE, n.tree=50, interaction.depth=10)


best_model <- h2o.trainModels(data_hex)

Ok, now both algorithms reach similar validation AUC values: GLM: 0.9285384 GBM: 0.9286973

Replace money-related integer columns by their Log-Transform


data_hex$'capital-gain' <- log(1+data_hex$'capital-gain')
data_hex$'capital-loss' <- log(1+data_hex$'capital-loss')
data_hex
best_model <- h2o.trainModels(data_hex)

We see that the training AUC for GLM improves slightly, from 0.9269056070 to 0.9269586507 . Intuition: Money is often
distributed exponentially, and the log transform brings it back to a linear space. Note that the validation AUC drops, likely
due to small data statistical noise. We clearly got close to the limit of this dataset. Note that GBM didn't benefit from this
transform, it seems to be able to better split up the original integer space.

frame <- data_hex


random <- h2o.runif(frame, seed = 123456789)
train_hex <- h2o.assign(frame[random < .8,], "train_hex")
valid_hex <- h2o.assign(frame[random >= .8 & random < .9,], "valid_hex")
test_hex <- h2o.assign(frame[random >= .9,], "test_hex")
predictors <- colnames(frame)[-match(response,colnames(frame))]
# multi-model comparison with N-fold cross-validation
data = list(x=predictors, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
h2o.fit(h2o.deeplearning, data, list())
)
best_model <- h2o.leaderBoard(models, test_hex, match(response,colnames(frame)))

Categorical Feature Engineering

67

H2O World 2014 Training

Other Useful Tools


Synthetic Data
Use h2o.createFrame to synthetic random data in H2O. This method can also be used to quickly create very large datasets
for scaling tests. Note that there is no intrinsic structure in the data (it's either constant or random), so results from many
machine learning methods will not be very meaningful.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
myframe = h2o.createFrame(h2oServer, 'framekey', rows = 20, cols = 5,
seed = -12301283, randomize = TRUE, value = 0,
categorical_fraction = 0.8, factors = 10, real_range = 1,
integer_fraction = 0.2, integer_range = 10, missing_fraction = 0.2,
response_factors = 1)

We created a small random frame in H2O that contains missing values, categorical and numerical columns.

head(myframe,20)
# response C1 C2 C3 C4 C5
# 1 -0.4227451 4a41c a1290 3 326a0 82dce
# 2 0.4594655 0a79a a1290 -6 05f22 f4772
# 3 0.2784008 70d75 4c3a7 7 320a8 6b17b
# 4 -0.5674698 07067 6 320a8 a78d9
# 5 -0.7041911 24800 4c3a7 8 82dce
# 6 0.5853957 8e64f a1290 NA ea644
# 7 0.8540204 4 05f22 a6b1e
# 8 -0.1466706 4a41c 1 cef5a 89717
# 9 NA 0a79a c402d 3 070d5 a78d9
# 10 -0.7357408 4a41c a1290 9 e055d 64439
# 11 -0.2798403 4c3a7 3 f4772
# 12 0.3454386 24800 NA
# 13 NA 0a79a ca1de 9 070d5 a78d9
# 14 -0.6732674 8e64f 50b47 9 320a8 6b17b
# 15 NA cc3d5 ca1de 4 e055d f4772
# 16 NA cc3d5 07067 1 070d5 89717
# 17 0.8564855 4f8b9 a1290 -1 326a0 6b17b
# 18 -0.5555804 0a79a 50b47 NA 82dce
# 19 0.5639324 3 e055d
# 20 0.2511743 7 30c85 cae4c

We remove the response column and convert the integer column to a factor.

myframe <- myframe[,-1]


myframe[,3] <- as.factor(myframe$C3)
summary(myframe)
head(myframe, 20)

Interaction Features between Factors


Create pairwise interactions for 2 groups of columns, keep only up to 10 (most common) factors per interaction.

pairwise <- h2o.interaction(myframe, key = 'pairwise', factors = list(c(1,2),c(2,3,4)),


pairwise=TRUE, max_factors = 10, min_occurrence = 1)
head(pairwise, 20)
levels(pairwise[,2])

Other Useful Tools

68

H2O World 2014 Training

Create 5-th order interaction between the specified columns, and allow up to 10k resulting factors (per pair-wise
interaction).

higherorder <- h2o.interaction(myframe, key = 'higherorder', factors = c(1,2,3,4,5),


pairwise=FALSE, max_factors = 10000, min_occurrence = 1)
head(higherorder, 20)

Create a categorical variable out of the integer column via self-interaction, and keep at most 3 factors, and only if they
occur at least twice

summary(myframe$C3)
head(myframe$C3, 20)
trim_integer_levels <- h2o.interaction(myframe, key = 'trim_integers', factors = 3,
pairwise = FALSE, max_factors = 3, min_occurrence = 2)
head(trim_integer_levels, 20)

Append all interactions to the original frame and clean up temporaries

myframe <- cbind(myframe, pairwise, higherorder, trim_integer_levels)


myframe <- h2o.assign(myframe, 'final.key')
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
myframe
head(myframe,20)
summary(myframe)
# > head(myframe,20)
# C1 C2 C3 C4 C5 C1_C2 C2_C3 C2_C4 C3_C4 C1_C2_C3_C4_C5 C3_C3
# 1 49ed9 d9ff0 3 c9523 00599 49ed9_d9ff0 d9ff0_3 d9ff0_c9523 other 49ed9_d9ff0_3_c9523_00599 3
# 2 e2271 d9ff0 -6 fe2d9 cb67d e2271_d9ff0 d9ff0_-6 d9ff0_fe2d9 -6_fe2d9 e2271_d9ff0_-6_fe2d9_cb67d other
# 3 408d2 6c5ce 7 28b4d 3e4cb 408d2_6c5ce other other other 408d2_6c5ce_7_28b4d_3e4cb other
# 4 ae93f 6 28b4d da0c6 other other other other NA_ae93f_6_28b4d_da0c6 other
# 5 722ea 6c5ce 8 00599 722ea_6c5ce other 6c5ce_NA 8_NA 722ea_6c5ce_8_NA_00599 other
# 6 5e310 d9ff0 NA 9dbca 5e310_d9ff0 d9ff0_NA other other 5e310_d9ff0_NA_9dbca_NA
# 7 4 fe2d9 8d2b5 NA_NA other NA_fe2d9 4_fe2d9 NA_NA_4_fe2d9_8d2b5 other
# 8 49ed9 1 87bef 92d9c 49ed9_NA NA_1 other 1_87bef 49ed9_NA_1_87bef_92d9c 1
# 9 e2271 14aa5 3 d77b0 da0c6 other 14aa5_3 14aa5_d77b0 3_d77b0 e2271_14aa5_3_d77b0_da0c6 3
# 10 49ed9 d9ff0 9 79727 b7a40 49ed9_d9ff0 other other other 49ed9_d9ff0_9_79727_b7a40 9
# 11 6c5ce 3 cb67d NA_6c5ce 6c5ce_3 6c5ce_NA 3_NA NA_6c5ce_3_NA_cb67d 3
# 12 722ea NA 722ea_NA NA_NA NA_NA NA_NA 722ea_NA_NA_NA_NA
# 13 e2271 0036a 9 d77b0 da0c6 other other 0036a_d77b0 9_d77b0 e2271_0036a_9_d77b0_da0c6 9
# 14 5e310 0a9ed 9 28b4d 3e4cb other other other other 5e310_0a9ed_9_28b4d_3e4cb 9
# 15 f76de 0036a 4 79727 cb67d other other other other f76de_0036a_4_79727_cb67d other
# 16 f76de ae93f 1 d77b0 92d9c other ae93f_1 ae93f_d77b0 1_d77b0 f76de_ae93f_1_d77b0_92d9c 1
# 17 853d4 d9ff0 -1 c9523 3e4cb 853d4_d9ff0 d9ff0_-1 d9ff0_c9523 other 853d4_d9ff0_-1_c9523_3e4cb other
# 18 e2271 0a9ed NA 00599 other 0a9ed_NA 0a9ed_NA NA_NA e2271_0a9ed_NA_NA_00599
# 19 3 79727 NA_NA other other other NA_NA_3_79727_NA 3
# 20 7 8007d 1280c NA_NA other NA_8007d 7_8007d NA_NA_7_8007d_1280c other

Imputation of Missing Values


First, we randomly replace 50 rows in each column of the iris dataset with missing values

ds <- iris
ds[sample(nrow(ds), 50),1] <- NA
ds[sample(nrow(ds), 50),2] <- NA
ds[sample(nrow(ds), 50),3] <- NA
ds[sample(nrow(ds), 50),4] <- NA
ds[sample(nrow(ds), 50),5] <- NA
summary(ds)

Other Useful Tools

69

H2O World 2014 Training

upload the NA'ed dataset to H2O

hex <- as.h2o(h2oServer, ds)


head(hex,20)

Impute the NAs in the first column in place with "median"

h2o.impute(hex, "Sepal.Length", method = "median")


head(hex,20)

Impute the NAs in the second column with the mean based on the groupBy columns Sepal.Length and Petal.Width and
Species

h2o.impute(hex, "Sepal.Width", method = "mean", groupBy = c("Sepal.Length", "Petal.Width", "Species"))


head(hex,20)

Impute the Species column with the "mode" based on the columns 1 and 4

h2o.impute(hex, 5, method = "mode", groupBy = c(1,4))


head(hex,20)

Splitting H2O Frames into Consecutive Subsets


First, we create a large frame

myframe = h2o.createFrame(h2oServer, 'large', rows = 1000000, cols = 10,


seed = -12301283, randomize = TRUE, value = 0,
categorical_fraction = 0.8, factors = 10, real_range = 1,
integer_fraction = 0.2, integer_range = 10, missing_fraction = 0.2,
response_factors = 1)
dim(myframe)

Now, we split that dataset into 4 consecutive pieces, so we need to specify the sizes of the first 3 splits

splits <- h2o.splitFrame(myframe, c(0.4,0.2,0.1))


dim(splits[[1]])
dim(splits[[2]])
dim(splits[[3]])
dim(splits[[4]])

Splitting H2O Frames into Random Subsets


We create a 1D vector with uniform values sampled from the interval 0...1 and use that to assign rows to the splits.

random <- h2o.runif(myframe, seed = 123456789)


train <- myframe[random < .8,]
valid <- myframe[random >= .8 & random < 0.9,]
test <- myframe[random >= .9,]
dim(train)
dim(valid)
dim(test)

Other Useful Tools

70

H2O World 2014 Training

Other Useful Tools

71

H2O World 2014 Training

Pratical Usecases for Marketing


Use KDDCup98 dataset to design an intelligent direct mail campaign to maximize
the overall profit (R tryout vs H2O solution)

Use the dataset with selected features to try to run random


forests on R
rm(list=ls())
setwd("/data/h2o-training/")
Kdd98 <- read.csv("cup98LRN_z.csv")

featureSet <- c("ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", "RECINHSE", "MDMAUD", "DOMAIN", "CLUSTER", "AGE",
"HOMEOWNR", "CHILD03", "CHILD07", "CHILD12", "CHILD18", "NUMCHLD", "INCOME", "GENDER", "WEALTH1", "HIT", "COLLECT1",
"VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", "GARDENIN",
"BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", "PEPSTRFL", "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12",
"RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", "LASTGIFT", "LASTDATE", "FISTDATE", "TIMELAG", "AVGGIFT",
"HPHONE_D", "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", "CLUSTER2", "GEOCODE2", "TARGET_D")

kdd98 <- Kdd98[, setdiff(featureSet, c("CONTROLN", "TARGET_B"))]


ls()
library(randomForest)
rf <- randomForest(TARGET_D ~ ., data=kdd98)

You'll quickly see an error message - "Error in na.fail.default(list(TARGET_D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, : missing


values in object"

library(party)
cf <- cforest(TARGET_D ~ ., data= kdd98, control = cforest_unbiased(mtry=2, ntree=50))

After a while, a window will pop up showing "Unable to establish connection with R session". In order to continue
to use Rstudio, you may go to the task manager to kill rsession. This demonstrates cforest runs out of memory for
this trimmed dataset and cannot return any results.

Use the complete dataset to run big data random forest with
H2O
Let's use H2O to build a big data random forest model and predict who to mail and how much profit the fundraising campaign may generate.
To run H2O on your VM, double click the H2O icon on your VM desktop.
Then open a browser, type localhost:54321 to access H2O web UI.
Let's start uploading the datasets, build the model, and do prediction.
Name the prediction key "drf_prediction" and download to csv after predicting.

Evaluate the total profit from the prediction


Practical Use Cases for Marketing

72

H2O World 2014 Training

After getting the prediction, then run this with the H2O prediction.

kdd98v <- read.csv("cup98VAL_z.csv") # read test data


kdd_pred = read.csv("drf_prediction.csv") # read prediction value
kdd_pred_val <- apply(kdd_pred,1,function(row) if (row[1] > 0.68) 1 else 0 )
kdd98_withpred <- cbind(kdd98v, kdd_pred_val)
dim(kdd98v)
dim(kdd98_withpred)
kdd98_withpred$yield <- apply(kdd98_withpred,1,function(row) (as.numeric(row['TARGET_D']) - 0.68) * as.numeric(row[483]))

sum(kdd98_withpred$yield) # profit
max(kdd98_withpred$yield) # max donation
sum(kdd_pred_val) # mails sent

Practical Use Cases for Marketing

73

H2O World 2014 Training

Sparkling Water
What is Sparkling Water?
Sparkling Water is an integration work done by Michal Malohlava that brings together the two open-source platform H2O
and Spark. Because of this integration H2O and users can takes advantage of H2O's algorithms, Spark SQL, and both
parsers. The typical use case is to load data either into H2O or Spark, use Spark SQL to make a query, feed the results into
H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark.
The current integration will require an installation of Spark. To launch a Spark App the user will be required to start up a
Spark cluster with H2O instances sitting in the same JVM. The user will specify the number of woker nodes and amount of
memory on each node when starting a Spark Shell or when submitting a job to YARN on Hadoop.

Data Distribution
The key to this tutorial is understanding how data is distributed on the Sparkling Water cluster. After data has been
imported into either Spark or H2O we can publish H2O data frames as SchemaRDDs in Spark in order to run Spark SQL
without replicating the data. However reading data into H2O from Spark will require a one time load of the data into H2O
which isn't done as cheaply.

Sparkling Water

74

H2O World 2014 Training

Setup and Installation


Prerequisites: Spark 1.2+
First install Spark and set the necessary environmental variables and then download the latest version of Sparkling Water
to get started.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.2.1-bin-hadoop2.4.tgz
tar -xvf spark-1.2.1-bin-hadoop2.4.tgz
cd spark-1.2.1-bin-hadoop2.4
export SPARK_HOME=`pwd`
wget http://h2o-release.s3.amazonaws.com/sparkling-water/master/95/sparkling-water-0.2.12-95.zip
unzip sparkling-water-0.2.12-95.zip
cd sparkling-water-0.2.12-95

To launch a Sparkling Shell which is essentially a Spark Shell with H2O classes launched with the shell from the command
line run the script sparkling-shell.

bin/sparkling-shell

Launch H2O on the Spark Cluster


Now that we've started a Spark cluster we move on to launch H2O JVM atop each of the Spark executors and import H2O
Sparkling Water

75

H2O World 2014 Training

Context. Now the user can use all of the H2O classes that the Sparkling shell was launched with.

// Prepare application environment


import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import water.Key
// Start H2O
val h2oContext = new H2OContext(sc).start()
import h2oContext._

Load data into the key-value store in the H2O cluster


This tutorial will be using a simple airlines dataset that comes with your installation of H2O. This airlines dataset is actually
a subset of the 20+ years worth of flight data publicly available. We will be using the dataset to predict whether flights are
Delayed during Departure. At this point you may type openFlow to access the new H2O webUI that will list all frames
currently in the KV store.

// Import all year airlines into H2O


import java.io.File
val dataFile = "examples/smalldata/year2005.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))

Load data Spark


At the same time load weather data into Spark instead of H2O, this is mainly to demonstrate how the transparent
integration of H2O Data Frames allow the user to use H2ORDDs and SparkRDDs not only from the same API but within sql
queries. The following weather dataset is for the airport Ohare in Chicago during the year 2005.

// Import weather data into Spark


val weatherDataFile = "examples/smalldata/Chicago_Ohare_International_Airport.csv"
val wrawdata = sc.textFile(weatherDataFile,8).cache()
val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

Filter
Use Spark to filter out all flights coming from Ohare in the airlines dataset.

// Transfer data from H2O to Spark RDD


val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
val flightsToORD = airlinesTable.filter(f => f.Origin==Some("ORD"))

Spark SQL on the two data frames


Import SQL Context and publish the airlines dataset as a SchemaRDD which will allow us to join our flight and weather
dataframes together. In the sql query we only ask for the columns relevant for the model build dumping columns with
many NA values or cheating columns such as the column Cancelled. So in addition to the original feature set we've added
on temperature and precipitation information for each flight.

// Use Spark SQL to join flight and weather data in spark


import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")

Sparkling Water

76

H2O World 2014 Training

weatherTable.registerTempTable("WeatherORD")

// Perform SQL Join on both tables


val bigTable = sql(
"""SELECT
|f.Year,f.Month,f.DayofMonth,
|f.UniqueCarrier,f.FlightNum,f.TailNum,
|f.Origin,f.Distance,
|w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
|f.IsDepDelayed
|FROM FlightsToORD f
|JOIN WeatherORD w
|ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)

Set columns to factors


It is important to set integer columns to factors before running any models, otherwise Deep Learning will assume it is
building a regression model rather than a binomial or multinomial model.

val trainFrame:DataFrame = bigTable


trainFrame.replace(19, trainFrame.vec("IsDepDelayed").toEnum)
trainFrame.update(null)

Create our predictive models


Now that we have our training frame in H2O we can build a GLM and Deep Learning Model. To create a Deep Learning
model, import the necessary classes and set all the parameters for the model in the script. Then after launching and
completing the job, grab the predictions frame from H2O and move it back into Spark as a SchemaRDD.

// Run deep learning to produce model estimating arrival delay


import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._epochs = 100
dlParams._train = trainFrame
dlParams._response_column = 'IsDepDelayed
dlParams._variable_importances = true
dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[water.Key[Frame]]
// Create a job
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get
// Use model to estimate delay on training data
val predictionH2OFrame = dlModel.score(bigTable)('predict)
val predictionsFromModel = toRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Do the same with GLM.

// Run GLM to produce model estimating arrival delay


import hex.glm.GLMModel.GLMParameters.Family
import hex.glm.GLM
import hex.glm.GLMModel.GLMParameters
val glmParams = new GLMParameters(Family.binomial)
glmParams._train = bigTable
glmParams._response_column = 'IsDepDelayed
glmParams._alpha = Array[Double](0.5)
glmParams._destination_key = Key.make("glmModel.hex").asInstanceOf[water.Key[Frame]]
val glm = new GLM(glmParams)
val glmModel = glm.trainModel().get()
// Use model to estimate delay on training data
val predGLMH2OFrame = glmModel.score(bigTable)('predict)
val predGLMFromModel = toRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Sparkling Water

77

H2O World 2014 Training

// Grab the AUC value of each model


val glm_auc = glmModel.validation().auc()
val trainMetrics = binomialMM(dlModel, trainFrame)
val dl_auc = trainMetrics.auc.AUC
println("AUC of General Linear Model" + glm_auc)
println("AUC of Deep Learning Model" + dl_auc)

Sparkling Water

78

H2O World 2014 Training

Python on H2O
Similar to the R API, the H2O python module provides access to the H2O JVM, its objects, its machine-learning algorithms,
and modeling support (basic munging and feature generation) capabilities. The python integration is only available in the
new generation H2O since all the REST endpoints it hits are different from the endpoints in the older version of H2O.
The H2O python module is not intended as a replacement for other popular machine learning modules such as scikit-learn,
pylearn2, and their ilk. This module is a complementary interface to a modeling engine intended to make the transition of
models from development to production as seamless as possible. Additionally, it is designed to bring H2O to a wider
audience of data and machine learning devotees that work exclusively with Python (rather than R or scala or Java -- which
are other popular interfaces that H2O supports), and are wanting another tool for building applications or doing data
munging in a fast, scalable environment without any extra mental anguish about threads and parallelism.
This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within
Python. It requires an installation of the h2o python module and its dependencies.

Setup and Installation


Prerequisites: Python 2.7 and Numpy 1.9.2
First install the necessary dependecies the H2O module is built on and then install the H2O module itself.

pip install requests


pip install tabulate
# Remove any preexisiting H2O module.
pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o-dev/master/1109/Python/h2o-0.3.0.1109-py2.py3-none-any.whl

Instantiate a H2O Cluster


The H2O JVM sports a web server such that all communication occurs on a socket (specified by an IP address and a port)
via a series of REST calls (see connection.py for the REST layer implementation and details). There is a single active
connection to the H2O JVM at any one time, and this handle is stashed away out of sight in a singleton instance of
:class: H2OConnection (this is the global :envvar: __H2OConn__ ).

import h2o
h2o.init()

Get data
This tutorial in particular will use a NYC's public Citibike dataset and import one month of all trip information across 300+
stations into H2O. Then a model predicting the demand and usage of bikes will be built to load-balance the Citibike
inventory. Users can download the data from Citibike-NYC Website.

wget https://s3.amazonaws.com/tripdata/201310-citibike-tripdata.zip

Load data into the key-value store in the H2O cluster


Python on H2O

79

H2O World 2014 Training

We will use the h2o.import_frame function to read the data into the H2O key-value store.

dir_prefix = "~/Downloads/"
small_test = dir_prefix + "201310-citibike-tripdata.zip"
data = h2o.import_frame(path=small_test)

Examine the proxy object for the H2O resident data


The resulting data object is of class H2OFrame , which implements methods such as describe , dim , and head .

data.describe()
data.dim()
data.head()

Derive columns
In this step we will take the time since epoch and divide by the number of seconds in a day to get all the unique days in the
dataset and bind the new column created to the original H2OFrame .

startime = data["starttime"]
secsPerDay=1000*60*60*24
data["Days"] = (startime/secsPerDay).floor()
data.describe()

Data munging and aggregations


Once we have the unique Days column we can group by it. The goal is to count the number of bike starts per-station perday. This will reduce the dataset down from 1,000,000 rows to about 10,000 rows which is the cardinality of Days times the
cardinality of start station name).

ddplycols=["Days","start station name"]


bpd = h2o.ddply(data[ddplycols],ddplycols,"(%nrow)") # Compute bikes-per-day
bpd["C1"]._name = "bikes" # Rename column from generic name
bpd.describe()
bpd.dim()

Data exploration and quantiles


To gauge the distribution of bike counts in the dataset run quantile on the column bikes. It ranges wildly from stations with
less than 10 bike checkouts in a day vs busy stations in the heart of NYC with hundreds of checkouts in one day.

print "Quantiles of bikes-per-day"


bpd["bikes"].quantile().show()

Now the dataset only has three features but we can add to the frame columns Month and DayOfWeek hopefully to detect
patterns where bike stations near the financial district will be more active during work days while bikes near parks might be
more active during the weekend.

secs = bpd["Days"]*secsPerDay
bpd["Month"] = secs.month()
bpd["DayOfWeek"] = secs.dayOfWeek()

Python on H2O

80

H2O World 2014 Training

print "Bikes-Per-Day"
bpd.describe()

Create training and test data sets to use during modeling


As a final step in exploration of H2O basics in R, we will create a test/train/holdout split, where the larger data set will be
used for training a model and the smaller data set will be used for testing the usefulness of the model. We will achieve this
by using the runif function to generate random uniforms over [0, 1] for each row and using those random values to
determine the split designation for that row.
Then we will build GBM ( h2o.gbm ), Random Forest( h2o.random_forest ), and GLM ( h2o.glm ) regression models predicting
for bikes counts. Once the models are built we can score on the different subsets of the data and report back the
performance of each model using function model_performance .

def split_fit_predict(data):
# Classic Test/Train split
r = data['Days'].runif() # Random UNIForm numbers, one per row
train = data[ r < 0.6]
test = data[(0.6 <= r) & (r < 0.9)]
hold = data[ 0.9 <= r ]
print "Training data has",train.ncol(),"columns and",train.nrow(),"rows, test has",test.nrow(),"rows, holdout has",hold.nrow()
# Run GBM
gbm = h2o.gbm(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
ntrees=500, # 500 works well
max_depth=6,
learn_rate=0.1)
# Run DRF
drf = h2o.random_forest(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
ntrees=500, # 500 works well
max_depth=50)
# Run GLM
glm = h2o.glm(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
dropNA20Cols=True)
#glm.show()

# --------- # 4- Score on holdout set & report


train_r2_gbm = gbm.model_performance(train).r2()
test_r2_gbm = gbm.model_performance(test ).r2()
hold_r2_gbm = gbm.model_performance(hold ).r2()
print "GBM R2 TRAIN=",train_r2_gbm,", R2 TEST=",test_r2_gbm,", R2 HOLDOUT=",hold_r2_gbm
train_r2_drf = drf.model_performance(train).r2()
test_r2_drf = drf.model_performance(test ).r2()
hold_r2_drf = drf.model_performance(hold ).r2()
print "DRF R2 TRAIN=",train_r2_drf,", R2 TEST=",test_r2_drf,", R2 HOLDOUT=",hold_r2_drf
train_r2_glm = glm.model_performance(train).r2()
test_r2_glm = glm.model_performance(test ).r2()
hold_r2_glm = glm.model_performance(hold ).r2()
print "GLM R2 TRAIN=",train_r2_glm,", R2 TEST=",test_r2_glm,", R2 HOLDOUT=",hold_r2_glm
# --------------

Here we see an r^2 of 0.92 for GBM, and 0.72 for GLM on the test set. This means given just the station, the month, and
Python on H2O

81

H2O World 2014 Training

the day-of-week we can predict 90% of the variance of the bike-trip-starts.

split_fit_predict(bpd)

Python on H2O

82

H2O World 2014 Training

Demos
Tableau
Excel
Streaming Data

Demos

83

H2O World 2014 Training

Tableau
Beauty and Big Data: Tableau
Tableau
Download Tableau in order to use notebooks available on our github.

How does Tableau play with H2O?


Tableau is the frontend visual organizer that utilizes all the available statistic tools from open source R and H2O. Tableau
will connect to R via a socket server using a library package already built for R. The H2O client package available for
installation allows R to connect and communicate to H2O via a REST API. So by connecting Tableau to R, Tableau
essentially can launch or initiate H2O and run any of the features already available for R.

R Component
First, make sure to install H2O in R:

> install.packages("h2o")

Then install the Rserve package in R that will allow the user to start up a R session on a local server that Tableau will
communicate with:

> install.packages("Rserve")
> library("Rserve")
> run.Rserve(port = 6311)

Tableau

84

H2O World 2014 Training

Tableau Front End


Step 1: Connection Setup
Open Demo_Template_8.1.twb which should have all the calculated fields containing R script already in the sidebar.
Navigate to Help > Settings and Performance > Manage R Connection to establish a connection to the R serve.

Input for IP: 'localhost' and for port: '54321'

Step 2: Data Connection


Set the workbooks connection to the airlines_meta.csv data by navigating to the data section on the left sidebar, right
clicking on the airlines_meta and choosing to Edit Connection.

Tableau

85

H2O World 2014 Training

Step 3: H2O Initialization


Configure the IP Address and Port that H2O will launch at as well as the path to the full airlines data file.

Tableau

86

H2O World 2014 Training

Step 4: Data Import

Execute 00 Load H2O and Tableau functions to run:

> library(h2o)
> tableauFunctions <- functions(x){

Tableau

87

H2O World 2014 Training

> ...
> }
> print('Finish loading necessary functions')

Execute 01 Init H2O & Parse Data to:

> library(h2o)
> localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1)
> data = h2o.importFile(localH2O, "/data/h2o-training/airlines/allyears2k.csv")

Execute 02 Compute Aggregation with H2Os ddply to groupby columns and do roll ups. First calculate the number of
flights coming and going per month:

numFlights = ddply(data.hex, 'Month', nrow)


numFlights.R = as.data.frame(numFlights)

Then compute the number of cancelled flights per month:

fun2 = function(df) {sum(df$Cancelled)}


h2o.addFunction(h2oLocal, fun2)
canFlights = ddply(data.hex, 'Month', fun2)
canFlights.R = as.data.frame(canFlights)

Execute 03 Run GLM to build a GLM model in H2O and grab back coefficient values that will be plotted in multiple
worksheets:

data.glm = h2o.glm(x = c('Origin', 'Dest', 'Distance', 'Unique Carrier') , y = 'Cancelled', data = data.hex, family = 'binomial', n

After the calculated fields finishes running, scroll through the different dashboards to see data differently.

Tableau

88

H2O World 2014 Training

Excel
Beauty and Big Data: Excel
Using Excel with H2O
When working with Excel the HTTP call to H2O will elicit a response in XML format that Microsoft Excel can parse in both
32-bit and 64-bit version. For Excel users that are already familiar with excel features and functions, this tutorial will work
you through some basic H2O capabilities written in VBA. So the demonstration Excel file has to be macro enabled to
access all the point and click features.

Initialization and Data Import

Start up a H2O session


Input for IP Address: 'localhost'
Input for Port: '54321'
Input for Heap Size: '1g'
Input for Number of Nodes: '1'
Input for File Path: '/data/h2o-training/airlines/allyears2k.csv'
Hit Submit to import the data with a progress bar in the lower left hand corner
The entry H2O Data Hex Key should automatically fill in with the destination key of the hex file now sitting in H2O

Data Summary
Hit Generate Summary to create a new worksheet with a summary of all the columns in the data set
Excel

89

H2O World 2014 Training

All columns except NA% are taken directly from H2Os inspect page

Build GLM Model

Excel

90

H2O World 2014 Training

Input for Response Variable: 'IsDepDelayed'


Then hit Populate Predictor Variables, which will eliminate the response variable from the Predictor Variables as well
as highlight the variables with high NA counts to ignore.
Input for family: 'binomial'
Fill in other information such as lambda, alpha, and the number of cross validation you want to run.
Hit submit to start the model build

Visualize Model Output


On the Output Models section, select a model you'll like to visualize
Hit Show Output to show the coefficients for the GLM model.
Create a bar chart and sort the coefficients by magnitude.

Excel

91

H2O World 2014 Training

Excel

92

H2O World 2014 Training

Real-time Predictions With H2O on Storm


This tutorial shows how to create a Storm topology can be used to make real-time predictions with H2O.

Where to find the latest version of this tutorial


https://github.com/0xdata/h2o-training/tree/master/tutorials/streaming/storm/README.md

1. What this tutorial covers

In this tutorial, we explore a combined modeling and streaming workflow as seen in the picture below:

We produce a GBM model by running H2O and emitting a Java POJO used for scoring. The POJO is very lightweight and
does not depend on any other libraries, not even H2O. As such, the POJO is perfect for embedding into third-party
environments, like a Storm bolt.
This tutorial walks you through the following sequence:
Installing the required software
A brief discussion of the data
Using R to build a gbm model in H2O
Exporting the gbm model as a Java POJO
Copying the generated POJO files into a Storm bolt build environment
Building Storm and the bolt for the model
Streaming Data

93

H2O World 2014 Training

Running a Storm topology with your model deployed


Watching predictions in real-time
(Note that R is not strictly required, but is used for convenience by this tutorial.)

2. Installing the required software


2.1. Clone the required repositories from Github
$ git clone https://github.com/apache/storm.git $ git clone https://github.com/0xdata/h2o-training.git

NOTE: Building storm (c.f. Section 5) requires Maven. You can install Maven (version 3.x) by following the Maven
installation instructions.
Navigate to the directory for this tutorial inside the h2o-training repository:
$ cd h2o-training/tutorials/streaming/storm

You should see the following files in this directory:


README.md (This document)
example.R (The R script that builds the GBM model and exports it as a Java POJO)
training_data.csv (The data used to build the GBM model)
live_data.csv (The data that predictions are made on; used to feed the spout in the Storm topology)
H2OStormStarter.java (The Storm topology with two bolts: a prediction bolt and a classifying bolt)
TestH2ODataSpout.java (The Storm spout which reads data from the live_data.csv file and passes each observation
to the prediction bolt one observation at a time; this simulates the arrival of data in real-time)
And the following directories:
premade_generated_model (For those people who have trouble building the model but want to try running with Storm
anyway; you can ignore this directory if you successfully build your own generated_model later in the tutorial)
images (Images for the tutorial documentation, you can ignore these)
web (Contains the html and image files for watching the real-time prediction output (c.f. Section 8))

2.2. Install R
Get the latest version of R from CRAN and install it on your computer.

2.3. Install the H2O package for R


Note: The H2O package for R includes both the R code as well as the main H2O jar file. This is all you need to run
H2O locally on your laptop.
Step 1: Start R $ R
Step 2: Install H2O from CRAN > install.packages("h2o")
Note: For convenience, this tutorial was created with the Markov stable release of H2O (2.8.1.1) from CRAN, as
shown above. Later versions of H2O will also work.

2.4. Development environment


This tutorial was developed with the following software environment. (Other environments will work, but this is what we
Streaming Data

94

H2O World 2014 Training

used to develop and test this tutorial.)


H2O 2.8.1.1 (Markov)
MacOS X (Mavericks)
java version "1.7.0_51" (JDK)
R 3.1.2
Storm git hash (insert here)
curl 7.30.0 (x86_64-apple-darwin13.0) libcurl/7.30.0 SecureTransport zlib/1.2.5
Maven (Apache Maven 3.0.4)
For viewing predictions in real-time (Section 8) you will need the following:
npm (1.3.11) ( $ brew install npm )
http-server ( $ npm install http-server -g )
A modern web browser (animations depend on D3)

3. A brief discussion of the data


Let's take a look at a small piece of the training_data.csv file for a moment. This is a synthetic data set created for this
tutorial.
$ head -n 20 training_data.csv

Label

Has4Legs

CoatColor

HairLength

TailLength

EnjoysPlay

StaresOutWindow

HoursSpentNappi

dog

Brown

dog

Brown

dog

Grey

10

dog

Grey

dog

Brown

10

cat

Grey

dog

Spotted

cat

Spotted

dog

Grey

cat

Black

dog

Grey

dog

Spotted

dog

Spotted

cat

Spotted

cat

White

cat

Black

cat

Grey

cat

Spotted

10

Note that the first row in the training data set is a header row specifying the column names.

Streaming Data

95

H2O World 2014 Training

The response column (i.e. the "y" column) we want to make predictions for is Label. It's a binary column, so we want to
build a classification model. The response column is categorical, and contains two levels, 'cat' and 'dog'. Note that the ratio
of dogs to cats is 3:1.
The remaining columns are all input features (i.e. the "x" columns) we use to predict whether each new observation is a
'cat' or a 'dog'. The input features are a mix of integer, real, and categorical columns.

4. Using R to build a gbm model in H2O and export it as a Java


POJO
4.1. Build and export the model
The example.R script builds the model and exports the Java POJO to the generated_model temporary directory. Run
example.R as follows:
$ R -f example.R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"


Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> #
> # Example R code for generating an H2O Scoring POJO.
> #
>
> # "Safe" system. Error checks process exit status code. stop() if it failed.
> safeSystem <- function(x) {
+ print(sprintf("+ CMD: %s", x))
+ res <- system(x)
+ print(res)
+ if (res != 0) {
+ msg <- sprintf("SYSTEM COMMAND FAILED (exit status %d)", res)
+ stop(msg)
+ }
+ }
>
> library(h2o)
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson
Loading required package: statmod
Loading required package: survival
Loading required package: splines
Loading required package: tools
---------------------------------------------------------------------Your next step is to start H2O and get a connection object (named
'localH2O', for example):
> localH2O = h2o.init()
For H2O package documentation, ask for help:
> ??h2o

Streaming Data

96

H2O World 2014 Training

After starting H2O, you can use the Web UI at http://localhost:54321


For more information visit http://docs.h2o.ai
----------------------------------------------------------------------

Attaching package: h2o


The following objects are masked from package:base:
ifelse, max, min, strsplit, sum, tolower, toupper
>
> cat("Starting H2O\n")
Starting H2O
> myIP <- "localhost"
> myPort <- 54321
> h <- h2o.init(ip = myIP, port = myPort, startH2O = TRUE)
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/var/folders/tt/g5d7cr8d3fg84jmb5jr9dlrc0000gn/T//Rtmps9bRj4/h2o_tomk_started_from_r.out
/var/folders/tt/g5d7cr8d3fg84jmb5jr9dlrc0000gn/T//Rtmps9bRj4/h2o_tomk_started_from_r.err
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Successfully connected to http://localhost:54321


R is connected to H2O cluster:
H2O cluster uptime: 1 seconds 587 milliseconds
H2O cluster version: 2.8.1.1
H2O cluster name: H2O_started_from_R
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
Note: As started, H2O is limited to the CRAN default of 2 CPUs.
Shut down and restart H2O as shown below to use all your CPUs.
> h2o.shutdown(localH2O)
> localH2O = h2o.init(nthreads = -1)

>
> cat("Building GBM model\n")
Building GBM model
> df <- h2o.importFile(h, "training_data.csv");
^M | ^M |
> y <- "Label"
> x <- c("NumberOfLegs","CoatColor","HairLength","TailLength","EnjoysPlay","StairsOutWindow","HoursSpentNapping","RespondsToCommands","
> gbm.h2o.fit <- h2o.gbm(data = df, y = y, x = x, n.trees = 10)
^M | ^M |
>
> cat("Downloading Java prediction model code from H2O\n")
Downloading Java prediction model code from H2O
> model_key <- gbm.h2o.fit@key
> tmpdir_name <- "generated_model"
> cmd <- sprintf("rm -fr %s", tmpdir_name)
> safeSystem(cmd)
[1] "+ CMD: rm -fr generated_model"
[1] 0
> cmd <- sprintf("mkdir %s", tmpdir_name)
> safeSystem(cmd)
[1] "+ CMD: mkdir generated_model"
[1] 0
> cmd <- sprintf("curl -o %s/GBMPojo.java http://%s:%d/2/GBMModelView.java?_modelKey=%s", tmpdir_name, myIP, myPort, model_key)
> safeSystem(cmd)
[1] "+ CMD: curl -o generated_model/GBMPojo.java http://localhost:54321/2/GBMModelView.java?_modelKey=GBM_9d538f637ef85c78d6e2fea88ad54
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 27041 0 27041 0 0 2253k 0 --:--:[1] 0
> cmd <- sprintf("curl -o %s/h2o-model.jar http://127.0.0.1:54321/h2o-model.jar", tmpdir_name)
> safeSystem(cmd)

Streaming Data

97

H2O World 2014 Training

[1] "+ CMD: curl -o generated_model/h2o-model.jar http://127.0.0.1:54321/h2o-model.jar"


% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 8085 100 8085 0 0 3206k 0 --:--:[1] 0
> cmd <- sprintf("sed -i '' 's/class %s/class GBMPojo/' %s/GBMPojo.java", model_key, tmpdir_name)
> safeSystem(cmd)
[1] "+ CMD: sed -i '' 's/class GBM_9d538f637ef85c78d6e2fea88ad54bc9/class GBMPojo/' generated_model/GBMPojo.java"
[1] 0
>
> cat("Note: H2O will shut down automatically if it was started by this R script and the script exits\n")
Note: H2O will shut down automatically if it was started by this R script and the script exits
>

4.2. Look at the output


The generated_model directory is created and now contains two files:
$ ls -l generated_model

mbp2:storm tomk$ ls -l generated_model/


total 72
-rw-r--r-- 1 tomk staff 27024 Nov 15 16:49 GBMPojo.java
-rw-r--r-- 1 tomk staff 8085 Nov 15 16:49 h2o-model.jar

The h2o-model.jar file contains the interface definition, and the GBMPojo.java file contains the Java code for the POJO
model.
The following three sections from the generated model are of special importance.

4.2.1. Class name


public class GBMPojo extends water.genmodel.GeneratedModel {

This is the class to instantiate in the Storm bolt to make predictions. (Note that we simplified the class name using sed as
part of the R script that exported the model. By default, the class name has a long UUID-style name.)

4.2.2. Predict method


public final float[] predict( double[] data, float[] preds)

predict() is the method to call to make a single prediction for a new observation. data is the input, and preds is the output.
The return value is just preds, and can be ignored.
Inputs and Outputs must be numerical. Categorical columns must be translated into numerical values using the DOMAINS
mapping on the way in. Even if the response is categorical, the result will be numerical. It can be mapped back to a level
string using DOMAINS, if desired. When the response is categorical, the preds response is structured as follows:

preds[0] contains the predicted level number


preds[1] contains the probability that the observation is level0
preds[2] contains the probability that the observation is level1
...
preds[N] contains the probability that the observation is levelN-1
sum(preds[1] ... preds[N]) == 1.0

Streaming Data

98

H2O World 2014 Training

In this specific case, that means:

preds[0] contains 0 or 1
preds[1] contains the probability that the observation is ColInfo_15.VALUES[0]
preds[2] contains the probability that the observation is ColInfo_15.VALUES[1]

4.2.3. DOMAINS array


// Column domains. The last array contains domain of response column.
public static final String[][] DOMAINS = new String[][] {
/* NumberOfLegs */ ColInfo_0.VALUES,
/* CoatColor */ ColInfo_1.VALUES,
/* HairLength */ ColInfo_2.VALUES,
/* TailLength */ ColInfo_3.VALUES,
/* EnjoysPlay */ ColInfo_4.VALUES,
/* StairsOutWindow */ ColInfo_5.VALUES,
/* HoursSpentNapping */ ColInfo_6.VALUES,
/* RespondsToCommands */ ColInfo_7.VALUES,
/* EasilyFrightened */ ColInfo_8.VALUES,
/* Age */ ColInfo_9.VALUES,
/* Noise1 */ ColInfo_10.VALUES,
/* Noise2 */ ColInfo_11.VALUES,
/* Noise3 */ ColInfo_12.VALUES,
/* Noise4 */ ColInfo_13.VALUES,
/* Noise5 */ ColInfo_14.VALUES,
/* Label */ ColInfo_15.VALUES
};

The DOMAINS array contains information about the level names of categorical columns. Note that Label (the column we
are predicting) is the last entry in the DOMAINS array.

5. Building Storm and the bolt for the model


5.1 Build storm and import into IntelliJ
To build storm navigate to the cloned repo and install via Maven:
$ cd storm && mvn clean install -DskipTests=true

Once storm is built, open up your favorite IDE to start building the h2o streaming topology. In this tutorial, we will be using
IntelliJ.
To import the storm-starter project into your IntelliJ please follow these screenshots:
Click on "Import Project" and find the storm repo. Select storm and click "OK"

Streaming Data

99

H2O World 2014 Training

Import the project from extrenal model using Maven, click "Next"

Streaming Data

100

H2O World 2014 Training

Ensure that "Import Maven projects automatically" check box is clicked (it's off by default), click "Next"

Streaming Data

101

H2O World 2014 Training

That's it! Now click through the remaining prompts (Next -> Next -> Next -> Finish).
Once inside the project, open up storm-starter/test/jvm/storm.starter. Yes, we'll be working out of the test directory.

5.2 Build the topology


The topology we've prepared has one spout TestH2ODataSpout and two bolts (a "Predict Bolt" and a "Classifier Bolt").
Please copy the pre-built bolts and spout into the test directory in IntelliJ.
$ cp H2OStormStarter.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/

$ cp TestH2ODataSpout.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/

Your project should now look like this:

Streaming Data

102

H2O World 2014 Training

6. Copying the generated POJO files into a Storm bolt build


environment
We are now ready to import the H2O pieces into the IntelliJ project. We'll need to add the h2o-model.jar and the scoring
POJO.
To import the h2o-model.jar into your IntelliJ project, please follow these screenshots:

Streaming Data

103

H2O World 2014 Training

File > Project Structure


Click the "+" to add a new dependency

Streaming Data

104

H2O World 2014 Training

Click on Jars or directories


Find the h2o-model.jar that we previously downloaded with the R script in section 4

Click "OK", then "Apply", then "OK".


You now have the h2o-model.jar as a depencny in your project.

Streaming Data

105

H2O World 2014 Training

We now copy over the POJO from section 4 into our storm project.
$ cp ./generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/

OR if you were not able to build the GBMPojo, copy over the pre-generated version:
$ cp ./premade_generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/

Your storm-starter project directory should now look like this:

In order to use the GBMPojo class, our PredictionBolt in H2OStormStarter has the following "execute" block:

@Override public void execute(Tuple tuple) {


GBMPojo p = new GBMPojo();
// get the input tuple as a String[]
ArrayList<String> vals_string = new ArrayList<String>();
for (Object v : tuple.getValues()) vals_string.add((String)v);
String[] raw_data = vals_string.toArray(new String[vals_string.size()]);
// the score pojo requires a single double[] of input.
// We handle all of the categorical mapping ourselves
double data[] = new double[raw_data.length-1]; //drop the Label
String[] colnames = tuple.getFields().toList().toArray(new String[tuple.size()]);
// if the column is a factor column, then look up the value, otherwise put the double
for (int i = 1; i < raw_data.length; ++i) {
data[i-1] = p.getDomainValues(colnames[i]) == null
? Double.valueOf(raw_data[i])
: p.mapEnum(p.getColIdx(colnames[i]), raw_data[i]);
}
// get the predictions
float[] preds = new float[GBMPojo.NCLASSES+1];
p.predict(data, preds);
// emit the results
_collector.emit(tuple, new Values(raw_data[0], preds[1]));
_collector.ack(tuple);
}

Streaming Data

106

H2O World 2014 Training

The probability emitted is the probability of being a 'dog'. We use this probability to decide wether the observation is of type
'cat' or 'dog' depending on some threshold. This threshold was chosen such that the F1 score was maximized for the
testing data (please see AUC and/or h2o.preformance() from R).
The ClassifierBolt then looks like:

public static class ClassifierBolt extends BaseRichBolt {


OutputCollector _collector;
final double _thresh = 0.54;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}

@Override
public void execute(Tuple tuple) {
String expected=tuple.getString(0);
double dogProb = tuple.getFloat(1);
String content = expected + "," + (dogProb <= _thresh ? "dog" : "cat");
try {
File file = new File("/Users/spencer/0xdata/h2o-training/tutorials/streaming/storm/web/out"); // EDIT ME! PUT YOUR PATH TO /we
if (!file.exists()) file.createNewFile();
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
_collector.emit(tuple, new Values(expected, dogProb <= _thresh ? "dog" : "cat"));
_collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("expected_class", "class"));
}
}

7. Running a Storm topology with your model deployed


Finally, we can run the topology by right-clicking on H2OStormStarter and running. Here's a screen shot of what that looks
like:

8. Watching predictions in real-time


Streaming Data

107

H2O World 2014 Training

To watch the predictions in real time, we start up an http-server on port 4040 and navigate to http://localhost:4040.
In order to get http-server, install npm (you may need sudo):
$ brew install npm $ npm install http-server -g

Once these are installed, you may navigate to the web directory and start the server:
$ cd web $ http-server -p 4040 -c-1

Now open up your browser and navigate to http://localhost:4040. Requires a modern browser (depends on D3 for
animation).
Here's a short video showing what it looks like all together.
Enjoy!

References
CRAN
GBM
The Elements of Statistical Learning. Vol.1. N.p., page 339 Hastie, Trevor, Robert Tibshirani, and J Jerome H
Friedman. Springer New York, 2001.
Data Science with H2O (GBM)
Gradient Boosting (Wikipedia)
H2O
H2O Markov stable release
Java POJO
R
Storm

Streaming Data

108

H2O World 2014 Training

Build Applications on Top of H2O


Start Developing Your Application On Top of H2O-dev
This tutorial targets developers who are trying to build application on top of the h2o-dev repository. It will walk you
through a quick introduction of H2O droplets - projects templates which can be used to build a new application.

1. Cloning the examples repository


The h2o-droplets repository on GitHub contains some very simple starter projects for different languages. Let's get started
by cloning the h2o-droplets repository, and changing to that directory.
$ git clone https://github.com/0xdata/h2o-droplets.git

Cloning into 'h2o-droplets'...


remote: Counting objects: 53, done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 53 (delta 10), reused 39 (delta 0)
Unpacking objects: 100% (53/53), done.
Checking connectivity... done.

$ cd h2o-droplets

Note: if you are using H2O training sandbox, the repository is already cloned for you in ~/devel/h2o-droplets .

2. A quick look at the repo contents


As of this writing, the repository contains a Java example and a Scala example. Each of these is an independent starter
project.
$ ls -al

total 8
drwxr-xr-x 6 tomk staff 204 Oct 28 08:40 .
drwxr-xr-x 35 tomk staff 1190 Oct 28 08:40 ..
drwxr-xr-x 13 tomk staff 442 Oct 28 08:40 .git
-rw-r--r-- 1 tomk staff 322 Oct 28 08:40 README.md
drwxr-xr-x 11 tomk staff 374 Oct 28 08:40 h2o-java-droplet
drwxr-xr-x 11 tomk staff 374 Oct 28 08:40 h2o-scala-droplet

Let's take a closer look at the Java example:


$ find h2o-java-droplet -type f

h2o-java-droplet/.gitignore
h2o-java-droplet/build.gradle
h2o-java-droplet/gradle/wrapper/gradle-wrapper.jar
h2o-java-droplet/gradle/wrapper/gradle-wrapper.properties
h2o-java-droplet/gradle.properties
h2o-java-droplet/gradlew
h2o-java-droplet/gradlew.bat
h2o-java-droplet/README.md
h2o-java-droplet/settings.gradle
h2o-java-droplet/src/main/java/water/droplets/H2OJavaDroplet.java

Build Applications on Top of H2O

109

H2O World 2014 Training

h2o-java-droplet/src/test/java/water/droplets/H2OJavaDropletTest.java

As you can see, the Java example contains of a build.gradle file, a Java source file, and a Java test file.
Look at the build.gradle file and you will see the following sections, which link the Java droplet sample project to a version
of h2o-dev published in MavenCentral:

repositories {
mavenCentral()
}

ext {
h2oVersion = '0.1.8'
}

dependencies {
// Define dependency on core of H2O
compile "ai.h2o:h2o-core:${h2oVersion}"
// Define dependency on H2O algorithm
compile "ai.h2o:h2o-algos:${h2oVersion}"
// Demands web support
compile "ai.h2o:h2o-web:${h2oVersion}"
// H2O uses JUnit for testing
testCompile 'junit:junit:4.11'
}

This is all very standard Gradle stuff. In particular, note that this example depends on three different H2O artifacts, all of
which are built in the h2o-dev repository.
h2o-core contains base platform capabilities like H2O's in-memory distributed key/value store and mapreduce
frameworks (the "water" package).
h2o-algos contains math algorithms like GLM and Random Forest (the "hex" package).
h2o-web contains the browser web UI (lots of javascript).

3. Preparing the example for use in your IDE


Let's walk through an example using IntelliJ IDEA. The first step is to use Gradle to build your IntelliJ project file.
$ cd h2o-java-droplet $ ./gradlew idea

:ideaModule
Download http://repo1.maven.org/maven2/ai/h2o/h2o-core/0.1.8/h2o-core-0.1.8.pom
Download http://repo1.maven.org/maven2/ai/h2o/h2o-algos/0.1.8/h2o-algos-0.1.8.pom
Download http://repo1.maven.org/maven2/ai/h2o/h2o-web/0.1.8/h2o-web-0.1.8.pom
[... many more one-time downloads not shown ...]
:ideaProject
:ideaWorkspace
:idea
BUILD SUCCESSFUL
Total time: 51.429 secs

You will see three new files created with IDEA extensions. The .ipr file is the project file.

Build Applications on Top of H2O

110

H2O World 2014 Training

$ ls -al

total 168
drwxr-xr-x 15 tomk staff 510 Oct 28 10:03 .
drwxr-xr-x 6 tomk staff 204 Oct 28 08:40 ..
-rw-r--r-- 1 tomk staff 273 Oct 28 08:40 .gitignore
drwxr-xr-x 3 tomk staff 102 Oct 28 10:03 .gradle
-rw-r--r-- 1 tomk staff 1292 Oct 28 08:40 README.md
-rw-r--r-- 1 tomk staff 1409 Oct 28 08:40 build.gradle
drwxr-xr-x 3 tomk staff 102 Oct 28 08:40 gradle
-rw-r--r-- 1 tomk staff 23 Oct 28 08:40 gradle.properties
-rwxr-xr-x 1 tomk staff 5080 Oct 28 08:40 gradlew
-rw-r--r-- 1 tomk staff 2404 Oct 28 08:40 gradlew.bat
-rw-r--r-- 1 tomk staff 33316 Oct 28 10:03 h2o-java-droplet.iml
-rw-r--r-- 1 tomk staff 3716 Oct 28 10:03 h2o-java-droplet.ipr
-rw-r--r-- 1 tomk staff 9299 Oct 28 10:03 h2o-java-droplet.iws
-rw-r--r-- 1 tomk staff 39 Oct 28 08:40 settings.gradle
drwxr-xr-x 4 tomk staff 136 Oct 28 08:40 src

4. Opening the project


Since we have already created the project file, start up IDEA and choose Open Project.

Choose the h2o-java-droplet.ipr project file that we just created with gradle.

Build Applications on Top of H2O

111

H2O World 2014 Training

5. Running the test inside the project


Rebuild the project.

Run the test by right-clicking on the test name.

Build Applications on Top of H2O

112

H2O World 2014 Training

Watch the test pass!

Build Applications on Top of H2O

113

H2O World 2014 Training

KMeans
Hacking Algorithms into H2O: KMeans
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All tutorials start out the same - getting the h2o-dev code
and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example, and then
the content is customized for each algorithm. This tutorial describes the algorithm K-Means.

What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST and JSON support on your new algorithm. You still have to write the math,
of course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.

Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:

C:\Users\cliffc\Desktop> mkdir my-h2o


C:\Users\cliffc\Desktop> cd my-h2o
C:\Users\cliffc\Desktop\my-h2o> git clone https://github.com/0xdata/h2o-dev

This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:

C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE

KMeans

114

H2O World 2014 Training

:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradle --daemon build -x test


...
:h2o-web:signArchives SKIPPED
:h2o-web:assemble
:h2o-web:check
:h2o-web:build
BUILD SUCCESSFUL
Total time: 1 mins 44.645 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.

While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the

web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example

algorithm and turning it into a K-Means algorithm.


h2o-web : The web interface and JavaScript. We will use jar files from here in our project, but probably not need to look

at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one

line here to teach H2O about our new algorithm.


Within each top-level directory, there is a fairly straightforward maven'ized directory structure:

src/main/java - Java source code


src/test/java - Java test code

In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:

src/main/java/water - Java core source code


src/main/java/hex - Java math source code

Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradlew idea


...
:h2o-test-integ:idea
:h2o-web:ideaModule
:h2o-web:idea

KMeans

115

H2O World 2014 Training

BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

Running H2O-dev Tests in an IDE


Then I switched to IDEAJ from my command window. I launched IDEAJ, selected "Open Project", navigated to the h2odev/ directory and clicked Open. After IDEAJ opened, I clicked the Make project button (or Build/Make Project or ctrl-F9)

and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:

"C:\Program Files\Java\jdk1.7.0_67\bin\java" -agentlib:jdwp=transport=dt_socket....


Connected to the target VM, address: '127.0.0.1:51321', transport: 'socket'

11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

11-08 13:17:07.722 192.168.1.2:54321 4576 main INFO: ###########################################################


11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test class name: hex.example.ExampleTest
11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test method name: testIris
11-08 13:17:07.724 192.168.1.2:54321 4576 main INFO: ###########################################################
Start Parse
11-08 13:17:08.198 192.168.1.2:54321 4576 FJ-0-7 INFO: Parse result for _85a160bc2419316580eeaab88602418e (150 rows):
11-08 13:17:08.204 192.168.1.2:54321 4576 FJ-0-7 INFO: Col type min max NAs consta
11-08 13:17:08.205 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_len: numeric 4.30000 7.90000
11-08 13:17:08.206 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_wid: numeric 2.00000 4.40000
11-08 13:17:08.207 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_len: numeric 1.00000 6.90000
11-08 13:17:08.208 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_wid: numeric 0.100000 2.50000
11-08 13:17:08.209 192.168.1.2:54321 4576 FJ-0-7 INFO: class: categorical 0.00000 2.00000
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Internal FluidVec compression/distribution summary:
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Chunk type count fraction size rel. size
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1 1 20.000 % 218 B 19.156 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1S 4 80.000 % 920 B 80.844 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Total memory usage : 1.1 KB

KMeans

116

H2O World 2014 Training

Done Parse: 488

11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 0


11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 1
11-08 13:17:08.305 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 2
11-08 13:17:08.306 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 3
11-08 13:17:08.307 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 4
11-08 13:17:08.308 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 5
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 6
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 7
11-08 13:17:08.310 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 8
11-08 13:17:08.311 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 9
11-08 13:17:08.315 192.168.1.2:54321 4576 main INFO: #### TEST hex.example.ExampleTest#testIris EXECUTION TIME: 00:00:00

Disconnected from the target VM, address: '127.0.0.1:51321', transport: 'socket'


Process finished with exit code 0

Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:

11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java

code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).

Building Our Algorithm: Copying from the Example


Now let's get our own algorithm framework to start playing with in place. Because H2O-dev already has a KMeans
algorithm, that name is taken... but we want our own. (Besides just doing it ourselves, there are some cool extensions to
KMeans we can add and sometimes it's easier to start from a clean[er] slate).
So this algorithm is called KMeans2 (not too creative, I know). I cloned the main code and model from the h2oalgos/src/main/java/hex/example/ directory into h2o-algos/src/main/java/hex/kmeans2/ , and also the test from h2oalgos/src/test/java/hex/example/ directory into h2o-algos/src/test/java/hex/kmeans2/ .

Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their KMeans2* variants.
KMeans

117

H2O World 2014 Training

I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its KMeans2 variant. Finally I renamed
the files and file contents from Example to KMeans2 .
I also dove into h2o-app/src/main/java/water/H2OApp.java and copied the two Example lines and made their KMeans2
variants. Because I'm old-school, I did this with a combination of shell hacking and Emacs; about 5 minutes all told.
At this point, back in IDEAJ, I nagivated to KMeans2Test.java , right-clicked debug-test testIris again - and was rewarded
with my KMeans2 clone running a basic test. Not a very good KMeans, but definitely a start.

Whats in All Those Files?


What's in all those files? Mainly there is a Model and a ModelBuilder , and then some support files.
A model is a mathematical representation of the world, an effort to approximate some interesting fact with numbers.
It is a static concrete unchanging thing, completely defined by the rules (algorithm) and data used to make it.
A model-builder builds a model; it is transient and active. It exists as long as we are actively trying to make a model,
and is thrown away once we have the model in-hand.
In our case, K-Means is the algorithm - so that belongs in the KMeans2ModelBuilder.java file, and the result is a set of
clusters (a model), so that belongs in the KMeans2Model.java file.
We also split Schemas from Models - to isolate slow-moving external APIs from rapidly-moving internal APIs: as a Java dev
you can hack the guts of K-Means to your hearts content - including the inputs and outputs - as long as the externally
facing V2 schemas do not change. If you want to report new stuff or take new parameters, you can make a new V3 schema
- which is not compatible with V2 - for the new stuff. Old external V2 users will not be affected by your changes (you'll still
have to make the correct mappings in the V2 schema code from your V3 algorithm).
One other important hack: K-Means is an unsupervised algorithm - no training data (no "response") tells it what the results
"should" be. So we need to hack the word Supervised out of all the various class names it appears in. After this is done,
your KMeans2Test probably fails to compile, because it is trying to set the response column name in the test, and
unsupervised models do not get a response to train with. Just delete the line for now:

parms._response_column = "class";

At this point we can run our test code again (still finding the max-per-column).

The KMeans2 Model


The KMeans2 model, in the file KMeans2Model.java , should contain what we expect out of K-Means: a set of clusters. We'll
represent a single cluster as an N-dimensional point (an array of doubles). For our K clusters, this will be:

public double _clusters[/*K*/][/*N*/]; // Our K clusters, each an N-dimensional point

Inside the KMeans2Model class, there is a class for the model's output: class KMeans2Output . We'll put our clusters there. The
various support classes and files will make sure our model's output appears in the correct REST and JSON responses and
gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that now.
To help assay the goodness our of model, we should also report some extra facts about the training results. The obvious
thing to report is the Mean Squared Error, or the average squared-error each training point has against its chosen cluster:

public double _mse; // Mean Squared Error of the training data

KMeans

118

H2O World 2014 Training

And finally a quick report on the effort used to train: the number of iterations training actually took. K-Means runs in
iterations, improving with each iteration. The algorithm typically stops when the model quits improving; we report how many
iterations it took here:

public int _iters; // Iterations we took

My final KMeans2Output class looks like:

public static class KMeans2Output extends Model.Output {


public int _iters; // Iterations executed
public double _clusters[/*K*/][/*N*/]; // Our K clusters, each an N-dimensional point
public double _mse; // Mean Squared Error
public KMeans2Output( KMeans2 b ) { super(b); }
@Override public ModelCategory getModelCategory() { return Model.ModelCategory.Clustering; }
}

Now, let's turn to the input to our model-building process. These are stored in the class KMeans2Model.KMeans2Parameters .
We already inherit an input training dataset (returned with train() ), possibly a validation dataset ( valid() ), and some
other helpers (e.g. which columns to ignore if we do an expensive scoring step with each iteration). For now, we can ignore
everything except the input dataset from train() .
However, we want some more parameters for K-Means: K , the number of clusters. Define it next to the left-over
_max_iters from the old Example code (which we might as well keep since that's a useful stopping conditon for K-Means):

public int _K;

My final KMeans2Parameters class looks like:

public static class KMeans2Parameters extends Model.Parameters {


public int _max_iters = 1000; // Max iterations
public int _K = 0;
}

A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our JSON
output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new K field to the external input Schema in h2oalgos/src/main/java/hex/schemas/KMeans2V2.java :

public static final class KMeans2ParametersV2 extends ModelParametersSchema<KMeans2Model.KMeans2Parameters, KMeans2ParametersV2


static public String[] own_fields = new String[] { "max_iters", "K"};
// Input fields
@API(help="Maximum training iterations.") public int max_iters;
@API(help="K") public int K;
}

KMeans

119

H2O World 2014 Training

And I need to add my result fields to the external output schema in h2oalgos/src/main/java/hex/schemas/KMeans2ModelV2.java :

public static final class KMeans2ModelOutputV2 extends ModelOutputSchema<KMeans2Model.KMeans2Output, KMeans2ModelOutputV2


// Output fields
@API(help="Iterations executed") public int iters;
@API(help="Cluster centers") public double clusters[/*K*/][/*N*/];
@API(help="Mean Squared Error") public double mse;

The KMeans2 Model Builder


Let's turn to the K-Means model builder, which includes some boilerplate we inherited from the old Example code, and a
place to put our real algorithm. There is a basic KMeans2 constructor which calls init :

public KMeans2( ... ) { super("KMeans2",parms); init(false); }

In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No running K-Means in the init() call!
Speaking of the init() call, the one we got from the old Example code limits our _max_iters to between 1 and 10 million.
Let's add some lines to check that K is sane:

if( _parms._K < 2 || _parms._K > 999999 )


error("K","must be between 2 and a million");

Immediately when testing the code, I get a failure because the KMeans2Test.java code does not set K and the default is
zero. I'll set K to 3 in the test code:

parms._K = 3;

In the KMeans2.java file there is a trainModel call that is used when you really want to start running K-Means (as opposed
to just checking arguments). In our case, the old boilerplate starts a KMeans2Driver in a background thread. Not required,
but for any long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and
from REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class KMeans2Driver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.

Scope.enter();

Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:

KMeans

120

H2O World 2014 Training

_parms.lock_frames(KMeans2.this);

Locking prevents situations like accidentally deleting or loading a new dataset with the same name while K-Means is
running. Like the Scope.exit() above, we will unlock in finally block. While it might be nice to use Java locking, or even
JDK 5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly cooperative we cannot enforce locking at the JVM level like the JVM does.
Next, we make an instance of our model object (with no clusters yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated model).

model = new KMeans2Model(dest(), _parms, new KMeans2Model.KMeans2Output(KMeans2.this));


model.delete_and_lock(_key);

Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.

The KMeans2 Main Algorithm


Finally we get to where the Math is!
K-Means starts with some clusters, generally picked from the dataset population, and then optimizes the cluster centers
and cluster assignments. The easiest (but not the best!) way to pick clusters is just to pick points at (pseudo) random. So
ahead of our iteration/main loop, let's pick some clusters.

// Pseudo-random initial cluster selection


Frame f = train(); // Input dataset
double clusters[/*K*/][/*N*/] = model._output._clusters = new double[_parms._K][f.numCols()];
Random R = new Random(); // Really awful RNG, we should pick a better one later
for( int k=0; k<_parms._K; k++ ) {
long row = Math.abs(R.nextLong() % f.numRows());
for( int j=0; j<f.numCols(); j++ ) // Copy the point into our cluster
clusters[k][j] = f.vecs()[j].at(row);
}
model.update(_key); // Update model in K/V store

My KMeans2 now has a leftover loop from the old Example code running up to some max iteration count. This sounds like a
good start to K-Means - we'll need several stopping conditions, and max-iterations is one of them.

// Stop after enough iterations


for( ; model._output._iters < _parms._max_iters; model._output._iters++ ) {
...
}

Let's also stop if somebody clicks the "cancel" button in the GUI:

// Stop after enough iterations


for( ; model._output._iters < _parms._max_iters; model._output._iters++ ) {
if( !isRunning() ) break; // Stopped/cancelled
...
}

I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. clusters, mse), and save the results back into the DKV, bump the progress bar, and log a little bit of progress:

KMeans

121

H2O World 2014 Training

// Fill in the model


model._output._clusters = ????? // we need to figure these out
model.update(_key); // Update model in K/V store
update(1); // One unit of work in the GUI progress bar
StringBuilder sb = new StringBuilder();
sb.append("KMeans2: iter: ").append(model._output._iters);
Log.info(sb);

The KMeans2 Main Loop


And now we need to figure what do in our main loop. Somewhere between the loop-top-isRunning check and the
model.update() call, we need to compute something to update our model with! This is the meat of K-Means - for each point,

assign it to the nearest cluster center, then compute new cluster centers from the assigned points, and iterate until the
clusters quit moving.
Anything that starts out with the words "for each point" when you have a billion points needs to run in-parallel and scale-out
to have a chance of completing fast - and this is exactly H2O is built for! So let's write code that runs scale-out for-eachpoint... and the easiest way to do that is with an H2O Map/Reduce job - an instance of MRTask. For K-Means, this is an
instance of Lloyd's basic algorithm. We'll call it from the main-loop like this, and define it below (extra lines included so you
can see how it fits):

if( !isRunning() ) break; // Stopped/cancelled


Lloyds ll = new Lloyds(clusters).doAll(f);
clusters = model._output._clusters = ArrayUtils.div(ll._sums,ll._rows);
model._output._mse = ll._se/f.numRows();
// Fill in the model
model.update(_key); // Update model in K/V store
update(1); // One unit of work
StringBuilder sb = new StringBuilder();
sb.append("KMeans2: iter: ").append(model._output._iters).append(" MSE=").append(model._output._mse).append(" ROWS="
Log.info(sb);

Let's also add a stopping condition if the clusters stop moving:

// Stop after enough iterations


double last_mse = Double.MAX_VALUE; // MSE from prior iteration
for( ; model._output._iters < _parms._max_iters; model._output._iters++ ) {
if( !isRunning() ) break; // Stopped/cancelled
Lloyds ll = new Lloyds(clusters).doAll(f);
clusters = model._output._clusters = ArrayUtils.div(ll._sums,ll._rows);
model._output._mse = ll._se/f.numRows();
// Fill in the model
model.update(_key); // Update model in K/V store
update(1); // One unit of work
StringBuilder sb = new StringBuilder();
sb.append("KMeans2: iter: ").append(model._output._iters).append(" MSE=").append(model._output._mse).append(" ROWS="
Log.info(sb);
// Also stop if the model stops improving (if MSE stops dropping very much)
double improv = (last_mse-model._output._mse) / model._output._mse;
if( improv < 1e-4 ) break;
last_mse = model._output._mse;
}

KMeans

122

H2O World 2014 Training

Basically, we just called some not-yet-defined Lloyds code, computed some cluster centers by computing the average
point from the points in the new cluster, and copied the results into our model. I also printed out the Mean Squared Error
and row counts, so we can watch the progress over time. Finally we end with another stopping condition: stop if the latest
model is really not much better than the last model. Now class Lloyds can be coded as an inner class to the
KMeans2Driver class:

class Lloyds extends MRTask<Lloyds> {


private final double[/*K*/][/*N*/] _clusters; // Old cluster
private double[/*K*/][/*N*/] _sums; // Sum of points in new cluster
private long[/*K*/] _rows; // Number of points in new cluster
private double _se; // Squared Error
Lloyds( double clusters[][] ) { _clusters = clusters; }
@Override public void map( Chunk[] chks ) {
...
}
@Override public void reduce( Lloyds ll ) {
...
}
}

A Quick H2O Map/Reduce Diversion


This isn't your Hadoop-Daddy's Map/Reduce. This is an in-memory super-fast map-reduce... where "super-fast" generally
means "memory bandwidth limited", often 1000x faster than the usual hadoop-variant - MRTasks can often touch a
gigabyte of data in a millisecond, or a terabyte in a second (depending on how much hardware is in your cluster - more
hardware is faster for the same amount of data!)
The map() call takes data in Chunks - where each Chunk is basically a small array-like slice of the Big Data. Data in
Chunks is accessed with basic at0 and set0 calls (vs accessing data in Vecs with at and set ). The output of a map()
is stored in the Lloyds object itself, as a Plain Olde Java Object (POJO). Each map() call has private access to its own
fields and Chunks , which implies there are lots of instances of Lloyds objects scattered all over the cluster (one such
instance per Chunk of data... well, actually one instance per call to map() , but each map call is handed an aligned set of
Chunks, one per feature or column in the dataset).
Since there are lots of little Lloyds running about, their results need to be combined. That's what reduce does - combine
two Lloyds into one. Typically, you can do this by adding similar fields together - often array elements are added side-byside, similar to a saxpy operation.
All code here is written in a single-threaded style, even as it runs in parallel and distributed. H2O handles all the
synchronization issues.

A Quick Helper
A common op is to compute the distance between two points. We'll compute it as the squared Euclidean distance (squared
so as to avoid an expensive square-root operation):

static double distance( double[] ds0, double[] ds1 ) {


double sum=0;
for( int i=0; i<ds0.length; i++ )
sum += (ds0[i]-ds1[i])*(ds0[i]-ds1[i]);
return sum;
}

Lloyd's
Back to Lloyds, we loop over the Chunk[] of data handed the map call by moving it as a double[] for easy handling:

KMeans

123

H2O World 2014 Training

@Override public void map( Chunk[] chks ) {


double[] ds = new double[chks.length];
for( int row=0; row<chks[0]._len; row++ ) {
for( int i=0; i<ds.length; i++ ) ds[i] = chks[i].at0(row);
...
}
}

Then we need to find the nearest cluster center:

for( int i=0; i<ds.length; i++ ) ds[i] = chks[i].at0(row);


// Find distance to cluster 0
int nearest=0;
double dist = distance(ds,_clusters[nearest]);
// Find nearest cluster, and its distance
for( int k=1; k<_parms._K; k++ ) {
double dist2 = distance(ds,_clusters[k]);
if( dist2 < dist ) { dist = dist2; nearest = k; }
}
...

And then add the point into our growing pile of points in the new clusters we're building:

if( dist2 < dist ) { dist = dist2; nearest = k; }


}
// Add the point into the chosen cluster
ArrayUtils.add(_sums[nearest],ds);
_rows[nearest]++;
// Accumulate squared-error (which is just squared-distance)
_se += dist;

And that ends the map call and the Lloyds main work loop. To recap, here it is all at once:

@Override public void map( Chunk[] chks ) {


double[] ds = new double[chks.length];
for( int row=0; row<chks[0]._len; row++ ) {
for( int i=0; i<ds.length; i++ ) ds[i] = chks[i].at0(row);
// Find distance to cluster 0
int nearest=0;
double dist = distance(ds,_clusters[nearest]);
// Find nearest cluster, and its distance
for( int k=1; k<_parms._K; k++ ) {
double dist2 = distance(ds,_clusters[k]);
if( dist2 < dist ) { dist = dist2; nearest = k; }
}
// Add the point into the chosen cluster
ArrayUtils.add(_sums[nearest],ds);
_rows[nearest]++;
// Accumulate squared-error (which is just squared-distance)
_se += dist;
}
}

The reduce needs to fold together the returned results; the _sums , the _rows and the _se :

@Override public void reduce( Lloyds ll ) {


ArrayUtils.add(_sums,ll._sums);
ArrayUtils.add(_rows,ll._rows);
_se += ll._se;
}

KMeans

124

H2O World 2014 Training

And that completes KMeans2!

Running K-Means
Running the KMeans2Test returns:

11-09 16:47:36.609 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 0 1.9077333333333335 ROWS=[66, 34, 50]
11-09 16:47:36.610 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 1 0.6794683457919871 ROWS=[60, 40, 50]
11-09 16:47:36.611 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 2 0.604730388888889 ROWS=[54, 46, 50]
11-09 16:47:36.613 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 3 0.5870257150458589 ROWS=[51, 49, 50]
11-09 16:47:36.614 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 4 0.5828039837094235 ROWS=[50, 50, 50]
11-09 16:47:36.615 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 5 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.616 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 6 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.618 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 7 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.619 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 8 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.621 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 9 0.5823599999999998 ROWS=[50, 50, 50]

You can see the Mean Squared Error dropping over time, and the clusters stabilizing.
At this point there are a zillion ways we could take this K-Means:
Report MSE on the validation set
Report MSE per-cluster (some clusters will be 'tighter' than others)
Run K-Means with different seeds, which will likely build different clusters - and report the best cluster, or even sort and
group them by MSE.
Handle categoricals (e.g. compute Manhattan distance instead of Euclidean)
Normalize the data (optionally, on a flag). Without normalization features, larger absolute values will have larger
distances (errors) and will get priority of other features.
Aggressively split high-variance clusters to see if we can get K-Means out of a local minima
Handle clusters that "run dry" (zero rows), possibly by splitting the point with the most error/distance out from its cluster
to make a new one.
I'm sure you can think of lots more ways to extend K-Means!
Good luck with your own H2O algorithm,
Cliff

KMeans

125

H2O World 2014 Training

Grep
Hacking Algorithms into H2O: Grep
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All tutorials start out the same: getting the h2o-dev code
and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example, and then
the content is customized for each algorithm. This tutorial describes computing Grep.

What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST & JSON support on your new algorithm. You still have to write the math, of
course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.

Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:

C:\Users\cliffc\Desktop> mkdir my-h2o


C:\Users\cliffc\Desktop> cd my-h2o
C:\Users\cliffc\Desktop\my-h2o> git clone https://github.com/0xdata/h2o-dev

This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:

C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE

Grep

126

H2O World 2014 Training

:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradle --daemon build -x test


...
:h2o-web:signArchives SKIPPED
:h2o-web:assemble
:h2o-web:check
:h2o-web:build
BUILD SUCCESSFUL
Total time: 1 mins 44.645 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.

While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the

web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example

algorithm and turning it into a grep algorithm.


h2o-web : The web interface and JavaScript. We will use jar files from here in our project, but probably not need to look

at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one

line here to teach H2O about our new algorithm.


Within each top-level directory, there is a fairly straightforward maven'ized directory structure:

src/main/java - Java source code


src/test/java - Java test code

In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:

src/main/java/water - Java core source code


src/main/java/hex - Java math source code

Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradlew idea


...
:h2o-test-integ:idea
:h2o-web:ideaModule
:h2o-web:idea

Grep

127

H2O World 2014 Training

BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

Running H2O-dev Tests in an IDE


Then I switched to IDEAJ from my command window. I launched IDEAJ, selected "Open Project", navigated to the h2odev/ directory and clicked Open. After IDEAJ opened, I clicked the Make project button (or Build/Make Project or ctrl-F9)

and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:

"C:\Program Files\Java\jdk1.7.0_67\bin\java" -agentlib:jdwp=transport=dt_socket....


Connected to the target VM, address: '127.0.0.1:51321', transport: 'socket'

11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

11-08 13:17:07.722 192.168.1.2:54321 4576 main INFO: ###########################################################


11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test class name: hex.example.ExampleTest
11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test method name: testIris
11-08 13:17:07.724 192.168.1.2:54321 4576 main INFO: ###########################################################
Start Parse
11-08 13:17:08.198 192.168.1.2:54321 4576 FJ-0-7 INFO: Parse result for _85a160bc2419316580eeaab88602418e (150 rows):
11-08 13:17:08.204 192.168.1.2:54321 4576 FJ-0-7 INFO: Col type min max NAs consta
11-08 13:17:08.205 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_len: numeric 4.30000 7.90000
11-08 13:17:08.206 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_wid: numeric 2.00000 4.40000
11-08 13:17:08.207 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_len: numeric 1.00000 6.90000
11-08 13:17:08.208 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_wid: numeric 0.100000 2.50000
11-08 13:17:08.209 192.168.1.2:54321 4576 FJ-0-7 INFO: class: categorical 0.00000 2.00000
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Internal FluidVec compression/distribution summary:
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Chunk type count fraction size rel. size
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1 1 20.000 % 218 B 19.156 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1S 4 80.000 % 920 B 80.844 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Total memory usage : 1.1 KB

Grep

128

H2O World 2014 Training

Done Parse: 488

11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 0


11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 1
11-08 13:17:08.305 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 2
11-08 13:17:08.306 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 3
11-08 13:17:08.307 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 4
11-08 13:17:08.308 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 5
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 6
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 7
11-08 13:17:08.310 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 8
11-08 13:17:08.311 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 9
11-08 13:17:08.315 192.168.1.2:54321 4576 main INFO: #### TEST hex.example.ExampleTest#testIris EXECUTION TIME: 00:00:00

Disconnected from the target VM, address: '127.0.0.1:51321', transport: 'socket'


Process finished with exit code 0

Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:

11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java

code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).

Building Our Algorithm: Copying from the Example


Now let's get our own algorithm framework to start playing with in place.
We want to do a grep, so let's call the code grep . I cloned the main code and model from the h2oalgos/src/main/java/hex/example/ directory into h2o-algos/src/main/java/hex/grep/ , and the test from h2oalgos/src/test/java/hex/example/ directory into h2o-algos/src/test/java/hex/grep/ .

Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their Grep* variants.
I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its Grep variant. Finally, I renamed the
Grep

129

H2O World 2014 Training

files and file contents from Example to Grep .


I also dove into h2o-app/src/main/java/water/H2OApp.java and copied the two Example lines and made their Grep variants.
Because I'm old-school, I did this with a combination of shell hacking and Emacs; about 5 minutes all told.
At this point, back in IDEAJ, I nagivated to GrepTest.java , right-clicked debug-test testIris again - and was rewarded with
my Grep clone running a basic test. Not a very good grep, but definitely a start.

Whats in All Those Files?


What's in all those files? Mainly there is a Model and a ModelBuilder , and then some support files.
A model is a mathematical representation of the world, an effort to approximate some interesting fact with numbers.
It is a static concrete unchanging thing, completely defined by the rules (algorithm) and data used to make it.
A model-builder builds a model; it is transient and active. It exists as long as we are actively trying to make a model,
and is thrown away once we have the model in-hand.
In our case, we want the regular expression matches (as a String[] ) of the match pattern applied to a text file - a
mathematical result - so that belongs in the GrepModel.java file. The algorithm to run grep belongs in the
GrepModelBuilder.java file.

We also split Schemas from Models to isolate slow-moving external APIs from rapidly-moving internal APIs. As a Java dev,
you can hack the guts of Grep to your heart's content, including the inputs and outputs, as long as the externally facing V2
schemas do not change. If you want to report new stuff or take new parameters, you can make a new V3 schema (which is
not compatible with V2) for the new stuff. Old external V2 users will not be affected by your changes - you'll still have to
make the correct mappings in the V2 schema code from your V3 algorithm.
One other important hack: grep is an unsupervised algorithm - no training data (no "response") tells it what the results
"should" be - it's not really a machine-learning algorithm. So we need to hack the word Supervised out of all the various
class names it appears in. After this is done, your GrepTest probably fails to compile, because it is trying to set the
response column name in the test, and unsupervised models do not get a response to train with. Just delete the line for
now:

parms._response_column = "class";

At this point, we can run our test code again (still finding the max-per-column) instead of running grep.

The Grep Model


The Grep model, in the file GrepModel.java , should contain what we expect out of grep: the set of matches as a String[] .
Note that this decision limits us to a set of results that can be held on a single machine. H2O can represent a Big Data set
of results, but only in a distributed data Frame. It's not to hard to do, but the String[] is easy to work with, and good for a
few million results before it gets painful.
We also want the line numbers (harder to get), and the byte offset of the match (easier to get), and perhaps the whole
matching line. For this example we'll skip the line numbers for a bit, and just report byte offsets and the matched Strings.

public String[] _matches; // Our String matches


public long[] _offsets; // File offsets

Inside the GrepModel class, there is a class for the model's output: class GrepOutput . We'll put our grep results there. The

Grep

130

H2O World 2014 Training

various support classes and files will make sure our model's output appears in the correct REST and JSON responses, and
gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that now.
My final GrepOutput class looks like:

public static class GrepOutput extends Model.Output {


public String[] _matches; // Our String matches
public long[] _offsets; // File offsets
public GrepOutput( Grep b ) { super(b); }
@Override public ModelCategory getModelCategory() { return Model.ModelCategory.Unknown; }
}

Now, let's turn to the input for our model-building process. These are stored in the class GrepModel.GrepParameters . We
already inherit an input dataset (returned with train() ), and some other helpers (e.g. which columns to ignore). For now,
we can ignore everything except the input dataset from train() .
However, we want some more parameters for grep: the regular expression. Define it next to the left-over _max_iters from
the old Example code (which we might as well also nuke):

public String _regex;

My final GrepParameters class looks like:

public static class GrepParameters extends Model.Parameters {


public String _regex; // The regex
}

A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not use) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our
JSON output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new regex field to the external schema in h2oalgos/src/main/java/hex/schemas/GrepV2.java :

public static final class GrepParametersV2 extends ModelParametersSchema<GrepModel.GrepParameters, GrepParametersV2


static public String[] own_fields = new String[] { "regex" };
// Input fields
@API(help="regex") public String regex;
}

And I need to add my result fields to the external output schema in h2o-algos/src/main/java/hex/schemas/GrepModelV2.java :

public static final class GrepModelOutputV2 extends ModelOutputSchema<GrepModel.GrepOutput, GrepModelOutputV2> {


// Output fields
// Assume small-data results: string matches only
@API(help="Matching strings") public String[] matches;
@API(help="Byte offsets of matches") public long[] offsets;

Grep

131

H2O World 2014 Training

The Grep Model Builder


Let's turn to the Grep model builder, which includes some boilerplate we inherited from the old Example code, and a place
to put our real algorithm. There is a basic Grep constructor which calls init :

public Grep( ... ) { super("Grep",parms); init(false); }

In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No computing grep in the init() call!
Speaking of the init() call, the one we got from the old Example code limits sanity checks the now-deleted _max_iters .
Let's replace that with some basic sanity checking:

@Override public void init(boolean expensive) {


super.init(expensive);
if( _parms._regex == null ) {
error("regex", "regex is missing");
} else {
try { Pattern.compile(_parms._regex); }
catch( PatternSyntaxException pse ) { error("regex", pse.getMessage()); }
}
if( _parms._train == null ) return;
Vec[] vecs = _parms.train().vecs();
if( vecs.length != 1 )
error("train","Frame must contain exactly 1 Vec (of raw text)");
if( !(vecs[0] instanceof ByteVec) )
error("train","Frame must contain exactly 1 Vec (of raw text)");
}

In the Grep.java file there is a trainModel call that is used when you really want to start running grep (as opposed to just
checking arguments). In our case, the old boilerplate starts a GrepDriver in a background thread. Not required, but for any
long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and from
REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class GrepDriver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.

Scope.enter();

Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:

_parms.lock_frames(Grep.this);

Locking prevents situations like accidentally deleting or loading a new dataset with the same name while grep is running.
Like the Scope.exit() above, we will unlock in the finally block. While it might be nice to use Java locking, or even JDK
5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly cooperative - we
cannot enforce locking at the JVM level like the JVM does.

Grep

132

H2O World 2014 Training

Next, we make an instance of our model object (with no result matches yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated grep search).

model = new GrepModel(dest(), _parms, new GrepModel.GrepOutput(Grep.this));


model.delete_and_lock(_key);

Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.

The Grep Main Algorithm


Finally we get to where the Real Stuff is!
Grep can be computed in a variety of ways, but for this demo I'm going to stick with the classic JDK Pattern and Match
classes. Note that 30secs of grep'ing the internet (well, google'ing, pretty much the same thing) pulls out a dozen Regex
packages claiming to be 1 or 2 orders of magnitude faster. I didn't try them (tempted though), and definitely worth a deeper
look in the next iteration.
My Grep now has a leftover loop from the old Example code running up to some max iteration count. Let's nuke it and just
make a pass over the single Big Data column of raw text.

// Run the main Grep Loop


GrepGrep gg = new GrepGrep(_parms._regex).doAll(train().vecs()[0]);

I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. grep), and save the results back into the DKV, bump the progress bar, and log a little bit of progress. For this
very simple 'model' we don't need any extra iteration, and we'll do progress on a per-byte instead of per-iteration basis. I'm
just going to assume my not-yet-defined GrepGrep class just ends up with the results, and save them to the model now:

// Fill in the model


model._output._matches = Arrays.copyOf(gg._matches,gg._cnt);
model._output._offsets = Arrays.copyOf(gg._offsets,gg._cnt);
StringBuilder sb = new StringBuilder();
sb.append("Grep: ").append("\n");
sb.append(Arrays.toString(model._output._matches)).append("\n");
sb.append(Arrays.toString(model._output._offsets)).append("\n");
Log.info(sb);

The GrepGrep Main Class


And now we need to figure what do in our main worker class. This is the meat of Grep - for each byte, search for a regex
match.
Anything that starts out with the words "for each byte" when you have a few gigabytes of text needs to run in-parallel and
scale-out to have a chance of completing fast - and this is exactly H2O is built for! So let's write code that runs scale-out
for-each-byte... and the easiest way to do that is with an H2O Map/Reduce job - an instance of MRTask. class GrepGrep
can be coded as an inner class to the GrepDriver class:

private class GrepGrep extends MRTask<GrepGrep> {


private final String _regex;
// Outputs, hopefully not too big for once machine!
String[] _matches;
long [] _offsets;
int _cnt;
GrepGrep( String regex ) { _regex = regex; }

Grep

133

H2O World 2014 Training

@Override public void map( Chunk chk ) {


...
}
@Override public void reduce( Histo h ) {
...
}
}

A Quick H2O Map/Reduce Diversion


This isn't your Hadoop-Daddy's Map/Reduce. This is an in-memory super-fast map-reduce... where "super-fast" generally
means "memory bandwidth limited", often 1000x faster than the usual hadoop-variant - MRTasks can often touch a
gigabyte of data in a millisecond, or a terabyte in a second (depending on how much hardware is in your cluster - more
hardware is faster for the same amount of data!)
The map() call takes data in Chunks - where each Chunk is basically a small array-like slice of the Big Data. Data in
Chunks is accessed with basic at0 and set0 calls (vs accessing data in Vecs with at and set ). The output of a map()
is stored in the GrepGrep object itself, as a Plain Old Java Object (POJO). Each map() call has private access to its own
fields and Chunks , which implies there are lots of instances of GrepGrep objects scattered all over the cluster (one such
instance per Chunk of data... well, actually one instance per call to map() , but each map call is handed an aligned set of
Chunks, one per feature or column in the dataset).
Since there are lots of little GrepGreps running about, their results need to be combined. That's what reduce does combine two GrepGrep s into one. Typically, you can do this by adding similar fields together - often array elements are
added side-by-side, similar to a saxpy operation.
All code here is written in a single-threaded style, even as it runs in parallel and distributed. H2O handles all the
synchronization issues.

GrepGrep
Back to class GrepGrep , we create holders for the results, compile a JDK regex Pattern object, then start looping over
Matches .

@Override public void map( Chunk[] chks ) {


_matches = new String[1]; // Result holders; will lazy expand
_offsets = new long [1];
ByteSeq bs = new ByteSeq(chk,chk.nextChunk());
Pattern p = Pattern.compile(_regex);
// We already checked that this is an instance of a ByteVec, which means
// all the Chunks contain raw text as byte arrays.
Matcher m = p.matcher(bs);
while( m.find() && m.start() < bs._bs0.length )
add(bs.str(m.start(),m.end()),chk.start()+m.start());
update(chk._len); // Whole chunk of work, done all at once
}

Note that I let the matches end in the second Chunk of data. All matches must start in the first Chunk of data; this prevents
me from counting every match twice (once in the first Chunk on a map call, once again in the second Chunk on some other
map call). Ultimately I limit matches to one Chunk's worth of raw text - about 4Megs for a single match - but they can span
a Chunk boundary.
I defined a simple class allowing a byte[] to be used as an instance of a CharSequence to pass to the Pattern.match call,
and a method to add (accumulate) results. Both are very simple. add accumulates results in the two result arrays,
doubling their size as needed (to keep asymptotic costs low):

Grep

134

H2O World 2014 Training

private void add( String s, long off ) {


if( _cnt == _matches.length ) {
_matches = Arrays.copyOf(_matches,_cnt<<1);
_offsets = Arrays.copyOf(_offsets,_cnt<<1);
}
_matches[_cnt ] = s;
_offsets[_cnt++] = off;
}

The class ByteSeq wraps byte arrays in a simple interface implementation. There's two special H2O properties at work
here making this very efficient: Always raw text data is stored as ... raw text data. The Chunk wrappers used to manipulate
the big data are very thin wrappers over raw text - and the underlying byte[] s are directly available. They are also lazily
loaded on demand from the file system. Editing the byte[] s can't be stopped with this direct access, and is Bad Coding
Style, and won't be reflected around the cluster, so Caveat Emptor. Nonetheless, pure read operations can get direct
access.
The other special H2O property is a little harder to explain, but just as crucial - H2O does data-placement such that
accessing the elements past the end of one Chunk and into another guarantees that the next Chunk in line is statistically
very likely to be locally available. If the data was psuedo randomly distributed, you might expect that asking for a neighbor
Chunk has poor odds of being local (for a size N cluster, 1/N Chunks would be local). Instead, something like 90% of nextChunks are adjacent. For Chunks are not adjacent, you'll get a copy (from the ubiquitous DKV)- meaning that some fraction
of the data is replicated to cover the edge cases. Basically, you get to ignore the edge case, and have your grep run off the
end of one Chunk and into the next, for free.

private class ByteSeq implements CharSequence {


private final byte _bs0[], _bs1[];
ByteSeq( Chunk chk0, Chunk chk1 ) { _bs0 = chk0.getBytes(); _bs1 = chk1==null ? null : chk1.getBytes(); }
@Override public char charAt(int idx ) {
return (char)(idx < _bs0.length ? _bs0[idx] : _bs1[idx-_bs0.length]);
}
@Override public int length( ) { return _bs0.length+(_bs1==null?0:_bs1.length); }
@Override public ByteSeq subSequence( int start, int end ) { throw H2O.unimpl(); }
@Override public String toString() { throw H2O.unimpl(); }
String str( int s, int e ) { return new String(_bs0,s,e-s); }
}

And that ends the map call and its helps in the GrepGrep main work loop.
We also need a reduce to fold together the returned results; the _mataches and the _offsets and the _cnt . I'm using a
add call again to make this easier. First I swap the larger set to the left, then I add the smaller set to the larger, then I make

sure all results go back into the this object.

@Override public void reduce( GrepGrep gg1 ) {


GrepGrep gg0 = this;
if( gg0._cnt < gg1._cnt ) { gg0 = gg1; gg1 = this; } // Larger result on left
for( int i=0; i<gg1._cnt; i++ )
gg0.add(gg1._matches[i], gg1._offsets[i]);
if( gg0 != this ) {
_matches = gg0._matches;
_offsets = gg0._offsets;
_cnt = gg0._cnt;
}
}

And that completes the Big Data portion of Grep.

Running Grep
Grep

135

H2O World 2014 Training

Back to the class GrepTest , I changed the file to something bigger - bigdata/laptop/text8.gz . You can get it via gradlew
syncBigdataLaptop which pulls down medium-large data from the Amazon cloud. Since we didn't add a GUnzip step, I

manually unzipped text8.txt to get a 100Mb file. GrepTest looks for words with 5 or more letter-pairs in this file.
Running the GrepTest returns:

11-15 22:12:58.489 192.168.1.11:54321 4740 FJ-0-7 INFO: Grep:


11-15 22:12:58.489 192.168.1.11:54321 4740 FJ-0-7 INFO: [tttttttttt, ttttttcccc, mmmmmmmmmm, nnmmmmmmmm, mmmmmmmmmm, nnmmmmmmmm
11-15 22:12:58.489 192.168.1.11:54321 4740 FJ-0-7 INFO: [7113831, 7113936, 7114204, 7114223, 7114233, 7114287, 7114297, 7114307

Takes about 5 seconds to run the regex (?:(\\w)\\1){5} on 100Mb; not too bad. Looking at the answers I see some funny
words there. I checked with grep -P '(?:(\\w)\\1){5}' text8.txt to see that I was getting the same answers - one of the
words is "racoonnookkeeper" - definitely a bit of a stretch! grep -P is somewhat faster than the JDK Pattern class; even
though H2O was 4-way parallel on my lap, the underlying JDK is too slow.
As an obvious extension, it would be nice to download one of the other Java regex solutions I found on the web; most
declared themselves to be vastly faster than the JDK code. Other cool hacks would include getting the line number, and
perhaps returning the entire matching line.
Good luck with your own H2O algorithm,
Cliff

Grep

136

H2O World 2014 Training

Quantiles
Hacking Algorithms into H2O: Quantiles
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All three tutorials start out the same: getting the h2o-dev
code and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example,
and then the content is customized for each algorithm. This tutorial describes computing Quantiles.

What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST & JSON support on your new algorithm. You still have to write the math, of
course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.

Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:

C:\Users\cliffc\Desktop> mkdir my-h2o


C:\Users\cliffc\Desktop> cd my-h2o
C:\Users\cliffc\Desktop\my-h2o> git clone https://github.com/0xdata/h2o-dev

This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:

C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE

Quantiles

137

H2O World 2014 Training

:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradle --daemon build -x test


...
:h2o-web:signArchives SKIPPED
:h2o-web:assemble
:h2o-web:check
:h2o-web:build
BUILD SUCCESSFUL
Total time: 1 mins 44.645 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.

While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the

web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example

algorithm and turning it into a quantiles algorithm.


h2o-web : The web interface and JavaScript. We will use jar files from here in our project, but probably not need to look

at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one

line here to teach H2O about our new algorithm.


Within each top-level directory, there is a fairly straightforward maven'ized directory structure:

src/main/java - Java source code


src/test/java - Java test code

In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:

src/main/java/water - Java core source code


src/main/java/hex - Java math source code

Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:

C:\Users\cliffc\Desktop\my-h2o\h2o-dev> gradlew idea


...
:h2o-test-integ:idea
:h2o-web:ideaModule
:h2o-web:idea

Quantiles

138

H2O World 2014 Training

BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>

Running H2O-dev Tests in an IDE


Then I switched to IDEAJ from my command window. I launched IDEAJ, selected "Open Project", navigated to the h2odev/ directory and clicked Open. After IDEAJ opened, I clicked the Make project button (or Build/Make Project or ctrl-F9)

and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:

"C:\Program Files\Java\jdk1.7.0_67\bin\java" -agentlib:jdwp=transport=dt_socket....


Connected to the target VM, address: '127.0.0.1:51321', transport: 'socket'

11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

11-08 13:17:07.722 192.168.1.2:54321 4576 main INFO: ###########################################################


11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test class name: hex.example.ExampleTest
11-08 13:17:07.723 192.168.1.2:54321 4576 main INFO: * Test method name: testIris
11-08 13:17:07.724 192.168.1.2:54321 4576 main INFO: ###########################################################
Start Parse
11-08 13:17:08.198 192.168.1.2:54321 4576 FJ-0-7 INFO: Parse result for _85a160bc2419316580eeaab88602418e (150 rows):
11-08 13:17:08.204 192.168.1.2:54321 4576 FJ-0-7 INFO: Col type min max NAs consta
11-08 13:17:08.205 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_len: numeric 4.30000 7.90000
11-08 13:17:08.206 192.168.1.2:54321 4576 FJ-0-7 INFO: sepal_wid: numeric 2.00000 4.40000
11-08 13:17:08.207 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_len: numeric 1.00000 6.90000
11-08 13:17:08.208 192.168.1.2:54321 4576 FJ-0-7 INFO: petal_wid: numeric 0.100000 2.50000
11-08 13:17:08.209 192.168.1.2:54321 4576 FJ-0-7 INFO: class: categorical 0.00000 2.00000
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Internal FluidVec compression/distribution summary:
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Chunk type count fraction size rel. size
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1 1 20.000 % 218 B 19.156 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: C1S 4 80.000 % 920 B 80.844 %
11-08 13:17:08.212 192.168.1.2:54321 4576 FJ-0-7 INFO: Total memory usage : 1.1 KB

Quantiles

139

H2O World 2014 Training

Done Parse: 488

11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 0


11-08 13:17:08.304 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 1
11-08 13:17:08.305 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 2
11-08 13:17:08.306 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 3
11-08 13:17:08.307 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 4
11-08 13:17:08.308 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 5
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 6
11-08 13:17:08.309 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 7
11-08 13:17:08.310 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 8
11-08 13:17:08.311 192.168.1.2:54321 4576 FJ-0-7 INFO: Example: iter: 9
11-08 13:17:08.315 192.168.1.2:54321 4576 main INFO: #### TEST hex.example.ExampleTest#testIris EXECUTION TIME: 00:00:00

Disconnected from the target VM, address: '127.0.0.1:51321', transport: 'socket'


Process finished with exit code 0

Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:

11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]

This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java

code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).

Building Our Algorithm: Copying from the Example


Now let's get our own algorithm framework to start playing with in place.
We want to compute Quantiles, so let's call the code quantile . I cloned the main code and model from the h2oalgos/src/main/java/hex/example/ directory into h2o-algos/src/main/java/hex/quantile/ , and the test from h2oalgos/src/test/java/hex/example/ directory into h2o-algos/src/test/java/hex/quantile/ .

Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their Quantile* variants.
I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its Quantile variant. Finally, I renamed
Quantiles

140

H2O World 2014 Training

the files and file contents from Example to Quantile .


I also dove into h2o-app/src/main/java/water/H2OApp.java and copied the two Example lines and made their Quantile
variants. Because I'm old-school, I did this with a combination of shell hacking and Emacs; about 5 minutes all told.
At this point, back in IDEAJ, I nagivated to QuantileTest.java , right-clicked debug-test testIris again - and was rewarded
with my Quantile clone running a basic test. Not a very good Quantile, but definitely a start.

Whats in All Those Files?


What's in all those files? Mainly there is a Model and a ModelBuilder , and then some support files.
A model is a mathematical representation of the world, an effort to approximate some interesting fact with numbers.
It is a static concrete unchanging thing, completely defined by the rules (algorithm) and data used to make it.
A model-builder builds a model; it is transient and active. It exists as long as we are actively trying to make a model,
and is thrown away once we have the model in-hand.
In our case, we want quantiles as a result - a mathematical representation of the world - so that belongs in the
QuantileModel.java file. The algorithm to compute quantiles belongs in the QuantileModelBuilder.java file.

We also split Schemas from Models to isolate slow-moving external APIs from rapidly-moving internal APIs. As a Java dev,
you can hack the guts of Quantile to your heart's content, including the inputs and outputs, as long as the externally facing
V2 schemas do not change. If you want to report new stuff or take new parameters, you can make a new V3 schema
(which is not compatible with V2) for the new stuff. Old external V2 users will not be affected by your changes - you'll still
have to make the correct mappings in the V2 schema code from your V3 algorithm.
One other important hack: quantiles is an unsupervised algorithm - no training data (no "response") tells it what the results
"should" be. So we need to hack the word Supervised out of all the various class names it appears in. After this is done,
your QuantileTest probably fails to compile, because it is trying to set the response column name in the test, and
unsupervised models do not get a response to train with. Just delete the line for now:

parms._response_column = "class";

At this point, we can run our test code again (still finding the max-per-column).

The Quantile Model


The Quantile model, in the file QuantileModel.java , should contain what we expect out of quantiles: the quantile value - a
single double . Usually people request a set of Q probabilities (e.g., the 1%, 5%, 25% and 50% probabilities), so we'll report
a double[/*Q*/] of quantiles. Also, we'll report them for all N columns in the dataset, so it will really be a double[/*N*/]
[/*Q*/] . For our Q probabilities, this will be:

public double _quantiles[/*N*/][/*Q*/]; // Our N columns, Q quantiles reported

Inside the QuantileModel class, there is a class for the model's output: class QuantileOutput . We'll put our quantiles there.
The various support classes and files will make sure our model's output appears in the correct REST & JSON responses,
and gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that
now.
And finally a quick report on the effort used to build the model: the number of iterations we actually took. Our quantiles will
run in iterations, improving with each iteration until we get the exact answer. I'll describe the algorithm in detail below, but
Quantiles

141

H2O World 2014 Training

it's basically an aggressive radix sort.


My final QuantileOutput class looks like:

public static class QuantileOutput extends Model.Output {


public int _iters; // Iterations executed
public double _quantiles[/*N*/][/*Q*/]; // Our N columns, Q quantiles reported
public QuantileOutput( Quantile b ) { super(b); }
@Override public ModelCategory getModelCategory() { return Model.ModelCategory.Unknown; }
}

Now, let's turn to the input for our model-building process. These are stored in the class QuantileModel.QuantileParameters .
We already inherit an input training dataset (returned with train() ), and some other helpers (e.g. which columns to ignore
if we do an expensive scoring step with each iteration). For now, we can ignore everything except the input dataset from
train() .

However, we want some more parameters for quantiles: the set of probabilities. Let's also put in some default probabilities
(the GUI & REST/JSON layer will let these be overridden, but it's nice to have some sane defaults). Define it next to the
left-over _max_iters from the old Example code (which we might as well also nuke):

// Set of probabilities to compute


public double _probs[/*Q*/] = new double[]{0.01,0.05,0.10,0.25,0.333,0.50,0.667,0.90,0.95,0.99};

My final QuantileParameters class looks like:

public static class QuantileParameters extends Model.Parameters {


// Set of probabilities to compute
public double _probs[/*Q*/] = new double[]{0.01,0.05,0.10,0.25,0.333,0.50,0.667,0.90,0.95,0.99};
}

A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not use) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our
JSON output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new probabilities field to the external schema in h2oalgos/src/main/java/hex/schemas/QuantileV2.java :

public static final class QuantileParametersV2 extends ModelParametersSchema<QuantileModel.QuantileParameters, QuantileParametersV2


static public String[] own_fields = new String[] {"probs"};
// Input fields
@API(help="Probabilities for quantiles") public double probs[];
}

And I need to add my result fields to the external output schema in h2oalgos/src/main/java/hex/schemas/QuantileModelV2.java :

public static final class QuantileModelOutputV2 extends ModelOutputSchema<QuantileModel.QuantileOutput, QuantileModelOutputV2


// Output fields
@API(help="Iterations executed") public int iters;
@API(help="Quantiles per column") public double quantiles[/*Q*/][/*N*/];

Quantiles

142

H2O World 2014 Training

The Quantile Model Builder


Let's turn to the Quantile model builder, which includes some boilerplate we inherited from the old Example code, and a
place to put our real algorithm. There is a basic Quantile constructor which calls init :

public Quantile( ... ) { super("Quantile",parms); init(false); }

In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No computing quantiles in the init() call!
Speaking of the init() call, the one we got from the old Example code limits sanity checks the now-deleted _max_iters .
Let's add some lines to check that our _probs are sane:

for( double p : _parms._probs )


if( p < 0.0 || p > 1.0 )
error("probs","Probabilities must be between 0 and 1 ");

In the Quantile.java file there is a trainModel call that is used when you really want to start running quantiles (as opposed
to just checking arguments). In our case, the old boilerplate starts a QuantileDriver in a background thread. Not required,
but for any long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and
from REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class QuantileDriver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.

Scope.enter();

Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:

_parms.lock_frames(Quantile.this);

Locking prevents situations like accidentally deleting or loading a new dataset with the same name while quantiles is
running. Like the Scope.exit() above, we will unlock in the finally block. While it might be nice to use Java locking, or
even JDK 5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly
cooperative - we cannot enforce locking at the JVM level like the JVM does.
Next, we make an instance of our model object (with no clusters yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated model).

model = new QuantileModel(dest(), _parms, new QuantileModel.QuantileOutput(Quantile.this));


model.delete_and_lock(_key);

Quantiles

143

H2O World 2014 Training

Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.

The Quantile Main Algorithm


Finally we get to where the Math is!
Quantiles can be computed in a variety of ways, but the most common starting point is to sort the elements - then, finding
the Nth element is easy. Alas, sorting large datasets is expensive. Also, sorting provides a total order on the data, which is
overkill for what we need. Instead, we'll do a couple of rounds of a radix-sort, refining our solution with each iteration until
we get the exact answer. The key metric in a radix sort is the number of bins used; we want all our bins to fit in cache,
probably even L1 cache, so we'll start with limiting ourselves to 1024 bins.
Given a histogram / radix-sort of 1024 bins on a billion rows, each bin will represent approximately 1 million rows. We then
pick the bin holding the Nth element, and re-bin / re-histogram / re-radix-sort another 1024 bins refining the bin holding the
Nth element. We expect this bin to hold about 1000 elements. Finding the Nth element in a billion rows then probably takes
3 or 4 passes (worst case is sorta bad: 100 passes), and each pass will be really fast.
Basically, we're gonna fake a sort - we'll end up with the elements for the exact row numbers we want for each probability
(and not all the other row numbers). E.g., if the probability of 0.4567 on a billion row dataset needs the element (in sorted
order) for row 456,700,000 - we'll get it (this element is not related to the unsorted row number 456,700,000).
Besides a classic histogram (element counts per bucket), we will also need an actual value - if there is a unique one. This is
always true if we only have one element in a bucket, but it might also be true if we have a long run of the same value (e.g.,
lots of the year 2014's in our dataset).
My Quantile now has a leftover loop from the old Example code running up to some max iteration count. Let's nuke it and
just run a loop over all the dataset columns.

// Run the main Quantile Loop


Vec vecs[] = train().vecs();
for( int n=0; n<vecs.length; n++ ) {
...
}

Let's also stop if somebody clicks the "cancel" button in the GUI:

for( int n=0; n<vecs.length; n++ ) {


if( !isRunning() ) return; // Stopped/cancelled
...
}

I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. quantiles), and save the results back into the DKV, bump the progress bar, and log a little bit of progress:

// Update the model


model._output._quantiles = ????? // we need to figure these out
model._output._iters++; // One iter per-prob-per-column
model.update(_key); // Update model in K/V store
update(1); // One unit of work in the GUI progress bar
StringBuilder sb = new StringBuilder();
sb.append("Quantile: iter: ").append(model._output._iters);
Log.info(sb);

The Quantile Main Loop


Quantiles

144

H2O World 2014 Training

And now we need to figure what do in our main loop. Somewhere between the loop-top-isRunning check and the
model.update() call, we need to compute something to update our model with! This is the meat of Quantile - for each point,

bin it - build a histogram. With the histogram in hand, we see if we can compute the exact quantile, if we cannot and then
build a new refined histogram from the bounds computed in the prior histogram.
Anything that starts out with the words "for each point" when you have a billion points needs to run in-parallel and scale-out
to have a chance of completing fast - and this is exactly H2O is built for! So let's write code that runs scale-out for-eachpoint... and the easiest way to do that is with an H2O Map/Reduce job - an instance of MRTask. For Quantile, this is an
instance of a histogram. We'll call it from the main-loop like this, and define it below (extra lines included so you can see
how it fits):

if( !isRunning() ) return; // Stopped/cancelled


Vec vec = vecs[n];
// Compute top-level histogram
Histo h1 = new Histo(/*bounds for the full Vec*/).doAll(vec); // Need to figure this out???
// For each probability, see if we have it exactly - or else run
// passes until we do.
for( int p = 0; p < _parms._probs.length; p++ ) {
double prob = _parms._probs[p];
Histo h = h1; // Start from the first global histogram
while( h.not_have_exact_quantile??? ) // Need to figure this out???
h = h.refine_histogram???? // Need to figure this out???
// Update the model
model._output._iters++; // One iter per-prob-per-column
model.update(_key); // Update model in K/V store
update(1); // One unit of work
}
StringBuilder sb = new StringBuilder();
sb.append("Quantile: iter: ").append(model._output._iters).append(" Qs=").append(Arrays.toString(model._output._quantiles[n]));
Log.info(sb);

Basically, we just called some not-yet-defined Histo code on the entire Vec (column) of data, then looped over each
probability. If we can compute the quantile exactly for the probability, great! If not, we build and run another histogram over
a refined range, repeating until we can compute the exact histogram. I also printed out the quantiles per Vec , so we can
watch the progress over time. Now class Histo can be coded as an inner class to the QuantileDriver class:

class Histo extends MRTask<Histo> {


private static final int NBINS=1024; // Default bin count
private final int _nbins; // Actual bin count
private final double _lb; // Lower bound of bin[0]
private final double _step; // Step-size per-bin
private final long _start_row; // Starting row number for this lower-bound
private final long _nrows; // Total datasets rows
// Big Data output result
long _bins [/*nbins*/]; // Rows in each bin
double _elems[/*nbins*/]; // Unique element, or NaN if not unique
private Histo( double lb, double ub, long start_row, long nrows ) {
_nbins = NBINS;
_lb = lb;
_step = (ub-lb)/_nbins;
_start_row = start_row;
_nrows = nrows;
}
@Override public void map( Chunk chk ) {
...
}
@Override public void reduce( Histo h ) {
...
}
}

Quantiles

145

H2O World 2014 Training

A Quick H2O Map/Reduce Diversion


This isn't your Hadoop-Daddy's Map/Reduce. This is an in-memory super-fast map-reduce... where "super-fast" generally
means "memory bandwidth limited", often 1000x faster than the usual hadoop-variant - MRTasks can often touch a
gigabyte of data in a millisecond, or a terabyte in a second (depending on how much hardware is in your cluster - more
hardware is faster for the same amount of data!)
The map() call takes data in Chunks - where each Chunk is basically a small array-like slice of the Big Data. Data in
Chunks is accessed with basic at0 and set0 calls (vs accessing data in Vecs with at and set ). The output of a map()
is stored in the Histo object itself, as a Plain Old Java Object (POJO). Each map() call has private access to its own fields
and Chunks , which implies there are lots of instances of Histo objects scattered all over the cluster (one such instance per
Chunk of data... well, actually one instance per call to map() , but each map call is handed an aligned set of Chunks, one

per feature or column in the dataset).


Since there are lots of little Histos running about, their results need to be combined. That's what reduce does - combine
two Histo s into one. Typically, you can do this by adding similar fields together - often array elements are added side-byside, similar to a saxpy operation.
All code here is written in a single-threaded style, even as it runs in parallel and distributed. H2O handles all the
synchronization issues.

Histogram
Back to class Histo , we create arrays to hold our results and loop over the data:

@Override public void map( Chunk[] chks ) {


long bins [] = _bins =new long [_nbins];
double elems[] = _elems=new double[_nbins];
for( int row=0; row<chk._len; row++ ) {
double d = chk.at0(row);
....
}
}

Then we need to find the correct bin (simple linear interpolation). If the bin is in-range, increment the bin count. Note that if
a value is missing, it will represented as a NaN, then the computation of idx will be a NaN and the range test will fail and
no bin will be incremented.

for( int row=0; row<chk._len; row++ ) {


double d = chk.at0(row);
double idx = (d-_lb)/_step;
if( !(0.0 <= idx && idx < bins.length) ) continue;
int i = (int)idx;
...
bins[i]++;
}

Also gather a unique element in the bin, if there is a unique element (otherwise, use NaN).

int i = (int)idx;
if( bins[i]==0 ) elems[i] = d; // Capture unique value
else if( !Double.isNaN(elems[i]) && elems[i]!=d )
elems[i] = Double.NaN; // Not unique
bins[i]++; // Bump row counts

Quantiles

146

H2O World 2014 Training

And that ends the map call and the Histo main work loop. To recap, here it is all at once:

@Override public void map( Chunk chk ) {


long bins [] = _bins =new long [_nbins];
double elems[] = _elems=new double[_nbins];
for( int row=0; row<chk._len; row++ ) {
double d = chk.at0(row);
double idx = (d-_lb)/_step;
if( !(0.0 <= idx && idx < bins.length) ) continue;
int i = (int)idx;
if( bins[i]==0 ) elems[i] = d; // Capture unique value
else if( !Double.isNaN(elems[i]) && elems[i]!=d )
elems[i] = Double.NaN; // Not unique
bins[i]++; // Bump row counts
}
}

The reduce needs to fold together the returned results; the _elems and the _bins . It's a bit tricky for the unique elements either the left Histo or the right Histo might have zero, one, or many elements - and if they both have a unique element, it
might not be the same unique element.

@Override public void reduce( Histo h ) {


for( int i=0; i<_nbins; i++ ) // Keep unique elements
if( _bins[i]== 0 ) _elems[i] = h._elems[i]; // Left had none, so keep right unique
else if( h._bins[i] > 0 && _elems[i] != h._elems[i] )
_elems[i] = Double.NaN; // Left & right both had elements, but not equal
ArrayUtils.add(_bins,h._bins);
}

And that completes the Big Data portion of Quantile.

From Histogram to Quantile


Now we get to the nit-picky part of Quantiles. This isn't Big Data, but it is Math. Gotta get it right! We have this code we
need to figure out:

while( h.not_have_exact_quantile??? ) // Need to figure this out???


h = h.refine_histogram???? // Need to figure this out???

Let's make a function to return the exact Quantile (if we can compute it, or use NaN otherwise). We'll call it like this,
capturing the quantile in the model output as we test for NaN:

while( Double.isNaN(model._output._quantiles[n][p] = h.findQuantile(prob)) )


h = h.refine_histogram???? // Need to figure this out???

For the findQuantile function we will:


Find the fractional (sorted) row number for the given probability.
Find the lower actual integral row number. If the probability evenly divides the dataset, then we want the element for
the sorted row. If not, we need both elements on either side of the fractional row number.
Find the histogram bin for the lower row.
Find the unique element for this bin, or return a NaN (need another refinement pass)
Repeat for the higher row number: find the bin, find the unique element or return NaN.
With the two values spanning the desired probability in-hand, compute the quantile. We'll use R's Type-7 linear
interpolation, but we could just as well use any of the other types.

Quantiles

147

H2O World 2014 Training

Here's my findQuantile code:

double findQuantile( double prob ) {


double p2 = prob*(_nrows-1); // Desired fractional row number for this probability
long r2 = (long)p2; // Lower integral row number
int loidx = findBin(r2); // Find bin holding low value
double lo = (loidx == _nbins) ? binEdge(_nbins) : _elems[loidx];
if( Double.isNaN(lo) ) return Double.NaN; // Needs another pass to refine lo
if( r2==p2 ) return lo; // Exact row number? Then quantile is exact
long r3 = r2+1; // Upper integral row number
int hiidx = findBin(r3); // Find bin holding high value
double hi = (hiidx == _nbins) ? binEdge(_nbins) : _elems[hiidx];
if( Double.isNaN(hi) ) return Double.NaN; // Needs another pass to refine hi
return computeQuantile(lo,hi,r2,_nrows,prob);
}

And a couple of simple helper functions:

private double binEdge( int idx ) { return _lb+_step*idx; }


// bin for row; can be _nbins if just off the end (normally expect 0 to nbins-1)
private int findBin( long row ) {
long sum = _start_row;
for( int i=0; i<_nbins; i++ )
if( row < (sum += _bins[i]) )
return i;
return _nbins;
}

Finally the computeQuantiles call:

static double computeQuantile( double lo, double hi, long row, long nrows, double prob ) {
if( lo==hi ) return lo; // Equal; pick either
// Unequal, linear interpolation
double plo = (double)(row+0)/(nrows-1); // Note that row numbers are inclusive on the end point, means we need a -1
double phi = (double)(row+1)/(nrows-1); // Passed in the row number for the low value, high is the next row, so +1
assert plo <= prob && prob < phi;
return lo + (hi-lo)*(prob-plo)/(phi-plo); // Classic linear interpolation
}

Refining a Histogram
Now we need to define how to refine a Histogram. Let's just admit it needs a Big Data pass up front and call the doAll() in
the driver loop directly:

while( Double.isNaN(model._output._quantiles[n][p] = h.findQuantile(prob)) )


h = h.refinePass(prob).doAll(vec); // Full pass at higher resolution

Then the refinePass call needs to make a new Histogram from the old one, with refined endpoints. The original Histogram
had endpoints of vec.min() and vec.max() , but here we'll use endpoints from the same bins that findQuantiles uses.

Histo refinePass( double prob ) {


double prow = prob*(_nrows-1); // Desired fractional row number for this probability
long lorow = (long)prow; // Lower integral row number
int loidx = findBin(lorow); // Find bin holding low value
// If loidx is the last bin, then high must be also the last bin - and we
// have an exact quantile (equal to the high bin) and we didn't need
// another refinement pass

Quantiles

148

H2O World 2014 Training

assert loidx < _nbins;


double lo = binEdge(loidx); // Lower end of range to explore
// If probability does not hit an exact row, we need the elements on
// either side - so the next row up from the low row
long hirow = lorow==prow ? lorow : lorow+1;
int hiidx = findBin(hirow); // Find bin holding high value
// Upper end of range to explore - except at the very high end cap
double hi = hiidx==_nbins ? binEdge(_nbins) : binEdge(hiidx+1);
long sum = _start_row; // Compute adjusted starting row for this histogram
for( int i=0; i<loidx; i++ )
sum += _bins[i];
return new Histo(lo,hi,sum,_nrows);
}

Running Quantile
Running the QuantileTest returns:

11-13 12:19:04.179 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 11 Qs=[4.4, 4.6000000000000005, 4.800000000000001, 5.10
11-13 12:19:04.183 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 22 Qs=[2.2, 2.345, 2.5, 2.8000000000000003, 2.900000000
11-13 12:19:04.187 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 33 Qs=[1.1490000000000002, 1.3, 1.4000000000000001, 1.6
11-13 12:19:04.191 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 44 Qs=[0.1, 0.2, 0.2, 0.30000000000000004, 0.8468, 1.3,
11-13 12:19:04.194 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 55 Qs=[0.0, 0.0, 0.0, 0.0, 0.6169999999999999, 1.0, 1.3

You can see the Quantiles-per-column (and there are 11 of them by default, so the iteration printout bumps by 11's). I
checked these numbers vs R's default Type 7 quantiles for the iris dataset and got the same numbers.
Iris is too small to trigger the refinement pass, so I also tested on covtype.data - an old forest covertype dataset with about
a half-million rows. Took about 0.7 seconds on my laptop to compute 11 quantiles exactly, on 55 columns and 581000 rows.
Good luck with your own H2O algorithm,
Cliff

Quantiles

149

H2O World 2014 Training

Build with Sparkling Water


This short tutorial introduces Sparkling Water project enabling H2O platform execution on the top of Spark.
The tutorial guides through project building and demonstrates capabilities of Sparkling Water on the example using
Spark Shell to build a Deep Learning model with help of H2O algorithms and Spark data wrangling.

Project Repository
The Sparkling Water project is hosted on GitHub. To clone it:

$ git clone https://github.com/0xdata/sparkling-water.git


$ cd sparkling-water

The repository is also already prepared on the provided H2O:

cd ~/devel/sparkling-water

Build the Project


The provided top-level gradlew command is used for building:

$ ./gradlew build

The command produces an assembly artifact which can be executed directly on the Spark platform.

Run Sparkling Water


Export your Spark distribution location
$ export SPARK_HOME=/opt/spark

Note: The provided sandbox image already contains exported shell variable SPARK_HOME

Run Sparkling Water


Start a Spark cluster with Sparkling Water:

bin/run-sparkling.sh

Go to http://localhost:54321/steam/index.html to access H2O web interface.


Note: By default, the command creates a Spark cluster specified by local-cluster[3,2,1024] , the Spark master
address. Hence, the cluster contains 3 worker nodes, with each node running H2O services.
Build with Sparkling Water

150

H2O World 2014 Training

Run Sparkling Shell


Start Spark shell with Sparkling Water:

bin/sparkling-shell

Note: The command launches a regular Spark shell with H2O services. To access the Spark UI, go to
http://localhost:4040, or to access H2O web UI, go to http://localhost:54321/steam/index.html.

Step-by-Step Example
1. Run Sparkling shell with an embedded cluster consisting of 3 Spark worker nodes:

export MASTER="local-cluster[3,2,1024]"
bin/sparkling-shell

2. You can go to http://localhost:4040/ to see the Sparkling shell (i.e., Spark driver) status.

3. Create an H2O cloud using all 3 Spark workers:

import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
val h2oContext = new H2OContext(sc).start()
import h2oContext._

4. Load weather data for Chicago international airport (ORD) with help from the regular Spark RDD API:

val weatherDataFile = "/data/h2o-training/sparkling-water/weather_ORD.csv"


val wrawdata = sc.textFile(weatherDataFile,3).cache()
// Parse and skip rows composed only of NAs
val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

Build with Sparkling Water

151

H2O World 2014 Training

5. Load and parse flight data using H2O's API:

import java.io.File
val dataFile = "/data/h2o-training/sparkling-water/allyears2k_headers.csv.gz"
// Load and parse using H2O parser
val airlinesData = new DataFrame(new File(dataFile))

6. Go to H2O web UI and explore data:

7. Select flights with a destination in Chicago (ORD) with help from the Spark API:

val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)


val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))

8. Compute the number of these flights:

flightsToORD.count

9. Use Spark SQL to join the flight data with the weather data:

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")
weatherTable.registerTempTable("WeatherORD")

10. Perform SQL JOIN on both tables:

val bigTable = sql(


"""SELECT

Build with Sparkling Water

152

H2O World 2014 Training

|f.Year,f.Month,f.DayofMonth,
|f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
|f.UniqueCarrier,f.FlightNum,f.TailNum,
|f.Origin,f.Distance,
|w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
|f.ArrDelay
|FROM FlightsToORD f
|JOIN WeatherORD w
|ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)

11. Run deep learning to produce a model estimating arrival delays:

import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._train = bigTable
dlParams._response_column = 'ArrDelay
dlParams._epochs = 100
// Create a job
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get

12. Use the model to estimate delays using training data:

val predictionH2OFrame = dlModel.score(bigTable)('predict)


val predictionsFromModel = toRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

13. Generate an R-code to show the residuals graph:

println(s"""
#
# R script for residual plot
#
# Import H2O library
library(h2o)
# Initialize H2O R-client
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "${predictionH2OFrame._key}")
act = h2o.getFrame (h, "${rame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57}")
# Select right columns
predDelay = pred$$predict
actDelay = act$$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind (as.data.frame(actDelay$$ArrDelay), as.data.frame(residuals$$predict))
nrow(compare)
plot( compare[,1:2] )
""")

14. Open RStudio and execute the generated code:

Build with Sparkling Water

153

H2O World 2014 Training

Note: RStudio must contain the newest H2O-DEV client library.

More information
Sparkling Water GitHub
H2O YouTube Channel
H2O SlideShare
H2O Blog

Build with Sparkling Water

154

H2O World 2014 Training

Troubleshooting
H2O
H2O: How Do I...?
H2O: Error Messages and Unexpected Behavior
H2O-Dev
H2O-Dev: How Do I...?
H2O-Dev: Error Messages and Unexpected Behavior
H2O Web UI
H2O Web UI: How Do I...?
H2O Web UI: Error Messages and Unexpected Behavior
Hadoop & H2O
Hadoop & H2O: How Do I...?
Hadoop & H2O: Error Messages and Unexpected Behavior
R & H2O
R & H2O: How Do I...?
R & H2O: Error Messages and Unexpected Behavior
Clusters and H2O
Clusters and H2O: How Do I...?
Clusters and H2O: Error Messages and Unexpected Behavior
Sparkling Water
Java
Java: How Do I...?

H2O
H2O: How Do I...?
Classify a single instance at a time
Export predicted values for all observations used in the model
Limit the CPU usage
Run multiple H2O projects by multiple team members on the same server
Run tests using Python
Specify multiple outputs
Tunnel between servers with H2O
Use H2O as a real-time, low-latency system to make predictions in Java

Classify a single instance at a time


The plain Java (POJO) scoring predict API is: public final float[] predict( double[] data, float[] preds)

// Pass in data in a double[], pre-aligned to the Model's requirements.


// Jam predictions into the preds[] array; preds[0] is reserved for the
// main prediction (class for classifiers or value for regression),
// and remaining columns hold a probability distribution for classifiers.

Troubleshooting

155

H2O World 2014 Training

Export predicted values for all observations used in the model


Click the Predict! link at the top of the model page, then enter the training data frame key in the data entry field. To export
the predicted data, click Download as CSV or Export to file and then select the file type (HDFS or NFS).

Limit the number of CPUs that H2O uses


Use the nthreads command from the command line. For example, to limit the number of CPUs to two:

java -jar h2o.jar -nthreads 2

Run multiple H2O projects by multiple team members on the same server
We recommend keeping projects isolated, as opposed to running on the same instance. Launch each project in a new H2O
instance with a different port number and different cloud name to prevent the instances from combining into a cluster:

java -Xmx1g -jar h2o.jar -port 20000 -name jane &


java -Xmx1g -jar h2o.jar -port 20002 -name bob &
...

You can also set the -ip flag to localhost to prevent cluster formation in H2O.

Run tests using Python


Refer to the following page for information on running tests using Python: https://github.com/0xdata/h2o/wiki/How-To-RunTests

Specify multiple outputs


Multiple response columns are not supported. You must train a new model for each response column.

Tunnel between servers with H2O


1. Using ssh, log in to the machine where H2O will run.
2. Start an instance of H2O by locating the working directory and calling a java command similar to the following
example. The port number chosen here is arbitrary, so yours may be something different).

$ java -jar h2o.jar -port 55599

3. This returns output similar to the following:

irene@mr-0x3:~/target$ java -jar h2o.jar -port 55599


04:48:58.053 main INFO WATER: ----- H2O started ---- 04:48:58.055 main INFO WATER: Build git branch: master
04:48:58.055 main INFO WATER: Build git hash: 64fe68c59ced5875ac6bac26a784ce210ef9f7a0
04:48:58.055 main INFO WATER: Build git describe: 64fe68c
04:48:58.055 main INFO WATER: Build project version: 1.7.0.99999
04:48:58.055 main INFO WATER: Built by: 'Irene'
04:48:58.055 main INFO WATER: Built on: 'Wed Sep 4 07:30:45 PDT 2013'

Troubleshooting

156

H2O World 2014 Training

04:48:58.055 main INFO WATER: Java availableProcessors: 4


04:48:58.059 main INFO WATER: Java heap totalMemory: 0.47 gb
04:48:58.059 main INFO WATER: Java heap maxMemory: 6.96 gb
04:48:58.060 main INFO WATER: ICE root: '/tmp'
04:48:58.081 main INFO WATER: Internal communication uses port: 55600
+ Listening for HTTP and REST traffic on http://192.168.1.173:55599/
04:48:58.109 main INFO WATER: H2O cloud name: 'irene'
04:48:58.109 main INFO WATER: (v1.7.0.99999) 'irene' on
/192.168.1.173:55599, discovery address /230 .252.255.19:59132
04:48:58.111 main INFO WATER: Cloud of size 1 formed [/192.168.1.173:55599]
04:48:58.247 main INFO WATER: Log dir: '/tmp/h2ologs'

4. Log into the remote machine where the running instance of H2O will be forwarded using a command similar to the
following example. Your port numbers and IP address will be different.

ssh -L 55577:localhost:55599 irene@192.168.1.173

5. Check cluster status.


You are now using H2O from localhost:55577, but the instance of H2O is running on the remote server (in this case, the
server with the ip address 192.168.1.xxx) at port number 55599.
To see this in action, note that the web UI is pointed at localhost:55577, but that the cluster status shows the cluster running
on 192.168.1.173:55599

Use H2O as a real-time, low-latency system to make predictions in Java

Troubleshooting

157

H2O World 2014 Training

1. Build a model (for example, a GBM model with just a few trees).
2. Export the model as a Java POJO: Click the Java Model button at the top right of the web UI. This code is selfcontained and can be embedded in any Java environment, such as a Storm bolt, without H2O dependencies.
3. Use the predict() function in the downloaded code.

H2O: Error Messages and Unexpected Behavior


Exception error with HDFS on CDH3
Unmatched quote char
Can't start up: not enough memory
Parser setup appears to be broken, got SVMLight data with (estimated) 0 columns.
Categories or coefficients missing after fitting the model

When I try to initiate a standalone instance of using an HDFS data on CDH3, I get an exception error - what should I
do?
The default HDFS version is CDH4. To use CDH3, use -hdfs_version cdh3 .

$ java -Xmx4g -jar h2o.jar -name Test -hdfs_version cdh3


...
05:37:13.588 main INFO WATER: Log dir: '/tmp/h2o-srisatish/h2ologs'
05:37:13.589 main INFO WATER: HDFS version specified on the command line: cdh3

I received an Unmatched quote char error message and now the number of columns does not match my dataset.
H2O supports only ascii quotation marks ( Unicode 0022, " ) and apostrophes ( Unicode 0027, ' ) in datasets. All other
character types are treated as enum when they are parsed.

When launching using java -Xmx1g -jar h2o.jar , I received a Can't start up: not enough memory error message, but
according to free -m , I have more than enough memory. What's causing this error message?
Make sure you are using the correct version of Java Developer Toolkit (at least 1.6 or later). For example, GNU compiler for
Java and Open JDK are not supported.
The current recommended JDK is JDK 7 from Oracle: http://www.oracle.com/technetwork/java/javase/downloads/jdk7downloads-1880260.html
However, JDK 6 is also supported:http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloadsjavase6-419409.html

What does the following error message indicate? "Parser setup appears to be broken, got SVMLight data with
(estimated) 0 columns."

H2O does not currently support a leading label line. Convert a row from:

i 702101:1 732101:1 803101:1 808101:1 727101:1 906101:1 475101:1


j 702101:1 732101:1 803101:1 808101:1 727101:1 906101:1 475101:1

Troubleshooting

158

H2O World 2014 Training

to

1 702101:1 732101:1 803101:1 808101:1 727101:1 906101:1 475101:1


2 702101:1 732101:1 803101:1 808101:1 727101:1 906101:1 475101:1

and the file should parse.

Why are my categories or coefficients missing after fitting the model?


If you are running a model with more than 7.5k predictors (including expanded factor levels) or a model with 0-value
coefficients, the output is truncated. To run without L1 penalties, set alpha=0 . However, if you have more than 7.5k
predictors, this is not recommended as it will increase the training time.

H2O-Dev
H2O-Dev: How Do I...?
Run Java APIs
Use an h2o-dev artifact as a dependency in a new project

Run developer APIs


After you've done the gradle build, refer to the following generated resources in your workspace:

REST:
- build/docs/REST/markdown
JAVA:
- h2o-core/build/docs/javadoc/index.html
- h2o-algos/build/docs/javadoc/index.html

Use an h2o-dev artifact as a dependency in a new project


Refer to the following blog post: http://h2o.ai/blog/2014/10/running-your-first-droplet-on-h2o/

H2O-Dev: Error Messages and Unexpected Behavior


ice_root not a read/writable directory

When I try to start , I get the following output: ice_root not a read/writable directory - what do I need to do?
If you are using on OS X and you are using an admin account, the user.name property is set to root by default. To resolve
this issue, disable the setuid bit in the Java executable.

Troubleshooting

159

H2O World 2014 Training

H2O Web UI
H2O Web UI: How Do I...?
Load multiple files from multiple directories

Load multiple files from multiple directories


To load multiple files from multiple directories, import the first directory but do not parse. Import the second directory but do
not parse. After you have imported the directories, go to Data > Parse. In the source key entry field, enter the pattern for
your file types (for example, *.data or *.csv ). The preview displays all imported files matching the pattern.

H2O Web UI: Error Messages and Unexpected Behavior


Exception error after starting a job in the web UI when trying to access the job in R
Failure to load file using web UI

Why did I receive an exception error after starting a job in the web UI when I tried to access the job in R?
To prevent errors, locks the job after it starts. Wait until the job is complete before accessing it in R or invoking APIs.

I tried to load a file of less than 4 GB using the web interface, but it failed; what should I do?
To increase the H2O JVM heap memory size, use java -Xmx8g -jar h2o.jar .

Hadoop & H2O


Hadoop & H2O: How Do I...?
Find the syntax for the file path of a data set sitting in hdfs
Launch the H2O cluster on Hadoop nodes using Rs h2o.init() command
Launch and form a cluster on the Hadoop cluster

Find the syntax for the file path of a data set sitting in hdfs
To locate an HDFS file, go to Data > Import and enter hdfs:// in the path field. H2O automatically detects any HDFS
paths. This is a good way to verify the path to your data set before importing through R or any other non-web API.

Launch the H2O cluster on Hadoop nodes using Rs h2o.init() command


This is not supported. Follow the instructions in the Hadoop Tutorial and add the IP address to the h2o.init() function to
connect to the cluster.

Troubleshooting

160

H2O World 2014 Training

Launch H2O and form a cluster on the Hadoop cluster

$ hadoop jar <h2o_driver_jar_file> water.hadoop.h2odriver [-jt <jobtracker:port>]


-libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output hdfsOutputDirName

Hadoop & H2O: Error Messages and Unexpected Behavior


ERROR: Output directory hdfs://sandbox.hortonworks.com:8020/user/root/hdfsOutputDir already exists
H2O starts to launch but times out in 120 seconds
H2O job launches but terminates after 600 seconds
YARN doesn't launch and H2O times out
Can't connect to Hadoop cluster

What does ERROR: Output directory hdfs://sandbox.hortonworks.com:8020/user/root/hdfsOutputDir already exists mean?


Each job gets its own output directory in HDFS. To prevent overwriting multiple users' files, each job must have a unique
output directory name. Change the -output hdfsOutputDir argument to -output hdfsOutputDir1 and the job should launch.

What should I do if H2O starts to launch but times out in 120 seconds?
1. YARN or MapReduce's configuration is not configured correctly. Enable launching for mapper tasks of specified
memory sizes. If YARN only allows mapper tasks with a maximum memory size of 1g and the request requires 2g,
then the request will timeout at the default of 120 seconds. Read Configuration Setup to make sure your setup will run.
2. The nodes are not communicating with each other. If you request a cluster of two nodes and the output shows a stall in
reporting the other nodes and forming a cluster (as shown in the following example), check that the security settings for
the network connection between the two nodes are not preventing the nodes from communicating with each other. You
should also check to make sure that the flatfile that is generated and being passed has the correct home address; if
there are multiple local IP addresses, this could be an issue.

$ hadoop jar h2odriver_horton.jar water.hadoop.h2odriver -libjars ../h2o.jar


-driverif 10.115.201.59 -timeout 1800 -mapperXmx 1g -nodes 2 -output hdfsOutputDirName
13/10/17 08:51:14 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/10/17 08:51:14 INFO security.JniBasedUnixGroupsMapping: Using JniBasedUnixGroupsMapping for
Group resolution
Using mapper->driver callback IP address and port: 10.115.201.59:34389
(You can override these with -driverif and -driverport.)
Driver program compiled with MapReduce V1 (Classic)
Memory Settings:
mapred.child.java.opts: -Xms1g -Xmx1g
mapred.map.child.java.opts: -Xms1g -Xmx1g
Extra memory percent: 10
mapreduce.map.memory.mb: 1126
Job name 'H2O_61026' submitted
JobTracker job ID is 'job_201310092016_36664'
Waiting for H2O cluster to come up...
H2O node 10.115.57.45:54321 requested flatfile
H2O node 10.115.5.25:54321 requested flatfile
Sending flatfiles to nodes...
[Sending flatfile to node 10.115.57.45:54321]
[Sending flatfile to node 10.115.5.25:54321]
H2O node 10.115.57.45:54321 reports H2O cluster size 1
H2O node 10.115.5.25:54321 reports H2O cluster size 1

Troubleshooting

161

H2O World 2014 Training

What should I do if the H2O job launches but terminates after 600 seconds?
The likely cause is a driver mismatch - check to make sure the Hadoop distribution matches the driver jar file used to
launchH2O. If your distribution is not currently available in the package, email us for a new driver file.

What should I do if I want to create a job with a bigger heap size but YARN doesn't launch and H2O times out?
First, try the job again but with a smaller heap size ( -mapperXmx ) and a smaller number of nodes ( -nodes ) to verify that a
small launch can proceed.
If the cluster manager settings are configured for the default maximum memory size but the memory required for the
request exceeds that amount, YARN will not launch andH2O will time out. If you have a default configuration, change the
configuration settings in your cluster manager to enable launching of mapper tasks for specific memory sizes. Use the
following formula to calculate the amount of memory required:

YARN container size


== mapreduce.map.memory.mb
== mapperXmx + (mapperXmx * extramempercent [default is 10%])

Output from an H2O launch is shown below:

$ hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver


-libjars ../h2o.jar -mapperXmx 30g -extramempercent 20 -nodes 4 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.16.2.181]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.16.2.181:58280
(You can override these with -driverif and -driverport.)
Driver program compiled with MapReduce V1 (Classic)
14/10/10 18:39:53 INFO Configuration.deprecation: mapred.map.child.java.opts is deprecated.
Instead, use mapreduce.map.java.opts
Memory Settings:
mapred.child.java.opts: -Xms30g -Xmx30g
mapred.map.child.java.opts: -Xms30g -Xmx30g
Extra memory percent: 20
mapreduce.map.memory.mb: 36864

mapreduce.map.memory.mb must be less than the YARN memory configuration values for the launch to succeed. See the

examples below for how to change the memory configuration values for your version of Hadoop.
For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may
need to change the settings for more than one role group.
1. Click Configuration and enter the following search term in quotes: yarn.nodemanager.resource.memory-mb.
2. Enter the amount of memory (in GB) to allocate in the Value field. If more than one group is listed, change the values
for all listed groups.

1. Click the Save Changes button in the upper-right corner.


Troubleshooting

162

H2O World 2014 Training

2. Enter the following search term in quotes: yarn.scheduler.maximum-allocation-mb


3. Change the value, click the Save Changes button in the upper-right corner, and redeploy.

For Hortonworks, configure the settings in Ambari.


1. Select YARN, then click the Configs tab.
2. Select the group.
3. In the Node Manager section, enter the amount of memory (in MB) to allocate in the
yarn.nodemanager.resource.memory-mb entry field.

1. In the Scheduler section, enter the amount of memory (in MB) to allocate in the yarn.scheduler.maximum-allocation-mb
entry field.

2. Click the Save button at the bottom of the page and redeploy the cluster.
For MapR:
1. Edit the yarn-site.xml file for the node running the ResourceManager.
2. Change the values for the yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb properties.
3. Restart the ResourceManager and redeploy the cluster.
To verify the values were changed, check the values for the following properties:
<name>yarn.nodemanager.resource.memory-mb</name>
<name>yarn.scheduler.maximum-allocation-mb</name>

Note: Due to Hadoop's memory allocation, the memory allocated forH2O on Hadoop is retained, even when no jobs are
running.

I cant connect to the Hadoop cluster - what do I do?


The mapper nodes need to be able to communicate with the driver node. The driver node (the machine running the job)
uses a default IP address that is displayed in the output as Using mapper->driver callback IP address and port: after
attempting to connect the cluster. If that network interface is not reachable by the mapper nodes, assign a driver node using
-driverif . The IP address used by the driver should start with the same numbers as the IP addresses used by the mapper

Troubleshooting

163

H2O World 2014 Training

nodes.

R and H2O
R and H2O: How Do I...?
Access the confusion matrix using R
Assign a value to a data key
Change an entire integer column to enum
Create test and train data split
Export data for post-analysis
Extract associated values (such as variable name and value) separately
Extract information and export the data locally
Extract the Neural Networks representation of the input
InstallH2O in R without internet access
Join split datasets
Launch a multi-nodeH2O cluster in R
Load a .hex data set in R
Manage dependencies in R
Perform a mean imputation for NA values
Transfer data frames from R toH2O or vice versa
Specify column headings and data from separate files
Specify a threshold for validating performance
Sort and view variable importance for Random Forest
Sub-sample data
Update the libraries in R to the current version

Access the confusion matrix through the R interface


To access the confusion matrix & error rates, use dp.modelObj@model$train_class_error

Assign a value to a data key


To reassign the value of any data key, use

h2o.assign (data,key)

where data is the frame containing the key and key is the new key.

Change an entire integer column to enum


Run the following commands twice; after you run it the second time, the value changes from FALSE to TRUE.

data.hex[,2] = as.factor(data.hex[,2])
is.factor(data.hex[,2])

Troubleshooting

164

H2O World 2014 Training

Create a test & train split of my data before building a model


To create a test and train data split, (for example, to exclude rows with values outside of a specific range):

s = h2o.splitFrame(air)
air.train = air[s <= 0.8,]
air.valid = air[s > 0.8,]

Export scored or predicted data in a format that will allow me to do post-analysis work, such as building an ROC
curve
Use pred = h2o.predict(...) and convert the data parsed byH2O to an R data frame using predInR = as.data.frame(pred) .
For a two-class set, the two probability class should be available in conjunction to the labels output from h2o.predict() .

head(pred)
predict X0 X1
1 0 0.9930366 0.006963459
2 0 0.9933736 0.006626437
3 0 0.9936944 0.006305622
4 0 0.9939933 0.006006664
5 0 0.9942777 0.005722271
6 0 0.9950644 0.004935578

Here you can take the entire data frame into R:

predInR = as.data.frame(pred)

or compile it separately, subset it inH2O, then bring it into R:

class1InR = as.data.frame(pred[,2])
class2InR = as.data.frame(pred[,3])

Extract variable importance from the output


Launch R, then enter the following:

library(h2o)
h <- h2o.init()
hex <- as.h2o(h, iris)
m <- h2o.gbm(x=1:4, y=5, data=hex, importance=T)
m@model$varimp
Relative importance Scaled.Values Percent.Influence
Petal.Width 7.216290000 1.0000000000 51.22833426
Petal.Length 6.851120500 0.9493965043 48.63600147
Sepal.Length 0.013625654 0.0018881799 0.09672831
Sepal.Width 0.005484723 0.0007600474 0.03893596

The variable importances are returned as an R data frame and you can extract the names and values of the data frame as
follows:

Troubleshooting

165

H2O World 2014 Training

is.data.frame(m@model$varimp)
# [1] TRUE
names(m@model$varimp)
# [1] "Relative importance" "Scaled.Values" "Percent.Influence"
rownames(m@model$varimp)
# [1] "Petal.Width" "Petal.Length" "Sepal.Length" "Sepal.Width"
m@model$varimp$"Relative importance"
# [1] 7.216290000 6.851120500 0.013625654 0.005484723

Extract information (such as coefficients or AIC) and export the data locally
When you run a model, for example

data.glm = h2o.glm(x=myX, y=myY, data=data.train, family="binomial", nfolds=0, standardize=TRUE)

The syntax is similar to the way you would use model$coefficients :

data.glm@model$coefficients

For other available parts to the data.glm@model object, press the Tab key in R to auto-complete data.glm@model$ or run
str(data.glm@model) to see all the components in the model object.

Extract the Neural Networks representation of the input


To extract the NN representation of the input at different hidden layers of the learned model (hidden layer learned features),
use the Deep Feature Extractor model in R, h2o.deepfeatures () after building a Deep Learning model.
Specify the DL model (which can be auto-encoder or regular DNN), the dataset to transform, and the desired hidden layer
index (0-based in GUI or Java, 1-based in R). The result is a new dataset with the features based on the specified hidden
layer.

library(h2o)
localH2O = h2o.init()
prosPath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(localH2O, path = prosPath)
prostate.dl = h2o.deeplearning(x = 3:9, y = 2, data = prostate.hex, hidden = c(100, 200),epochs = 5)
prostate.deepfeatures_layer1 = h2o.deepfeatures(prostate.hex, prostate.dl, layer = 1)
prostate.deepfeatures_layer2 = h2o.deepfeatures(prostate.hex, prostate.dl, layer = 2)
head(prostate.deepfeatures_layer1)
head(prostate.deepfeatures_layer2)

InstallH2O in R without internet access


Download the package on another machine with internet access and copy it via a USB drive to the machine where you will
installH2O. You can run the install.packages command below replacing the path to tar gz file in your zipped file.

if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }


if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
install.packages("/PATH/TO/TAR/FILE/ON/DISK", repos = NULL, type = "source")

Troubleshooting

166

H2O World 2014 Training

Caveat: You might need to install some dependencies such as statmod, rjson, bitops, and RCurl as well. If you don't
already have them, download the package onto a drive and install it in R before installingH2O.

Join split datasets


To join split datasets:

library (h2o)
localH2O = h2o.init(nthreads = -1)
df1 = data.frame(matrix(1:30, nrow=30, ncol=10))
df2 = data.frame(matrix(31:50, nrow=30, ncol=1))
df1.h2o = as.h2o(localH2O, df1)
df2.h2o = as.h2o(localH2O, df2)
binded_df = cbind(df1.h2o, df2.h2o)
binded_df

Launch a multi-node H2O cluster from R


H2O does not support launching a multi-node H2O cluster from R.
To start an H2O cluster in R:
1. From your cluster manager (for example, Hadoop), launch the clusters. For more information about launching clusters
using Hadoop, refer to the Hadoop Tutorial.
2. Point to the cluster you just started in R:

localH2O = h2o.init(ip="<IPaddress>",port=54321)
airlines.hex = h2o.importFile(localH2O, "hdfs://sandbox.hortonworks.com:8020/user/root/airlines_allyears2k_headers.csv")

Load a .hex data set in R

h2o.getFrame = new("H2OParsedData", h2o=object, key=key)


h2o.getModel <look at in R>
head(df.h2o)

(Where "object" is the H2O object in R and "key" is the key displayed in the web UI.)

Manage dependencies in R
The H2O R package utilizes other R packages (like lattice and curl). Get the binary from CRAN directly and install the
package manually using:

install.packages("path/to/fpc/binary/file", repos = NULL, type = "binary")

Users may find this page on installing dependencies helpful: http://stat.ethz.ch/R-manual/Rdevel/library/utils/html/install.packages.html

Troubleshooting

167

H2O World 2014 Training

Perform a mean imputation for NA values

?h2o.impute ,look up in R>


df = data.frame(x = 1:20, y = c(1:10,rep(NA,10)))
df$y[is.na(df$y)] = mean(df$y, na.rm=TRUE)

Source: http://www.r-bloggers.com/example-2014-5-simple-mean-imputation/

Transfer data frames from R to H2O or vice versa


Convert an H2O object to an R object using as.data.frame(h2oParsedDataObject) :

predCol = as.data.frame(pred$predict)

Convert an R object to an H2O object using as.h2o(rDataFrame) :

predTesting = as.h2o(predCol)

Specify column headings and data from separate files


To specify column headings and data from separate files, enter the following in R:

> colnames(df)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> colnames(df) = c("A", "B", "C", "D", "E")
> colnames(df)
[1] "A" "B" "C" "D" E"

Through the REST API, it looks like this:

05:25:25.545 # Session INFO HTTPD: POST /2/SetColumnNames2.json comma_separated_list=A,B,C,D,E source=key-name-here

You can also specify headers from another file when importing( h2o.importFiles...header= )

Specify a threshold for validating performance


By default, H2O chooses the best threshold and best AUC value. When it scores against a new data set, the parameter
threshold is set to the best threshold calculated by H2O. To specify your own threshold, use h2o.performance . In the
following example, H2O takes the third column, sets the threshold, generates a new vector prediction that will be compared
to the reference vector, and exports the selected validation measurement ("max_per_class_error").

h2o.performance(predict.glm[,3], reference, measure="max_per_class_error", thresholds)

Troubleshooting

168

H2O World 2014 Training

Sort and view variable importance for Random Forest

library(h2o)
h <- h2o.init()
hex <- as.h2o(h, iris)
m <- h2o.randomForest(x = 1:4, y = 5, data = hex, importance = T)
m@model$varimp
as.data.frame(t(m@model$varimp)) # transpose and then cast to data.frame

Sub-sample data
To sub-sample data (select 10% of dataset randomly), use the following command:

splits=h2o.splitFrame(data.hex,ratios=c(.1),shuffle=T)
small = splits[[1]]
large = splits[[2]]

The following sample script that will take data sitting in H2O, take the column you want to subset by, and create a new
dataset with a certain number of rows:

str_sampling <- function(data, label, rows){


row_indices = h2o.exec(expr_to_execute = seq(1,nrow(data),1), h2o = data@h2o)
labels = data[,label]
label_col = cbind(labels, row_indices)
idx0 = label_col[label_col[,1] == levels(labels)[1] ] [,2]
idx1 = label_col[label_col[,1] == levels(labels)[2] ] [,2]
random0 = h2o.runif(idx0)
random1 = h2o.runif(idx1)
select_idx0 = idx0[random0 <= ceiling((rows*0.5)/length(idx0))+0.01] [1:(rows*0.5)]
select_idx1 = idx1[random1 <= ceiling((rows*0.5)/length(idx1))+0.01] [1:(rows*0.5)]
all_idx = h2o.exec(c(select_idx0,select_idx1))
dataset = h2o.exec(data[all_idx,])
dataset
}

This is just one example of subsampling using H2O in R; adjust it to weights you find suitable for the type of sampling you
want to do. Depending on the way you subsample your data, it might not necessarily be balanced. Updating to the latest
version of H2O will give you the option to rebalance your data in H2O.
Keep in mind that H2O data frames are handled differently than R data frames. To subsample a data frame on a subset
from a vector, add a new 0/1 column for each feature and then combine the relevant pieces using OR (in the following
example, the result is df$newFeature3 ):

df.r = iris
df = as.h2o(h, df.r)
|========================================================| 100%
head(df)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

Troubleshooting

169

H2O World 2014 Training

6 5.4 3.9 1.7 0.4 setosa


tail(df)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
df$newFeature1 = df$Species == "setosa"
df$newFeature2 = df$Species == "virginica"
df$newFeature1 = (df$Species == "setosa" | df$Species == "virginica")

Update the libraries in R to the current version


First, do a clean uninstall of any previous packages:

if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }


if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

After unzipping the .zip file of the latest version of H2O, go to the h2o-/R folder and issue R CMD INSTALL
h2o_<VersionNumber>.tar.gz .

R and H2O: Error Messages and Unexpected Behavior


Version mismatch error during initialization
H2O connection severed during file import
Internal Server Error in R
H2O stalls/fails to start after allocating more memory
Fields associated with incorrect headers after loading in R using as.h2o
Error in function (type, msg, asError = TRUE) Failure when receiving data from the peer
Only one CPU used when starting H2O from R
names() does not display column names
String entries converted to NA during parsing
Column domain is too large to be represented as an enum : 10001>10000?
Picked up _JAVA_OPTIONS: -Xmx512M
Entire table does not display when using h2o.table()

I received a Version mismatch error during initialization. What do I do?

Troubleshooting

170

H2O World 2014 Training

If you see a response message in R indicating that the instance of H2O running is a different version than that of the
corresponding H2O R package, download the correct version of H2O from http://h2o.ai/download/. Follow the installation
instructions on the download page.

I was importing a file when I received the following error message: H2O connection has been severed. What should I
do?
Confirm the R package is the current version. If not, upgrade H2O and the R client and retry the file import.
If the instance is unavailable, relaunch the cluster. If you still can't connect, email support.

What do I do if received an Internal Server Error in R error message?


This issue is only applicable to OS X and older versions of R that cannot make the build due to a missing GNU-TAR
package.
Enter the following in Terminal:

brew install gnu-tar


cd /usr/bin
sudo ln -s /usr/local/opt/gnu-tar/libexec/gnubin/tar gnutar

When I tried to start an instance larger than 1 GB using h2o.init() , H2O stalled and failed to start.
Check the output in the .out and .err files listed in the output from h2o.init() and verify that you have the 64-bit version of
Java installed using java -version .
To force H2O to run in 64-bit mode, use the java -d64 flag with the -version command.

I tried to load an R data frame into H2O using as.h2o and now the fields have shifted so that they are associated
with the incorrect header. What caused this error?
Check to make sure that none of the data fields contain a comma as a value.

When trying to run H2O in R, I received the following error message: Error in function (type, msg, asError = TRUE)
Failure when receiving data from the peer

To determine if the issue is with H2O launching or R connecting to H2O:


1. Close any current Java instances to kill any currently running H2O processes.
2. Confirm there is only one version of H2O in your R library. If there are multiple versions, do a clean install.
3. Go to the h2o.jar file and run the following in a terminal: $ java -Xmx1g -jar h2o.jar
4. Run the following in R:

library(h2o)
localh2o = h2o.init()

Why is only one CPU being used when I start H2O from R?
Troubleshooting

171

H2O World 2014 Training

Depending on how you got your version of R, it may be configured by default to run with only one CPU. This is particularly
common for Linux installations. This can affect H2O when you use the h2o.init() function to start H2O from R.
To tell if this is happening, look in /proc/nnnnn/status at the Cpus_allowed bitmask (where nnnnn is the PID of R).

##(/proc/<nnnnn>/status: This configuration is BAD!)##


Cpus_allowed: 00000001
Cpus_allowed_list: 0

If you see a bitmask with only one CPU allowed, then any H2O process forked by R will inherit this limitation. To work
around this, set the following environment variable before starting R:

$ export OPENBLAS_MAIN_FREE=1
$ R

Now you should see something like the following in /proc/nnnnn/status

##(/proc/<nnnnn>/status: This configuration is good.)##


Cpus_allowed: ffffffff
Cpus_allowed_list: 0-31

At this point, the h2o.init() function will start an H2O that can use more than one CPU.

The data imported correctly but names() did not return the column names.
The version of R you are using is outdated, update to at least R 3.0.

Why were string entries converted into NAs during Parse?


At the moment, columns with numeric values will have the string entries converted to NA when the data is being ingested:

Data Frame in R Data Frame in H2O


V1 V2 V3 V4 V1 V2 V3 V4
1 1 6 11 A 1 1 6 11 NA
2 2 B A A 2 2 NA NA NA
3 3 A 13 18 3 3 NA 13 18
4 4 C 14 19 4 4 NA 14 19
5 5 10 15 20 5 5 10 15 20

If the numeric values in the column were meant to be additional factor levels then you can concatenate the values with a
string and the column will parse as a enumerator column:

V1 V2 V3 V4
1 1 i6 i11 A
2 2 B A A
3 3 A i13 i18
4 4 C i14 i19
5 5 i10 i15 i20

Why did as.h2o(localH2O, data) generate the error: Column domain is too large to be represented as an enum :
10001>10000?

Troubleshooting

172

H2O World 2014 Training

as.h2o , like h2o.uploadFile , uses a limited push method where the user initiates a request for information transfer, so it

should only be used for smaller file transfers. For bigger data files or files with more than 10000 enumerators in a column,
we recommend saving the file as a .csv and importing the data frame using h2o.importFile(localH2O, pathToData) .

When I tried to start H2O in R, I received the following error message: Picked up _JAVA_OPTIONS: -Xmx512M
There is a default setting in the environment for Java that conflicts with H2O's settings. To resolve this conflict, go to
System Properties. The easiest way is to search for system environment variables in the Start menu. Under the
Advanced tab, click Environment Variables. Under System Variables, locate and delete the variable _JAVA_OPTIONS,
whose value specifies -Xmx512mb. You should be able to start H2O after this conflict is removed.

Why wasn't the entire table displayed when I used h2o.table() ?


If the table is very large (for example, with thousands of categoricals), it could potentially overwhelm R's memory, causing it
to crash. To avoid this, the table is stored in H2O and is only imported into R on request. Depending on the number of rows
in your table, the default output varies but is typically around six levels.

Clusters and H2O


H2O Cluster: How do I...?
Add new nodes to an existing cluster
Check if all the virtual machines in the cluster are communicating
Create a cloud behind a firewall
Join an existing cluster using Java
Use an asymmetric node cluster

Add new nodes to an existing cluster


You cannot add new nodes to an existing cluster. You must create a new cluster.

Check if all the virtual machines in the cluster are communicating


Use the Admin>Cloud status menu option to confirm that the cluster size is the size you expected.

Create a cloud behind a firewall


H2O uses two ports:
REST_API port (54321): Specify on the command line as -port ; uses TCP only.
INTERNAL_COMMUNICATION port (54322): Implied based on the port specified as the REST_API port, +1; requires
TCP and UDP.
You can start the cluster behind the firewall, but to reach it, you need to make a tunnel to reach the REST_API port. To use
the cluster, the REST_API port of at least one node needs to be reachable.

Troubleshooting

173

H2O World 2014 Training

Join an existing cluster using Java


Make sure the cluster nodes are sharing the same h2o.jar and execute the following from the command line. Start the
cluster using java -jar lib/h2o.jar -name "cluster_name"

final String[] args = new String[] { "-name", "cluster_name"};


H2O.main(args);
H2O.waitForCloudSize(X+1);

In this case, the last member will join the existing cloud with same name. The resulting size of cloud will be the number of
nodes (X) +1.

Use an asymmetric node cluster


Asymmetric node clusters (for example, where one node is 16GB and the other is 120GB) are not supported. H2O attempts
to split up the data evenly across the nodes, so an asymmetric cluster may cause memory issues on the smaller nodes.

H2O Cluster: Error Messages and Unexpected Behavior


Cluster node unavailable
Nodes not forming cluster
Nodes forming independent clusters
Can't connect to Hadoop cluster
Java exception but can still run Pig and other jobs

One of the nodes in my cluster went down - what do I need to do?


H2O does not support high availability (HA). If a node in the cluster is unavailable, bring the cluster down and create a new
healthy cluster.

I launched H2O instances on my nodes - why won't they form a cloud?


When launching without specifying the IP address by adding argument -ip :

$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321

and multiple local IP addresses are detected, H2O will use the default localhost (127.0.0.1) as shown below:

10:26:32.266 main WARN WATER: Multiple local IPs detected:


+ /198.168.1.161 /198.168.58.102
+ Attempting to determine correct address...
10:26:32.284 main WARN WATER: Failed to determine IP, falling back to localhost.
10:26:32.325 main INFO WATER: Internal communication uses port: 54322
+ Listening for HTTP and REST traffic on http://127.0.0.1:54321/
10:26:32.378 main WARN WATER: Flatfile configuration does not include self: /127.0.0.1:54321 but contains [/192.168.1.161:54321, /

To avoid reverting to 127.0.0.1 on servers with multiple local IP addresses, run the command with the -ip argument

Troubleshooting

174

H2O World 2014 Training

forcing a launch at the appropriate location:

$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -ip 192.168.1.161 -port 54321

If this does not resolve the issue, try the following additional troubleshooting tips:
Test connectivity using curl: curl -v http://otherip:54321
Confirm ports 54321 and 54322 are available for both TCP and UDP
Confirm your firewall is not preventing the nodes from locating each other
Check if the Linux iptables are causing the issue
Check the configuration for the EC2 security group
Confirm that the username is the same on all nodes; if not, define the cloud name using -name
Check if the nodes are on different networks
Check if the nodes have different interfaces; if so, use the -network option to define the network (for example, network 127.0.0.1 ).

Force the bind address using -ip

What should I do if I tried to start a cluster but the nodes started independent clouds that were not connected?
Because the default cloud name is the user name of the node, if the nodes are on different operating systems (for example,
one node is on a Windows machine and the other is on a Mac), the different user names on each machine will prevent the
nodes from recognizing they are part of the same cloud. To resolve this issue, use -name to configure the same name for
all nodes.

Java
Java: How Do I...?
Remove POJO data
Switch between Java versions

Remove POJO data


There are three ways to delete POJO data:
DKV.remove : Raw remove (similar to HashMap.remove); deletes the POJO associated with the specified key. If the

POJO uses other keys in the DKV, they will not be removed. We recommend using this option for Iced objects that
don't use other objects in the DKV. If you use this option on a vector, it deletes only the header; all of the chunks that
comprise the vector are not deleted.
Keyed.remove : Remove the POJO associated with the specified key and all its subparts. We recommend using this

option to avoid leaks, especially when deleting a vector. If you use this option on a vector, all of the chunks that
comprise the vector are deleted.
Lockable.delete : Remove all subparts of a Lockable by write-locking the key and then deleting it. You must unlock the

key before using this method.

Sparkling Water
Troubleshooting

175

H2O World 2014 Training

How does H2O work with Spark?


What is the current status of Sparkling Water?
Does H2O duplicate my data?
Is it possible to use H2O and MLBase or MLlib together?
Where can I download this and try for myself?
How does Scala work with Sparkling Water?
Is it possible to use Sparkling Water via R?
What algorithms are supported in Sparkling Water?
How do I compile and run my own code in Sparkling Water?
How do I set up Sparkling Water on an existing Spark cluster?
How do I load data into Spark?
Where can I find more info on Sparkling Water?
How can I use Tachyon to share data between H2O and Spark?
How do I increase the time-out to allow more time for the cluster to form?
How do I run examples in Sparkling Water?
How do I specify the number of Sparkling Water nodes to start?
How do I specify the amount of memory Sparkling Water should use?

How does H2O work with Spark?


This is a "deep" integration - H2O can convert Spark RDDs to H2O DataFrames and vice-versa directly. It is a fully inmemory, in-process, distributed, and parallel conversion, without ever touching the disk. Simply put, we're an application jar
that works with an existing Spark cluster.

What is the current status of Sparkling Water?


Sparkling water is now integrated on the application level. You do not need to modify the Spark infrastructure and you
can use the regular Spark 1.1.0+ distribution.
The project source is located at https://github.com/0xdata/sparkling-water and also contains pointers to download of
the latest version.
To run examples or Sparkling Shell, which is the regular Spark Shell with an additional jar, you need to export
SPARK_HOME and point it to your Spark distribution.

There is a simple demo described in https://github.com/0xdata/sparkling-water/blob/master/examples/README.md


showing use of H2O, Spark RDD API, and Spark SQL.

Does H2O duplicate my data?


Yes, H2O makes a highly compressed version of your data that is optimized for machine learning and speed. Assuming
your cluster has enough memory, this compressed version lives purely in memory. Additional memory is needed to train
models using algorithms like Random Forest. Scoring does not require much additional memory.

Is it possible to use H2O or MLlib together?


Yes, we have interoperability via Scala. You can find examples in the Sparkling Water github repository.

Where can I download Sparkling Water?


To download a pre-built version, visit the H2O download page.

Troubleshooting

176

H2O World 2014 Training

To build it yourself, do git clone https://github.com/0xdata/sparkling-water.git and follow the directions in the
README.md file.

How does Scala work with Sparkling Water?


Spark is compatible with Scala already. What's new with Sparkling Water is we now have an H2O Scala interface to make it
even easier to integrate H2O with Spark.

Is it possible to use Sparkling Water via R?


Yes - however, because the "classic" H2O R package won't work, you need to install the R package for H2O built from the
h2o-dev repository.
Note: This feature is under active development, expect these instructions to change!
As of today, you need to build this yourself. It gets produced by the normal ./gradlew build h2o-dev build process and puts
the R package in h2o-r/R/src/contrib/h2o_0.1.7.99999.tar.gz. To install this package in R, you can run R CMD INSTALL h2or/R/src/contrib/h2o_0.1.7.99999.tar.gz .

What algorithms are supported in Sparkling Water?


All of the H2O algorithms in the h2o-dev build are available and before long all of the main H2O build as well. Currently, we
have Deep Learning, Generalized Linear Modeling, K-Means, and standard statistics like min, max, mean, stddev, and
anything you can write with Scala.
The list is growing on a monthly basis.

How do I compile and run my own code in Sparkling Water?


To get started right away, grab the source code and build it.

git clone https://github.com/0xdata/sparkling-water.git


./gradlew build

Stay tuned for a release build that we will push to MavenCentral.

How do I set up Sparkling Water on an existing Spark cluster?


Follow the instructions on the download page. Here is a link to the latest bleeding edge download.
Additional documentation is available in the Sparkling Water Github repository.

How do I load data into Spark?


Note: This feature is under active development, expect these instructions to change!
Once you have started the Spark shell with the Sparkling Water jar on the classpath, you can load data into Spark as
follows:

Troubleshooting

177

H2O World 2014 Training

Directly via native Spark API:

val weatherDataFile = "examples/smalldata/Chicago_Ohare_International_Airport.csv"


val wrawdata = sc.textFile(weatherDataFile).cache()

Indirectly through Sparkling Water API:

import java.io.File
val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)

Where can I find more info on Sparkling Water?


Blog posts:
http://h2o.ai/blog/2014/09/sparkling-water-tutorials/
http://h2o.ai/blog/2014/09/how-sparkling-water-brings-h2o-to-spark/
Presentations:
http://www.slideshare.net/0xdata/sparkling-water-5-2814
Sparkling Water Github repository:
https://github.com/0xdata/sparkling-water
https://github.com/0xdata/sparkling-water/tree/master/examples

How can I use Tachyon to share data between H2O and Spark?
Tachyon is no longer needed! The Spark and H2O environments now share JVMs directly. Please update to the latest
version of Sparkling Water.

How do I run examples in Sparkling Water?


Follow the directions in the following tutorial:
http://h2o.ai/blog/2014/09/sparkling-water-tutorials/
And those in the Github repository itself:
https://github.com/0xdata/sparkling-water

How do I specify the starting number of Sparkling Water nodes?


From the API:

val h2oContext = new H2OContext(sc).start(3)

Troubleshooting

178

H2O World 2014 Training

From the command line via Spark property spark.ext.h2o.cluster.size (as an example):

bin/sparkling-shell --conf "spark.ext.h2o.cluster.size=3"

How do I specify the amount of memory Sparkling Water should use?


From the command line via Spark property spark.executor.memory . (as an example):

bin/sparkling-shell --conf "spark.executor.memory=4g"

Switch between Java versions at Mac OSX


From the command line, enter:

export JAVA_8_HOME=$(/usr/libexec/java_home -v1.8)


export JAVA_7_HOME=$(/usr/libexec/java_home -v1.7)
export JAVA_6_HOME=$(/usr/libexec/java_home -v1.6)
alias java6='export JAVA_HOME=$JAVA_6_HOME'
alias java7='export JAVA_HOME=$JAVA_7_HOME'
alias java8='export JAVA_HOME=$JAVA_8_HOME'

To switch versions, enter java followed by the version number (for example, java8 ). To confirm the version, enter java version .

jessicas-mbp:src jessica$ java -version


java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
jessicas-mbp:src jessica$ java8
jessicas-mbp:src jessica$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Troubleshooting

179

H2O World 2014 Training

More Information
H2O Documentation
H2O Booklets
H2O YouTube Channel
H2O SlideShare
H2O Blog
H2O GitHub

More Information

180

Das könnte Ihnen auch gefallen