Beruflich Dokumente
Kultur Dokumente
Table of Contents
1. Introduction
2. H2O World Training Sandbox
3. H2O in Big Data Environments
4. Hands-On Training
i. H2O with the Web UI
ii. R with H2O
iii. Supervised Learning
i. Generalized Linear Models
ii. Gradient Boosted Models
iii. Random Forests
iv. Regression
v. Classification
vi. Deep Learning
iv. Unsupervised Learning
i. KMeans Clustering
ii. Dimensionality Reduction
iii. Anomaly Detection
v. Advanced Topics
i. Multi-model Parameter Tuning
ii. Categorical Feature Engineering
iii. Other Useful Tools
vi. Practical Use Cases for Marketing
5. Sparkling Water
6. Python on H2O
7. Demos
i. Tableau
ii. Excel
iii. Streaming Data
8. Build Applications on Top of H2O
i. KMeans
ii. Grep
iii. Quantiles
iv. Build with Sparkling Water
9. Troubleshooting
10. More Information
Get H2O!
Download H2O
http://h2o.ai/download/
Hack H2O
https://github.com/h2oai/h2o.git
Introduction
2. Setup Sandbox
Start VirtualBox
3. Launch Sandbox
H2O World Training Sandbox
Login screen
4. Sandbox Content
H2O v2.8.2.8 - "Maxwell"
Sparkling Water v0.2.1-31
RStudio Desktop v0.98.1091
R version 3.1.1 (2014-07-10) - "Sock it to Me"
H2O R package v2.8.2.8
Spark v1.1.0 for Hadoop 2.4
Git v1.7.1
Oracle JDK v1.7.0_45
IntelliJ Idea 14 Community Edition
Hands-On Training
In the following sections, you'll learn how to get started with H2O via the simple to use Web UI, R, and go through a number
of specific examples.
3.2 H2O in R
Patrick Aboyoun: Basics and Exploratory Data Analysis (EDA)
Hands-On Training
10
11
12
Creating Visualizations
Navigate to ADVANCED Tab to see overlaying ROC curves
Click ADD VISUALIZATION...
For the X-Axis Field choose Training Time (ms)
For the Y-Axis Field choose AUC
Examine the new graph you created. Weigh the value of extra gain in accuracy for time taken to train the models.
Select the model with your desired level of accuracy, then copy and paste the model key and proceed to step 9.
13
14
R with H2O
This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within
R. It requires an installation of the h2o R package and its dependencies.
library(h2o)
h2oServer <- h2o.init(nthreads = -1)
Download Data
This tutorial uses a 10% sample of the Person-Level 1% 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau, making it a Person-Level 0.1% 2013 PUMS.
wget https://s3.amazonaws.com/h2o-training/pums2013/adult_2013_full.csv.gz
The key argument to the h2o.importFile function sets the name of the data set in the H2O key-value store. If the key
argument is not supplied, the data will reside in the H2O key-value store under a machine generated name.
The results of the h2o.ls function shows the size of the object held by the adult_2013_full key in the H2O key-value
store.
class(adult_2013_full)
dim(adult_2013_full)
head(colnames(adult_2013_full), 50)
R with H2O
15
nms <- c("AGEP", "COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX",
"INTP", "WKHP", "POBP", "WAGP")
adult_2013 <- adult_2013_full[!is.na(adult_2013_full$WAGP) &
adult_2013_full$WAGP > 0, nms]
h2o.ls(h2oServer)
Although we created an object in R called adult_2013 , there is no value with that key in the H2O key-value store. To make
it easier to track our data set, we will copy it's value to the adult_2013 key using the h2o.assign function and delete all the
machine generated keys with the prefix Last.value that served as intermediary objects using the h2o.rm function.
summary(adult_2013)
dim(adult_2013)
As with R data.frame objects, individual columns within an H2O data set can be summarized using methods commonly
associated with R vector objects. For example, the quantile function in R is used to find sample quantiles at probability
values specified in the prob argument.
The use of the $ operator to extract the column WAGP from the adult_2013 data set generated new Last.value keys that
we will clean up in the interest of maintaining a tidy key-value store.
R with H2O
16
h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)
Now that we have the capital gain and loss columns, we can assign our new data set to the adult_2013 key and remove all
the temporary keys from the H2O key-value store.
Now that we have the log transformed columns, we can assign our new data set to the adult_2013 key and remove all the
temporary keys from the H2O key-value store.
h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)
R with H2O
17
cutpoints[1L] <- 0
adult_2013$CENT_WAGP <- h2o.cut(adult_2013$WAGP, cutpoints)
adult_2013$TOP2_WAGP <- adult_2013$WAGP > centiles[99L]
centcounts <- h2o.table(adult_2013["CENT_WAGP"], return.in.R = TRUE)
round(100 * centcounts/sum(centcounts), 2)
top2counts <- h2o.table(adult_2013["TOP2_WAGP"], return.in.R = TRUE)
round(100 * top2counts/sum(top2counts), 2)
relpxtabs <- h2o.table(adult_2013[c("RELP", "TOP2_WAGP")], return.in.R = TRUE)
relpxtabs
round(100 * relpxtabs/rowSums(relpxtabs), 2)
schlxtabs <- h2o.table(adult_2013[c("SCHL", "TOP2_WAGP")], return.in.R = TRUE)
schlxtabs
round(100 * schlxtabs/rowSums(schlxtabs), 2)
h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)
h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)
Now that we have derived a few sets of variables, we can examine the H2O key-value store to ensure we have the
expected objects.
R with H2O
18
h2o.ls(h2oServer)
rmLastValues()
kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013"] / 1024^2
Now check to make sure the size of the resulting data sets meet expectations.
nrow(adult_2013)
nrow(adult_2013_train)
nrow(adult_2013_test)
h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)
R with H2O
19
Supervised Learning
20
R Documentation
The h2o.glm function fits H2O's Generalized Linear Models from within R.
library(h2o)
args(h2o.glm)
The R documentation (man page) for H2O's Generalized Linear Models can be opened from within R using the help or ?
functions:
help(h2o.glm)
We can run the example from the man page using the example function:
example(h2o.glm)
And run a longer demonstration from the h2o package using the demo function:
demo(h2o.glm)
21
R Documentation
The h2o.gbm function fits H2O's Gradient Boosting Machines from within R.
library(h2o)
args(h2o.gbm)
The R documentation (man page) for H2O's Gradient Boosting Machines can be opened from within R using the help or
? functions:
help(h2o.gbm)
We can run the example from the man page using the example function:
example(h2o.gbm)
And run a longer demonstration from the h2o package using the demo function:
demo(h2o.gbm)
22
Random Forests
Intuition: Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other
trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Number of variables randomly sampled as candidates for splits
4. Sampling rate for constructing data set to use on each tree
R Documentation
The h2o.randomForest function fits H2O's Random Forest from within R.
library(h2o)
args(h2o.randomForest)
The R documentation (man page) for H2O's Random Forest can be opened from within R using the help or ? functions:
help(h2o.randomForest)
We can run the example from the man page using the example function:
example(h2o.randomForest)
And run a longer demonstration from the h2o package using the demo function:
demo(h2o.randomForest)
Random Forests
23
Regression
Regression using Generalized Linear Models, Gradient
Boosting Machines, and Random Forests in H2O
This tutorial demonstrates regression modeling in H2O using generalized linear models (GLM), gradient boosting machines
(GBM), and random forests. It requires an installation of the h2o R package and its dependencies.
Random Forests
Intuition: Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other
trees.
Important components:
1. Number of trees
2. Maximum depth of tree
3. Number of variables randomly sampled as candidates for splits
4. Sampling rate for constructing data set to use on each tree
24
We will begin this tutorial by starting a local H2O cluster using the default heap size and as much compute as the operating
system will allow.
library(h2o)
h2oServer <- h2o.init(nthreads = -1)
rmLastValues <- function(pattern = "Last.value.")
{
keys <- h2o.ls(h2oServer, pattern = pattern)$Key
if (!is.null(keys))
h2o.rm(h2oServer, keys)
invisible(keys)
}
Load and prepare the training and testing data for analysis
This tutorial uses a 0.1% sample of the Person-Level 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau with 75% of that sample being designated to the training data set and 25% to the test data set. This data
set is intended to be an update to the UCI Adult Data Set.
For the purposes of validation, we will create a single column data set containing only the target variable LOG_WAGP from the
test data set.
Also for our data set we have 8 columns that use integer codes to represent categorical levels so we will coerce them to
factor after the data read.
Regression
25
We will start with an ordinary linear model that is trained using the h2o.glm function with Gaussian (Normal) error and no
elastic net regularization ( lambda = 0 ).
class(log_wagp_glm_0)
getClassDef("H2OGLMModel")
In this model object, most of the values of interest are contained in the model slot.
names(log_wagp_glm_0@model)
coef(log_wagp_glm_0@model) # works through stats:::coef.default
log_wagp_glm_0@model$aic
1 - log_wagp_glm_0@model$deviance / log_wagp_glm_0@model$null.deviance
In addition the h2o.mse function can be used to calculate the mean squared error of prediction.
h2o.mse(h2o.predict(log_wagp_glm_0, adult_2013_test),
adult_2013_test[, "LOG_WAGP"])
rmLastValues()
The resulting model object is also of class H2OGLMModel , but now the xval slot is populated with model fits resulting when
one of the folds was dropped during training.
class(log_wagp_glm_0_cv)
length(log_wagp_glm_0_cv@xval)
class(log_wagp_glm_0_cv@xval[[1L]])
Regression
26
We can create boxplots for each of the coefficients to see how variable the estimates were after each of the folds were
dropped in turn.
Regression
27
As we can see below, both the Akaike information criterion (AIC) and percent deviance explained metrics point to using a
combination of RELP and SCHL in a linear model for LOG_WAGP .
log_wagp_glm_relp@model$aic
log_wagp_glm_schl@model$aic
log_wagp_glm_relp_schl@model$aic
1 - log_wagp_glm_relp@model$deviance / log_wagp_glm_relp@model$null.deviance
1 - log_wagp_glm_schl@model$deviance / log_wagp_glm_schl@model$null.deviance
1 - log_wagp_glm_relp_schl@model$deviance / log_wagp_glm_relp_schl@model$null.deviance
In the context of elastic net regularization, we need to search the parameter space defined by the mixing parameter alpha
and the shrinkage parameter lambda . To aide us in this search H2O can produce a grid of models for all combinations of a
discrete set of parameters.
We will use different methods for specifying the alpha and lambda values as they are dependent upon one another. For
the alpha parameter, we will specify five values ranging from 0 (ridge) to 1 (lasso) by increments of 0.25. For lambda , we
will turn on an automated lambda search by setting lambda = TRUE and specify the number of lambda values to 10 by
setting nlambda = 10 .
We now have an object of class H2OGLMGrid that contains a list of H2OGLMModelList objects for each of the models fit on the
grid.
class(log_wagp_glm_grid)
getClassDef("H2OGLMGrid")
class(log_wagp_glm_grid@model[[1L]])
getClassDef("H2OGLMModelList")
length(log_wagp_glm_grid@model[[1L]]@models)
class(log_wagp_glm_grid@model[[1L]]@models[[1L]])
Currently the h2o package does not contain any helper functions for extracting models of interest, so we have to explore
Regression
28
log_wagp_glm_grid@model[[1L]]@models[[1L]]@model$params$alpha # ridge
log_wagp_glm_grid@model[[2L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[3L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[4L]]@models[[1L]]@model$params$alpha
log_wagp_glm_grid@model[[5L]]@models[[1L]]@model$params$alpha # lasso
log_wagp_glm_grid_mse < sapply(log_wagp_glm_grid@model,
function(x)
sapply(x@models, function(y)
h2o.mse(h2o.predict(y, adult_2013_test),
actual_log_wagp)))
log_wagp_glm_grid_mse
log_wagp_glm_grid_mse == min(log_wagp_glm_grid_mse)
log_wagp_glm_best <- log_wagp_glm_grid@model[[5L]]@models[[10L]]
A comparison of mean squared errors against the test set suggests our GBM fit outperforms our GLM fit.
Regression
29
h2o.mse(h2o.predict(log_wagp_glm_best, adult_2013_test),
actual_log_wagp)
h2o.mse(h2o.predict(log_wagp_gbm_best, adult_2013_test),
actual_log_wagp)
Regression
30
library(h2o)
h2oServer <- h2o.init(nthreads = -1)
rmLastValues <- function(pattern = "Last.value.")
{
keys <- h2o.ls(h2oServer, pattern = pattern)$Key
if (!is.null(keys))
h2o.rm(h2oServer, keys)
invisible(keys)
}
Load the training and testing data into the H2O key-value store
This tutorial uses a 0.1% sample of the Person-Level 2013 Public Use Microdata Sample (PUMS) from United States
Census Bureau with 75% of that sample being designated to the training data set and 25% to the test data set. This data
set is intended to be an update to the UCI Adult Data Set.
For the purposes of validation, we will create a single column data set containing only the target variable TOP2_WAGP from
the test data set.
Also for our data set we have 8 columns that use integer codes to represent categorical levels so we will coerce them to
factor after the data read.
Classification
31
Classification
32
Plot Receiver Operating Characteristic (ROC) curve and find its Area Under the
Curve (AUC)
A receiver operating characteristic (ROC) curve is a graph of the true positive rate ( recall ) against the false positive rate (1
- specificity) for a binary classifier.
Classification
33
We can extract the area under the ROC curve from our model object to see how close it is to the ideal of 1.
f1_top2_wagp_glm_relp@model$auc
The Gini coefficient, not to be confused with the Gini impurity splitting metric used in decision tree based algorithms such
as h2o.randomForest , is a linear rescaling of AUC defined by $Gini = 2 * AUC - 1$.
f1_top2_wagp_glm_relp@model$gini
2 * f1_top2_wagp_glm_relp@model$auc - 1
Another metric we could have used to determine a cutoff is the lift metric, which is a ratio of the probability of a positive
prediction given the model and the baseline probability for the population, that is generated by the h2o.gains function,
which was inspired by Craig Rolling's gains package hosted on CRAN.
h2o.gains(actual_top2_wagp, prob_top2_wagp_glm_relp)
class(f1_top2_wagp_glm_relp@model)
names(f1_top2_wagp_glm_relp@model)
class_top2_wagp_glm_relp < prob_top2_wagp_glm_relp > f1_top2_wagp_glm_relp@model$best_cutoff
Classification
34
h2o.confusionMatrix(class_top2_wagp_glm_relp, actual_top2_wagp)
Fit an elastic net logistic regression model across a grid of parameter settings
Now that we are familiar with H2O model fitting in R, we can fit more sophisticated models involving a larger set of
predictors.
In the context of elastic net regularization, we need to search the parameter space defined by the mixing parameter alpha
and the shrinkage parameter lambda . To aide us in this search H2O can produce a grid of models for all combinations of a
discrete set of parameters.
We will use different methods for specifying the alpha and lambda values as they are dependent upon one another. For
the alpha parameter, we will specify five values ranging from 0 (ridge) to 1 (lasso) by increments of 0.25. For lambda , we
will turn on an automated lambda search by setting lambda = TRUE and specify the number of lambda values to 10 by
setting nlambda = 10 .
We now have an object of class H2OGLMGrid that contains a list of H2OGLMModelList objects for each of the models fit on the
grid.
class(top2_wagp_glm_grid)
getClassDef("H2OGLMGrid")
class(top2_wagp_glm_grid@model[[1L]])
getClassDef("H2OGLMModelList")
Classification
35
length(top2_wagp_glm_grid@model[[1L]]@models)
class(top2_wagp_glm_grid@model[[1L]]@models[[1L]])
Currently the h2o package does not contain any helper functions for extracting models of interest, so we have to explore
the model object and choose the model we like best.
top2_wagp_glm_grid@model[[1L]]@models[[1L]]@model$params$alpha # ridge
top2_wagp_glm_grid@model[[2L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[3L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[4L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[5L]]@models[[1L]]@model$params$alpha # lasso
top2_wagp_glm_grid_f1 < sapply(top2_wagp_glm_grid@model,
function(x)
sapply(x@models, function(y)
h2o.performance(h2o.predict(y, adult_2013_test)[, 3L],
actual_top2_wagp,
measure = "F1")@model$error
))
top2_wagp_glm_grid_f1
top2_wagp_glm_grid_f1 == min(top2_wagp_glm_grid_f1)
top2_wagp_glm_best <- top2_wagp_glm_grid@model[[4L]]@models[[7L]]
prob_top2_wagp_glm_best <- h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L]
prob_top2_wagp_glm_best <- h2o.assign(prob_top2_wagp_glm_best,
key = "prob_top2_wagp_glm_best")
rmLastValues()
f1_top2_wagp_glm_best <- h2o.performance(prob_top2_wagp_glm_best,
actual_top2_wagp,
measure = "F1")
plot(f1_top2_wagp_glm_best, type = "cutoffs", col = "blue")
Classification
36
Classification
37
We can find the number of non-zero coefficients in our best logistic regression model to gain some understanding about its
overall complexity.
table(coef(top2_wagp_glm_best@model) != 0)
nzcoefs <- coef(top2_wagp_glm_best@model)
nzcoefs <- names(nzcoefs)[nzcoefs != 0]
nzcoefs <- unique(sub("\\..*$", "", nzcoefs))
setdiff(c("RELP_SCHL", addpredset), nzcoefs) # all preds had non-zero coefs
Classification
38
Classification
39
R Documentation
The h2o.deeplearning function fits H2O's Deep Learning models from within R.
library(h2o)
args(h2o.deeplearning)
The R documentation (man page) for H2O's Deep Learning can be opened from within R using the help or ? functions:
help(h2o.deeplearning)
As you can see, there are a lot of parameters! Luckily, as you'll see later, you only need to know a few to get the most out of
Deep Learning. More information can be found in the H2O Deep Learning booklet and in our slides.
We can run the example from the man page using the example function:
example(h2o.deeplearning)
And run a longer demonstration from the h2o package using the demo function (requires an internet connection):
demo(h2o.deeplearning)
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
TRAIN = "train.csv.gz"
TEST = "test.csv.gz"
train_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ',', key = 'train.hex')
test_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TEST), header = F, sep = ',', key = 'test.hex')
While H2O Deep Learning has many parameters, it was designed to be just as easy to use as the other supervised training
methods in H2O. Automatic data standardization and handling of categorical variables and missing values and per-neuron
adaptive learning rates reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of
hidden layers, the number of epochs and the activation function and maybe some regularization techniques.
Deep Learning
40
Let's look at the model summary, and the confusion matrix and classification error (on the validation set, since it was
provided) in particular:
dlmodel
dlmodel@model$confusion
dlmodel@model$valid_class_error
To confirm that the reported confusion matrix on the validation set (here, the test set) was correct, we make a prediction on
the test set and compare the confusion matrices explicitly:
dlmodel@model$params
Let's see which model had the lowest validation error (grid search automatically sorts the models by validation error):
grid_search
best_model <- grid_search@model[[1]]
best_model
best_params <- best_model@model$params
best_params$activation
best_params$hidden
best_params$l1
Deep Learning
41
We can then find the model with the lowest validation error:
Checkpointing
Let's continue training the best model, for 2 more epochs. Note that since many parameters such as epochs, l1, l2,
max_w2, score_interval, train_samples_per_iteration, score_duty_cycle, classification_stop, regression_stop,
variable_importances, force_load_balance can be modified between checkpoint restarts, it is best to specify as many
Once we are satisfied with the results, we can save the model to disk:
Of course, you can continue training this model as well (with the same x , y , data , validation )
Deep Learning
42
# > record_model <- h2o.deeplearning(x = 1:784, y = 785, data = train_hex, validation = test_hex,
# activation = "RectifierWithDropout", hidden = c(1024,1024,2048),
# epochs = 8000, l1 = 1e-5, input_dropout_ratio = 0.2,
# train_samples_per_iteration = -1, classification_stop = -1)
# > record_model@model$confusion
# Predicted
# Actual 0 1 2 3 4 5 6 7 8 9 Error
# 0 974 1 1 0 0 0 2 1 1 0 0.00612
# 1 0 1135 0 1 0 0 0 0 0 0 0.00088
# 2 0 0 1028 0 1 0 0 3 0 0 0.00388
# 3 0 0 1 1003 0 0 0 3 2 1 0.00693
# 4 0 0 1 0 971 0 4 0 0 6 0.01120
# 5 2 0 0 5 0 882 1 1 1 0 0.01121
# 6 2 3 0 1 1 2 949 0 0 0 0.00939
# 7 1 2 6 0 0 0 0 1019 0 0 0.00875
# 8 1 0 1 3 0 4 0 2 960 3 0.01437
# 9 1 2 0 0 4 3 0 2 0 997 0.01189
# Totals 981 1142 1038 1013 977 891 956 1031 964 1007 0.00830
Note: results are not 100% reproducible and also depend on the number of nodes, cores, due to thread race conditions and
model averaging effects, which can by themselves be useful to avoid overfitting. Often, it can help to run the initial
convergence with more train_samples_per_iteration (automatic values of -2 or -1 are good choices), and then continue
from a checkpoint with a smaller number (automatic value of 0 or a number less than the total number of training rows are
good choices here).
Regression
If the response column is numeric and non-integer, regression is enabled by default. For integer response columns, as in
this case, you have to specify classification=FALSE to force regression. In that case, there will be only 1 output neuron,
and the loss function and error metric will automatically switch to the MSE (mean square error).
Let's look at the model summary (i.e., the training and validation set MSE values):
regression_model
regression_model@model$train_sqr_error
regression_model@model$valid_sqr_error
mean((h2o.predict(regression_model, train_hex)-train_hex[,785])^2)
mean((h2o.predict(regression_model, test_hex)-test_hex[,785])^2)
h2o.mse(h2o.predict(regression_model, test_hex), test_hex[,785])
The difference in the training MSEs is because only a subset of the training set was used for scoring during model building
(10k rows), see section "Scoring on Training/Validation Sets During Training" below.
Deep Learning
43
Cross-Validation
For N-fold cross-validation, specify nfolds instead of a validation frame, and N+1 models will be built: 1 model on the full
training data, and N models with each one successive 1/N-th of the data held out. Those N models then score on the held
out data, and the stitched-together predictions are scored to get the cross-validation error.
Note: There is a RUnit test to demonstrate the correctness of N-fold cross-validation results reported by the model.
The N individual cross-validation models can be accessed as well:
dlmodel@xval
sapply(dlmodel@xval, function(x) (x@model$valid_class_error))
Variable Importances
Variable importances for Neural Network models are notoriously difficult to compute, and there are many pitfalls. H2O Deep
Learning has implemented the method of Gedeon, and returns relative variable importances in descending order of
importance.
Deep Learning
44
If adaptive_rate is disabled, several manual learning rate parameters become important: rate , rate_annealing ,
rate_decay , momentum_start , momentum_ramp , momentum_stable and nesterov_accelerated_gradient , the discussion of which
Generalization Techniques
L1 and L2 penalties can be applied by specifying the l1 and l2 parameters. Intuition: L1 lets only strong weights survive
(constant pulling force towards zero), while L2 prevents any single weight from getting too big. Dropout has recently been
introduced as a powerful generalization technique, and is available as a parameter per layer, including the input layer.
input_dropout_ratio controls the amount of input layer neurons that are randomly dropped (set to zero), while
hidden_dropout_ratios are specified for each hidden layer. The former controls overfitting with respect to the input data
(useful for high-dimensional noisy data such as MNIST), while the latter controls overfitting of the learned features. Note
that hidden_dropout_ratios require the activation function to end with ...WithDropout .
overfitting on the validation set). Also note that the training or validation set errors can be based on a subset of the training
or validation data, depending on the values for score_validation_samples or score_training_samples , see below. For actual
early stopping on a predefined error rate on the training data (accuracy for classification or MSE for regression), specify
classification_stop or regression_stop .
Categorical Data
For categorical data, a feature with K factor levels is automatically one-hot encoded (horizontalized) into K-1 input neurons.
Hence, the input neuron layer can grow substantially for datasets with high factor counts. In these cases, it might make
sense to reduce the number of hidden neurons in the first hidden layer, such that large numbers of factor levels can be
handled. In the limit of 1 neuron in the first hidden layer, the resulting model is similar to logistic regression with stochastic
gradient descent, except that for classification problems, there's still a softmax output layer, and that the activation function
is not necessarily a sigmoid ( Tanh ). If variable importances are computed, it is recommended to turn on
use_all_factor_levels (K input neurons for K levels). The experimental option max_categorical_features uses feature
Deep Learning
45
hashing to reduce the number of input neurons via the hash trick at the expense of hash collisions and reduced accuracy.
Missing Values
H2O Deep Learning automatically does mean imputation for missing values during training (leaving the input layer
activation at 0 after standardizing the values). For testing, missing test set values are also treated the same way by default.
See the h2o.impute function to do your own mean imputation.
Reproducibility
Every run of DeepLearning results in different results since multithreading is done via Hogwild! that benefits from intentional
lock-free race conditions between threads. To get reproducible results at the expense of speed for small datasets, set
reproducible=T and specify a seed. This will not work for big data for technical reasons, and is probably also not desired
because of the significant slowdown.
Note that the default value of score_duty_cycle=0.1 limits the amount of time spent in scoring to 10%, so a large number of
scoring samples won't slow down overall training progress too much, but it will always score once after the first MapReduce
iteration, and once at the end of training.
Stratified sampling of the validation dataset can help with scoring on datasets with class imbalance. Note that this option
also requires balance_classes to be enabled (used to over/under-sample the training dataset, based on the max. relative
size of the resulting training dataset, max_after_balance_size ):
Deep Learning
46
nn_mse
h2o_nn <- h2o.deeplearning(x=seq(1,784,by=20),y=785,data=train_hex,hidden=10,classification=F,activation="Tanh")
h2o_pred <- h2o.predict(h2o_nn, test_hex)
h2o_mse <- mean((as.data.frame(h2o_pred)-test.R[,785])^2)
h2o_mse
More information can be found in the H2O Deep Learning booklet and in our
slides.
Deep Learning
47
Unsupervised Learning
Arno Candel & Spencer Aiello: K-Means Clustering
Arno Candel: Dimensionality Reduction on MNIST
Arno Candel: Anomaly Detection on MNIST with H2O Deep Learning
Unsupervised Learning
48
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
datadir <- "/data"
homedir <- file.path(datadir, "h2o-training", "clustering")
iris.h2o <- as.h2o(h2oServer, iris)
km.model
km.model@model$centers # The centers for each cluster
km.model@model$tot.withinss # total within cluster sum of squares
km.model@model$cluster # cluster assignments per observation
To see the model parameters that were used, access the model@model$params field:
km.model@model$params
?h2o.kmeans
Use the Gap Statistic (Beta) To Find the Optimal Number of Clusters
This is essentially a grid search over KMeans.
You can get the R documentation help here:
?h2o.gapStatistic
The idea is that, for each 'k', generate WCSS from a reference distribution and examine the gap between the expected
WCSS and the observed WCSS. To obtain the W_k from the reference distribution, 'B' Monte Carlo replicates are drawn
KMeans Clustering
49
from the reference distribution. For each replicate, a KMeans is constructed and the WCSS reported back.
Let's take a look at the output. The default output display will show the number of KMeans models that were run and the
optimal value of k:
gap_stat
summary(gap_stat)
plot(gap_stat)
We can compare the result with the published result from Fast and Accurate KMeans on Large Datasets where the cost for
k = 12 and ~2GB of RAM was approximately 3.50E+18. This paper implements a streaming KMeans, so of course
accuracy in the streaming case will not be as good as a batch job, but results are comparable within a few orders of
magnitude. H2O gives the ability to work on datasets that don't fit in a single box's RAM without having to stream the data
from cold storage: simply use distributed H2O.
We can also compare with StreamKM++: A Clustering Algorithm for Data Streams. For various k, we can compare our
implementation, but we only do k = 30 here.
# km.census <- h2o.kmeans(data = census.1990, centers = 30, init="furthest") # NOT RUN: Too long on VM
# km.census@model$tot.withinss # NOT RUN: Too long on VM
Let's compare now on the big dataset BigCross Big Cross, which has 11.6 million data points with 57 integer features.
KMeans Clustering
50
We can compare the result with the published result from Fast and Accurate KMeans on Large Datasets, where the cost for
k = 24 and ~2GB of RAM was approximately 1.50E+14.
We can also compare with StreamKM++: A Clustering Algorithm for Data Streams. For various k, we can compare our
implementation, but we only do k = 30 here.
# km.bigcross <- h2o.kmeans(data = big.cross, centers = 30, init="furthest") # NOT RUN: Too long on VM
# km.bigcross@model$tot.withinss # NOT RUN: Too long on VM
KMeans Clustering
51
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
DATA = "train.csv.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,DATA), header = F, sep = ',', key = 'train.hex')
The data consists of 784 (=28^2) pixel values per row, with (gray-scale) values from 0 to 255. The last column is the
response (a label in 0,1,2,...,9).
predictors = c(1:784)
resp = 785
Dimensionality Reduction
52
We see that the first 50 or 100 principal components cover the majority of the variance of this dataset.
To reduce the dimensionality of MNIST to its 50 principal components, we use the h2o.predict() function with an extra
argument num_pc :
We can now convert the data with the autoencoder model to 50-dimensional space (second hidden layer)
Dimensionality Reduction
53
To get the full reconstruction from the output layer of the autoencoder, use h2o.predict() as follows
Dimensionality Reduction
54
We use the well-known MNIST dataset of hand-written digits, where each row contains the 28^2=784 raw gray-scale pixel
values from 0 to 255 of the digitized digits (0 to 9).
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/mnist/"
TRAIN = "train.csv.gz"
TEST = "test.csv.gz"
train_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ',', key = 'train.hex')
test_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TEST), header = F, sep = ',', key = 'test.hex')
The data consists of 784 (=28^2) pixel values per row, with (gray-scale) values from 0 to 255. The last column is the
response (a label in 0,1,2,...,9).
predictors = c(1:784)
resp = 785
55
Note that the response column is ignored (it is only required because of a shared DeepLearning code framework).
In case you wanted to see the lower-dimensional features created by the auto-encoder deep learning model, here's a way
to extract them for a given dataset. This a non-linear dimensionality reduction, similar to PCA, but the values are capped by
the activation function (in this case, they range from -1...1)
Anomaly Detection
56
len<-nrow(mydata)
N<-ceiling(sqrt(len))
op <- par(mfrow=c(N,N),pty='s',mar=c(1,1,1,1),xaxt='n',yaxt='n')
for (i in 1:nrow(mydata)) {
colors<-c('white','black')
cus_col<-colorRampPalette(colors=colors)
z<-array(mydata[i,],dim=c(28,28))
z<-z[,28:1]
image(1:28,1:28,z,main=paste0("rec_error: ", round(rec_error[i],4)),col=cus_col(256))
}
on.exit(par(op))
}
plotDigits <- function(data, rec_error, rows) {
row_idx <- order(rec_error[,1],decreasing=F)[rows]
my_rec_error <- rec_error[row_idx,]
my_data <- as.matrix(as.data.frame(data[row_idx,]))
plotDigit(my_data, my_rec_error)
}
Let's look at the test set points with low/median/high reconstruction errors. We will now visualize the original test set points
and their reconstructions obtained by propagating them through the narrow neural net.
The good
Let's plot the 25 digits with lowest reconstruction error. First we plot the reconstruction, then the original scanned images.
Clearly, a well-written digit 1 appears in both the training and testing set, and is easy to reconstruct by the autoencoder with
minimal reconstruction error. Nothing is as easy as a straight line.
The bad
Now let's look at the 25 digits with median reconstruction error.
Anomaly Detection
57
These test set digits look "normal" - it is plausible that they resemble digits from the training data to a large extent, but they
do have some particularities that cause some reconstruction error.
The ugly
And here are the biggest outliers - The 25 digits with highest reconstruction error!
Now here are some pretty ugly digits that are plausibly not commonly found in the training data - some are even hard to
classify by humans.
Voila!
We were able to find outliers with H2O Deep Learning Auto-Encoder models. We would love to hear your usecase
for Anomaly detection.
Note: Every run of DeepLearning results in different results since we use Hogwild! parallelization with intentional race
conditions between threads. To get reproducible results at the expense of speed for small datasets, set reproducible=T and
specify a seed.
Anomaly Detection
58
More information can be found in the H2O Deep Learning booklet and in our
slides.
Anomaly Detection
59
Advanced Topics
Arno Candel: Multi-model Parameter Tuning for Higgs Dataset
Arno Candel: Categorical Feature Engineering for Adult dataset
Arno Candel: Other Useful Tools
Advanced Topics
60
Start H2O, import the HIGGS data, and prepare train/validation/test splits
Initialize the H2O server (enable all cores)
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
Import the data: For simplicity, we use a reduced dataset containing the first 100k rows.
For small datasets, it can help to rebalance the dataset into more chunks to keep all cores busy
Prepare train/validation/test splits: We split the dataset randomly into 3 pieces. Grid search for hyperparameter tuning and
61
model selection will be done on the training and validation sets, and final model testing is done on the test set. We also
assign the resulting frames to meaningful names in the H2O key-value store for later use, and clean up all temporaries at
the end.
The first column is the response label (background:0 or higgs:1). Of the 28 numerical features, the first 21 are low-level
detector features and the last 7 are high-level humanly derived features (physics formulae).
response = 1
low_level_predictors = c(2:22)
low_and_high_level_predictors = c(2:29)
source("~/h2o-training/tutorials/advanced/binaryClassificationHelper.R.md")
The code below trains 60 models (2 loops, 5 classifiers with 2 grid search models each, each resulting in 1 full training and
2 cross-validation models). A leaderboard scoring the best models per h2o.fit() function is displayed.
N_FOLDS=2
for (preds in list(low_level_predictors, low_and_high_level_predictors)) {
data = list(x=preds, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
h2o.fit(h2o.glm, data,
list(family="binomial", variable_importances=T, lambda=c(1e-5,1e-4), use_all_factor_levels=T)),
h2o.fit(h2o.randomForest, data,
list(type="fast", importance=TRUE, ntree=c(5), depth=c(5,10))),
h2o.fit(h2o.randomForest, data,
list(type="BigData", importance=TRUE, ntree=c(5), depth=c(5,10))),
h2o.fit(h2o.gbm, data,
list(importance=TRUE, n.tree=c(10), interaction.depth=c(2,5))),
h2o.fit(h2o.deeplearning, data,
list(variable_importances=T, l1=c(1e-5), epochs=1, hidden=list(c(10,10,10), c(100,100))))
)
best_model <- h2o.leaderBoard(models, test_hex, response)
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
}
The output contains a leaderboard (based on validation AUC) for the models using low-level features only:
62
Clearly, the high-level features add a lot of predictive power, but what if they are not easily available? On this sampled
dataset and with simple models from low-level features alone, Deep Learning seems to have an edge over the other
methods indicating that it is able to create useful high-level features on its own.
Note: Every run of DeepLearning results in different results since we use Hogwild! parallelization with intentional race
conditions between threads. To get reproducible results at the expense of speed for small datasets, set reproducible=T and
specify a seed.
With this computationally slightly more expensive Deep Learning model, we achieve a nice boost over the simple models
above: AUC = 0.7245833 (on validation)
Voila!
We applied multi-model grid search with N-fold cross-valiation on the real-world Higgs dataset, and demonstrated
the power of H2O Deep Learning in automatically cerating non-linear derived features for highest predictive
accuracy!
63
More information can be found in the H2O Deep Learning booklet and in our
slides.
64
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/adult/"
TRAIN = "adult.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ' ', key = 'data_hex')
We manually assign column names since they are missing in the original file.
We will try to predict whether income is <=50K or >50K . summary(data_hex$income) response = "income"
First, we source a few helper functions that allow us to quickly compare a multitude of binomial classification models, in
particular the h2o.fit() and h2o.leaderBoard() functions. Note that these specific functions require variable importances and
N-fold cross-validation to be enabled.
source("~/h2o-training/tutorials/advanced/binaryClassificationHelper.R.md")
We then add this simple helper function to split a frame into train/valid/test pieces, train a GLM and a GBM model with 2fold cross-validation and obtaining the best model after printing a leaderbaord. For more accurate
N_FOLDS = 2
h2o.trainModels <- function(frame) {
# split the data into train/valid/test
random <- h2o.runif(frame, seed = 123456789)
train_hex <- h2o.assign(frame[random < .8,], "train_hex")
valid_hex <- h2o.assign(frame[random >= .8 & random < .9,], "valid_hex")
test_hex <- h2o.assign(frame[random >= .9,], "test_hex")
predictors <- colnames(frame)[-match(response,colnames(frame))]
# multi-model comparison with N-fold cross-validation
data = list(x=predictors, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
h2o.fit(h2o.glm, data, glmparams),
h2o.fit(h2o.gbm, data, gbmparams)
)
best_model <- h2o.leaderBoard(models, test_hex, match(response,colnames(frame)))
h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
best_model
}
65
For simplicity, we use default parameters (no grid search parameter tuning) to establish baseline performance numbers for
this dataset.
Both GLM and GBM do a great job at this dataset, we get validation AUC values of above 90%: GLM: 0.9028564 GBM:
0.9009924 According to GBM, the most important columns are marital-status,relationship,capital-gain,education-num,age
Feature engineering
The following section shows ways to create new derived features. We'll need this simple append function, as we're going to
add new columns the our dataset
GLM clearly benefited from this. We see that ages 18,19 and 20 are among the most important predictors for income and
we get the following validation AUC values: GLM: 0.9066634 GBM: 0.9009924
For fun, let's look at the largest positively and negatively correlated coefficients:
head(sort(best_model@model$normalized_coefficients,decreasing=T),5)
head(sort(best_model@model$normalized_coefficients,decreasing=F),5)
With all these new factor levels as predictors, GLM now got a nice boost: GLM: 0.9285384 GBM: 0.9021225
Let's give GBM a shot at beating GLM by using better parameters:
66
Ok, now both algorithms reach similar validation AUC values: GLM: 0.9285384 GBM: 0.9286973
We see that the training AUC for GLM improves slightly, from 0.9269056070 to 0.9269586507 . Intuition: Money is often
distributed exponentially, and the log transform brings it back to a linear space. Note that the validation AUC drops, likely
due to small data statistical noise. We clearly got close to the limit of this dataset. Note that GBM didn't benefit from this
transform, it seems to be able to better split up the original integer space.
67
library(h2o)
h2oServer <- h2o.init(nthreads=-1)
myframe = h2o.createFrame(h2oServer, 'framekey', rows = 20, cols = 5,
seed = -12301283, randomize = TRUE, value = 0,
categorical_fraction = 0.8, factors = 10, real_range = 1,
integer_fraction = 0.2, integer_range = 10, missing_fraction = 0.2,
response_factors = 1)
We created a small random frame in H2O that contains missing values, categorical and numerical columns.
head(myframe,20)
# response C1 C2 C3 C4 C5
# 1 -0.4227451 4a41c a1290 3 326a0 82dce
# 2 0.4594655 0a79a a1290 -6 05f22 f4772
# 3 0.2784008 70d75 4c3a7 7 320a8 6b17b
# 4 -0.5674698 07067 6 320a8 a78d9
# 5 -0.7041911 24800 4c3a7 8 82dce
# 6 0.5853957 8e64f a1290 NA ea644
# 7 0.8540204 4 05f22 a6b1e
# 8 -0.1466706 4a41c 1 cef5a 89717
# 9 NA 0a79a c402d 3 070d5 a78d9
# 10 -0.7357408 4a41c a1290 9 e055d 64439
# 11 -0.2798403 4c3a7 3 f4772
# 12 0.3454386 24800 NA
# 13 NA 0a79a ca1de 9 070d5 a78d9
# 14 -0.6732674 8e64f 50b47 9 320a8 6b17b
# 15 NA cc3d5 ca1de 4 e055d f4772
# 16 NA cc3d5 07067 1 070d5 89717
# 17 0.8564855 4f8b9 a1290 -1 326a0 6b17b
# 18 -0.5555804 0a79a 50b47 NA 82dce
# 19 0.5639324 3 e055d
# 20 0.2511743 7 30c85 cae4c
We remove the response column and convert the integer column to a factor.
68
Create 5-th order interaction between the specified columns, and allow up to 10k resulting factors (per pair-wise
interaction).
Create a categorical variable out of the integer column via self-interaction, and keep at most 3 factors, and only if they
occur at least twice
summary(myframe$C3)
head(myframe$C3, 20)
trim_integer_levels <- h2o.interaction(myframe, key = 'trim_integers', factors = 3,
pairwise = FALSE, max_factors = 3, min_occurrence = 2)
head(trim_integer_levels, 20)
ds <- iris
ds[sample(nrow(ds), 50),1] <- NA
ds[sample(nrow(ds), 50),2] <- NA
ds[sample(nrow(ds), 50),3] <- NA
ds[sample(nrow(ds), 50),4] <- NA
ds[sample(nrow(ds), 50),5] <- NA
summary(ds)
69
Impute the NAs in the second column with the mean based on the groupBy columns Sepal.Length and Petal.Width and
Species
Impute the Species column with the "mode" based on the columns 1 and 4
Now, we split that dataset into 4 consecutive pieces, so we need to specify the sizes of the first 3 splits
70
71
featureSet <- c("ODATEDW", "OSOURCE", "STATE", "ZIP", "PVASTATE", "DOB", "RECINHSE", "MDMAUD", "DOMAIN", "CLUSTER", "AGE",
"HOMEOWNR", "CHILD03", "CHILD07", "CHILD12", "CHILD18", "NUMCHLD", "INCOME", "GENDER", "WEALTH1", "HIT", "COLLECT1",
"VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", "GARDENIN",
"BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", "PEPSTRFL", "CARDPROM", "MAXADATE", "NUMPROM", "CARDPM12", "NUMPRM12",
"RAMNTALL", "NGIFTALL", "CARDGIFT", "MINRAMNT", "MAXRAMNT", "LASTGIFT", "LASTDATE", "FISTDATE", "TIMELAG", "AVGGIFT",
"HPHONE_D", "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", "CLUSTER2", "GEOCODE2", "TARGET_D")
library(party)
cf <- cforest(TARGET_D ~ ., data= kdd98, control = cforest_unbiased(mtry=2, ntree=50))
After a while, a window will pop up showing "Unable to establish connection with R session". In order to continue
to use Rstudio, you may go to the task manager to kill rsession. This demonstrates cforest runs out of memory for
this trimmed dataset and cannot return any results.
Use the complete dataset to run big data random forest with
H2O
Let's use H2O to build a big data random forest model and predict who to mail and how much profit the fundraising campaign may generate.
To run H2O on your VM, double click the H2O icon on your VM desktop.
Then open a browser, type localhost:54321 to access H2O web UI.
Let's start uploading the datasets, build the model, and do prediction.
Name the prediction key "drf_prediction" and download to csv after predicting.
72
After getting the prediction, then run this with the H2O prediction.
sum(kdd98_withpred$yield) # profit
max(kdd98_withpred$yield) # max donation
sum(kdd_pred_val) # mails sent
73
Sparkling Water
What is Sparkling Water?
Sparkling Water is an integration work done by Michal Malohlava that brings together the two open-source platform H2O
and Spark. Because of this integration H2O and users can takes advantage of H2O's algorithms, Spark SQL, and both
parsers. The typical use case is to load data either into H2O or Spark, use Spark SQL to make a query, feed the results into
H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark.
The current integration will require an installation of Spark. To launch a Spark App the user will be required to start up a
Spark cluster with H2O instances sitting in the same JVM. The user will specify the number of woker nodes and amount of
memory on each node when starting a Spark Shell or when submitting a job to YARN on Hadoop.
Data Distribution
The key to this tutorial is understanding how data is distributed on the Sparkling Water cluster. After data has been
imported into either Spark or H2O we can publish H2O data frames as SchemaRDDs in Spark in order to run Spark SQL
without replicating the data. However reading data into H2O from Spark will require a one time load of the data into H2O
which isn't done as cheaply.
Sparkling Water
74
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.2.1-bin-hadoop2.4.tgz
tar -xvf spark-1.2.1-bin-hadoop2.4.tgz
cd spark-1.2.1-bin-hadoop2.4
export SPARK_HOME=`pwd`
wget http://h2o-release.s3.amazonaws.com/sparkling-water/master/95/sparkling-water-0.2.12-95.zip
unzip sparkling-water-0.2.12-95.zip
cd sparkling-water-0.2.12-95
To launch a Sparkling Shell which is essentially a Spark Shell with H2O classes launched with the shell from the command
line run the script sparkling-shell.
bin/sparkling-shell
75
Context. Now the user can use all of the H2O classes that the Sparkling shell was launched with.
Filter
Use Spark to filter out all flights coming from Ohare in the airlines dataset.
Sparkling Water
76
weatherTable.registerTempTable("WeatherORD")
Sparkling Water
77
Sparkling Water
78
Python on H2O
Similar to the R API, the H2O python module provides access to the H2O JVM, its objects, its machine-learning algorithms,
and modeling support (basic munging and feature generation) capabilities. The python integration is only available in the
new generation H2O since all the REST endpoints it hits are different from the endpoints in the older version of H2O.
The H2O python module is not intended as a replacement for other popular machine learning modules such as scikit-learn,
pylearn2, and their ilk. This module is a complementary interface to a modeling engine intended to make the transition of
models from development to production as seamless as possible. Additionally, it is designed to bring H2O to a wider
audience of data and machine learning devotees that work exclusively with Python (rather than R or scala or Java -- which
are other popular interfaces that H2O supports), and are wanting another tool for building applications or doing data
munging in a fast, scalable environment without any extra mental anguish about threads and parallelism.
This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within
Python. It requires an installation of the h2o python module and its dependencies.
import h2o
h2o.init()
Get data
This tutorial in particular will use a NYC's public Citibike dataset and import one month of all trip information across 300+
stations into H2O. Then a model predicting the demand and usage of bikes will be built to load-balance the Citibike
inventory. Users can download the data from Citibike-NYC Website.
wget https://s3.amazonaws.com/tripdata/201310-citibike-tripdata.zip
79
We will use the h2o.import_frame function to read the data into the H2O key-value store.
dir_prefix = "~/Downloads/"
small_test = dir_prefix + "201310-citibike-tripdata.zip"
data = h2o.import_frame(path=small_test)
data.describe()
data.dim()
data.head()
Derive columns
In this step we will take the time since epoch and divide by the number of seconds in a day to get all the unique days in the
dataset and bind the new column created to the original H2OFrame .
startime = data["starttime"]
secsPerDay=1000*60*60*24
data["Days"] = (startime/secsPerDay).floor()
data.describe()
Now the dataset only has three features but we can add to the frame columns Month and DayOfWeek hopefully to detect
patterns where bike stations near the financial district will be more active during work days while bikes near parks might be
more active during the weekend.
secs = bpd["Days"]*secsPerDay
bpd["Month"] = secs.month()
bpd["DayOfWeek"] = secs.dayOfWeek()
Python on H2O
80
print "Bikes-Per-Day"
bpd.describe()
def split_fit_predict(data):
# Classic Test/Train split
r = data['Days'].runif() # Random UNIForm numbers, one per row
train = data[ r < 0.6]
test = data[(0.6 <= r) & (r < 0.9)]
hold = data[ 0.9 <= r ]
print "Training data has",train.ncol(),"columns and",train.nrow(),"rows, test has",test.nrow(),"rows, holdout has",hold.nrow()
# Run GBM
gbm = h2o.gbm(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
ntrees=500, # 500 works well
max_depth=6,
learn_rate=0.1)
# Run DRF
drf = h2o.random_forest(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
ntrees=500, # 500 works well
max_depth=50)
# Run GLM
glm = h2o.glm(x =train.drop("bikes"),
y =train ["bikes"],
validation_x=test .drop("bikes"),
validation_y=test ["bikes"],
dropNA20Cols=True)
#glm.show()
Here we see an r^2 of 0.92 for GBM, and 0.72 for GLM on the test set. This means given just the station, the month, and
Python on H2O
81
split_fit_predict(bpd)
Python on H2O
82
Demos
Tableau
Excel
Streaming Data
Demos
83
Tableau
Beauty and Big Data: Tableau
Tableau
Download Tableau in order to use notebooks available on our github.
R Component
First, make sure to install H2O in R:
> install.packages("h2o")
Then install the Rserve package in R that will allow the user to start up a R session on a local server that Tableau will
communicate with:
> install.packages("Rserve")
> library("Rserve")
> run.Rserve(port = 6311)
Tableau
84
Tableau
85
Tableau
86
> library(h2o)
> tableauFunctions <- functions(x){
Tableau
87
> ...
> }
> print('Finish loading necessary functions')
> library(h2o)
> localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1)
> data = h2o.importFile(localH2O, "/data/h2o-training/airlines/allyears2k.csv")
Execute 02 Compute Aggregation with H2Os ddply to groupby columns and do roll ups. First calculate the number of
flights coming and going per month:
Execute 03 Run GLM to build a GLM model in H2O and grab back coefficient values that will be plotted in multiple
worksheets:
data.glm = h2o.glm(x = c('Origin', 'Dest', 'Distance', 'Unique Carrier') , y = 'Cancelled', data = data.hex, family = 'binomial', n
After the calculated fields finishes running, scroll through the different dashboards to see data differently.
Tableau
88
Excel
Beauty and Big Data: Excel
Using Excel with H2O
When working with Excel the HTTP call to H2O will elicit a response in XML format that Microsoft Excel can parse in both
32-bit and 64-bit version. For Excel users that are already familiar with excel features and functions, this tutorial will work
you through some basic H2O capabilities written in VBA. So the demonstration Excel file has to be macro enabled to
access all the point and click features.
Data Summary
Hit Generate Summary to create a new worksheet with a summary of all the columns in the data set
Excel
89
All columns except NA% are taken directly from H2Os inspect page
Excel
90
Excel
91
Excel
92
In this tutorial, we explore a combined modeling and streaming workflow as seen in the picture below:
We produce a GBM model by running H2O and emitting a Java POJO used for scoring. The POJO is very lightweight and
does not depend on any other libraries, not even H2O. As such, the POJO is perfect for embedding into third-party
environments, like a Storm bolt.
This tutorial walks you through the following sequence:
Installing the required software
A brief discussion of the data
Using R to build a gbm model in H2O
Exporting the gbm model as a Java POJO
Copying the generated POJO files into a Storm bolt build environment
Building Storm and the bolt for the model
Streaming Data
93
NOTE: Building storm (c.f. Section 5) requires Maven. You can install Maven (version 3.x) by following the Maven
installation instructions.
Navigate to the directory for this tutorial inside the h2o-training repository:
$ cd h2o-training/tutorials/streaming/storm
2.2. Install R
Get the latest version of R from CRAN and install it on your computer.
94
Label
Has4Legs
CoatColor
HairLength
TailLength
EnjoysPlay
StaresOutWindow
HoursSpentNappi
dog
Brown
dog
Brown
dog
Grey
10
dog
Grey
dog
Brown
10
cat
Grey
dog
Spotted
cat
Spotted
dog
Grey
cat
Black
dog
Grey
dog
Spotted
dog
Spotted
cat
Spotted
cat
White
cat
Black
cat
Grey
cat
Spotted
10
Note that the first row in the training data set is a header row specifying the column names.
Streaming Data
95
The response column (i.e. the "y" column) we want to make predictions for is Label. It's a binary column, so we want to
build a classification model. The response column is categorical, and contains two levels, 'cat' and 'dog'. Note that the ratio
of dogs to cats is 3:1.
The remaining columns are all input features (i.e. the "x" columns) we use to predict whether each new observation is a
'cat' or a 'dog'. The input features are a mix of integer, real, and categorical columns.
Streaming Data
96
>
> cat("Building GBM model\n")
Building GBM model
> df <- h2o.importFile(h, "training_data.csv");
^M | ^M |
> y <- "Label"
> x <- c("NumberOfLegs","CoatColor","HairLength","TailLength","EnjoysPlay","StairsOutWindow","HoursSpentNapping","RespondsToCommands","
> gbm.h2o.fit <- h2o.gbm(data = df, y = y, x = x, n.trees = 10)
^M | ^M |
>
> cat("Downloading Java prediction model code from H2O\n")
Downloading Java prediction model code from H2O
> model_key <- gbm.h2o.fit@key
> tmpdir_name <- "generated_model"
> cmd <- sprintf("rm -fr %s", tmpdir_name)
> safeSystem(cmd)
[1] "+ CMD: rm -fr generated_model"
[1] 0
> cmd <- sprintf("mkdir %s", tmpdir_name)
> safeSystem(cmd)
[1] "+ CMD: mkdir generated_model"
[1] 0
> cmd <- sprintf("curl -o %s/GBMPojo.java http://%s:%d/2/GBMModelView.java?_modelKey=%s", tmpdir_name, myIP, myPort, model_key)
> safeSystem(cmd)
[1] "+ CMD: curl -o generated_model/GBMPojo.java http://localhost:54321/2/GBMModelView.java?_modelKey=GBM_9d538f637ef85c78d6e2fea88ad54
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 27041 0 27041 0 0 2253k 0 --:--:[1] 0
> cmd <- sprintf("curl -o %s/h2o-model.jar http://127.0.0.1:54321/h2o-model.jar", tmpdir_name)
> safeSystem(cmd)
Streaming Data
97
The h2o-model.jar file contains the interface definition, and the GBMPojo.java file contains the Java code for the POJO
model.
The following three sections from the generated model are of special importance.
This is the class to instantiate in the Storm bolt to make predictions. (Note that we simplified the class name using sed as
part of the R script that exported the model. By default, the class name has a long UUID-style name.)
predict() is the method to call to make a single prediction for a new observation. data is the input, and preds is the output.
The return value is just preds, and can be ignored.
Inputs and Outputs must be numerical. Categorical columns must be translated into numerical values using the DOMAINS
mapping on the way in. Even if the response is categorical, the result will be numerical. It can be mapped back to a level
string using DOMAINS, if desired. When the response is categorical, the preds response is structured as follows:
Streaming Data
98
preds[0] contains 0 or 1
preds[1] contains the probability that the observation is ColInfo_15.VALUES[0]
preds[2] contains the probability that the observation is ColInfo_15.VALUES[1]
The DOMAINS array contains information about the level names of categorical columns. Note that Label (the column we
are predicting) is the last entry in the DOMAINS array.
Once storm is built, open up your favorite IDE to start building the h2o streaming topology. In this tutorial, we will be using
IntelliJ.
To import the storm-starter project into your IntelliJ please follow these screenshots:
Click on "Import Project" and find the storm repo. Select storm and click "OK"
Streaming Data
99
Import the project from extrenal model using Maven, click "Next"
Streaming Data
100
Ensure that "Import Maven projects automatically" check box is clicked (it's off by default), click "Next"
Streaming Data
101
That's it! Now click through the remaining prompts (Next -> Next -> Next -> Finish).
Once inside the project, open up storm-starter/test/jvm/storm.starter. Yes, we'll be working out of the test directory.
$ cp TestH2ODataSpout.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
Streaming Data
102
Streaming Data
103
Streaming Data
104
Streaming Data
105
We now copy over the POJO from section 4 into our storm project.
$ cp ./generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
OR if you were not able to build the GBMPojo, copy over the pre-generated version:
$ cp ./premade_generated_model/GBMPojo.java /PATH_TO_STORM/storm/examples/storm-starter/test/jvm/storm/starter/
In order to use the GBMPojo class, our PredictionBolt in H2OStormStarter has the following "execute" block:
Streaming Data
106
The probability emitted is the probability of being a 'dog'. We use this probability to decide wether the observation is of type
'cat' or 'dog' depending on some threshold. This threshold was chosen such that the F1 score was maximized for the
testing data (please see AUC and/or h2o.preformance() from R).
The ClassifierBolt then looks like:
@Override
public void execute(Tuple tuple) {
String expected=tuple.getString(0);
double dogProb = tuple.getFloat(1);
String content = expected + "," + (dogProb <= _thresh ? "dog" : "cat");
try {
File file = new File("/Users/spencer/0xdata/h2o-training/tutorials/streaming/storm/web/out"); // EDIT ME! PUT YOUR PATH TO /we
if (!file.exists()) file.createNewFile();
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
_collector.emit(tuple, new Values(expected, dogProb <= _thresh ? "dog" : "cat"));
_collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("expected_class", "class"));
}
}
107
To watch the predictions in real time, we start up an http-server on port 4040 and navigate to http://localhost:4040.
In order to get http-server, install npm (you may need sudo):
$ brew install npm $ npm install http-server -g
Once these are installed, you may navigate to the web directory and start the server:
$ cd web $ http-server -p 4040 -c-1
Now open up your browser and navigate to http://localhost:4040. Requires a modern browser (depends on D3 for
animation).
Here's a short video showing what it looks like all together.
Enjoy!
References
CRAN
GBM
The Elements of Statistical Learning. Vol.1. N.p., page 339 Hastie, Trevor, Robert Tibshirani, and J Jerome H
Friedman. Springer New York, 2001.
Data Science with H2O (GBM)
Gradient Boosting (Wikipedia)
H2O
H2O Markov stable release
Java POJO
R
Storm
Streaming Data
108
$ cd h2o-droplets
Note: if you are using H2O training sandbox, the repository is already cloned for you in ~/devel/h2o-droplets .
total 8
drwxr-xr-x 6 tomk staff 204 Oct 28 08:40 .
drwxr-xr-x 35 tomk staff 1190 Oct 28 08:40 ..
drwxr-xr-x 13 tomk staff 442 Oct 28 08:40 .git
-rw-r--r-- 1 tomk staff 322 Oct 28 08:40 README.md
drwxr-xr-x 11 tomk staff 374 Oct 28 08:40 h2o-java-droplet
drwxr-xr-x 11 tomk staff 374 Oct 28 08:40 h2o-scala-droplet
h2o-java-droplet/.gitignore
h2o-java-droplet/build.gradle
h2o-java-droplet/gradle/wrapper/gradle-wrapper.jar
h2o-java-droplet/gradle/wrapper/gradle-wrapper.properties
h2o-java-droplet/gradle.properties
h2o-java-droplet/gradlew
h2o-java-droplet/gradlew.bat
h2o-java-droplet/README.md
h2o-java-droplet/settings.gradle
h2o-java-droplet/src/main/java/water/droplets/H2OJavaDroplet.java
109
h2o-java-droplet/src/test/java/water/droplets/H2OJavaDropletTest.java
As you can see, the Java example contains of a build.gradle file, a Java source file, and a Java test file.
Look at the build.gradle file and you will see the following sections, which link the Java droplet sample project to a version
of h2o-dev published in MavenCentral:
repositories {
mavenCentral()
}
ext {
h2oVersion = '0.1.8'
}
dependencies {
// Define dependency on core of H2O
compile "ai.h2o:h2o-core:${h2oVersion}"
// Define dependency on H2O algorithm
compile "ai.h2o:h2o-algos:${h2oVersion}"
// Demands web support
compile "ai.h2o:h2o-web:${h2oVersion}"
// H2O uses JUnit for testing
testCompile 'junit:junit:4.11'
}
This is all very standard Gradle stuff. In particular, note that this example depends on three different H2O artifacts, all of
which are built in the h2o-dev repository.
h2o-core contains base platform capabilities like H2O's in-memory distributed key/value store and mapreduce
frameworks (the "water" package).
h2o-algos contains math algorithms like GLM and Random Forest (the "hex" package).
h2o-web contains the browser web UI (lots of javascript).
:ideaModule
Download http://repo1.maven.org/maven2/ai/h2o/h2o-core/0.1.8/h2o-core-0.1.8.pom
Download http://repo1.maven.org/maven2/ai/h2o/h2o-algos/0.1.8/h2o-algos-0.1.8.pom
Download http://repo1.maven.org/maven2/ai/h2o/h2o-web/0.1.8/h2o-web-0.1.8.pom
[... many more one-time downloads not shown ...]
:ideaProject
:ideaWorkspace
:idea
BUILD SUCCESSFUL
Total time: 51.429 secs
You will see three new files created with IDEA extensions. The .ipr file is the project file.
110
$ ls -al
total 168
drwxr-xr-x 15 tomk staff 510 Oct 28 10:03 .
drwxr-xr-x 6 tomk staff 204 Oct 28 08:40 ..
-rw-r--r-- 1 tomk staff 273 Oct 28 08:40 .gitignore
drwxr-xr-x 3 tomk staff 102 Oct 28 10:03 .gradle
-rw-r--r-- 1 tomk staff 1292 Oct 28 08:40 README.md
-rw-r--r-- 1 tomk staff 1409 Oct 28 08:40 build.gradle
drwxr-xr-x 3 tomk staff 102 Oct 28 08:40 gradle
-rw-r--r-- 1 tomk staff 23 Oct 28 08:40 gradle.properties
-rwxr-xr-x 1 tomk staff 5080 Oct 28 08:40 gradlew
-rw-r--r-- 1 tomk staff 2404 Oct 28 08:40 gradlew.bat
-rw-r--r-- 1 tomk staff 33316 Oct 28 10:03 h2o-java-droplet.iml
-rw-r--r-- 1 tomk staff 3716 Oct 28 10:03 h2o-java-droplet.ipr
-rw-r--r-- 1 tomk staff 9299 Oct 28 10:03 h2o-java-droplet.iws
-rw-r--r-- 1 tomk staff 39 Oct 28 08:40 settings.gradle
drwxr-xr-x 4 tomk staff 136 Oct 28 08:40 src
Choose the h2o-java-droplet.ipr project file that we just created with gradle.
111
112
113
KMeans
Hacking Algorithms into H2O: KMeans
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All tutorials start out the same - getting the h2o-dev code
and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example, and then
the content is customized for each algorithm. This tutorial describes the algorithm K-Means.
What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST and JSON support on your new algorithm. You still have to write the math,
of course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.
Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:
This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:
C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE
KMeans
114
:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:
But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.
While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the
web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example
at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one
In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:
Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:
KMeans
115
BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:
11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
KMeans
116
Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java
code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).
Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their KMeans2* variants.
KMeans
117
I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its KMeans2 variant. Finally I renamed
the files and file contents from Example to KMeans2 .
I also dove into h2o-app/src/main/java/water/H2OApp.java and copied the two Example lines and made their KMeans2
variants. Because I'm old-school, I did this with a combination of shell hacking and Emacs; about 5 minutes all told.
At this point, back in IDEAJ, I nagivated to KMeans2Test.java , right-clicked debug-test testIris again - and was rewarded
with my KMeans2 clone running a basic test. Not a very good KMeans, but definitely a start.
parms._response_column = "class";
At this point we can run our test code again (still finding the max-per-column).
Inside the KMeans2Model class, there is a class for the model's output: class KMeans2Output . We'll put our clusters there. The
various support classes and files will make sure our model's output appears in the correct REST and JSON responses and
gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that now.
To help assay the goodness our of model, we should also report some extra facts about the training results. The obvious
thing to report is the Mean Squared Error, or the average squared-error each training point has against its chosen cluster:
KMeans
118
And finally a quick report on the effort used to train: the number of iterations training actually took. K-Means runs in
iterations, improving with each iteration. The algorithm typically stops when the model quits improving; we report how many
iterations it took here:
Now, let's turn to the input to our model-building process. These are stored in the class KMeans2Model.KMeans2Parameters .
We already inherit an input training dataset (returned with train() ), possibly a validation dataset ( valid() ), and some
other helpers (e.g. which columns to ignore if we do an expensive scoring step with each iteration). For now, we can ignore
everything except the input dataset from train() .
However, we want some more parameters for K-Means: K , the number of clusters. Define it next to the left-over
_max_iters from the old Example code (which we might as well keep since that's a useful stopping conditon for K-Means):
A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our JSON
output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new K field to the external input Schema in h2oalgos/src/main/java/hex/schemas/KMeans2V2.java :
KMeans
119
And I need to add my result fields to the external output schema in h2oalgos/src/main/java/hex/schemas/KMeans2ModelV2.java :
In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No running K-Means in the init() call!
Speaking of the init() call, the one we got from the old Example code limits our _max_iters to between 1 and 10 million.
Let's add some lines to check that K is sane:
Immediately when testing the code, I get a failure because the KMeans2Test.java code does not set K and the default is
zero. I'll set K to 3 in the test code:
parms._K = 3;
In the KMeans2.java file there is a trainModel call that is used when you really want to start running K-Means (as opposed
to just checking arguments). In our case, the old boilerplate starts a KMeans2Driver in a background thread. Not required,
but for any long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and
from REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class KMeans2Driver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.
Scope.enter();
Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:
KMeans
120
_parms.lock_frames(KMeans2.this);
Locking prevents situations like accidentally deleting or loading a new dataset with the same name while K-Means is
running. Like the Scope.exit() above, we will unlock in finally block. While it might be nice to use Java locking, or even
JDK 5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly cooperative we cannot enforce locking at the JVM level like the JVM does.
Next, we make an instance of our model object (with no clusters yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated model).
Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.
My KMeans2 now has a leftover loop from the old Example code running up to some max iteration count. This sounds like a
good start to K-Means - we'll need several stopping conditions, and max-iterations is one of them.
Let's also stop if somebody clicks the "cancel" button in the GUI:
I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. clusters, mse), and save the results back into the DKV, bump the progress bar, and log a little bit of progress:
KMeans
121
assign it to the nearest cluster center, then compute new cluster centers from the assigned points, and iterate until the
clusters quit moving.
Anything that starts out with the words "for each point" when you have a billion points needs to run in-parallel and scale-out
to have a chance of completing fast - and this is exactly H2O is built for! So let's write code that runs scale-out for-eachpoint... and the easiest way to do that is with an H2O Map/Reduce job - an instance of MRTask. For K-Means, this is an
instance of Lloyd's basic algorithm. We'll call it from the main-loop like this, and define it below (extra lines included so you
can see how it fits):
KMeans
122
Basically, we just called some not-yet-defined Lloyds code, computed some cluster centers by computing the average
point from the points in the new cluster, and copied the results into our model. I also printed out the Mean Squared Error
and row counts, so we can watch the progress over time. Finally we end with another stopping condition: stop if the latest
model is really not much better than the last model. Now class Lloyds can be coded as an inner class to the
KMeans2Driver class:
A Quick Helper
A common op is to compute the distance between two points. We'll compute it as the squared Euclidean distance (squared
so as to avoid an expensive square-root operation):
Lloyd's
Back to Lloyds, we loop over the Chunk[] of data handed the map call by moving it as a double[] for easy handling:
KMeans
123
And then add the point into our growing pile of points in the new clusters we're building:
And that ends the map call and the Lloyds main work loop. To recap, here it is all at once:
The reduce needs to fold together the returned results; the _sums , the _rows and the _se :
KMeans
124
Running K-Means
Running the KMeans2Test returns:
11-09 16:47:36.609 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 0 1.9077333333333335 ROWS=[66, 34, 50]
11-09 16:47:36.610 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 1 0.6794683457919871 ROWS=[60, 40, 50]
11-09 16:47:36.611 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 2 0.604730388888889 ROWS=[54, 46, 50]
11-09 16:47:36.613 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 3 0.5870257150458589 ROWS=[51, 49, 50]
11-09 16:47:36.614 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 4 0.5828039837094235 ROWS=[50, 50, 50]
11-09 16:47:36.615 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 5 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.616 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 6 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.618 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 7 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.619 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 8 0.5823599999999998 ROWS=[50, 50, 50]
11-09 16:47:36.621 192.168.1.2:54321 3036 FJ-0-7 INFO: KMeans2: iter: 9 0.5823599999999998 ROWS=[50, 50, 50]
You can see the Mean Squared Error dropping over time, and the clusters stabilizing.
At this point there are a zillion ways we could take this K-Means:
Report MSE on the validation set
Report MSE per-cluster (some clusters will be 'tighter' than others)
Run K-Means with different seeds, which will likely build different clusters - and report the best cluster, or even sort and
group them by MSE.
Handle categoricals (e.g. compute Manhattan distance instead of Euclidean)
Normalize the data (optionally, on a flag). Without normalization features, larger absolute values will have larger
distances (errors) and will get priority of other features.
Aggressively split high-variance clusters to see if we can get K-Means out of a local minima
Handle clusters that "run dry" (zero rows), possibly by splitting the point with the most error/distance out from its cluster
to make a new one.
I'm sure you can think of lots more ways to extend K-Means!
Good luck with your own H2O algorithm,
Cliff
KMeans
125
Grep
Hacking Algorithms into H2O: Grep
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All tutorials start out the same: getting the h2o-dev code
and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example, and then
the content is customized for each algorithm. This tutorial describes computing Grep.
What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST & JSON support on your new algorithm. You still have to write the math, of
course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.
Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:
This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:
C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE
Grep
126
:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:
But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.
While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the
web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example
at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one
In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:
Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:
Grep
127
BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:
11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
Grep
128
Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java
code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).
Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their Grep* variants.
I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its Grep variant. Finally, I renamed the
Grep
129
We also split Schemas from Models to isolate slow-moving external APIs from rapidly-moving internal APIs. As a Java dev,
you can hack the guts of Grep to your heart's content, including the inputs and outputs, as long as the externally facing V2
schemas do not change. If you want to report new stuff or take new parameters, you can make a new V3 schema (which is
not compatible with V2) for the new stuff. Old external V2 users will not be affected by your changes - you'll still have to
make the correct mappings in the V2 schema code from your V3 algorithm.
One other important hack: grep is an unsupervised algorithm - no training data (no "response") tells it what the results
"should" be - it's not really a machine-learning algorithm. So we need to hack the word Supervised out of all the various
class names it appears in. After this is done, your GrepTest probably fails to compile, because it is trying to set the
response column name in the test, and unsupervised models do not get a response to train with. Just delete the line for
now:
parms._response_column = "class";
At this point, we can run our test code again (still finding the max-per-column) instead of running grep.
Inside the GrepModel class, there is a class for the model's output: class GrepOutput . We'll put our grep results there. The
Grep
130
various support classes and files will make sure our model's output appears in the correct REST and JSON responses, and
gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that now.
My final GrepOutput class looks like:
Now, let's turn to the input for our model-building process. These are stored in the class GrepModel.GrepParameters . We
already inherit an input dataset (returned with train() ), and some other helpers (e.g. which columns to ignore). For now,
we can ignore everything except the input dataset from train() .
However, we want some more parameters for grep: the regular expression. Define it next to the left-over _max_iters from
the old Example code (which we might as well also nuke):
A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not use) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our
JSON output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new regex field to the external schema in h2oalgos/src/main/java/hex/schemas/GrepV2.java :
And I need to add my result fields to the external output schema in h2o-algos/src/main/java/hex/schemas/GrepModelV2.java :
Grep
131
In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No computing grep in the init() call!
Speaking of the init() call, the one we got from the old Example code limits sanity checks the now-deleted _max_iters .
Let's replace that with some basic sanity checking:
In the Grep.java file there is a trainModel call that is used when you really want to start running grep (as opposed to just
checking arguments). In our case, the old boilerplate starts a GrepDriver in a background thread. Not required, but for any
long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and from
REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class GrepDriver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.
Scope.enter();
Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:
_parms.lock_frames(Grep.this);
Locking prevents situations like accidentally deleting or loading a new dataset with the same name while grep is running.
Like the Scope.exit() above, we will unlock in the finally block. While it might be nice to use Java locking, or even JDK
5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly cooperative - we
cannot enforce locking at the JVM level like the JVM does.
Grep
132
Next, we make an instance of our model object (with no result matches yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated grep search).
Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.
I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. grep), and save the results back into the DKV, bump the progress bar, and log a little bit of progress. For this
very simple 'model' we don't need any extra iteration, and we'll do progress on a per-byte instead of per-iteration basis. I'm
just going to assume my not-yet-defined GrepGrep class just ends up with the results, and save them to the model now:
Grep
133
GrepGrep
Back to class GrepGrep , we create holders for the results, compile a JDK regex Pattern object, then start looping over
Matches .
Note that I let the matches end in the second Chunk of data. All matches must start in the first Chunk of data; this prevents
me from counting every match twice (once in the first Chunk on a map call, once again in the second Chunk on some other
map call). Ultimately I limit matches to one Chunk's worth of raw text - about 4Megs for a single match - but they can span
a Chunk boundary.
I defined a simple class allowing a byte[] to be used as an instance of a CharSequence to pass to the Pattern.match call,
and a method to add (accumulate) results. Both are very simple. add accumulates results in the two result arrays,
doubling their size as needed (to keep asymptotic costs low):
Grep
134
The class ByteSeq wraps byte arrays in a simple interface implementation. There's two special H2O properties at work
here making this very efficient: Always raw text data is stored as ... raw text data. The Chunk wrappers used to manipulate
the big data are very thin wrappers over raw text - and the underlying byte[] s are directly available. They are also lazily
loaded on demand from the file system. Editing the byte[] s can't be stopped with this direct access, and is Bad Coding
Style, and won't be reflected around the cluster, so Caveat Emptor. Nonetheless, pure read operations can get direct
access.
The other special H2O property is a little harder to explain, but just as crucial - H2O does data-placement such that
accessing the elements past the end of one Chunk and into another guarantees that the next Chunk in line is statistically
very likely to be locally available. If the data was psuedo randomly distributed, you might expect that asking for a neighbor
Chunk has poor odds of being local (for a size N cluster, 1/N Chunks would be local). Instead, something like 90% of nextChunks are adjacent. For Chunks are not adjacent, you'll get a copy (from the ubiquitous DKV)- meaning that some fraction
of the data is replicated to cover the edge cases. Basically, you get to ignore the edge case, and have your grep run off the
end of one Chunk and into the next, for free.
And that ends the map call and its helps in the GrepGrep main work loop.
We also need a reduce to fold together the returned results; the _mataches and the _offsets and the _cnt . I'm using a
add call again to make this easier. First I swap the larger set to the left, then I add the smaller set to the larger, then I make
Running Grep
Grep
135
Back to the class GrepTest , I changed the file to something bigger - bigdata/laptop/text8.gz . You can get it via gradlew
syncBigdataLaptop which pulls down medium-large data from the Amazon cloud. Since we didn't add a GUnzip step, I
manually unzipped text8.txt to get a 100Mb file. GrepTest looks for words with 5 or more letter-pairs in this file.
Running the GrepTest returns:
Takes about 5 seconds to run the regex (?:(\\w)\\1){5} on 100Mb; not too bad. Looking at the answers I see some funny
words there. I checked with grep -P '(?:(\\w)\\1){5}' text8.txt to see that I was getting the same answers - one of the
words is "racoonnookkeeper" - definitely a bit of a stretch! grep -P is somewhat faster than the JDK Pattern class; even
though H2O was 4-way parallel on my lap, the underlying JDK is too slow.
As an obvious extension, it would be nice to download one of the other Java regex solutions I found on the web; most
declared themselves to be vastly faster than the JDK code. Other cool hacks would include getting the line number, and
perhaps returning the entire matching line.
Good luck with your own H2O algorithm,
Cliff
Grep
136
Quantiles
Hacking Algorithms into H2O: Quantiles
This is a presentation of hacking a simple algorithm into the new dev-friendly branch of H2O, h2o-dev.
This is one of three "Hacking Algorithms into H2O" tutorials. All three tutorials start out the same: getting the h2o-dev
code and building it. They are the same until the section titled Building Our Algorithm: Copying from the Example,
and then the content is customized for each algorithm. This tutorial describes computing Quantiles.
What is H2O-dev ?
As I mentioned, H2O-dev is a dev-friendly version of H2O, and is soon to be our only version. What does "dev-friendly"
mean? It means:
No classloader: The classloader made H2O very hard to embed in other projects. No more! Witness H2O's
embedding in Spark.
Fully integrated into IdeaJ: You can right-click debug-as-junit any of the junit tests and they will Do The Right Thing
in your IDE.
Fully gradle-ized and maven-ized: Running gradlew build will download all dependencies, build the project, and run
the tests.
These are all external points. However, the code has undergone a major revision internally as well. What worked well was
left alone, but what was... gruesome... has been rewritten. In particular, it's now much easier to write the "glue" wrappers
around algorithms and get the full GUI, R, REST & JSON support on your new algorithm. You still have to write the math, of
course, but there's not nearly so much pain on the top-level integration.
At some point, we'll flip the usual H2O github repo to have h2o-dev as our main repo, but at the moment, h2o-dev does not
contain all the functionality in H2O, so it is in its own repo.
Building H2O-dev
I assume you are familiar with basic Java development and how github repo's work - so we'll start with a clean github repo
of h2o-dev:
This will download the h2o-dev source base; took about 30secs for me from home onto my old-school Windows 7 box.
Then do an initial build:
C:\Users\cliffc\Desktop\my-h2o> cd h2o-dev
C:\Users\cliffc\Desktop\my-h2o\h2o-dev> .\gradlew build
...
:h2o-web:test UP-TO-DATE
:h2o-web:check UP-TO-DATE
Quantiles
137
:h2o-web:build
BUILD SUCCESSFUL
Total time: 11 mins 41.138 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
The first build took about 12mins, including all the test runs. Incremental gradle-based builds are somewhat faster:
But faster yet will be IDE-based builds. There's also a functioning Makefile setup for old-schoolers like me; it's a lot faster
than gradle for incremental builds.
While that build is going, let's look at what we got. There are 4 top-level directories of interest here:
h2o-core : The core H2O system - including clustering, clouding, distributed execution, distributed Key-Value store, the
web, REST and JSON interfaces. We'll be looking at the code and javadocs in here - there are a lot of useful utilities but not changing it.
h2o-algos : Where most of the algorithms lie, including GLM and Deep Learning. We'll be copying the Example
at the code.
h2o-app : A tiny sample Application which drives h2o-core and h2o-algos, including the one we hack in. We'll add one
In the Java directories, we further use water directories to hold core H2O functionality and hex directories to hold
algorithms and math:
Ok, let's setup our IDE. For me, since I'm using the default IntelliJ IDEA setup:
Quantiles
138
BUILD SUCCESSFUL
Total time: 38.378 secs
C:\Users\cliffc\Desktop\my-h2o\h2o-dev>
and after a few seconds, IDEAJ reports the project is built (with a few dozen warnings).
Let's use IDEAJ to run the JUnit test for the Example algorithm I mentioned above. Navigate to the ExampleTest.java file. I
used a quick double-press of Shift to bring the generic project search, then typed some of ExampleTest.java and selected it
from the picker. Inside the one obvious testIris() function, right-click and select Debug testIris() . The testIris code
should run, pass pretty quickly, and generate some output:
11-08 13:17:07.536 192.168.1.2:54321 4576 main INFO: ----- H2O started ---- 11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git branch: master
11-08 13:17:07.642 192.168.1.2:54321 4576 main INFO: Build git hash: cdfb4a0f400edc46e00c2b53332c312a96566cf0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build git describe: RELEASE-0.1.10-7-gcdfb4a0
11-08 13:17:07.643 192.168.1.2:54321 4576 main INFO: Build project version: 0.1.11-SNAPSHOT
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built by: 'cliffc'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Built on: '2014-11-08 13:06:53'
11-08 13:17:07.644 192.168.1.2:54321 4576 main INFO: Java availableProcessors: 4
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap totalMemory: 183.5 MB
11-08 13:17:07.645 192.168.1.2:54321 4576 main INFO: Java heap maxMemory: 2.66 GB
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: OS version: Windows 7 6.1 (amd64)
11-08 13:17:07.646 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 127.0.0.1
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: lo (Software Loopback Interface 1), 0:0:0:0:0:
11-08 13:17:07.647 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), 192
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Possible IP Address: eth3 (Realtek PCIe GBE Family Controller), fe8
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Internal communication uses port: 54322
11-08 13:17:07.648 192.168.1.2:54321 4576 main INFO: Listening for HTTP and REST traffic on http://192.168.1.2:54321/
11-08 13:17:07.649 192.168.1.2:54321 4576 main INFO: H2O cloud name: 'cliffc' on /192.168.1.2:54321, discovery address /
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: If you have trouble connecting, try SSH tunneling from your local m
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 cliffc@1
11-08 13:17:07.650 192.168.1.2:54321 4576 main INFO: 2. Point your browser to http://localhost:55555
11-08 13:17:07.652 192.168.1.2:54321 4576 main INFO: Log dir: '\tmp\h2o-cliffc\h2ologs'
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
Quantiles
139
Ok, that's a pretty big pile of output - but buried it in is some cool stuff we'll need to be able to pick out later, so let's break it
down a little.
The yellow stuff is H2O booting up a cluster of 1 JVM. H2O dumps out a bunch of stuff to diagnose initial cluster setup
problems, including the git build version info, memory assigned to the JVM, and the network ports found and selected for
cluster communication. This section ends with the line:
11-08 13:17:07.719 192.168.1.2:54321 4576 main INFO: Cloud of size 1 formed [/192.168.1.2:54321]
This tells us we formed a Cloud of size 1: one JVM will be running our program, and its IP address is given.
The lightblue stuff is our ExampleTest JUnit test starting up and loading some test data (the venerable iris dataset with
headers, stored in the H2O-dev repo's smalldata/iris/ directory). The printout includes some basic stats about the loaded
data (column header names, min/max values, compression ratios). Included in this output are the lines Start Parse and
Done Parse . These come directly from the System.out.println("Start Parse") lines we can see in the ExampleTest.java
code.
Finally, the green stuff is our Example algorithm running on the test data. It is a very simple algorithm (finds the max per
column, and does it again and again, once per requested _max_iters ).
Then I copied the three GUI/REST files in h2o-algos/src/main/java/hex/schemas with Example in the name
( ExampleHandler.java , ExampleModelV2.java , ExampleV2 ) to their Quantile* variants.
I also copied the h2o-algos/src/main/java/hex/api/ExampleBuilderHandler.java file to its Quantile variant. Finally, I renamed
Quantiles
140
We also split Schemas from Models to isolate slow-moving external APIs from rapidly-moving internal APIs. As a Java dev,
you can hack the guts of Quantile to your heart's content, including the inputs and outputs, as long as the externally facing
V2 schemas do not change. If you want to report new stuff or take new parameters, you can make a new V3 schema
(which is not compatible with V2) for the new stuff. Old external V2 users will not be affected by your changes - you'll still
have to make the correct mappings in the V2 schema code from your V3 algorithm.
One other important hack: quantiles is an unsupervised algorithm - no training data (no "response") tells it what the results
"should" be. So we need to hack the word Supervised out of all the various class names it appears in. After this is done,
your QuantileTest probably fails to compile, because it is trying to set the response column name in the test, and
unsupervised models do not get a response to train with. Just delete the line for now:
parms._response_column = "class";
At this point, we can run our test code again (still finding the max-per-column).
Inside the QuantileModel class, there is a class for the model's output: class QuantileOutput . We'll put our quantiles there.
The various support classes and files will make sure our model's output appears in the correct REST & JSON responses,
and gets pretty-printed in the GUI. There is also the left-over _maxs array from the old Example code; we can delete that
now.
And finally a quick report on the effort used to build the model: the number of iterations we actually took. Our quantiles will
run in iterations, improving with each iteration until we get the exact answer. I'll describe the algorithm in detail below, but
Quantiles
141
Now, let's turn to the input for our model-building process. These are stored in the class QuantileModel.QuantileParameters .
We already inherit an input training dataset (returned with train() ), and some other helpers (e.g. which columns to ignore
if we do an expensive scoring step with each iteration). For now, we can ignore everything except the input dataset from
train() .
However, we want some more parameters for quantiles: the set of probabilities. Let's also put in some default probabilities
(the GUI & REST/JSON layer will let these be overridden, but it's nice to have some sane defaults). Define it next to the
left-over _max_iters from the old Example code (which we might as well also nuke):
A bit on field naming: I always use a leading underscore _ before all internal field names - it lets me know at a glance
whether I'm looking at a field name (stateful, can changed by other threads) or a function parameter (functional, private).
The distinction becomes interesting when you are sorting through large piles of code. There's no other fundamental reason
to use (or not use) the underscores. External APIs, on the other hand, generally do not care for leading underscores. Our
JSON output and REST URLs will strip the underscores from these fields.
To make the GUI functional, I need to add my new probabilities field to the external schema in h2oalgos/src/main/java/hex/schemas/QuantileV2.java :
And I need to add my result fields to the external output schema in h2oalgos/src/main/java/hex/schemas/QuantileModelV2.java :
Quantiles
142
In this case, init(false) means "only do cheap stuff in init ". Init is defined a little ways down and does basic (cheap)
argument checking. init(false) is called every time the mouse clicks in the GUI and is used to let the front-end sanity
parameters function as people type. In this case "only do cheap stuff" really means "only do stuff you don't mind waiting on
while clicking in the browser". No computing quantiles in the init() call!
Speaking of the init() call, the one we got from the old Example code limits sanity checks the now-deleted _max_iters .
Let's add some lines to check that our _probs are sane:
In the Quantile.java file there is a trainModel call that is used when you really want to start running quantiles (as opposed
to just checking arguments). In our case, the old boilerplate starts a QuantileDriver in a background thread. Not required,
but for any long-running algorithm, it is nice to have it run in the background. We'll get progress reports from the GUI (and
from REST/JSON) with the option to cancel the job, or inspect partial results as the model builds.
The class QuantileDriver holds the algorithmic meat. The compute2() call will be called by a background Fork/Join worker
thread to drive all the hard work. Again, there is some brief boilerplate we need to go over.
First up: we need to record Keys stored in H2O's DKV: Distributed Key/Value store, so a later cleanup, Scope.exit(); , will
wipe out any temp keys. When working with Big Data, we have to be careful to clean up after ourselves - or we can swamp
memory with Big Temps.
Scope.enter();
Next, we need to prevent the input datasets from being manipulated by other threads during the model-build process:
_parms.lock_frames(Quantile.this);
Locking prevents situations like accidentally deleting or loading a new dataset with the same name while quantiles is
running. Like the Scope.exit() above, we will unlock in the finally block. While it might be nice to use Java locking, or
even JDK 5.0 locks, we need a distributed lock, which is not provided by JDK 5.0. Note that H2O locks are strictly
cooperative - we cannot enforce locking at the JVM level like the JVM does.
Next, we make an instance of our model object (with no clusters yet) and place it in the DKV, locked (e.g., to prevent
another user from overwriting our model-in-progress with an unrelated model).
Quantiles
143
Also, near the file bottom is a leftover class Max from the old Example code. Might as well nuke it now.
Let's also stop if somebody clicks the "cancel" button in the GUI:
I removed the "compute Max" code from the old Example code in the loop body. Next up, I see code to record any new
model (e.g. quantiles), and save the results back into the DKV, bump the progress bar, and log a little bit of progress:
144
And now we need to figure what do in our main loop. Somewhere between the loop-top-isRunning check and the
model.update() call, we need to compute something to update our model with! This is the meat of Quantile - for each point,
bin it - build a histogram. With the histogram in hand, we see if we can compute the exact quantile, if we cannot and then
build a new refined histogram from the bounds computed in the prior histogram.
Anything that starts out with the words "for each point" when you have a billion points needs to run in-parallel and scale-out
to have a chance of completing fast - and this is exactly H2O is built for! So let's write code that runs scale-out for-eachpoint... and the easiest way to do that is with an H2O Map/Reduce job - an instance of MRTask. For Quantile, this is an
instance of a histogram. We'll call it from the main-loop like this, and define it below (extra lines included so you can see
how it fits):
Basically, we just called some not-yet-defined Histo code on the entire Vec (column) of data, then looped over each
probability. If we can compute the quantile exactly for the probability, great! If not, we build and run another histogram over
a refined range, repeating until we can compute the exact histogram. I also printed out the quantiles per Vec , so we can
watch the progress over time. Now class Histo can be coded as an inner class to the QuantileDriver class:
Quantiles
145
Histogram
Back to class Histo , we create arrays to hold our results and loop over the data:
Then we need to find the correct bin (simple linear interpolation). If the bin is in-range, increment the bin count. Note that if
a value is missing, it will represented as a NaN, then the computation of idx will be a NaN and the range test will fail and
no bin will be incremented.
Also gather a unique element in the bin, if there is a unique element (otherwise, use NaN).
int i = (int)idx;
if( bins[i]==0 ) elems[i] = d; // Capture unique value
else if( !Double.isNaN(elems[i]) && elems[i]!=d )
elems[i] = Double.NaN; // Not unique
bins[i]++; // Bump row counts
Quantiles
146
And that ends the map call and the Histo main work loop. To recap, here it is all at once:
The reduce needs to fold together the returned results; the _elems and the _bins . It's a bit tricky for the unique elements either the left Histo or the right Histo might have zero, one, or many elements - and if they both have a unique element, it
might not be the same unique element.
Let's make a function to return the exact Quantile (if we can compute it, or use NaN otherwise). We'll call it like this,
capturing the quantile in the model output as we test for NaN:
Quantiles
147
static double computeQuantile( double lo, double hi, long row, long nrows, double prob ) {
if( lo==hi ) return lo; // Equal; pick either
// Unequal, linear interpolation
double plo = (double)(row+0)/(nrows-1); // Note that row numbers are inclusive on the end point, means we need a -1
double phi = (double)(row+1)/(nrows-1); // Passed in the row number for the low value, high is the next row, so +1
assert plo <= prob && prob < phi;
return lo + (hi-lo)*(prob-plo)/(phi-plo); // Classic linear interpolation
}
Refining a Histogram
Now we need to define how to refine a Histogram. Let's just admit it needs a Big Data pass up front and call the doAll() in
the driver loop directly:
Then the refinePass call needs to make a new Histogram from the old one, with refined endpoints. The original Histogram
had endpoints of vec.min() and vec.max() , but here we'll use endpoints from the same bins that findQuantiles uses.
Quantiles
148
Running Quantile
Running the QuantileTest returns:
11-13 12:19:04.179 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 11 Qs=[4.4, 4.6000000000000005, 4.800000000000001, 5.10
11-13 12:19:04.183 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 22 Qs=[2.2, 2.345, 2.5, 2.8000000000000003, 2.900000000
11-13 12:19:04.187 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 33 Qs=[1.1490000000000002, 1.3, 1.4000000000000001, 1.6
11-13 12:19:04.191 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 44 Qs=[0.1, 0.2, 0.2, 0.30000000000000004, 0.8468, 1.3,
11-13 12:19:04.194 172.16.2.47:54321 1940 FJ-0-7 INFO: Quantile: iter: 55 Qs=[0.0, 0.0, 0.0, 0.0, 0.6169999999999999, 1.0, 1.3
You can see the Quantiles-per-column (and there are 11 of them by default, so the iteration printout bumps by 11's). I
checked these numbers vs R's default Type 7 quantiles for the iris dataset and got the same numbers.
Iris is too small to trigger the refinement pass, so I also tested on covtype.data - an old forest covertype dataset with about
a half-million rows. Took about 0.7 seconds on my laptop to compute 11 quantiles exactly, on 55 columns and 581000 rows.
Good luck with your own H2O algorithm,
Cliff
Quantiles
149
Project Repository
The Sparkling Water project is hosted on GitHub. To clone it:
cd ~/devel/sparkling-water
$ ./gradlew build
The command produces an assembly artifact which can be executed directly on the Spark platform.
Note: The provided sandbox image already contains exported shell variable SPARK_HOME
bin/run-sparkling.sh
150
bin/sparkling-shell
Note: The command launches a regular Spark shell with H2O services. To access the Spark UI, go to
http://localhost:4040, or to access H2O web UI, go to http://localhost:54321/steam/index.html.
Step-by-Step Example
1. Run Sparkling shell with an embedded cluster consisting of 3 Spark worker nodes:
export MASTER="local-cluster[3,2,1024]"
bin/sparkling-shell
2. You can go to http://localhost:4040/ to see the Sparkling shell (i.e., Spark driver) status.
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
val h2oContext = new H2OContext(sc).start()
import h2oContext._
4. Load weather data for Chicago international airport (ORD) with help from the regular Spark RDD API:
151
import java.io.File
val dataFile = "/data/h2o-training/sparkling-water/allyears2k_headers.csv.gz"
// Load and parse using H2O parser
val airlinesData = new DataFrame(new File(dataFile))
7. Select flights with a destination in Chicago (ORD) with help from the Spark API:
flightsToORD.count
9. Use Spark SQL to join the flight data with the weather data:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")
weatherTable.registerTempTable("WeatherORD")
152
|f.Year,f.Month,f.DayofMonth,
|f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
|f.UniqueCarrier,f.FlightNum,f.TailNum,
|f.Origin,f.Distance,
|w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
|f.ArrDelay
|FROM FlightsToORD f
|JOIN WeatherORD w
|ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)
import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()
dlParams._train = bigTable
dlParams._response_column = 'ArrDelay
dlParams._epochs = 100
// Create a job
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get
println(s"""
#
# R script for residual plot
#
# Import H2O library
library(h2o)
# Initialize H2O R-client
h = h2o.init()
# Fetch prediction and actual data, use remembered keys
pred = h2o.getFrame(h, "${predictionH2OFrame._key}")
act = h2o.getFrame (h, "${rame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57}")
# Select right columns
predDelay = pred$$predict
actDelay = act$$ArrDelay
# Make sure that number of rows is same
nrow(actDelay) == nrow(predDelay)
# Compute residuals
residuals = predDelay - actDelay
# Plot residuals
compare = cbind (as.data.frame(actDelay$$ArrDelay), as.data.frame(residuals$$predict))
nrow(compare)
plot( compare[,1:2] )
""")
153
More information
Sparkling Water GitHub
H2O YouTube Channel
H2O SlideShare
H2O Blog
154
Troubleshooting
H2O
H2O: How Do I...?
H2O: Error Messages and Unexpected Behavior
H2O-Dev
H2O-Dev: How Do I...?
H2O-Dev: Error Messages and Unexpected Behavior
H2O Web UI
H2O Web UI: How Do I...?
H2O Web UI: Error Messages and Unexpected Behavior
Hadoop & H2O
Hadoop & H2O: How Do I...?
Hadoop & H2O: Error Messages and Unexpected Behavior
R & H2O
R & H2O: How Do I...?
R & H2O: Error Messages and Unexpected Behavior
Clusters and H2O
Clusters and H2O: How Do I...?
Clusters and H2O: Error Messages and Unexpected Behavior
Sparkling Water
Java
Java: How Do I...?
H2O
H2O: How Do I...?
Classify a single instance at a time
Export predicted values for all observations used in the model
Limit the CPU usage
Run multiple H2O projects by multiple team members on the same server
Run tests using Python
Specify multiple outputs
Tunnel between servers with H2O
Use H2O as a real-time, low-latency system to make predictions in Java
Troubleshooting
155
Run multiple H2O projects by multiple team members on the same server
We recommend keeping projects isolated, as opposed to running on the same instance. Launch each project in a new H2O
instance with a different port number and different cloud name to prevent the instances from combining into a cluster:
You can also set the -ip flag to localhost to prevent cluster formation in H2O.
Troubleshooting
156
4. Log into the remote machine where the running instance of H2O will be forwarded using a command similar to the
following example. Your port numbers and IP address will be different.
Troubleshooting
157
1. Build a model (for example, a GBM model with just a few trees).
2. Export the model as a Java POJO: Click the Java Model button at the top right of the web UI. This code is selfcontained and can be embedded in any Java environment, such as a Storm bolt, without H2O dependencies.
3. Use the predict() function in the downloaded code.
When I try to initiate a standalone instance of using an HDFS data on CDH3, I get an exception error - what should I
do?
The default HDFS version is CDH4. To use CDH3, use -hdfs_version cdh3 .
I received an Unmatched quote char error message and now the number of columns does not match my dataset.
H2O supports only ascii quotation marks ( Unicode 0022, " ) and apostrophes ( Unicode 0027, ' ) in datasets. All other
character types are treated as enum when they are parsed.
When launching using java -Xmx1g -jar h2o.jar , I received a Can't start up: not enough memory error message, but
according to free -m , I have more than enough memory. What's causing this error message?
Make sure you are using the correct version of Java Developer Toolkit (at least 1.6 or later). For example, GNU compiler for
Java and Open JDK are not supported.
The current recommended JDK is JDK 7 from Oracle: http://www.oracle.com/technetwork/java/javase/downloads/jdk7downloads-1880260.html
However, JDK 6 is also supported:http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloadsjavase6-419409.html
What does the following error message indicate? "Parser setup appears to be broken, got SVMLight data with
(estimated) 0 columns."
H2O does not currently support a leading label line. Convert a row from:
Troubleshooting
158
to
H2O-Dev
H2O-Dev: How Do I...?
Run Java APIs
Use an h2o-dev artifact as a dependency in a new project
REST:
- build/docs/REST/markdown
JAVA:
- h2o-core/build/docs/javadoc/index.html
- h2o-algos/build/docs/javadoc/index.html
When I try to start , I get the following output: ice_root not a read/writable directory - what do I need to do?
If you are using on OS X and you are using an admin account, the user.name property is set to root by default. To resolve
this issue, disable the setuid bit in the Java executable.
Troubleshooting
159
H2O Web UI
H2O Web UI: How Do I...?
Load multiple files from multiple directories
Why did I receive an exception error after starting a job in the web UI when I tried to access the job in R?
To prevent errors, locks the job after it starts. Wait until the job is complete before accessing it in R or invoking APIs.
I tried to load a file of less than 4 GB using the web interface, but it failed; what should I do?
To increase the H2O JVM heap memory size, use java -Xmx8g -jar h2o.jar .
Find the syntax for the file path of a data set sitting in hdfs
To locate an HDFS file, go to Data > Import and enter hdfs:// in the path field. H2O automatically detects any HDFS
paths. This is a good way to verify the path to your data set before importing through R or any other non-web API.
Troubleshooting
160
What should I do if H2O starts to launch but times out in 120 seconds?
1. YARN or MapReduce's configuration is not configured correctly. Enable launching for mapper tasks of specified
memory sizes. If YARN only allows mapper tasks with a maximum memory size of 1g and the request requires 2g,
then the request will timeout at the default of 120 seconds. Read Configuration Setup to make sure your setup will run.
2. The nodes are not communicating with each other. If you request a cluster of two nodes and the output shows a stall in
reporting the other nodes and forming a cluster (as shown in the following example), check that the security settings for
the network connection between the two nodes are not preventing the nodes from communicating with each other. You
should also check to make sure that the flatfile that is generated and being passed has the correct home address; if
there are multiple local IP addresses, this could be an issue.
Troubleshooting
161
What should I do if the H2O job launches but terminates after 600 seconds?
The likely cause is a driver mismatch - check to make sure the Hadoop distribution matches the driver jar file used to
launchH2O. If your distribution is not currently available in the package, email us for a new driver file.
What should I do if I want to create a job with a bigger heap size but YARN doesn't launch and H2O times out?
First, try the job again but with a smaller heap size ( -mapperXmx ) and a smaller number of nodes ( -nodes ) to verify that a
small launch can proceed.
If the cluster manager settings are configured for the default maximum memory size but the memory required for the
request exceeds that amount, YARN will not launch andH2O will time out. If you have a default configuration, change the
configuration settings in your cluster manager to enable launching of mapper tasks for specific memory sizes. Use the
following formula to calculate the amount of memory required:
mapreduce.map.memory.mb must be less than the YARN memory configuration values for the launch to succeed. See the
examples below for how to change the memory configuration values for your version of Hadoop.
For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may
need to change the settings for more than one role group.
1. Click Configuration and enter the following search term in quotes: yarn.nodemanager.resource.memory-mb.
2. Enter the amount of memory (in GB) to allocate in the Value field. If more than one group is listed, change the values
for all listed groups.
162
1. In the Scheduler section, enter the amount of memory (in MB) to allocate in the yarn.scheduler.maximum-allocation-mb
entry field.
2. Click the Save button at the bottom of the page and redeploy the cluster.
For MapR:
1. Edit the yarn-site.xml file for the node running the ResourceManager.
2. Change the values for the yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb properties.
3. Restart the ResourceManager and redeploy the cluster.
To verify the values were changed, check the values for the following properties:
<name>yarn.nodemanager.resource.memory-mb</name>
<name>yarn.scheduler.maximum-allocation-mb</name>
Note: Due to Hadoop's memory allocation, the memory allocated forH2O on Hadoop is retained, even when no jobs are
running.
Troubleshooting
163
nodes.
R and H2O
R and H2O: How Do I...?
Access the confusion matrix using R
Assign a value to a data key
Change an entire integer column to enum
Create test and train data split
Export data for post-analysis
Extract associated values (such as variable name and value) separately
Extract information and export the data locally
Extract the Neural Networks representation of the input
InstallH2O in R without internet access
Join split datasets
Launch a multi-nodeH2O cluster in R
Load a .hex data set in R
Manage dependencies in R
Perform a mean imputation for NA values
Transfer data frames from R toH2O or vice versa
Specify column headings and data from separate files
Specify a threshold for validating performance
Sort and view variable importance for Random Forest
Sub-sample data
Update the libraries in R to the current version
h2o.assign (data,key)
where data is the frame containing the key and key is the new key.
data.hex[,2] = as.factor(data.hex[,2])
is.factor(data.hex[,2])
Troubleshooting
164
s = h2o.splitFrame(air)
air.train = air[s <= 0.8,]
air.valid = air[s > 0.8,]
Export scored or predicted data in a format that will allow me to do post-analysis work, such as building an ROC
curve
Use pred = h2o.predict(...) and convert the data parsed byH2O to an R data frame using predInR = as.data.frame(pred) .
For a two-class set, the two probability class should be available in conjunction to the labels output from h2o.predict() .
head(pred)
predict X0 X1
1 0 0.9930366 0.006963459
2 0 0.9933736 0.006626437
3 0 0.9936944 0.006305622
4 0 0.9939933 0.006006664
5 0 0.9942777 0.005722271
6 0 0.9950644 0.004935578
predInR = as.data.frame(pred)
class1InR = as.data.frame(pred[,2])
class2InR = as.data.frame(pred[,3])
library(h2o)
h <- h2o.init()
hex <- as.h2o(h, iris)
m <- h2o.gbm(x=1:4, y=5, data=hex, importance=T)
m@model$varimp
Relative importance Scaled.Values Percent.Influence
Petal.Width 7.216290000 1.0000000000 51.22833426
Petal.Length 6.851120500 0.9493965043 48.63600147
Sepal.Length 0.013625654 0.0018881799 0.09672831
Sepal.Width 0.005484723 0.0007600474 0.03893596
The variable importances are returned as an R data frame and you can extract the names and values of the data frame as
follows:
Troubleshooting
165
is.data.frame(m@model$varimp)
# [1] TRUE
names(m@model$varimp)
# [1] "Relative importance" "Scaled.Values" "Percent.Influence"
rownames(m@model$varimp)
# [1] "Petal.Width" "Petal.Length" "Sepal.Length" "Sepal.Width"
m@model$varimp$"Relative importance"
# [1] 7.216290000 6.851120500 0.013625654 0.005484723
Extract information (such as coefficients or AIC) and export the data locally
When you run a model, for example
data.glm@model$coefficients
For other available parts to the data.glm@model object, press the Tab key in R to auto-complete data.glm@model$ or run
str(data.glm@model) to see all the components in the model object.
library(h2o)
localH2O = h2o.init()
prosPath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(localH2O, path = prosPath)
prostate.dl = h2o.deeplearning(x = 3:9, y = 2, data = prostate.hex, hidden = c(100, 200),epochs = 5)
prostate.deepfeatures_layer1 = h2o.deepfeatures(prostate.hex, prostate.dl, layer = 1)
prostate.deepfeatures_layer2 = h2o.deepfeatures(prostate.hex, prostate.dl, layer = 2)
head(prostate.deepfeatures_layer1)
head(prostate.deepfeatures_layer2)
Troubleshooting
166
Caveat: You might need to install some dependencies such as statmod, rjson, bitops, and RCurl as well. If you don't
already have them, download the package onto a drive and install it in R before installingH2O.
library (h2o)
localH2O = h2o.init(nthreads = -1)
df1 = data.frame(matrix(1:30, nrow=30, ncol=10))
df2 = data.frame(matrix(31:50, nrow=30, ncol=1))
df1.h2o = as.h2o(localH2O, df1)
df2.h2o = as.h2o(localH2O, df2)
binded_df = cbind(df1.h2o, df2.h2o)
binded_df
localH2O = h2o.init(ip="<IPaddress>",port=54321)
airlines.hex = h2o.importFile(localH2O, "hdfs://sandbox.hortonworks.com:8020/user/root/airlines_allyears2k_headers.csv")
(Where "object" is the H2O object in R and "key" is the key displayed in the web UI.)
Manage dependencies in R
The H2O R package utilizes other R packages (like lattice and curl). Get the binary from CRAN directly and install the
package manually using:
Troubleshooting
167
Source: http://www.r-bloggers.com/example-2014-5-simple-mean-imputation/
predCol = as.data.frame(pred$predict)
predTesting = as.h2o(predCol)
> colnames(df)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
> colnames(df) = c("A", "B", "C", "D", "E")
> colnames(df)
[1] "A" "B" "C" "D" E"
You can also specify headers from another file when importing( h2o.importFiles...header= )
Troubleshooting
168
library(h2o)
h <- h2o.init()
hex <- as.h2o(h, iris)
m <- h2o.randomForest(x = 1:4, y = 5, data = hex, importance = T)
m@model$varimp
as.data.frame(t(m@model$varimp)) # transpose and then cast to data.frame
Sub-sample data
To sub-sample data (select 10% of dataset randomly), use the following command:
splits=h2o.splitFrame(data.hex,ratios=c(.1),shuffle=T)
small = splits[[1]]
large = splits[[2]]
The following sample script that will take data sitting in H2O, take the column you want to subset by, and create a new
dataset with a certain number of rows:
This is just one example of subsampling using H2O in R; adjust it to weights you find suitable for the type of sampling you
want to do. Depending on the way you subsample your data, it might not necessarily be balanced. Updating to the latest
version of H2O will give you the option to rebalance your data in H2O.
Keep in mind that H2O data frames are handled differently than R data frames. To subsample a data frame on a subset
from a vector, add a new 0/1 column for each feature and then combine the relevant pieces using OR (in the following
example, the result is df$newFeature3 ):
df.r = iris
df = as.h2o(h, df.r)
|========================================================| 100%
head(df)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Troubleshooting
169
After unzipping the .zip file of the latest version of H2O, go to the h2o-/R folder and issue R CMD INSTALL
h2o_<VersionNumber>.tar.gz .
Troubleshooting
170
If you see a response message in R indicating that the instance of H2O running is a different version than that of the
corresponding H2O R package, download the correct version of H2O from http://h2o.ai/download/. Follow the installation
instructions on the download page.
I was importing a file when I received the following error message: H2O connection has been severed. What should I
do?
Confirm the R package is the current version. If not, upgrade H2O and the R client and retry the file import.
If the instance is unavailable, relaunch the cluster. If you still can't connect, email support.
When I tried to start an instance larger than 1 GB using h2o.init() , H2O stalled and failed to start.
Check the output in the .out and .err files listed in the output from h2o.init() and verify that you have the 64-bit version of
Java installed using java -version .
To force H2O to run in 64-bit mode, use the java -d64 flag with the -version command.
I tried to load an R data frame into H2O using as.h2o and now the fields have shifted so that they are associated
with the incorrect header. What caused this error?
Check to make sure that none of the data fields contain a comma as a value.
When trying to run H2O in R, I received the following error message: Error in function (type, msg, asError = TRUE)
Failure when receiving data from the peer
library(h2o)
localh2o = h2o.init()
Why is only one CPU being used when I start H2O from R?
Troubleshooting
171
Depending on how you got your version of R, it may be configured by default to run with only one CPU. This is particularly
common for Linux installations. This can affect H2O when you use the h2o.init() function to start H2O from R.
To tell if this is happening, look in /proc/nnnnn/status at the Cpus_allowed bitmask (where nnnnn is the PID of R).
If you see a bitmask with only one CPU allowed, then any H2O process forked by R will inherit this limitation. To work
around this, set the following environment variable before starting R:
$ export OPENBLAS_MAIN_FREE=1
$ R
At this point, the h2o.init() function will start an H2O that can use more than one CPU.
The data imported correctly but names() did not return the column names.
The version of R you are using is outdated, update to at least R 3.0.
If the numeric values in the column were meant to be additional factor levels then you can concatenate the values with a
string and the column will parse as a enumerator column:
V1 V2 V3 V4
1 1 i6 i11 A
2 2 B A A
3 3 A i13 i18
4 4 C i14 i19
5 5 i10 i15 i20
Why did as.h2o(localH2O, data) generate the error: Column domain is too large to be represented as an enum :
10001>10000?
Troubleshooting
172
as.h2o , like h2o.uploadFile , uses a limited push method where the user initiates a request for information transfer, so it
should only be used for smaller file transfers. For bigger data files or files with more than 10000 enumerators in a column,
we recommend saving the file as a .csv and importing the data frame using h2o.importFile(localH2O, pathToData) .
When I tried to start H2O in R, I received the following error message: Picked up _JAVA_OPTIONS: -Xmx512M
There is a default setting in the environment for Java that conflicts with H2O's settings. To resolve this conflict, go to
System Properties. The easiest way is to search for system environment variables in the Start menu. Under the
Advanced tab, click Environment Variables. Under System Variables, locate and delete the variable _JAVA_OPTIONS,
whose value specifies -Xmx512mb. You should be able to start H2O after this conflict is removed.
Troubleshooting
173
In this case, the last member will join the existing cloud with same name. The resulting size of cloud will be the number of
nodes (X) +1.
and multiple local IP addresses are detected, H2O will use the default localhost (127.0.0.1) as shown below:
To avoid reverting to 127.0.0.1 on servers with multiple local IP addresses, run the command with the -ip argument
Troubleshooting
174
$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -ip 192.168.1.161 -port 54321
If this does not resolve the issue, try the following additional troubleshooting tips:
Test connectivity using curl: curl -v http://otherip:54321
Confirm ports 54321 and 54322 are available for both TCP and UDP
Confirm your firewall is not preventing the nodes from locating each other
Check if the Linux iptables are causing the issue
Check the configuration for the EC2 security group
Confirm that the username is the same on all nodes; if not, define the cloud name using -name
Check if the nodes are on different networks
Check if the nodes have different interfaces; if so, use the -network option to define the network (for example, network 127.0.0.1 ).
What should I do if I tried to start a cluster but the nodes started independent clouds that were not connected?
Because the default cloud name is the user name of the node, if the nodes are on different operating systems (for example,
one node is on a Windows machine and the other is on a Mac), the different user names on each machine will prevent the
nodes from recognizing they are part of the same cloud. To resolve this issue, use -name to configure the same name for
all nodes.
Java
Java: How Do I...?
Remove POJO data
Switch between Java versions
POJO uses other keys in the DKV, they will not be removed. We recommend using this option for Iced objects that
don't use other objects in the DKV. If you use this option on a vector, it deletes only the header; all of the chunks that
comprise the vector are not deleted.
Keyed.remove : Remove the POJO associated with the specified key and all its subparts. We recommend using this
option to avoid leaks, especially when deleting a vector. If you use this option on a vector, all of the chunks that
comprise the vector are deleted.
Lockable.delete : Remove all subparts of a Lockable by write-locking the key and then deleting it. You must unlock the
Sparkling Water
Troubleshooting
175
Troubleshooting
176
To build it yourself, do git clone https://github.com/0xdata/sparkling-water.git and follow the directions in the
README.md file.
Troubleshooting
177
import java.io.File
val dataFile = "examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
val airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)
How can I use Tachyon to share data between H2O and Spark?
Tachyon is no longer needed! The Spark and H2O environments now share JVMs directly. Please update to the latest
version of Sparkling Water.
Troubleshooting
178
From the command line via Spark property spark.ext.h2o.cluster.size (as an example):
To switch versions, enter java followed by the version number (for example, java8 ). To confirm the version, enter java version .
Troubleshooting
179
More Information
H2O Documentation
H2O Booklets
H2O YouTube Channel
H2O SlideShare
H2O Blog
H2O GitHub
More Information
180