Sie sind auf Seite 1von 27

Revolution Confidential

New A dvanc es in High P erformanc e A nalytic s with R : 'B ig Data' Dec is ion Trees and A nalys is of Hadoop Data
P res ented by: S ue R anney
V P P roduct Development

Revolution Confidential

In todays webc as t:

Revolution Confidential

High Performance Analytics (HPA) with Revolution R Enterprise Big Data Decision Trees Revolutions HPA with Hadoop Data Resources, Q&A

R evolution R E nterpris e: What G ets Ins talled?

Revolution Confidential

Latest stable version of Open-Source R High performance math libraries RevoScaleR package that adds: High performance big data capabilities to R Access to a variety of data sources (e.g., SAS, SPSS, text files, ODBC) Ability to compute in a variety of compute contexts (e.g., Windows/Linux workstation/server, Microsoft HPC Server cluster, Azure Burst, IBM Platform LSF cluster) High performance computing capabilities Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE)
Revolution R Enterprise 5.0 Webinar 3

High P erformanc e A nalytic s (HPA ) in R evoS c aleR


High Performance Computing + Data

Revolution Confidential

Full-featured, fast, and scalable analysis functions Same code works on small and big data, and a variety of data sources Same code works on a variety of compute contexts - a laptop, server, cluster, or the cloud Scales approximately linearly with the number of observations without increasing memory requirements
Revolution R Enterprise 4

R evoS c aleR : HPA A lgorithms

Revolution Confidential

Descriptive statistics (rxSummary) Tables and cubes (rxCube, rxCrossTabs) Correlations/covariances (rxCovCor, rxCor, rxCov, rxSSCP) K means clustering (rxKmeans) Linear regressions (rxLinMod) Logistic regressions (rxLogit) Generalized Linear Models (rxGlm) Predictions (scoring) (rxPredict) Decision Trees (rxDTree) NEW!
Revolution R Enterprise 5

Dec is ion Trees

Revolution Confidential

Relatively easy-to-interpret models Widely used in a variety of disciplines. For example,


Predicting which patient characteristics are associated with high risk of, for example, heart attack. Deciding whether or not to offer a loan to an individual based on individual characteristics. Predicting the rate of return of various investment strategies Retail target marketing

Can handle multi-factor response easily Useful in identifying important interactions

Revolution R Enterprise

Dec is ion Tree Types

Revolution Confidential

Classification tree: predict what class or group an observation belongs in (dependent variable is a factor) for each terminal node or leaf Regression tree: predict average value of dependent variable for each terminal node or leaf

Revolution R Enterprise

S imple E xample: Marketing R es pons e


Data set containing the following information: Response: Was response to a phone call, email, or mailing? Age Income Marital status Attended college?

Revolution Confidential

Revolution R Enterprise

S imple E xample: S pec ifying the model


treeOut <- rxDTree(response~ age + income + college + marital, data = rdata) where rdata is the name of the data set

Revolution Confidential

Revolution R Enterprise

S imple E xample: B as ic Output

Revolution Confidential

Information on the split, the number of observations in the node, the number that match the y value, and the y probabilities
1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000) 2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743) 4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639) 77 Phone (0.96586879 0.00000000 0.03413121) * 9 Mail (0.03435115 0.00000000 0.96564885) *

8) age< 64.5 2256 9) age>=64.5 262 5) age< 39.5 2556

580 Email (0.19874804 0.77308294 0.02816901)

10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054) 20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) * 21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) * 11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) * 3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452)
10

Revolution R Enterprise

S imple E xample: Vis ual R epres entation

Revolution Confidential

Root No College Age < 65 Age >= 40 Age < 65: Phone Age >= 65: Mail Single Age < 40 Married: Email Age < 40 Income >= 30: Phone Income < 30: Email Single Age >= 40: Email Married: Email College

Age >= 65: Mail

Income >= 30: Phone

Income < 30: Email

Revolution R Enterprise

11

S c aling HPA with R evoS c aleR

Revolution Confidential

RevoScaleR functions can read from data sets on disk in chunks, so you can increase the number of observations in the data set beyond what can be analyzed in memory all at once RevoScaleR analysis functions process chunks of data in parallel, taking greater advantage of your computing resources (Parallel External Memory Algorithms) Multiple cores on a desktop/server Cluster/grids have added advantage of more hard drives for storing & accessing data
Windows HPC Server Cluster Burst computations to Azure in the cloud IBM Platform LSF Grid
Revolution R Enterprise 12

T he B ig Data Dec is ion Tree A lgorithm


Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data. This sorting step becomes time and memory prohibitive when dealing with large data. rxDTree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data rxDTree partitions the data horizontally, processing in parallel different sets of observations
Revolution R Enterprise

Revolution Confidential

13

Revolution Confidential

Us eful rxDTree A rguments for B ig Data cp: complexity parameter. Increasing cp will decrease the number of splits attempted maxDepth: the maximum depth of any tree node. The computations take much longer at greater depth, so lowering maxDepth can greatly speed up computation time. maxNumBins: the maximum number of bins to use to cut numeric data. Decreasing maxNumBins will speed up computation time.
Revolution R Enterprise 14

B ig Data E xample
CDC Report in Jan. 2012

Revolution Confidential

15

T he U.S . B irth Data: 1985 - 2009

Revolution Confidential

Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download:
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm

These natality files are gigantic; theyre approximately 3.1 GB uncompressed. Thats a little larger than R can easily process Joseph Adler, R in a Nutshell Ive imported key variables from each year into a single .xdf file with over 100 million observations.
16

R egres s ion Tree: Multiple B irths

Revolution Confidential

Call: rxDTree(formula = IsMultiple ~ DadAgeR8 + MAGER + FRACEREC + FHISP_REC + MRACEREC + MHISP_REC + DOB_YY, data = birthAllC, maxDepth = 6, cp = 1e-05, blocksPerRead = 10, verbose = 1) File: C:\Revolution\Data\CDC\BirthUS.xdf Number of valid observations: 100672041 Number of missing observations: 0
Revolution R Enterprise 17

L eaves with L owes t P erc ent of Multiple B irths


Mom is not black and under the age of 20 1.3%

Revolution Confidential

Mom is Asian or Pacific Islander 1.6% (and not Hispanic) and is between 22 and 28 years of age. The birth is before 1997 Mom is black and under the age of 18 1.7%

18

L eaves with Highes t P erc ent of Multiple B irths


Mom is over 47 years old and the birth is after 1996 38.6%

Revolution Confidential

Mom is white, non-Hispanic, is 28.1% between 45 and 47 years old, and the birth is after 1996 Mom is Hispanic, is between 45 and 47 years old, and the birth is after 1996 15.5%

19

Revolution Confidential

P oll Ques tion


Are you using Hadoop?

R evoS c aleR with Hadoop Data F iles NE W

Revolution Confidential

The Hadoop Distributed File System (HDFS)


is highly fault-tolerant and is designed to be deployed on low-cost hardware.

RevoScaleR supports accessing data in the HDFS file system for import or for direct analysis

21

R evoS c aleR Data S ourc es

Revolution Confidential

Data Sources can be used for import or directly for analysis


External: delimited text, fixed format text, SAS, SPSS, ODBC connections Provided with RevoScaleR: efficient .xdf file format

Data Sources contain information about their file system


Delimited text and .xdf data sources can both be used with the HDFS file system

Data sources are used as input to HPA functions


22

A n E xample Us ing Hadoop Data


Hadoop cluster in our office
Five nodes of commodity hardware

Revolution Confidential

Red Hat Enterprise Linux (RHEL) operating system Clouderas Hadoop (CDH3) Also has IBM Platform LSF workload management system installed (not required to use HDFS data)

My colleague, Dawn Kinsey, recorded a data analysis session


22 comma delimited files stored in HDFS Contain information on U.S. flight arrivals, 1997 2008
Revolution R Enterprise 23

S teps in A nalys is

Revolution Confidential

Set up a file system object and a data source object Explore the HDFS airline data for the year 2000 directly Extract variables of interest from all the files into an .xdf file in the native file system Use Rs great plotting capabilities on summary information Perform a big logistic regression on an .xdf file stored in HDFS
Revolution R Enterprise 24

Revolution Confidential

P oll Ques tion


What features of Revolution R Enterprise 6.1 are most interesting to you?

T hank You!
Download slides, replay from todays webinar
http://bit.ly/QJfR4A

Revolution Confidential

Learn more about Revolution R Enterprise


Overview: revolutionanalytics.com/products New feature videos: http://www.revolutionanalytics.com/products/new-features.php

Contact Revolution Analytics


http://bit.ly/hey-revo

November 29: Real-Time Big Data Analytics: from Deployment to Production


David Smith, VP Marketing and Community, Revolution Analytics

www.revolutionanalytics.com/news-events/free-webinars
26

Revolution Confidential

The leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR

27

Das könnte Ihnen auch gefallen