Sie sind auf Seite 1von 29

Hortonworks Inc.

2012
Enabling R on Hadoop

July 11, 2013
Page 1
Hortonworks Inc. 2012
Your Presenters
Ravi Mutyala
Systems Architect
Page 2
Paul Codding
Solutions Engineer
Hortonworks Inc. 2012
Agenda
A Brief History of R
How R is Typically Used
How R is Used with Hadoop
Getting Started
Page 3
Hortonworks Inc. 2012
A Brief History of R

Page 4
Hortonworks Inc. 2012
History of R
Page 5
1976: S
Fortran
John
Chambers
S
1988: S V3
written in C
& statistical
models
included
1998: S V4
1991: R
Created by
Ross Ihaka &
Robert
Gentleman
R
1997: R
Core Group
Formed
2000: R
Version 1.0
released
Hortonworks Inc. 2012
How R is Typically Used
Page 6
Hortonworks Inc. 2012
Main Uses of R
Statistical Analysis & Modeling
Classification
Scoring
Ranking
Clustering
Finding relationships
Characterization
Common Uses
Interactive Data Analysis
General Purpose Statistics
Predictive Modeling
Page 7
Hortonworks Inc. 2012
How R is Used with Hadoop
Page 8
Hortonworks Inc. 2012
Hadoop Components
Page 9
!" $%&'( )* +,,%-./01
23+45!6* "76)8$7"
9+:!!2 $!67
:+4+
"76)8$7"
!276+48!;+3
"76)8$7"
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, !
9!64!;<!6="
:+4+ 23+45!6* >9:2?
Distributed
Storage & Processing
9:5" @+6; !"# %&'(
<7A9:5" *+2 67:B$7
9$+4+3!C
98)7 28C
9A+"7
"D!!2
53B*7
!!E87
+*A+68
Hortonworks Inc. 2012
Hadoop Components & R
Page 10
!" $%&'( )* +,,%-./01
23+45!6* "76)8$7"
9+:!!2 $!67
:+4+
"76)8$7"
!276+48!;+3
"76)8$7"
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, !
9!64!;<!6="
:+4+ 23+45!6* >9:2?
Distributed
Storage & Processing
9:5" @+6; !"# %&'(
<7A9:5" *+2 67:B$7
9$+4+3!C
98)7 28C
9A+"7
"D!!2
53B*7
!!E87
+*A+68
Data Service Components
Hive
HBase

Hadoop Core
Map Reduce
HDFS
Hortonworks Inc. 2012
Options for R on Hadoop
Options
RODBC/RJDBC
RHive
RHadoop
Analysis
Focus
Integration Ease
Benefits
Limitations
Page 11
RHadoop
RODBC/RJDBC
RHive
Hortonworks Inc. 2012
RODBC/RJDBC
Focus
SQL Access from R
Integration Ease
Install Hortonworks Hive ODBC Driver
Install Hive libraries
Benefits
Low impact on existing R scripts leveraging other DB packages
Not required to install Hadoop configuration/binaries on client
machines
Limitations
Parallelism limited to Hive
Result set size
Page 12
Hortonworks Inc. 2012
Deployment Considerations
Page 13
TT , DN
.
.
.
.
.
.
.
TT , DN
J
T
N
N
H
S
Hortonworks Inc. 2012
RHive
Focus
Broad access to Hive and HDFS
Integration Ease
Requires Hadoop binaries, libraries, and configuration files on
client machines
Uses Java DFS Client and HiveServer
Benefits
Wide range of features expressed through HQL
rhive-apply R Distributed apply function using HQL
Limitations
Requires heavy client deployment
Dependent on HiveServer, and cant be used with HiveServer2
Page 14
Hortonworks Inc. 2012
Deployment Considerations
Page 15
TT + DN
.
.
.
.
.
.
.
TT + DN
J
T
R Edge
Node N
N
H
S
Hortonworks Inc. 2012
RHadoop
Focus
Tight integration with core Hadoop components
Benefit
Ability to run R on a massively distributed system
Ability to work with full data sets instead of sample sets
Additional Information
https://github.com/RevolutionAnalytics/RHadoop/wiki
Page 16
Hortonworks Inc. 2012
RHadoop Architecture
Page 17
R
rhdfs
rhbase
rmr2
HDFS
HBase Thrift
Gateway
Map Reduce
HBase
Streaming
R
R
R
R
Hortonworks Inc. 2012
rhdfs
Access HDFS from R
Read from HDFS to R dataframe
Write from R dataframe to HDFS
1.0.6 adds support for Windows (using HDP)
Page 18
Hortonworks Inc. 2012
rhdfs
Hadoop CLI Commands & rhdfs equivalent
hadoop fs ls /
hdfs.ls(/)
hadoop fs mkdir /user/rhdfs/ppt
hdfs.mkdir(/user/rhdfs/ppt)
hadoop fs put 1.txt /user/rhfds/ppt/
localData <- system.file(file.path("unitTestData", 1.txt"), package="rhdfs)
hdfs.put(localData, /user/rhdfs/ppt/1.txt)
hadoop fs get /user/rhdfs/ppt/1.txt 1.txt
hdfs.get(/user/rhdfs/ppt/1.txt,test)
hadoop fs rm /user/rhdfs/ppt/1.txt
hdfs.delete(/user/rhdfs/ppt/1.txt)
Page 19
Hortonworks Inc. 2012
rhbase
Access and change data within HBase
Uses Thrift API
Command Examples
hb.new.table
hb.insert
hb.scan.ex
hb.scan
Page 20
Hortonworks Inc. 2012
rmr2
Enables writing MapReduce jobs using R
Ability to parallelize algorithms
Ability to use big data sets without needing to sample
data
mapreduce(input, output, map, reduce, !)
Reduces takes a key and a collection of values which
could be vector, list, data frame or matrix
2.2.1 adds support for Windows (using HDP)
Page 21
Hortonworks Inc. 2012
Sample code - wordcount
Page 22
wc.map =
function(., lines) {
keyval(
unlist(
strsplit(
x = lines,
split = pattern)),
1)}
wc.reduce =
function(word, counts ) {
keyval(word, sum(counts))}

mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)}
Hortonworks Inc. 2012
More Sample Code
Page 23
groups = rbinom(32, n = 50, prob = 0.4)
tapply(groups, groups, length)
groups = to.dfs(groups)
from.dfs(
mapreduce(
input = groups,
map = function(., v) keyval(v, 1),
reduce =
function(k, vv)
keyval(k, length(vv))))
Hortonworks Inc. 2012
Deployment Considerations
Page 24
TT , DN,
RS
R
.
.
.
.
.
.
.
TT , DN,
RS
R
J
T
R Edge
Node N
N
H
T
G
Hortonworks Inc. 2012
RHadoop
Limitations
Requires installation of R on all TaskTracker nodes
Does not automatically parallelize algorithms
Different slot/memory configuration recommended to leave
memory and CPU resources for R
Page 25
OS
Map Reduce
OS
Map Reduce
R
Hortonworks Inc. 2012
Getting Started
Page 26
Hortonworks Inc. 2012
Your Fastest On-ramp to Enterprise Hadoop!
Page 27
http://hortonworks.com/products/hortonworks-sandbox/
The Sandbox lets you experience Apache Hadoop from the convenience of your own
laptop no data center, no cloud and no internet connection needed!

The Hortonworks Sandbox is:
A free download: http://hortonworks.com/products/hortonworks-sandbox/
A complete, self contained virtual machine with Apache Hadoop pre-configured
A personal, portable and standalone Hadoop environment
A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
Hortonworks Inc. 2012
Installation
Install R on all nodes
Install dependent
packages
RJSONIO
itertools
digest
Rcpp
rJava
functional
RCurl
httr
plyr
Download & Install
RHadoop Packages
rmr2
rhdfs
rhbase (requires Thrift)

Page 28
Hortonworks Inc. 2012
Questions & Answers
TRY
Download HDP at hortonworks.com

LEARN
Applying Data Science using Apache
Hadoop Training

FOLLOW
twitter: @hortonworks
Facebook: facebook.com/hortonworks



Page 29
Further questions & comments:
paul@hortonworks.com
ravi@hortonworks.com

Das könnte Ihnen auch gefallen