Rhadoop and Mapr

®

RHadoop and MapR
Accessing Enterprise-‐Grade Hadoop from R

Version 2.0 (14.March.2014)

®

Table of Contents

Introduction ........................................................... 3
Environment ........................................................... 3
R ............................................................................. 3
Special Installation Notes ....................................... 4
Install R .................................................................. 5
Install RHadoop ...................................................... 5
Install rhdfs .......................................................................................... 5
Install rmr2 ........................................................................................... 7
Install rhbase ...................................................................................... 10
Conclusion ............................................................ 14

Resources ............................................................. 14

2 | RHadoop and MapR

®

Introduction
RHadoop is an open source collection of three R packages created by Revolution Analytics that allow users to
manage and analyze data with Hadoop from an R environment. It allows data scientists familiar with R to quickly
utilize the enterprise-‐grade capabilities of the MapR Hadoop distribution directly with the analytic capabilities of R.
This paper provides step-‐by-‐step instructions to install and use RHadoop with MapR and R on RedHat Enterprise
Linux.

RHadoop consists of the following packages:
• rhdfs -‐ functions providing file management of the HDFS from within R
• rmr2 -‐ functions providing Hadoop MapReduce functionality in R
• rhbase -‐ functions providing database management for the HBase distributed database from within R

Each of the RHadoop packages can be installed and used independently or in conjunction with each other.
Environment
The integration testing described in this paper was performed in March 2014 on a 3-‐node Amazon EC2 cluster.
The product versions in the test are listed in the table below. Note that Revolution Analytics currently provides
Linux support only on RedHat.

Product Version
EC2 AMI RHEL-‐6.4 x86_64 (ami-‐74557e31)
Type m1.large
Root/Boot 8GB EBS standard
MapR storage (3) 450GB EBS standard
RedHat Enterprise Linux 6.4 64-‐bit 2.6.32-‐358.el6.x86_64
Java java-‐1.7.0-‐openjdk.x86_64
java-‐1.7.0-‐openjdk-‐devel.x86_64
MapR M7 3.0.2 (3.0.2.22510.GA-‐1)
HBase 0.94.13
GNU R 3.0.2
RHadoop
rhdfs 1.0.8
rmr2 3.0.1
rhbase 1.2.0
Apache Thrift 0.9.1

R
R is both a language and environment for statistical computation and is freely available from GNU. Revolution
Analytics provides two versions of R: the free Revolution R Community and the premium Revolution R Enterprise
for workstations and servers. Revolution R Community is an enhanced distribution of the open source R for users

®

looking for faster performance and greater stability. Revolution R Enterprise adds commercial enhancements and
professional support for real-‐world use. It brings higher performance, greater scalability, and stronger reliability to
R—at a fraction of the cost of legacy products. R is easily extended with libraries that are distributed in packages.
Packages are collections of R functions, data, and compiled code in a well-‐defined format. The directory where
packages are stored is called the library. R comes with a standard set of packages. Others are available for
download and installation. Once installed, they have to be loaded into the session to be used.
Special Installation Notes

These installation instructions are specific to the product versions specified earlier in this document. Some
modifications may be required for your environment. A package repository must be available to install dependent
packages. The MapR cluster must be installed and running either Hadoop HBase or MapR tables. You will need
root privileges on all nodes in the cluster. MapR installation instructions are available on the MapR
Documentation web site:

http://doc.mapr.com/display/MapR/Quick+Installation+Guide

All commands entered by the user are in bold courier font. Commands entered from within the R
environment are preceded by the default R prompt “>“. Linux shell commands are preceded by either the '#'
character (for the root user) or the '$' character (for a non-‐root user).

For example, the following represents running the yum command as the root user:
#
yum install git -y

The following example represents running a command within the R shell:
>
library(rhdfs)

Similar to the R shell, the following represents running a command within the HBase shell:
hbase(main):001:0>
create 'mytable', 'cf1'

Note that some Linux commands are long and wrap across multiple lines in this document. Linux commands use
the backslash "\" character to escape the carriage return. Similarly, in the R shell, long commands use the comma
"," character to escape the carriage return. This facilitates copying from this document and pasting into a Linux
terminal window.

Here is an example of a long Linux command that wraps to multiple lines in this document:
#
su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\
examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out"

And there is an example of a long R command that wraps to multiple lines in this document:
>
install.packages(c('Rcpp'),
repos="http://cran.revolutionanalytics.com",
INSTALL_opts=c('--byte-compile'))


®

Unless otherwise indicated, all commands should be run as the root user on the client and task tracker systems.
This document assumes there is a non-‐root user called user01 on your client system for purposes of validating
the installation. You can use any non-‐root user in those commands that can access the MapR cluster. Just
remember to replace all occurrences of user01 with your non-‐root user.

Finally, note that these installation instructions assume your client systems are running RedHat Linux as that is the
only operating system supported by Revolution Analytics.
Install R
Testing for this paper was done with GNU R. If you have Revolution R Community or Enterprise, you can use that
version of R instead. Follow Revolution Analytics installation instructions for the appropriate edition. A version of
R must always be installed on the client system accessing the cluster using the RHadoop libraries. Additionally, to
execute MapReduce jobs with the rmr2 library, R must be installed on all task tracker nodes in the cluster.

As the root user, follow the installation steps below to install GNU R on all client systems and task trackers.
1) Install GNU R.
#
yum -y --enablerepo=epel install R

2) Install the GNU R developer package.
#
yum -y --enablerepo=epel install R-devel

Note that the R-devel package may already be up to date when installing the R package.

3) Confirm installation was successful by running R as a non-‐root user on your client system and all your task
trackers. At the command line, type the following command to determine the version of R that is installed.
#
su - user01 -c "R --version"
Install RHadoop
The installation instructions that follow are complete for each RHadoop package (rhdfs, rmr2, rhbase).
System administrators can skip to installation instructions of just the package(s) they want to install. Recall that R
must be installed before installing any of the RHadoop packages.
Install rhdfs
The rhdfs package uses the hadoop command to access MapR file services. To use rhdfs, R and the rhdfs
package only need to be installed on the client system that is accessing the cluster. This can be a node in the
cluster or it can be any client system that can access the cluster with the hadoop command.

As the root user, perform the following steps on every client node.
1) Confirm that you can access the MapR file services by listing the contents of the top-‐level directory.
#
su - user01 -c "hadoop fs -ls /"
2) Install the rJava R package that is required by rhdfs.

#
R --save

®

>
install.packages(c('rJava'), repos="http://cran.revolutionanalytics.com")
>
quit()
3) Download the rhdfs package from github.

#
yum install git -y
#
cd ~
#
git clone git://github.com/RevolutionAnalytics/rhdfs.git
4) Set the HADOOP_CMD environment variable to the hadoop command script and install the rhdfs package.
Whereas the rJava package in the previous step was downloaded and installed from a CRAN repository,
rhdfs is installed from with the R client.
#
export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop
#
R CMD INSTALL ~/rhdfs/pkg
5) Set required environment variables. In addition to the HADOOP_CMD environment variable set in the
previous step, LD_LIBRARY_PATH must include the location of the MapR client library
libMapRClient.so and HADOOP_CONF must specify the Hadoop configuration directory. Any user
wanting to use the rhdfs library must set these environment variables.
#
export LD_LIBRARY_PATH=/opt/mapr/lib:$LD_LIBRARY_PATH
#
export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf
6) Switch user to your non-‐root user to validate the installation was successful.
#
su user01
$

7) From R, load the rhdfs library and confirm that you can access the MapR cluster file system by listing the
root directory.
$
R --no-save
>
library(rhdfs)
>
hdfs.init()
>
hdfs.ls('/')
>
quit()
Note: When loading an R library using the library() command, dependent libraries will also be loaded.
For rhdfs, the rJava library will be loaded if it has not already been loaded.

8) Exit from your su command.

®

$
exit
#

9) Check the installation of the rhdfs package.
#
R CMD check ~/rhdfs/pkg; echo $?

An exit code of 0 means the installation was successful. If no errors are reported, you have successfully
installed rhdfs and can use it to access the MapR file services from R (you can safely ignore any "notes").

10) (optional) Persist the required environment variable settings for the shells in your client environment. The
command below will set the variables for bourne and bash shell users. You may wish to examine the
/etc/profile file first before making the following edits to ensure that you're not duplicating or
clobbering existing settings.
#
echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" >> \
/etc/profile
#
echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" >> \
/etc/profile

Install rmr2
The rmr2 package uses Hadoop Streaming to invoke R map and reduce functions. To use rmr2, the R and the
rmr2 packages must be installed on the clients as well as every task tracker node in the MapR cluster.

Install rmr2 on every tasktracker node AND on every client system as the root user.
1) Install required R packages:
#
R REPL --no-save
>
install.packages(c('Rcpp'),
>
install.packages(c('RJSONIO'),
>
install.packages(c('itertools'),
>
install.packages(c('digest'),
>
install.packages(c('functional'),

®

>
install.packages(c('stringr'),
>
install.packages(c('plyr'),
>
install.packages(c('bitops'),
>
install.packages(c('reshape2'),
>
install.packages(c('caTools'),
>
quit()

Note: Warnings may be safely ignored while the packages are being built.
2) Download the quickcheck and rmr2 packages from github. RHadoop includes the quickcheck package
to support writing randomized unit tests performed by the rmr2 package check.
#
cd ~
#
git clone git://github.com/RevolutionAnalytics/quickcheck.git
#
git clone git://github.com/RevolutionAnalytics/rmr2.git
3) Install the quickcheck and rmr2 packages.

#
R CMD INSTALL ~/quickcheck/pkg
#
R CMD INSTALL ~/rmr2/pkg
Note: Warnings may be safely ignored while the packages are being built.

4) From any task tracker node, create a directory called /tmp (if it doesn't already exist) in the root of your
MapR-‐FS (owned by the mapr user) and give it global read-‐write permissions. This directory is required for
running MapReduce applications in R using the rmr2 package.
#
su - mapr -c "hadoop fs -mkdir /tmp"

Note: the command above will fail gracefully if the /tmp directory already exists.
#
su - mapr -c "hadoop fs -chmod 777 /tmp"

®

Validate rmr2 on any single client system as a non-‐root user with the following steps. Note that the rmr2
package must be installed on ALL the task trackers before proceeding.

1) Confirm that your MapR cluster is configured to run a simple MapReduce job outside of the RHadoop
environment. You only need to perform this step on one client from which you intend to run your R
MapReduce programs.
#
su - user01 -c "hadoop fs -mkdir /tmp/mrtest/wc-in"
#
su - user01 -c "hadoop fs -put /opt/mapr/NOTICE.txt /tmp/mrtest/wc-in"
#
su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\
examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out"
#
su - user01 -c "hadoop fs -cat /tmp/mrtest/wc-out/part-r-00000"
#
su - user01 -c "hadoop fs -rmr /tmp/mrtest"
2) Copy the wordcount.R script to a directory that is accessible by your non-‐root user.
#
cp ~/rmr2/pkg/tests/wordcount.R /tmp
3) Set the required environment variables HADOOP_CMD and HADOOP_STREAMING. Since rmr2 uses Hadoop
Streaming, it needs access to both the hadoop command and the streaming jar.
#
export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop
#
export HADOOP_STREAMING=/opt/mapr/hadoop/hadoop-\
0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar
4) Switch user to your non-‐root user to validate the installation was successful.
#
su user01
5) Run the wordcount.R program from the R environment as your non-‐root user.
$
R --no-save < /tmp/wordcount.R; echo $?

Note that an exit code of 0 means the command was successful.
6) Exit from your su command.

$
exit
#
7) Run the full rmr2 check. The examples run by the rmr2 check below will sequentially generate 81 streaming
MapReduce jobs on the cluster. 80 of the jobs have just 2 mappers so a large cluster will not speed this up.
On a 3-‐node medium EC2 cluster with two task trackers, the examples take just over 1 hour. You may wish to
launch this under nohup as shown below and wait for it to complete before you proceed in this document.
#
nohup R CMD check ~/rmr2/pkg > ~/rmr2-check.out &

®

Check the output in ~/rmr2-check.out for any errors.

8) (optional) Persist the required environment variable settings for the shells in your client environment. The
command below will set the variables for Bourne and bash shell users. You may wish to examine the
#
echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" >> \
/etc/profile
#
echo "export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf" >> \
/etc/profile
#
/etc/profile
Install rhbase
The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution.
The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server. The
Thrift server listens for Thrift requests and in turn uses the HBase HTable java class to access HBase. Since
rhbase is a client-‐side technology, it only needs to be installed on the client system that will access the MapR
HBase cluster. Any MapR HBase cluster node can also be a client.

For the client system to access a local Thrift server, the client system must have the mapr-hbase-internal
packaged installed which includes the MapR HBase Thrift server. If your client system is one of the MapR HBase
Masters or Region Servers, it will already have this package installed. These rhbase installation instructions
assume that mapr-hbase-internal is already installed on the client system.

In addition to the HBase Thrift server, the rhbase package requires the Thrift include files to compile and the C++
thrift library at runtime in order to be a Thrift client. Since these Thrift components are not included in the MapR
distribution, Thrift must be installed before rhbase. By default, rhbase connects to a Thrift server on the local
host. A remote server can be specified in the rhbase hb.init() call, but the rhbase package check expects
the Thrift server to be local. These installation instructions assume the Thrift server is running locally and that
HBase is installed and running in your cluster.

As the root user, perform the following installation steps on ALL task tracker nodes and client systems.
1) Install (or update) prerequisite packages.
#
yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel \
libevent-devel zlib-devel python-devel openssl-devel ruby-devel qt qt-\
devel php-devel
2) Download, build, and install Thrift.

#
cd ~
#
git clone https://git-wip-us.apache.org/repos/asf/thrift.git thrift
#
cd ~/thrift
#

®

sed -i s/2.65/2.63/ configure.ac

#
./bootstrap.sh
#
./configure
#
make && make install
#
/sbin/ldconfig /usr/lib/libthrift-0.9.1.so
3) Download the rhbase package.

#
cd ~
#
git clone git://github.com/RevolutionAnalytics/rhbase.git
4) Modify the thrift.pc file. We need to add the thrift directory to the end of the includedir
configuration.
#
sed -i \
's/^includedir=\${prefix}\/include/includedir=\${prefix}\/include\/thrift\
/' ~/thrift/lib/cpp/thrift.pc

5) Install the rhbase package. The LD_LIBRARY_PATH must be set to find the Thrift library
(libthrift.so) which was installed as part of the Thrift installation. Also, PKG_CONFIG_PATH must
point to the directory containing the thrift.pc package configuration file.
#
export LD_LIBRARY_PATH=/usr/local/lib
#
export PKG_CONFIG_PATH=~/thrift/lib/cpp
#
R CMD INSTALL ~/rhbase/pkg
6) Configure the file /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml with the HBase

zookeeper servers and their port number. For HBase Master Servers and HBase Region Servers, zookeeper
servers should already be properly configured in this file. For client only systems, edit the
hbase.zookeeper.quorum and hbase.zookeeper.property.clientPort properties to
correspond to your zookeeper servers.
#
su - mapr -c "maprcli node listzookeepers"
#
su - mapr -c "vi /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml"
. . .
<property>
<name>hbase.zookeeper.quorum</name>
<value>zkhost1,zkhost2,zkhost3</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>5181</value>
</property>
. . .

®

7) Start the MapR HBase Thrift server as a background daemon.
#
/opt/mapr/hbase/hbase-0.94.13/bin/hbase-daemon.sh start thrift

Note: Pass the parameters stop thrift to the hbase-daemon.sh script to stop the daemon.

8) Run the rhbase package checks. LD_LIBRARY_PATH and PKG_CONFIG_PATH must still be set. Note
that the command will produce errors you can safely ignore if you are not running Hadoop HBase in your
MapR cluster. Recall that the rhbase package is compatible with MapR tables which does not require
Hadoop HBase to be installed and running in your MapR cluster.
#
R CMD check ~/rhbase/pkg

Note: When running the rhbase package check, warnings can be safely ignored.
9) (optional) Persist the required environment variable settings for the shells in your environment. The
command below will set the variables for Bourne and bash shell users. You may wish to examine the
#
echo -e "export \
LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64/R/library/rhbase/libs:\$\
LD_LIBRARY_PATH" >> /etc/profile
#
echo "export PKG_CONFIG_PATH=~root/thrift/lib/cpp" >> /etc/profile
#
/etc/profile

Test rhbase with Hadoop HBase

The rhbase package is now installed and ready for use by any user on the system. Validate that a non-‐root user
can access HBase via the HBase shell and from rhbase. Note that the instructions below assume the
environment variables LD_LIBRARY_PATH and HADOOP_CMD have been configured for the user01 user.
They also assume you are running Hadoop HBase in your MapR cluster.

1) Start the HBase shell, create a table, display its description, and drop it.
#
su user01 -c "hbase shell"
hbase(main):001:0>
create 'mytable', 'cf1'
hbase(main):002:0>
describe 'mytable'
hbase(main):003:0>
disable 'mytable'
hbase(main):004:0>
drop 'mytable'
hbase(main):005:0>

®

quit
2) Now perform the same test with rhbase.

#
su user01 -c "R --save"
>
library(rhbase)
>
hb.init()
>
hb.new.table('mytable', 'cf1')
>
hb.describe.table('mytable')
>
hb.disable.table('mytable')
>
hb.delete.table('mytable')
>
q()
Test rhbase with MapR Tables

The rhbase package is compatible with MapR tables. You can use the MapR tables feature if you have an M7
license installed on your MapR cluster. Simply use absolute paths in MapR-‐FS for your table names rather
than relative paths as for HBase. The following steps assume that the user01 home directory is in a MapR
file system called /mapr/mycluster/home/user01.

1) Start the HBase shell, create a table, display its description, and drop it.
#
su user01 -c "hbase shell"
hbase(main):001:0>
create '/mapr/mycluster/home/user01/mytable', 'cf1'
hbase(main):002:0>
describe '/mapr/mycluster/home/user01/mytable'
hbase(main):003:0>
disable '/mapr/mycluster/home/user01/mytable'
hbase(main):004:0>
drop '/mapr/mycluster/home/user01/mytable'
hbase(main):005:0>
quit
2) Now perform the same test with rhbase.

#
su user01 -c "R --save"
>
library(rhbase)
>
hb.init()
>
hb.new.table('/mapr/mycluster/home/user01/mytable', 'cf1')
>

®

hb.describe.table('/mapr/mycluster/home/user01/mytable')
>
hb.delete.table('/mapr/mycluster/home/user01/mytable')
>
q()
Conclusion
With Revolution Analytics’ RHadoop packages and MapR’s enterprise grade Hadoop distribution, data scientists
can utilize the full potential of Hadoop from the familiar R environment.
Resources
More information can be found on RHadoop, and other technologies referenced in this paper at the links below.

GNU R home page: http://www.r-‐project.org
RHadoop home page: https://github.com/RevolutionAnalytics/
Apache Thrift: http://thrift.apache.org
Revolution Analytics: http://www.revolutionanalytics.com
MapR Technologies: http://www.mapr.com

About MapR Technologies
MapR’s advanced distribution for Apache Hadoop delivers on the promise of Hadoop, making the management and
analysis of big data a practical reality for more organizations. MapR’s advanced capabilities, such as streaming
analytics, mission-‐critical data protection, and MapR tables expand the breadth and depth of use cases across
industries.

About Revolution Analytics
Revolution Analytics (formerly Revolution Computing) was founded in 2007 to foster the R Community, as well as
support the growing needs of commercial users. Our name derives from combining the letter "R" with the word
"evolution." It speaks to the ongoing development of the R language from an open-‐source academic research tool
into commercial applications for industrial use.

03/14/2014

Rhadoop and Mapr

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Rhadoop and Mapr

Hochgeladen von

Copyright:

Verfügbare Formate

Conclusion ............................................................ 14

2 | RHadoop and MapR

3 | RHadoop and MapR

Special Installation Notes

4 | RHadoop and MapR

2) Install the rJava R package that is required by rhdfs.

5 | RHadoop and MapR

3) Download the rhdfs package from github.

6 | RHadoop and MapR

7 | RHadoop and MapR

3) Install the quickcheck and rmr2 packages.

8 | RHadoop and MapR

6) Exit from your su command.

9 | RHadoop and MapR

2) Download, build, and install Thrift.

10 | RHadoop and MapR

sed -i s/2.65/2.63/ configure.ac

3) Download the rhbase package.

6) Configure the file /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml with the HBase

11 | RHadoop and MapR

Test rhbase with Hadoop HBase

12 | RHadoop and MapR

2) Now perform the same test with rhbase.

Test rhbase with MapR Tables

2) Now perform the same test with rhbase.

13 | RHadoop and MapR

14 | RHadoop and MapR

Das könnte Ihnen auch gefallen