Sie sind auf Seite 1von 28

DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

AICTE SPONSORED Faculty Development Programme (FDP) on “DATA SCIENCE RESEARCH AND
BIG DATA ANALYTICS” scheduled from 11.12.2017 to 23.12.2017

BIG DATA ANALYTICS


Organized By
Department of Information Technology,
National Engineering College,
Kovilpatti
12 & 14 December 2017
Resource Person
Mr.D.Kesavaraja M.E , MBA, (PhD) , MISTE
Assistant Professor,
Department of Computer Science and Engineering,
Dr.Sivanthi Aditanar College of Engineering
Tiruchendur - 628215

❖ Introduction to Big Data


❖ BIG DATA Analogy
❖ Big Data Analytics
❖ Installing Cent OS – 7
❖ Hadoop Installation – Single Node
❖ Hadoop Distributed File System
❖ To set up the one node Hadoop cluster.
❖ Mount the one node Hadoop cluster using FUSE.
❖ JAVA API’s of Hadoop
❖ Map and Reduce tasks
❖ JAVA wordcount program to demonstrate the use of Map and
Reduce tasks
❖ Big Data Analytics Job Opportunities

• Hands on - Live Experiments


• E-Resources , Forums and Groups.
• Discussion and Clarifications
More Details Visit : www.k7cloud.in
: http://k7training.blogspot.in
*************

Presented By D.Kesavaraja www.k7cloud.in 1|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Setup the one node Apache Hadoop Cluster in CentOS 7


Aim :
To find procedure to set up the one node Hadoop cluster.

Introduction :

Apache Hadoop is an Open Source framework build for distributed Big Data storage
and processing data across computer clusters. The project is based on the following
components:
1. Hadoop Common – it contains the Java libraries and utilities needed by other
Hadoop modules.
2. HDFS – Hadoop Distributed File System – A Java based scalable file system
distributed across multiple nodes.
3. MapReduce – YARN framework for parallel big data processing.
4. Hadoop YARN: A framework for cluster resource management.

Procedure :

Step 1: Install Java on CentOS 7


1. Before proceeding with Java installation, first login with root user or a user with
root privileges setup your machine hostname with the following command.
# hostnamectl set-hostname master

Set Hostname in CentOS 7


Also, add a new record in hosts file with your own machine FQDN to point to your
system IP Address.
# vi /etc/hosts
Add the below line:
192.168.1.41 master.hadoop.lan

Presented By D.Kesavaraja www.k7cloud.in 2|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Set Hostname in /etc/hosts File


Replace the above hostname and FQDN records with your own settings.
2. Next, go to Oracle Java download page and grab the latest version of Java SE
Development Kit 8 on your system with the help of curl command:
# curl -LO -H "Cookie: oraclelicense=accept-securebackup-cookie"
“http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.rpm”

Download Java SE Development Kit 8


3. After the Java binary download finishes, install the package by issuing the below
command:
# rpm -Uvh jdk-8u92-linux-x64.rpm

Install Java in CentOS 7


Step 2: Install Hadoop Framework in CentOS 7
4. Next, create a new user account on your system without root powers which we’ll
use it for Hadoop installation path and working environment. The new account home
directory will reside in /opt/hadoop directory.

Presented By D.Kesavaraja www.k7cloud.in 3|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

# useradd -d /opt/hadoop hadoop


# passwd hadoop
5. On the next step visit Apache Hadoop page in order to get the link for the latest
stable version and download the archive on your system.
# curl -O http://apache.javapipe.com/hadoop/common/hadoop-2.7.2/hadoop-
2.7.2.tar.gz

Download Hadoop Package


6. Extract the archive the copy the directory content to hadoop account home path.
Also, make sure you change the copied files permissions accordingly.
# tar xfz hadoop-2.7.2.tar.gz
# cp -rf hadoop-2.7.2/* /opt/hadoop/
# chown -R hadoop:hadoop /opt/hadoop/

Extract-and Set Permissions on Hadoop


7. Next, login with hadoop user and configure Hadoop and Java Environment
Variables on your system by editing the .bash_profile file.
# su - hadoop
$ vi .bash_profile
Append the following lines at the end of the file:
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin

Presented By D.Kesavaraja www.k7cloud.in 4|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

export
CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Configure Hadoop and Java Environment Variables


8. Now, initialize the environment variables and check their status by issuing the
below commands:
$ source .bash_profile
$ echo $HADOOP_HOME
$ echo $JAVA_HOME

Presented By D.Kesavaraja www.k7cloud.in 5|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Initialize Linux Environment Variables


9. Finally, configure ssh key based authentication for hadoop account by running the
below commands (replace the hostname or FQDN against the ssh-copy-
id command accordingly).
Also, leave the passphrase filed blank in order to automatically login via ssh.
$ ssh-keygen -t rsa
$ ssh-copy-id master.hadoop.lan

Configure SSH Key Based Authentication


Step 3: Configure Hadoop in CentOS 7
10. Now it’s time to setup Hadoop cluster on a single node in a pseudo distributed
mode by editing its configuration files.
The location of hadoop configuration files is $HADOOP_HOME/etc/hadoop/, which is
represented in this tutorial by hadoop account home directory (/opt/hadoop/) path.
Once you’re logged in with user hadoop you can start editing the following
configuration file.
The first to edit is core-site.xml file. This file contains information about the port
number used by Hadoop instance, file system allocated memory, data store memory
limit and the size of Read/Write buffers.

Presented By D.Kesavaraja www.k7cloud.in 6|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

$ vi etc/hadoop/core-site.xml

Add the following properties between <configuration> ...


</configuration> tags. Use localhost or your machine FQDN for hadoop instance.
<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.lan:9000/</value>
</property>

Configure Hadoop Cluster


11. Next open and edit hdfs-site.xml file. The file contains information about the
value of replication data, namenode path and datanode path for local file systems.
$ vi etc/hadoop/hdfs-site.xml

Here add the following properties between <configuration> ...


</configuration> tags. On this guide we’ll use/opt/volume/ directory to store our
hadoop file system.
Replace the dfs.data.dir and dfs.name.dir values accordingly.
<property>
<name>dfs.data.dir</name>
<value>file:///opt/volume/datanode</value>
</property>

Presented By D.Kesavaraja www.k7cloud.in 7|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

<property>
<name>dfs.name.dir</name>
<value>file:///opt/volume/namenode</value>
</property>

Configure Hadoop Storage


12. Because we’ve specified /op/volume/ as our hadoop file system storage, we
need to create those two directories (datanode and namenode) from root account
and grant all permissions to hadoop account by executing the below commands.
$ su root
# mkdir -p /opt/volume/namenode
# mkdir -p /opt/volume/datanode
# chown -R hadoop:hadoop /opt/volume/
# ls -al /opt/ #Verify permissions
# exit #Exit root account to turn back to hadoop user

Presented By D.Kesavaraja www.k7cloud.in 8|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Configure Hadoop System Storage


13. Next, create the mapred-site.xml file to specify that we are using yarn
MapReduce framework.
$ vi etc/hadoop/mapred-site.xml
Add the following excerpt to mapred-site.xml file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Presented By D.Kesavaraja www.k7cloud.in 9|Page


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Set Yarn MapReduce Framework


14. Now, edit yarn-site.xml file with the below statements enclosed
between <configuration> ... </configuration> tags:

$ vi etc/hadoop/yarn-site.xml
Add the following excerpt to yarn-site.xml file:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Add Yarn Configuration


15. Finally, set Java home variable for Hadoop environment by editing the below line
from hadoop-env.sh file.

$ vi etc/hadoop/hadoop-env.sh
Edit the following line to point to your Java system path.
export JAVA_HOME=/usr/java/default/

Presented By D.Kesavaraja www.k7cloud.in 10 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Set Java Home Variable for Hadoop


16. Also, replace the localhost value from slaves file to point to your machine
hostname set up at the beginning of this tutorial.
$ vi etc/hadoop/slaves
Step 4: Format Hadoop Namenode
17. Once hadoop single node cluster has been setup it’s time to initialize HDFS file
system by formatting the /opt/volume/namenode storage directory with the
following command:
$ hdfs namenode -format

Presented By D.Kesavaraja www.k7cloud.in 11 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Format Hadoop Namenode

Hadoop Namenode Formatting Process


Step 5: Start and Test Hadoop Cluster
18. The Hadoop commands are located in $HADOOP_HOME/sbin directory. In
order to start Hadoop services run the below commands on your console:
$ start-dfs.sh
$ start-yarn.sh

Presented By D.Kesavaraja www.k7cloud.in 12 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Check the services status with the following command.


$ jps

Start and Test Hadoop Cluster


Alternatively, you can view a list of all open sockets for Apache Hadoop on your
system using the ss command.
$ ss -tul
$ ss -tuln # Numerical output

Presented By D.Kesavaraja www.k7cloud.in 13 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Check Apache Hadoop Sockets


19. To test hadoop file system cluster create a random directory in the HDFS file
system and copy a file from local file system to HDFS storage (insert data to HDFS).
$ hdfs dfs -mkdir /my_storage
$ hdfs dfs -put LICENSE.txt /my_storage

Check Hadoop Filesystem Cluster


To view a file content or list a directory inside HDFS file system issue the below
commands:
$ hdfs dfs -cat /my_storage/LICENSE.txt
$ hdfs dfs -ls /my_storage/

Presented By D.Kesavaraja www.k7cloud.in 14 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

List Hadoop Filesystem Content

Check Hadoop Filesystem Directory


To retrieve data from HDFS to our local file system use the below command:
$ hdfs dfs -get /my_storage/ ./

Copy Hadoop Filesystem Data to Local System


Get the full list of HDFS command options by issuing:
$ hdfs dfs -help
Step 6: Browse Hadoop Services
20. In order to access Hadoop services from a remote browser visit the following
links (replace the IP Address of FQDN accordingly). Also, make sure the below ports
are open on your system firewall.
For Hadoop Overview of NameNode service.
http://192.168.1.41:50070

Presented By D.Kesavaraja www.k7cloud.in 15 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Access Hadoop Services


For Hadoop file system browsing (Directory Browse).
http://192.168.1.41:50070/explorer.html

Hadoop Filesystem Directory Browsing


For Cluster and Apps Information (ResourceManager).
http://192.168.1.41:8088

Presented By D.Kesavaraja www.k7cloud.in 16 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Hadoop Cluster Applications


For NodeManager Information.
http://192.168.1.41:8042

Hadoop NodeManager
Step 7: Manage Hadoop Services
21. To stop all hadoop instances run the below commands:
$ stop-yarn.sh
$ stop-dfs.sh

Presented By D.Kesavaraja www.k7cloud.in 17 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Stop Hadoop Services


22. In order to enable Hadoop daemons system-wide, login with root user,
open /etc/rc.local file for editing and add the below lines:
$ su - root
# vi /etc/rc.local
Add these excerpt to rc.local file.
su - hadoop -c "/opt/hadoop/sbin/start-dfs.sh"
su - hadoop -c "/opt/hadoop/sbin/start-yarn.sh"
exit 0

Enable Hadoop Services at System-Boot

Presented By D.Kesavaraja www.k7cloud.in 18 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Then, add executable permissions for rc.local file and enable, start and check
service status by issuing the below commands:
$ chmod +x /etc/rc.d/rc.local
$ systemctl enable rc-local
$ systemctl start rc-local
$ systemctl status rc-local

Enable and Check Hadoop Services


That’s it! Next time you reboot your machine the Hadoop services will be
automatically started for you!

Hadoop Fuse Installation and Configuration on Centos

What is Fuse?
• FUSE permits you to write down a traditional user land application as a bridge
for a conventional file system interface.
• The hadoop-hdfs-fuse package permits you to use your HDFS cluster as if it
were a conventional file system on Linux.
• It’s assumed that you simply have a operating HDFS cluster and grasp the
hostname and port that your NameNode exposes.
• The Hadoop fuse installation and configuration with Mounting HDFS, HDFS
mount using fuse is done by following the below steps.
Step 1 : Required Dependencies
Step 2 : Download and Install FUSE
Step 3 : Install RPM Packages
Step 4 : Modify HDFS FUSE
Step 5 : Check HADOOP Services
Step 6 : Create a Directory to Mount HADOOP

Presented By D.Kesavaraja www.k7cloud.in 19 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Step 7 : Modify HDFS-MOUNT Script


Step 8 : Create softlinks of LIBHDFS.SO
Step 9 : Check Memory Details

To start Hadoop fuse installation and configuration follow the steps:

Step 1 : Required Dependencies


Hadoop single / multinode Cluster (started mode)
jdk (preinstalled)
Fuse mount Installation and configuration guide has prepared on
following platform and services.
Operating System : CentOS release 6.4 (Final) 32bit
hadoop : hadoop-1.2.1
mysql-server : 5.1.71
JDK : java version “1.7.0_45″ 32bit (jdk-7u45-linux-i586.rpm)
fuse : hdfs-fuse-0.2.linux2.6-gcc4.1-x86.tar.gz
fuse RPMs : fuse-libs-2.8.3-4.el6.i686,
————————- —- fuse-2.8.3-4.el6.i686,
————————- fuse-devel-2.8.3-4.el6.i686.

Step2 : Download and install fuse

Llogin as hadoop user to a node in hadoop cluster (master / datanode)

Download hdfs-fuse from following location


[hadoop@hadoop ~]#wget https://hdfs-fuse.googlecode.com/files/hdfs-
fuse-0.2.linux2.6-gcc4.1-x86.tar.gz

Extract hdfs-fuse-0.2.linux2.6-gcc4.1-x86.tar.gz
[hadoop@hadoop ~]#tar -zxvf hdfs-fuse-0.2.linux2.6-gcc4.1-x86.tar.gz

Step 3 : Install rpm packages


Switch to root user to install following rpm packages
fuse-libs-2.8.3-4.el6.i686
fuse-2.8.3-4.el6.i686
fuse-devel-2.8.3-4.el6.i686

[hadoop@hadoop ~]#su – root


[root@hadoop ~]#yum install fuse*
[root@hadoop ~]#chmod +x /usr/bin/fusermount
Step 4 : Modify hdfs fuse

After installation of rpm packages, switch back to hadoop user


[root@hadoop ~]# su – hadoop

Modify hdfs fuse configuration / environmental variables


[hadoop@hadoop ~]$cd hdfs-fuse/conf/

Presented By D.Kesavaraja www.k7cloud.in 20 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

Add following lines in hdfs-fuse.conf


[hadoop@hadoop conf]$vi hdfs-fuse.conf

export JAVA_HOME=/usr/java/jdk1.7.0_45 # JAVA HOME path export


HADOOP_HOME=/home/hadoop/hadoop-1.2.1 # hadoop installation home
path export FUSE_HOME=/home/hadoop #fuse
installation path export HDFS_FUSE_HOME=/home/hadoop/hdfs-fuse #
fuse home path export HDFS_FUSE_CONF=/home/hadoop/hdfs-fuse/conf #
fuse configuration path LogDir /tmp LogLevel LOG_DEBUG Hostname
192.168.1.52 # hadoop master
node IP Port 9099 # hadoop port number

Step 5 : Check hadoop services


[hadoop@hadoop conf]$cd ..

verify hadoop instance is running


[hadoop@hadoop hdfs-fuse]$ jps
2643 TaskTracker
4704 Jps
2206 NameNode
2516 JobTracker
2432 SecondaryNameNode
2316 DataNode

Step 6 : Create a directory to mount hadoop

Create a folder to mount hadoop file system to it


[hadoop@hadoop hdfs-fuse]#mkdir /home/hadoop/hdfsmount
[hadoop@hadoop hdfs-fuse]# cd
[hadoop@hadoop ~]#pwd

Step 7 : Modify hdfs-mount script

Switch to hdfc fuse binary folder in order to run mount script.


[hadoop@hadoop ~]#cd hdfs-fuse/bin/
Modify hdfs-mount script to show jvm path location and other environmental
settings, in our installation guide this is the location for jvm
(usr/java/jdk1.7.0_45/jre/lib/i386/server)

[hadoop@hadoop bin]$ vi hdfs-mount


JAVA_JVM_DIR=/usr/java/jdk1.7.0_45/jre/lib/i386/server export
JAVA_HOME=/usr/java/jdk1.7.0_45 export
HADOOP_HOME=/home/hadoop/hadoop-1.2.1 export FUSE_HOME=/home/hadoop
export HDFS_FUSE_HOME=/home/hadoop/hdfs-fuse export
HDFS_FUSE_CONF=/home/hadoop/hdfs-fuse/conf

Step 8 : Create softlinks of libhdfs.so

Presented By D.Kesavaraja www.k7cloud.in 21 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

create softlinks of libhdfs.so which is located in (/home/hadoop/hadoop-


1.2.1/c++/Linux-i386-32/lib/libhdfs.so)
[root@hadoop ~]# cd /home/hadoop/hdfs-fuse/lib/
[root@hadoop lib]# ln -s /home/hadoop/hadoop-1.2.1/c++/Linux-i386-
32/lib/libhdfs .
Mount HDFS file system to /home/hadoop/hdfsmount
[hadoop@hadoop bin]#./hdfs-mount /home/hadoop/hdfsmount
or
[hadoop@hadoop bin]$./hdfs-mount -d /home/hadoop/hdfsmount (-d
option to enable debug)

Step 9 : Check memory details


[hadoop@hadoop bin]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_hadoop-lv_root 50G 1.4G 46G 3% /
tmpfs 504M 0 504M 0%
/dev/shm
/dev/sda1 485M 30M 430M 7% /boot
/dev/mapper/vg_hadoop-lv_home 29G 1.2G 27G 5% /home
hdfs-fuse 768M 64M 704M 9%
/home/hadoop/hdfsmount

[hadoop@hadoop bin]$ ls /home/hadoop/hdfsmount/

tmp user

Use below “fusermount” command to unmount hadoop file system


[hadoop@hadoop bin]$fusermount -u /home/hadoop/hdfsmount

The fuse mount is ready to use as local file system.

Using FileSystem API to read and write data to HDFS


Reading data from and writing data to Hadoop Distributed File System
(HDFS) can be done in a lot of ways. Now let us start by using the FileSystem
API to create and write to a file in HDFS, followed by an application to read a file
from HDFS and write it back to the local file system.
Step 1: Once you have downloaded a test dataset, we can write an application to
read a file from the local file system and write the contents to Hadoop Distributed
File System.
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

Presented By D.Kesavaraja www.k7cloud.in 22 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.ToolRunner;

public class HdfsWriter extends Configured implements Tool {

public static final String FS_PARAM_NAME = "fs.defaultFS";

public int run(String[] args) throws Exception {

if (args.length < 2) {
System.err.println("HdfsWriter [local input path] [hdfs output path]");
return 1;
}

String localInputPath = args[0];


Path outputPath = new Path(args[1]);

Configuration conf = getConf();


System.out.println("configured filesystem = " + conf.get(FS_PARAM_NAME));
FileSystem fs = FileSystem.get(conf);
if (fs.exists(outputPath)) {
System.err.println("output path exists");
return 1;
}
OutputStream os = fs.create(outputPath);
InputStream is = new BufferedInputStream(new FileInputStream(localInputPath));
IOUtils.copyBytes(is, os, conf);
return 0;
}

public static void main( String[] args ) throws Exception {


int returnCode = ToolRunner.run(new HdfsWriter(), args);
System.exit(returnCode);
}
}
Step 2: Export the Jar file and run the code from terminal to write a sample file to
HDFS.
[root@localhost student]# vi HdfsWriter.java
[root@localhost student]# /usr/java/jdk1.8.0_91/bin/javac HdfsWriter.java
[root@localhost student]# /usr/java/jdk1.8.0_91/bin/jar cvfe HdfsWriter.jar
HdfsWriter HdfsWriter.class
[root@localhost student]# hadoop jar HdfsWriter.jar a.txt kkk.txt
configured filesystem = hdfs://localhost:9000/
[root@localhost student]# hadoop jar HdfsReader.jar /user/root/kkk.txt kesava.txt
configured filesystem = hdfs://localhost:9000/

Step 3: Verify whether the file is written into HDFS and check the contents of the
file.

Presented By D.Kesavaraja www.k7cloud.in 23 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

[training@localhost ~]$ hadoop fs -cat /user/training/HdfsWriter_sample.txt


Step 4: Next, we write an application to read the file we just created in Hadoop
Distributed File System and write its contents back to the local file system.
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HdfsReader extends Configured implements Tool {
public static final String FS_PARAM_NAME = "fs.defaultFS";
public int run(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("HdfsReader [hdfs input path] [local output path]");
return 1;
}

Path inputPath = new Path(args[0]);


String localOutputPath = args[1];
Configuration conf = getConf();
System.out.println("configured filesystem = " + conf.get(FS_PARAM_NAME));
FileSystem fs = FileSystem.get(conf);
InputStream is = fs.open(inputPath);
OutputStream os = new BufferedOutputStream(new
FileOutputStream(localOutputPath));
IOUtils.copyBytes(is, os, conf);
return 0;
}

public static void main( String[] args ) throws Exception {


int returnCode = ToolRunner.run(new HdfsReader(), args);
System.exit(returnCode);
}
}
Step 5: Export the Jar file and run the code from terminal to write a sample file to
HDFS.
[root@localhost student]# vi HdfsReader.java
[root@localhost student]# /usr/java/jdk1.8.0_91/bin/javac HdfsReader.java
[root@localhost student]# /usr/java/jdk1.8.0_91/bin/jar cvfe HdfsReader.jar
HdfsReader HdfsReader.class
[root@localhost student]# hadoop jar HdfsReader.jar /user/root/kkk.txt sample.txt
configured filesystem = hdfs://localhost:9000/
Step 6: Verify whether the file is written back into local file system.

Presented By D.Kesavaraja www.k7cloud.in 24 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

[training@localhost ~]$ hadoop fs -cat /user/training/HdfsWriter_sample.txt

MAP REDUCE

[student@localhost ~]$ su
Password:
[root@localhost student]# su - hadoop
Last login: Wed Aug 31 10:14:26 IST 2016 on pts/1
[hadoop@localhost ~]$ mkdir mapreduce
[hadoop@localhost ~]$ cd mapreduce
[hadoop@localhost mapreduce]$ vi WordCountMapper.java

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>
{
private final static IntWritable one= new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = newStringTokenizer (line);
while(tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word,one);
}
}
}

[hadoop@localhost mapreduce]$ vi WordCountReducer.java

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text,


IntWritable>
{
//Reduce method for just outputting the key from mapper as the value from mapper
is just an empty string

Presented By D.Kesavaraja www.k7cloud.in 25 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException, InterruptedException
{
int sum = 0;
//iterates through all the values available with a key and add them
together and give the final result as the key and sum of its values
for (IntWritable value : values)
{
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}

[hadoop@localhost mapreduce]$ vi WordCount.java

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool


{
public int run(String[] args) throws Exception
{
//getting configuration object and setting job name
Configuration conf = getConf();
Job job = new Job(conf, "Word Count hadoop-0.20");

//setting the class names


job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);

//setting the output data type classes


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

//to accept the hdfs input and output dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;

Presented By D.Kesavaraja www.k7cloud.in 26 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

public static void main(String[] args) throws Exception


{
int res = ToolRunner.run(new Configuration(), newWordCount(), args);
System.exit(res);
}
}

Presented By D.Kesavaraja www.k7cloud.in 27 | P a g e


DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS

********************************************
“Knowing is not enough
We must apply
Willing is not enough
We must do”
Best Wishes
By
D.Kesavaraja M.E ,(PhD),MISTE,AMIE
Assistant Professor/CSE

Dr.Sivanthi Aditanar College of Engineering

Tiruchendur
Website : www.k7cloud.in Mail:k7cloud@gmail.com Mobile: +91 9865213214

Presented By D.Kesavaraja www.k7cloud.in 28 | P a g e

Das könnte Ihnen auch gefallen