Sie sind auf Seite 1von 168

Big Data & Hadoop

Architecture and Development

Raghavan Solium
Big Data Consultant
raghavan.solium@gmail.com

Day - 1

Understanding Big Data

What is Big Data


Challenges with Big Data
Why not RDBMS / EDW?
Distributed Computing & MapReduce Model

What is Apache Hadoop

Hadoop & its eco system


Components of Hadoop (Architecture)
Hadoop deployment modes
Install & Configure Hadoop
Hands on with Standalone mode

Day - 1
HDFS The Hadoop DFS

Building Blocks
Name Node & Data Node
Starting HDFS Services
HDFS Commands
Hands on
Configure HDFS
Start & Examine the daemons
Export & Import files into HDFS

Map Reduce Anatomy

MapReduce Workflow
Job Tracker & Task Tracker
Starting MapReduce Services
Hands on
Configure MapReduce
Start & Examine the daemons

Day - 2
MapReduce Programming

Java API
Data Types
Input & Output Formats
Hands on

Advance Topics

Combiner
Partitioner
Counters
Compression, Speculative Execution, Zero & One Reducer
Distributed Cache
Job Chaining
HDFS Federation
HDFS HA
Hadoop Cluster Administration

Day - 3
Pig

What is Pig Latin?


Pig Architecture
Install & Configure Pig
Data Types & Common Query algorithms
Hands On

Hive

What is Hive?
Hive Architecture
Install & Configure Hive
Hive Data Models
Hive Metastore
Partitioning and Bucketing

Hands On

Day - 4
Sqoop

What is Sqoop
Install & Configure Sqoop
Import & Export
Hands On

Introduction to Amazon Cloud


What is AWS
EC2, S3
How to leverage AWS for Hadoop

Day - 4
Hadoop Administration

HDFS Persistent Data Structure


HDFS Safe Mode
HDFS File system Check
HDFS Block Scanner
HDFS Balancer
Logging
Hadoop Routine Maintenance
Commissioning & Decommissioning of nodes
Cluster Machine considerations
Network Topology
Security

What the BIG hype about Big Data?


May be it is in the hype, but the problems are big, real and big value. How?....
We are in the age of advanced analytics (thats where all the problem is, we want
to analyze the data) where valuable business insight is mined out of historical data
But we also live in the age of crazy data where every individuals, enterprises, and
machines leave so much data behind summing up to many Terabytes and many
times, Petabytes, and it is only expected to grow
Good news. Blessing in disguise. More data means better precision
More data usually beats better algorithms
But How are we going to analyze?
Traditional database or warehouse systems crawl or crack at these volumes
Inflexible to handle most of these formats
This is the very characteristic of Big Data

Nature of Big Data


Huge volumes of data that can not be handled by traditional database or
warehouse systems, its mostly machine produced, most of it is unstructured and
grows at high velocity
7

Lets Define

Variety
Sensor Data
Machine logs
Social media data
Scientific data
RFID readers
sensor networks
vehicle GPS traces
Retail transactions

Volume
The New York Stock Exchange has
several petabytes of data for analysis
Facebook hosts approximately 10
billion photos, taking up one
petabytes of storage.
At the end of 2010 The Large Hadron
Collider near Geneva, Switzerland has
about 150 petabytes of data

Velocity
The New York Stock Exchange
generates about one terabyte of new
trade data every day
The Large Hadron Collider produce s
about 15 petabytes of data per year
Weather sensors collect data every
hour at many locations across the
globe and gather a large volume of
log data
8

Inflection Points
Data Storage
Big Data ranges from several Terabytes to Petabytes.
At these volumes access speed of the data devices will dominate overall analysis
time.
A Terabyte of data requires 2.5 hours to be read from a 100 MBPS drive
Writing will even be slower

Is divide the data and rule a solution here?


Have multiple disk drives, split your data file into small enough pieces across the
drives and do parallel reads and processing
Hardware Reliability (Failure of any drive) is a challenge
Resolving Data interdependency between drives is a notorious challenge
Number of disk drives that can be added to a server is limited

Analysis
Much of Big Data is unstructured. Traditional RDBMS/ EDW cannot handle them
Lot of Big Data analysis is adhoc in nature, involves whole data scan, referencing
itself, joining, combing etc
Traditional RDBMS/ EDW cannot handle these with their limited scalability
options and architectural limitations
You can incorporate better servers, processors and throw in more RAM but there
is a limit to it
9

Inflection Points
We need a Drastically different approach
A distributed file system with high capacity and high reliability
A process engine that can handle structure / Unstructured
data
A computation model that can operate on distributed data
and abstracts data dispersion
PRAM, MapReduce are such models

Let us see what MapReduce is

10

What is MapReduce Model

(K2, V2)
Intermediate Key/Value pairs

(K1, V1)

(K3, V3)

Input file Split s

Output files

Computer 1
Split 1

Input File
/ Data

Sort

Map

Reduce

Computer 2
Split 2

Computer 1
Sort

Map
Computer 2
Reduce

Computer 3
Split 3

Part 1

Part 2

Sort

Map

11

What is MapReduce Model


MapReduce is a computation model that supports parallel processing on
distributed data using a cluster of computers.
The MapReduce model expects the input data to be split and distributed to the
machines on the cluster so the each split can be processed independently and
in parallel.
There are two stages of processing in MapReduce model to achieve the final
result. Map and Reduce. Every computer in the cluster can run independent
map and reduce processes.
Map processes the input splits. The output of map is distributed again to the
reduce processes to combine the map output to give final expected result.
The model treats data at every stage as Key and Values pairs, transforming one
set of Key/ Value pairs into different set of Key/ value pairs to arrive at the end
result.
Map processes transforms input key/ value pairs to an intermediate key/value
pairs. MapReduce framework passes this output to reduce processes which will
transform this to get the final result which again will be in the form of key/
Value pairs.
12

MapReduce Model
MapReduce should have
Ability to initiate and monitor parallel processes and
coordinate between them
A mechanism to pass the same key outputs from map
processes to a single reduce process
Recover from any failures transparently

13

Big Data Universe


Evolving and expanding..

14

So whats going to happen to our good friend RDBMS?


We dont know! As of now it looks like they are going to
coexists
Hadoop is a batch oriented analysis system. Its not
suitable for low-latency data operations
MapReduce systems can output the analysis outcome to
the RDBMS/EDWs for online access and point queries
RDBMS / EDW compared to MapReduce

Data size
Access
Updates
Structure
Integrity
Scaling

Traditional RDBMS
Gigabytes
Interactive and batch
Read and write many times
Static schema
High
Nonlinear

MapReduce
Petabytes
Batch
Write once, read many times
Dynamic schema
Low
Linear

(Some of these things are debatable as the Big Data and Hadoop eco systems are fast evolving and moving to higher degree of
maturity and flexibility. For example Hbase brings in the ability to point queries )
15

Some Use Cases


Web/Content Indexing
Finance & Insurance
Fraud detection
Sentiment Analysis

Retail
Trend analysis, Personalized promotions

Scientific simulation & analysis


Aeronautics, Particle physics, DNA Analysis

Machine Learning
Log Analytics

16

What is Apache Hadoop and how it can help with Big Data?
It is an open source Apache project for handling Big Data
It addresses Data storage issue and Analysis (processing) issues through its
HDFS file system and implementing MapReduce computation model
It is designed for massive scalability and reliability
The model enables leveraging cheap commodity servers keeping the cost in
check

Who Loves it?

Yahoo! runs 20,000 servers running Hadoop


Largest Hadoop cluster is 4000 servers, 16 PB raw storage (Is it Yahoo?)
Facebook runs 2000 Hadoop servers
24 PB raw storage and 100 TB raw log / day
eBay and LinkedIn has production use of Hadoop
Sears retail uses Hadoop
17

Hadoop & Its ecosystem

Apache Oozie (Workflow)


HBase
Zoo
Keeper
(Coordination
Service)

PIG

Hive

Mahout

MapReduce Framework
Structured Data

Sqoop

HDFS (Hadoop Distributed File System)

Flume
Log
Files

Unstructured Data

18

Hadoop & Its ecosystem


Avro: A serialization system for efficient, cross-language RPC and persistent data storage.

MapReduce:

A distributed data processing model and execution environment that runs on large
clusters of commodity machines.

HDFS: A distributed file system that runs on large clusters of commodity machines.

Pig:

A data flow language and execution environment for exploring very large datasets. Pig runs on
HDFS and MapReduce clusters.

Hive:

A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for
querying the data.

Hbase:

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and
supports both batch-style computations using MapReduce and point queries (random reads).

ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such
as distributed locks that can be used for building distributed applications.

Sqoop:

A tool for efficient bulk transfer of data between structured data stores (such as relational
databases) and HDFS.

Oozie:

A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig,
Hive, and Sqoop jobs).
19

Hadoop Requirements
Supported Platforms
GNU/Linux is supported as a development and production
Win32 supported as development only
cygwin is required for running on Windows

Required Software
JavaTM 1.6.x
ssh to be installed, sshd must be running (for launching the
daemons on the cluster with password less entry)

Development Environment
Eclipse 3.5 or above

20

Lab Requirements
Windows 7 - 64 bit OS, Min 4 GB Ram
VMWare Player 5.0.0
Linux VM - Ubuntu 12.04 LTS
User: hadoop, Password: hadoop123
Java 6 installed on Linux VM
Open SSH installed on Linux VM
Putty - For opening Telnet sessions to the Linux VM
WinSCP - For transferring files between Windows / VM
Eclipse 3.5
21

Hands On
Using the VM
Install & Configure hadoop

Install & Configure ssh


Set up Putty & WinScp
Set up lab directories
Install open JDK
Install & Verify hadoop

22

Starting VM

23

Starting VM

Enter user ID/ Password : hadoop / hadoop123

24

Install & Configure ssh

Install ssh
>>sudo apt-get

install

ssh

Check ssh installation


>>which ssh
>>which sshd
>>which ssh-keygen

Generate ssh Key


>>ssh-keygen -t

rsa

-P

-f

~/.ssh/id_rsa

Copy public key as an authorized key (equivalent to slaves)


>>cp ~/.ssh/id_rsa.pub
~/.ssh/authorized_keys
>>chmod
700 ~/.ssh
>>chmod
600 ~/.ssh/authorized_keys
25

Verify ssh
Verify SSH by logging into target (localhost here)
>>ssh localhost
This command should log you into the machine localhost

26

Accessing VM Putty & WinSCP


Get IP address of the Linux VM
>>ifconfig

Use Putty to telnet to VM


Use WinSCP to FTP to VM

27

Lab VM Directory Structure


User Home Directory for user hadoop (Created default by OS)
/home/hadoop

Working directory for the lab session


/home/hadoop/lab

Downloads directory (installables downloaded and stored under this)


/home/hadoop/lab/downloads

Data directory (sample data is stored under this)


/home/hadoop/lab/data

Create directory for installing the tools


/home/hadoop/lab/install

28

Install & Configure Java


Install Open JDK
>>sudo

apt-get

install

openjdk-6-jdk

Check Installation
>>java

-version

Configure Java Home in environment


Add a line to .bash_profile to set Java Home
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
Hadoop will use this during runtime

Install Hadoop

Download Hadoop Jar


http://apache.techartifact.com/mirror/hadoop/common/hado
op-1.0.3/hadoop-1.0.3.tar.gz
FTP the file to Linux VM into ~/lab/downloads folder

Untar (execute the following commands)


>>cd ~/lab/install
>>tar xvf ~/lab/downloads/hadoop-1.0.3.tar.gz
Check the extracted directory hadoop-1.0.3
>>ls -l hadoop-1.0.3

Configure environment in .bash_profile


Add below two lines and execute bash profile
>>export HADOOP_INSTALL=~/lab/install/hadoop-1.0.3
>>export PATH=$PATH:$HADOOP_INSTALL/bin
>>. .bash_profile

30

Run an Example
Verify Hadoop installation
>> hadoop version

Try the following


>>hadoop
Will provide command usage

>>cd $HADOOP_INSTALL
>>hadoop jar hadoop-examples-1.0.3.jar
Will provide the list of classes in the above jar file

>>hadoop jar hadoop-examples-1.0.3.jar wordcount


<input directory> <output directory>

31

Component of Core Hadoop


Client

Name Node

Job Tracker
Networked

Secondary
Name Node
Data Node

Data Node

Data Node

Data Node

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Map

Map

Map

Map

Red

Red

Red

Red

(Hadoop supports many other file systems other than HDFS itself . For one to leverage Hadoops abilities completely HDFS is one of
the most reliable file systems)
32

Components of Core Hadoop


At a high-level Hadoop architectural components can be classified into two categories
Distributed File management system HDFS
This has central and distributed sub components
NameNode Centrally Monitors and controls the whole file system
DataNode Take care of the local file segments and constantly communicates
with NameNode
Secondary NameNode Do not confuse. This is not a NameNode Backup. This
just backs up the file system status from the NameNode periodically
Distributed computing system MapReduce Framework

This again has central and distributed sub components


Job Tracker Centrally Monitors the submitted Job and controls all processes
running on the nodes (computers) of the cluster. This communicated with Name
Node for file system access

Task Tracker Take care of the local job execution on the local file segment. This
talks to DataNode for file information. This constantly communicates with Job
Tracker daemon to report the task progress
When the Hadoop system is running in a distributed mode all these daemons would be
running in the respective computer
33

Hadoop Operational Modes

Hadoop can be run in one of the three modes


Standalone (Local) Mode
No daemons launched
Everything runs in single JVM
Suitable for development

Pseudo Distributed Mode


All daemons are launched on a single machine thus
simulating a cluster environment
Suitable for testing & debugging

Fully Distributed Mode


The Hadoop daemons run in a cluster environment
Each daemons run on machines respectively assigned to them
Suitable for Integration Testing / Production
A typical distributed mode runs Name Node on a separate machine, Job Tracker & Secondary Name Node on a separate machine.
34
Rest of the machines in the cluster run a Data Node and Task Tracker Daemons

Hadoop Configuration Files

The configuration files can be found under conf Directory


File Name

Format

Description

hadoop-env.sh

Bash script

Environment variables that are used in the scripts to run


Hadoop

core-site.xml

Hadoop
configuration XML

Configuration settings for Hadoop Core, such as I/O


settings that are common to HDFS and MapReduce

hdfs-site.xml

Hadoop
configuration XML

Configuration settings for HDFS daemons: the namenode,


the secondary namenode, and the datanodes

mapred-site.xml

Hadoop
configuration XML

Configuration settings for MapReduce daemons: the


jobtracker and the tasktrackers

masters

Plain text

List of machines (one per line) that run a secondary


namenode

slaves

Plain text

List of machines (one per line) that each run a datanode


and a tasktracker

hadoop-metrics
.properties

Java Properties

Properties for controlling how metrics are published in


Hadoop

log4j.properties

Java Properties

Properties for system logfiles, the namenode audit log,


and the task log for the tasktracker child process
35

Key Configuration Properties

Property Name

Conf File

Standalone

Pseudo
Distributed

Fully Distributed

fs.default.name

core-site.xml

file:///
(default)

hdfs://localhost/ hdfs://namenode/

dfs.replication

hdfs-site.xml

NA

3 (default)

mapred.job.tracker

mapredsite.xml

local (default)

Localhost:8021

Jobtracket:8021

36

HDFS
37

Design of HDFS
HDFS is hadoops Distributed File System
Designed for storing very large files (of sized petabytes)
Single file can be stored across several the disks
Designed for streaming data access patterns
Not suitable for low-latency data access
Designed to be highly fault tolerant hence can run on
commodity hardware

38

HDFS Concepts
Like in any file system HDFS stores files by breaking it
into smallest units called Blocks
The default HDFS block size is 64 MB
The large block size helps in maintaining high
throughput
Each Block is replicated across multiple machines in the
cluster for redundancy

39

Design of HDFS - Daemons

Get block information for


the file

Secondary
Name Node

Name Node

Networked
Client

Read Blocks

Data Node

Hadoop
Cluster

Data Node

Data Node

Data Node

Data Blocks

40

Design of HDFS - Daemons


The HDFS file system is managed by two daemons
NameNode & DataNode
NameNode & DataNode function in master/ slave fashion
NameNode Manages File system namespace
Maintains file system tree and the metadata of all the files and directories
Filesystem Image
Edit log

Datanodes store and retrieve the blocks for the files when they are told by
NameNode
NameNode maintains the information on which DataNodes all the blocks for a
given file are located
DataNodes report to NameNode periodically with the list of blocks they are
storing
With NameNode off, the HDFS is inaccessible

Secondary NameNode
Not a backup for the NameNode
Just helps in merging filesystem image with edit log to avoid edit log
41
becoming too large

Hands On
Configure HDFS file system for hadoop
Format HDFS
Start & Verify HDFS services
Verify HDFS
Stop HDFS services
Change replication

42

Configuring HDFS core-site.xml (Pseudo Distributed Mode)


Set JAVA_HOME in conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
The property is used on the remote machines

Set up core-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
Add fs.default.name property under configuration tag to specify NameNode location.
localhost for Pseudo distributed mode. Name node runs at port 8020 by default if no
port is specified

43

Starting HDFS
Format NameNode
>>hadoop namenode -format
Creates empty file system with storage directories and
persistent data structures
Data nodes are not involved

Start dfs services & verify daemons


>>start-dfs.sh
>>jps

List / Check HDFS


>>hadoop
>>hadoop
>>hadoop

fs -ls
fsck / -files -blocks
fs -mkdir testdir
44

Verify HDFS
List / Check HDFS again
>>hadoop
>>hadoop

fs -ls
fsck /

-files -blocks

Stop dfs services


>>stop-dfs.sh
>>jps
No java processes should be running

45

Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode)


<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Add dfs.replication property under configuration tag. value is set


to 1 so that no replication is done
46

Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode)

Property
Name

Description

Default Value

dfs.name.dir

Directories for NameNode to store its


persistent data (Comma separated
directory names). A copy of metadata is
stored in each of the listed directory

${hadoop.tmp.dir}/dfs/
name

dfs.data.dir

Directories where DataNode stores blocks.


Each block is stored in only one of these
directories

${hadoop.tmp.dir}/dfs/
data

fs.checkpoint.dir Directories where secondary NameNode


stores checkpoints. A copy of the
checkpoint is stored in each of the listed
directory

${hadoop.tmp.dir}/dfs/
namesecondary

47

Basic HDFS Commands


Creating Directory
hadoop fs -mkdir <dirname>

Removing Directory
hadoop fs -rm

<dirname>

Copying files to HDFS from local filesystem


hadoop fs -copyFromLocal <local dir>/<filename>
<hdfs dir Name>/<hdfs file name>

Copying files from HDFS to local filesystem


hadoop fs -copyToLocal
<local dir>/<filename>

<hdfs dir Name>/<hdfs file name>

List files and directories


hadoop

fs

-ls

<dir name>

List the blocks that make up each file in HDFS


hadoop

fsck

-files

-blocks

48

Hands On
Create data directories for
NameNode
Secondary NameNode
DataNode

Configure the nodes


Format HDFS
Start DFS service and verify daemons
Create directory retail in HDFS
Copy files from lab/data/retail directory to HDFS retail
directory
Verify the blocks created
Do fsck on HDFS to check the health of HDFS file system
49

Create data directories for HDFS


Create directory for NameNode
>>cd ~/lab
>>mkdir hdfs
>>cd hdfs
>>mkdir namenode
>>mkdir secondarynamenode
>>mkdir datanode
>>chmod 755 datanode

50

Configuring data directories for HDFS


Configure HDFS directories
Add the following properties in hdfs-site.xml
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/lab/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/lab/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/home/hadoop/ lab/hdfs/secondarynamenode</value>
<final>true</final>
</property>
51

HDFS Web UI

Hadoop provides a web UI for viewing HDFS


Available at http://<VM host IP>:50070/
Browse file system
Log files

52

MapReduce

53

MapReduce
A distributed parallel processing engine of Hadoop
Processes the data in sequential parallel steps called
Map
Reduce

Best run with a DFS supported by hadoop to exploit its parallel


processing abilities
Has the ability to run on a cluster of computers
Each computer called as a node

Input/output data at every stage is handled in terms of key/value


pairs
Key/ Value can be chosen by programmer

Mapper output with the same key are sent to the same reducer
Input to Reducer is always sorted by key
Number of mappers and reducers per node can be configured
54

MapReduce Workflow Word count

(K1, V1)

Input file Split s on the DFS


Computer 1
If you go up and
down

Input file
If you go up and down
The weight go down and
the health go up

Map

Computer 2
The weight go
down and

Map

Computer 3
the health go up

Map

Intermediate Key/Value pairs

Output

(K2, V2)

(K3, V3)

and, 1
down, 1
go, 1
if, 1
up, 1
you, 1
and, 1
down, 1
go, 1
the, 1
weight, 1
go, 1
health, 1
the, 1
up, 1

and, 1 down, 1
and, 1 down, 1

go, 1
go, 1
go, 1

if, 1
Computer 1
Reduce

Computer 2
Reduce

up, 1
up, 1

you, 1

health, 1

the, 1
the, 1

and
Down
go
If

2
2
3
1

up
you
the
Health
Weight

2
1
2
1
1

weight, 1

55

Design of MapReduce - Daemons


The MapReduce system is managed by two daemons
JobTracker & TaskTracker
JobTracker & TaskTracker function in master/ slave
fashion
JobTracker coordinates the entire job execution
TaskTracker runs the individual tasks of map and reduce
JobTracker does the bookkeeping of all the tasks run on the
cluster
One map task is created for each input split
Number of reduce tasks is configurable
mapred.reduce.tasks
56

Design of MapReduce - Daemons


Client

Job Tracker

Networked

HDFS

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

57

Hands On
Configure MapReduce
Start MapReduce daemons
Verify the daemons
Stop the daemons

58

mapred-site.xml - Pseudo Distributed Mode

<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Add mapred.job.tracker property under configuration tag to specify JobTracker
location. localhost:8021 for Pseudo distributed mode.
59

Starting Hadoop MapReduce Daemons

Start MapReduce Services


>>start-mapred.sh
>>jps

Stop MapReduce Services


>>stop-mapred.sh
>>jps

60

MapReduce
Programming
61

MapReduce Programming

Having seen the functioning of MapReduce, to perform a


job in hadoop a programmer needs to create
A MAP function
A REDUCE function
A Driver to communicate with the framework, configure and
launch the job
Execution Environment
Map
Reduce
Framework

Execution
Parameters

Map

Map
Reduce
Framework

Map

Red
uce

Map
Reduce
Framework

Output

Red

Driver
62

Retail Use case


Set of transactions in txn.csv
Txn ID
TXN Date Cust ID Amt Category Sub-Cat Addr-1 Addr-2 Credit/ Cash
00999990,08-19-2011,4003754,147.66,Team Sports,Lacrosse,Bellevue,Washington,credit
00999991,10-09-2011,4006641,126.19,Water Sports,Surfing,San Antonio,Texas,credit
00999992,06-09-2011,4005497,097.78,Water Sports,Windsurfing,San Diego,California,credit

Customer details in custs.csv


Cust ID Fst Nam Lst Nam Age
Profession
4009983,Jordan,Tate,35,Coach
4009984,Justin,Melvin,43,Loan officer
4009985,Rachel,Corbett,66,Human resources assistant

63

Map Function
The Map function is represented by Mapper class, which declares
an abstract method map()
Mapper class is generic type with four type parameters for the
input and output key/ value pairs
Mapper <K1, V1, K2, V2>
K1, V1 are the types of the input key / value pair
K2, V2 are the types of the output key / value pair

Hadoop provides its own types that are optimized for network
serialization
Text
LongWritable
IntWritable

Corresponds to Java String


Corresponds to Java Long
Corresponds to Java Int

The map() method must be implemented to achieve the input key/


value transformation
map() is called by MapReduce framework passing the input key/
values from the input split
map() is provided with a context object in its call, to which the
transformed key/ values can be written to
64

Mapper Word Count


public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(), " \t\n\r\f,.;:?![]'");

while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}

}
65

Reduce Function
The Reduce function is represented by Reducer class, which
declares an abstract method reduce()
Reducer class is generic type with four type parameters for the
input and output key/ value pairs
Reducer <K2, V2, K3, V3>
K2, V2 are the types of the input key/ value pair, this type of this pair
must match the output types of Mapper
K3, V3 are the types of the output key/ value pair

The reduce() method must be implemented to achieve the


desired transformation of input key/ value
reduce() method is called by MapReduce framework passing the
input key/ values from out of map phase
MapReduce framework guarantees that the records with the
same key from all the map tasks will reach a single reduce task
Similar to the map, reduce method is provided with a context
object to which the transformed key/ values can be written to
66

Reducer Word Count


public static class IntSumReducer
extends Reducer< Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum) );
}
}

67

Driver MapReduce Job


The job object forms the specification of a job
Configuration conf = new Configuration();
Job job = new Job(conf, Word Count);

Job object gives you control over how the job is run
Set the jar file containing mapper and reducer for distribution
around the cluster
Job.setJarByClass(wordCount.class);

Set Mapper and Reducer classes


job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);

Input/ output location is specified by calling static methods on


FileInputFormat and FileOutputFormat classes by passing the job
FileInputFormat.addInputPath(job, path);
FileOutputFormat.setOutputPath(job, path);

Set Mapper and Reducer output types


Set Input and Output formats
Input key/ value types are controlled by the Input Formats
68

MapReduce Job Word Count


Public class WordCount

public static void main(String args[]) throws Exception {


if (args.length != 2) {
System.err.println(Usage: WordCount <input Path> <output Path>);
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, Word Count);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]) );
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

69

The MapReduce Web UI


Hadoop provides a web UI for viewing job information

Available at http://<VM host IP>:50030/


follow jobs progress while it is running
find job statistics
View job logs
Task Details

70

Combiner
Combiner function helps to aggregate the map output
before passing on to reduce function
Reduces intermediate data to be written to disk
Reduces data to be transferred over network

Combiner is represented by same interface as Reducer


Combiner for a job is specified as
job.setCombinerClass(<combinerclassname>.class);

71

Word Count With Combiner


Public class WordCount

public static void main(String args[]) throws Exception {


if (args.length != 2) {
System.err.println(Usage: WordCount <input Path> <output Path>);
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, Word Count);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

In case of cumulative &


associative functions the
reducer can work as combiner.
Otherwise a separate combiner
needs to be created

FileInputFormat.addInputPath(job, new Path(args[0]) );


FileOutputFormat.setOutputPath(job, new Path(args[1]) );

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

72

Partitioning
Map tasks partition their output keys by the number of
reducers
There can be many keys in a partition
All records for a given key will be in a single partition
A Partitioner class controls partitioning based on the Key
Hadoop uses hash partition by default (HashPartitioner)

The default behavior can be changed by implementing


the getPartition() method in the Partitioner (abstract) class
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int
numPartitions);
}

A custom partitioner for a job can be set as


job.setPartitionerClass(<customPartitionerClass>.class);
73

Partitioner Example
public class WordPartitioner extends Partitioner <Text, IntWritable>{

@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String ch = key.toString().substring(0,1);
/*if (ch.matches("[abcdefghijklm]")) {
return 0;
} else if (ch.matches("[nopqrstuvwxyz]")) {
return 1;
}
return 2;*/
//return (ch.charAt(0) % numPartitions); //round robin based on ASCI value
return 0; // default behavior
}
}

74

One or Zero Reducers


Number of reducers is to be set by the developer
job.setNumReduceTasks();

OR

mapred.reduce.tasks=10

One Reducer
Maps output data is not partitioned, all key /values will reach
the only reducer
Only one output file is created
Output file is sorted by Key
Good way of combining files or producing a sorted output for
small amounts of data

Zero Reducers or Map-only


The job will have only map tasks
Each mapper output is written into a separate file (similar to
multiple reducers case) into HDFS
Useful in cases where the input split can be processed
independent of other parts
75

Data Types

Hadoop provides its own data types


Data types implement Writable interface
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

Optimized for network serialization

Key data types implement WritableComparable interface


which enables key comparison
public interface WritableComparable<T> extends Writable, Comparable<T> {
}

Keys are compared with each other during the sorting phase
Respective registered RawComparator is used comparison
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}

76

Data Types
Writable wrapper classes for Java primitives
Java
primitive

Writable
Serialized size
implementation
(bytes)

Boolean
Byte
Short
Int

BooleanWritable
ByteWritable
ShortWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable

Float
Long
Double

1
1
2
4
15
4
8
19
8

NullWritable
Special writable class with zero length
serialization
Used as a place holder for a key/ value
when you do not need to use that
position
77

Data Types (Custom)


Custom Data types (Custom Writables)
Custom and complex data types can be implemented per need to be used
as key and values
key data types must implement WritableComparable
Values data types need to implement at least Writable

Custom types can implement raw comparators for speed


public static class CustComparator extends WritableComparator {
public CustComparator () {
super(<custDataType>.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
}
}
static {
WritableComparator.define(<custDataType>.class, new CustComparator());
}

WritableComparator is a general purpose


RawComparator
Custom comparators for a job can also be set as
job.setSortComparatorClass(KeyComparator.class);
job.setGroupingComparatorClass(GroupComparator.class);

implementation

of

78

Input Formats
An Input Format determines how the input data to be
interpreted and passed on to the mapper
Based on an Input Format, the input data is divided into
equal chunks called splits
Each split is processed by a separate map task
Each split in turn is divided into records based on Input
Format and passed with each map call
The Key and the value from the input record is determined
by the Input Format (including types)

All input Formats implement InputFormat interface


Input format for a job is set as follows
job.setInputFormatClass(<Input Format Class Name>.class);

Two categories of Input Formats


File Based
Non File Based

79

Input Formats

80

File Input Formats


FileInputFormat is the base class for all file based data
sources
Implements InputFormat interface
FileInputFormat offers static convenience methods for
setting a Jobs input paths
FileInputFormart.addInputPath(job, path)

Each Split corresponds to either all or part of a single file


except for CombineFileInputFormat
File Input Formats
Text Based

TextInputFormat
KeyValueTextInputFormat
NLineInputFormat
CombineFileInputFormat (meant for lot of small files to avoid too many splits)

Binary
SequenceFileInputFormat

81

File Input Formats - TextInputFormat


Each line is treated as a record
Key is byte offset of the line from beginning of the file
Value is the entire line
Input File

2001220,John ,peter,35,320,1st Main,lombard,NJ,manager


2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin
2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer

LongWritable

Input to
Mapper

Text

K1 = 0 V1 = 2001220,John ,peter,35,320,1st Main,lombard,NJ,manager


K2 = 54 V2 = 2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin
K3 = 102 V3 = 2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer

TextInputFormat is the default input format if none


specified
82

File Input Formats - Others


KeyValueTextInputFormat
Splits each line into key/ value based on specified delimiter
Key is the part of the record before the first appearance of the
delimiter and rest is the value
Default delimiter is tab character
A different delimiter can be set through the property
mapreduce.input.keyvaluelinerecordreader.key.value.separator

NLineInputFormat
Each File Splits contains fixed number of lines
The default is one, which can be changed by setting the
property
mapreduce.input.lineinputformat.linespermap

CobineFileInputFormat
A Splits can consist of multiple files (based on max split size)
Typically used for lot of small files
This is an abstracts class and one need to implement to use 83

File Input Formats - SequenceFileInputFormat


Sequence File
provides persistent data structure for binary key-value pairs
Provides sync points in the file at regular intervals, which
makes a sequence file splittable
The key / values can be stored compressed or without
Two types of compressions
Record
Block

SequenceFileInputFormat
Enables reading data from a Sequence File
Can read MapFiles as well
Variants of SequnceFileInputFormat
SequnceFileAsTextInputFormat
Converts key, values into Text Objects

SequnceFileAsBinaryInputFormat
Retrieves the keys and values as BytesWritable Objects

84

Non File Input Formats - DBInputFormat


DBInputFormat is an input format to read data from
RDBMS through JDBC

85

Output Formats
OutputFormat class hierarchy

86

Output Formats - Types


Output Format for a job is set as
job.setOutputFormatClass(TextOutputFormat.class);

FileBased
FileOutputFormat is the Base class
FileOutputFormat offers static method for setting output path
FileOutputFormat.setOutputPath(job, path);
One file per reducer is created (default file name : part-r-nnnnn),
nnnnn is an designating the part number, starting from zero

TextOutputFormat
SequenceFileOutputFormat
SequenceFileAsBinaryOutputFormat

MapFileOutputFormat

NullOutputFormat
DBOutputFormat
Output format to dump output data to RDBMS through JDBC
87

Lazy Output
FileOutputFormat subclasses will create output files,
even if there is no record to write
LazyOutputFormat can be used to delay output file
creation until there is a record to write
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class)
Instead of
Job.setOutputFormatClass(TextOutputFormat.class)

88

Unit Testing - MRUnit


MRUnit is a unit testing library for MapReduce program
Mapper and Reducer can be tested independently by
passing inputs
MapDriver<K1, V1, K2, V2> has methods to run a mapper by
passing input key value and expected key values
ReduceDriver< MapDriver<K1, V1, K2, V2> has methods to
run a Reducer by passing input key value and expected key
values

89

Counters
Useful means of
Monitoring job progress
Gathering statistics
Problem diagnosis

Built-in-counters fall into below groups.

MapReduce task counters


Filesystem counters
FileInput-Format counters
FileOutput-Format counters
Job counters

Each counter will either be task counter or job counter


Counters are global. MapReduce framework aggregates
them across all maps and reduces to produce a grand
total at the end of the job
90

User Defined Counters


Counters are defined in a job by Java enum
enum Temperature {
MISSING,
MALFORMED
}

Counters are set and incremented as


context.getCounter(Temperature.MISSING).increment(1);

Dynamic counters
Counters can also be set without predefining as enums
context.getCounter(grounName, counterName).increment(1);

Counters are retrieved as


Counters cntrs = job.getCounters();
long total = cntrs.getCounter(Task.Counter.MAP_INPUT_RECORDS);
long missing =
cntrs.getCounter(MaxTemperatureWithCounters.Temperature.MISSING);
91

Side Data Distribution


Side data: typically the read only data needed by the job
for processing the main dataset
Two methods to make such data available to task trackers
Using Job Configuration
Using Distributed Cache

Using Job Configuration


Small amount of metadata can be set as key value pairs in the
job configuration
Configuration conf = new Configuration();
conf.set(Product, Binoculars);
conf.set(Conversion, 54.65);

The same can be retrieved in the map or reduce tasks


Configuration conf = context.getConfiguration();
String product = conf.get(Product).trim();

Effective only for small amounts of data (few KB). Else will put
pressure on memory of daemons
92

Side Data Distribution Distributed Cache


A mechanism for copying read only data in files/ archives
to the task nodes just in time
Can be used to provide 3rd party jar files
Hadoop copies these files to DFS then tasktracker copies
them to the local disk relative to tasks working directory
Distributed cache for a job can be set up by calling
methods on Job
Job.addCacheFile(new URI(<file path>/<file name>);
Job.addCacheArchives(new URI(<file path>/<file name>);
Job.addFileToClasspath(new Path(<file path>/<file name>);

The files can be retrieved from the distributed cache


through the methods on JobContext
Path[] localPaths = context.getLocalCacheFiles();
Path[] localArchives = context.getLocalCacheArchives();
93

Multiple Inputs
Often in real life you get the related data from different
sources in different formats
Hadoop provide MultipleInputs class to handle the
situation
MultipleInputs.addInputPath(job, inputPath1, <inputformat>.class);
MultipleInputs.addInputPath(job, inputPath2, <inputformat>.class);

No need to set input path, InputFormat class separately


You can even have separate Mapper class for each input file
MultipleInputs.addInputPath(job, inputPath1, <inputformat>.class,
MapperClass1.class);
MultipleInputs.addInputPath(job, inputPath2, <inputformat>.class,
MapperClass2.class);
Both Mappers must emit same key/ value types
94

Joins
More than one record sets to be joined based on a key
Two techniques for joining data in MapReduce
Map side join (Replicated Join)
Possible only when
one of the data sets is small enough to be distributed across the data
nodes and fits into the memory for maps to independently join OR
Both the data sets are portioned in such a way that they have equal
number of partitions, sorted by same key and all records for a given key
must reside in the same partition

The smaller data set is used for the look up using the join key
Faster as the data is loaded into the memory

95

Joins
Reduce side join
Mapper will tag the records from both the data sets distinctly
Join key is used as maps output key
The records for the same key are brought together in the
reducer and reducer will complete the joining process
Less efficient as both the data sets have to go through
mapreduce shuffle

96

Job Chaining
Multiple jobs can be run in a linear or complex dependent fashion
Simple way is to call the job drivers one after the other with
respective configurations
JobClient.runJob(conf1);
JobClient.runJob(conf2);

Here the second job is not launched until first job is completed

For complex dependencies you can use JobControl, and


ControlledJob classes
ControlledJob cjob1 = new ControlledJob(conf1);
ControlledJob cjob2 = new ControlledJob(conf2);

cjob2.addDependingJob(cjob1);
JobControl jc = new JobControl(Chained Job);
jc.addjob(cjob1);
jc.addjob(cjob2);

jc.run();

JobControl can run jobs in parallel if there is no dependency or the


dependencies are met

97

Speculative Execution
MapReduce jobs execution time is typically determined
by the slowest running task
Job is not complete until all tasks are completed
One slow job could bring down overall performance of the job

Tasks could be slow due to various reasons


Hardware degradation
Software issues

Hadoop Strategy Speculative Execution

Determines when a task is running longer than expected


Launches another equivalent task as backup
Output is taken from the task whichever completes first
Any duplicate tasks running are killed post that
98

Speculative Execution - Settings


Is ON by default
The behavior can be controlled independently for map
and reduce tasks
For Map Tasks
mapred.map.tasks.speculative.execution

to true/ false

For Reduce Tasks


mapred.reduce.tasks.speculative.execution

to true/ false

99

Skipping Bad Records


While handling a large datasets you may not anticipate
every possible error scenario
This will result in unhandled exception leading to task
failure
Hadoop retries failed tasks(task can fail due to other
reasons) up to four times before marking the whole job
as failed
Hadoop provides skipping mode for automatically
skipping bad records
The mode is OFF by default
Can be enabled by setting
mapred.skip.mode.enabled = true
100

Skipping Bad Records


When skipping mode is enabled, if the task fails for two
times, the record is noted during the third time and
skipped during the fourth attempt
The number of total attempts for map and reduce tasks
can be increased by setting
mapred.map.max.attempts
mapred.reduce.max.attempts

Bad records are stored under _logs/skip directory as


sequence file

101

Hadoop Archive Files

HDFS stores small files (size << block size) inefficiently


Each file is stored in a block, increasing the disk seeks
Too many small files take lot of nameNode memory/ MB

Hadoop Archives (HAR) is hadoops file format that packs


files into HDFS blocks efficiently
HAR files are not compressed
HAR files reduces namenode memory usage
HAR files can be used as mapreduce input directly
hadoop archive is the command to work on HAR files
hadoop archive -archiveName files.har /my/files
hadoop fs -ls /my

/my

HAR files always be with .har extension


HAR files can be accessed by application using har URI 102

Some Operations for Thought


Sorting, Counting, Summing
Secondary Sort
Searching, Validation and transformation
Statistical Computations
Grouping
Unions and Intersections
Inverted Index

103

Disadvantages of MapReduce
MapReduce (Java API) is difficult to program, long
development cycle
Need to rewrite trivial operations like Join, filter to
achieve in map/reduce/Key/value concepts
Locked with Java which makes it impossible for data
analysts to work with hadoop
There are several abstraction layers on top of
MapReduce which make working with Hadoop simple.
PIG and HIVE are in the leading front

104

PIG
105

PIG
PIG is an abstraction layer on top of MapReduce that frees analysts
from the complexity of MapReduce programming
Architected towards handling unstructured and semi structured
data
Its a dataflow language, which means the data is processed in a
sequence of steps transforming the data
The transformations support relational-style operations such as
filter, union, group and join.
Designed to be extensible and reusable
Programmers can develop own functions and use (UDFs)

Programmer friendly
Allows to introspect data structures
Can do sample run on a representative subset of your input

PIG internally converts each transformation into a MapReduce job


and submits to hadoop cluster
40 percent of Yahoos Hadoop jobs are run with PIG
106

PIG Architecture
Pig runs as a client side application, there is no need to
install anything on the cluster
Pig
Script

Grunt Shell

Map

Red

Map

Red

Hadoop Cluster
107

Install & Configure PIG


Download a version of PIG compatible with your
hadoop installation
http://pig.apache.org/releases.html

Untar into a designated folder. This will be Pigs home


directory
>>tar xvf pig-x.y.z.tar.gz

Configure Environment Variables - add in .bash_profile


export PIG_INSTALL=/<parent directory path>/pig-x.y.z
export PATH=$PATH:$PIG_INSTALL/bin
>>. .bash_profile

Verify Installation
>>pig -help
Displays command usage

>>pig
Takes you into Grunt shell
grunt>

108

PIG Execution Modes


Local Mode

Runs in a single JVM


Operates on local file system
Suitable for small datasets and for development
To run PIG in local mode
>>pig -x

local

MapReduce Mode
In this mode the queries are translated into MapReduce jobs
and run on hadoop cluster
PIG version must be compatible with hadoop version
Set HADOOP_HOME environment variable to indicate pig
which hadoop client to use
export HADOOP_HOME=$HADOOP_INSTALL
If not set it will uses the bundled version of hadoop

109

Ways of Executing PIG programs

Grunt
An interactive shell for running Pig commands
Grunt is started when the pig command is run without any
options

Script
Pig commands can be executed directly from a script file
>>pig pigscript.pig

It is also possible to run Pig scripts from Grunt shell using run
and exec.

Embedded
You can run Pig programs from Java using the PigServer class,
much like you can use JDBC
For programmatic access to Grunt, use PigRunner

110

An Example
A Sequence of transformation steps to get the end result

grunt> transactions = LOAD 'retail/txn.csv' USING PigStorage(',')


AS (txn_id, txn_dt, cust_id, amt, cat, sub_cat, adr1, adr2, trans_type);

LOAD

grunt> txn_100plus = FILTER transactions BY amt > 100.00;

FILTER

grunt> txn_grpd = GROUP txn_100plus BY cat;

grunt> txn_cnt_bycat = FOREACH txn_grpd GENERATE group,


COUNT(txn_100plus);

GROUP

AGGREGATE

grunt> DUMP txn_cnt_bycat;

A relation is created with every statement


111

Data Types

Simple Types
Category
Numeric
Text
Binary

Type
int
long
float
double
chararray
bytearray

Description
32-bit signed integer
64-bit signed integer
32-bit floating-point number
64-bit floating-point number
Character array in UTF-16 format
Byte array

112

Data Types

Complex Types
Type
Tuple
Bag
map

Description
Sequence of fields of any type
An unordered collection of tuples, possibly
with duplicates
A set of key-value pairs; keys must be
character arrays, but values may be any type

Example
(1,'pomegranate')
{(1,'pomegranate'),(2)}
['a'#'pomegranate']

113

LOAD Operator

<relation name> = LOAD <input file with path> [USING UDF()]


[AS (<field name1>:dataType, <field name2>:dataType, ,<field name3>:dataType)]

Loads data from a file into a relation


Uses the PigStorage load function as default unless specified
otherwise with the USING option
The data can be given a schema using the AS option.
The default data type is bytearray if not specified
records=LOAD sales.txt;
records=LOAD sales.txt AS (f1:chararray, f2:int, f3:float);
records=LOAD sales.txt USING PigStorage(\t);
records=LOAD sales.txt USING PigStorage(\t) AS (f1:chararray, f2:int, f3:float);

114

Diagnostic Operators

DESCRIBE
Describes the schema of a relation

EXPLAIN
Display the execution plan used to compute a relation

ILLUSTRATE
Illustrate step-by-step how data is transformed
Uses sample of the input data to simulate the execution.

115

Data Write Operators

LIMIT
Limits the number of tuples from a relation

DUMP
Display the tuples from a relation

STORE
Store the data from a relation into a directory.
The directory must not exists

116

Relational Operators

FILTER
Selects tuples based on Boolean expression
teenagers = FILTER cust BY age < 20;

ORDER
Sort a relation based on one or more fields
Further processing (FILTER, DISTINCT, etc.) may destroy the
ordering
ordered_list = ORDER cust BY name DESC;

DISTINCT
Removes duplicate tuples
unique_custlist = DISTINCT cust;

117

Relational Operators

GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL

FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM

118

Relational Operators

GROUP BY
Within a relation, group tuples with the same group key
GROUP ALL will group all tuples into one group
groupByProfession=GROUP cust BY profession
groupEverything=GROUP cust ALL

FOREACH
Loop through each tuple in nested_alias and generate new
tuple(s).
At least one of the fields of nested_alias should be a bag
DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed
operations in nested_op to operate on the inner bag(s).
countByProfession=FOREACH groupByProfession GENERATE
group, count(cust);
Built in aggregate functions AVG, COUNT, MAX, MIN, SUM
119

Operating on Multiple datasets

JOIN
Compute inner join of two or
more relations based on common
field values.
X = JOIN A BY a1, B BY b1;
DUMP X;

DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)

DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)

(1,2,3,1,3)
(8,3,4,8,9)
(7,2,5,7,9)

120

Operating on Multiple datasets

COGROUP
Group tuples from two or
more relations, based on
common group values.

>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)

>>DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(7,9)

>>X = COGROUP A BY a1, B BY b1;


>>DUMP X;
(1,
(8,
(7,
(2,
(4,

{(1,2,3)}, {(1,3)} )
{(8,3,4)}, {(8,9)} )
{(7,2,5)}, {(7,9)} )
{}, {(2,4),(2,7)} )
{(4,2,1), (4,3,3)}, {} )
121

Operating on Multiple datasets

UNION
Creates the union of two or
more relations
>>X = UNION A, B;
>>DUMP X;

SPLIT

>>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)

>>DUMP B;
(2,4)
(8,9)

(1,2,3)
(4,2,1)
(8,3,4)
(2,4)
(8,9)

Splits a relation into two or more relations, based on a Boolean


expressions.
>>Y = SPLIT X INTO C IF a1 <5, D IF a1 > 5;
>>DUMP C;
>>DUMP D;
(1,2,3)
(4,2,1)
(2,4)

(8,3,4)
(8,9)
122

Operating on Multiple datasets

SAMPLE
Randomly samples a relation as per given sampling factor.
There is no guarantee that the same number of tuples are
returned every time.

>>sample_data = SAMPLE large_data 0.01;


Above statement generates a 1% sample of data in relation
large_data

123

UDFs
PIG lets users define their own functions and lets them
be used in the statements
The UDFs can be developed in Java, Python or
Javascript
Filter UDF
To be subclassed of FilterFunc which is a subclass of EvalFunc

Eval UDF
To be subclassed of EvalFunc
public abstract class EvalFunc<T> {
public abstract T exec(Tuple input) throws IOException;
}

Load UDF
To be subclassed of LoadFunc

Define and use an UDF


REGISTER pig-examples.jar;
DEFINE <funcName> com.training.myfunc.isCustomerTeen()
filtered= FILTER cust BY isCustomerTeen(age)
124

Macros

Package reusable pieces of Pig Latin code


Define a Macro
DEFINE max_by_group(X, group_key, max_field) RETURNS Y {
A = GROUP $X by $group_key;
$Y = FOREACH A GENERATE group, MAX($X.$max_field);
};
max_temp = max_by_group(filtered_records, year, temperature);

Macros can be defined in separate files to Pig scripts for


reuse, in which case they need to be imported
IMPORT

<path>/<macrofile>';

125

HIVE
126

HIVE
A datawarehousing framework built on top of hadoop
Abstracts MapReduce complexity behind
Target users are generally data analysts who are
comfortable with SQL
SQL Like Language and called HiveQL
Hive meant only for structured data
You can interact with Hive using several methods
CLI (Command Line Interface)
A Web GUI
JDBC

127

HIVE Architecture

CLI

Hive
Metastore

WEB

JDBC

Parser/
Planner/
Optimizer

Map

Red

Map

Red

Hadoop Cluster
128

Install & Configure HIVE


Download a version of HIVE compatible with your
hadoop installation
http://hive.apache.org/releases.html

Untar into a designated folder. This will be HIVEs home


directory
>>tar xvf hive-x.y.z.tar.gz

Configure
Environment Variables add in .bash_profile
export HIVE_INSTALL=/<parent directory path>/hive-x.y.z
export PATH=$PATH:$HIVE_INSTALL/bin

Verify Installation
>>hive -help
Displays command usage

>>hive
Takes you into hive shell
hive>

129

Install & Configure HIVE


Hadoop needs to be running
Configure to hadoop
Create hive-site.xml under conf directory
specify the filesystem and jobtracker using the hadoop
properties
fs.default.name
mapred.job.tracker

If not set, they default to the local file system and the local
(in-process) job runner - just like they do in Hadoop

Create following directories under HDFS


/tmp
/user/hive/warehouse
chmod g+w for both above directories

130

Install & Configure HIVE


Data store
Hive stores data under /user/hive/warehouse by default

Metastore
Out-of-the-box hive comes with light weight SQL database
Derby to store and manage meta data
This can be configured to other databases like MySQL

131

Hive Data Models


Databases
Tables
Partitions
Buckets

132

Hive Data Types


TINYINT 1 byte integer
SMALLINT 2 byte integer
INT 4 byte integer
BIGINT 8 byte integer
BOOLEAN true / false
FLOAT single precision
DOUBLE double precision
STRING sequence of characters
STRUCT
A column can be of type STRUCT with data {a INT, b STRING}

MAPS
ARRAYS
*a, b, c+

133

Tables
A Hive table is logically made up of the data being
stored and the associated metadata
Creating a Table
CREATE TABLE emp_table (id INT, name String, address STRING)
PARTITIONED BY (designation STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS SEQUENCEFILE;

Loading Data
LOAD DATA INPATH /home/hadoop/employee.csv
OVERWRITE INTO TABLE emp_table;

View Table Schema


SHOW TABLES;
DESCRIBE emp_table;

An external table is a table which is outside the


warehouse directory
134

Hands On
Create retail, customers tables
hive> CREATE DATABSE retail;
hive> USE retail;
hive> CREATE TABLE retail_trans (txn_id INT, txn_date STRING, Cust_id INT,
Amount FLOAT, Category STRING, Sub_Category STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> CREATE TABLE customers (Cust_id INT, FirstName STRING,


LastName STRING, Profession STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> SHOW TABLES;
hive> DESCRIBE retail_trans;

135

Hands On
Load data and run queries
hive> LOAD DATA INPATH 'retail/txn.csv' INTO TABLE retail_trans;
hive> LOAD DATA INPATH 'retail/custs.csv' INTO TABLE customers;
hive> SELECT Category, count(*) FROM retail_trans
GROUP BY Category;
hive> SELECT Category, count(*) FROM retail_trans WHERE Amount > 100
GROUP BY Category;
hive> SELECT Concat (cu.FirstName, ' ', cu.LastName), rt.Category, count(*)
FROM retail_trans rt JOIN customers cu
ON rt.cust_id = cu.cust_id
GROUP BY cu.FirstName, cu.LastName, rt.Category;

136

Queries
SELECT
SELECT id, naem FROM emp_table WHERE designation =
manager;
SELECT count(*) FROM emp_table;
SELECT designation, count(*) FROM emp_table
GROUP BY designation;

INSERT
INSERT OVERWRITE TABLE new_emp (SELECT * FROM
emp_table WHERE id > 100);
Inserting local directory
INSERT OVERWRITE LOCAL DIRECTORY tmp/results (SELECT * FROM
emp_table WHERE id > 100);

JOIN
SELECT emp_table.*, detail.age FROM emp_table JOIN detail
ON (emp_table.id = detail.id);
137

Partitioning & Bucketing


HIVE can organize tables into partitions based on
columns
Partitioned are specified during the table creation time
When we load data into a partitioned table, the
partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');

Bucketing
Bucketing imposes extra structure on the table
make sampling more efficient
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
138

UDFs
UDFs have to be written in java
Have to be subclased UDF
(org.apache.hadoop.hive.ql.exec.UDF)
A UDF must implement at least one evaluate() method.
public class Strip extends UDF {
public Text evaluate(Text str) {
----------return str1;
}
}
ADD JAR /path/to/hive-examples.jar;
CREATE TEMPORARY FUNCTION strip AS
'com.hadoopbook.hive.Strip';
SELECT strip(' bee ') FROM dummy
139

SQOOP
140

SQOOP

sqoop allows users to extract data from a structured


data store into Hadoop for analysis
Sqoop can also export the data back the structured
stores
Installing & Configuring SQOOP
Download a version of SQOOP compatible with your hadoop
installation
Untar into a designated folder. This will be SQOOPs home
directory
>>tar xvf sqoop-x.y.z.tar.gz

Configure
Environment Variables add in .bash_profile
export SQOOP_HOME=/<parent directory path>/sqoop-x.y.z
export PATH=$PATH:$SQOOP_HOME /bin

Verify Installation
>>sqoop
>>sqoop help

141

Importing Data
RDBMS
1. Examine the
schema
2. Generate Code

Sqoop
Client

MyClass.java

3. Launch Multiple maps


on the cluster

4. Use Generate Code

Map

Map

Map

Hadoop Cluster
142

Importing Data
Copy mysql jdbc driver to sqoops lib directory
Sqoop does not come with the jdbc driver

Sample import
>>sqoop import --connect jdbc:mysql://localhost/retail
--table transactions -m 1
>>hadoop fs -ls transactions
The Import tool will run a MapReduce job that connects to
the database and reads the table
By default, four map tasks are used
The output is written to a directory by the table name, under users
HDFS home directory

Generates comma-delimited text files by default


In addition to downloading data, the import tool also
generates a java class as per the table schema
143

Codegen
The code can also be generated without import action
>>sqoop codegen --connect jdbc:mysql://localhost/hadoopguide
--table widgets --class-name Widget
The generated class can hold a single record retrieved from
the table
The generated code can be used in MapReduce programs to
manipulate the data

144

Working along with Hive


Importing data into Hive
Generate Hive table definition directly from the source
>>sqoop create-hive-table
--connect jdbc:mysql://localhost/retail --table transactions
--fields-terminated-by ',

Generate table definition and import data into Hive


>>sqoop import
--connect jdbc:mysql://localhost/retail
--table transactions -m 1 --hive-import

Exporting data from Hive


Create the table in MySQL database
>>sqoop export
--connect jdbc:mysql://localhost/retail -m 1
--table customers
--export-dir /user/hive/warehouse/retail.db/customers
--input-fields-terminated-by ','
145

Administration
146

NameNode Persistent Data Structure


A newly formatted Namenode creates the
shown directory structure
VERSION: Java properties file with HDFS
version
edits: Any write operation such as
creating, moving a file is logged into edits
fsimage: Persistent checkpoint of file
system metadata. This is update whenever
edit log rolls over
fstime: Records the time when fsimage
was last updated

${dfs.name.dir}/
current/
VERSION
edits
fsimage

fstime

147

Persistent Data Structure DataNode & SNN


Secondary Namenode directory structure

Datanode directory structure


Need not be formatted explicitly
They create their directories on startup
${dfs.data.dir}/
current/
blk_<id_1>
blk_<id_1>.meta

${fs.checkpoint.dir}/
current/
VERSION
edits
fsimage
fstime
previous.checkpoint/
VERSION
edits
fsimage
fstime

blk_<id_2>
blk_<id_2>.meta
Subdir0/
Subdir1/
148

HDFS Safe Mode


When a Namenode starts it will enter Safe Mode
Loads fsimage to memory and applies the edits from edit log
During this time it does not listen to any requests
Safe mode is exited when minimal replication condition is met,
plus an extension time of 30 seconds

Check whether Namenode is in safe mode


hadoop dfsadmin -safemode get

Wait until the safe mode is off


hadoop dfsadmin -safemode wait

Enter or leave safe mode


hadoop dfsadmin -safemode enter / leave
149

HDFS Filesystem Check

Hadoop provides fsck utility to check the health of HDFS


hadoop fsck /

Option to either
move (to lost+found)
or delete affected
files
hadoop fsck / -move
hadoop fsck / -delete

Finding Blocks for a given file


hadoop fsck /user/hadoop/weather/1901 -files -blocks -racks
150

HDFS Block Scanner


Datanodes run Block Scanner utility periodically to verify
the blocks stored on it to guard against the disk errors
The default is 3 weeks (dfs.datanode.scan.period.hours)
Corrupt blocks are reported to Namenode for fixing
The Block scan report for a datanode can be accessed at
http://<datanode>:50
075/blockScannerRep
ort
List of Blocks can be
accessed by
appending ?listblocks
to the above URL

151

HDFS Balancer

Over a period of time, the block distribution across the


cluster may become unbalanced
This will affect the data locality for MapReduce and puts
strain on highly utilized datanodes
Hadoops Balancer daemon redistributes the blocks to
achieve the balance
The balancing act can be initiated through
start-balancer.sh

It produces a log file in the standard log directory


The bandwidth for the balancer cab be changed by setting
the dfs.balance.bandwidthPerSec property in hdfs-site.xml
Default bandwidth 1 MB / Sec
152

Logging
All Hadoop Daemons produce respective log files
Log files are stored under $HADOOP_INSTALL/logs
The location can be changed by setting the property
HADOOP_LOG_DIR in hadoop-env.sh

The log levels can be set under log4j.properties


Name Node
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

Job Tracker
log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

Stack Traces
The stack traces for all the hadoop daemons can be obtained at
/stacks page under daemons expose a web UI
Job tracker stack trace can be found at
http://<jobtracker-host>:50030/stacks

153

Hadoop Routine Maintenance


Metadata Backup
Good practice to keep copies of different ages(one hour, one day,
one week etc.,)
One way is to periodically archive secondary namenodes
previous.checkpoint directory to an offsite location
Test the integrity of the copy regularly

Data Backup
HDFS replication is not a substitute for data back up
As the data volume is very high, it is a good practice to prioritize
the data to be backed up
Business critical data
Data that can not be regenerated

distcp is a good tool to backup from HDFS to other filesystems

Run filesystem check (fsck) and balancer tools regularly 154

Commissioning of New Nodes


The datanodes that are permitted to connect to Namenode
are specified in a file pointed by the property dfs.hosts
The tasktrackers that are permitted to connect to Jobtracker
are specified in a file pointed by the property mapred.hosts
This restricts an arbitrary machine connecting into the
cluster and compromising on data integrity and security
To add a new nodes
Add the network address of the new node in the above files
Run the commands to refresh Namenode and Jobtracker
hadoop dfsadmin -refreshNodes
hadoop mradmin -refreshNodes

Update the slaves file with the new nodes


Note that the slaves file is not used by hadoop daemons. It is used by
control scripts for cluster-wide operations
155

Decommissioning of Nodes

For removing nodes from the cluster, the Namenode and


the Jobtracker must be informed
The decommissioning process is controlled by an exclude
file. The file location is set through a property
For HDFS it is dfs.hosts.exclude
For MapReduce mapred.hosts.exclude

To remove nodes from the cluster


add the network addresses to the respective exclude files
Run the commands to update Namenode and JobTracker
hadoop dfsadmin -refreshNodes
hadoop mradmin -refreshNodes

During decommission process Namenode will replicate the data


to other datanodes
Remove the nodes from the include file as well as slaves file 156

Cluster & Machines Considerations

Several options
Build your own cluster from scratch
Use offerings that provide hadoop as a service on cloud

While building you own, choose server grade commodity


machines (Commodity does not mean low-end)
Unix / Linux platform
Hadoop is designed to use multiple cores and disks
A typical machine for running a datanode and tasktracker
Processor

Memory

Storage

Network

Two quad-core 2-2.5 GHz CPUs

16-24 GB ECC RAM1

Four 1 TB SATA disks

Gigabit Ethernet

Cluster size is typically estimated based on the storage


capacity and its expected growth
157

Master Node Scenarios

The machine running the master daemons should be


resilient as failure of these would lead to data lose and
unavailability of the cluster
On a small cluster (few 10s of nodes) you run all master
daemons on a single machine
As the cluster grows their memory requirement grows and
needs to be run on separate machines
The control scripts should be run as follows
Run HDFS control scripts from the namenode machine
masters file should contain the address of the secondary namenode

Run MapReduce scripts from the Job tracker machine


slaves file on both machines should be in sync so that each node
will run one Datanode and a task tracker
158

Network Topology

A common architecture consists of two level network


topology
1GB + Switch

1GB Switch

30 to 40 servers per rack


159

Network Topology
For multirack cluster, the admin needs to map nodes to
racks so hadoop is network aware to place data as well as
mapreduce tasks as close as possible to the data
Two ways to define the network map
Implement java interface DNSToSwitchMapping
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
}
Have the property topology.node.switch.mapping.impl point to the
implemented class. The namenode and jobtracker will make use of this

User based script pointed by the property


topology.script.file.name

The default behavior is to map all nodes to the same rack


160

Cluster Setup and Installation


Use automated installation tools such as kickstart or Debian
to install software on nodes
Create one master script and use the same to automate

Following steps to be carried to complete cluster setup


Install Java (6 or later) on all nodes
Create user account on all nodes for Hadoop activities
Have the same user name on all nodes
Having NFS drive as home directory makes SSH key distribution simple

Install Hadoop on all nodes and change the owner of files


Install SSH. Hadoop control scripts (not the daemons) rely on SSH
to perform cluster-wide operations

Configure
Generate an RSA key pair, share public key on all nodes
Configure Hadoop. Better way of doing it is by using tools like Chef
161
or Puppet

Memory Requirements Worker Node


The memory allocated to each daemon is controlled by
HADOOP_HEAPSIZE setting in hadoop-env.sh
The default value is 1 GB

The task tracker launches separate JVMs to run map and


reduce tasks
The memory for the child JVM is set by mapred.child.java.opts.
Default value is 200 MB

The number of map and reduce tasks that can be run at


any time is set by the property
Map
- mapred.tasktracker.map.tasks.maximum
Reduce - mapred.tasktracker.reduce.tasks.maximum
The default is two for both map and reduce tasks
162

Memory Requirements Worker Node

The number of tasks that can be run simultaneously on a


tasks tracker depends on the number of processors
a good rule of thumb is to have a factor of between one
and two more tasks than processors
If you have a eight core processor
One core for Datanode and tasktracker
On remaining 7 cores we can have 7 maps and 7 reduce tasks
Increasing the memory for the JVM to 400 MB the total
memory required is 7.6 MB

163

Other Properties to consider


Cluster Membership
Buffer Size
HDFS Block size
Reserved storage space
Trash
Job Scheduler
Reduce slow start
Task Memory Limits

164

Security
Hadoop uses Kerberos for authentication
Kerberos do not manage the
permissions for hadoop
To enable Kerberos
authentication set the property
hadoop.security.authentication
in core-site.xml to kerberos
Enable service-level
authorization by setting
hadoop.security.authorization
to true in the same file
To control which users and groups can do what, configure
Access Control Lists (ACLs) in the hadoop-policy.xml
165

Security Policies
Allow only alice, bob and users in the mapreduce group to submit the jobs
<property>
<name>security.job.submission.protocol.acl</name>

<value>alice, bob mapreduce</value>


</property>

Allow only users in the datanode group to communicate with Namenode


<property>
<name>security.datanode.protocol.acl</name>
<value>datanode</value>
</property>

Allow any user to talk to HDFS cluster as a DFSClient


<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>
166

Recommended Readings

167

Das könnte Ihnen auch gefallen