Sie sind auf Seite 1von 35

Course Topics

Week 1
Introduction to HDFS

Week 3
Map-Reduce Basics, types and
formats

Week 5
HIVE

Week 7
ZOOKEEPER

Week 2
Setting Up Hadoop Cluster

Week 4
PIG

Week 6
HBASE

Week 8
SQOOP

Topics for Today


Revision
Hadoop Modes
Terminal Commands
Web UI Urls
Usecase in Healthcare
Sample example list in Hadoop
Running Teragen Example

Hadoop Configuration Files


Slaves & Masters
Name Node Recovery
Dump of MR jobs
Data Loading Techniques

Class 1 - Revision
HDFS Hadoop Distributed File System (storage)
MapReduce (processing)

Lets Revise
1. What is HDFS?
2. What is the difference between a Hadoop database and Relational Database?
3. What is Namenode?
4. What is Secondary Namenode?
5. Gen 1 and Gen 2 Hadoop.

Hadoop Modes
Hadoop can be run in one of three modes:
Standalone (or local) mode
no daemons, everything runs in a single JVM
suitable for running MapReduce programs during development
has no dfs

Pseudo-distributed mode
Hadoop daemons run on the local machine
Fully distributed mode
Hadoop daemons run on a cluster of machines

Terminal Commands

Terminal Commands

Web UI URLs
NameNode status: http://localhost:50070/dfshealth.jsp
JobTracker status: http://localhost:50030/jobtracker.jsp
TaskTracker status: http://localhost:50060/tasktracker.jsp
DataBlock Scanner Report: http://localhost:50075/blockScannerReport

Sample Examples List

Running the Teragen Example

Checking the Output

Checking the Output

Hadoop Configuration Files

Sample Cluster Configuration


Master
NameNode

JobTracker

Slave01
Slave02
Slave03
Slave04
Slave05
DataNode
TaskTracker

Hadoop Configuration Files


Configuration Filenames
hadoop-env.sh
core-site.xml
hdfs-site.xml

mapred-site.xml
masters
slaves
hadoop-metrics.properties
log4j.properties

Description of log files


Enviroment variables that are used in the scripts to run
Hadoop
Configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the
namenode, the secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
jobtracker and the task trackers
A list of machines(one per line) that each run a
secondary namenode
A list of machines(one per line) that each run a datanode
and a task tracker
Properties for controlling how metrics are published in
Hadoop
Properties for system log files, the namenode audit log
and the task log for the tasktracker child process

DD for each component

Core

core-site.xml

HDFS

hdfs-site.xml

MapReduce

mapred-site.xml

core-site.xml and hdfs-site.xml


hdfs-site.xml

core-site.xml

<?xml version - "1.0"?>

<?xml version ="1.0"?>

<!--hdfs-site.xml-->

<!--core-site.xml-->

<configuration>

<configuration>

<property>

<property>

<name>dfs.replication</name>

<name>fs.default.name</name>

<value>1</value>

<value>hdfs://localhost:8020/</value>

</property>
</configuration>

</property>
</configuration>

Defining HDFS details in hdfs-site.xml


Property

Value

Description

dfs.data.dir

<value>/disk1/hdfs/data,/di A list of directories where the datanode stores


sk2/hdfs/data</value>
blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data

fs.checkpoint.dir

<value>/disk1/hdfs/names A list of directories where the secondary


econdary,/disk2/hdfs/name namenode stores checkpoints. It stores a copy of
secondary</value>
the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary

Mapred-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>

All Properties
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

Defining mapred-sites.xml
Property

Value

Description

Mapred.job.tracker

<value>localhost:
8021</value>

The hostname and the port that the jobtrackers RPC server
runs on. If set to the default value of local, then the jobtracker
is run in-process on demand when you run a MapReduce job
(you dont need to start the jobtracker in this case, and in fact
you will get an error if you try to start it in this mode)

Mapred.local.dir

${hadoop.tmp.dir}
/mapred/local

A list of directories where MapReduce stores intermediate


data for jobs. The data is cleared out when the job ends.

Mapred.system.dir

${hadoop.tmp.dir}
/mapred/system

The directory relative to fs.default.name where shared files


are stored, during a job run.

Mapred.tasktracker.map. 2
tasks.maximum

The number of map tasks that may be run on a tasktracker at


any one time

Mapred.tasktracker.redu
ce.tasks.maximum

The number of reduce tasks tat may be run on a tasktracker


at any one time.

Slaves and masters


Two files are used by the startup and shutdown commands:

slaves
contains a list of hosts, one per line, that are to host
DataNode and TaskTracker servers
masters

contains a list of hosts, one per line, that are to host


secondary NameNode servers

Per-process runtime environment


hadoop-env.sh file:

Hadoop-env.sh

Set parameter JAVA_HOME

JVM

This file also offers a way to provide custom parameters for each of the servers.
Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/
directory of the installation.

hadoop.env-sh
Examples of environment variables that you can specify:
export HADOOP_DATANODE_HEAPSIZE="128
export HADOOP_TASKTRACKER_HEAPSIZE="512

hadoop.env-sh Sample
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.6-sun

# Extra Java runtime options. Empty by default.


export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"
..
..
..
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

Namenode Recovery

Shut down the secondary NameNode


secondary:fs.checkpoint.dir Namenode:dfs.name.dir

secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
When the copy completes, start the NameNode and restart the
secondary NameNode

Reporting
hadoop-metrics.properties

This file controls the reporting


The default is not to report

Dump of a MR Job

Data Loading Techniques

HDFS

Using Hadoop Copy Commands


Using Flume
Using Sqoop

FLUME
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.

SQOOP

Apache Sqoop(TM) is a tool designed for


efficiently transferring bulk data
between Apache Hadoop and structured
datastores such as relational databases.

Assignment for this Week


Attempt the following assignment using the document
present in the LMS under the tab Week 2:
Flume Set-up on Cloudera

Attempt Assignment Week 2


Refresh your Java Skills using Java for Hadoop
Tutorial on LMS

Ask your doubts

Q & A..?

Thank You
See You in Class Next Week

Das könnte Ihnen auch gefallen