Hadoop Week2

Course Topics
Week 1
Introduction to HDFS
Week 3
Map-Reduce Basics, types and
formats
Week 5
HIVE
Week 7
ZOOKEEPER
Week 2
Setting Up Hadoop Cluster
Week 4
PIG
Week 6
HBASE
Week 8
SQOOP
Topics for Today

Revision
Hadoop Modes
Terminal Commands
Web UI Urls
Usecase in Healthcare
Sample example list in Hadoop
Running Teragen Example
Hadoop Configuration Files

Slaves & Masters
Name Node Recovery
Dump of MR jobs
Data Loading Techniques
Class 1 - Revision
HDFS Hadoop Distributed File System (storage)
MapReduce (processing)
Lets Revise
1. What is HDFS?
2. What is the difference between a Hadoop database and Relational Database?
3. What is Namenode?
4. What is Secondary Namenode?
5. Gen 1 and Gen 2 Hadoop.
Hadoop Modes
Hadoop can be run in one of three modes:
Standalone (or local) mode
no daemons, everything runs in a single JVM
suitable for running MapReduce programs during development
has no dfs
Pseudo-distributed mode
Hadoop daemons run on the local machine
Fully distributed mode
Hadoop daemons run on a cluster of machines
Terminal Commands
Terminal Commands
Web UI URLs
NameNode status: http://localhost:50070/dfshealth.jsp
JobTracker status: http://localhost:50030/jobtracker.jsp
TaskTracker status: http://localhost:50060/tasktracker.jsp
DataBlock Scanner Report: http://localhost:50075/blockScannerReport
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Sample Cluster Configuration

Master
NameNode
JobTracker
Slave01
Slave02
Slave03
Slave04
Slave05
DataNode
TaskTracker

Configuration Filenames
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
masters
slaves
hadoop-metrics.properties
log4j.properties
Description of log files

Enviroment variables that are used in the scripts to run
Hadoop
Configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the
namenode, the secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
jobtracker and the task trackers
A list of machines(one per line) that each run a
secondary namenode
A list of machines(one per line) that each run a datanode
and a task tracker
Properties for controlling how metrics are published in
Hadoop
Properties for system log files, the namenode audit log
and the task log for the tasktracker child process
DD for each component
Core
core-site.xml
HDFS
hdfs-site.xml
MapReduce
mapred-site.xml
core-site.xml and hdfs-site.xml

hdfs-site.xml
core-site.xml
<?xml version - "1.0"?>
<?xml version ="1.0"?>


<configuration>
<configuration>
<property>
<property>
<name>dfs.replication</name>
<name>fs.default.name</name>
<value>1</value>
<value>hdfs://localhost:8020/</value>
</property>
</configuration>
</property>
</configuration>
Defining HDFS details in hdfs-site.xml

Property
Value
Description
dfs.data.dir
<value>/disk1/hdfs/data,/di A list of directories where the datanode stores

sk2/hdfs/data</value>
blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir
<value>/disk1/hdfs/names A list of directories where the secondary

econdary,/disk2/hdfs/name namenode stores checkpoints. It stores a copy of
secondary</value>
the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml
<?xml version=1.0?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
All Properties
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
Defining mapred-sites.xml
Property
Value
Description
Mapred.job.tracker
<value>localhost:
8021</value>
The hostname and the port that the jobtrackers RPC server
runs on. If set to the default value of local, then the jobtracker
is run in-process on demand when you run a MapReduce job
(you dont need to start the jobtracker in this case, and in fact
you will get an error if you try to start it in this mode)
Mapred.local.dir
${hadoop.tmp.dir}
/mapred/local
A list of directories where MapReduce stores intermediate

data for jobs. The data is cleared out when the job ends.
Mapred.system.dir
${hadoop.tmp.dir}
/mapred/system
The directory relative to fs.default.name where shared files

are stored, during a job run.
Mapred.tasktracker.map. 2
tasks.maximum
The number of map tasks that may be run on a tasktracker at

any one time
Mapred.tasktracker.redu
ce.tasks.maximum
The number of reduce tasks tat may be run on a tasktracker

at any one time.
Slaves and masters

Two files are used by the startup and shutdown commands:
slaves
contains a list of hosts, one per line, that are to host
DataNode and TaskTracker servers
masters
contains a list of hosts, one per line, that are to host

secondary NameNode servers
Per-process runtime environment

hadoop-env.sh file:
Hadoop-env.sh
Set parameter JAVA_HOME
JVM
This file also offers a way to provide custom parameters for each of the servers.
Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/
directory of the installation.
hadoop.env-sh
Examples of environment variables that you can specify:
export HADOOP_DATANODE_HEAPSIZE="128
export HADOOP_TASKTRACKER_HEAPSIZE="512
hadoop.env-sh Sample
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.6-sun
# Extra Java runtime options. Empty by default.

export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"
..
..
..
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
Namenode Recovery
Shut down the secondary NameNode

secondary:fs.checkpoint.dir Namenode:dfs.name.dir
secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
When the copy completes, start the NameNode and restart the
secondary NameNode
Reporting
hadoop-metrics.properties
This file controls the reporting

The default is not to report
Dump of a MR Job
Data Loading Techniques
HDFS
Using Hadoop Copy Commands

Using Flume
Using Sqoop
FLUME
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
SQOOP
Apache Sqoop(TM) is a tool designed for

efficiently transferring bulk data
between Apache Hadoop and structured
datastores such as relational databases.
Assignment for this Week

Attempt the following assignment using the document
present in the LMS under the tab Week 2:
Flume Set-up on Cloudera
Attempt Assignment Week 2

Refresh your Java Skills using Java for Hadoop
Tutorial on LMS
Ask your doubts
Q & A..?
Thank You
See You in Class Next Week

Hadoop Week2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Week2

Hochgeladen von

Copyright:

Verfügbare Formate

Course Topics

Topics for Today

Hadoop Configuration Files

Sample Examples List

Running the Teragen Example

Checking the Output

Checking the Output

Hadoop Configuration Files

Sample Cluster Configuration

Hadoop Configuration Files

Description of log files

DD for each component

core-site.xml and hdfs-site.xml

<?xml version - "1.0"?>

<?xml version ="1.0"?>

Defining HDFS details in hdfs-site.xml

<value>/disk1/hdfs/data,/di A list of directories where the datanode stores

<value>/disk1/hdfs/names A list of directories where the secondary

A list of directories where MapReduce stores intermediate

The directory relative to fs.default.name where shared files

The number of map tasks that may be run on a tasktracker at

The number of reduce tasks tat may be run on a tasktracker

Slaves and masters

contains a list of hosts, one per line, that are to host

Per-process runtime environment

Set parameter JAVA_HOME

# Extra Java runtime options. Empty by default.

Shut down the secondary NameNode

This file controls the reporting

Data Loading Techniques

Using Hadoop Copy Commands

Apache Sqoop(TM) is a tool designed for

Assignment for this Week

Attempt Assignment Week 2

Ask your doubts

Das könnte Ihnen auch gefallen