Beruflich Dokumente
Kultur Dokumente
✓ Hadoop Features
✓ Hadoop Components
✓ Hadoop Processes
✓ Hadoop Architecture
✓ MapReduce Framework
✓ What is YARN
✓ What is ZooKeeper
➢ The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
➢ HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
➢ ▪Achieving
NameNode
data localization
▪ DataNode
❖ Moving the application to the place whereUsed ByisHDFS
the data residing OR
▪ ❖Secondary NameNode
Making data local to application
▪ Task Tracker
Used by MapReduce Framework
▪ Job Tracker
➢ Two Masters
NameNode
• Ifdata
➢ Achieving down cannot access HDFS
localization
An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster,
which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.
The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories.
DataNodes are responsible for serving read and write requests from the file system’s
clients.
DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.
HDFS is built using the Java language, any machine that supports Java can run the
NameNode or the DataNode software.
✓ Blocks of data distributed across several machines are processed by map tasks parallely
✓ Only purpose is to take the snapshot of NameNode and merging the log file contents
into metadata file on local file system
✓ Over a period of time edits file can become very big and the next start become very
longer
✓ Secondary NameNode merges the edits file contents periodically with fsimage file to
keep the edits file size within a sizeable limit
✓ MapReduce master
✓ Client submits the job to JobTracker
✓ JobTracker talks to the NameNode to get the list of blocks
✓ Job Tracker locates the task tracker on the machine where data is located
✓ Data Localization
✓ Job Tracker then first schedules the mapper tasks
✓ Once all the mapper tasks are over it runs the reducer tasks
✓ Responsible for running tasks (map or reduce tasks) delegated by job tracker
What is heartbeat in
hadoop
1 1
1
2 2 2
3 3 3
4 4 4
Block B: Block C:
Block A:
✓ Ask NN to give the list of DataNodes (DN) that is hosting the replica’s of the block of file
✓ Client then directly read from the data nodes without contacting again to NN
✓ Along with the data, check sum is also shipped for verifying the data integrity.
• If the replica is corrupt client intimates NN, and try to get the data from other
DataNode(DN)
✓ What is mapper?
✓ What is reducer?
sd
✓ What is job?
• Complete execution of mapper and reducers over the entire data set
✓ What is task?
• Single unit of execution (map or reduce task)
• Map task executes typically over a block of data (dfs.block.size)
• Reduce task works on mapper output
✓ What is “task attempt”?
• Instance of an attempt to execute a task (map or reduce task)
• If task is failed working on particular portion of data, another task
• will run on that portion of data on that machine itself
• If a task fails 4 times, then the task is marked as failed and entire job fails
• Make sure that atleast one attempt of task is run on different machine
✓ MapReduce framework ensures that map task is run closer to the data to
avoid network traffic
• Several map tasks runs parallel on different machines and each working
on different portion (block) of data
✓ NOTE all the values for a particular intermediate key goes to one reducer
Text Input Format Offset of the line within a Entire line till “\n” as value
file
Key Value Text Input Format Part of the record till the Remaining record after first
first delimiter delimiter
Sequence File Input Format Key needs to be determined Value need to be
from the header determined from header
Example:
Hey, I am good
I am from Mphasis
1 Hey, I am good
2 I am from Mphasis
1 Hey, I am good
2 I am from Mphasis
Key Value
0 Hello, How are you
1 Hey, I am good
2 I am from Mphasis
✓ Similar to reducer
✓ Runs on the same machine as the mapper task
✓ Runs the reducer code on the intermediate output of the mapper
✓ Thus minimizing the intermediate key-value pairs
✓ Combiner runs on intermediate output of each mapper
Advantages
✓ Minimize the data transfer across the network
✓ Speed up the execution
✓ Reduces the burden on reducer
✓ It is called after you emit your key value pairs from mapper
context.write(key,value)
✓ Large number of mappers running will generate large amount of data
• And If only one reducer is specified, then all the intermediate key and its
list of values goes to a single reducer
✓ Copying will take lot of time
✓ Sorting will also be time consuming
✓ Whether single machine can handle that much amount of
intermediate data or not?
✓ It is called after you emit your key value pairs from mapper
context.write(key,value)
✓ Large number of mappers running will generate large amount of data
• And If only one reducer is specified, then all the intermediate key and its
list of values goes to a single reducer
✓ Copying will take lot of time
✓ Sorting will also be time consuming
✓ Whether single machine can handle that much amount of
intermediate data or not?
➢ Hadoop framework feels that a certain task (Mapper or Reducer) is taking longer on
average compared to the other tasks from the same job, it clones the “long running”
task and run it on another node. This is called Speculative Execution.
➢ Meaning Hadoop is speculating that something is wrong with the “long running” task
and runs a clone task on the other node
➢ The slowness in the “long running” job could be due to a faulty hardware, network
congestion, or the node could be simply busy etc.
➢ Most of the the time this is a false alarm and the task which was considered long
running or problematic completes successfully. In that case Hadoop will kill the cloned
task and proceed with the results from the completed task
What is YARN
✓ MRv2 is split into two major functionalities
✓ job Tracker , resource management and scheduling/monitoring
What is ZooKeeper
https://en.wikipedia.org/wiki/Apache_Hadoop
https://hadoop.apache.org/
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-
site/YARN.html