Sie sind auf Seite 1von 49

HDFS and MAPREDUCE

Prof S.Ramachandram

DISTRIBUTED FILE SYSTEMS

A Distributed File System ( DFS ) is simply a classical model of a file system ( as


discussed before ) distributed across multiple machines. The purpose is to promote
sharing of dispersed files.

This is an area of active research interest today.

The resources on a particular machine are local to itself.


machines are remote.

A file system provides a service for clients. The server interface is the normal set of
file operations: create, read, etc. on files.

Resources on other

Goals
1 Network transparency: uses do not have to aware the location of files to
access them
location transparency: the name of a file does not reveal any kind of the file's
physical storage location.
/server1/dir1/dir2/X
server1 can be moved anywhere (e.g., from CIS to SEAS).
location independence: the name of a file does not need to be changed when
the file's physical storage location changes.
The above file X cannot moved to server2 if server1 is full and server2 is
no so full.

2 High availability: system failures or scheduled activities such as backups,


addition of nodes

Architecture
Computation model
file severs -- machines dedicated to storing files and performing storage
and retrieval operations (for high performance)
clients -- machines used for computational activities may have a local
disk for caching remote files
Two most important services
name server -- maps user specified names to stored objects, files and
directories
cache manager -- to reduce network delay, disk delay problem:
inconsistency
Typical data access actions
open, close, read, write, etc.

Design Issues

Naming and name resolution


Semantics of file sharing
Stateless versus stateful servers
Caching -- where to store files
Cache consistency
Replication

Distributed File System-Present Needs


Need to process huge datasets on large
clusters of computers
Very expensive to build reliability into each
application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant

Need a common infrastructure


Efficient, reliable, easy to use
Open Source, Apache Licence

What is HAdoop

Hadoop MapReduce
MapReduce is a programming model and software
framework first developed by Google (Googles
MapReduce paper submitted in 2004)
Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
Petabytes of data
Thousands of nodes

Computational processing occurs on both:


Unstructured data : file system
Structured data : database

Hadoop Distributed File System (HFDS)


Inspired by Google File System
Scalable, distributed, portable files ystem written in Java for
Hadoop framework
Primary distributed storage used by Hadoop applications

HFDS can be part of a Hadoop cluster or can be a stand-alone


general purpose distributed file system
An HFDS cluster primarily consists of
NameNode that manages file system metadata
DataNode that stores actual data

Stores very large files in blocks across machines in a large cluster


Reliability and fault tolerance ensured by replicating data across
multiple hosts

Has data awareness between nodes


Designed to be deployed on low-cost hardware

Assumptions and Goals


Hardware Failure
Hardware failure is the norm rather than the exception.

Streaming Data Access


Applications that run on HDFS need streaming access to
their data sets. The emphasis is on high throughput of data
access rather than low latency of data access.

Large Data Set(GB to TB)


Simple Coherency Model (write-once-read-many
access model for)
Moving Computation is Cheaper than Moving Data
Portability Across Heterogeneous Hardware and
Software Platforms

Hadoop Distributed File System


Expects large file size
Small number of large files
Hundreds of MB to GB each

Expects sequential access


Default block size in HDFS is 64MB
Result:
Reduces amount of metadata storage per file
Supports fast streaming of data (large amounts of
contiguous data)

Hadoop Distributed File System


HDFS expects to read a block start-to-finish
Useful for MapReduce
Not good for random access
Not a good general purpose file system

Hadoop Distributed File System


HDFS files are NOT part of the ordinary file system
HDFS files are in separate name space

Not possible to interact with files using ls, cp, mv, etc.
However, HDFS provides similar utilities

Hadoop Distributed File System


Meta data handled by NameNode
Deal with synchronization by only allowing one
machine to handle it
Store meta data for entire file system
Not much data: file names, permissions, &
locations of each block of each file

HDFS Architecture

HDFS has a master/slave architecture.


An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster,
which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set
of DataNodes.
The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories.
It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file
systems clients.
The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode

HDFS Architecture
The NameNode and DataNode are pieces of software designed to
run on commodity machines. These machines typically run a
GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.
Usage of the highly portable Java language means that HDFS can be
deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only the
NameNode software.
Each of the other machines in the cluster runs one instance of the
DataNode software
The NameNode is the arbitrator and repository for all HDFS
metadata.

HDFS Architecture
HDFS supports a traditional hierarchical file
organization.
A user or an application can create directories
and store files inside these directories.
The NameNode maintains the file system
namespace.
An application can specify the number of replicas
of a file that should be maintained by HDFS.
The number of copies of a file is called the
replication factor of that file. This information is
stored by the NameNode.

HDFS Architecture
An HDFS client wanting to read a file first contacts the NameNode
for the locations of data blocks comprising the file and then reads
block contents from the DataNode closest to the client.
When writing data, the client requests the NameNode to nominate
a suite of three DataNodes to host the block replicas.
The client then writes data to the DataNodes in a pipeline fashion.
The current design has a single NameNode for each cluster.
The cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster
Each DataNode may execute multiple application tasks
concurrently.

HDFS Architecture
Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write

Rack2

Client
12/6/2014

19

Hadoop Distributed File System

HDFS Architecture
HDFS keeps the entire namespace in RAM.
The inode data and the list of blocks belonging to each file
comprise the metadata of the name system called the image.
The persistent record of the image stored in the local hosts
native files system is called a checkpoint.
The NameNode also stores the modification log of the image
called the journal in the local hosts native file system.
For improved durability, redundant copies of the checkpoint
and journal can be made at other servers.
During restarts the NameNode restores the namespace by
reading the namespace and replaying the journal.
The locations of block replicas may change over time and are
not part of the persistent checkpoint

Architectre-DataNode
Each block replica on a DataNode is represented by
two files in the local hosts native file system.
The first file contains the data itself and the second file
is blocks metadata including checksums for the block
data and the blocks generation stamp.
The size of the data file equals the actual length of the
block and does not require extra space to round it up
to the nominal block size as in traditional file systems.
Thus, if a block is half full it needs only half of the
space of the full block on the local drive.

HDFS Architecture
The HDFS namespace is a hierarchy of files and
directories.
Files and directories are represented on the
NameNode by inodes, which record attributes like
permissions, modification and access times
The file content is split into large blocks (typically
128 megabytes, butuser selectable file-by-file)
Each block of the file is independently
replicated at multiple DataNodes(typically 3)

REPLICATION

Replication
Large HDFS instances run on a cluster of computers that commonly spread
across many racks.
Communication between two nodes in different racks has to go through
switches.
In most cases, network bandwidth between machines in the same rack is
greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the
process outlined in Hadoop Rack Awareness.
A simple but non-optimal policy is to place replicas on unique racks. This
prevents losing data when an entire rack fails and allows use of bandwidth
from multiple racks when reading data.
This policy evenly distributes replicas in the cluster which makes it easy to
balance load on component failure. However, this policy increases the cost
of writes because a write needs to transfer blocks to multiple racks.

Replication
For the common case, when the replication factor is three, HDFSs
placement policy is to put one replica on one node in the local rack,
another on a node in a different (remote) rack, and the last on a different
node in the same remote rack.
This policy cuts the interrack write traffic which generally improves write
performance. The chance of rack failure is far less than that of node
failure; this policy does not impact data reliability and availability
guarantees.
However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than
three.
With this policy, the replicas of a file do not evenly distribute across the
racks. One third of replicas are on one node, two thirds of replicas are on
one rack, and the other third are evenly distributed across the remaining
racks.
This policy improves write performance without compromising data
reliability or read performance.

Replica Selection
To minimize global bandwidth consumption and
read latency
HDFS tries to satisfy a read request from a replica
that is closest to the reader. If there exists a
replica on the same rack as the reader node, then
that replica is preferred to satisfy the read
request.
If angg/ HDFS cluster spans multiple data centers,
then a replica that is resident in the local data
center is preferred over any remote replica.

Safemode Startup

On startup Namenode enters Safemode.


Replication of data blocks do not occur in Safemode.
Each DataNode checks in with Heartbeat and BlockReport.
Namenode verifies that each block has acceptable number of
replicas
After a configurable percentage of safely replicated blocks check
in with the Namenode, Namenode exits Safemode.
It then makes the list of blocks that need to be replicated.
Namenode then proceeds to replicate these blocks to other
Datanodes.

12/6/2014

28

Filesystem Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the EditLog to
record every change that occurs to the filesystem meta
data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local filesystem

Entire filesystem namespace including mapping of


blocks to files and file system properties is stored in a
file FsImage. Stored in Namenodes local filesystem.
12/6/2014

29

Namenode
Keeps image of entire file system namespace and file
Blockmap in memory.
4GB of local RAM is sufficient to support the above data
structures that represent the huge number of files and
directories.
When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of a
crash.
12/6/2014

30

Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files
per directory and creates directories appropriately:
Research issue?

When the filesystem starts up it generates a list of all


HDFS blocks and send this report to Namenode:
Blockreport.
12/6/2014

31

The Communication Protocol


All HDFS communication protocols are layered on top
of the TCP/IP protocol
A client establishes a connection to a configurable TCP
port on the Namenode machine. It talks ClientProtocol
with the Namenode.
The Datanodes talk to the Namenode using Datanode
protocol.
RPC abstraction wraps both ClientProtocol and
Datanode protocol.
Namenode is simply a server and never initiates a
request; it only responds to RPC requests issued by
DataNodes or clients.
12/6/2014

32

ROBUSTNESS

12/6/2014

33

Objectives
Primary objective of HDFS is to store data
reliably in the presence of failures.
Three common failures are: Namenode failure,
Datanode failure and network partition.

12/6/2014

34

DataNode failure and heartbeat


A network partition can cause a subset of Datanodes to
lose connectivity with the Namenode.
Namenode detects this condition by the absence of a
Heartbeat message.
Namenode marks Datanodes without Hearbeat and does
not send any IO requests to them.
Any data registered to the failed Datanode is not
available to the HDFS.
Also the death of a Datanode may cause replication
factor of some of the blocks to fall below their specified
value.
12/6/2014

35

Re-replication
The necessity for re-replication may arise due
to:
A Datanode may become unavailable,
A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be
increased.

12/6/2014

36

Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one Datanode to
another if the free space on a Datanode falls below
a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically create
additional replicas and rebalance other data in the
cluster.
These types of data rebalancing are not yet
implemented: research issue.
12/6/2014

37

Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a
storage device, network faults, or buggy software.
A HDFS client creates the checksum of every block
of its file and stores it in hidden files in the HDFS
namespace.
When a clients retrieves the contents of file, it
verifies that the corresponding checksums match.
If does not match, the client can retrieve the block
from a replica.
12/6/2014

38

Metadata Disk Failure


FsImage and EditLog are central data structures of HDFS.
A corruption of these files can cause a HDFS instance to be nonfunctional.
For this reason, a Namenode can be configured to maintain
multiple copies of the FsImage and EditLog.
Multiple copies of the FsImage and EditLog files are updated
synchronously.
Meta-data is not data-intensive.
The Namenode could be single point failure: automatic failover
is NOT supported! Another research topic.

12/6/2014

39

DATA ORGANIZATION

12/6/2014

40

Data Blocks
HDFS support write-once-read-many with reads
at streaming speeds.
A typical block size is 64MB (or even 128 MB).
A file is chopped into 64MB chunks and stored.

12/6/2014

41

Staging
A client request to create a file does not reach
Namenode immediately.
HDFS client caches the data into a temporary file.
When the data reached a HDFS block size the client
contacts the Namenode.
Namenode inserts the filename into its hierarchy
and allocates a data block for it.
The Namenode responds to the client with the
identity of the Datanode and the destination of the
replicas (Datanodes) for the block.
Then the client flushes it from its local memory.
12/6/2014

42

Staging (contd.)
The client sends a message that the file is
closed.
Namenode proceeds to commit the file for
creation operation into the persistent store.
If the Namenode dies before file is closed, the
file is lost.
This client side caching is required to avoid
network congestion; also it has precedence is
AFS (Andrew file system).
12/6/2014

43

Replication Pipelining
When the client receives response from
Namenode, it flushes its block in small pieces
(4K) to the first replica, that in turn copies it to
the next replica and so on.
Thus data is pipelined from Datanode to the
next.

12/6/2014

44

API (ACCESSIBILITY)

12/6/2014

45

Application Programming Interface


HDFS provides Java API for application to use.
Python access is also used in many applications.
A C language wrapper for Java API is also
available.
A HTTP browser can be used to browse the files
of a HDFS instance.

12/6/2014

46

FS Shell, Admin and Browser Interface


HDFS organizes its data in files and directories.
It provides a command line interface called the FS shell
that lets the user interact with data in the HDFS.
The syntax of the commands is similar to bash and csh.
Example: to create a directory /foodir
/bin/hadoop dfs mkdir /foodir
There is also DFSAdmin interface available
Browser interface is also available to view the
namespace.

12/6/2014

47

Space Reclamation
When a file is deleted by a client, HDFS renames file to a
file in be the /trash directory for a configurable amount
of time.
A client can request for an undelete in this allowed time.
After the specified time the file is deleted and the space
is reclaimed.
When the replication factor is reduced, the Namenode
selects excess replicas that can be deleted.
Next heartbeat(?) transfers this information to the
Datanode that clears the blocks for use.

12/6/2014

48

Summary
We discussed the features of the Hadoop File
System, a peta-scale file system to handle bigdata sets.
What discussed: Architecture, Protocol, API, etc.
Missing element: Implementation
The Hadoop file system (internals)
An implementation of an instance of the HDFS (for
use by applications such as web crawlers).
12/6/2014

49

Das könnte Ihnen auch gefallen