HD Fs Technical Documentation

1.
INTRODUCTION
HDFS is an Apache Software Foundation project and a subproject of the Apache

Hadoop project. Hadoop is ideal for storing large amounts of data, like terabytes and
petabytes, and uses HDFS as its storage system. HDFS lets you connect nodes
(commodity personal computers) contained within clusters over which data files are
distributed. You can then access and store the data files as one seamless file system.
Access to data files is handled in a streaming manner, meaning that applications or
commands are executed directly using the MapReduce processing model.
HDFS is fault tolerant and provides high-throughput access to large data sets. This
article explores the primary features of HDFS and provides a high-level view of the
HDFS architecture. HDFS has many similarities with other distributed file systems, but
is different in several respects. One noticeable difference is HDFS's write-once-read-
many model that relaxes concurrency control requirements, simplifies data coherency,
and enables high-throughput access. Another unique attribute of HDFS is the viewpoint
that it is usually better to locate processing logic near the data rather than moving the
data to the application space. HDFS rigorously restricts data writing to one writer at a
time. Bytes are always appended to the end of a stream, and byte streams are guaranteed
to be stored in the order written. HDFS has many goals. Here are some of the most
notable:
 Fault tolerance by detecting faults and applying quick, automatic recovery

Data access via MapReduce streaming.
 Simple and robust coherency model.
 Processing logic close to the data, rather than the data close to the processing
logic.
 Portability across heterogeneous commodity hardware and operating systems
 Scalability to reliably store and process large amounts of data.
 Economy by distributing data and processing across clusters of commodity
personal computers.
 Efficiency by distributing data and logic to process it in parallel on nodes where
data is located.
1
Apache Hadoop is an open source software framework for storage and large scale
processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-
level project being built and used by a global community of contributors and users. It
is licensed under the Apache License 2.0.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally
developed to support distribution for the Nutch search Engine project. Doug, who was
working at Yahoo! at the time and is now Chief Architect of Cloudera, named the
project after his son's toy elephant. Cutting's son was 2 years old at the time and just
beginning to talk. He called his beloved stuffed yellow elephant "Hadoop"
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common and thus
should be automatically handled in software by the framework. Apache Hadoop's
MapReduce and HDFS components originally derived respectively from Google's
MapReduce and Google File System (GFS) papers.
Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is
now commonly considered to consist of a number of related projects as well: Apache
Pig, Apache Hive, Apache HBase, and others.
2
2. WHAT IS HDFS
 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
 It is designed to scale up from single machine to thousands of machines, each
offering local computation and storage
 HDFS supports the rapid transfer of data between compute nodes. At its outset,
it was closely coupled with MapReduce, a programmatic framework for data
processing.
 The file system replicates, or copies, each piece of data multiple times and
distributes the copies to individual nodes, placing at least one copy on a
different server rack than the others. As a result, the data on nodes that crash
can be found elsewhere within a cluster.
 HDFS uses master/slave architecture. In its initial incarnation, each Hadoop
cluster consisted of a single NameNode that managed file system operations and
supporting DataNodes that managed data storage on individual compute nodes.
The HDFS elements combine to support applications with large data sets.
 This master node "data chunking" architecture takes as its design guides
elements from Google File System (GFS), a proprietary file system outlined in
in Google technical papers, as well as IBM's General Parallel File System
(GPFS), a format that boosts I/O by striping blocks of data over multiple disks,
writing blocks in parallel. While HDFS is not Portable Operating System
Interface model-compliant, it echoes POSIX design style in some aspects.
3
3. HDFS ARCHITECTURE
HDFS is comprised of interconnected clusters of nodes where files and

directories reside. An HDFS cluster consists of a single node, known as a NameNode
that manages the file system namespace and regulates client access to files. In addition,
data nodes store data as blocks within files. Name nodes and data nodes Within HDFS,
a given name node manages file system namespace operations like opening, closing,
and renaming files and directories. A name node also maps data blocks to data nodes,
which handle read and write requests from HDFS clients. Data nodes also create, delete,
and replicate data blocks according to instructions from the governing name node. This
design facilitates a simplified model for managing each namespace and arbitrating data
distribution. Relationships between name nodes and data nodes Name nodes and data
nodes are software components designed to run in a decoupled manner on commodity
machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming
language can run HDFS. A typical installation cluster has a dedicated machine that runs
a name node and possibly one data node. Each of the other machines in the cluster runs
one data node Data nodes continuously loop, asking the name node for instructions. A
name node can't connect directly to a data node; it simply returns values from functions
invoked by a data node. Each data node maintains an open server socket so that client
code or other data nodes can read or write data. The host or port for this server socket
is known by the name node, which provides the information to interested clients or
other data nodes. See the Communications protocols sidebar for more about
communication between data nodes, name nodes, and clients.
Relationships between name nodes and data nodes Name nodes and data nodes
are software components designed to run in a decoupled manner on commodity
machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming
language can run HDFS. A typical installation cluster has a dedicated machine that runs
a name node and possibly one data node. Each of the other machines in the cluster runs
one data node.
4
Figure 1 HDFS Architecture
Data nodes continuously loop, asking the name node for instructions. A name
node can't connect directly to a data node; it simply returns values from functions
invoked by a data node. Each data node maintains an open server socket so that client
code or other data nodes can read or write data. The host or port for this server socket
is known by the name node, which provides the information to interested clients or
other data nodes. See the Communications protocols sidebar for more about
communication between data nodes, name nodes, and clients. The name node maintains
and administers changes to the file system namespace.
5
4. GOALS OF HDFS
 Distributed – using block size, default 128 MB.

 Hardware Failure – detection of faults and quick, automatic recovery from them
is a core architectural goal of HDFS – using replication factor, default 3.
 Streaming Data Access.
 Applications that run on HDFS need streaming access to their data sets.
 They are not general purpose applications that typically run on general purpose
file systems.
 HDFS is designed more for batch processing rather than interactive use by users.
 The emphasis is on high throughput of data access rather than low latency of
data access.
 POSIX imposes many hard requirements that are not needed for applications
that are targeted for HDFS.
 POSIX semantics in a few key areas has been traded to increase data throughput
rates.
 Large Data sets – tuned for large data sets.
 Simple Coherency Model – write-once-read-many, HDFS files are immutable
 Data Locality.
 Portability across Heterogeneous Hardware and Software Platforms.
6
5. BENEFITS OF HDFS
 Open-source – Apache Hadoop is an open source project. It means its code

can be modified according to business requirements.
 Distributed Processing – As data is stored in a distributed manner
in hdfs across the cluster, data is processed in parallel on cluster of nodes.
 Fault Tolerance – By default 3 replicas of each block is stored across the
cluster in Hadoop and it can be changed also as per the requirement. So if any
node goes down, data on that node can be recovered from other nodes easily.
Failures of nodes or tasks are recovered automatically by the framework. This
is how Hadoop is Fault tolerant.
 Reliability – Due to replication of data in the cluster, data is reliably stored
on the cluster of machine despite machine failures. If your machine goes
down, then also your data will be stored reliably.
 High Availability – Data is Highly Availability and accessible despite
hardware failure due to multiple copies of data. If a machine or few hardware
crashes, then data will be accessed from other path.
 Scalability – Hadoop is highly scalable in the way new hardware can be
easily added to the nodes. It also provides horizontal scalability which means
new nodes can be added on the fly without any downtime.
 Economic – Hadoop is not very expensive as it runs on cluster of commodity
hardware. We do not need any specialized machine for it. Hadoop provides
huge cost saving also as it is very easy to add more nodes on the fly here. So
if requirement increases, you can increase nodes as well without any
downtime and without requiring much of pre planning.
 Easy to use – No need of client to deal with distributed computing,
framework takes care of all the things. So it is easy to use.
 Data Locality – Hadoop works on data locality principle which states that
move computation to data instead of data to computation. When client
submits the algorithm, this algorithm is moved to data in the cluster rather
than bringing data to the location where algorithm is submitted and then
processing it.
7
6. LIMITATIONS OF HDFS
 Issue with small files- Hadoop is not suited for small files. Small files are the
major problems in HDFS. A small file is significantly smaller than the HDFS
Block size. If you are storing these large number of small files, HDFS can’t
handle these lots of files. As HDFS works with a small number of large files for
storing data sets rather than larger number of small files. If one use the huge
number of small files, then this will overload the namenode. Since namenode
stores the namespace of HDFS.
 ProcessingSpeed- With parallel and distributed algorithm, MapReduce process
large data sets. MapReduce performs the task: Map and Reduce. MapReduce
requires a lot of time to perform these tasks thereby increasing latency. As data
is distributed and processed over the cluster in MapReduce. So, it will increase
the time and reduces processing speed.
 Support only Batch Processing- Hadoop supports only batch processing. It
does not process streamed data and hence, overall performance is slower.
MapReduce framework does not leverage the memory of the cluster to the
maximum.
 Iterative Processing- Hadoop is not efficient for iterative processing. As
hadoop does not support cyclic data flow. That is the chain of stages in which
the input to the next stage is the output from the previous stage.
 Vulnerable by nature- Apache Hadoop is entirely written in Java, a language
most widely used and Java been most heavily exploited by cyber-criminal.
Therefore it implicates in numerous security breaches.
 Security- Hadoop can be challenging in managing the complex application.
Hadoop is missing encryption at storage and network levels, which is a major
point of concern. Hadoop supports Kerberos authentication, which is hard to
manage.
8
7. HDFS STORAGE
 It is used to store and manage huge amount of data in cluster of machines.
 It is scalable, provides fast access.
 Fault Tolerance: by detecting faults and applying quick, automatic recovery.
 Processing logic close to the data rather than the data close to the processing
logic.
 Efficiency: by distributing data and logic to process it in parallel on nodes where

data is located.
 Reliability by automatically maintaining multiple copies of data and

automatically redeploying, processing logic in the event of failure.
9
8. HDFS CLUSTER
Figure 2 Hadoop Cluster

Name Node Data Node
File systems that manage the storage across a network of machines are called
distributed file systems.
Name Node
Functions of a NameNode:
 It directs the Datanodes (Slave nodes) to execute the low-level I/O
operations.
 It records the metadata of all the files stored in the cluster, e.g. the
location, the size of the files, permissions, hierarchy, etc.
 NameNode regularly receives a Heartbeat and a Blockreport from all the
DataNodes in the cluster to make sure that the datanodes are working
properly. A Block Report contains a list of all blocks on a DataNode.
10
Data Node:
Various functions of Datanodes:
 Datanodes perform the low-level read and write requests from the file
system’s clients.
 They regularly send a report on all the blocks present in the cluster to
the NameNode.
 Datanodes also enables pipelining of data.
 They forward data to other specified DataNodes.
 Datanodes send heartbeats to the NameNode once every 3 seconds, to
report the overall health of HDFS.
11
9. HDFS MAPREDUCE
Figure 3 MapReduce
Job Tracker –
 Receives the requests for MapReduce execution from the client.

 Talks to the NameNode to determine the location of the data.
 Finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given
node.
 Monitors the individual TaskTrackers and the submits back the overall status
of the job back to the client.
 JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
12
TaskTracker –
 Runs on DataNode. Mostly on all DataNodes.

 Mapper and Reducer tasks are executed on DataNodes administered by
 Will be assigned Mapper and Reducer tasks to execute by JobTracker.
 Will be in constant communication with the JobTracker signalling the
progress of the task in execution.
 TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive,
 JobTracker will assign the task executed by the TaskTracker to another node.
 It is with the storage daemons,
 Computing daemons also follow a master/slave
13
10. HDFS FEATURES
Features of HDFS:
 Support for very large files.

 Commodity Hardware: hdfs requires a commodity Hardware and hadoop does
not require high configuration hardware, expensive Software to be part of its
basic installation.
 For the commodity hardware chance of node failure across the cluster is high
at least for large clusters.
 Streaming data access: hdfs is the most efficient data processing pattern. That
pattern is write once read many times.
 It supports streaming data access
 Once we can store a file on top of hdfs file, we can’t override that file.
14
11. HDFS COMPONENTS
All the components of the Hadoop ecosystem, as explicit entities are evident.
The holistic view of Hadoop architecture gives prominence to Hadoop common,
Hadoop YARN, Hadoop Distributed File Systems (HDFS) and Hadoop MapReduce of
the Hadoop Ecosystem. Hadoop common provides all Java libraries, utilities, OS level
abstraction, necessary Java files and script to run Hadoop, while Hadoop YARN is a
framework for job scheduling and cluster resource management. HDFS in Hadoop
architecture provides high throughput access to application data and Hadoop
MapReduce provides YARN based parallel processing of large data sets.
Figure 4 Hadoop Components
15
12. CONCLUSION
HDFS Used by many companies it is easy to install, just need Linux

workstations on a network. Takes care of hard problems such as failover, ginormous
distributed file system. Good web-based monitoring tools. Processing large sets of
structured, semi-structured and unstructured data and supporting systems application
architecture. Planning and estimating cluster capacity, and creating roadmaps for
Hadoop cluster deployment
Involved in Hadoop Cluster environment administration that includes adding

and removing cluster nodes, cluster capacity planning, performance tuning, cluster
Monitoring, Troubleshooting adding new nodes to an existing cluster, recovering from
a Name Node failure. Decommissioning and commissioning the Node on running
cluster. Recovering from node failures and troubleshooting common Hadoop cluster
issues
16
13. REFERENCES
 K. V. Shvachko, "HDFS Scalability: The limits to growth", pp. 6-16, April

2010.
 B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka,
B. Zhou, "Scalable Performance of the Panasas Parallel file System", Proc. of
the 6th USENIX Conference on File and Storage Technologies, February 2008.
 S. Weil, S. Brandt, E. Miller, D. Long, C. Maltzahn, "Ceph: A Scalable High-
Performance Distributed File System", Proc. of the 7th Symposium on
Operating Systems Design and Implementation, November 2006.
 Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P.
Wyckoff, R. Murthy, "Hive - A Warehousing Solution Over a Map-Reduce
Framework", Proc. of Very Large Data Bases, vol. 2, no. 2, pp. 1626-1629,
August 2009.
 F. P. Junqueira, B. C. Reed, "The life and times of a zookeeper", Proc. of the
28th ACM Symposium on Principles of Distributed Computing, August 10–12,
2009.
17

HD Fs Technical Documentation

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

HD Fs Technical Documentation

Hochgeladen von

Copyright:

Verfügbare Formate

1.

HDFS is an Apache Software Foundation project and a subproject of the Apache

 Fault tolerance by detecting faults and applying quick, automatic recovery

HDFS is comprised of interconnected clusters of nodes where files and

 Distributed – using block size, default 128 MB.

 Portability across Heterogeneous Hardware and Software Platforms.

 Open-source – Apache Hadoop is an open source project. It means its code

 It is used to store and manage huge amount of data in cluster of machines.

 It is scalable, provides fast access.

 Fault Tolerance: by detecting faults and applying quick, automatic recovery.

 Efficiency: by distributing data and logic to process it in parallel on nodes where

 Reliability by automatically maintaining multiple copies of data and

Figure 2 Hadoop Cluster

distributed file systems.

Various functions of Datanodes:

 Receives the requests for MapReduce execution from the client.

 Runs on DataNode. Mostly on all DataNodes.

 Support for very large files.

 It supports streaming data access

Figure 4 Hadoop Components

HDFS Used by many companies it is easy to install, just need Linux

Involved in Hadoop Cluster environment administration that includes adding

 K. V. Shvachko, "HDFS Scalability: The limits to growth", pp. 6-16, April

Das könnte Ihnen auch gefallen