Beruflich Dokumente
Kultur Dokumente
INTRODUCTION
HDFS is fault tolerant and provides high-throughput access to large data sets. This
article explores the primary features of HDFS and provides a high-level view of the
HDFS architecture. HDFS has many similarities with other distributed file systems, but
is different in several respects. One noticeable difference is HDFS's write-once-read-
many model that relaxes concurrency control requirements, simplifies data coherency,
and enables high-throughput access. Another unique attribute of HDFS is the viewpoint
that it is usually better to locate processing logic near the data rather than moving the
data to the application space. HDFS rigorously restricts data writing to one writer at a
time. Bytes are always appended to the end of a stream, and byte streams are guaranteed
to be stored in the order written. HDFS has many goals. Here are some of the most
notable:
1
Apache Hadoop is an open source software framework for storage and large scale
processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-
level project being built and used by a global community of contributors and users. It
is licensed under the Apache License 2.0.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally
developed to support distribution for the Nutch search Engine project. Doug, who was
working at Yahoo! at the time and is now Chief Architect of Cloudera, named the
project after his son's toy elephant. Cutting's son was 2 years old at the time and just
beginning to talk. He called his beloved stuffed yellow elephant "Hadoop"
All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common and thus
should be automatically handled in software by the framework. Apache Hadoop's
MapReduce and HDFS components originally derived respectively from Google's
MapReduce and Google File System (GFS) papers.
Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is
now commonly considered to consist of a number of related projects as well: Apache
Pig, Apache Hive, Apache HBase, and others.
2
2. WHAT IS HDFS
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
It is designed to scale up from single machine to thousands of machines, each
offering local computation and storage
HDFS supports the rapid transfer of data between compute nodes. At its outset,
it was closely coupled with MapReduce, a programmatic framework for data
processing.
The file system replicates, or copies, each piece of data multiple times and
distributes the copies to individual nodes, placing at least one copy on a
different server rack than the others. As a result, the data on nodes that crash
can be found elsewhere within a cluster.
HDFS uses master/slave architecture. In its initial incarnation, each Hadoop
cluster consisted of a single NameNode that managed file system operations and
supporting DataNodes that managed data storage on individual compute nodes.
The HDFS elements combine to support applications with large data sets.
This master node "data chunking" architecture takes as its design guides
elements from Google File System (GFS), a proprietary file system outlined in
in Google technical papers, as well as IBM's General Parallel File System
(GPFS), a format that boosts I/O by striping blocks of data over multiple disks,
writing blocks in parallel. While HDFS is not Portable Operating System
Interface model-compliant, it echoes POSIX design style in some aspects.
3
3. HDFS ARCHITECTURE
Relationships between name nodes and data nodes Name nodes and data nodes
are software components designed to run in a decoupled manner on commodity
machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming
language can run HDFS. A typical installation cluster has a dedicated machine that runs
a name node and possibly one data node. Each of the other machines in the cluster runs
one data node.
4
Figure 1 HDFS Architecture
Data nodes continuously loop, asking the name node for instructions. A name
node can't connect directly to a data node; it simply returns values from functions
invoked by a data node. Each data node maintains an open server socket so that client
code or other data nodes can read or write data. The host or port for this server socket
is known by the name node, which provides the information to interested clients or
other data nodes. See the Communications protocols sidebar for more about
communication between data nodes, name nodes, and clients. The name node maintains
and administers changes to the file system namespace.
5
4. GOALS OF HDFS
6
5. BENEFITS OF HDFS
7
6. LIMITATIONS OF HDFS
Issue with small files- Hadoop is not suited for small files. Small files are the
major problems in HDFS. A small file is significantly smaller than the HDFS
Block size. If you are storing these large number of small files, HDFS can’t
handle these lots of files. As HDFS works with a small number of large files for
storing data sets rather than larger number of small files. If one use the huge
number of small files, then this will overload the namenode. Since namenode
stores the namespace of HDFS.
ProcessingSpeed- With parallel and distributed algorithm, MapReduce process
large data sets. MapReduce performs the task: Map and Reduce. MapReduce
requires a lot of time to perform these tasks thereby increasing latency. As data
is distributed and processed over the cluster in MapReduce. So, it will increase
the time and reduces processing speed.
Support only Batch Processing- Hadoop supports only batch processing. It
does not process streamed data and hence, overall performance is slower.
MapReduce framework does not leverage the memory of the cluster to the
maximum.
Iterative Processing- Hadoop is not efficient for iterative processing. As
hadoop does not support cyclic data flow. That is the chain of stages in which
the input to the next stage is the output from the previous stage.
Vulnerable by nature- Apache Hadoop is entirely written in Java, a language
most widely used and Java been most heavily exploited by cyber-criminal.
Therefore it implicates in numerous security breaches.
Security- Hadoop can be challenging in managing the complex application.
Hadoop is missing encryption at storage and network levels, which is a major
point of concern. Hadoop supports Kerberos authentication, which is hard to
manage.
8
7. HDFS STORAGE
Processing logic close to the data rather than the data close to the processing
logic.
9
8. HDFS CLUSTER
File systems that manage the storage across a network of machines are called
Name Node
Functions of a NameNode:
It directs the Datanodes (Slave nodes) to execute the low-level I/O
operations.
It records the metadata of all the files stored in the cluster, e.g. the
location, the size of the files, permissions, hierarchy, etc.
NameNode regularly receives a Heartbeat and a Blockreport from all the
DataNodes in the cluster to make sure that the datanodes are working
properly. A Block Report contains a list of all blocks on a DataNode.
10
Data Node:
Datanodes perform the low-level read and write requests from the file
system’s clients.
They regularly send a report on all the blocks present in the cluster to
the NameNode.
Datanodes also enables pipelining of data.
They forward data to other specified DataNodes.
Datanodes send heartbeats to the NameNode once every 3 seconds, to
report the overall health of HDFS.
11
9. HDFS MAPREDUCE
Figure 3 MapReduce
Job Tracker –
12
TaskTracker –
13
10. HDFS FEATURES
Features of HDFS:
Streaming data access: hdfs is the most efficient data processing pattern. That
pattern is write once read many times.
Once we can store a file on top of hdfs file, we can’t override that file.
14
11. HDFS COMPONENTS
All the components of the Hadoop ecosystem, as explicit entities are evident.
The holistic view of Hadoop architecture gives prominence to Hadoop common,
Hadoop YARN, Hadoop Distributed File Systems (HDFS) and Hadoop MapReduce of
the Hadoop Ecosystem. Hadoop common provides all Java libraries, utilities, OS level
abstraction, necessary Java files and script to run Hadoop, while Hadoop YARN is a
framework for job scheduling and cluster resource management. HDFS in Hadoop
architecture provides high throughput access to application data and Hadoop
MapReduce provides YARN based parallel processing of large data sets.
15
12. CONCLUSION
16
13. REFERENCES
17