Beruflich Dokumente
Kultur Dokumente
RESEARCH ARTICLE
OPEN ACCESS
ABSTRACT
As per the records, usually 2.5 quintillion of data is created per day- so much that 90% of the data in the world has
been created in the last two years alone. The volume of data is rising tremendously. Different organisations have
generated big and big data for years, but struggle to use it effectively. Moreover, data size is increasing at exponential
rates with the advent of penetrating devices like android phones, social networking sites like LinkedIn, Facebook etc.
and other sources like Google +, Data Sensor devices etc. All this plethora of data is termed as Big Data. There is a
need to manage and process this Big data in a suitable manner to produce meaningful information. Traditional
techniques of managing data have fall short to analyse this data. Due to its different nature of Big Data, various file
system architectures are used to store it.
Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, scalable and
flexible. The Apache Hadoop provides open source software for reliable, scalable and distributed computing. Map
Reduce technique is used for efficient processing of Big Data. This paper gives a brief overview of Big Data, Hadoop
Map Reduce and Hadoop Distributed File System along with its architecture.
Keywords:- Big Data; Hadoop; Map reduce; Hdfs (Hadoop Distributed File System)
I.
INTRODUCTION
II.
BIG DATA
www.ijetajournal.org
Page 11
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
Big Data can be defined as [9]: The tools or
techniques for describing the new generation of
technologies and architectures that are designed to
economically extract value from very large volumes of
a wide variety of data, by enabling high-velocity
capture, discovery or analysis. Doug Laney describes
the definition of Big Data in terms of its attributes
by 3 Vs: Volume, Variety and Velocity in 2011. Later
in 2012, IBM describes two more Vs as Value and
Veracity, thus making 5 Vs of Big Data. Then in
2013, one more V was proposed as Variability to make
6 Vs of Big Data .These 6 Vs are now listed as:
Volume, Variety, Velocity, Value, Veracity and
Variability.
Hadoop supports the running of application on Big
Data, and addresses three main challenges (3V)
created by Big Data
Volume: Large volume of data is main
challenge of storage. Hadoop provides
framework to process, store and analyse large
data sets to address volume of data. Data
volumes are expected to grow 60 times by
2020.
Velocity: Hadoop handles furious rate of
incoming data from very large system.
Variety: Hadoop handles different types of
structured and unstructured data such as text,
audio, videos, log files and many more.
A. What is Big Data Problem?
Big Data has popularised because there is high use
of data intensive technologies. The main difficulty of
big data is the working with its traditional relational
databases and desktop statistics packages, requiring
instead "massively parallel software running on tens,
hundreds, or even thousands of servers" . Different
challenges faced in large data management include
scalability, accessibility, unstructured data, real time
analytics, fault tolerance and many more. Moreover
the variations in the amount of data stored in different
sectors, the types of data generated and storedi.e.,
whether the data is structured, semi-structured or
quasi-structuredalso differ from industry to
industry[4].
B.
III.
HADOOP
www.ijetajournal.org
Massive storage
The Hadoop framework can store large
amount of data by splitting the data into
blocks and storing it on clusters of lower-cost
commodity hardware.
Distributed
Data is divided and stored across multiple
nodes, and computations can be run in
parallel across multiple connected machines.
Faster processing
Hadoop provides faster processing of huge
data sets in parallel fashion across clusters of
tightly connected low-cost computers for
quick results [1].
Page 12
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
Hadoops existence originates from Google File
System (GFS) and MapReduce which become apache
HDFS and Apache Mapreduce respectively [13].
The Hadoop brand contains many different tools.
Following two are core parts of Hadoop:
A.
Components of hadoop
Need of Hadoop
Parameters
T echnical
Features
Customized to support
NFS file system.
www.ijetajournal.org
Page 13
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
Deployment
Maintenanc
e
Cost
C.
Deployment
through
toolkit/console.
managed
AWs
Deployment
manged
through AWS toolkit.
Basic
configuration
management
and
performance tuning is
supported
through
AWS,EMR
and
management console.
Basic
configuration
management
and
performance tuning is
supported
through
AWS,EMR
and
management console.
Easy to maintain as
cluster
is
managed
through
AWS
management console and
AWS toolkit.
MapR is a proprietary
distribution.
www.ijetajournal.org
Page 14
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
operation. MapReduce follows programming model
for processing large datasets. The Map( ) function
takes an input key/value pair and produces a list of
intermediate key/value pairs[1].
Map
(input_key,input_value)>list(output_key,intermediate_
value)
Reduce
(output_key,list(intermediate_value))>list(output_valu
e)
www.ijetajournal.org
Page 15
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
emit (word, sum)
The map function emits each word plus an
associated count of occurrences (just 1 in this simple
example). The reduce function sums together all
counts emitted for a particular word.
4) Comparison of MapReduce with RDBMS [11]
In many ways, MapReduce can be seen as a
complement to a Rational Database Management
System (RDBMS).
Figure 5.
Traditional RDBMS
MapReduce
Data Size
Gigabytes
Petabytes
Updates
Write once,read
many times
Structure
Static schema
Dynamic
schema
Integrity
High
Low
IV.
HADOOP SYSTEM
ARCHITECTURE
B.
HDFS Architecture
www.ijetajournal.org
Page 16
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
Figure 6.
An HDFS client
V. FAULT TOLERANCE
A.
Replica Placement
Figure 7.
VI.
Rack awareness
HADOOP INSTALLATION ON
SINGLE NODE
www.ijetajournal.org
Page 17
International Journal of Engineering Trends and Applications (IJETA) Volume 3 Issue 3, May -Jun 2016
VII.
CONCLUSION
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
http://readwrite.com/2013/05/29/the-realreason-hadoop-is-such-a-big-deal-in-big-data
[7]
http://www.experfy.com/blog/cloudera-vshortonworks-comparing-hadoop-distributions/
[8]
http://blog.blazeclan.com/252/
[9]
[10]
[11]
[12]
[13]
[14]
www.ijetajournal.org
Page 18