Sie sind auf Seite 1von 27

The solution for Big data

HADOOP

J. Sai Krishna
and G. Sravya
Lahari
2nd B.Tech (CSE)
K.O.R.M College of Engineering
Kadapa

Contents
1. Data trends in storing data.
2. Bigdata problems in IT industry
3. Introduction to HADOOP
4. HDFS (Hadoop Distributed File System)

MapReduce
6. Prominent users of Hadoop.
7. Conclusion
5.

Data trends in storing data


What is data--- Any real world symbol

(character,
numeric, special character)
or a of group of
them is
said to be
data it may be of the visual or audio or
scriptural ,etc

Big data
What is big dataIn IT, it is a collection

of data sets so large and complex data that


it becomes difficult to process using onhand database management tools or
traditional data processing applications.
As of 2012, limits on the size of data sets
that are feasible to process in reasonable
time were on the order of Exabyte of data.

BIGDATA and problems with


it.
Daily about 0.5 Petabytes of updates are being made

into FACEBOOK including 40 millions photos.


Daily, YOUTUBE is loaded with videos that can be
watched for one year continuously
Limitations are encountered due to large data sets in
many areas, including meteorology, genomics,
complex physics simulations, and biological and
environmental research.
Also affect Internet search, finance and business
informatics.
The challenges include in capture, retrieval, storage,
search, sharing, analysis, and visualization.

THEN WHAT COULD BE THE


SOLUTION FOR BIGDATA

HADOOP

What is Hadoop?
It is a opensource software written in java
Hadoop software library is a framework that

allows for the distributed processing of


large data sets across clusters of
computers using simple programming
models.
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.

The project includes these


modules:
Hadoop Common
Hadoop Distributed File System
(HDFS)
Hadoop MapReduce

1.Hadoop Commons
It provides access to the filesystems

supported by Hadoop.
The Hadoop Common package contains the
necessary JAR files and scripts needed to
start Hadoop.
The package also provides source code,
documentation, and a contribution section
which includes projects from the Hadoop
Community (Avro, Cassandra, Chukwa,
Hbase, Hive, Mahout, Pig, ZooKeeper)

2. Hadoop Distributed File


System (HDFS):
Hadoop uses HDFS, a distributed file

system based on GFS (Google File System),


as its shared filesystem.
HDFS architecture divides files into large
chunks (~64MB) distributed across data
servers (this is configurable).
It has a namenode and datanodes

What does a HDFS contain


HDFS consists of a global namenodes or

namespaces and they are federated.


The datanodes are used as common
storage for blocks by all the Namenodes.
Each datanode registers with all the
Namenodes in the cluster.
Datanodes send periodic heartbeats and

block reports and handles commands


from the Namenodes

Structure of Hadoop system:

MASTER NODE

Master node
Keeps

track of namespace and metadata about items


Keeps track of MapReduce jobs in the system

Hadoop currently configured with centurion064 as


the master node
Hadoop is locally installed in each system.
Installed location is in /localtmp/hadoop/hadoop0.15.3

SLAVE NODES

Slave nodes
Manage

blocks of data sent from master node


In common, these are the chunkservers

Currently centurion060, centurion064 are the two


slave nodes being used.
Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is automatically
created by the DFS)
Once you use the DFS, relative paths are from
/usr/{your usr id}

Advantages and Limitations of


HDFS
Reduce traffic on job

scheduling.
File access can be
achieved through
the native Java or
language of the
users' choice (C++,
Java, Python, PHP,
Ruby, Erlang, Perl,
Haskell, C#, Cocoa,
Smalltalk, and
OCaml),

It cannot be

directly mounted
by an existing
operating system.
It should be
provided with UNIX
or LUNIX system.

3.Hadoop MAPREDUCE
SYSTEM

MAP AND REDUCE METHODS USAGE

Map function

Reduce function

Run this program as a


MapReduce job

WORD COUNT OVER A GIVEN


SET OF STRINGS
We love India

We play
tennis

We
1
love
1
India
We
1
Play
Map
1
tennis

Love
India

1
1

We
2
tennis 1
play
1
Reduce

MAPREDUCE IN WITH NO REDUCE TASKS

MAPREDUCE WITH TWO REDUCE


TASKS - AUTOMATIC PARALLEL
EXECUTION IN MAPREDUCE

MapReduce - lifecycle

Input
Splits

Map
function
Map phase

Reduce
function
Reduce phase

Shuffle and sort in MapReduce


with multiple reduce tasks

Prominent users of HADOOP


Amazon 100 nodes
Facebook two clusters of 8000 and 3000

nodes
Adobe 80 node system
EBay 532 node cluster
yahoo cluster of about 4500 nodes
IIIT Hyderabad 30 node cluster

Achievements
March 2011 - Apache Hadoop takes top

prize at Media Guardian Innovation Award


July 2012 - Hadoop Wins Terabyte Sort
Benchmark

Conclusion:
It reduce traffic on capture, storage, search,
sharing, analysis, and visualization.
A huge amount of data could be stored and large
computations could be done in a single
compound with full safety and security at cheap
cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the
burning issues in the present IT industry so, work
on those will surely make you more useful to that.

Thank
you
Any queries

Das könnte Ihnen auch gefallen