The Solution For Big Data Hadoop

The solution for Big data
HADOOP
J. Sai Krishna
and G. Sravya
Lahari
2nd B.Tech (CSE)
K.O.R.M College of Engineering
Kadapa
Contents
1. Data trends in storing data.
2. Bigdata problems in IT industry
3. Introduction to HADOOP
4. HDFS (Hadoop Distributed File System)
MapReduce
6. Prominent users of Hadoop.
7. Conclusion
5.
Data trends in storing data

What is data--- Any real world symbol
(character,
numeric, special character)
or a of group of
them is
said to be
data it may be of the visual or audio or
scriptural ,etc
Big data
What is big dataIn IT, it is a collection
of data sets so large and complex data that

it becomes difficult to process using onhand database management tools or
traditional data processing applications.
As of 2012, limits on the size of data sets
that are feasible to process in reasonable
time were on the order of Exabyte of data.
BIGDATA and problems with

it.
Daily about 0.5 Petabytes of updates are being made
into FACEBOOK including 40 millions photos.

Daily, YOUTUBE is loaded with videos that can be
watched for one year continuously
Limitations are encountered due to large data sets in
many areas, including meteorology, genomics,
complex physics simulations, and biological and
environmental research.
Also affect Internet search, finance and business
informatics.
The challenges include in capture, retrieval, storage,
search, sharing, analysis, and visualization.
THEN WHAT COULD BE THE

SOLUTION FOR BIGDATA
HADOOP
What is Hadoop?
It is a opensource software written in java
Hadoop software library is a framework that
allows for the distributed processing of

large data sets across clusters of
computers using simple programming
models.
It is designed to scale up from single
servers to thousands of machines, each
offering local computation and storage.
The project includes these

modules:
Hadoop Common
Hadoop Distributed File System
(HDFS)
Hadoop MapReduce
1.Hadoop Commons
It provides access to the filesystems
supported by Hadoop.
The Hadoop Common package contains the
necessary JAR files and scripts needed to
start Hadoop.
The package also provides source code,
documentation, and a contribution section
which includes projects from the Hadoop
Community (Avro, Cassandra, Chukwa,
Hbase, Hive, Mahout, Pig, ZooKeeper)
2. Hadoop Distributed File

System (HDFS):
Hadoop uses HDFS, a distributed file
system based on GFS (Google File System),

as its shared filesystem.
HDFS architecture divides files into large
chunks (~64MB) distributed across data
servers (this is configurable).
It has a namenode and datanodes
What does a HDFS contain

HDFS consists of a global namenodes or
namespaces and they are federated.

The datanodes are used as common
storage for blocks by all the Namenodes.
Each datanode registers with all the
Namenodes in the cluster.
Datanodes send periodic heartbeats and
block reports and handles commands

from the Namenodes
Structure of Hadoop system:
MASTER NODE
Master node
Keeps
track of namespace and metadata about items

Keeps track of MapReduce jobs in the system
Hadoop currently configured with centurion064 as

the master node
Hadoop is locally installed in each system.
Installed location is in /localtmp/hadoop/hadoop0.15.3
SLAVE NODES
Slave nodes
Manage
blocks of data sent from master node

In common, these are the chunkservers
Currently centurion060, centurion064 are the two

slave nodes being used.
Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is automatically
created by the DFS)
Once you use the DFS, relative paths are from
/usr/{your usr id}
Advantages and Limitations of

HDFS
Reduce traffic on job
scheduling.
File access can be
achieved through
the native Java or
language of the
users' choice (C++,
Java, Python, PHP,
Ruby, Erlang, Perl,
Haskell, C#, Cocoa,
Smalltalk, and
OCaml),
It cannot be
directly mounted
by an existing
operating system.
It should be
provided with UNIX
or LUNIX system.
3.Hadoop MAPREDUCE
SYSTEM
MAP AND REDUCE METHODS USAGE
Map function
Reduce function
Run this program as a

MapReduce job
WORD COUNT OVER A GIVEN

SET OF STRINGS
We love India
We play
tennis
We
1
love
1
India
We
1
Play
Map
1
tennis
Love
India
1
1
We
2
tennis 1
play
1
Reduce
MAPREDUCE IN WITH NO REDUCE TASKS
MAPREDUCE WITH TWO REDUCE

TASKS - AUTOMATIC PARALLEL
EXECUTION IN MAPREDUCE
MapReduce - lifecycle
Input
Splits
Map
function
Map phase
Reduce
function
Reduce phase
Shuffle and sort in MapReduce

with multiple reduce tasks
Prominent users of HADOOP

Amazon 100 nodes
Facebook two clusters of 8000 and 3000
nodes
Adobe 80 node system
EBay 532 node cluster
yahoo cluster of about 4500 nodes
IIIT Hyderabad 30 node cluster
Achievements
March 2011 - Apache Hadoop takes top
prize at Media Guardian Innovation Award

July 2012 - Hadoop Wins Terabyte Sort
Benchmark
Conclusion:
It reduce traffic on capture, storage, search,
sharing, analysis, and visualization.
A huge amount of data could be stored and large
computations could be done in a single
compound with full safety and security at cheap
cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the
burning issues in the present IT industry so, work
on those will surely make you more useful to that.
Thank
you
Any queries

The Solution For Big Data Hadoop

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

The Solution For Big Data Hadoop

Hochgeladen von

Copyright:

Verfügbare Formate

The solution for Big data

Data trends in storing data

of data sets so large and complex data that

BIGDATA and problems with

into FACEBOOK including 40 millions photos.

THEN WHAT COULD BE THE

allows for the distributed processing of

The project includes these

2. Hadoop Distributed File

system based on GFS (Google File System),

What does a HDFS contain

namespaces and they are federated.

block reports and handles commands

Structure of Hadoop system:

track of namespace and metadata about items

Hadoop currently configured with centurion064 as

blocks of data sent from master node

Currently centurion060, centurion064 are the two

Advantages and Limitations of

MAP AND REDUCE METHODS USAGE

Run this program as a

WORD COUNT OVER A GIVEN

MAPREDUCE IN WITH NO REDUCE TASKS

MAPREDUCE WITH TWO REDUCE

Shuffle and sort in MapReduce

Prominent users of HADOOP

prize at Media Guardian Innovation Award

Das könnte Ihnen auch gefallen