Sie sind auf Seite 1von 24

Submitted By:-

Payal M. Wadhwani
Guided By:-
Prof. K. Kakwani
DEFINITION OF HADOOP ?
WORKING
ADVANTAGES & DIS-ADVANTAGES
APPLICATION
CONCLUSION
REFERENCES


Hadoop was created by Douglas
Reed Cutting. It was develop to
support Lucene and Nutch search
engine projects.

Hadoop is a free, Open-source software
framework that supports the processing of
large data sets in a distributed computing
environment.
Hadoop is large-scale, high-performance
processing jobs in spite of system
changes or failures.
In non distributed architecture, data
stored in one server and any client
program will access this central data
server to retrieve the data. If the
main server fails, you have to go
back to the backup to restore the
data. For this Hadoop takes
comparatively short time.
In Hadoop, each and every server
offers local computation and
storage. When you run a query
against a large data set, every server
in this distributed architecture will
be executing the query on its local
machine against the local data set.
Finally, the resultset from all this
local servers are consolidated.
Need to process 100TB datasets
On 1 node:
scanning @ 50MB/s = 23 days
On 1000 node cluster:
scanning @ 50MB/s = 33 min
Need Efficient, Reliable and Usable
framework

Hadoop is designed to run on
commodity hardware (affordable
hardware). I think if your are using
hadoop for learning or experimental
purpose, then 2GB of RAM, 40 GB HDD
and dual core processor will do your
stuff. I would prefer to setup a linux
environment rather than using VM on
windows. To try out on windows using
VM ware, you can download VM sandbox
freely available from various vendors
such as Cloudera or Hortonworks etc.


Comparison with RDBMS
Unless we are dealing with very large volumes of
unstructured data (hundreds of GB, TBs or PBs) and have
large numbers of machines available you will likely find the
performance of Hadoop running a Map/Reduce query much
slower than a comparable SQL query on a relational
database. Hadoop uses a brute force access method whereas
RDBMSs have optimization methods for accessing data
such as indexes and read-ahead. The benefits really do only
come into play when the positive of mass parallelism is
achieved, or the data is unstructured to the point where no
RDBMS optimizations can be applied to help the
performance of queries.

Hadoop has a variety of node types within
each Hadoop cluster; these include
DataNodes, NameNodes, and EdgeNodes.
Names of these nodes can vary from site to
site, but the functionality is common across
the sites. Hadoops architecture is modular,
allowing individual components to be scaled
up and down as the needs of the
environment change.
Hadoop Consists:
Hadoop Common*: The common utilities that
support the other Hadoop subprojects.
HDFS*: A distributed file system that provides
high throughput access to application data.
MapReduce*: A software framework for
distributed processing of large data sets on
compute clusters.

Web
Servers
Scribe
Servers
Network
Storage
Hadoop Cluster
Oracle
RAC
MySQ
L
By Harsha Jain

Hadoop Distributed File System
MapReduce
HDFS stands for Hadoop Distributed File
System, which is the storage system used
by Hadoop. Both data and processing are
distributed across multiple servers.
MapReduce is a parallel programming model
that is used to retrieve the data from the
Hadoop cluster.
In Map Reduce, records are processed in
isolation by tasks called Mappers.
The output from the Mappers is then brought
together into a second set of tasks called
Reducers .

1.Input files split (M splits)
2.Assign Master & Workers
3.Map tasks
4.Writing intermediate data to disk (R regions)
5.Intermediate data read & sort
6.Reduce tasks
7.Return

Flow Chart
Processes 100TB data sets.
Automatically handles data replication and node
failure
Cost Saving and efficient and reliable data
processing
Hadoop is designed to run on cheap commodity
hardware
It does the hard work you can focus on
processing data

A9.com Amazon: To build Amazon's product search
indices; process millions of sessions daily for
analytics.
Yahoo! : More than 100,000 CPUs in ~20,000
computers running Hadoop.
Facebook: To store copies of internal log and
dimension data sources and use it as a source for
reporting/analytics and machine learning.

HDFS and MapReduce will be using
single-master models which
can result in single points of failure.
Security is also one of the major
concern because Hadoop does offer
a security
Hadoop Users
Adobe
Alibaba
Amazon
AOL
Facebook
Google
IBM

Major Contributor
Apache
Cloudera
Yahoo

Hadoop is data storage and analysis platform for
large volumes of data.
Hadoop will sit along side, not replace your
existing RDBMS.
Hadoop has many tools to ease data analysis.
Apache Hadoop! (http://hadoop.apache.org )
Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
Free Search by Doug Cutting
(http://cutting.wordpress.com )
Hadoop and Distributed Computing at Yahoo!
(http://developer.yahoo.com/hadoop )
Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com )

Thanking You !!

Das könnte Ihnen auch gefallen