Hadoop Framework

The Hadoop Framework
What was the problem

collect
ETL
Raw Data
RDBMS
Reports
Moving data to compute can't keep up with

volume of data generated (bottle neck: I/O)
Archiving data = dead data
ETL process loose data.
Hard to go back to the original raw data
History
90s: WebCrawler, Excite, Lycos, Infoseek,

AltaVista ...
2000, Google rose to prominence
2003 Google released paper on GFS:

The Google File System
2004 Google released paper on MapReduce:

MapReduce: Simplified Data Processing on Large Clusters
2005 Hadoop was born (Doug Cutting : also creator of

Lucene! and Mike Cafarella from Yahoo)
2006 yahoo donated Hadoop to Apache
GFS
GFS
Single master maintaining file system metadata
Files divided into fixed size chunks.
Each chunk is replicated on multiple chunkservers
Client interact with master for metadata operations,

but data communication goes directly to chunkserver
Assuming once write, files are seldom modified
Throughput is more important than low latency.
MapReduce
src: http://blog.sqlauthority.com
WordCount Example of MapReduce

map(String name, String document):
// key: document name
// value: document contents
for each word w in document:
EmitIntermediate(w, "1");
reduce(String word, Iterator partialCounts):

// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(AsString(result));
Apache Hadoop Framework
A open source framework of tools for big data

storage and processing.
Scalable
Fault tolerant
Main components:
Hadoop Distributed File System (HDFS)

MapReduce
YARN (MapReduce 2.0 )
HDFS
src: http://sundar5.wordpress.com/2010/03/19/hadoop-basic/
HDFS
Designed to store gigantic files (giga to tera bytes)
Suitable for mostly immutable files
Not suitable for concurrent write
Block structures (large files broke into fixed size

blocks)
Default block size 64MB (structured file system's
block size: 4k ~8k)
Replicate data across multiple machines (2 on same
rack, 1 on a different rack)
Master namenode, cluster of datanodes (secondary ND)
Hadoop Architecture
Hadoop Architecture
Hadoop Architecture
HDFS -- fault tolerant, high bandwidth data storage

layer
MapReduce distributed, fault tolerant resource
management and data-processing
Move compute to data
Schema on read (late binding) instead of schema on
write(RDBMS)
YARN (MapReduce 2.0), split JobTracker into
resource management and job scheduling and
execution. Allow easy plug-in of non-MapReduce
apps.
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
High level data flow language:

PigLatin. Can be parsed and
executed as series of
MapReduce jobs on Hadoop
cluster. Much faster and easier to
write than MapReduce.
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
Data warehouse infrastructure on

top of Hadoop for data
summarization, query and
analysis. SQL like language
HiveQL. Full support of
map/reduce
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
A distributed storage system for

structured data. Designed for
random, realtime read/write
access to BigData. Just like
Google Bigtable and GFS,
Hbase works with HDFS.
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
Flume
is
for
integrating
large
volume of log data.
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
Transfer bulk data between

Hadoop and Structured data
store
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE
Oozie is a workfow
scheduler system to
manage Apache
Hadoop jobs.
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Scalable Machine Learning
Library
Apache Flume
Apache Sqoop
Apache OOZIE
Related Tools
Apache Pig
Apache Mahout
Apache Hive
Apache ZooKeeper
Apache HBase
Apache Flume
Zookeeper allows distributed
processes to coordinate with
Apache
eachSqoop
other through a shared
hierarchical name space of
data registers.
Apache OOZIE
references
Introducing Apache Hadoop: the modern data operating

system by Dr. Amr Awadalla
http://web.stanford.edu/class/ee380/Abstracts/111116-slides.pdf
Big Data Buzz Words: What is MapReduce by Pinal Dave
http://blog.sqlauthority.com
Yahoo Hadoop Tutorial:

https://developer.yahoo.com/hadoop/tutorial/module1.html

Hadoop Framework

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Framework

Hochgeladen von

Copyright:

Verfügbare Formate

The Hadoop Framework

What was the problem

Moving data to compute can't keep up with

Archiving data = dead data

ETL process loose data.

Hard to go back to the original raw data

90s: WebCrawler, Excite, Lycos, Infoseek,

2000, Google rose to prominence

2003 Google released paper on GFS:

2004 Google released paper on MapReduce:

2005 Hadoop was born (Doug Cutting : also creator of

2006 yahoo donated Hadoop to Apache

Single master maintaining file system metadata

Files divided into fixed size chunks.

Each chunk is replicated on multiple chunkservers

Client interact with master for metadata operations,

Assuming once write, files are seldom modified

Throughput is more important than low latency.

WordCount Example of MapReduce

reduce(String word, Iterator partialCounts):

Apache Hadoop Framework

A open source framework of tools for big data

Hadoop Distributed File System (HDFS)

Designed to store gigantic files (giga to tera bytes)

Suitable for mostly immutable files

Not suitable for concurrent write

Block structures (large files broke into fixed size

HDFS -- fault tolerant, high bandwidth data storage

High level data flow language:

Data warehouse infrastructure on

A distributed storage system for

Transfer bulk data between

Introducing Apache Hadoop: the modern data operating

Big Data Buzz Words: What is MapReduce by Pinal Dave

Yahoo Hadoop Tutorial:

Das könnte Ihnen auch gefallen