Beruflich Dokumente
Kultur Dokumente
Nilaf T 1RV08CS068
Why Hadoop
Large scale data processing is difficult!
Managing
hundreds or thousands of processors Managing parallelization and distribution I/O Scheduling Status and monitoring Fault/crash tolerance
What is Hadoop
It's
a framework for running applications on large clusters of commodity hardware which produces huge data and to process it. Apache Software Foundation Project. Open source 1.0.1 stable release available for download
Hadoop History
Google
used Map Reduce internally to index webpages to access peta bytes of data. Google used their own filesystem called google file system GFS. In 2004 , based on google paper on MapReduce, Hadoop as apache project was started with contributions from Yahoo! , Amazon .
Hadoop History
Hadoop Contains
Hadoop consists of two key services: Reliable data storage using the Hadoop Distributed File System (HDFS) High-performance parallel data processing using a technique called MapReduce
is a distributed, scalable, and portable filesystem. written in Java for Hadoop framework. Filesystem keeps checksums of data for corruption detection and recovery. Data is organized into files and directories.
major advantage of HDFS are that it provides very high input and output speeds. Hadoop DFS stores each file as a sequence of blocks Each block in a file except the last block are of the same size. It is inspired by the Google File System.
Map Reduce
Started by Google
Processes
MapReduce
Is
the style in which most programs running on Hadoop are written. input is broken in tiny pieces which are processed independently Map-reduce is a programming model for processing and generating large data sets.
MapReduce
It consists of two steps: map and reduce. The map step takes a key/value pair and produces an intermediate key/value pair. The reduce step takes a key and a list of the key's values and outputs the final key/value pair.
MapReduce Benefits
Map-reduce jobs are automatically parallelized. Partial failure of the processing cluster is expected and tolerable. Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry. It scales very well. Many jobs are naturally expressible in the map/reduce paradigm.
MapReduce Benefits
In any Map Reduce framework there are two key elements -
1.A Mapper which applies some processing function on a given key/value and produces a list of key/value pairs.
2.The second element is a Reducer which: 1. Fetch a key and a list of all the values that where created for that key from all the Mappers 2. Sort the fetched key/values into buckets 3. Reduce the list by processing key/values and output new key/value pairs.
HDFS Architecture
HDFS Architecture
Used space for one or more data nodes can be underutilized. Therefore, HDFS supports rebalancing data blocks using various models. One model might move data blocks from one data node to another automatically if the free space on a data node falls too low. Another model might dynamically create additional replicas and rebalance other data blocks in a cluster if a sudden increase in demand for a given file occurs. HDFS also provides the hadoop balance command for
Hadoop is not
Apache Hadoop is not a substitute for a database Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop is a massively scalable storage and batch data processing system. Hadoop stores data in files, and does not index
Hadoop is not
Hadoop
works where the data is too big for a database. Hadoop does not store indexes and relations like databases hence saves SPACE.
Hadoop is not
MapReduce is not always the best algorithm. Hadoop is best if two jobs can run in parallel. Hadoop not good when one job needs other. Hadoop is not a good choice for building systems that carry out intense calculations with little or no data.
Advantages of Hadoop
Hadoop compute clusters are built on cheap commodity hardware. Hadoop automatically handles node failures and data replication. Hadoop is a good framework for building batch data processing system. Hadoop provides API and framework implementation for working with Map Reduce. Hadoop Job infrastructure can manage and handle HUGE amounts of data in the range of peta bytes.
Uses
Article
clustering for google News. Crawl for news and cluster into one. Website indexing Spam filter. Data analysis
building for google search. Web crawlers need to crawl each website to index based on the keywords and contents. Article clustering for google News. Crawl for news and cluster into one.
building for Yahoo! Search. Spam Detection for yahoo mail. Too many mails being received every day and yahoo! Spam filter need to detect spams
Mining. Ex finding relations between two users based on profile information. Ad Optimization.Optimize ad based on the contents on the user profile. Spam Detection.
Hadoop
How Hadoop saved money and time The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240
END