+ What is in this presentation for you? • How does Google process petabytes of data daily? • Can you learn from them? • To interest you in using this tool for your work.
from 'Does IT matter?' by Nicholas Carr 2
Presentation Purpose • To outline – What is the secret sauce behind Google • This is aimed at a technical audience. • This is an article summary. • Contents – Why is this important? – What is MapReduce? – Why use MapReduce? – MapReduce in detail – LISP example – MapReduce Flow Chart – Extras – Conclusion • Questions as you like
2009 Why is Google a verb? MapReduce! 3
Why is Google a Verb? • Google exploits an affect called "The Wisdom of Crowds", which argues that large groups hold a collective wisdom. How? – In 2004, two Google developers published how they were able to do a complete rewrite of indexing for Google web search service, across 20 TB of data. – This is explained in an article called "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat. – This presentation summarises that article.
2009 Why is Google a verb? MapReduce! 4
What is MapReduce? • MapReduce is a programming model and an associated implementation for processing and generating large data sets. – Users specify • a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and • a reduce function that merges all intermediate values associated with the same intermediate key. – Many real world tasks are expressible in this model. • The MapReduce abstraction is inspired by the map and reduce primitives present in Lisp, which is a functional language.
2009 Why is Google a verb? MapReduce! 5
Why use MapReduce? • Functional style makes parallelization possible. • When fault-tolerance, data distribution and load balancing as added in, Google is able to achieve massive Scalability. • Programs written in this style are automatically parallelized and executed on a large cluster of commodity machines. • The run-time system takes care of partitioning, scheduling, handling machine failures, and inter-machine communication. • This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. • Once understood, Programmers find the system easy to use. • Result => Hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
2009 Why is Google a verb? MapReduce! 6
MapReduce in detail • The MapReduce Model takes a set of input key/value pairs, and produces a set of output key/value pairs. • The user of the MapReduce library expresses the computation as two user written functions: Map and Reduce. • The Map function takes an input pair and produces intermediate key/value pairs. – (map (key1, value1)) => (key2, value2) • The Reduce function accepts an intermediate keys/values and merges these values to form a possibly smaller set of values. • Typically, zero or one output value is produced per Reduce call. – (reduce (key2, value2)) => (value3) • Eg, how to count each word across a collection of documents. – The map function emits each word plus a count. – The reduce function sums together all counts for each word.
2009 Why is Google a verb? MapReduce! 7
MapReduce Examples • Add one to each list element: – (mapcar #'1+ '(1 2 3)) => (2 3 4) • Determine the length of each sub list – (mapcar 'length '((a)(a b)(a b c))) => (1 2 3) • Sum all list elements – (reduce #'+ '(1 2 3 4)) => 10
2009 Why is Google a verb? MapReduce! 8
MapReduce Flowchart P User Fork (1) P Master Assign Reduce (2) Program Master Wake Up MapReduce Completed (10) Assign Map (2) Completed Reduce (9) Completed Map (5)
Map Phase Reduce Phase
P Worker 1 Checks for File (6) P Worker 4
D Input File: Split 1 D Intermediate File: P Worker 3 D Output File: 1
P Worker 2 Machine 1
Read (3) Local Write (4) Remote Read (7) Write (8)
D Input File: Split 2
D Intermediate File: D Output File: 2 Machine 2 D Input File: Split 3 D Intermediate File: Machine 3
MapReduce (System Architecture)
System Architect Sun Sep 27, 2009 21:02 Comment (1) Master is invoked by User Program. (2) Master assigns M Map Workers and R Reduce Workers (3) Each Map Worker reads pre-split input data. (4) Each Map Worker writes local internediate data. (5) Each Map Worker notifies Master of completion (6) Reduce Worker checks for Map Worker output. (7) Each Reduce Worker reads the smaller internediate file remotely. (8) Each Reduce Worker writes the summarised output to a small number of files. (9) Each Reduce Worker notifies Master of completion. (10) Master wakes User program with final results.
2009 Why is Google a verb? MapReduce! 9
Extras • Performance – Network bandwidth is a relatively scarce resource. – All input data is stored on local disks in the Google cluster. – All files divided into 64 MB blocks. • Certainty – When the user-supplied map and reduce operators are deterministic functions of their input values, the Google MapReduce implementation produces the same output as would have been produced by a non-faulting sequential execution of the entire program. – This is a natural outcome of using a functional approach. • Other matters in the article, but not in this presentation. – Worker Failure, Master Failure, Stragglers, Ordering Guarantees, Combiner Function, Input and Output Types, Skipping Bad Records, Local Execution for debugging, Status Information, Counters
2009 Why is Google a verb? MapReduce! 10
Conclusions > The MapReduce model has been successful at Google because: 1. The model is easy to use, even for programmers without experience with parallel and distributed systems. 2. Many problems can use the MapReduce model/design pattern. 3. MapReduce scales to clusters of thousands of machines. > Google have learned several things from this work. 1. Restricting the programming model makes it easy to parallelize, distribute and make fault-tolerant computations. 2. Network bandwidth is a scarce resource. 3. Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.