Sie sind auf Seite 1von 40

A Modest Proposal

for Taming and Clarifying the Promises of Big Data

and the Software Driven Future

Brendan McAdams
10gen, Inc.
brendan@10gen.com
@rit

Friday, November 16, 12


"In short, software is eating the world."
- Marc Andreesen
Wall Street Journal, Aug. 2011
http://on.wsj.com/XLwnmo

Friday, November 16, 12


Software is Eating the World

Amazon.com (and .uk, .es, etc) started as a bookstore


Today, they sell just about everything - bicycles,
appliances, computers, TVs, etc.
In some cities in America, they even do home grocery
delivery
No longer as much of a physical goods company -
becoming fixated and surrounded by software
Pioneering the eBook revolution with Kindle
EC2 is running a huge percentage of the public
internet

Friday, November 16, 12


Software is Eating the World

Netflix started as a company to deliver DVDs to the home...

Friday, November 16, 12


Software is Eating the World

Netflix started as a company to deliver DVDs to the home...


But as theyve grown, business has shifted to an
online streaming service
They are now rolling out rapidly in many countries
including Ireland, the UK, Canada and the Nordics
No need for physical inventory or postal distribution ...
just servers and digital copies

Friday, November 16, 12


Disney Found Itself Forced To Transform...

From This...

Friday, November 16, 12


Disney Found Itself Forced To Transform...

... To This

Friday, November 16, 12


But What Does All This Software Do?

Software always eats data be it text files, user form input,


emails, etc

All things that eat, must eventually excrete...

Friday, November 16, 12


Ingestion = Excretion

+ =

Yeast Ingests Sugars,

and Excretes Ethanol

Friday, November 16, 12


Ingestion = Excretion

Cows, er...

well, you get the point.


Friday, November 16, 12
So What Does Software Eat?

Software always eats data be it text files, user form input,


emails, etc

But what does software excrete?


More Data, of course...
This data gets bigger and bigger
The solutions become narrower for storing &
processing this data
Data Fertilizes Software, in an endless cycle...

Friday, November 16, 12


Theres a Big Market Here...

Lots of Solutions for Big Data


Data Warehouse Software
Operational Databases
Old style systems being upgraded to scale storage +
processing
NoSQL - Cassandra, MongoDB, etc
Platforms
Hadoop

Friday, November 16, 12


Dont Tilt At Windmills...

Friday, November 16, 12


Dont Tilt At Windmills...

It is easy to get distracted by all of these solutions


Keep it simple
Use tools you (and your team) can understand
Use tools and techniques that can scale
Try not to reinvent the wheel

Friday, November 16, 12


... And Dont Bite Off More Than You Can Chew

Break it into smaller pieces


You cant fit a whole pig into your mouth...
... slice it into small parts that you can consume.

Friday, November 16, 12


Big Data at a Glance

Large Dataset
Primary Key as username

Big Data can be gigabytes, terabytes, petabytes or exabytes


An ideal big data system scales up and down around various
data sizes while providing a uniform view

Major concerns
Can I read & write this data efficiently at different scale?
Can I run calculations on large portions of this data?

Friday, November 16, 12


Big Data at a Glance
...
Large Dataset
Primary Key as username

Systems like Google File System (which inspired Hadoops


HDFS) and MongoDBs Sharding handle the scale problem by
chunking

Break up pieces of data into smaller chunks, spread across


many data nodes
Each data node contains many chunks
If a chunk gets too large or a node overloaded, data can be
rebalanced

Friday, November 16, 12


Chunks Represent Ranges of Values
Initially, an empty
collection has a single
- + chunk, running the range
of minimum (-) to ...
INSERT {USERNAME: Bill} maximum (+)

As we add data, more


chunks are created of - B C +
new ranges

INSERT {USERNAME: Becky}


INSERT {USERNAME: Brendan}

Individual or partial letter


- Ba Be Br ranges are one possible
chunk value... but they
can get smaller!

INSERT {USERNAME: Brad}

The smallest possible


chunk value is not a Brad Brendan
range, but a single
possible value

Friday, November 16, 12


Big Data at a Glance
a b c d e f g h
...
Large Dataset
Primary Key as username
s t u v w x y z

To simplify things, lets look at our dataset split into chunks by


letter

Each chunk is represented by a single letter marking its


contents
You could think of B as really being Ba Bz

Friday, November 16, 12


Big Data at a Glance
a b c d e f g h
Large Dataset
Primary Key as username
s t u v w x y z

Friday, November 16, 12


Big Data at a Glance

Large Dataset
Primary Key as username

x b v t d f z s

h e u c w a y g

MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)

Friday, November 16, 12


Big Data at a Glance
Data Node 1 Data Node 2 Data Node 3 Data Node 4
Large Dataset
Primary Key as username
25% of chunks 25% of chunks 25% of chunks 25% of chunks

x b v t d f z s

h e u c w a y g

Representing data as chunks allows many levels of scale across n data nodes

Friday, November 16, 12


Scaling
Data Node 1 Data Node 2 Data Node 3 Data
Data
Node
Node
4 5

x b v t d f z s

h e u c w a y g

The set of chunks can be evenly distributed across n data nodes

Friday, November 16, 12


Add Nodes: Chunk Rebalancing
Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5

x c b z t f v y

a s u g e w h d

The goal is equilibrium - an equal distribution.

As nodes are added (or even removed)

chunks can be redistributed for balance.

Friday, November 16, 12


Dont Bite Off More Than You Can Chew...

The answer to calculating big data is much the same as


storing it

We need to break our data into bite sized pieces


Build functions which can be composed together
repeatedly on partitions of our data
Process portions of the data across multiple calculation
nodes
Aggregate the results into a final set of results

Friday, November 16, 12


Bite Sized Pieces Are Easier to Swallow

These pieces are not chunks rather, the individual data


points that make up each chunk

Chunks make up a useful data transfer units for processing


as well
Transfer Chunks as Input Splits to calculation nodes,
allowing for scalable parallel processing

Friday, November 16, 12


MapReduce the Pieces

The most common application of these techniques is


MapReduce
Based on a Google Whitepaper, works with two primary
functions map and reduce to calculate against large
datasets

Friday, November 16, 12


MapReduce to Calculate Big Data

MapReduce is designed to effectively process data at varying


scales

Composable function units can be reused repeatedly for scaled


results

Friday, November 16, 12


MapReduce to Calculate Big Data

In addition to the HDFS storage component, Hadoop is built


around MapReduce for calculation

MongoDB can be integrated to MapReduce data on Hadoop


No HDFS storage needed - data moves directly between
MongoDB and Hadoops MapReduce engine

Friday, November 16, 12


What is MapReduce?

MapReduce made up of a series of phases, the primary of


which are
Map
Shuffle
Reduce
Lets look at a typical MapReduce job
Email records
Count # of times a particular user has received email

Friday, November 16, 12


MapReducing Email
to: tyler
from: brendan
subject: Ruby Support

to: brendan
from: tyler
subject: Re: Ruby Support

to: mike
from: brendan
subject: Node Support

to: brendan
from: mike
subject: Re: Node Support

to: mike
from: tyler
subject: COBOL Support

to: tyler
from: mike
subject: Re: COBOL Support
(WTF?)

Friday, November 16, 12


Map Step
map function breaks each document
to: tyler
into a key (grouping) & value
key: tyler
from: brendan value: {count: 1}
subject: Ruby Support

to: brendan
from: tyler key: brendan
subject: Re: Ruby Support value: {count: 1}

to: mike
from: brendan
subject: Node Support key: tyler
value: {count: 1}
map function
to: brendan emit(k, v)
from: mike
subject: Re: Node Support key: mike
value: {count: 1}

to: mike
from: tyler key: brendan
subject: COBOL Support value: {count: 1}

to: tyler
from: mike
subject: Re: COBOL Support key: mike
(WTF?) value: {count: 1}

Friday, November 16, 12


Group/Shuffle Step
key: tyler
value: {count: 1}

key: brendan
Group like keys together, value: {count: 1}

creating an array of their key: tyler


value: {count: 1}

distinct values
(Automatically done by M/R frameworks)
key: mike
value: {count: 1}

key: brendan
value: {count: 1}

key: mike
value: {count: 1}

Friday, November 16, 12


Group/Shuffle Step

Group like keys together,


key: tyler

creating an array of their


values: [{count: 1},
{count: 1}]

distinct values key: mike


values: [{count: 1},
{count: 1}]
(Automatically done by M/R frameworks)
key: brendan
values: [{count: 1},
{count: 1}]

Friday, November 16, 12


Reduce Step
For each key reduce function

flattens the list of values to a single

result
key: tyler key: mike
values: [{count: 1}, value: {count: 2}
{count: 1}]

key: mike key: brendan


reduce function
values: [{count: 1}, value: {count: 2}
aggregate values
{count: 1}]
return (result)

key: brendan
key: tyler
values: [{count: 1},
value: {count: 2}
{count: 1}]

Friday, November 16, 12


Processing Scalable Big Data

MapReduce provides an effective system for calculating


and processing our large datasets (from gigabytes through
exabytes and beyond)

MapReduce is supported in many places including


MongoDB & Hadoop

We have effective answers for both of our concerns.


Can I read & write this data efficiently at different scale?
Can I run calculations on large portions of this data?

Friday, November 16, 12


Batch Isnt a Sustainable Answer
There are downsides here - fundamentally, MapReduce is a
batch process

Batch systems like Hadoop give us a Catch 22


You can get answers to questions from Petabytes of
Data
But you cant guarantee youll get them quickly
In some ways, this is a step backwards in our industry
Business Stakeholders tend to want answers now
We must evolve

Friday, November 16, 12


Moving Away from Batch
The Big Data world is moving rapidly away from slow, batch
based processing solutions

Google moved forward from Batch into more Realtime over last
few years

Hadoop is replacing MapReduce as Assembly Language with


more flexible resource management in YARN
Now MapReduce is just a feature implemented on top of
YARN
Build anything we want
Newer systems like Spark & Storm provide platforms for
realtime processes

Friday, November 16, 12


In Closing
The World IS Being Eaten By Software

All that software is leaving behind an awful lot of data


We must be careful not to step in it
More Data Means More Software Means More Data
Means...

Practical Solutions for Processing & Storing Data will save


us

We as Data Scientists & Technologists must always evolve


our strategies, thinking and tools

Friday, November 16, 12


[Download the Hadoop Connector]
http://github.com/mongodb/mongo-hadoop
[Docs]
http://api.mongodb.org/hadoop/

QUESTIONS?

*Contact Me*
brendan@10gen.com
(twitter: @rit)

Friday, November 16, 12

Das könnte Ihnen auch gefallen