Sie sind auf Seite 1von 41

Data Warehousing

Jens Teubner, TU Dortmund


jens.teubner@cs.tu-dortmund.de

Winter 2014/15

Jens Teubner Data Warehousing Winter 2014/15 1


Part VII

MapReduce et al.

Jens Teubner Data Warehousing Winter 2014/15 180


Scaling Up Data Warehouse Systems

Growing expectations toward Data Warehouses:


increasing data volumes (Big Data)
increasing complexity of analyses

Problems:
OLAP queries are multi-dimensional queries
Curse of Dimensionality: indexes become ineffective
Indexes cant help to ght growing query complexity
Workloads become scan heavy.
Scaling up a server becomes expensive

Jens Teubner Data Warehousing Winter 2014/15 181


Curse of Dimensionality

Elapsed Time for NN-search (ms) N=50000, image database, k=10

100000 Scan
R*-ree
X-tree
VA-File
10000

1000

100

0 5 10 15 20 25 30 35 40 45
Number of dimensions in vectors

Jens Teubner Data Warehousing Winter 2014/15 182


Parallel Query Evaluation

Scans can be parallelized, however:

User

zip=44227 zip=44227 zip=44227

disk disk disk

parallel hardware (e.g., graphics processors)


cluster systems

Jens Teubner Data Warehousing Winter 2014/15 183


SAP BI/BW
, SAP/R3
RDB, DWH
OLAP Using GPUs
FACTSHEET
FOCUS
WARE MOD
ULES
ERP, CRM
SCM
,
Flat Files
JEDO X SOFT
ST EMS
SOURCE SY
E.g., Jedox OLAP Accelerator
(uses NVIDIA Tesla GPUs):
OR
ACCELERAT
he iten und
emory ozessorein
enden Pr
In-GPU-M ule bestehen aus taus rbaren Berechnungen ih e
ren
GPU-Mod und para
llelisie
eitung erh
ht di
Moderne
NTURBO! be n erge be n, da ss man-
zeigen gerade in
komplexen
be r CP Us . Die para
llele Verarb
n sign i ka nt, beson-
BARC ha ten gen nalyse
enhauses stgenann zvorteil ge d Datena n im OLAP
-
s der mei Performan grien un ierten Zelle
gen eine rformance eit von Zu it konsolid
Datenabfra nce- un d Pe
Gesc hw in digk
nu ng en m
ss Intellige zeiten be
i um Berech
h Busine Reaktions n es sich
ichende vo n ders wen t.
Un zu re m an z l ha nd el mi-
stellt. die Perfor Datenwr
fe die GPU zu
limitieren ungenutz
t, icher auf
ufkommen en zial e fe r vo m Hauptspe y -Tec hn ologie:
ftspot tentrans U-Memor
ige Gesch Berichten ensiven Da e In-GP ig im
en wicht ionen und Um zeitint innovativ l vollstnd
se n, Simulat ts ch ei- t Je dox auf en der Wrfe U
von Pr og no
m en sk ritisch e En
ni m ie ren, se tz
r h lt di e Ze lld at
is se zw ischen CP
rato bn
d unterneh GPU Accele gen und
Erge
Anwen-
prucht un en . De r Je do x ic h An fra uf en di e
werden k
nn ss ledigl tzer la
g getroen cher, so da Fr den Nu werden
GPU-Spei n mssen. en Daten
nd el , In vest- b er tra gen werde r un d alle wichtig de t wer-
a Finanzen
, Ha und GPU schnelle rere GPUs
verwen
source:
Branchen
, wie etw pl ex e Planungssz
e-
ng en dadurch
deutlich
nn en auch meh gr eren GPU-
n ko m du Zudem k n
ft, msse e-
itgestellt. und eine
Jedox White Paper
ktiengesch zeit durchg gezeiten
zu in Echt sofort bere rzere Abfra
if- An al ysen nahe ag ile r kann so f r noch k r fel zu sorgen.
d Wha t , um so den, um te nw
vorliegen der- s groe Da
nngren die Anwen r besonder
hneller Ke ch zeitig auch Speicher f
ei
gl en
cheidung
steigt
en. Somit
Systems. n is se f r agile Entsder eine noch bessere
rgeb
Schnelle E GPU Accelerator haben menssteuerung: Stakeholde len
Anwen r im
Jens Teubner Data Warehousing Winter 2014/15 it dem Jedox er Untern
eh
ien m it exib 184
Parallel Databases

E.g., Teradata Database: SE


TA DATABA
TERADA
ASE
e data DATAB
larger th IONAL P
ARALLEL
sing. The
PE

th e b ig ger the TRADIT Initial Que


ry
es,
he queri im p ortant
s also Initial Que
ry
essing. It the play-
AMP AMP
AMP

istribute
AMP
ry
True Que
way to d ly a mong th
e
Parallelis
m
even
ute them Query
ion
).
ts of work Replicat
Compile
d
n
Executio
e d as
Serialized lenecks
perform AMP AMP

ries to be
AMP

rson, or
Bott
Process
AMP

si n g le p e sorts,
(typically
osed to a use they were aggregat
ions, Balanced AMP AMP

ca ce AMP

e n e ck b e Performan
AMP

ere the and joins)


cards wh
lar set of e F ig ure 1.)
. (S e
uentially Final Resu
lt
source: ge
Final Mer
Teradata White Paper
lt
Final Resu
lism. Its radata
r paralle e versus Te
igned fo e c is io n su p p ort
ar al lel Databas
x d - al P
s comple and distr
ib
Figure 1.
Tradition
all tasks
wn into sm ssors, known as D atabase.
ce
tware pro We refer to this
a se.
d ata b
he Jens Teubnerro cess
Data or (AMP).
Warehousing Winter 2014/15 185
Scalability Challenges

Challenges:

Robustness:
More components higher risk of failure
Failure of single component might take whole system
off-line.
Scalability/Elasticity:
Provision for peak load?
Use resources otherwise when DW not at peak load?
Add resources later (when business grows)?
Cost:
(Reliable) large installations tend to become expensive.
(Theres a relatively small market for very large systems.)

Jens Teubner Data Warehousing Winter 2014/15 186


Scalability in Web Search

Search engines have faced similar challenges very early.

Task: generate inverted les


term cnt posting list
data warehouses are 1 doc1 :3
are cool cool 2 doc1 :4, doc2 :1
doc1 data 2 doc1 :1, doc2 :5
distribute 1 doc2 :3
cool guys distribute guys 1 doc2 :2
their data their 1 doc2 :4
doc2 warehouses 1 doc1 :2

Jens Teubner Data Warehousing Winter 2014/15 187


Inverted File Generation

Idea: Break up index generation into two parts:


1 For each document, extract terms.
2 Collect terms into groups and emit an index entry per group.

E.g.,
1 foreach document doc do 8 foreach key, (values) do
2 pos 1; 9 count 0;
3 tokens parse (doc); 10 pList ();
4 foreach word in tokens do 11 foreach v values do
5 emit word, doc.id:pos; 12 pList.append (v);
6 pos pos + 1; 13 count count + 1;
14 emit key, count, pList;
7 collect key, (values . . . ) pairs;

Jens Teubner Data Warehousing Winter 2014/15 188


Inverted File Generation

Observations: (for parallel execution)


For part 1, documents can be partitioned arbitrarily over
nodes.
For part 2 , all postings of one term must be collocated on the
same node (postings for different terms may be on different
nodes).
To establish collocation, data may have to be moved
(shufed) across nodes.

Jens Teubner Data Warehousing Winter 2014/15 189


Distributed Index Generation

shufe
terms entries

terms entries

input terms entries result



(partitioned) (partitioned)
terms entries

terms entries

Jens Teubner Data Warehousing Winter 2014/15 190


Generalization ( MapReduce)

The application pattern turns out to be highly versatile.

Only replace foreach bodies:


lines 26: f1 :: [, ] Mapper
lines 814: f2 :: , [] Reducer

Shufing (line 7) combines [, ] (list of key/value pairs) into a


list of , [] (pairs of key and list of values).
Shufing (combining) is generic.

MapReduce3 is a framework for distributed computing, where f1 and


f2 can be instantiated by the user.

3
Dean and Ghemawat. MapReduce: Simplied Data Processing on Large
Clusters. OSDI 2004.
Jens Teubner Data Warehousing Winter 2014/15 191
Example: Webserver Log File Analysis
E.g., Webserver log le analysis
Task: For each client IP, report total trafc (in bytes).
 Mapper and Reducer implementations?

Jens Teubner Data Warehousing Winter 2014/15 192


MapReduce Illustrated

by @kerzol on Twitter
Jens Teubner Data Warehousing Winter 2014/15 193
MapReduce

The MapReduce framework


decides on a number of Mappers and Reducers to instantiate,
decides the partitioning of of data and computation,
moves data as necessary and implements shufing;

considers cluster topology, system load, etc.,


interfaces with a distributed le system (Google File Syst.).

Apache Hadoop provides an open-source implementation of the


MapReduce concept; also comes with the Hadoop Distributed File
System, HDFS.

Jens Teubner Data Warehousing Winter 2014/15 194


So What?

The idea seems straightforward. Why all the fuss?

Remember the challenges we stated?


Risk of failures; elasticity; cost

MapReduce was designed for large clusters of cheap machines.


Think of thousands of machines.
Failures are frequent (and have to be dealt with).
This is why MapReduce has become popular.

Jens Teubner Data Warehousing Winter 2014/15 195


Failure Tolerance?

Trick:
Mapper and Reducer must be pure functions.
Their output depends only on their input.
No side effects.
Computation can be done anywhere, repeated if necessary.

MapReduce runtime:
Monitor job execution.
Job does not nish within expected time?
Restart on different node.
Might end up processing a task unit twice discard all
results but one.
Also used to improve performance (in case of stragglers).

Jens Teubner Data Warehousing Winter 2014/15 196


Performance: Grep

E.g., scan 1010 100-byte words for three-character pattern.

1800 machines
uppercase; 30000

Input (MB/s)
e = GetCounter("uppercase");
each 2 2 GHz
20000
each
ng name, 2 contents):
String 160 GB IDE HDD
ch word w in contents: 10000
Gigabit Ethernet
IsCapitalized(w)):
percase->Increment(); 0
paper from
Intermediate(w, 2004
"1"); 20 40 60 80 100
Seconds
r values from individual worker machines
ly propagated to the master (piggybacked Figure 2: Data transfer rate over time
The
sponse). Leverage aggregate
master aggregates disk bandwidth.
the counter
uccessful
Thismap andis reduce
whattasks
we and
needreturns
for OLAP, too.
user code when the MapReduce operation disks, and a gigabit Ethernet link. The machines were
The current counter values are also dis- arranged in a two-level tree-shaped switched network
e master status page so that a human can with approximately 100-200 Gbps of aggregate band-
gress of the live computation. When aggre- width available at the root. All of the machines were
r values, the master eliminates the effects of in the same hosting facility and therefore the round-trip
Jens Teubner Data Warehousing Winter 2014/15 197
Performance: Sort

20000 20000
20000
Done Done Done
Input (MB/s)

Input (MB/s)
15000 15000

Input (MB/s)
15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000

20000 20000 20000


Shuffle (MB/s)

Shuffle (MB/s)
Shuffle (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000

20000 20000 20000


Output (MB/s)
Output (MB/s)

Output (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0
0 0
500 1000
500 1000 500 1000
Seconds
Seconds Seconds

(a) Normal execution (b) No backup tasks (c) 200 tasks killed
Figure 3: Data transfer rates over time for different executions of the sort program

Jens Teubner Data Warehousing Winter 2014/15 198


MapReduce for Data Warehousing

MapReduce is not a database.


No tables, tuples, rows, schemas, indexes, etc.

Rather, MapReduce is based on les.


Typically kept in a distributed le system.

This is unfortunate:
No schema information to optimize, validate, etc.
No indexes (or other means to improve physical representation).

This is good:
Start analyzing immediately; dont wait for index creation, etc.
May ease ad-hoc analyses.

Jens Teubner Data Warehousing Winter 2014/15 199


Beyond the Basic Idea

While the original MapReduce is proprietary to Google, Hadoop is


widely used in industry and research.
Java-based
Can run on heterogeneous platforms, cloud systems, etc.
Integration with other Apache technology
Hadoop Distributed File System (HDFS), HBase, etc.
Can hook into more functions than just Mapper and Reducer
e.g., pre-aggregate between map and shufe
modify partitioning, etc.
Many interfaces Hadoop database/data warehouse

Jens Teubner Data Warehousing Winter 2014/15 200


Hadoop and Petabyte Sort Benchmark

Challenge: sort 1 TB of 100-byte records.


Hardware:
3800 nodes, 2 4 2.5 GHz per node
4 SATA disks, 8 GB RAM per node
Results:

GBytes Nodes Maps Reduces Repl. Time


500 1406 8000 2600 1 59 sec
1,000 1460 8000 2700 1 62 sec
100,000 3452 190,000 10,000 2 173 min
1,000,000 3658 80,000 20,000 2 975 min

Jens Teubner Data Warehousing Winter 2014/15 201


Hadoop and Petabyte Sort Benchmark

Jens Teubner Data Warehousing Winter 2014/15 202


Hadoop and Petabyte Sort Benchmark

Jens Teubner Data Warehousing Winter 2014/15 203


MapReduce Databases: Load Times

30000 50000

Analysis. SIGMOD 2009.


Pavlo et al.. A Comparison of Approaches
25000
40000

20000
30000
seconds

seconds
15000

to Large-Scale Data
20000
10000

2262.2
10000
75.5
67.7

5000

0 0
50 Nodes 100 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10

Hadoop Vertica Hadoop

p Task Data Set Figure 2: Load Times Grep Task Data Set Figure 3: Load
Schema and physical (1TB/cluster)
data organization make loading slower on (20GB
the databases.
of 3TB of disk space in order to store ferent node based on the hash of its p
in HDFS, we were limited to running
Jens Teubner Data Warehousing Winter 2014/15
loaded, the columns are automatically
204
MapReduce Databases: Grep Benchmark

1500

to Large-Scale Data Analysis. SIGMOD 2009.


Pavlo et al.. A Comparison of Approaches
1250

1000
seconds

750

500

250

0
s 100 Nodes 25 Nodes 50 Nodes 100 Nodes

oop Vertica Hadoop

de Data Set Figure 5: Grep Task Results 1TB/cluster Data Set

MapReduce
es perform about 600,000leaves
unique result as collection
HTML documents, each with ofa les; collecting
unique URL. In into
single result costs addl. time.
oop. But in Fig- each document, we randomly generate links to other pages set us-
than a factor of ing a Zipfian distribution.
ount of data pro- We also generated two additional data sets meant to model log
Jens
ents. the re- Data Warehousing
ForTeubner Winter
files of HTTP 2014/15
server traffic. These data sets consist of values de- 205
MapReduce Databases: Aggregation

to Large-Scale Data Analysis. SIGMOD 2009.


Pavlo et al.. A Comparison of Approaches
1800 1400

1600
1200
1400
1000
1200
seconds

seconds
1000 800

800 600
600
400
400
200
200

0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes

Vertica Hadoop Vertica Hadoop

Figure 7: Aggregation Task Results (2.5 million Groups) Figure 8: Aggregation Task Results (2,000 Groups)

the query coordinator, which outputs results to the user. The results Once this table is populated, it is then trivial to use a second query
in Figure 7 illustrate that the two DBMSs perform about the same to output the record with the largest totalRevenue field.

Databases limited by communication


for a large number of groups, as their runtime is dominated by the
cost, which is lower for
cost to transmit the large number of local groups and merge them
SELECT INTO Temp sourceIP,
AVG(pageRank) as avgPageRank,
at the coordinator. For the experiments using fewer nodes, Vertica
smaller group counts. SUM(adRevenue) as totalRevenue
performs somewhat better, since it has to read less data (since it
FROM Rankings AS R, UserVisits AS UV
can directly access the sourceIP and adRevenue columns), but it WHERE R.pageURL = UV.destURL
becomes slightly slower as more nodes are used. AND UV.visitDate BETWEEN Date(2000-01-15)
AND Date(2000-01-22)
Based on the results in Figure 8, it is more advantageous to use GROUP BY UV.sourceIP;
a column-store system when processing fewer groups for this task.
This is because the two columns accessed (sourceIP and adRev- SELECT sourceIP, totalRevenue, avgPageRank
FROM Temp
enue)Teubner
Jens consist of only
Data20 bytes out of the
Warehousing more than
Winter 200 bytes per
2014/15 ORDER BY totalRevenue DESC LIMIT 1; 206
MapReduce Databases: Join
1800 8000

to Large-Scale Data Analysis. SIGMOD 2009.


1600 7000

Pavlo et al.. A Comparison of Approaches


1400
6000
1200
5000

seconds
seconds
1000
4000
800
3000
600
2000
400

85.0
36.1
31.3

31.9
28.2
28.0

29.2

29.4
21.5
15.7

200 1000

0 0
1 Nodes 10 Nodes 25 Nodes 50 Nodes 100 Nodes 1 Nod

Vertica DBMSX Hadoop

Figure 9: Join Task Results Figur


Joins are rather
records complex
for a particular to formulate
sourceIP in We
on a single node. MapReduce.
use the iden- seconds to split,
Repartitioning incurs high communication overhead.
tity Map function in the Hadoop API to supply records directly to the CPU overhe
the split process [1, 8]. iting factor for H
Joins canReduce
be accelerated using
Function: For each indexes.
sourceIP, the function adds up the Second, the p
adRevenue and computes the average pageRank, retaining the one that both the Us
Jens Teubner Data Warehousing Winter 2014/15 207
MapReduce Databases

Persistent data data read ad-hoc:


Overhead for schema design, loading, indexing, etc.
Cost might amortize only after several queries/analyses.
Databases feature support for transactions.
Not needed for read-only workloads.

Language: SQL Java/C++/:


Write a new MapReduce program for each and every analysis?
User-dened functionality in SQL?
E.g., similarity measures, statistics functions, etc.
Debug SQL or MapReduce job?

Is there a good middle ground?

Jens Teubner Data Warehousing Winter 2014/15 208


Apache Pig

Idea:
Data processing language that sits in-between SQL and
MapReduce.
Declarative (SQL-like; ; allow for optimization, easy
re-use and maintenance)
Procedural-style, rich data model (; programmers feel
comfortable)
Pig programs are compiled into MapReduce (Hadoop) jobs.

Jens Teubner Data Warehousing Winter 2014/15 209


Pig Latin Example
S = LOAD 'sailors.csv' USING PigStorage(',')
AS (sid:int, name:chararray, rating:int, age:int);
B = LOAD 'boats.csv' USING PigStorage(',')schema on-the-y
AS (bid:int, name:chararray, color:chararray);
R = LOAD 'reserves.csv' USING PigStorage(',')
AS (sid:int, bid:int, day:chararray);

-- SELECT S.sid, R.day


-- FROM Sailors AS S, Reserves AS R
-- WHERE S.sid = R.sid AND R.bid = 101
programming style:
A = FILTER R BY (bid == 101); sequence of assignments
B = JOIN S BY sid, A by sid; ; data ow
X = FOREACH B GENERATE S::sid, A::day AS day;

STORE X into 'result.csv';

Jens Teubner Data Warehousing Winter 2014/15 210


Pig Latin Data Model
Pig Latin features a fairly rich data model:
atoms:
e.g., 'foo', 42
tuples: sequence of elds of any data type
e.g., ('foo', 42)
access by eld name or position, tuples can be nested
bag: collection of tuples (possibly with duplicates)
{ }
('foo', 42)
e.g.,
(17, ('hello', 'world'))
map: collection of key value mappings
{ }
('lakers')
'fan of'
e.g., ('iPod')
age 20

Jens Teubner Data Warehousing Winter 2014/15 211


Pig Latin Data Model

Pig Latins data types can be arbitrarily nested4


Contrast to 1NF data model in relational databases
Avoid joins, which MapReduce cant do too well.
Allow for sound data model, including grouping, etc.
Easier integration with user-dened functions

4
Keys for map types must be atomic, though (for efciency reasons).
Jens Teubner Data Warehousing Winter 2014/15 212
Pig Latin Operators: FILTER

kids = FILTER users BY (age < 18);

Comparison operators: ==, eq, !=, neq, AND,


Can use user-dened functions arbitrarily.

 Implementation in MapReduce?

Jens Teubner Data Warehousing Winter 2014/15 213


Pig Latin Operators: FOREACH

FOREACH Sailors GENERATE


sid AS sailorId,
name AS sailorName,
( rating, age ) AS sailorInfo;

Apply some processing (e.g., item re-structuring) to every item


of a data set (; projection in Relational Algebra)
No loop dependence! parallel execution
(XQuerys FLWOR expressions provide a similar form of iteration.)

Jens Teubner Data Warehousing Winter 2014/15 214


Pig Latin Operators: GROUP

sales_by_cust = GROUP sales BY customerName;

returns a bag (relation) with two elds: group key and bag of
tuples with that key value.
First eld is named group
Second eld is named by variable (alias in Pig
terminology) used in the GROUP statement (here: sales)

 Implementation in MapReduce?

Jens Teubner Data Warehousing Winter 2014/15 215


Pig Latin Operators: COGROUP

Group items from multiple data sets:


O = LOAD 'owner.csv' USING PigStorage(',')
AS (owner:chararray, pet:chararray);
{(Alice, turtle) , (Alice, goldfish) , (Alice, cat) , (Bob, dog) , (Bob, cat)}
F = LOAD 'friend.csv' USING PigStorage(',')
AS (person:chararray, friend:chararray);
{(Cindy, Alice) , (Mark, Alice) , (Paul, Bob) , (Paul, Jane)}
X =COGROUP
OBY owner, F BY friend;


(Alice, turtle) { }


(Cindy, Alice)

Alice,
(Alice, goldfish) ,


(Mark, Alice)

(Alice, cat)
( { } )
(Bob, dog) { }

Bob, , (Paul, Bob)


(Bob, cat)





( { })

Jane, {} , (Paul, Jane)

Jens Teubner Data Warehousing Winter 2014/15 216


Pig Latin Operators: JOIN

join_result = JOIN results BY queryString,


revenue BY queryString;

Equi-joins only.

 Implementation in MapReduce?
Cross product between elds 1 and 2 of COGROUP result.

temp = COGROUP results BY queryString;


revenue BY queryString;
join_result = FOREACH temp GENERATE
FLATTEN (results), FLATTEN (revenue);

Jens Teubner Data Warehousing Winter 2014/15 217


Pig Latin: More Operators

Many additional operators ease common data analysis tasks, e.g.,


LOAD/STORE
(Not surprisingly, Pig works well together with HDFS.)
UNION
CROSS
ORDER
DISTINCT

Jens Teubner Data Warehousing Winter 2014/15 218


Pig Latin: Debugging

Pig Latin was also designed with the development and analysis
workow in mind.
Interactive use of Pig (grunt).
Can run Pig programs locally (without Hadoop).
Commands to examine expression results.
DUMP: Write (intermediate) result to storage.
DESCRIBE: Print schema of an (intermediate) result.
EXPLAIN: Print execution plan.
ILLUSTRATE: View step-by-step execution of a plan; show
representative examples of (intermediate) result data.

Jens Teubner Data Warehousing Winter 2014/15 219