Sie sind auf Seite 1von 31

Distributed Query Processing

Based on The state of the art in distributed query processing Donald Kossman (ACM Computing Surveys, 2000)

Motivation
Cost and scalability: network of off-shelf machines Integration of different software vendors (with own DBMS) Integration of legacy systems Applications inherently distributed, such as workflow or collaborative-design State-of-the-art distributed information technologies (e-businesses)

Part 1 : Basics
Query Processing Basics
centralized query processing distributed query processing

Problem Statement
Input: Query such as Biological objects in study A referenced in a literature in journal Y. Output: Answer Objectives:
response time, throughput, first answers, little IO, ...

Centralized vs. Distributed Query Processing


same basic problem but, more and different parameters, such(data sites or available machine power) and objectives

Steps in Query Processing


Input: Declarative Query
SQL, XQuery, ...

Step 1: Translate Query into Algebra


Tree of operators (query plan generation)

Step 2: Optimize Query


Tree of operators (logical) - also select partitions of table Tree of operators (physical) also site annotations (Compilation)

Step 3: Execution
Interpretation; Query result generation

Algebra
A.d

SELECT A.d FROM A, B WHERE A.a = B.b AND A.c = 35


A

A.a = B.b, A.c = 35 X B

relational algebra for SQL very well understood algebra for XQuery mostly understood

Query Optimization
A.d A.a = B.b, A.c = 35 X A B index A.c A.d hashjoin B.b B

logical, e.g., push down cheap predicates enumerate alternative plans, apply cost model use search heuristics to find cheapest plan

Basic Query Optimization


Classical Dynamic Programming algorithm
Performs join order optimization Input : Join query on n relations Output : Best join order

The Dynamic Prog. Algorithm


for i = 1 to n do { optPlan({Ri}) = accessPlans(Ri) prunePlans(optPlan({Ri})) } for i = 2 to n do for all S { R1, R2 Rn } such that |S| = i do { optPlan(S) = for all O S do { optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S O)) prunePlans(optPlan(S)) } } return optPlan({R1, R2, Rn})

Query Execution
John A.d

(John, 35, CS)


hashjoin (John, 35, CS) (Mary, 35, EE) index A.c B.b B (CS) (AS)

(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)

library of operators (hash join, merge join, ...) exploit indexes and clustering in database pipelining (iterator model)

Summary : Centralized Queries


Basic SQL (SPJG, nesting) well understood Very good extensibility
spatial joins, time series, UDF, xquery, etc.

Current problems
Better statistics : cost model for optimization Physical database design expensive & complex

Some Trends
interactiveness during execution approximate answers, top-k self-tuning capabilities (adaptive; robust; etc.)

Distributed Query Processing: Basics


Idea:
Extension of centralized query processing. (System R* et al. in 80s)

What is different?
extend physical algebra: send&receive operators other metrics : optimize for response time resource vectors, network interconnect matrix caching and replication less predictability in cost model (adaptive algos) heterogeneity in data formats and data models

Issues in Distributed Databases


Plan enumeration
The time and space complexity of traditional dynamic programming algorithm is very large Iterative Dynamic Programming (heuristic for large queries)

Cost Models
Classic Cost Model Response Time Model Economic Models

Distributed Query Plan


A.d

Forms Of Parallelism?

hashjoin

receive
send

receive
send

B.b
index A.c B

Cost : Resource Utilization


1 8

Total Cost = Sum of Cost of Ops


Cost = 40

1
1

6
6

2
5 10

Another Metric : Response Time


25, 33 Pipelined parallelism 24, 32 Total Cost = 40 first tuple = 25 last tuple = 33

0, 7
Independent parallelism 0, 6

0, 24
0, 18

0, 12
0, 5 0, 10 first tuple = 0 last tuple = 10

Query Execution Techniques for Distributed Databases


Row Blocking Multi-cast optimization Multi-threaded execution Joins with horizontal partitioning Semi joins Top n queries

Query Execution Techniques for DD


Row Blocking
SEND and RECEIVE operators in query plan to model communication Implemented by TCP/IP, UDP, etc. Ship tuples in block-wise fashion (batch); smooth burstiness

Query Execution Techniques for DD


Multi-cast Optimization
Location of sending/receiving may affect communication costs; forwarding versus multi-casting

Multi-threaded execution
Several threads for operators at the same site (intraquery parallelism) May be useful to enable concurrent reads for diverse machines (while continuing query processing) Must consider if resources warrant concurrent operator execution (say two sorts each needing all memory)

Query Execution Techniques for DD


Joins with Data (horizontal) partitioning:
Hash-based partitioning to conduct joins on independent partitions

Semi Joins :
Reduce communication costs; Send only join keys instead of complete tuples to the site to extract relevant join partners

Double-pipelined hash joins :


Non-blocking join operators to deliver first results quickly; fully exploit pipelined parallelism, and reduce overall response time

Top n queries :
Isloate top n tuples quickly and only perform other expensive operations (like sort, join, etc) on those few (use stop operators)

Adaptive Algorithms
Deal with unpredictable events at run time
delays in arrival of data, burstiness of network autonomity of nodes, changes in policies

Example: double pipelined hash joins


build hash table for both input streams read inputs in separate threads good for bursty arrival of data

Re-optimization at run time (LEO, etc.)


monitor execution of query adjust estimates of cost model re-optimize if delta is too large

Special Techniques for Client-Server Architectures


Shipping techniques
Query shipping Data shipping Hybrid shipping

Query Optimization
Site Selection Where to optimize Two Phase Optimization

Special Techniques for Federated Database Systems


Wrapper architecture Query optimization
Query capabilities Cost estimation
Calibration Approach Wrapper Cost Model

Parameter Binding

Heterogeneity
Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging Wrappers map from local to global schema Wrappers carry out caching
connections, cursors, data, ...

Wrappers map queries into local dialect Wrappers participate in query planning!!!
define the subset of queries that can be handled give cost information, statistics capability-based rewriting

Middleware
Two kinds of middleware
data warehouses virtual integration

Data Warehouses
good: query response times good: materializes results of data cleaning bad: high resource requirements in middleware bad: staleness of data

Virtual Integration
the opposite caching possible to improve response times

Virtual Integration
Query

Middleware (query decomposition, result composition) wrapper sub query wrapper sub query

DB1

DB2

IBM Data Joiner


SQL Query

Data Joiner wrapper sub query wrapper sub query

SQL DB1

SQL DB2

Adding XML
Query XML Publishing Middleware (SQL) wrapper wrapper

sub query
DB1

sub query
DB2

XML Data Integration


XML Query

Middleware (XML) XML query wrapper XML query wrapper

DB1

DB2

XML Data Integration


Example: BEA Liquid Data Advantage
Availability of XML wrappers for all major databases

Problems
XML - SQL mapping is very difficult XML is not always the right language (e.g., decision support style queries)

Summary
Middleware looks like a homogenous centralized database
location transparency data model transparency

Middleware provides global schema


data sources map local schemas to global schema

Various kinds of middleware (SQL, XML) Stacks of middleware possible Data cleaning requires special attention

Das könnte Ihnen auch gefallen