Distributed Query Processing

Distributed Query Processing
Based on The state of the art in distributed query processing Donald Kossman (ACM Computing Surveys, 2000)
Motivation
Cost and scalability: network of off-shelf machines Integration of different software vendors (with own DBMS) Integration of legacy systems Applications inherently distributed, such as workflow or collaborative-design State-of-the-art distributed information technologies (e-businesses)
Part 1 : Basics
Query Processing Basics
centralized query processing distributed query processing
Problem Statement
Input: Query such as Biological objects in study A referenced in a literature in journal Y. Output: Answer Objectives:
response time, throughput, first answers, little IO, ...
Centralized vs. Distributed Query Processing

same basic problem but, more and different parameters, such(data sites or available machine power) and objectives
Steps in Query Processing

Input: Declarative Query
SQL, XQuery, ...
Step 1: Translate Query into Algebra

Tree of operators (query plan generation)
Step 2: Optimize Query

Tree of operators (logical) - also select partitions of table Tree of operators (physical) also site annotations (Compilation)
Step 3: Execution
Interpretation; Query result generation
Algebra
A.d
SELECT A.d FROM A, B WHERE A.a = B.b AND A.c = 35

A
A.a = B.b, A.c = 35 X B
relational algebra for SQL very well understood algebra for XQuery mostly understood
Query Optimization
A.d A.a = B.b, A.c = 35 X A B index A.c A.d hashjoin B.b B
logical, e.g., push down cheap predicates enumerate alternative plans, apply cost model use search heuristics to find cheapest plan
Basic Query Optimization

Classical Dynamic Programming algorithm
Performs join order optimization Input : Join query on n relations Output : Best join order
The Dynamic Prog. Algorithm

for i = 1 to n do { optPlan({Ri}) = accessPlans(Ri) prunePlans(optPlan({Ri})) } for i = 2 to n do for all S { R1, R2 Rn } such that |S| = i do { optPlan(S) = for all O S do { optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S O)) prunePlans(optPlan(S)) } } return optPlan({R1, R2, Rn})
Query Execution
John A.d
(John, 35, CS)

hashjoin (John, 35, CS) (Mary, 35, EE) index A.c B.b B (CS) (AS)
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
library of operators (hash join, merge join, ...) exploit indexes and clustering in database pipelining (iterator model)
Summary : Centralized Queries

Basic SQL (SPJG, nesting) well understood Very good extensibility
spatial joins, time series, UDF, xquery, etc.
Current problems
Better statistics : cost model for optimization Physical database design expensive & complex
Some Trends
interactiveness during execution approximate answers, top-k self-tuning capabilities (adaptive; robust; etc.)
Distributed Query Processing: Basics

Idea:
Extension of centralized query processing. (System R* et al. in 80s)
What is different?
extend physical algebra: send&receive operators other metrics : optimize for response time resource vectors, network interconnect matrix caching and replication less predictability in cost model (adaptive algos) heterogeneity in data formats and data models
Issues in Distributed Databases

Plan enumeration
The time and space complexity of traditional dynamic programming algorithm is very large Iterative Dynamic Programming (heuristic for large queries)
Cost Models
Classic Cost Model Response Time Model Economic Models
Distributed Query Plan

A.d
Forms Of Parallelism?
hashjoin
receive
send
receive
send
B.b
index A.c B
Cost : Resource Utilization

1 8
Total Cost = Sum of Cost of Ops

Cost = 40
1
1
6
6
2
5 10
Another Metric : Response Time

25, 33 Pipelined parallelism 24, 32 Total Cost = 40 first tuple = 25 last tuple = 33
0, 7
Independent parallelism 0, 6
0, 24
0, 18
0, 12
0, 5 0, 10 first tuple = 0 last tuple = 10
Query Execution Techniques for Distributed Databases

Row Blocking Multi-cast optimization Multi-threaded execution Joins with horizontal partitioning Semi joins Top n queries
Query Execution Techniques for DD

Row Blocking
SEND and RECEIVE operators in query plan to model communication Implemented by TCP/IP, UDP, etc. Ship tuples in block-wise fashion (batch); smooth burstiness

Multi-cast Optimization
Location of sending/receiving may affect communication costs; forwarding versus multi-casting
Multi-threaded execution
Several threads for operators at the same site (intraquery parallelism) May be useful to enable concurrent reads for diverse machines (while continuing query processing) Must consider if resources warrant concurrent operator execution (say two sorts each needing all memory)

Joins with Data (horizontal) partitioning:
Hash-based partitioning to conduct joins on independent partitions
Semi Joins :
Reduce communication costs; Send only join keys instead of complete tuples to the site to extract relevant join partners
Double-pipelined hash joins :

Non-blocking join operators to deliver first results quickly; fully exploit pipelined parallelism, and reduce overall response time
Top n queries :
Isloate top n tuples quickly and only perform other expensive operations (like sort, join, etc) on those few (use stop operators)
Adaptive Algorithms
Deal with unpredictable events at run time
delays in arrival of data, burstiness of network autonomity of nodes, changes in policies
Example: double pipelined hash joins

build hash table for both input streams read inputs in separate threads good for bursty arrival of data
Re-optimization at run time (LEO, etc.)

monitor execution of query adjust estimates of cost model re-optimize if delta is too large
Special Techniques for Client-Server Architectures

Shipping techniques
Query shipping Data shipping Hybrid shipping
Query Optimization
Site Selection Where to optimize Two Phase Optimization
Special Techniques for Federated Database Systems

Wrapper architecture Query optimization
Query capabilities Cost estimation
Calibration Approach Wrapper Cost Model
Parameter Binding
Heterogeneity
Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging Wrappers map from local to global schema Wrappers carry out caching
connections, cursors, data, ...
Wrappers map queries into local dialect Wrappers participate in query planning!!!
define the subset of queries that can be handled give cost information, statistics capability-based rewriting
Middleware
Two kinds of middleware
data warehouses virtual integration
Data Warehouses
good: query response times good: materializes results of data cleaning bad: high resource requirements in middleware bad: staleness of data
Virtual Integration
the opposite caching possible to improve response times
Virtual Integration
Query
Middleware (query decomposition, result composition) wrapper sub query wrapper sub query
DB1
DB2
IBM Data Joiner

SQL Query
Data Joiner wrapper sub query wrapper sub query
SQL DB1
SQL DB2
Adding XML
Query XML Publishing Middleware (SQL) wrapper wrapper
sub query
DB1
sub query
DB2
XML Data Integration

XML Query
Middleware (XML) XML query wrapper XML query wrapper
DB1
DB2
XML Data Integration

Example: BEA Liquid Data Advantage
Availability of XML wrappers for all major databases
Problems
XML - SQL mapping is very difficult XML is not always the right language (e.g., decision support style queries)
Summary
Middleware looks like a homogenous centralized database
location transparency data model transparency
Middleware provides global schema

data sources map local schemas to global schema
Various kinds of middleware (SQL, XML) Stacks of middleware possible Data cleaning requires special attention

Distributed Query Processing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Distributed Query Processing

Hochgeladen von

Copyright:

Verfügbare Formate

Distributed Query Processing

Centralized vs. Distributed Query Processing

Steps in Query Processing

Step 1: Translate Query into Algebra

Step 2: Optimize Query

SELECT A.d FROM A, B WHERE A.a = B.b AND A.c = 35

A.a = B.b, A.c = 35 X B

Basic Query Optimization

The Dynamic Prog. Algorithm

(John, 35, CS)

(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)

Summary : Centralized Queries

Distributed Query Processing: Basics

Issues in Distributed Databases

Distributed Query Plan

Cost : Resource Utilization

Total Cost = Sum of Cost of Ops

Another Metric : Response Time

Query Execution Techniques for Distributed Databases

Query Execution Techniques for DD

Query Execution Techniques for DD

Query Execution Techniques for DD

Double-pipelined hash joins :

Example: double pipelined hash joins

Re-optimization at run time (LEO, etc.)

Special Techniques for Client-Server Architectures

Special Techniques for Federated Database Systems

IBM Data Joiner

Data Joiner wrapper sub query wrapper sub query

XML Data Integration

Middleware (XML) XML query wrapper XML query wrapper

XML Data Integration

Middleware provides global schema

Das könnte Ihnen auch gefallen