Beruflich Dokumente
Kultur Dokumente
Based on The state of the art in distributed query processing Donald Kossman (ACM Computing Surveys, 2000)
Motivation
Cost and scalability: network of off-shelf machines Integration of different software vendors (with own DBMS) Integration of legacy systems Applications inherently distributed, such as workflow or collaborative-design State-of-the-art distributed information technologies (e-businesses)
Part 1 : Basics
Query Processing Basics
centralized query processing distributed query processing
Problem Statement
Input: Query such as Biological objects in study A referenced in a literature in journal Y. Output: Answer Objectives:
response time, throughput, first answers, little IO, ...
Step 3: Execution
Interpretation; Query result generation
Algebra
A.d
relational algebra for SQL very well understood algebra for XQuery mostly understood
Query Optimization
A.d A.a = B.b, A.c = 35 X A B index A.c A.d hashjoin B.b B
logical, e.g., push down cheap predicates enumerate alternative plans, apply cost model use search heuristics to find cheapest plan
Query Execution
John A.d
library of operators (hash join, merge join, ...) exploit indexes and clustering in database pipelining (iterator model)
Current problems
Better statistics : cost model for optimization Physical database design expensive & complex
Some Trends
interactiveness during execution approximate answers, top-k self-tuning capabilities (adaptive; robust; etc.)
What is different?
extend physical algebra: send&receive operators other metrics : optimize for response time resource vectors, network interconnect matrix caching and replication less predictability in cost model (adaptive algos) heterogeneity in data formats and data models
Cost Models
Classic Cost Model Response Time Model Economic Models
Forms Of Parallelism?
hashjoin
receive
send
receive
send
B.b
index A.c B
1
1
6
6
2
5 10
0, 7
Independent parallelism 0, 6
0, 24
0, 18
0, 12
0, 5 0, 10 first tuple = 0 last tuple = 10
Multi-threaded execution
Several threads for operators at the same site (intraquery parallelism) May be useful to enable concurrent reads for diverse machines (while continuing query processing) Must consider if resources warrant concurrent operator execution (say two sorts each needing all memory)
Semi Joins :
Reduce communication costs; Send only join keys instead of complete tuples to the site to extract relevant join partners
Top n queries :
Isloate top n tuples quickly and only perform other expensive operations (like sort, join, etc) on those few (use stop operators)
Adaptive Algorithms
Deal with unpredictable events at run time
delays in arrival of data, burstiness of network autonomity of nodes, changes in policies
Query Optimization
Site Selection Where to optimize Two Phase Optimization
Parameter Binding
Heterogeneity
Use Wrappers to hide heterogeneity Wrappers take care of data format, packaging Wrappers map from local to global schema Wrappers carry out caching
connections, cursors, data, ...
Wrappers map queries into local dialect Wrappers participate in query planning!!!
define the subset of queries that can be handled give cost information, statistics capability-based rewriting
Middleware
Two kinds of middleware
data warehouses virtual integration
Data Warehouses
good: query response times good: materializes results of data cleaning bad: high resource requirements in middleware bad: staleness of data
Virtual Integration
the opposite caching possible to improve response times
Virtual Integration
Query
Middleware (query decomposition, result composition) wrapper sub query wrapper sub query
DB1
DB2
SQL DB1
SQL DB2
Adding XML
Query XML Publishing Middleware (SQL) wrapper wrapper
sub query
DB1
sub query
DB2
DB1
DB2
Problems
XML - SQL mapping is very difficult XML is not always the right language (e.g., decision support style queries)
Summary
Middleware looks like a homogenous centralized database
location transparency data model transparency
Various kinds of middleware (SQL, XML) Stacks of middleware possible Data cleaning requires special attention