Beruflich Dokumente
Kultur Dokumente
M. Venkatesan Assistant Professor (Selection Grade) School of Computing Science & Engineering VIT University
1
Syllabus
CSE 515 ADVANCED DATABASE SYSTEMS LTPC3104 Contents: DATABASE DESIGN AND TUNING Introduction to physical database design Guideline for index selection- Overview of database tuning Conceptual schema tuning Queries and view tuning. PARALLEL AND DISTRIBUTED DATABASE Parallel database systems: Architecture of parallel databases, parallel Query evaluation, parallelizing joins and parallel query optimization. Distributed database systems: Distributed database architecture, Properties of distributed database, Types of distributed database, storing data in a distributed DBMS, distributed query processing EMERGING DATABASE TECHNOLOGIES Multimedia databases: Multimedia sources, Multimedia database queries, Multimedia database applications, Mobile databases: Architecture of mobile databases, characteristics of mobile computing, mobile DBMS, Object Database System: Abstract data types, object identity and reference types, inheritance, and Database design for ORDBMS DATA WAREHOUSING Data warehousing: Definition and terminology, Data Preprocessing, Main components of data warehouse, Data warehouse architecture, OLAP technology, Data mart. Text/ Reference Books 1. Raghu Ramakrishnan and Johannes Gehrke: Database Management Systems, III Edition, McGraw Hill,2000. 2. S.K.Singh, Database Systems: Concepts, Design & Applications, Pearson education, 2006 3. Ramez Elmasri & B.Navathe: Fundamentals of database systems, IV edition, Addison Wesley, 2005. 4. Jiawei Han and Micheline Kamber, Data Mining-Concepts and Techniques, Morgan kaufmann publishers, 2005.
Database Workload
Understanding the workload: The most important queries and how often they arise. The most important updates and how often they arise. The desired performance for these queries and updates For each query in the workload: Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected. Note: while update, index may slow down the process
5
Index
Is a schema object Is used by the Oracle server to speed up the retrieval of rows by using a pointer Can reduce disk I/O by using a rapid path access method to locate data quickly Is independent of the table it indexes Is used and maintained automatically by the Oracle server How Are Indexes Created? Automatically: A unique index is created automatically when you define a PRIMARY KEY or UNIQUE constraint in a table definition. Manually: Users can create nonunique indexes on columns to 7 speed up access to the rows.
Indexes
An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.
10
11
Index Classification
Primary vs. secondary: If search key contains primary key, then called primary index. Unique index: Search key contains a candidate key. Clustered vs. unclustered: If order of data records is the same as, or `close to, order of data entries, then called clustered index. Alternative 1 implies clustered A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not!
CLUSTERED
Index entries direct search for data entries
UNCLUSTERED
Data entries
Data Records
Data Records
12
Choice of Indexes
Choice of Indexes One approach: consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. Before creating an index, must also consider the impact on updates in the workload Trade-off: indexes can make queries go faster, updates slower. Require disk space, too.
13
age>20 and sal > 10 Benefits from B+ Tree Index, Clustering benefits range queries Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.
14
15
Should be clustered if join column is not key for inner, and inner tuples need to be retrieved.
Clustered B+ tree on join column(s) is good for Sort-Merge.
SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=Toy AND E.dno=D.dno Hash index on D.dname supports Toy selection. Given this, index on D.dno is not needed. Index on E.age no need of Hash index on E.dno allows us to get matching (inner) Empindex on E.dno tuples for each selected (outer) Dept tuple. What if WHERE included: `` ... AND E.age=25 ? Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. So, if E.age index is already created, this query provides much less motivation for adding an E.dno index.
17
GROUP BY E.dno
Clustering is especially important when accessing inner tuples in INL. Should make index on E.dno clustered. Suppose that the WHERE clause is instead: WHERE E.hobby=Stamps AND E.dno=D.dno If many employees collect stamps, Sort-Merge join may be worth considering. A clustered index on D.dno would help. Summary: Clustering is useful whenever many tuples are to be retrieved.
20
21
Co-Clustering
It can speed of Joins, in particular key-foreign key joins corresponding to 1:N rel A sequential scan of either relation becomes slower . All inserts deletes and updates that alter records lengths become slower
23
SELECT D.mgr, E.eid A number of <E.dno,E.eid> FROM Dept D, Emp E queries can be answered WHERE D.dno=E.dno without SELECT E.dno, COUNT(*) retrieving any <E.dno> FROM Emp E tuples from one GROUP BY E.dno or more of the relations SELECT E.dno, MIN(E.sal) <E.dno,E.sal> FROM Emp E involved if a suitable index B-tree trick! GROUP BY E.dno is available. <E. age,E.sal> SELECT AVG(E.sal) or FROM Emp E <E.sal, E.age> WHERE E.age=25 AND 24 E.sal BETWEEN 3000 AND 5000
Example Schemas
Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val) Depts (Did, Budget, Report) Suppliers (Sid, Address) Parts (Pid, Cost) Projects (Jid, Mgr) We will concentrate on Contracts, denoted as CSJDPQV. The following ICs are given to hold: JP C, SD P, C is the primary key. What are the candidate keys for CSJDPQV? What normal form is this relation schema in?
26
Denormalization
Suppose that the following query is important: Is the value of a contract less than the budget of the department? To speed up this query, we might add a field budget B to Contracts. This introduces the FD D B wrt Contracts. Thus, Contracts is no longer in 3NF.
28
Choice of Decompositions
There are 2 ways to decompose CSJDPQV into BCNF: SDP and CSJDQV; lossless-join but not dep-preserving. SDP, CSJDQV and CJP; dep-preserving as well. The difference between these is really the cost of enforcing the FD JP C. 2nd decomposition: Index on JP on relation CJP
29
Horizontal Decompositions
Our definition of decomposition: Relation is replaced by a collection of relations that are projections. Most important case. Sometimes, might want to replace relation by a collection of relations that are selections. Each new relation has same schema as the original, but a subset of the rows. Collectively, new relations contain all rows of the original. Typically, the new relations are disjoint.
32
Horizontal Decompositions
Suppose that contracts with value > 10000 are subject to different rules. This means that queries on Contracts will often contain the condition val>10000. One way to deal with this is to build a clustered B+ tree index on the val field of Contracts. A second approach is to replace contracts by two new relations: LargeContracts and SmallContracts, with the same attributes (CSJDPQV). Performs like index on such queries, but no index overhead. Can build clustered indexes on other attributes, in addition!
33
Tuning Queries
If a query runs slower than expected, check if an index needs to be re-built, or if statistics are too old. Sometimes, the DBMS may not be executing the plan you had in mind. Common areas of weakness: Selections involving null values. Selections involving arithmetic or string expressions. Selections involving OR conditions. Lack of evaluation features like index-only strategies or certain join methods or poor size estimation. Check the plan that is being used! Then adjust the choice of indexes or rewrite the query/view
34
Tuning Queries
Complicated by interaction of: NULLs, duplicates, aggregation, subqueries. Guideline: Use only one query block, if possible
SELECT DISTINCT * FROM Sailors S WHERE S.sname IN (SELECT Y.sname FROM YoungSailors Y)
35
Consider DBMS use of index when writing arithmetic expressions: E.age=2*D.age will benefit from index on E.age, but might not benefit from index on D.age!
36
D.mgrname=Joe
SELECT T.dno, AVG(T.sal) FROM Temp T GROUP BY T.dno
SELECT E.dno, AVG(E.sal) FROM Emp E, Dept D WHERE E.dno=D.dno AND D.mgrname=Joe GROUP BY E.dno
If there is a dense B+ tree index on <dno, sal>, an index-only plan can be used to avoid retrieving Emp tuples in the second query 37
Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design. What are the important queries and updates? What attributes/relations are involved? Indexes must be chosen to speed up important queries (and perhaps some updates!). Index maintenance overhead on updates to key fields. Choose indexes that can help many queries, if possible. Build indexes to support index-only strategies. Clustering is an important decision; only one index on a given relation can be clustered! Order of fields in composite index key can be important. Static indexes may have to be periodically re-built. 38 Statistics have to be periodically updated
40