Beruflich Dokumente
Kultur Dokumente
Indexing
Chapter 8
record
File/Inde
x
Record
Record
key
ID
Buffer
Manage
r
Storage
Manag
er
2
Search
Key
A
In
de
x
on
AB
2
3
4
5
6
7
entry
*Record
ID
Search Key
A data
record
4
5
6
7
Indexing
technique
key value k)
2. <k, rid> pair, where rid is the
record id of data record with
search key value k
3. <k, rid-list> pair, where rid-list
is a list of rids of data records
Choice of alternative for data entries is
with search key k
Data
entries
Data Records
Alternative 1:
Key
DR
DR
DR DR
Auxiliary
informati
on
DR
DR DR DR
DR
Data records
7
B-tree
B-tree can be used to implement Alternative 1
Data
records
(instead of
data
entries)
stored in
tree node
The tree is
relatively large
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Key
Smaller
than
Alternative
1
Data
entries
Heap
file
Data Records
Alternative
Index Classification
10
Clustered Index
Suppose that Alternative (2) is used for data entries,
and that the data records are stored in a Heap file.
CLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Heap
file
11
Clustered vs Unclustered
Alternative
1 implies clustered; in
practice, clustered also implies
Alternative 1 (since sorted files are
rare).
12
<k, rid>
Clustered
Sorted
according
to SSN
Data file
sorted
according to
SSN
SSN
Ag
e
Incom
e
Phone
56312132
5
32126454
18
57236167
2
40754931
24
67856327
6
32196593
32
69839425
0
40734598
76
72035732
0
40775890
92
73470586
2
32145510
23
80943562
0
40776523
64
13
Hash-based Index
Record
IDs
pointing
to data
records
0
2
key
h(key) = ID
of hash
bucket
Overf
ow
page
An entry may
be a record, a
<k, rid>, or a
<k, list_rid>.
N-1
14
Hash-Based Indexes
Good for equality selections.
Index is a collection of buckets.
Bucket = primary page plus zero or more overflow
pages.
15
Non-leaf
Pages
Leaf
Pages
16
Example B+ Tree
Root
17
Entries <= 17
5
2*
3*
Entries > 17
27
13
5*
7* 8*
14* 16*
29*?
22* 24*
30
27* 29*
Find 28*?
17
D
R
Number
of data
pages
Average
time to
read/wri
te disk
page
Number of
records per
Good enough to show the overall trends!
page
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
18
Review
20
Number
of levels
log24
22
19
Cost Computation
Fano
ut is
F
Heigh
t is
logF
B
B leaf
pages
tree
20
Cost Computation
1
more
I/O to
read
the
data
recor
d
21
Cost Computation
22
23
Operations to Compare
Range selection
Insert a record
Delete a record
Delete
Insert
Equality search
Selection
Scan
24
Heap Files:
Sorted Files:
Indexes:
Alt (2), (3): data entry size = 10% size of
record
Hash: No overflow buckets.
25
400 records
File
siz
e
100
records
Larg
er
File
size
100
records
100
records
100
records
80
80
80
80
80
records records records records records
80% occupancy
use 25% more pages
A disk page
26
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Heap
Scan
Equali
ty
BD
0.5BD
Sorted
Clustere
d
Uncluste
red Tree
Index
Range
D
Delet
e
Time to read
Search +
BD
2D
or write disk
D
Write the page
page
Number of
Heap files (not
records perR
sorted; insert at
page
eof)
Insert
Uncluste
red Hash
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Index
27
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Heap
Sorted
Clustere
d
Uncluste
red Tree
Index
Scan
Equal
ity
Range
Insert
Delet
e
BD
0.5BD
BD
2D
Search +
D
BD
Dlog2B
Sorted files,
D(log2B + #
matching
pages)
sorted
on
sal>
Search + Fetch
Search
&+
rewrite
BD
<age,BD the
latter
half of the
file after
adding
the new
record
Uncluste
28
Cost of
Operations
Several assumptions
Heap
Sorted
Clustere
d
Uncluste
red Tree
Index
Scan
Equali
ty
Range
Insert
Delet
e
BD
0.5BD
BD
2D
Search +
D
D(log2B + #
Search +
BD
Search +
BD
BD
Dlog2B
matching
pages)
Uncluste
29
Cost of
Operations
Several assumptions
67% page
occupancy,
Heap
50%
more
pages to scan
Sorted
Clustere
d
Insert
Delet
e
2D
Search +
D
Search +
BD
Search +
BD
Clustered B+D(log
tree
file,
F(1.5B)
DlogF(1.5
+
+
1.5BD
Alternative
(1), search
keySearch
D
B)
# matching
<age,sal>
pages)
1 write to
insert
the
Search
new +
D
record
Scan
Equali
ty
R
Range
Height
Number
of the
of pages
BD tree 0.5BD as leaf BD
nodes
BD
Dlog2B
D(log2B + #
matching
pages)
Uncluste
Database
Management Systems 3ed, R. Ramakrishnan and J. Gehrke
red Tree
30
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Heap
Sorted
Clustere
d
Uncluste
red Tree
Index
Scan
Equal
ity
Range
Insert
Delet
e
BD
0.5BD
BD
2D
Search +
D
D(log2B + #
&+
Search + Fetch
Search
Dlog2B
BD
matching
rewrite
BD
BD
pages)
Sorted files, sorted
on <age,
the latter
half of the
sal>
file after
adding
the new
record
Uncluste
31
Cost of
Operations
Several assumptions
67% page
occupancy,
Heap
50%
more
pages to scan
Sorted
Clustere
d
Insert
Delet
e
2D
Search +
D
Search +
BD
Search +
BD
Clustered B+D(log
tree
file,
F(1.5B)
DlogF(1.5
+
+
1.5BD
Alternative
(1), search
keySearch
D
B)
# matching
<age,sal>
pages)
1 write to
insert
the
Search
+
new
D
record
Scan
Equali
ty
R
Range
Height
Number
of the
of pages
BD tree 0.5BD as leaf BD
nodes
BD
Dlog2B
D(log2B + #
matching
pages)
Uncluste
Database
Management Systems 3ed, R. Ramakrishnan and J. Gehrke
red Tree
32
Cost of Operations
Heap File /w Unclustered B+ tree
SCAN (to obtain data records in sorting order)
SCAN COST:
B
Data
size
1.5B
Data entry is only 10% the size of data record
# leaf pages is 0.1(1.5B)
Cost of scanning the leaf pages is 0.1(1.5B)D
0.1(1.5) B
pages
B pages
33
Cost of Operations
Heap File /w Unclustered
B+ tree
EQUALITY SEACH
Search for the matching data entry in the index
Fetch the corresponding data record from the data
file
SEARCH COST:
Cost of searching the index
# leaf pages is 0.1(1.5B) tree height is logF(0.15B)
Descending the index tree visits logF(0.15B) pages
Cost of finding the matching data entry is DlogF(0.15B)
34
Cost of Operations
Heap File /w Unclustered B+ tree
Equality
slide)
Search
the B+
tree
Range
Selection
D(# matches + logF(0.15B))
Fetching each
match in the
range incurs one
I/O
Search
the B+
tree
35
Cost of Operations
Heap File /w Unclustered B+ tree
INSERT
INSERT COST:
2D
D+DlogF(0.15B) + 2D = D(3 +
Database Management Systems 3ed,
Ramakrishnan and J. Gehrke
logR. (0.15B))
36
Cost of Operations
Heap File /w Unclustered B+ tree
Insert
(from last
slide)
Delete
D(3 + logF(0.15B))
1 I/O to
insert the
data entry
+ 2 I/Os to
insert the
new record
Search
the B+
tree
D(3 + logF(0.15B)) = 2D + S
1 I/O to
delete the
data entry
+ 2 I/Os to
delete the
data record
Search
the B+
tree
1 I/O to write
back the dataentry page
and another I/O
to write back
the data-record
page
37
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Equal
Delet
Range
Insert
ity
e
+
Heap
file
with
unclustered
B
tree
Search +
BD
0.5BD
BD
2D
D
index on search key <age,sal>
1 I/O to insert the
Scan
Heap
Sorted
Clustered
BD
1.5BD
Dlog2B
DlogF1.5
B
D(log2B + #
matches)
data entry
2 I/Os +
Search
+ +
Search
toBD
insert the data
BD
D(logF1.5B +
# matching
pages)
Search +
D
record
Search +
D
1 I/Os to write
Cost of
Unclustered
D(1
+
D(logEach
Uncluster
matchD(3 +back the datascanning
F0.15B
+1
entry Search
page and
one I/O BD(R+0.1
per
requires
an
logF0.15
logF0.15B
data
entries
ed Tree
+
#
I/O to write
record 5)
2Dback
I/O
is 0.1(1.5B)D
Index
B)
)
matches)
the data-record38
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
page
39
40
41
42
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Scan
Heap
Heap
file
Sorted
Clustere
d
Equal
ity
R
Range
Insert
2D to
update
BD unclustered
0.5BD
BD
2D
with
hash
the
index file
index
D(log2B + # +
Search
2D to +
Dlog2B
BD
BD
matches)
update
Cost of
Hash
the data
scanning
structure
Dlog
1.5B
+
F
file
data entries
cannot
DlogF1.5
Search +
1.5BD is
help
# matching
D
B
1.25(0.1B)D
pages
B
Delet
e
Search
2D to +
update
D
the
Search
+
index file
+ BD
2D to
update
the data
Search
+
file
D
D(1 +
D(3 +
D(logF0.15B
Search +
logF0.15
logF0.15B
+#
2D
B)
)
matches)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
Uncluste
red Tree
Index
BD(R+0.1
5)
43
Cost of Operations
D
Several assumptions
underlie these (rough)
estimates!
Heap
Sorted
Clustere
d
Scan
Equal
ity
Range
Insert
Delet
e
BD
0.5BD
BD
2D
Search +
D
Search +
BD
Search +
BD
Search +
D
Search +
D
BD
Dlog2B
1.5BD
DlogF1.5
B
D(log2B + #
matching
pages)
DlogF1.5B +
# matching
pages
D(1 +
D(3 +
D(logF0.15B
Uncluste
BD(R+0.1
Search +
logF0.15
log
0.15B
red Tree
+#
F
5)
2D
Database
Management
Systems
3ed,
R.
Ramakrishnan
and
J.
Gehrke
Index
B)
)
matches)
44
45
Choice of Indexes
Clustered? Hash/tree?
46
47
Employee
s older
than 40
SELECT
E.dno
FROM
Employees E
WHERE E.age > 40
SELECTE.name
FROM Employees E
WHEREE.dno=123
many
employ
ees
48
E.name
Employees E
E.age1 = 20 AND E.sal
= 50000
2
49
50
SELECT
FROM
WHERE
E.dno
Emp E
E.age>30
the condition ?
On
age
51
Examples of Clustered
Indexes (2)
Using E.age index
may be costly
Retrieve
tuples with
E.age > 25
Sort the tuples on
dno, and count
number of tuples for
each dno
Good to
know
DBMS
internal
On
age
52
An unclustered index on
E.eid is good enough for
the second query since
no two employees have
the same E.eid.
SELECT
FROM
WHERE
E.dno
Emp E
E.eid=32816945
53
Composite Search
Keys: Search on a
combination of fields.
11,80
11
12,10
12
12,20
13,75
<age, sal>
10,12
20,12
75,13
12 10
cal
11
joe
12 20
sue
13 75
12
13
<age>
80
10
Data records
sorted by name
80,11
<sal, age>
20
75
80
<sal>
Data entries
sorted by <sal>
54
key<age,sal> key<sal,age>
best.
55
Database Example
We use this database for the following query examples
Dep
tdno
budge
mgr
t
Em
pei
dn ag pho
sal
d
o e ne
Foreign key
56
Index
on
dno
SELECT
SELECT
FROM
FROM
WHERE
WHERE
Find mangers of
D.mgr
departments with
Dept D, Emp
E one
at least
employee
D.dno=E.dno
pei
dn ag pho
sal
d
o e ne
57
SELECT
SELECT
FROM
FROM
GROUP
GROUP BY
BY
Index
on dno
E.dno,
E.dno, COUNT(*)
COUNT(*)
Emp
Emp E
E
E.dno
E.dno
If <E.dno> index is
available, need only
scan the data entries
and count employees
for each dno
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
58
Emp
If <E.dno,E.sal> tree
index is available, need
only scan the data
entries and compute
MIN(E.sal) for each dno
ei
dn ag pho
sal
d
o e ne
59
ei
dn ag pho
sal
d
o e ne
If <E.age,E.sal> or
<E.sal,E.age> tree index
is available, the average
salary can be computed
using only data entries in
the index
Compute
average
SELECT
SELECT AVG(E.sal)
AVG(E.sal)
salary of
FROM
FROM Emp
Emp E
E
young
WHERE
E.age=25
AND
WHERE E.age=25
AND
executives
E.sal
E.sal BETWEEN
BETWEEN 300000
300000 AND
AND 50000
5000
60
SELECT
E.dno,
SELECT
E.dno,
COUNT (*)
(*)
COUNT
FROM
Emp E
E
FROM
Emp
Using <dno,age> index
WHERE
E.age=30
WHERE
E.age=30
GROUP BY
BY E.dno
E.dno
GROUP
Scan all data entries
Do not
not
Do
For each dno, count
scan all
all
scan
ages
ages
number of tuples with
Using
<age,dno> index
age=30
Better !!
Better
Use index find first data entry /w age = 30
Scan data entries with age = 30, and count
number of tuples for each dno (the
departments are arranged continuously
for age=30)
61
SELECT
SELECT
COUNT (*)
(*)
COUNT
FROM
FROM
WHERE
WHERE
GROUP BY
BY
GROUP
E.dno,
E.dno,
Emp E
E
Emp
E.age>30
E.age>30
E.dno
E.dno
No
sorting.
Better !
62
Summary
63
Summary (Contd.)
64
Summary (Contd.)
Understanding
Indexes
65