Sie sind auf Seite 1von 17

ADVANCED DATABASE INDEXING

The Kluwer International Series on


ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue University
West Lafayette, IN 47907

Other books in the Series:

MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushil


Jajodia, Binto George ISBN: 0-7923-7702-8
FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6
INTERCONNECTING HETEROGENEOUS INFORMATION SYSTEMS, Athman
Bouguettaya, Boualem Benatallah, Ahmed Elmagarmid ISBN: 0-7923-8216-1
FOUNDATIONS OF KNOWLEDGE SYSTEMS: With Applications to Databases
and Agents, Gerd Wagner ISBN: 0-7923-8212-9
DATABASE RECOVERY, Vijay Kumar, Sang H. Son ISBN: 0-7923-8192-0
PARALLEL, OBJECT-ORIENTED, AND ACTIVE KNOWLEDGE BASE
SYSTEMS,Ioannis Vlahavas, Nick Bassiliades ISBN: 0-7923-8117-3
DATA MANAGEMENT FOR MOBILE COMPUTING, Evaggelia Pitoura, George
Samaras ISBN: 0-7923-8053-3
MINING VERY LARGE DATABASES WITH PARALLEL PROCESSING, Alex
A. Freitas, Simon H. Lavington ISBN: 0-7923-8048-7
INDEXING TECHNIQUES FOR ADVANCED DATABASE SYSTEMS, Elisa
Bertino, Beng Chin Ooi, Ron Sacks-Davis, Kian-Lee Tan, Justin Zobel, Boris
Shidlovsky, Barbara Catania ISBN: 0-7923-9985-4
INDEX DATA STRUCTURES IN OBJECT-ORIENTED DATABASES, Thomas
A. Mueck, Martin L. Polaschek ISBN: 0-7923-9971-4
DATABASE ISSUES IN GEOGRAPHIC INFORMATION SYSTEMS, Nabil R.
Adam, Aryya Gangopadhyay ISBN: 0-7923-9924-2
VIDEO DATABASE SYSTEMS: Issues, Products, and Applications, Ahmed K.
Elmagarmid, Haitao Jiang, Abdelsalam A. Helal, Anupam Joshi, Magdy Ahmed
ISBN: 0-7923-9872-6
REPLICATION TECHNIQUES IN DISTRIBUTED SYSTEMS, Abdelsalam A.
Helal, Abdelsalam A. Heddaya, Bharat B. Bhargava ISBN: 0-7923-9800-9
SEARCHING MULTIMEDIA DATABASES BY CONTENT, Christos Faloutsos
ISBN: 0-7923-9777-0
TIME-CONSTRAINED TRANSACTION MANAGEMENT: Real-Time
Constraints in Database Transaction Systems, Nandit R. Soparkar, Henry F.
Korth, Abraham Silberschatz ISBN: 0-7923-9752-5
DATABASE CONCURRENCY CONTROL: Methods, Performance, and Analysis,
Alexander Thomasian, IBM T. J. Watson Research Center ISBN: 0-7923-9741-X
ADVANCED DATABASE INDEXING

by

Yannis Manolopoulos
Aristotle University, Greece

Yannis Theodoridis
Computer Technology Institute, Greece

Vassilis J. Tsotras
University of California, Riverside, U.S.A.

"
~.

SPRINGER SCIENCE+BUSINESS MEDIA, LLC


Library of Congress Cataloging-in-Publication Data

Manolopoulos, Yannis, 1957-


Advanced database indexing I Yannis Manolopoulos, Yannis Theodoridis, Vassilis J. Tsotras.
p. cm. -- (The Kluwer international series on advances in database systems ; 17)
Includes bibliographical references and index.
ISBN 978-1-4613-4641-8 ISBN 978-1-4419-8590-3 (eBook)
DOI 10.1007/978-1-4419-8590-3
1. Database management. 2. Indexing. 1. Theodoridis, Yannis, 1967- II. Tsotras,
Vassilis J., 1961- III. Title. IV. Series.

QA76.9.D3 M3375 1999


005.74' l--dc21
99-048329

Copyright ® 2000 by Springer Science+Business Media New York


Originally published by Kluwer Academic Publishers in 2000
Softcover reprint of the hardcover 1st edition 2000

AII rights reserved. No part of this publication may be reproduced, stored in a


retrieval system or transmitted in any form or by any means, mechanical, photo-
copying, record ing, or otherwise, without the prior written permission of the
publisher, Springer Science+Business Media, LLC

Printed on acid-free paper.


To Paulina, Vassiliki and Helga

for their love and patience


Contents

List of Figures Xl

List of Tables xv

Contributors xvii

Preface XIX

Chapter 1: STORAGE SYSTEMS


l. Introduction 1
2. Primary Storage Devices 2
3. Secondary Storage Devices 3
4. Tertiary Storage Devices 8
5. Connecting Storage Together 10
6. Important Issues of Storage Systems 12
7. Alternative Storage Systems 13
8. Future 14
9. Further Reading 15
References 15
Chapter 2: EXTERNAL SORTING 17
l. Introduction 18
2. Run Formation Algorithms 19
3. Merging Algorithms 23
4. Memory Adaptive External Sorting 29
5. Further Reading 34
References 34
V111 ADVANCED DATABASE INDEXING

Chapter 3: FUNDAMENTAL ACCESS METHODS 37


1. Introduction 37
2. Basic Indices 40
3. External Dynamic Hashing 47
4. Multiattribute Access Methods 53
5. Document Searching 56
6. Further Reading 57
References 57
Chapter 4: ACCESS METHODS FOR INTERVALS 61
1. Introduction 61
2. External Memory Structures for Intervals 69
3. Further Reading 79
References 80
Chapter 5: TEMPORAL ACCESS METHODS 83
1. Introduction 83
2. Transaction-time Indexing 90
3. Bitemporal Indexing 109
4. Further Reading 113
References 113
Chapter 6: SPATIAL ACCESS METHODS 117
1. Introduction 117
2. Spatial Indexing Methods 122
3. Extensions 134
4. Further Reading 136
References 137
Chapter 7: SPATIOTEMPORAL ACCESS METHODS 141
1. Introduction 141
2. The Discrete Spatiotemporal Environment 146
3. The Continuous Spatiotemporal Environment 152
4. Further Reading 162
References 162
Chapter 8: IMAGE AND MULTIMEDIA INDEXING 167
1. Introduction 167
2. Spatial Similarity Retrieval 169
3. Visual Similarity Retrieval 174
4. Extensions 182
5. Further Reading 183
References 184
Chapter 9: EXTERNAL PERFECT HASHING 187
1. Introduction 187
Contents IX

2. Framework and Definitions 188


3. Perfect Hashing and Performance Characteristics 190
4. Dynamic External Perfect Hashing 192
5. Static External Perfect Hashing 196
6. Performance Comparison 205
7. Further Reading 205
References 207
Chapter 10: PARALLEL EXTERNAL SORTING 209
1. Introduction 209
2. Merge-based Parallel Sorting 212
3. Partition-based Parallel Sorting 214
4. Further Reading 216
References 217
Chapter 11: PARALLEL INDEX STRUCTURES 219
1. Introduction 219
2. Declustering Techniques 221
3. Multi-Disk B-trees 224
4. Parallel Linear Quadtrees 226
5. Parallel R-trees 228
6. Parallel S-trees 230
7. Further Reading 232
References 233
Chapter 12: CONCURRENCY ISSUES IN ACCESS METHODS 235
1. Introduction 235
2. Concurrency Control for B+-trees 237
3. Concurrency Control for R-trees 245
4. Concurrency Control for Hash Files 250
5. Further Reading 254
References 255
Chapter 13: LATEST DEVELOPMENTS 259
1. Data Warehouses 259
2. Semistructured Data over the Web 263
3. Main-memory Databases 264
4. Further Reading 266
References 267
Author Index 271
Term Index 279
List of Abbreviations 285
List of Figures

Figure 1.1. Levels of memory hierarchy. 1


Figure 1.2. Organization of a magnetic disk. 3
Figure 1.3. (a) linear, (b) serpentine, (c) helical scan and (d) transverse
tape types. 9
Figure 2.1. 2-way merging of four initial runs. 18
Figure 2.2. Application of first-fit technique. 22
Figure 2.3. Example of 4 runs and 7 buffers. 26
Figure 2.4. Splitting the merging phase. 31
Figure 2.5. Combining merging steps. 31
Figure 3.1. An Indexed Sequential File. 41
Figure 3.2. A B+-tree with q = 5. 44
Figure 3.3. An Extendible Hashing Scheme. 49
Figure 3.4. A Linear Hashing Scheme. 51
Figure 3.5. A Grid File. 54
Figure 4.1. An Interval Tree with U= {1, ... , 8} and n = 5 intervals. 64
Figure 4.2. A Segment Tree with U= {I, ... , 5} and n = 5 intervals. 65
Figure 4.3. An interval I is translated into a point in a 2-dimensional
space. 67
Figure 4.4. A Priority Search Tree. 68
Figure 4.5. An SR-tree. 75
Figure 5.1. An example ofa transaction-time evolution. 86
Figure 5.2. Two valid-time databases. 87
Figure 5.3. A bitemporal database. 89
Figure 5.4. The access forest for a given collection of usefulness
intervals. 95
Figure 5.5. Each page is storing data from a time-key range. 97
xii ADVANCED DATABASE INDEXING

Figure 5.6. An example of the TSB-tree. 99


Figure 5. 7. An example of a plain key-split. 100
Figure 5.8. The Overlapping B-tree. 105
Figure 5.9: Evolution of a set and its linear hashing scheme. 108
Figure 5.10. The bounding-rectangle approach for bitemporal objects. 110
Figure 5.11. The two R-tree methodology for bitemporal data. 110
Figure 6.1. The fundamental spatial data types (points, lines, regions). 118
Figure 6.2. Examples of spatial operators. 119
Figure 6.3. Two-step spatial query processing. 120
Figure 6.4. MBR approximations of objects in Figure 6.2. 121
Figure 6.5. Boundaries and capital locations in Europe. 122
Figure 6.6. The LSD-tree. 124
Figure 6. 7. MBR approximations introduce the notions of dead space
and overlapping. 126
Figure 6.8. The R-tree. 128
Figure 6.9. Access ofR-tree nodes for (a) point and (b) nearest-neigh-
bor queries. 130
Figure 6.10. The Quadtree. 132
Figure 6.11. Object o's MBR meets q's MBR while the covering node
rectangle N does not. 135
Figure 7.1. Moving objects in a 2-dimensional reference space; the
third dimension is time. 145
Figure 7.2. A conceptual view of a discrete spatiotemporal evolution. 146
Figure 7.3. Treating time as another dimension. 148
Figure 7.4. Alive objects are stored as slices at the time they are
inserted. 149
Figure 7.5. The overlapping approach. 151
Figure 7.6. Query in the primal plane. 154
Figure 7. 7. Query in the (a) dual Hough-X and (b) dual Hough-Y
plane. 155
Figure 7.8. Data regions for R- and k-d trees. 157
Figure 8.1. A symbolic image and its 2-D string representation. 169
Figure 8.2. 2-D string indexing. 170
Figure 8.3. Example image and its corresponding ARG [Petrakis and
Faloutsos, 1997]. 171
Figure 8.4. Using R-trees for spatial similarity retrieval. 172
Figure 8.5. Two solutions of a query involving spatial configuration
of three image objects. 173
Figure 8.6. Mapping of images to points in feature space. 174
Figure 8. 7. The X-tree. 176
Figure 8.8. Searching the M-tree. 180
Figure 8.9. Sub-pattern matching. 183
List of Figures X111

Figure 9.1. (a) hashing function with collisions (b) perfect hashing
function and (C) minimal perfect hashing function. 189
Figure 9.2. A dynamic external perfect hashing scheme. 193
Figure 9.3. Dependency graph of the set of six words {chef, clean,
sigma, take, taken, tea}. 197
Figure 9.4. Searching step for the key set of interest. 199
Figure 9.5. A 2- dimensional array-based trie of the word set of the
example. 203
Figure 9.6. The Packed Trie array for the key set of the example. 204
Figure 10.1. The three architectures proposed to support a parallel
database system. 210
Figure 10.2.(a) backend sorting and (b) distributed sorting. 212
Figure 1O.3.Example ofa parallel merge-sort. 213
Figure 10.4. Calculation of the exact splitting vector and generation
of fragments. 215
Figure 10.5. Example ofload balance and load imbalance for a sy-
stem with two processors. 216
Figure 11.1.Using multiple disks to store different files. 220
Figure 11.2.A simple tree-based index. 221
Figure 11.3 .Record distribution approach for a three-disk system. 222
Figure 11.4 .Example of a super page, partitioned into four disks. 223
Figure 11.5.A B-tree with four nodes. 224
Figure 11.6. Distribution of a B-tree to 3 disks. The horizontal links
are omitted for clarity. 226
Figure 11.7. A binary image (left) and the corresponding Region
Quadtree (right). 227
Figure 11.8. The shaded region represents a range query, which is
the area of interest. 229
Figure 11.9. Page P4 has been split to P4a and P4b. The proximity index
of the MBR ofP4b with R1, R2, R3 and R4 is calculated. 230
Figure 1l.1O.An S-tree example. 231
Figure 12.1.Search structure before and after a split. 238
Figure 12.2.Node layout ofa Blink_tree. 240
Figure 12.3. Two stages of split in a Blink-tree. 241
Figure 12.4.Node layout for operation specific locking. 243
Figure 12.5.Node of an Rlink_tree. 246
Figure 13.1. The architecture of a data warehouse 260
Figure 13.2.A data cube 261
Figure 13.3.A data integration system. 264
Figure 13.4.(a) A T-tree, (b) a T-tree node. 265
List of Tables

Table 9.1. Set of six words with their random ho, hI and h2 values. 198
Table 9.2. Levels in the Ordering step of the example of the six word
strings. 198
Table 9.3. (a) vertices with their g values, (b) word strings (keys) with
their computed hash addresses 201
Table 9.4. Performance characteristics of hashing techniques that
produce external perfect hashing functions. 206
Table 12.1. Compatibility table for lock modes. 236
Table 12.2. Alternative compatibility table for lock modes. 236
Table 12.3. Performance results for R-trees 249
Contributors

Yannis Manolopoulos was born in Thessaloniki, Greece in 1957. He re-


ceived a B.Eng. (1981) in Electrical Eng. and a Ph.D. (1986) in Computer
Eng., both from the Aristotle University of Thessaloniki. He has been with
the Department of Computer Science of the University of Toronto, the De-
partment of Computer Science of the University of Maryland at College Park
and the Department of Electrical and Computer Eng. of the Aristotle Univer-
sity. Currently, he is Associate Professor at the Department of Informatics of
the latter University. He has published over 90 papers in refereed scientific
journals and conference proceedings. He is author of two textbooks on
data/file structures, which are recommended in the vast majority of the com-
puter science/engineering departments in Greece. He has served as PC mem-
ber in a number of conferences such as SIGMOD, EDBT, SSD, ADBIS, and
ACM-GIS, whereas currently he is member of the Editorial Board of The
Computer Journal. His research interests include spatiotemporal databases,
databases for web, data mining, data/file structures and algorithms, and per-
formance evaluation of secondary and tertiary storage systems.
Yannis Theodoridis was born in Athens, Greece in 1967. He received a
B.Eng. (1990) in Electrical Eng. and a Ph.D. (1996) in Electrical and Com-
puter Eng., both from the National Technical University of Athens. Cur-
rently, he is a Senior Researcher at the Computer Technology Institute (CTI)
in Patras, Greece. He has published over 20 papers in refereed scientific
journals and conference proceedings, including Algorithmica, IEEE TKDE,
ACM Multimedia Journal, ACM SIGMODIPODS Conference. He has
served on the PC of the SSD'99 Symposium and the STDBM'99 Workshop.
His research interests include spatial and spatiotemporal databases, multime-
dia systems, access methods, query optimization, and benchmarking.
XVlll ADVANCED DATABASE INDEXING

Vassilis J. Tsotras was born in Athens, Greece in 1961. He received his


B.Eng. in Electrical Eng. from the National Technical University of Athens,
Greece (1985), and the M.S., M.Phil. and Ph.D. degrees from Columbia
University (1986, 1988, and 1991 respectively). Currently he is an Associate
Professor at the Department of Computer Science and Engineering of the
University of California, Riverside (UCR). Before joining UCR he was an
Associate Professor of Computer and Information Science at Polytechnic
University, Brooklyn, NY. His research interests include temporal and spa-
tiotemporal databases, access methods, wireless data dissemination and par-
allel database systems. He has published over 45 papers in refereed scientific
journals and conference proceedings. Dr. Tsotras has served as PC member
in various database conferences and workshops, including ICDE'98,
SIGMOD' 99, ICDE'OO and EDBT'OO. He was the Program Co-Chair of the
5th International Workshop on Multimedia Information Systems (MIS'99).
His research has been funded through grants from the National Science
Foundation, DARPA, the Department of Defence, etc. In 1991 he received
Natiorial Science Foundation's Research Initiation Award.
Preface

Due to the development of Data Base Management Systems (DBMSs), there


is no longer need for application programmers to build their own file sys-
tems, but emphasis has moved to design and tuning issues. Thus, physical
design (e.g. access methods, query optimization techniques, transaction
processing, etc.) might seem to be a stabilized area. For example, the stan-
dard "ubiquitous" B-tree is the still dominating structure in virtually all
DBMSs, although many elegant variations as well as other robust structures
have appeared in the literature. The reason is that industry is not willing to
change certain modules, since this costly operation will improve system per-
formance only marginally. This way, the price/performance ratio will in-
crease instead of decreasing.
However, in modem applications, new data types are required (such as
images, text, video, voice, intervals, etc.) except the standard ones (e.g. inte-
ger, real, date and the like) which are met in all commercial systems. During
the recent years, there is ongoing research in the following areas:
indexing methods for special purpose databases, such as multi-
dimensional, spatial, temporal, multimedia, text, object-oriented, or
www databases, or databases used for data mining or on-line analytical
processmg,
in connection with
database architectures, e.g. in client/server, distributed and parallel data-
bases, which means that access methods must be distributed or parallel
as well.
It is only recently that a "new" data structure, i.e. R-trees, has been integra-
ted in certain systems, such as Oracle, InformixlIllustra, 02, etc, almost 10-
15 years after its appearance in the literature. Thus, the belief that physical
xx ADVANCED DATABASE INDEXING

level issues have come to a steady state does not seem to be correct any
more.
Although transparent for the user of a DBMS, access methods playa key
role in database performance. Thus, careful tuning or selection of the appro-
priate access method is important in order to develop efficient systems in the
present transition era towards object-relational and other special purpose
systems. Also, understanding of the state-of-art is essential in order to pro-
pose more efficient indexing techniques.
This book may serve as a textbook for graduates specializing in database
systems, or database professionals, which are keen to be acquainted with the
recent developments. Emphasis has been given on structure description, im-
plementation techniques used, and operations performed. Most books in re-
lated topics are based on COBOL, PLll, Pascal, C or a pseudo-language.
This book uses a simple algorithmic pseudo-language (for some of the ac-
cess methods), whereas the interested reader is encouraged to implement
some of them. Note, also, that code for certain structures could be found on
the Internet.
The book is divided in two parts. The fIrst part consisting of 3 chapters
contains some fundamentals, which more-or-Iess may be found in most of
the books about fIle structures, physical database design, or computer archi-
tecture. It serves as the background knowledge for the second part, which
consists of 10 chapters dealing with more advanced material on access
methods. Every book chapter ends with references for further reading.
Chapter 1 briefly discusses issues related to storage media, such as mag-
netic disks, optical disks and tertiary storage, parallel disks and RAID sys-
tems. The next chapter describes external sorting methods, introducing no-
tions useful for a chapter in the second part of the book, dedicated to parallel
external sorting. The third chapter is about the most important fIle structures,
which are currently used in any DBMS, such as B+-trees, Hashing with
Chaining, Linear Hashing and Inverted Files, as well as other popular struc-
tures such as Grid Files and k-d trees.
The first chapter of the second part, Chapter 4, explains some structures,
which are used not to store integers but ranges of integers, e.g. intervals.
These structures are Segment Indices, Interval B-trees and External Segment
trees. Chapter 5 concerns structures for temporal databases, such as the
Snapshot index, Time-Split B-tree, Multiversion B-trees and Overlapping B-
trees. In Chapter 6 we examine structures used in spatial databases and GISs.
The most well known methods, Quadtrees, R -trees and variants, are exam-
ined along with other interesting structures such as LSD-trees, etc. Chapter 7
deals with spatiotemporal data, or in other words, spatial data that evolve
over time. Certain indexing techniques based on overlapping and partial per-
sistence are described. Chapter 8 examines representations and access meth-
Preface XXI

ods for image and multimedia databases, such as 2-D strings, X-trees, M-
trees and R-tree based methods.
Chapter 9 contains material based on hashing. For a number of years,
perfect hashing was considered as a method useful exclusively for main
memory applications, e.g. for tables with a small number of keywords. Here
we will describe two methods, which can apply perfect hashing for very
large numbers of records. Chapter 10 considers again the issue of external
sorting, which has been examined in the second chapter. However, due to the
advent of new architectures and fast disks, a parallel environment is as-
sumed, and thus the approach is quite different. Chapter 11 introduces the
notion of declustering, i.e. techniques used to distribute a single file structure
in several disks. In this context, B-trees, R-trees, Quadtrees and Linear
Hashing will be examined. In Chapter 12 we deal with an important issue re-
lated to the performance of indices, i.e. concurrency control, and we examine
particular techniques such as the Blink-tree and Rlink-tree methods. Finally, the
book ends with a chapter dedicated to the newest development in indexing,
such as indexing for on-line analytical processing, data warehouses and
semistructured data and main-memory databases.
Thanks are due to many friends and colleagues for their help during the
various stages of authoring this book. In particular, we would like to thank
(in alphabetical order) Robert Alcock, Alex Biliris, Alex Nanopoulos, Nikos
Karayannidis, George Kollios, Dimitris Papadias, Apostolos Papadopoulos,
Evi Pitoura, Timos Sellis, Eleni Tousidou, Michael Vassilakopoulos, and in
particular Theodoros Tzouramanis. Also, Scott Delman and Melissa Fearon
of Kluwer Academic Publishers provided invaluable support.

Yannis Manolopoulos
Yannis Theodoridis
Vassilis J. Tsotras

Das könnte Ihnen auch gefallen