DBMS

Introduction to Databases
Storage Management
Prof. Beat Signer
Department of Computer Science Vrije Universiteit Brussel
http://www.beatsigner.com
2 December 2005
Context of Today's Lecture

Programmers Application Programs DBMS DML Preprocessor Query Compiler DDL Compiler Users Queries DB Admins Database Schema
Program Object Code
Authorisation Control
Catalogue Manager
Integrity Checker
Command Processor
Query Optimiser
Transaction Manager Data Manager Database Manager
Scheduler
Buffer Manager
Recovery Manager
Access Methods
File Manager
System Buffers
Data, Indices and System Catalogue
Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010
April 20, 2012
Beat Signer - Department of Computer Science - bsigner@vub.ac.be
Storage Device Hierarchy

Storage devices vary in
Cache
data capacity access speed cost per byte
Main Memory
Devices with fastest

access time have highest costs and smallest capacity
Flash Memory
Magnetic Disk
Optical Disk
Magnetic Tapes
April 20, 2012
Cache
On-board cache on the same chip as the microprocessor
level 1 (L1) cache temporary storage of instructions and data typical size of ~64 kB e.g. level 2 (L2) cache typical size of ~1 MB
Extra cache levels located on separate chips
Data items in the cache are copies of values in main

memory locations
If data in the cache has been updated, changes must be

reflected in the corresponding memory locations
April 20, 2012 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4
Main Memory
Main memory can be several gigabytes large Normally too small and too expensive for storing the
entire database

content is lost during power failure or crash (volatile memory) in-memory databases (IMDB) primarily rely on main memory
- note that IMDBs lack durability (D of the ACID properties)
IMDB size limited by the maximal addressable memory space

- e.g. maximal 4 GB for 32-bit address space
Random access memory (RAM)
time to access data is more or less independent of its location (different from magnetic tapes)
Typical access time of ~10 nanoseconds (10-8 seconds)

Secondary Storage (Hard Disk)

Essentially random access Files are moved between a hard disk and main memory
(disk I/O) by the operating system (OS) or the DBMS

the transfer units are blocks tendency for larger block sizes the buffer manager of the DBMS manages the loading and unloading of blocks for specific DBMS operations 1'000'000 times slower than main memory access
Parts of the main memory are used to buffer blocks
Typical block I/O time (seek time) ~10 milliseconds
Capacity of several hundred gigabytes and a system can

use many disk units
Hard Disk
A hard disk contains one
or more platters and one or more heads
The platters were originally

addressed in terms of cylinders, heads and sectors (block)

cylinder-head-sector (CHS) scheme max of 1024 cylinders, 16 heads and 63 sectors
Current hard disks offer

logical block addressing (LBA)
April 20, 2012
hides the physical disk geometry

Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7
Solid-State Drives (SSD)

Storage device that uses solid-state memory
(flash memory) to persistently store data
Offers a hard disk interface with a storage capacity of up

to a few hundred gigabytes
Typical block I/O time (seek time) ~0.1 milliseconds SSDs might help to reduce the gap between primary and
secondary storage in DBMS systems
Currently there are still some limitations of SSDs

the limited number of SSD write operations before failure can be a problem for DBs with a lot of update operations write operations are often still much slower than read operations
April 20, 2012
Tertiary Storage
No random access
access time depends on data location tapes optical disk jukeboxes

- racks of CD-ROMs (read only)
Different devices

tape silos
- room-sized devices holding racks of tapes operated by tape robots - e.g. StorageTek PowderHorn with up to 28.8 petabytes
April 20, 2012
Models of Computation
RAM model of computation
assumes that all data is held in main memory assumes that data does not fit into main memory efficient algorithms must take into account secondary and even tertiary storage best algorithms for processing large amounts of data often differ from those for the RAM model of computation minimising disk accesses plays a major role
- I/O model of computation
DBMS model of computation

I/O model of computation
the time to move a block between disk and memory is much higher than the time for the corresponding computation
April 20, 2012
Accelerating Secondary Storage Access

Various possible strategies to improve secondary
storage access
placement of blocks that are often accessed together on the same disk cylinder distribute data across multiple disks to profit from parallel disk accesses (e.g. RAID) mirroring of data use of disk scheduling algorithms in OS, DBMS or disk controller to determine order of requested block read/writes
- e.g. elevator algorithm
prefetching of disk blocks efficient caching

- main memory - disk controllers
April 20, 2012
11
Redundant Array of Independent Disks

The redundant array of independent disks (RAID)
organisation technique provides a single disk view for a number (array) of disks

divide and replicate data across multiple hard disks introduced in 1987 by D.A. Patterson, G.A. Gibson and R. Katz
The main goals of a RAID solution are
higher capacity by grouping multiple disks

- originally a RAID was also a cheaper alternative to expensive large disks
original name: Redundant Array of Inexpensive Disks
higher performance due to parallel disk access

- multiple parallel read/write operations
increased reliability since data might be stored redundantly

- data can be restored if a disk fails
April 20, 2012
12
RAID ...
There are three main concepts in RAID systems
identical data is written to more than one disk (mirroring) data is split accross multiple disks (striping) redundant parity data is stored on separated disks and used to detect and fix problems (error correction)
April 20, 2012
13
RAID Reliability
The mean time between failures (MTBF) is the average
time until a disk failure occurs
e.g. a hard disk might have a MTBF of 200'000 hours (22.8 years)
- note that the MTBF decreases as disks get older
If a DBMS uses an array of disks, then the overall

system's MTBF can be much lower
e.g. the MTBF for a disk array of 100 of the disks mentioned above is 200'000 hours/100 = 2'000 hours (83 days)
By storing information redundantly, data can be restored

in the case of a disk failure
April 20, 2012
14
RAID Reliability ...

The mean time to data loss (MTTDL) depends on the
MTBF and the mean time to repair
if we mirror the information on two disks with a MTBF of 200'000 hours and a mean time to repair of 10 hours then the MTTDL is 200'0002/(2*10) hours = 228'000 years of course in reality it is more likely that an error occurs on multiple disks around the same time
- drives have the same age
- power failure, earthquake, fire, ...
April 20, 2012
15
RAID Levels
[http://en.wikipedia.org/wiki/RAID]
The different RAID levels offer different

cost-performance trade-offs
RAID 0
block level striping without any redundancy mirroring without striping bit level striping multiple parity disks byte level striping one parity disk
RAID 1
RAID 2

RAID 3

April 20, 2012
RAID Levels ...

RAID 4
block level striping one parity disk Similar to RAID 3 block level striping with distributed parity no dedicted parity disk block level striping with dual distributed parity no dedicted parity disk similar to RAID 5
RAID 5
RAID 6
April 20, 2012
Data Representation
A DBMS has to define how the elements of its data model
(e.g. relational model) are mapped to secondary storage
a field contains a fixed- or variable-length sequence of bytes and represents an attribute a record contains a fixed- or variable-length sequence of fields and represents a tuple records are stored in fixed-length physical block storage units representing a set of tuples
- the blocks also represent the units of data transfer
a file contains a collection of blocks and represents a relation
A database is finally mapped to a number of files

managed by the underlying operating system
April 20, 2012
index structures are stored in separate files

Relational Model Representation

A number of issues have to be addressed when
mapping the basic elements of the relational model to secondary storage

how to map the SQL datatypes to fields? how to represent tuples as records? how to represent records in blocks? how to represent a relation as a collection of blocks? how to deal with record sizes that do not fit into blocks? how to deal with variable-length records? how to deal with schema updates and growing record lengths? ...
April 20, 2012
19
Representation of SQL Datatypes

Fixed-length character string (CHAR(n))
represented as a field which is an array of n bytes strings that are shorter than n bytes are filled up with a special "pad" character two common representations (non-fixed length version later) length plus content
- allocate an array of n + 1 bytes - the first byte represents the length of the string (8-bit integer) followed by the string content - limited to a maximal string length of 255 characters
Variable-length character string (VARCHAR(n))
null-terminated string
- allocate an array of n + 1 bytes - terminate the string with a special null character (like in C)
April 20, 2012
20
Representation of SQL Datatypes ...

Dates (DATE)
fixed-length character string the precision n leads to strings of variable length and two possible representations fixed-precision
- limit the precision to a fixed value and store as VARCHAR(m)
Time (TIME(n))

true-variable length
- store the time as true variable length value
Bits (BIT(n))
bit values of size n can be packed into single bytes packing of multiple bit values into a single byte is not recommended
- makes the retrieval and updating of a value more complex and error-prone
April 20, 2012
Storage Access
A part of the system's main memory is used as a buffer
to store copies of disk blocks
The buffer manager is responsible to move data from

secondary disk storage into memory
the number of block transfers between disk and memory should be minimised as many blocks a possible should be kept in memory
The buffer manager is called by the DMBS every time a

disk block has to be accessed
the buffer manager has to check whether the block is already allocated in the buffer (main memory)
April 20, 2012
22
Buffer Manager
If the requested block is already in the buffer, the buffer
manager returns the corresponding address
If the block is not yet in the buffer, the buffer manager

performs the following steps
allocate buffer space

- if no space is available, remove an existing block from the buffer (based on a buffer replacement strategy) and write it back to the disk if it has been modified since it was last fetched/written to disk
read the block from the disk, add it to the buffer and return the corresponding memory address
Note the similarities to a virtual memory manager
April 20, 2012
23
Buffer Replacement Strategies

Most operating systems use a least recently used (LRU)
strategy where the block that was least recently used is moved back from memory to disk
use past access pattern to predict future block access
A DBMS is able to predict future access patterns more

accurately than an operating system
a request to the DBMS involves multiple steps and the DBMS might be able to determine which blocks will be needed by analysing the different steps of the operation note that LRU might not always be the best replacement strategy for a DBMS
April 20, 2012
24
Buffer Replacement Strategies ...

Let us have a look at the procedure to compute the
following natural join query: order customer
note that we will see more efficient solutions for this problem when discussing query optimisation
for each tuple o of order { for each tuple c of customer { if o.customerID = c.customerID { create a new tuple r with: r.customerID := c.customerID r.name := c.name ... r.orderID := o.orderID ... add tuple r to the result set of the join operation } } }

We further assume that the two relations order and
customer are stored in separate files
From the pseudocode we can see that
once an order tuple has been processed, it is not needed anymore

- if a whole block of order tuples has been processed, that block is no longer required in memory (but an LRU strategy might keep it) - as soon as the last tuple of an order block has been processed, the buffer manager should free the memory space toss-immediate strategy
once a customer tuple has been processed, it is not accessed again until all the other customer tuples have been accessed
- when the processing of a customer block has been finished, the least recently used customer block will be requested next
- we should replace the block that has been most recently used (MRU)
April 20, 2012
26

A memory block can be marked to indicate that this block
is not allowed to be written back to disk (pinned block)
note that if we want to use an MRU strategy for the inner loop of the previous example, the block has to be pinned
- the block has to be unpinned after the last tuple in the block has be processed
the pinning of blocks provides some control to restrict the time when blocks can be written back to disk
- important for crash recovery
- blocks that are currently updated should not be written to disk
The prefetching of blocks might be used to further

increase the performance of the overall system
e.g. for serial scans (relation scans)
April 20, 2012
27

The buffer manager can also use statistical information
about the probability that a request will reference a particular relation (and its related blocks)
the system catalogue (data dictionary) with its metadata is one of the most frequently accessed parts of the database
- if possible, system catalogue blocks should always be in the buffer
index files might be accessed more often than the corresponding files themselves
- do not remove index files from the buffer if not necessary
the crash recovery manager can also provide constraints for the buffer manager
- the recovery manager might demand that other blocks have to be written first (force-output) before a specific block can be written to disk
April 20, 2012
28
System Catalogue / Data Dictionary

Stores metadata about the database
names of the relations names, domain and lengths of the attributes of each relation names of views names of indices
- name of relation that is indexed - name of attributes - type of index
integrity constraints users and their authorisations statistical data

- number of tuples in relation, storage method, ...
...
April 20, 2012
File Organisation
A file is a logically organised as a sequence of records
each record contains a sequence of fields name, datatype and offset of record fields are defined by the schema record types (schema) might change over time
The records are mapped to disk blocks
the block size is fixed and defined by the physical properties of the disk and the operating system the record size might vary for different relations and even between tuples of the same relation (variable field size)
use multiple files and only store fixed-length records in each file store variable-length records in a file
There are different possible mappings of records to files

April 20, 2012
Fixed-Length Records
type customer = record cID int; name varchar(30) street varchar(30) end
If we assume that an integer requires 2 bytes and

characters are represented by one byte, then the customer record is 64 bytes long
Block ... cID
0 2
name
33
street
64
...
April 20, 2012
31
Fixed-Length Records ...

Block
...
0
cID
4
name
36
street
68
...
It might be necessary to ensure that data elements begin

at an offset that is a multiple of 4 (8 for 64-bit processors)

the first byte of a block loaded from disk is placed at a memory address that is a multiple of 4 we have to ensure that we have the appropriate offsets (e.g. dividable by 4)
April 20, 2012
32
Fixed-Length Records ...

Block
...
0
l
12
cID
16
name
48
street
80
...
Often a record header is added to each record for

managing metadata about

the record schema (pointer s to the DBMS schema information) timestamp t about the last access or modification time the length l of the record
- could be computed from the schema but the information is convenient if we want to quickly access the next record without having to consult the schema
...
April 20, 2012
33
Fixed-Length Records in Blocks/Files

h h h 1 2 5 Max Frisch Eddy Merckx Claude Debussy Bahnhofstrasse 7 Pleinlaan 25 12 Rue Louise record 0 record 1 record 2
h
h
53
8
Albert Einstein
Max Frisch
Bergstrasse 18
ETH Zentrum
record 3
record 4
Problems with this fixed length representation
after a record has been deleted, its space has to be filled with another record
- could move all records after the deleted one but that is too expensive - can move the last record to the deleted record's position but also that might require an additional block access
if the block size is not a multiple of the record size, some records will cross block boundaries and we need two block accesses to read/write such a record
April 20, 2012
Fixed-Length Records in Blocks/Files ...

Since insert operations tend to be more frequent that
delete operations, it might be acceptable to leave the space of the deleted record open until a new record is inserted

we cannot just add an additional boolean flag ("free") to the record since it will be hard to find the free records allocate a certain amount of bytes for a file header containing metadata about the file
The block/file header contains a pointer (address) to the

first deleted record
April 20, 2012
each deleted record contains a pointer (address) to the next deleted record the linked list of deleted records is called a free list
Fixed-Length Records in Blocks/Files ...

header h 1 Max Frisch Bahnhofstrasse 7 record 0 record 1
h
h
5
8
Claude Debussy
Max Frisch
12 Rue Louise
ETH Zentrum
record 2
record 3 record 4
To insert a new record, the first free record pointed to by

the header is used and the address in the header is updated to the free record that the used record was pointing to
to save some space, the pointers of the free list can also be stored in the unused space of deleted records (no additional field)
April 20, 2012
Address Space
There are several ways how the database address
space (blocks and block offsets) can be represented
physical addresses consisting of byte strings (up to 16 bytes) that address

- host - storage device identifier (e.g. hard disk ID) - cylinder number of the disk
- track within the cylinder (for multi-surface disks)

- block within the track - potential offset of record within the block
logical addresses consisting of an arbitrary string of length n
April 20, 2012
37
Address Space Mapping

logical address logical physical
... map table
...
physical address
A map table is stored at a known disk location and

provides a mapping between the logical and physical address spaces

introduces some indirection since the map table has to be consulted to get the physical address flexibility to rearrange records within blocks or move them to other blocks without affecting the record's logical address different combinations of logical and physical addresses are possible (structured address schemes)
April 20, 2012
Variable-Length Data
Records of the same type may have different lengths We may want to represent
record fields with varying size (e.g. VARCHAR(n)) large fields (e.g. images) ...
We need an alternative data representation to deal with

these requirements
April 20, 2012
39
Variable-Length Record Fields

cID
record length
name
street
Scheme for records with variable-length fields

put all fixed-length fields first (e.g. cID) add the length of the record to the record header add the offsets of the variable-length fields to the record header
Note that if the order of the variable-length fields is

always the same, we do not have to store an offset for the first variable-length field (e.g. name)
April 20, 2012
40
Variable-Length Records
offset table
... free ...
record3
record2
record1
There are different reasons why we might have to use

variable-length records
to store records that have at least one field with a variable length to store different record types in a single block/file
address of a record consists of the block address in combination with an offset table index records can be moved around
Structured address scheme (slotted-page structure)

April 20, 2012
Large Records
record1
record header block header
record2a
record2b block 2
record3
block 1
Sometimes we have to deal with values that do not fit

into a single block (e.g. audio or movie clips)
a record that is split across two or more blocks is called a spanned record spanned records can also be used to pack blocks more efficiently each record header carries a bit to indicate if it is a fragment
- fragments have some more bits; telling whether first or last fragment of record
Extra header information
April 20, 2012
potential pointers to previous and next fragment

Storage of Binary Large Objects (BLOBS)

BLOB is stored as a sequence of blocks
often blocks allocated successively on a disk cylinder
BLOB might be striped across multiple disks for more

efficient retrieval
BLOB field might not be automatically fetched into

memory

user has to explicitly load parts of the BLOB possibly index structures to retrieve parts of a BLOB
April 20, 2012
43
Insertion of Records
offset table
... free ...
record3
record2
record1
If the records are not kept in a particular order, we can

just find a block with some empty space or create a new block if there is no such space
If the record has to be inserted in a particular order, but

there is no space in the block, there are two alternatives
find space in a nearby block and rearrange some records create an overflow block and link it from the header of the original block
- note that an overflow block might point to another overflow block and so on
April 20, 2012
Deletion of Records
offset table
... free ...
record3
record2
record1
If we use an offset table, we may compact the free space

in the block (slide around the records)
If the records cannot be moved, we might have a free list

in the header
We might also be able to remove an overflow block after

a delete operation
April 20, 2012
45
Update of Records
offset table
... free ...
record3
record2
record1
If we have to update a fixed-length record there is no

problem since we will still use the same space
If the updated record is larger than the original version,

then we might have to create more space
same options as discussed for insert operation
If the updated record is smaller, then we may compact

some free space or remove overflow blocks
April 20, 2012
similar to delete operation

Homework
Study the following chapter of the
Database System Concepts book
chapter 10
- sections 10.1-10.9 - Storage and File Structure
April 20, 2012
47
Exercise 8
Structured Query Language (SQL) PostgreSQL
April 20, 2012
48
References
H. Garcia-Molina, J.D. Ullman and J. Widom,
Database Systems The Complete Book, Prentice Hall, 2002
A. Silberschatz, H. Korth and S. Sudarshan, Database

System Concepts (Sixth Edition), McGraw-Hill, 2010
April 20, 2012
49
Next Lecture
Access Methods
2 December 2005

DBMS

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DBMS

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Databases

Context of Today's Lecture

Program Object Code

Transaction Manager Data Manager Database Manager

Data, Indices and System Catalogue

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

Storage Device Hierarchy

data capacity access speed cost per byte

Devices with fastest

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

Extra cache levels located on separate chips

Data items in the cache are copies of values in main

If data in the cache has been updated, changes must be

IMDB size limited by the maximal addressable memory space

Random access memory (RAM)

Typical access time of ~10 nanoseconds (10-8 seconds)

Secondary Storage (Hard Disk)

Parts of the main memory are used to buffer blocks

Typical block I/O time (seek time) ~10 milliseconds

Capacity of several hundred gigabytes and a system can

The platters were originally

cylinder-head-sector (CHS) scheme max of 1024 cylinders, 16 heads and 63 sectors

Current hard disks offer

April 20, 2012

hides the physical disk geometry

Solid-State Drives (SSD)

Offers a hard disk interface with a storage capacity of up

Currently there are still some limitations of SSDs

April 20, 2012

access time depends on data location tapes optical disk jukeboxes

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

DBMS model of computation

I/O model of computation

April 20, 2012

Accelerating Secondary Storage Access

prefetching of disk blocks efficient caching

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

Redundant Array of Independent Disks

The main goals of a RAID solution are

higher capacity by grouping multiple disks

higher performance due to parallel disk access

increased reliability since data might be stored redundantly

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

If a DBMS uses an array of disks, then the overall

By storing information redundantly, data can be restored

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

RAID Reliability ...

April 20, 2012

Beat Signer - Department of Computer Science - bsigner@vub.ac.be

The different RAID levels offer different

RAID Levels ...

April 20, 2012

a file contains a collection of blocks and represents a relation

A database is finally mapped to a number of files

April 20, 2012

index structures are stored in separate files