Beruflich Dokumente
Kultur Dokumente
Storage Management
Prof. Beat Signer
Department of Computer Science Vrije Universiteit Brussel
http://www.beatsigner.com
2 December 2005
Authorisation Control
Catalogue Manager
Integrity Checker
Command Processor
Query Optimiser
Scheduler
Buffer Manager
Recovery Manager
Access Methods
File Manager
System Buffers
Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010
Cache
Main Memory
Flash Memory
Magnetic Disk
Optical Disk
Magnetic Tapes
Cache
On-board cache on the same chip as the microprocessor
level 1 (L1) cache temporary storage of instructions and data typical size of ~64 kB e.g. level 2 (L2) cache typical size of ~1 MB
Main Memory
Main memory can be several gigabytes large Normally too small and too expensive for storing the
entire database
content is lost during power failure or crash (volatile memory) in-memory databases (IMDB) primarily rely on main memory
- note that IMDBs lack durability (D of the ACID properties)
time to access data is more or less independent of its location (different from magnetic tapes)
the transfer units are blocks tendency for larger block sizes the buffer manager of the DBMS manages the loading and unloading of blocks for specific DBMS operations 1'000'000 times slower than main memory access
Hard Disk
A hard disk contains one
or more platters and one or more heads
Typical block I/O time (seek time) ~0.1 milliseconds SSDs might help to reduce the gap between primary and
secondary storage in DBMS systems
the limited number of SSD write operations before failure can be a problem for DBs with a lot of update operations write operations are often still much slower than read operations
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8
Tertiary Storage
No random access
Different devices
tape silos
- room-sized devices holding racks of tapes operated by tape robots - e.g. StorageTek PowderHorn with up to 28.8 petabytes
Models of Computation
RAM model of computation
assumes that all data is held in main memory assumes that data does not fit into main memory efficient algorithms must take into account secondary and even tertiary storage best algorithms for processing large amounts of data often differ from those for the RAM model of computation minimising disk accesses plays a major role
- I/O model of computation
the time to move a block between disk and memory is much higher than the time for the corresponding computation
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10
placement of blocks that are often accessed together on the same disk cylinder distribute data across multiple disks to profit from parallel disk accesses (e.g. RAID) mirroring of data use of disk scheduling algorithms in OS, DBMS or disk controller to determine order of requested block read/writes
- e.g. elevator algorithm
11
divide and replicate data across multiple hard disks introduced in 1987 by D.A. Patterson, G.A. Gibson and R. Katz
12
RAID ...
There are three main concepts in RAID systems
identical data is written to more than one disk (mirroring) data is split accross multiple disks (striping) redundant parity data is stored on separated disks and used to detect and fix problems (error correction)
13
RAID Reliability
The mean time between failures (MTBF) is the average
time until a disk failure occurs
e.g. a hard disk might have a MTBF of 200'000 hours (22.8 years)
- note that the MTBF decreases as disks get older
e.g. the MTBF for a disk array of 100 of the disks mentioned above is 200'000 hours/100 = 2'000 hours (83 days)
14
if we mirror the information on two disks with a MTBF of 200'000 hours and a mean time to repair of 10 hours then the MTTDL is 200'0002/(2*10) hours = 228'000 years of course in reality it is more likely that an error occurs on multiple disks around the same time
- drives have the same age
- power failure, earthquake, fire, ...
15
RAID Levels
[http://en.wikipedia.org/wiki/RAID]
RAID 0
block level striping without any redundancy mirroring without striping bit level striping multiple parity disks byte level striping one parity disk
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16
RAID 1
RAID 2
RAID 3
April 20, 2012
block level striping one parity disk Similar to RAID 3 block level striping with distributed parity no dedicted parity disk block level striping with dual distributed parity no dedicted parity disk similar to RAID 5
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17
RAID 5
RAID 6
Data Representation
A DBMS has to define how the elements of its data model
(e.g. relational model) are mapped to secondary storage
a field contains a fixed- or variable-length sequence of bytes and represents an attribute a record contains a fixed- or variable-length sequence of fields and represents a tuple records are stored in fixed-length physical block storage units representing a set of tuples
- the blocks also represent the units of data transfer
how to map the SQL datatypes to fields? how to represent tuples as records? how to represent records in blocks? how to represent a relation as a collection of blocks? how to deal with record sizes that do not fit into blocks? how to deal with variable-length records? how to deal with schema updates and growing record lengths? ...
19
represented as a field which is an array of n bytes strings that are shorter than n bytes are filled up with a special "pad" character two common representations (non-fixed length version later) length plus content
- allocate an array of n + 1 bytes - the first byte represents the length of the string (8-bit integer) followed by the string content - limited to a maximal string length of 255 characters
null-terminated string
- allocate an array of n + 1 bytes - terminate the string with a special null character (like in C)
20
fixed-length character string the precision n leads to strings of variable length and two possible representations fixed-precision
- limit the precision to a fixed value and store as VARCHAR(m)
Time (TIME(n))
true-variable length
- store the time as true variable length value
Bits (BIT(n))
bit values of size n can be packed into single bytes packing of multiple bit values into a single byte is not recommended
- makes the retrieval and updating of a value more complex and error-prone
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21
Storage Access
A part of the system's main memory is used as a buffer
to store copies of disk blocks
the number of block transfers between disk and memory should be minimised as many blocks a possible should be kept in memory
the buffer manager has to check whether the block is already allocated in the buffer (main memory)
22
Buffer Manager
If the requested block is already in the buffer, the buffer
manager returns the corresponding address
read the block from the disk, add it to the buffer and return the corresponding memory address
23
a request to the DBMS involves multiple steps and the DBMS might be able to determine which blocks will be needed by analysing the different steps of the operation note that LRU might not always be the best replacement strategy for a DBMS
24
note that we will see more efficient solutions for this problem when discussing query optimisation
for each tuple o of order { for each tuple c of customer { if o.customerID = c.customerID { create a new tuple r with: r.customerID := c.customerID r.name := c.name ... r.orderID := o.orderID ... add tuple r to the result set of the join operation } } }
April 20, 2012 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25
once a customer tuple has been processed, it is not accessed again until all the other customer tuples have been accessed
- when the processing of a customer block has been finished, the least recently used customer block will be requested next
- we should replace the block that has been most recently used (MRU)
26
note that if we want to use an MRU strategy for the inner loop of the previous example, the block has to be pinned
- the block has to be unpinned after the last tuple in the block has be processed
the pinning of blocks provides some control to restrict the time when blocks can be written back to disk
- important for crash recovery
- blocks that are currently updated should not be written to disk
27
the system catalogue (data dictionary) with its metadata is one of the most frequently accessed parts of the database
- if possible, system catalogue blocks should always be in the buffer
index files might be accessed more often than the corresponding files themselves
- do not remove index files from the buffer if not necessary
the crash recovery manager can also provide constraints for the buffer manager
- the recovery manager might demand that other blocks have to be written first (force-output) before a specific block can be written to disk
28
names of the relations names, domain and lengths of the attributes of each relation names of views names of indices
- name of relation that is indexed - name of attributes - type of index
...
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29
File Organisation
A file is a logically organised as a sequence of records
each record contains a sequence of fields name, datatype and offset of record fields are defined by the schema record types (schema) might change over time
the block size is fixed and defined by the physical properties of the disk and the operating system the record size might vary for different relations and even between tuples of the same relation (variable field size)
use multiple files and only store fixed-length records in each file store variable-length records in a file
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30
Fixed-Length Records
type customer = record cID int; name varchar(30) street varchar(30) end
name
33
street
64
...
31
cID
4
name
36
street
68
...
the first byte of a block loaded from disk is placed at a memory address that is a multiple of 4 we have to ensure that we have the appropriate offsets (e.g. dividable by 4)
32
l
12
cID
16
name
48
street
80
...
the record schema (pointer s to the DBMS schema information) timestamp t about the last access or modification time the length l of the record
- could be computed from the schema but the information is convenient if we want to quickly access the next record without having to consult the schema
...
33
h
h
53
8
Albert Einstein
Max Frisch
Bergstrasse 18
ETH Zentrum
record 3
record 4
after a record has been deleted, its space has to be filled with another record
- could move all records after the deleted one but that is too expensive - can move the last record to the deleted record's position but also that might require an additional block access
if the block size is not a multiple of the record size, some records will cross block boundaries and we need two block accesses to read/write such a record
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34
we cannot just add an additional boolean flag ("free") to the record since it will be hard to find the free records allocate a certain amount of bytes for a file header containing metadata about the file
each deleted record contains a pointer (address) to the next deleted record the linked list of deleted records is called a free list
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35
h
h
5
8
Claude Debussy
Max Frisch
12 Rue Louise
ETH Zentrum
record 2
record 3 record 4
to save some space, the pointers of the free list can also be stored in the unused space of deleted records (no additional field)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36
Address Space
There are several ways how the database address
space (blocks and block offsets) can be represented
37
...
physical address
introduces some indirection since the map table has to be consulted to get the physical address flexibility to rearrange records within blocks or move them to other blocks without affecting the record's logical address different combinations of logical and physical addresses are possible (structured address schemes)
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38
Variable-Length Data
Records of the same type may have different lengths We may want to represent
record fields with varying size (e.g. VARCHAR(n)) large fields (e.g. images) ...
39
name
street
put all fixed-length fields first (e.g. cID) add the length of the record to the record header add the offsets of the variable-length fields to the record header
40
Variable-Length Records
offset table
record3
record2
record1
to store records that have at least one field with a variable length to store different record types in a single block/file
address of a record consists of the block address in combination with an offset table index records can be moved around
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41
Large Records
record1
record header block header
record2a
record2b block 2
record3
block 1
a record that is split across two or more blocks is called a spanned record spanned records can also be used to pack blocks more efficiently each record header carries a bit to indicate if it is a fragment
- fragments have some more bits; telling whether first or last fragment of record
user has to explicitly load parts of the BLOB possibly index structures to retrieve parts of a BLOB
43
Insertion of Records
offset table
record3
record2
record1
find space in a nearby block and rearrange some records create an overflow block and link it from the header of the original block
- note that an overflow block might point to another overflow block and so on
Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44
Deletion of Records
offset table
record3
record2
record1
45
Update of Records
offset table
record3
record2
record1
Homework
Study the following chapter of the
Database System Concepts book
chapter 10
- sections 10.1-10.9 - Storage and File Structure
47
Exercise 8
Structured Query Language (SQL) PostgreSQL
48
References
H. Garcia-Molina, J.D. Ullman and J. Widom,
Database Systems The Complete Book, Prentice Hall, 2002
49
Next Lecture
Access Methods
2 December 2005