You are on page 1of 107

Data vs. Information Data: Raw facts; data constitute the building blocks of information Unprocessed information.

on. Raw data must be properly formatted for storage, processing and presentation. Information: - Result of processing raw data to reveal its meaning. Accurate, relevant, and timely information is key to good decision making Good decision making is the key to organizational survival in a global environment. Data Management: Discipline that focuses on the proper generation, storage and retrieval of data. Data Mgt. is a core activity for any business, govt. agency service or organization or charity. Why store the data as raw facts? Historical Roots: Files and File Systems Although file system as a way of managing data are now largely obsolete, there are several good reasons for studying them in detail: - understanding file system characteristics make database design easier to understand - awareness of problems with file system helps prevent similar problems in DBMS - Knowledge of file systems is helpful if you plan to convert an obsolete file system to a DBMS. In recent past, a manager of almost any small org. was (sometimes still is) able to keep track of necessary data by using manual file system. Such a file system was traditionally composed of collection of file folders each properly tagged and kept in filing cabinet. Unfortunately, report generation from a manual file system can be slow and cumbersome. File and File Systems: Data: Raw facts e.g telephone number, birth date, customer name and year-to-date (YTD) sales value. Data have little meaning unless they have been organized in some logical manners. Smallest piece of data that can be recognized by the computer is a single character, such as letter A, the number 5 or a symbol as /. A single character requires 1 byte of computer storage. Field: A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Record: A logically connected set of one or more fields that describe a person, place or thing e.g the fields that constitutes a record for a customer names, address, phone number, Date of Birth. File: a collection of related records e.g a file might contain data about vendors of ROBCOR Company or a file might contain records for the students currently enrolled at UEL.
Historical Roots: Files and File Systems

A simple file system where each department has multiple programs to directly access the data Note: No separation as will be seen in a DBMS

As the number of files increased, a small file system evolved. Each file in the system used its own application programs to store, retrieve and modify data. And each file has owned by the individual or the dept. that commissioned its creation. As the file system grew, the demand for the data processing specialists programming skills grew even faster and the DP specialist was authorized to hire additional programmers. The size of the file system also requires a larger more complex computer. The new computer and the additional programming staff caused the DP specialist to spend less time programming and more time managing technical and human resources. Therefore the DP specialists job evolved into that of a Data Processing (DP) Manager who supervised a DP dept. File-based System: A collection of application programs that perform services for the end-users such as the production of reports. Each program defines and manages its own data. Problems with File System Data Mgt. - Data Redundancy: multiple file locations/copies could lead to Update, Insert and Delete anomalies. - Structural Dependencies/Data Dependence Access to a file depends on its structure Marketing changes in existing file structure is difficult File structure changes require modifications in all programs that use data in that file Different program languages have different file structures Modifications are likely to produce errors, requiring additional time to debug the program Programs written in (3GL): example of 3GL re COBOL, BASIC, FORTRAN Programmer must specify task and how its done. Modern databases use (4GL) allow users to specify what must be done without saying how it must be done. 4GL are used data retrieval (such as query by example and report generator tools) and can work with different DBMS. The need to write 3GL program to produce even the simplest reports makes ad hoc queries impossible.

- Security features such as effective password protection, the ability to lock out parts of files or parts of the system itself and other measures designed to safeguard data confidentiality are difficult to program and therefore often omitted in a file system envt. To Summarize the Limitations of File System Data Mgt. Requires extensive programming. There are no ad hoc query capabilities. System administration can be complex and difficult. Difficult to make changes to existing structures. Security features are likely to be inadequate.

Limitations of File-based Approach Separation and Isolation of data: when data is isolated in separate files, it is more difficult to access data that should be available. Duplication of Data: Owing to the decentralised approach taken by each dept. file-based approach encouraged if not necessitated, the uncontrolled duplication of data. Duplication is wasteful, it costs time and money to enter data more than once It takes up additional storage space, again with associated costs. Duplication can lead to loss of data integrity. Data Dependence: the physical structure and storage of data files and records are defined in the duplication code. This means that changes to an existing structure are difficult to make. Incompatible file formats Fixed queries/ Proliferation of application programs: File-based system are very dependent upon the application developer, who has to write any queries or reports that are required. No provision for security or integrity Recovery in the event of hardware/software failure was limited or non-existent Access to the files was restricted to one user at a time- there was no provision for shared access by staff in same dept. Introducing DB and DBMS DB (Database) shared, integrated computer structure that stores: End user data (raw facts) Metadata (data about data) through which the end-user data the integrated and managed.

Metadata provide a description of the data characteristic and the set of relationship that link the data found within that the database. Database resembles a very well-organized electronic filling cabinet in which powerful software (DBMS) helps manage the cabinets contents.. DBMS (database management system): Collection of programs that manages database structure and controls access to the data Possible to share data among multiple applications or users Makes data management more efficient and effective. Roles: DBMS serves as the intermediary between the user and the DB. The DBMS receives all application requests and translates them into complex operations required to fulfil those requests. DBMS hides much of the DBs internal complexity from the application programs and users. DBMS uses System Catalog A detailed data dictionary that makes access to the system tables (metadata) which describes the database. Typically stores: names, types, and sizes of data items constraints on the data names of authorized users data items accessible by a user and the type of access usage statistics (ref. optimisation) Role & Advantages of the DBMS End users have better access to more and better-managed data Promotes integrated view of organizations operations Minimized Data Inconsistency Probability of data inconsistency is greatly reduced Improved Data Access Possible to produce quick answers to ad hoc queries Particularly important when compared to previous historical DBMSs. Improved Data Sharing: Create an envt. In which end users have better access to more and better- managed data such access makes it possible for end users to respond quickly to changes in their envt. Better Data Integration: wider access to well-managed data promotes an integrated view of the organisations operations and a clearer view of the big picture. Data Inconsistency exists when different versions of the same data appear in different places. Increased end-user Productivity: Availability of data combined with tools that transform data into usable information, empowers end-users to make quick informed decisions that can be the different between success and failure in global economy.

First DBMS enables data in the DB to be shared among multiple applications or users.

Second it integrates the many diff users views of the data into a single all encompassing data repository because data are crucial raw material from which information is derived, you must have a good way of managing such data. Role & Advantages of the DBMS DBMS serves as intermediary between the user/applications and the DB, compare this to the previous file based systems. Database System: refers to an org. of components that define and regulate the collection, storage, mgt and use of data within a DB envt.

Types of Databases Can be classified by users: Single User: supports only one user at a time. If user A is using the DB, user B and C must wait until user A is done. Multiuser: supports multiple users at the same time. When multiuser DB supports a relatively small number of users (usually fewer than 50) dept. within organization (Work Group Database). When the DB is used by entire organization and supports many users across many depts. (Enterprise DB) Can be classified by location: Centralized: Supports data located at a single site Distributed: Supports data distributed across several sites Can be classified by use: Transactional (or production): OLTP Supports a companys day-to-day operations Data Warehouse: OLAP Stores data used to generate information required to make tactical or strategic decisions Often used to store historical data Structure is quite different History of Database Systems First-generation Hierarchical and Network

Second generation Relational Third generation Object-Oriented Object-Relational DBMS FUNCTIONS DBMS performs several important functions that guarantee the integrity and consistency of the data in the DB. These include: 1. Data Dictionary MGT.: Stores definitions of the data elements and their relationship (metadata) in a data dictionary. DBMS uses data dictionary to look up the required data components structure and relationships, thus relieving you from having to code such complex relationships in each program. 2. Data Storage Mgt: DBMS creates and manages the complex structure required for data storage, thus relieving you of the difficult task of defining and programming the physical data characteristic. 3. Data Transformation and Presentation: It transforms entered data to conform to required data structure. DBMS relieves you of the chore of making a distribution between the logical data format and physical data format. 4. Security Mgt. DBMS creates a security system that enforces user security and data privacy. Security rules determine which users can access the DB, which data items each user can access and which data operations (read, add, delete or modify) the user can perform. 5. Multiuser Access Control: To provide data integrity and data consistency. 6. Backup and Recovery Mgt: To ensure data safety and integrity. It allows DBA to perform routine and special backup and recovery procedures. 7. Data Integrity Mgt: Promotes and enforces integrity rules, thus minimising data redundancy and maximising data consistency. 8. DB Access lang. and Application Programming Interfaces: provides data access through a query lang. 9. DB Communication Interfaces: Current-generation DBMS accept end-user requests via multiple, different network envts e.g DBMS might provide access to the DB via the internet through the use of web browsers such as Mozilla Firefox or internet explorer. Data Models Hierarchical Model e.g ADABAS It is developed in 1960s to manage large amounts of data for complex manufacturing projects e.g Apollo rocket that landed on the moon in 1969. basic logical structure is represented by an upsidedown tree

HM contains levels or segments. Segment is equivalent of a file systems record type. The root segment is the parent of the level 1 segments which in turn are the parents of the level to segment etc. other segments below are children of the segment above. In short, HM depicts a set of one-to-many (1:*) relationships between a parent and its children segments. (Each parent can have many children but each child has only one parent. Limitations: i. Complex to implement ii. Difficult to manage iii. Lack structural independence

Hierarchical Structure diag

Network Model (NM) e.g IMS Information Mgt Syst. NM was created to represent complex data relationship more effectively than the HM to improve DB performance and to impose DB standard. Lack of DB standard was troubles some to programmers and application designers because it made designs and applications less portable. Conference on Data System Lang. (CODASYL) created the Database task Group (DBTG) that defined crucial components: Network Schema: conceptual org. of the entire DB as viewed by the DB administrator. It includes definition of the DB name, record type for each record and the components that make up those records. Network Subschema: defines the portion of the DB seen by the program that actually produce the desired information from the data contained within the DB. Existence of subschema definitions allows all DB program to simply involve the subschema required to access the application DB file(s).

Data Mgt Lang. (DML): defines the environment in which data can manage. To produce desired standardisation for each of the tree components, the DBTG specified 3 distinct DML components. - A schema data definition lang (DDL): enables DBA to define the schema components - Subschema DDL: allows application program to define DB component that will be used by the application. - Data Manipulation Lang: to work with the data in the DB. The NM allows a record to have more than one parent. A relationship is called a SET. Each set is composed of at least 2 record types i.e an owner record that is equivalent to hierarchical models parent and a member record that is equivalent to the hierarchical models child. A set represents a 1:* relationship between owner and member.

Network Model diag

The Relational Model e.g Oracle, DB2 RM is implemented through a Sophisticated Relational Database Mgt System (RDBMS). RDBMS performs same basic functions provided by the hierarchical and network DBMS system. Important advantage of the RDBMS is the user. The RDBMS manages all of the physical details while the user sees the relational DB as a collection of tables in which data are stored and can manipulate and query data in a way that seems intuitive and logical. Each table is a matrix consisting of a series of row/column intersections. Tables also called Relations; are related to each other through the sharing of a field which is common to both entities. -A relational diagram is a representation of relational DBs entities, the attributes with those entities and the relationship between those entities. - A relational table stores collection of related DBs entities, therefore, relational DB table resembles a file. Crucial difference between table and a file: A table yields complete data and structural independent cos it is a purely logical structure.

- Reason for the relational DB models rise to dominance is its powerful and flexible query lang. RDBMS uses SQL (4GL) to translate use queries into instructions for retrieving the requested data. Object-Oriented Model (OOM) (check appendix G) In object-oriented data model (OOM), both data and their relationship are contained in a single structure known as an OBJECT. Like the relational models entity, an object is described by its factual content. But quite unlike an entity, an object includes information about relationships between the facts within the object, as well as info about its relationship with other objects. Therefore, the facts within the object are given greater meaning. The OODM is said to be a semantic data model cos semantic indicates meaning. OO data model is based on the following components: i. An object is an abstraction of a real-world entity i.e object may be considered equivalent to an ER models entity. An object represents only one individual occurrence of an entity. ii. Attributes describe the properties of an object. iii. Object that share similar characteristic are grouped in classes. A class is a collection of similar objects with shared structure (attributes) and behaviour (method). Class resembles the ER models entity set. However, a class is diff from an entity set in that it contains set of procedures known as Methods. A classs method represents a real-world action such as finding a selected Persons name, changing a Persons name. iv. Classes are organised in a class hierarchy. The class hierarchy resembles an upside-down tree in which each class has only one parent e.g CUSTOMER class and EMPLOYEE class share a parent PERSON class. v. Inheritance is the ability of an object within the class hierarchy to inherit the attributes and method of classes above it, for example- 2 classes, CUSTOMER and EMPLOYEE can be created as subclasses from the class PERSON. In this case, CUSTOMER and EMPLOYEE will inherit all attributes and methods from PERSON. Entity Relationship Model Complex design activities require conceptual simplicity to yield successful results although the relational model was a vast improvement over the hierarchical and network models. It still lacked the features that would make it an effective database design tool. Because it is easier to examine structures graphically than to describe them in text, database designers prefer to use a graphical tool in which entities and their relationship are pictured. Thus, the entity relationship (ER) model or ERM, has become a widely accepted standard for data modelling. One of the more recent versions of Peter Chens Notations is known as the Crows Foot Model. The Crows Foot Notation was originally invented by Gordon Everest. In crows foot notation, graphical symbols were used instead of using the simple notation such as n to indicate many used by Chen. The label Crows Foot is derived from the three-pronged symbol used to represent the many side of the relationship. Although there is a general shift towards the use of UML, many organisations today still use the Crows Foot Notation. This is particular true in legacy systems

which are running on obsolete hardware and software but are vital to the organisation. It is therefore important that you are familiar with both Chens and Crows foot modelling notations. Many recently the class diagram component of thr Unified Modelling Language. has been used to produce entity relationship models. Although class diagrams have been developed as a part of the larger UML object-oriented design method, the notation is emerging as the industry data modelling standard. The ERM uses ERDs to represent the conceptual database as viewed by the end user. The ERMs main components are entities, relationships and attributes. The ERD also includes connectivity and cardinality notations. An ERD can also show relationship strength, relationship participation (optional or mandatory), and degree of relationship (unary, binary, ternary, etc). ERDs may be based on many different ERMs. The Object Oriented Model Objects that share similar characteristics are grouped in classes. Classes are organized in a class hierarchy and contain attributes and methods. Inheritance is the ability of an object within the class hierarchy to inherit the attributes and methods of classes above it. An OODBMS will use pointer to link objects together. Is this a backwards step? The OO data Model represents an object as a box; all of the object attributes and relationships to other objects are included in within the object box. The object representation of the INVOICE includes all related objects within the same object box. Note that the connectivity (1:1 and 1:*) indicate relationship of the related objects to the invoice. The ER model uses 3 separate entities and 2 relationships to represent an invoice transaction. As customers can buy more than one item at a time, each invoice references one or more lines, one item per line.

The Relational Model Developed by Codd (IBM) in 1970 Considered ingenious but impractical in 1970 Conceptually simple Computers use to lacked the power to implement the relational model Today, microcomputers can run sophisticated relational database software. Advantages of DBMS Control of data redundancy Data consistency

More information from the same amount of data Sharing of data Improved data integrity Improved security Enforcement of standards Economy of scale Balance conflicting requirements Improved data accessibility and responsiveness Increased productivity Improved maintenance through data independence Increased concurrency Improved backup and recovery services Disadvantages of DBMS Cost of DBMS Additional hardware costs Cost of conversion Complexity Size Maintenance / Performance? Higher dependency / impact of a failure

Degrees of Data Abstraction ANSI ( American National Standard Institute) SPARC (Standard Planning and Requirements Committee) defined a framework for data modelling based on degrees of data abstraction ANSI-SPARC Three Level Architecture

External Model A specific representation of external view is known as External Schema. The use of external view representing subjects of the DB has some important advantage: It makes it easy to identify specific data required to support each business units operations. Makes the designers job easy by providing feedback about the models adequacy. It helps to ensure security constraints in the DB design. It makes application program development much similar. Conceptual Model It represents a global view of the entire DB. It is a representation of data as viewed by the entire organisation. It integrates all ext. views (entities, relationships, constraints and processes) into a single global view of the entire data in the enterprise known as Conceptual Schema. Most widely used conceptual model is the ER model. The ER model is used to graphically represent the conceptual schema. Advantage: It provides relatively easily macro-level view of the data envt. It is independent of both software and hardware. Software Independence means the model does not depend on the DBMS software used to implement the model. DBMS software used to implement the model. Hardware Independence means the model does not depend on the hardware used in the implementation of the model. Internal Model Once a specific DBMS has been selected, the internal model maps the conceptual model to the DBMS. The internal model is the representation of the DB as seen by the DBMS. An Internal

Schema depicts a specific representation of an internal model, using the DB constructs supported by the chosen DB. Internal model depends on specific DB software. When you can change the internal model without affecting the conceptual model you have Logical Independence. However, the I.M. is also hardware-independent because it unaffected by the choice of the computer on which software is installed. Physical Model It operates at the lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes. P.M. requires the definition of both the physical storage devices and the (physical) access methods required to reach the data within those storage devices, making it both software and hardware dependent. - When you can change the physical model without affecting the internal model, you have Physical Independence. Therefore, a change in storage devices/methods and even a change in OS will not affect the internal model. External Level Users view of the database. Describes that part of database that is relevant to a particular user. Conceptual Level Community view of the database. Describes what data is stored in database and relationships among the data. Internal Level Physical representation of the database on the computer. Describes how the data is stored in the database. The Importance of Data Models Data models Relatively simple representations, usually graphical, of complex real-world data structures Facilitate interaction among the designer, the applications programmer, and the end user. Data model is a (relatively) simple abstraction of a complex real-world data envt. DB designers use data models to communicate with applications programmers and end users. The basic datamodeling components are entities attributes, relationships and constraints. Business rules are used to identify and define the basic modeling component within a specific realworld environment. Database Modelling Alternate Notations Crows Feet Notation Chen Notation UML Unified Modelling Language
Review the coverage provided on the Bookss CD in appendix E, regarding these alternative notations.

Data Model Basic Building Blocks Entity Relationship Diagrams (ERD) ERD represents the conceptual DB as viewed by the end user. ERD depicts the DBs main components: Entity - anything about which data is to be collected and stored Attribute - a characteristic of an entity Relationship - describes an association among entities One-to-many (1:m) relationship Many-to-many (m:n) relationship

One-to-one (1:1) relationship Constraint - a restriction placed on the data Entity Relationship Diagrams An Entity A thing of independent existence on which you may wish to hold data on Example: an Employee, a Department Entity is an object of interest to the end user. The word entity in the ERM corresponds to a table and to a row in the relational envt. The ERM refers to a specific table row as an entity instance or entity occurrence. In UML notation, an entity is represented by a box that is subdivided into 3 parts: - the top part is used to name the entity a noun, usually written in capital letters. - Middle part is used to name and describe the attributes. - Bottom part is used to list the methods. Methods are used only when designing object relational/object oriented DB models. The two terms ER Model / ER Diagram are often used interchangeable to refer to the same thing a graphical representation of a database. To be more precise you would refer to the specific notation being used as the model, e.g. Chen, crows foot, describing what type of symbols are used. Whereas an actual example of that notation being used in practice would be called an ER diagram. So the Model is the notation specification, the diagram is an actual drawing. Relationships: an association between entities. Entity types may bear relationship to one another Example: Employee works in a Department Recording: which Dept an Emp is in The relationship could be Works in Existence Dependence: an entity is said to existence-dependent if it can exist in the database only when it is associated with another related entity occurrence. Implementation terms, an entity is existence-dependent if it has a mandatory foreign key that is foreign key attribute that cannot be null. Relationship Strength: this concept is based on how the pry key of a related entity is defined

Weak (Non-identifying) Relationship: exists if the pry key of the related entity does not contain a pry key component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE. CRS_CREDIT) CLASS (CLASS_CODE, CRS_CODE, CLASS_SECTION) Strong (Identifying) Relationship: exists when the pry key of the related entity contains a pry key component of the parent entity e.g COURSE (CRS_CODE, DEPT_CODE, CRS_CREDIT) CLASS (CRS_CODE, CLASS_SECTION, CLASS_TIME, ROOM_CODE) Weak Entities A weak entity is one that meets two conditions: 1. It is existence-dependent; it cannot exist without the entity with which it has a relationship 2. It has a pry key that is partially or totally derived from the parent entity in the relationship.

Attributes: are characteristic of entities for example STUDENT entity includes the attributes STU_LNAME, STU_FNAME and STU_INITIAL Domains: attributes have a domain, a domain is the attributes set of possible values. Relationship Degree A relationship degree indicates the number of entities or participants associated with a relationship. Unary Relationships: exists when an association is maintained within a single entity. Binary Relationships: exist when two entities are associated in a relationship. Ternary Relationship: exists when three entities are associated.

Recursive Relationship: is one in which a relationship can exist between occurrences of same entity set. (Naturally, such a condition is found within a unary relationship.)

Compose Entity (Bridge entity): This is composed of the primary keys of each of the entities to be connected. Example is converting the *:* relationship into two 1:* relationships.

Composite and Simple Attributes. Composite attribute is an attribute that can be further subdivided to yield additional attributes e.g ADDRESS, can be subdivided into street, city, state and postcode. Simple attribute is an attribute that cannot be subdivided e.g age, sex and marital status. Single-Valued Attributes: attribute that can have only a single value e.g person can have only one social security number. Multivalued Attributes: have many values e.g person may have several university degrees. Derived attribute: attribute whose value is calculated from other attribute e.g an employees age EMP_AGE may be found by computing the integer value of the difference between the current date and the EMP_DOB Properties of an entity we want to record Example: Employee number, name The attributes could be EMP_NO, EMP_NAME Relation Types Relation between two entities Emp and Dept More than one relation between entities Lecturer and Student Teaches - Personal Tutor Relationship with itself Called Recursive Part made up of parts Degree and cardinality are two important properties of the relational model. The word relation, also known as a Dataset in Microsoft Access, is based on the mathematical set theory form which Codd derived his model. Since the relational model uses attribute values to establish relationships among tables, many database users incorrectly assume that the term relation

refers to such relationships. Many then incorrectly conclude that only the relational model permits the use of relationships. A Relation Schema is a textual representation of the DB tables where each table is described by its name followed by the list of its attributes in parentheses e.g. LECTURER (EMP_NUM, LECTURER_OFFICE, LECTURER_EXTENSION. Rows sometimes referred to as Records Columns are sometimes labeled as Fields. Tables occasionally labeled as Files. DB table is a logical rather than a physical concept and the file, record and the field describe physical concepts. Properties of a Relation 1. A table is perceived as 2-dimensional structure composed of rows and columns. 2. Each table row (tuple) represents a single entity occurrence within the entity set and must be distinct. Duplicate rows are not allowed in a relation. 3. Each table column represents an attribute and each column has a distinct name 4. Each cell/column/row intersection in a relation should contain only single data value. 5. All values in a column must conform to the same data format. 6. Each column has a specific range of values known as the Attribute Domain. 7. The order of the rows and columns is immaterial to the DBMS. 8. Each table must have an attribute or a combination of attributes that uniquely identifies each row. Cardinality of Relationship Determines the number of occurrences from one entity to another. Example: Each Dept there are a number of Employees that work in it. Cardinality is used to express the maximum number of entity occurrence associated with one occurrence of the related entity. Participation determines whether all occurrence of an entity participate in the relationship or not. Three Types of Cardinality One-to-Many Dept Emp Many-to-Many Student Courses Must be resolved into 1:m One-to-One Husband Wife (UK Law) Optionality / Participation Identifies the minimum cardinality of relationship between entities 0 - May be related 1 - Must be related Developing an ER Diagram The process of database design is an iterative rather than a linear or sequential process. An iterative process is thus one based on repetition of processes and procedures.

1. 2. 3. 4. 5. 6. 7. 8. 9.

Create a detailed narrative of the organizations description of operations. Identify the business rules based on the descriptions of operations. Identify Entities Work out Relationships Develop an initial ERD Work out Cardinality/ Optionality Identify the primary and foreign keys Identify Attributes Revise and Review the ERD

Types of Keys Primary Key - The attribute which uniquely identifies each entity occurrence Candidate Key - one of a number of possible attributes which could be used as the key field Composite Key - when more than one attribute is required to identify each occurrence Composite Pry keys: pry key composed of more than one attribute. Foreign Key - when an entity has a Key attribute from another entity stored in it Superkey an attribute ( or combination of attributes) that uniquely identifies each row in a table. Candidate A superkey that does not contain a subset of attributes that is itself a superkey. Primary key A candidate key selected to uniquely identify all other attribute values in any given row. It cannot contain null entries. Identifiers (Pry Keys): the ERM uses identifiers to uniquely identify each entity instance. Identifiers are underlined in the ERD, key attributes are also underlined when writing the relational schema. Secondary Key An attribute (combination of attribute) used strictly for data retrieval purposes. Foreign Key An attribute (or combination of attribute) in one table whose values must either match the primary key in another table or be null. The basic UML ERD

The basic Crows foot ERD

Example Problem 1 A college library holds books for its members to borrow. Each book may be written by more than one author. Any one author may have written several books. If no copies of a wanted book are currently in stock, a member may make a reservation for the title until it is available. If books are not returned on time a fine is imposed and if the fine is not paid the member is barred from loaning any other books until the fine is paid.
ER Diag One

Example Problem 2

A local authority wishes to keep a database of all its schools and the school children that are attending each school. The system should also be able to record teachers available to be employed at a school and be able to show which teachers teach which children. Each school has one head teacher whos responsibility it is to manage their individual school, this should also be modelled. Example Problem 3 A university runs many courses. Each course consists of many modules, each module can contribute to many courses. Students can attend a number of modules but first they must possess the right qualifications to be accepted on a particular course. Each course requires a set of qualifications at particular grades to allow students to be accepted, for example the Science course requires a least 2 A levels, one of which must be mathematics at grade B or above. There is the normal teaching student/lecturer relationship, but you will also have to record personal tutor assignments.

Review Questions ch1 Discuss each of the following: Data, Field, Record, File What is data redundancy and which characteristic of the file system can lead to it? Discuss the lack of data independence in file systems. What is a DBMS and what are its functions? What is structural independence and why is it important? Explain the diff betw data and information. What is the role of DBMS and what are its advantages List and describe the diff types of databases. What re the main components of a database system? What is metadata? Explain why database design is important. What re the potential costs of implementing a database system? Review Questions ch2 1. Discuss the importance of data modelling. 2. What is a business rule, and what is its purpose in data modelling? 3. How would you translate business rules into data model components? 5. What three languages were adopted by the DBTG to standardize the basic network data model, and why was such standardisation important to users and designers? 6. Describe the basic features of the relational data model and discuss their importance to the end user and the designer. 7. Explain how the entity relationship (ER) model helped produce a more structured relational database design envt. 9. Why is an object said to have greater semantic content than an entity? 10. What is the difference between an object and a class in the object-oriented data model (OODM)? 12. What is an ERDM, and what role does it play in the modern (production) database envt? 14. What is a relationship, and what three types of relationships exist? 15. Give an example of each of the three types of relationships. 16. What is a table and what role does it play in the relational model? 17. What is a relational diagram? Give an example. 18. What is logical independence?

19. What is physical independence? 20. What is connectivity? Draw ERDs to illustrate connectivity. Review Questions ch.3 1. What is the difference between a database and a table? 2. What does a database expert mean when he/she says that a database displays both entity integrity and referential integrity? 3. Why are entity integrity and referential integrity important in a database? Review Questions ch5 1. When two conditions must be met before an entity can be classified as a weak entity? Give an example of a weak entity. 2. What is a strong (or identifying) relationship? 4. What is composed entity and when is it used? 6. What is recursive relationship? Give an example. 7. How would you graphically identify each of the following ERM components in a UML model: I. An Entity II. The multiplicity (0:*) 8. Discuss difference between a composite key and a composite attribute. How would each be indicated in an ERD? 9. What two courses of action are available to a designer when he or she encounters a multivalued attribute? 10. What is a derived attribute? Give example. 11. How is a composite entity represented in an ERD and what is its function? Illustrate using the UML notation. 14. What three (often conflicting) database requirements must be addressed in database design? 15. Briefly, but precisely, explain the diff betw single-valued attributes and simple attributes. Give an example of each. 16. What are multivalued attributes and how can they be handled within the database design?

Enhanced Entity Relationship (EER) Modelling ( Extended Entity Relationship Model) This is the result of adding more semantic constructs to the original entity relationship (ER) model. Examples of the additional concepts are in EER models are: Specialisation/Generalization Super Class/Sub Class Aggregation Composition In modelling terms, an entity supertype is a generic entity type that is related to one or more entity subtypes, where the entity supertype contains the common characteristic and the entity subtypes contain the unique characteristic of each entity subtype.

Specialization Hierarchy Entity supertypes and subtypes are organised in a specialization hierarchy. The specialization hierarchy depicts the arrangement of higher-level entity supertypes (parent entities) and lower-level entity subtypes (child entities). In UML notation subtypes are called Subclasses and supertypes are known as Superclasses Specialization and Generalization Specialization is the top-down process of identifying lower-level, more specific entity subtypes from a higher-level entity supertype. Specialization is based on grouping unique characteristics and relationships of the subtypes. Generalization is the bottom-up process of identifying a higher-level, more generic entity supertype from lower-level entity subtypes. Generalization is based on grouping common characteristics and relationships of the subtypes. Superclass An entity type that includes one or more distinct sub-groupings of its occurrences therefore a generalization. Subclass - A distinct sub-grouping of occurrences of an entity type therefore a specialization. Attribute Inheritance: An entity in a subclass represents same real world object as in superclass and may possess subclass-specific attributes, as well as those associated with the superclass. Composite and Aggregation Aggregation is where by a larger entity can be composed of smaller entities e.g. University Departments. A special case of aggregation is known as Composition. This is a much stronger relationship than aggregation, since when the parent entity instance is deleted, all child entity instances are automatically deleted. An Aggregation construct is used when an entity is composed of a collection of other entities, but the entities are independent of each other. A Composition construct is used when two entities are associated in an aggregation association with a strong identifying relationship. That is, deleting the parent deletes the children instances.

Normalization of Database Tables This is a process for evaluating and correcting table structures to minimise data redundancies, thereby reducing the likelihood of data anomalies. Normalization works through a series of stages called Normal Forms. i.e First normal form (1NF), second normal form (2NF), third normal form (3NF). From a structural point of view, 2NF is better than 1NF and 3F is better than 2NF. For most business database design purposes, 3NF is as high as we need to go in normalization process. The highest level of normalization is not always most desirable. But all most business designs use 3NF as the ideal normal form. A table is in 1NF when all key attributes are defined and when all remaining attributes are dependent on the pry key. However a table in 1NF can still contain both partial and transitive dependencies. (A partial dependency is one in which an attribute is functionally dependent on only

a part of a multiattribute pry key. A transitive dependency is one in which one non-key attribute is functionally dependent on another non-key attribute). A table with a single-attribute pry key cannot exhibit partial dependecies. A table is in 2NF when it is in 1NF and contains no partial dependencies. Therefore, a 1NF table is automatically in 2NF when its pry key is based on only a single attribute. A table in 2NF may still contain transitive dependencies. A table is in 3NF when it is in 2NF and contains no transitive dependencies. When a table has only a single candidate key, a 3NF table is automatically in BCNF (Boyce-Codd Normal Form). Normalization Process Checking ER model using functional dependency Result - Removes any data duplication problems Saves excess storage space Removes insertion, update and deletion anomalies. Functional Dependency A B B is functionally dependent on A If we know A then we can find B Studno Studname Review Questions 1. What is an entity supertype and why is it used? 2. What kinds of data would you store in an entity subtype? 3. What is a specialization hierarchy? Review Questions 1. What is normalization? 2. When is a table in 1NF? 3. When is a table in 2NF? 4. When is a table in 3NF? 5. When is a table in BCNF? 7. What is a partial dependency? With what normal form is it associated? 8. What three data anomalies are likely to be the result of data redundancy? How can such anomalies be eliminated? 9. Define and discuss the concept of transitive dependency. 11. Why is a table whose pry key consists of a single attribute automatically in 2NF when it is in 1NF.

Relational Algebra and SQL Relational DB Roots Relational algebra and relational calculus are the mathematical basis for relational databases. Proposed by E.F. Codd in 1971 as the basis for defining the relational model. Relational algebra Procedural, describes operations Relational calculus None procedural / Declarative Relational Algebra is a collection of formal operations acting on these relations which produce new relations as a result. The algebra is based on predicate logic and set theory and is discussed as Procedural Lang. Relational algebra defines theoretical way of manipulating table contents through a number of relational operators. Set Theory

Relational Algebra Operators


UNION INTERSECT DIFFERENCE SELECT (Restrict) PROJECT CARTESIAN PRODUCT DIVISION JOIN The SELECT operator denoted by which is formally defined as (R) or <criterion> (RELATION) where (R) is the set of specified tuples of the relation R and is the predicate (or criterion) to extract the required tuples. The PROJECT operator returns all values for selected attributes and is formally defined as a1...an(R) or <list of attributes> (Relation) Relational Operators Union R U S builds a relation consisting of all tuples appearing in either or both of two specified relations. Intersection R S Builds a relation consisting of all tuples appearing in both of two specified relations. Difference (complement) R - S Builds a relation consisting of all tuples appearing in the first and not the second of two specified relations. Union

Intersection

Difference

Select (Restrict) a(R)


extract specified tuples from a specific relation.

Project

a,b(R)

extract specified attributes from specified relation.

Cartesian-product R x S
Builds a relation from two specified relations consisting of all possible concatenated pairs of tuples, one from each of the two specified relations. Cartesian Product Example

R1 (a1, a2an) with cardinality i and R2 (b1, b2..bn) with cardinality j is a relation R3 with degree k = n+m, cardinality i*j and attributes (a1, a2..an, b1, b2..bn) this can be denoted R3 = R1 x R2 Division R / S Takes two relations, one binary and one unary, and builds a relation consisting of all values of one attribute of the binary relation that match (in the other attribute) all values in the unary relation. Join R S

Builds a relation from two specified relations consisting of all possible concatenated pairs of tuples, one from each of the two specified relations, such that in each pair the two tuples satisfy some specified condition. The DIVISION of 2 relations R1 (a1, a2am) with cardinality i and R2 (b1, b2..bm) with cardinality j is a relation R3 with degree k = n-m and cardinality i j The JOIN of 2 relations R1 (a1, a2an) and R2 (b1, b2..bn) is a relation R3 with degree k = n+m and attributes (a1, a2an , b1, b2..bn ) that satisfy a specific join condition. Division

See page 129

Equijoin Example

i. Complete R1 x R2 this first performs a Cartesian Product to form all possible combinations of the rows R1 and R2 ii. Restrict the Cartesian Product to only those rows where the values in certain columns match.

See page 131

Secondary Algebraic Operators Intersection R S = R - (R - S) Division R/S = A (R) - A ((A (R)xS)-R) Join R II R S = R (RxS) Equijoin R II R=S S = R=S (RxS) Natural Join R II A S = A (RxS) Semijoin R II R S = A (R II R S) Example Tables S1 S# SNAME S1 Smith S4 Clark P P# P1 P2 P3 PNAME Bolt Nut Screw CITY London London WEIGHT 10 15 15 S2 S# S1 S2 S# S1 S1 S4 S2 SNAME Smith Jones P# P1 P2 P2 P3 QTY 10 5 7 8 CITY London Paris

SP

Union S1 U S2 produce a table consisting of rows in either S1 or S2 S1 S# S1 S4 SNAME Smith Clark CITY London London S2 S# S1 S2 SNAME Smith Jones CITY London Paris

S1 S# SNAME CITY S1 Smith London S4 Clark London S2 Jones Paris Intersection S1 S2 Produce a table consisting of rows in both S1 and S2 S1 S# S1 S4 SNAME Smith Clark CITY London London S2 S# S1 S2 SNAME Smith Jones CITY London Paris

S1 S# SNAME CITY S1 Smith London Difference S1 S2 Produce table consisting of rows in S1 and not in S2 S1 S# S1 S4 SNAME Smith Clark CITY London London S2 S# S1 S2 SNAME Smith Jones CITY London Paris

S1

S# S4

SNAME Clark

CITY London

Restriction city=LondonS2 extract rows from a table that meet a specific criteria S2 S# S1 S2 SNAME Smith Jones
Pname

CITY London Paris

S2

S# S1

SNAME Smith

CITY London

Project

extract values of specified columns from a table P P# PNAME WEIGHT P1 Bolt 10 P2 Nut 15 P3 Screw 15 PNAME Bolt Nut Screw Cartesian product S1 X P produce a table of all combinations from two other tables S1 S# S1 S4 SNAME Smith Clark CITY London London P P# P1 P2 P3 PNAME Bolt Nut Screw WEIGHT 10 15 15

S1

S# S1 S1 S1 S4 S4 S4

SNAME Smith Smith Smith Clark Clark Clark

CITY London London London London London London

P1 P2 P3 P1 P2 P3

Bolt Nut Screw Bolt Nut Screw

10 15 15 10 15 15

Divide P / S
Produce a new table by selecting a column from rows in P that match every row in S

P Partname S# Bolt 1 Nut 1 Screw 1 Washer 1

Bolt 2 Screw 2 Washer 2 Bolt 3 Nut 3 Washer 3

S# 1 2 3

BOLT WASHER

Which Parts does every Supplier Supply

Natural Join you must select only the rows in which the common attribute values match. You could also do a right outer join and left outer join to select the rows that have no matching values in the other related table. An Inner Join in which only rows that meet a given criteria are selected. Outer Join returns the matching rows as well as the rows with unmatched attribute values for one table to be joined. Natural Join S1 II SP Produce a table from two tables on matching columns S1 S# S1 S4 SNAME Smith Clark CITY London London SP S# S1 S1 S4 S2 P# P1 P2 P2 P3 QTY 10 5 7 8

S# S1 S1 S4

SNAME Smith Smith Clark

CITY London London London

P# P1 P2 P2

QTY 10 5 7

Read pg 132 - 136

Consider
Get supplier numbers and cities for suppliers who supply part P2.

Algebra Join relation S on S# with SP on S# Restrict the results of that join to tuples with P# = P2 Project the result of that restriction on S# and City

Calculus Get S# and city for suppliers such that there exists a shipment SP with the same S# value and with P# value P2. SQL Structured Query Language SQL is a non-procedural lang. Data Manipulation Language (DML) Data Definition Language (DDL) Data Control Language (DCL) Embedded and Dynamic SQL Security Transaction Management C/S execution and remote DB access Types of Operations Data Definition Language DDL Define the underlying DB structure SQL includes commands to create DB object such as tables, indexes and views e.g CREATE TABLE, NOT NULL, UNIQUE, PK, FK, CREATE INDEX, CREATE VIEW, ALTER TABLE, DROP TABLE, DROP INDEX, DROP VIEW
Data Definition Language Create / Amend / Drop a table Specify integrity checks Build indexes Create virtual Views of a table

Data Manipulation Language Retrieving and updating the data Includes commands to INSERT, UPDATE, DELETE and retrieve data within the DB tables. E.g INSERT, SELECT, WHERE, GROUP BY, HAVING, ORDER BY, UPDATE, DELETE, COMMIT, ROLLBACK
Data Manipulation Language Query the DB to show selected data Insert, delete and update table rows Control transactions - commit / rollback

Data Control Language - Control Access rights to parts of the DB GRANT to allow specified users to perform specified tasks. DENY to disallow specified users from performing specified tasks. REVOKE to cancel previously granted or denied permissions. UPDATE to allow a user to update records READ disallows a user to edit the database, can only view the data DELETE allows a user to delete records in a Database Reading the Syntax UPPER CASE = reserved words lower case = user defined words Vertical bar | = a choice i.e. asc|desc Curly braces { } = choice from list Square brackets [ ] = optional element Dots = optional repeating items

Syntax of SQL SELECT [ALL | DISTINCT] {[table.]* | EXPRESSION [alias],} FROM table [ alias [,table [alias]] [WHERE condition] [GROUP BY expression [,expression] [HAVING condition]] [ORDER BY {expression | position}[ASC|DESC]] [{UNION | INTERSECT | MINUS} query] Purpose of the Commands SELECT Specifies which columns to appear FROM Specifies which table/s to be used WHERE Applies restriction to the retrieval GROUP BY Groups rows with the same column value HAVING Adds restriction to the groups retrieved ORDER BY Specifies the order of the output DB Schema is a group of DB objects such as tables and indexes that are related to each other [CREATE SCHEMA AUTHORIZATION {creator};] Creating Table Structures: CREATE TABLE tablename ( Column1 datatype [constraint], Column2 datatype [constraint], PRIMARY KEY (column 1), FOREIGN KEY (column 1), REFERENCES tablename CONSTRAINT constraint); Foreign Key Constraint definition ensures that: You cannot delete a tablename if at least one product row references that tablename. On the other hand, if a change is made in an existing tablename, that change must be reflected automatically in other tablename being refereced. Not Null constraint: is used to ensure that a column does not accept nulls. Unique constraint: is used to ensure that all values in a column are unique. Default Constraint is used to assign a value to an attribute when a new row is added to a table. Check Constraint is met for the specified attribute (that is, the condition is true), the data are accepted for that attribute. Purpose of the COMMIT and ROLLBACK commands are used to ensure DB update integrity in transaction mgt. The EXISTS special operator: EXISTS can be used wherever there is a requirement to execute a command based on the result of another query. A VIEW is a virtual table based on a SELECT query. The query can contain columns, computed columns, aliases and aggregate functions from one or more tables [CREATE VIEW viewname AS SELECT query.

Embedded SQL refers to the use of SQL statements within an application programming lang. e.g COBOL, C++, ASP, JAVA and .NET. The lang. in which the SQL statements are embedded is called the host lang. embedded SQL is still the most common approach to maintaining procedural capabilities in DBMS-based applications.
Get remaining note from the slide pg 12 - 18

Review Questions 1. What are the main operations of algebra? 2. What is the Cartesian product? Illustrate your answer with an example. 3. What is the diff betw PROJECTION and SELECTION? 4. Explain the diff betw natural join and outer join?

DBMS Optimization Database Performance- Tuning Concepts - Goal of the database performance is to execute queries as fast as possible. - Database performance tuning refers to a set of a activities and procedures designed to reduce response time of the DB system, i.e to try and ensure that an end-user query is processed by the DBMS in the minimum amount of time. - The performance of a typical DBMS is constrained by 3 main factors: i. CPU Processing Power ii. Available primary Memory (RAM) iii. Input/Output (Hard disk and network) throughput. System Resources Client fastest possible Server Multiple processors Fastest possible Quad core Intel 2.66GHz Max. possible Max. possible (64GB) Fast IDE hard disk with Multiple high speed, high sufficient free hard disk capacity e.g 750GB space. High-speed connection High-speed connection Fined-tuned for best Fine-tuned for best server client application application performance performance Fine-tuned for best Fine-tuned for best throughput throughput Optimize SQL in client Optimize DBMS for the application best performance.

Hardware

CPU

RAM Hard disk

Software

Network Operating System Network Application

The system performs best when its hardware and software resources are optimized. Fine-tuning the performance of a system requires a holistic approach, i.e all factors must be checked to ensure that each one operates at its optimum level and has sufficient resources to minimize the occurrence of bottlenecks. Note: Good DB performance starts with good DB design. No amount of fine-tuning will make a poorly designed DB perform as well as a well-designed DB. Performance Tuning: Client and Server - On the client side, the objective is to generate a SQL query that returns the correct answer in the least amount of time, using the minimum amount of resources at the server end. The activities required to achieve that good resources are commonly referred to as SQL Performing Tuning. - On the server, DBMS envt must be properly configured to response to clients request in the fastest way possible, while making optimum use of existing resources. The activities required to achieve that goal commonly referred to as DBMS Performance Tuning.

DBMS Architecture It is represented by the processes and structures (in memory and in permanent storage) used to manage a DB.

DBMS diag

DBMS Architecture Component and Functions - All data in DB are stored in DATA FILES. Data file can contain rows from one single table or it can contain rows from many diff tables. DBA determines the initial size of the data files that make up the DB. Data files can automatically expand in predefined increments known as EXTENDS. For example, if more spaces is required, DBA can define that each new extend will be in 10KB or 10MB increments.

Data files are generally grouped in file groups creating tables spaces. A table space or file group is a logical grouping of several data files that store data with similar characteristic - DBMS retrieve the data from permanent storage and place it in RAM (data cache) - SQL Cache or Procedure Cache is a shared, reserved memory area that stores most recently executed - SQL statements or PL/SQL procedures including triggers and functions. - Data Cache or Buffer Cache is a shared, reserved memory area that stores the most recently accessed data blocks in RAM. -To move data from permanent storage (data files) to the RAM (data cache), the DBMS issues I/O requests and waits for the replies. An input/output request is a low-level (read or write) data access operation to/from computer devices. The purpose of the I/O operation is to move data to and from diff computer component or devices. - Working with data in data cache is many times faster than working with data in data files cos DBMS doesnt have to wait for hard disk to retrieve data. - Majority of performance-tuning activities focus on minimising number of I/O operations. Processes are: Listener: listens for clients request and hands the processing of the SQL requests to other DBMS processes. User: DBMS creates a user process to manage each client session Scheduler: schedules the concurrent execution of SQL requests. Lock Manager: manages all locks placed on DB objects Optimizer: analyses SQL queries and finds the most efficient way to access the data. Database Statistics: it refers to a number of measurements about DB object such as tables, indexes and available resources such as number of processors used, processor speed and temporary space available. Those statistics give a snapshot of DB characteristics. Reasons for DBMS Optimiser DBMS prevents direct access to DB Optimiser is part of the DBMS Optimiser processes user requests Removes need for knowledge or data format-hence data independency Reference to data dictionary - Therefore increased productivity - Provides Ad-hoc query processing. Query Processing DBMS processes queries in 3 phases: Parsing: DBMS parses the SQL query and chooses the most efficient access/execution plan. Execution: DBMS executes the SQL query using the chosen execution plan. Fetching: DBMS fetches the data and sends the result set back to the client. The SQL parsing activities are performed by the query optimiser. The Query Optimiser analyses the SQL query and finds the most efficient way to access the data.

Parsing a SQL query requires several steps: Interpretation - Syntax Check: validate SQL statement - Validation: confirms existence (table/Attribute) - Translation: into relational algebra - Relational Algebra optimisation - Strategy Selection execute plan - Code generation: executable code Accessing (I/O Disk access): read data from the physical data files and genearate the result set. Processing Time: (CPU Computation) process data (cpu)

Query Optimisation: Is the central activity during the parsing phase in query processing. In this phase, the DBMS must choose what indexes to use how to perform join operations, what table to use first and so on. Indexes facilitate searching, sorting and using aggregate functions and even join operations. The improvement in data access speed occurs because an index is an ordered set of values that contain the index key and pointers. An Optimizer is used to work out how to retrieve the data in the most efficient way from a database. Types of Optimisers Heuristic (Rule-based): uses a set of preset rules and points to determine the best approach to execute a query. -15 rules, ranked in order of efficiency, particular access path for a table only chosen if statement contains a predicate or other construct that makes that access path available. -Score assigned to each execution strategy using these rankings and strategy with best (lowest) selected. The Rule Based (heuristic) optimizer uses a set of rules to quickly choose between alternate options to retrieve the data. It has the advantage of quickly arriving at a solution with a low overhead in terms of processing, but the disadvantage of possibly not arriving at the most optimal solution. Cost Estimation (Cost based): uses sophisticated algorithm based on statistics about the objects being accessed to determine the best approach to execute a query. The optimiser process adds up the processing costs, the I/O costs and the resource cost (RAM and temporary space) to come up with the total cost of a given execution plan. The Cost Based optimizer uses statistics which the DBA instructs to be gathered from the database tables and based on these values it estimates the expected amount of disk I/O and CPU usage required for alternate solutions. It subsequently chooses the solution with the lowest cost and executes it. It has the advantage of being more lightly to arrive at an optimal solution, but the disadvantage of taking more time with a higher overhead in terms of processing requirements. Cost-based + hints: The Cost Based optimizer (with Hints) is the same as the Cost Based optimizer with the additional facility of allowing the DBA to supply Hints to the optimizer, which instructs it to carry out certain

access methods and therefore eliminates the need for the optimizer to consider a number of alternative strategies. It has the advantage of giving control to the DBA who may well know what would be the best access method based on the current database data, plus the ability to quickly compare alternate execution plans, but it has the disadvantage of taking us back to hard coding where the instructions on retrieving data are written into the application code. This could lead to the need to rewrite application code in the future when the situation changes. Optimiser hints are special instructions for the optimiser that are embedded inside the SQL command text. Query Execution Plan (QEP) SELECT ENAME FROM EMP E, DEPT D WHERE E.DEPTNO = D.DEPTNO AND DNAME = RESEARCH; OPTION 1: OPTION 2: JOIN SELECT S ELECT JOIN PROJECT PROJECT

To calculate QEP to compare cost of both strategies: diag is here

Cost-based: Make use of statistics in data dictionary No of rows No of blocks No of occurrences Largest/smallest value Then calculates the cost of alternate solutions query Statistics Cost-based Optimiser: depends on statistics for all tables, clusters and indexes accessed by query. Users responsibility to generate these statistics and keep them current. Package DBMS_STATS can be used to generate and manage statistics and histograms. Statistic Procedure and provides the Auto-update and Auto-create statistic options in its initialization parameters. Gathering Statistic: Use ANALYSEAnalyse table TA compute statistic exact

Analyse table TA estimate statistics %selection Analyse index TA_PK estimate statistics Accessing Statistics: View USER_TABLES View USER_TAB_COLUMNS View USER_INDEX
Check for Pro and Cons of each types of optimisers Exercise WK4 solution .

Review Questions 1. What is SQL performance tuning? 2. What is database performance tuning? 3. What is the focus of most performance-tuning activities and why does that focus exist? 4. What are database statistics, and why are they important? 5. How are DB statistic obtained? 6. What DB statistic measurements are typical of tables, indexes and resources? 7. How is the processing of SQL DDL statements (such as CREATE TABLE) different from the processing required by DML statements? 8. In a simple terms, the DBMS processes queries in three phases. What are those phases and what is accomplished in each phase? 9. If indexes are so important, why not index every column in every table? 10. What is the difference between a rule-based optimizer and a costbased optimiser? 11. What are optimizer hints, and how are they used? 12. 13. What recommendations would you make for managing the data files in a DBMS with many tables and indexes?

Production System DB is a carefully designed and constructed repository of facts. The fact repository is a part of a larger whole known as an Information System. An Information System provides for data collection, storage and retrieval. It also facilitates the transformation of data into info. and the mgt of both data and information. Complete information System is composed of people, hardware, software, the DB, application programs and procedures. System Analysis is the process that establishes the need for and the scope of an info.system. The process of creating an info syst. is known as System Development. A successful database design must reflect the information system of which the database is a part. A successful info system is developed within a framework known as the Systems Development Life Cycle (SDLC). Application transforms data into that forms that basis for decision making. The most successful DB is subject to frequent evaluation and revision within a framework known as the DB Life Cycle (DBLC) Database Design Strategies: Top-down vs Bottom-up and Centralized vs decentralized. The information Systems Applications - Transform data into that forms basis for decision making - Usually produce the following: formal report, tabulations, graphic displays - Every application is composed of 2 parts: Data and Code by which data are transformed into information.

The performance of an information system depends on a triad of factors: - DB design and implementation - Application design and implementation - Administrative procedures The term DB Development: describes the process of DB design and implementation. The primary objective in DB design is to create complete, normalized, non-redundant and fully integrated conceptual, logical and Physical DB models. System Deveopment Life Cycle (SDLC)

The SDLC is an iterative rather than a Sequential process. SDLC divided into five phases: Planning: such an assessment should answer some important questions. Should the existing system be continued Should the existing system be modified. Should the existing system be replaced. The feasibility study must address the following: - The technical aspects of hardware and software requirements - The system cost. Analysis: problems defined during the planning phase are examined in greater detail during analysis phase. Addressing questions are: What are the requirements of the current systems end users? Do those requirements fit into the overall info requirements? The analysis phase of the SDLC is, in effect, a thorough audit of user requirements. The existing hardware and software system are also studied in order to give a better understanding of the systems functional area, actual and potential problems and opportunities. The A.P. also includes

the creation of a logical system design. The logical design must specify the appropriate conceptual data model, inputs, processes and expected output requirements. When creating logical design, designer might use tools such as data flow diag (DFDs), hierarchical Input Process Output (HIPO) diagram and ER diag. Defining the logical system also yields functional description of the systems components (modules) for each process within the DB envt. Detailed Systems Design: complete the design of the systems processes. The design includes all necessary technical specifications for the screens, menus, reports and other devices that might be used to help make the system a more efficient information generator. Implementation: the hardware, DBMS software and application programs are installed and the DB design is implemented. During the initial stages of the I.P. the system enters into a cycle of coding, testing and debugging until it is ready to be delivered. The DB contents may be loaded interactively or in batch mode, using a variety of methods and devices: - Customised user programs - DB interface program - Conversion program that import the data from a different file structure using batch program, a DB utility or both. The sysem is subjected to exhaustive testing until it is ready for use. After testing is concluded, the final documentation is reviewed and printed and end users are trained. Maintenance: as soon as the system is operational end users begin to request changes in it. These changes generate system maintenance: - Corrective maintenance in response to system errors. - Adaptive maintenance due to changes in the business envt. - Perfective maintenance to enhance the system. The DB Life Cycle DBLC) : it contains 6 phases

Database Design Strategies Two classical approaches to DB design: Top-down Design: - Identify data sets - Defines data elements for each of these sets. This process involves the identification of different entity types and the definition of each entitys attributes. Bottom-up Design - Identifies data elements (items) - Groups them together in data sets i.e it first defines attributes, then groups them to form entities. The selection of a primary emphasis on top-down or bottom-up procedures often depends on the scope of the problem or personal preferences. The 2 methodologies are complementary rather than mutually exclusive.

Top-down vs Bottom-up Design Sequencing Even when a primarily top-down approach is selected, the normalization process that revises existing table structure is (inevitably) a bottom-up technique. ER models constitute a top-down process even when the selection of attributes and entities can be described as bottom-up because both ER model and normalization technique form the basis for most designs, the top-down vs bottom-up debate may be based on a distinction rather than a difference. Production System continues Use estimate to refresh statistics Use declarative & procedural integrity Use stored PL/SQL procedures already compiled shared pool cache System configuration

System Configuration Size & configuration of the DB caches Number/size of data, buffer cache

Size of shared pool SQL, PL/SQL, Triggers, Data Dictionary Log buffers Options for the DBA Table structure Heap Hash ISAM BTree The main difference between the table structures is as follows: The Heap table has no indexing ability built into it and so if left as a Heap would require a secondary index if it was large and speedy access was required. The others have indexing ability built into them but the Hash and ISAM would degrade over time if lots of modifications where made to it - the additional data simply being added to the end as a heap in overflow pages, this is as apposed to the BTree which is dynamic and grow as data is added. Data Structures Heap No key columns Queries, other than appends, scan every page Rows are appended at the end 1 main page, all others are overflow Duplicate rows are allowed Do Use when: Inserting a lot of new rows Bulk loading data Table is only 1-2 pages Queries always scan entire table Do Not Use when: You need fast access to 1 or a small subset of rows Tables are large You need to make sure a key is unique Hash Do Use when: Data is retrieved based on the exact value of a key Do Not Use when: You need pattern matching, range searches You need to scan the table entirely You need to use a partial key for retrieval ISAM Do Use when: Queries involve pattern matching and range scans Table is growing slowly Key is large Table is small enough to modify frequently

Do Not Uses when: Only doing exact matches Table is large and is growing rapidly Btree Index is dynamic Access data sorted by key Overflow does not occur if there are no duplicate keys Reuses deleted space on associated data pages Do Use when Need pattern matching or range searches on the key Table is growing fast Using sequential key applications Table is too big to modify Joining entire tables to each other Do Not Use when: Table is static or growing slowly Key is large Creating Indexes: Key fields Foreign keys Access fields Disk Layout Multiple Disks Location of tables/index Log file DBMS components Disk stripping Other factors CPU Disk access speed Operating system Available memory swapping to disk Network performance

De-normalisation Including children with parents Storing most recent child with parent Hard-coding static data Storing running totals Use system assigned keys Combining reference of code tables Creating extract tables

Centralized vs Decentralized Design Two general approaches (bottom-up and top-down) to DB design can be influenced by factors such as the scope and Size of the system, the companys mgt style and the companys structure (centralised or decentralised). Centralized Design is productive when the data component is composed of a relatively small number of objects and procedures. Centralised design is relatively simple and/or small can DB and can be successfully done by single person (DBA) The company operations and the scope of the problem are sufficiently limited to allow even a single designer to define the problems create conceptual design, verify conceptual design with the user view.

Decentralized Design: this might be used when the data component of the system has a considerable number of entities and complex relations on which very complex operations are performed. Decentralised design likely to be employed when the problem itself is spread across several operations sites and each element is a subset of the entire data set. Carefully selected team of DB designers is employed to tackle a complex DB project. Within the decentralised design framework, the DB design task is divided into several modules. Each design group creates a conceptual data model corresponding to the subset being modelled. Each conceptual model is then verified individually against user views, processes and constraints for each of the modules. After the verification process has been completed, all modules are integrated into one conceptual model. Naturally, after the subsets have been aggregated into a larger conceptual model, the lead designer must verify that the combined conceptual model is still able to support all of the required transactions.

Database Design Conceptual, Logical and Physical Database Design. Conceptual DB Design is where we create the conceptual representation of the DB by producing a data model which identifies the relevant entities and relationship within our system. Logical DB Design is where we design relations based on each entity and define integrity rules to ensure there is no redundant relationship within our DB. Physical DB Design is where the physical DB is implemented in the target DBMS. In this stage we have to consider how each relation is stored and how data is accessed.

Three Stages of DB Design

Selecting a suitable file organisation is important for fast data retrieval and efficient use of storage space. 3 most common types of file organisation are: Heap Files: which contain randomly ordered records. Indexed Sequential Files: which are stored on one or more fields using indexes. Hashed Files: in which a hashing algorithm is used to determine the address of each record based upon the value of the primary key. Within a DBMS indexes are often stored in data structure known as B-trees which allow fast data retrieval. Two other kinds of indexes are Bitmap Indexes and Join Indexes. These are often used on multi-dimensional data held in data warehouses. Indexes are crucial in speeding up data access. Indexes facilitate searching, sorting and using aggregate functions and even join operations. The improvement in data access speed occurs because an index is an ordered set of value that contains the index key and pointers. Data Sparsity refers to the number of different values a column could possibly have. Indexes are recommended in highly sparse columns used in search conditions.

Concurrency and Recovery A transaction is any action that reads from and/or writes to a DB. A transaction may consist of a simple SELECT statement to generate a list of table contents. Other statements are UPDATE, INSERT, or combinations of SELECT, UPDATE & INSERT statement.

A transaction is a logical unit of work that must be entirely completed or entirely aborted, no intermediate states are acceptable. All of the SQL statements in the transaction must be completed successfully. If any of the SQL statements fail, the entire transaction is rolled back to the original DB state that existed before the transaction started. A successful transaction changes the DB from one consistent state to another. A consistent DB State is one in which all data integrity constraints are satisfied. To ensure consistency of the DB, every transaction must begin with the DB in a known consistent State. If the DB is not in a consistent state, the transaction will yield an inconsistent DB that violates its integrity and business rules. All transactions are controlled and executed by the DBMS to guarantee DB integrity. Most real-world DB transaction are formed by the 2 or more DB requestsEquivalent of a single SQL statement in an application program or transaction. Terms to know Transaction: logical unit of work Consistent State: DB reflecting true position Concurrent: at the same time. Sequence: Read disk block, update data, rewrite disk Serializability: Ensures that concurrent execution of several transactions yields consistent results. Transaction properties All transaction must display atomicity, consistency, isolation durability and serializability (ACIDS test) Atomicity: requires that all operations (SQL requests) of a transaction be completed; if not, the transaction is aborted. Consistency: indicates the performance of the DBs consistent state, when a transaction is completed, the DB reaches a consistent state. Isolation: means that the data used during the execution of a transaction cannot be used by a 2nd transaction until the 1st one is completed. Durability: ensures that once transaction changes are done (committed), they cannot be undone or lost even in the event of a system failure. Serializability: ensures that the concurrent execution of several transanctions yield consistent results. This is important in multi-user and distributed databases where multiple transaction are likely to be executed concurrently. Naturally, if only a single transaction is executed, serializaability is not an issue.

The Transaction Log DBMS uses a transaction log to keep track of all transactions that update the DB. The information stored in this log is used by the DBMS for a recovery requirement triggered by a ROLLBACK statement. Log with Deferred Updates - Transaction recorded in log file - Updates are not written to the DB

Log entries are used to update the DB

In the event of a failure.. - Any transactions not completed are ignored - Any transactions committed are redone - Checkpoint used to limit amount of rework. Log with Immediate Updates - Writes to DB as well as the log file - Transaction record contains old and new value - Once log record written DB can be updated. In the event of failure.. Transaction not completed are undone old values. Updates take place in reverse order Transaction committed are redone new value. Concurrency Control The coordination of the simultaneous execution of transaction in a multi-user DB system is known as Concurrency Control. The objective of concurrency control is to ensure the serializability of transaction in a multi-user DB envt. Concurrency control is important because the simultaneous execution of transactions over a shared DB can create several data integrity and consistency problems. Both disk I/O and CPU are used. 3 main problems are Lost Updates Uncommitted Data Inconsistent Retrievals. Uncommitted Data occurs when 2 transactions are executed concurrently and the 1st transaction is rolled back after the second transaction has already accessed the uncommitted data thus violating the isolation property of transactions. Inconsistent Retrievals: occur when a transaction calculates some summary (aggregates) functions over a set of data while other transactions are updating the data. The problem is that the transaction might read some data before they are changed and other data after they are changed, thereby yielding inconsistent results. Lost Updates: When 1st transaction T1 has not yet been committed when the 2nd transaction T2 is executed. Therefore T2 still operates on the initial value of T1. The Scheduler: Is responsible for establishing the order in which the concurrent transaction operations are executed. The transaction execution order is critical and ensures DB integrity in multi-user DB systems. Locking, Time-stamping and Optimistic methods are used by the scheduler to ensure the serializabilty of transactions. Serializability of Schedules is guaranteed through the use of 2-phase locking. The 2-phase locking schema has a growing phase in which the transaction acquires all of the locks that it needs without unlocking any data and a shrinking phase in which the transaction releases all of the locks without acquiring new locks.

Serializability: Serial execution means performing transactions one after another If 2 transactions are only reading a variable, they do not conflict and order is not important. If 2 transactions operate on separate variables, they do not conflict and order is not important. Only when a transaction writes to a variable and another either reads or writes to the same variable, then order is important. Serializability is making sure that when it counts, transaction operates in order. Lock Granularity: It indicates the level of lock use. Locking can take place at the following levels: database, table, page, row, or even field (attribute) Database Level Lock: the entire DB is locked, thus preventing the use of any tables in the DB by transaction T2 while transaction T1 is being executed. This level of locking is good for batch processes, but it is unsuitable for online multi-user DBMS. Note that transaction T1 and T2 cannot access the same DB concurrently even when they use diff tables. Table Level Lock: The entire table is locked, preventing access to any row by transaction T2 while transaction T1 is using the table. If a transaction requires access to several tables, each table may be locked. However 2 transactions can access the same DB as long as they access diff tables. Page Level Lock DBMS locks an entire disk page. A disk page or page is the equivalent of a disk block which can be described as directly addressable section of a disk. A page has a fixed size. Row Level Lock: It is much less restrictive than the locks discussed above. DBMS allows concurrent transactions to access diff rows of the same table, even when the rows are locked on the same page. Field Level Lock: It allows concurrent transaction to access the same row as long as they require the use of diff fields (attributes) within that row. Lock Types: Shared/Exclusive Locks: an exclusive lock exists when access is reserved specifically for the transaction that locked the object. Read (shared) Lock: allows the reading but not updating of a data item, allowing multiple accesses. Write (Exclusive) allows exclusive update of a data item. A shared Lock is issued when a transaction wants to read data from the DB and no exclusive lock is held on that data item. An Exclusive Lock is issued when a transaction wants to update (Write) a data item and no locks are currently held on that data item by any other transaction. Two-Phase Locking: Defines how transactions acquire and relinquish locks. It guarantees serializability, but it does not prevent deadlocks the two phases are:

1. Growing Phase: transaction acquires all required locks without unlocking any data. Once all locks have been acquired the transaction is in its locked point. 2. Shrinking Phase: transaction releases all locks and cannot obtain any new lock. The two-phase locking protocol is governed by the following rules: 2 transactions cannot have conflicting locks. No unlock operation can precede a lock operation in the same transaction. No data are affected until all locks are obtained i.e until the transaction is in its locked point. Deadlocks: A deadlock occurs when 2 transactions wait for each other to unlock data. Three Basic Techniques to Control Deadlocks: Deadlock Prevention: a transaction requesting a new lock is aborted when there is the possibility that a deadlock can occur. If the transaction is aborted, all changes made by this transaction are rolled back and all locks obtained by the transaction are released. (statically make deadlock structurally impossible ) Deadlock Detection: DBMS periodically tests the DB for deadlocks if a deadlock is found, one of the transactions (the victim) is aborted (rolled back and restarted) and the other transaction continues. (let deadlocks occur, detect them and try to recover) Deadlock Avoidance: The transaction must obtain all of the locks it needs before it can be executed. (avoid deadlocks by allocating resources carefully) Concurrency Control with Time-Stamping Methods: Time-stamping: the time-stamping approach to scheduling concurrent transactions assign a global, unique time stamp to each transaction. Time stamps must have two properties: Uniqueness and Monotomicity. Uniqueness ensures that no equal time stamp values can exist. Monotomicity ensures that time stamp values always increase. All DB operates (Read and Write) within the same transaction must have the same time stamp. The DBMS executes conflicting operations in time stamp order, thereby ensuring serializability of the transactions. If 2 transaction conflict, one is stopped, rolled back, rescheduled and assigned a new time stamp value. No locks are used so no deadlock can occur. Disadvantage of the time stamping approach is that each value stored in the DB requires 2 additional time stamp fields. Concurrency Control With Optimistic Methods: The optimistic approach is based on the assumption that the majority of that DB operations do not conflict. The optimistic approach does not require locking or time stamping technique. Instead a transaction is executed without restrictions until is committed. Each transaction moves through 2 or 3 phases which are READ, VALIDATION and WRITE. Some envts may have relatively few conflicts between transactions. Locking would be an inefficient overhead. Eliminate this by optimistic technique Assume there will be no problems Before committing a check done

If conflict occurred transaction is rolled back.

Database Recovery: DB recovery restores a DB from a given state, usually inconsistent, to a previously consistent state. Need for Recovery: Physical disasters fire, flood Sabotage internal Carelessness unintentional Disk Malfunctions headcrash, unreadable tracks System crashes hardware System software errors termination of DBMS Application software errors logical errors. Recovery Technique: are based on the atomic transaction property, all portions of the transaction must be treated as a single, logical unit of work in which all operations are applied and completed to produce a consistent DB. Technique to restore DB to a consistent state. Transactions not completed rolled back To record transaction using a log file. Contains transaction and checkpoint records Checkpoint record lists current transactions.

Four Important Concepts that Effect the Recovery Process: Write-ahead-log protocol: ensures that transaction logs are always written before any DB are actually updated. Redundant Transaction Log: most DBMS keep several copies of the transaction log to ensure that a physical disk failure will not impair the DBMS ability to recover data. Database Buffers: buffer is a temporary storage area in primary memory used to speed up disk operations. Database Checkpoints: is an operation in which the DBMS writes all of its updated buffers to disk. Checkpoint operation is also registered in the transaction log. Recovery procedure uses deferred Write or deferred update, the transaction operations do not immediately update the physical DB. Instead only the transaction log is updated. The recovery process for all started and committed transactions (before the failure) follow these steps: Identify the last checkpoint in the transaction log. For a transaction that started and committed before the last checkpoint, nothing needs to be done because the data are already saved. For a transaction that performed a commit operation after the last checkpoint, the DBMS uses the transaction log records to redo, the transaction and to update the DB, using the after values in the transaction log. For any transaction that had a Rollback operation after the last checkpoint or that was left active before the failure occurred, nothing needs to be done because DB was never updated. When the recovery procedure was Write-through or Immediate Update the DB is immediately updated by transaction operations during the transaction execution, even before the transaction reaches it commit point.

Deadlocks in Distributed Systems Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse - They are harder to avoid, prevent or even detect. - They are hard to cure when tracked down because all relevant information is scattered over many machines. Distributed Deadlock Detection Since preventing and avoiding deadlocks to happen is difficult, researchers works on detecting the occurrence of deadlocks in distributed system. The presence of atomic transaction in some distributed systems makes a major conceptual difference. When a deadlock is detected in a conventional system, we kill one or more processes to break the deadlock. When deadlock is detected in a system based on atomic transaction, it is resolved by aborting one or more transactions. But transactions have been designed to withstand being aborted. When a transaction is aborted, the system is first restored to the state it had before the transaction began, at which point the transaction can start again. When a bit of luck, it will succeed the second time. Thus the difference is that the consequences of killing off a process are much less severe when transactions are used. 1 Centralised Deadlock Detection We use a centralised deadlock detection algorithm and try to imitate the nondistributed algorithm. Each machine maintains the resource graph for its own processes and resources. A centralised coordinator maintain the resource graph for the entire system. In updating the coordinators graph, messages have to be passed. - Method 1: whenever an arc is added or deleted from the resource graph, a message have to be sent to the coordinator. - Method 2: periodically, every process can send a list of arcs added and deleted since previous update. - Method 3: coordinator asks for information when it needs it. One possible way to prevent false deadlock is to use the Lamports algorithm to provide global timing for the distributed systems. When the coordinator gets a message that leads to a suspect deadlock: It sends everybody a message saying I just received a message with a timestamp T which leads to deadlock. If anyone has a message for me with an earlier timestamp, please send it immediately When aevery machine has replied, positively or negativel, the coordinator will see the the deadlock has really occurred on not. 2. The Chandy-Misra-Haas algorithm: Processes are allowed to request multiple resources at once the growing phase of a transaction can be speeded up. The consequence of this change is a process may now on two or more resources at the same time. When a process has to wait for some resources, a probe message is generated and sent to the process holding the resources. The message consists of three numbers: the process being blocked, the process sending the messga and the process receiving the message. When message arrived, the recipient checks to see it if itself is waiting for any processes. If so, the message is updated, keeping the first number unchanged and replaced the second and third field by the corresponding process number. The message is then sent to the process holding the needed resources. If a message goes all the way around and comes back to the original sender

- The process that initiates the probe, a cycle exists and the system is deadlocked. Review Questions I. Explain the following statement: a transaction is a logical unit of work. II. What is a consistsent database state, and how is it achieved? III. The DBMS does not guarantee that the semantic meaning of the transaction truly represents the real-world event. What are the possible consequences of that limitation? Give example. IV. List and discuss the four transaction properties. V. What is transaction log, and what is its function? VI. What is scheduler, what does it do, and why is its activity important to concurrency control? VII. What is lock and how, in general, does it work? VIII. What is concurrency control and what is its objectives? IX. What is an exclusive lock and under what circumstance is it granted? X. What is deadlock, and how can it be avoided? Discuss several deadlock avoidance stategies. 11. What three levels of backup may be used in DB recovery mgt? briefly describe what each of those three backup levels does. Database Security Issues Types of Security Legal and ethical issues regarding the right to access certain information. Some information may be deemed to be private and cannot be accessed legally by unauthorized persons. Policy Issues at the governmental, institutional or corporate level as to what kinds of info should not be made publicly available- for example Credit ratings and personal medical records. System-related issues: such as the system level at which various security function should be enforced for example, whether a security function should be handled at the physical hardware level, the operating system level or the DBMS level. The need to identify multiple security levels and to categorize the data and users based on these classifications for example, top secret, secret, confidential and unclassified. The security policy of the organisation with respect to permitting access to various classifications of data must be enforced. Threats to Database: This result in the loss or degradation of some or all of the following commonly accepted security goals: Integrity, Availability and Confidentiality. Loss of Integrity: Database Integrity refers to the requirement that information be protected form improper modification. Modification of data includes creation, insertion, modification, changing the status of data and deletion. Integrity is lost if unauthorised

changes are made to the data by either intentional or accidental acts. If the loss of the system or data integrity is not corrected, continued use of the contaminated system or corrupted data could result in inaccuracy, fraud or erroneous decisions. Loss of Availability: Database availability refers to making objects available to a human user or a program to which they have a legitimate right. Loss of Confidentiality: Database confidentiality refers to the protection of data from unauthorized disclosure. Unauthorized, unanticipated or unintentional disclosure could result in loss of public confidence, embarrassment or legal action against the organisation. Control Measures Four main control measures that are used to provide security of data in databases: Access control Inference control Flow control Data encryption Access Control: the security mechanism of a DBMS must include provisions for restricting access to the database system as a whole. This function is called Access control and is handled by creating user accounts and passwords to control the login process by the DBMS. Inference Control: Statistical databases are used to provide statistical information or summaries of values based on various criteria e.g database for population statistics. Statistical database users e.g govt.statistician or market researchers firms are allowed to access the database to retrieve statistical information about population but not to access the detailed confidential information about specific individuals. Statistical database security ensures that information about individuals cannot be accessed. It is sometimes possible to deduce or infer certain facts concerning individuals from queries that involve only summary statistic on groups; consequently, this must not be permitted either. The corresponding control measures are called Inference Control. Flow Control: it prevents information from flowing in such a way that it reaches unauthorized users. Channels that are pathways for information to flow implicitly in ways that violate the security policy of an organisation are called Covert Channels. Data Encryption: is used to protect sensitive data such as credit cards numbers, that is transmitted via some type of communications network. The data is encoded using some coding algorithm. An unauthorized user who access encoded data will have difficulty deciphering it but authorized users are given decoding or decrypting algorithms (or keys) to decipher data. A DBMS typically includes a database security and authorization subsystem that is responsible for security of portions of a database against unauthorized access. Two Types of database Security Mechanism: i) Discretionary Security mechanism: these re used to grant privileges to users including the capability to access specific data files, records or fields in a specified mode (such as read, insert, delete or update) ii) Mandatory Security Mechanism: used to enforce multilevel security by classifying the data and users into various security classes/levels and then implementing the appropriate security policy of the organisation. E.g a typical security policy is to permit users at certain classification

level to see only the data items classified at the users own or lower classification level. An extension of this is Role-based Security, which enforces policies and privileges based on concepts of roles. Database Security and the DBA DBA is the central authority for managing a database system. The DBAs responsibilities include granting privileges to users who need to use the system and classifying users and data in accordance with the policy of the organisation. The DBA has a DBA account in DBMS, sometimes called a System or Superuser Account, which provides powerful capabilities that are not made available to regular database accounts and users. DBA-privileged commands include commands for granting and revoking privileges to individual accounts or user groups and for performing the following types of actions: i Account Creation: this action creates a new account and password for a user or group of users to enable access to the DBMS. ii Privilege Granting: this action permits the DBA to grant certain privileges to certain accounts. iii Privilege Revocation: this action permits the DBA to revoke certain privileges that were previously given to certain accounts. iv Security Level Assignment: this action consists of assigning user accounts to the appropriate security classification level. The DBA is responsible for the overall security of the database system. Action i above is used to control access to the DBMS as a whole, whereas actions ii and iii are used to control Discretionary database authorization and action iv is used to control Mandatory authorization. Access Protection, User Accounts and Database Audits DBA will create a new account number and password for the user if there is a legitimate need to access the database. The user must log in to the DBMS by entering the account number and password whenever database access is needed. Its straightforward to keep track of all database users and their accounts and passwords by creating and encrypted table or file with two fields: Account Number and Password. This table can be easily maintained by the DBMS. The database system must also keep track of all operations on the database that are applied by a certain user throughout each login session, which consists of the sequence of database interactions that a user performs from the time of logging in to the time of logging off. To keep a record of all updates applied to the database and of particular users who applied each update, we can modify system log which includes an entry for each operation applied to the database that may be required for recovery from a transaction failure or system crash. If any tampering with the database is suspected, a database audit is performed, which consists of reviewing the log to examine all access and operations applied to the database during a certain time period. When an illegal or unauthorized operation is found, the DBA can determine the account number used to perform the operation. Database audits are particularly important for sensitive databases that are updated by many transactions and users such as a banking database that is updated by many bank tellers. A database log that is used mainly for security purposes is sometimes called an Audit Trail. Discretionary Access Control based on Granting and Revoking Privileges The typical method of enforcing discretionary access control in a database system is based on the granting and revoking privileges.

Types of Discretionary Privileges: The Account Level: at this level, the DBA specifies the particular privileges that each account holds independently of the relations in the database. The privileges at the account level apply to the capabilities provided to the account itself and can include the CREATE SCHEMA or CREATE TABLE privilege, to create a schema or base relations; the CREATE VIEW privilege; the ALTER privilege, to apply schema changes such adding or removing attributes from relations; the DROP privilege, to delete relations or views; the MODIFY privilege, to insert, delete, or update tuples; and the SELECT privilege, to retrieve information from the database by using a SELECT query. The Relation (or Table) Level: at this level, the DBA can control the privileges to access each individual relation or view in the database. The second level of privileges applies to the relation level, whether they are base relations or virtual (view) relations. The granting and revoking of privileges generally follow an authorization model for discretionary privileges known as Access Matrix model, where the rows of a matrix M represents subjects (users, accounts, programs) and the columns represent objects (relations, records, columns, views, operations). Each position M(i,j) in the matrix represents the types of privileges ( read, write, update) that subject I holds on object j To control the granting and revoking of relation privileges, each relation R in a database is assigned an owner account which is typically the account that was used when the relation was created in the first place. The owner of a relation is given all privileges on that relation. In SQL2, the DBA can assign and owner to a whole schema by creating the schema and associating the appropriate authorization identifier with that schema using the CREATE SCHEMA command. The owner account holder can pass privileges on any of the owned relation to other users by granting privileges to their accounts. In SQL the following types of privileges can be granted on each individual relation R: SELECT (retrieval or read) privilege on R: gives the account retrieval privilege. In SQL this gives the account the privilege to use the SELECT statement to retrieve tuples from R MODIFY privilege on R: this gives the account the capability to modify tuples of R. in SQL this privilege is divided into UPDATE, DELETE and INSERT privileges to apply the corresponding SQL command to R. additionally both the INSERT and UPDATE privileges can specify that only certain attributes of R can be updated by the account. REFERNCES privilege on R: this gives the account the capability to reference relation R when specifying integrity constraints. This privilege can also be restricted to specific attributes of R. Notice that to create a view, the account must have SELECT privilege on all relations involved in the view definition. Specifying Privileges using Views The mechanism of views is an important discretionary authorization mechanism in its own right.

For example, if the owner A of a relation R wants another account B to be able to retrieve only some fields of R, then A can create a view V of R that includes only those attributes and then grant SELECT on V to B. The same applies to limiting B to retrieving only certain tuples of R; a view V can be created by defining the view by means of a query that selects only those tuples from R that A wants to allow B to access. Revoking Privileges: In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a relation may want to grant the SELECT privilege to a user for a specific task and then revoke that privilege once the task is completed. Hence, a mechanism for revoking privileges is needed. In SQL, a REVOKE command is included for the purpose of canceling privileges. Propagation of privileges using the GRANT OPTION Whenever the owner A of a relation R grants a privilege on R to another account B, privilege can be given to B with or without the GRANT OPTION. If the GRANT OPTION is given, this means that B can also grant that privilege on R to other accounts. Suppose that B is given the GRANT OPTION by A and that B then grants the privilege on R to a third account C, also with GRANT OPTION. In this way, privileges on R can propagate to other accounts without the knowledge of the owner of R. If the owner account A now revokes the privilege granted to B, all the privileges that B propagated based on that privilege should automatically be revoked by the system. It is possible for a user to receive a certain privilege from two or more sources. e .g A4 may receive a certain UPDATE R privilege from both A2 and A3. In such a case, if A2 revokes this privilege from A4, A4 will still continue to have the privilege by virtue of having been granted it from A3. if A3 later revokes the privilege from A4, A4 totally loses the privilege. Hence a DBMS that allows propagation of privilege must keep track of how all the privileges were granted so that revoking of privileges can be done correctly and completely. Specifying Limits on Propagation of Privileges Techniques to limit the propagation of privileges have been developed, although they have not yet been implemented in most DBMSs and are not a part of SQL. Limiting horizontal propagation to an integer number i means that an account B given the GRANT OPTION can grant the privilege to at most i other accounts. Vertical propagation is more complicated; it limits the depth of the granting of privileges. Granting a privilege with a vertical propagation of zero is equivalent to granting the privilege with no GRANT OPTION. If account A grants a privilege to account B with the vertical propagation set to an integer number j>0, this means that the account B has the GRANT OPTION on that privilege, but B can grant the privilege to other accounts only with a vertical propagation less than j. Mandatory Access Control and Role-Based Access Control for Multilevel Security The discretionary access control techniques of granting and revoking privileges on relations has traditionally been the main security mechanism for relational database systems. This is an all-or-nothing method: A user either has or does not have a certain privilege. In many applications, and additional security policy is needed that classifies data and users based on security classes. This approach known as mandatory access control, would typically be combined with the discretionary access control mechanisms. It is important to note that most commercial DBMS currently provide mechanisms only for discretionary access control.

Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level and U the lowest: TS S C U The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and to the classification of an object O as class(O). Two restrictions are enforced on data access based on the subject/object classifications: 1. A subject S is not allowed read access to an object O unless class(S) class(O). This is known as the Simple Security Property. 2. A subject S is not allowed to write an object O unless class(S) class(O). This known as the Star Property (or * property). The first restriction is intuitive and enforces the obvious rule that no subject can read an object whose security classification is higher than the subjects security clearance. The second restriction is less intuitive. It prohibits a subject from writing an object at a lower security classification than the subjects security clearance. Violation of this rule would allow information to flow from higher to lower classifications which violates a basic tenet of multilevel security. To incorporate multilevel security notions into the relational database model, it is common to consider attribute values and tuples as data objects. Hence, each attribute A is associated with a classification attribute C in the schema, and each attribute value in a tuple is associated with a corresponding security classification. In addition, in some models, a tuple classification attribute TC is added to the relation attributes to provide a classification for each tuple as a whole. Hence, a multilevel relation schema R with n attributes would be represented as R(A1,C1,A2,C2, , An,Cn,TC) where each Ci represents the classification attribute associated with attribute Ai. The value of the TC attribute in each tuple t which is the highest of all attribute classification values within t provides a general classification for the tuple itself, whereas each Ci provides a finer security classification for each attribute value within the tuple. The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular (single-level) relation. A multilevel relation will appear to contain different data to subjects (users) with different clearance levels. In some cases, it is possible to store a single tuple in the relation at a higher classification level and produce the corresponding tuples at a lower-level classification through a process known as Filtering. In other cases, it is necessary to store two or more tuples at different classification levels with the same value for the apparent key. This leads to the concept of Polyinstantiation where several tuples can have the same apparent key value but have different attribute values for users at different classification levels. In general, the entity integrity rule for multilevel relations states that all attributes that are members of the apparent key must not be null and must have the same security classification within each individual tuple. In addition, all other attribute values in the tuple must have a security classification greater than or equal to that of the apparent key. This constraint ensures that a user can see the key if the user is

permitted to see any part of the tuple at all. Other integrity rules, called Null Integrity and Interinstance Integrity, informally ensure that if a tuple value at some security level can be filtered from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.

Comparing Discretionary Access Control and Mandatory Access Control Discretionary Access Control (DAC) policies are characterized by a high degree of flexibility, which makes them suitable for a large variety of application domains. The main drawback of DAC models is their vulnerability to malicious attacks, such as Trojan horses embedded in application programs. By contrast, mandatory policies ensure a high degree of protection in a way, they prevent any illegal flow of information. Mandatory policies have the drawback of being too rigid and they are only applicable in limited environments. In many practical situations, discretionary policies are preferred because they offer a better trade-off between security and applicability. Role-Based Access Control Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for managing and enforcing security in large-scale enterprise-wide systems. Its basic notion is that permissions are associated with roles, and users are assigned to appropriate roles. Roles can be created using the CREATE ROLE and DESTROY ROLE commands. The GRANT and REVOKE commands discussed under DAC can then be used to assign and revoke privileges from roles. RBAC appears to be a viable alternative to traditional discretionary and mandatory access controls; it ensures that only authorized users are given access to certain data or resources. Role hierarchy in RBAC is a natural way to organize roles to reflect the organizations lines of authority and responsibility. Another important consideration in RBAC system is the possible temporal constraints that may exist on roles, such as the time and duration of role activations and timed triggering of a role by an activation of another role. RBAC model is a highly desirable goal for addressing the key security requirements of Web-based applications. RBAC models have several desirable features such as flexibility, policy neutrality, better support for security management and administration and other aspects that make them attractive candidates for developing secure Web-based applications. RBAC models can represent traditional DAC and MAC policies as well as user-defined or organization-specific policies. RBAC model provides a natural mechanism for addressing the security issues related to the execution of tasks and workflows. Easier deployment over the internet has been another reason for the success of RBAC models. Access Control Policies for E-commerce and the Web E-Commerce environments require elaborate policies that go beyond traditional DBMSs.

In conventional database environments, access control is usually performed using a set of authorizations stated by security officers or users according to some security policies. Such a simple paradigm is not well suited for a dynamic environment like e-commerce. In an e-commerce environment the resources to be protected are not only traditional data but also knowledge and experience. Such peculiarities call for more flexibility in specifying access control policies. The access control mechanism should be flexible enough to support a wide spectrum of heterogeneous protection objects. A second related requirement is the support for content-based access-control. Content-based access control allows one to express access control policies that take the protection object content into account. In order to support content-based access control, access control policies must allow inclusion of conditions based on the object content. A third requirement is related to heterogeneity of subjects, which requires access control policies based on user characteristic and specifications rather than on specific and individual characteristic. e. g user IDs. Credential is a set of properties concerning a user that are relevant for security purposes. It is believed that the XML lang. can play a key role in access control for e-commerce applications because XML is becoming the common representation lang. for document interchange over the web and is also becoming the lang. for e-commerce. Statistical Database Security Statistical databases are used mainly to produce statistics on various populations. The database may contain confidential data on individuals, which should be protected from user access. Users are permitted to retrieve statistical information on the populations, such as averages, sums, counts, maximums, minimums, and standard deviations. A population is a set of tuples of a relation (table) that satisfy some selection condition. Statistical queries involve applying statistical functions to a population of tuples. For example, we may want to retrieve the number of individuals in a population or the average income in the population. However, statistical users are not allowed to retrieve individual data, such as the income of a specific person. Statistical database security techniques must prohibit the retrieval of individual data. This can be achieved by prohibiting queries that retrieve attribute values and by allowing only queries that involve statistical aggregate functions such as COUNT, SUM, MIN, MAX, AVERAGE, and STANDARD DEVIATION. Such queries are sometimes called Statistical Queries. It is DBMSs responsibility to ensure confidentiality of information about individuals, while still providing useful statistical summaries of data about those individuals to users. Provision of privacy protection of users in a statistical database is paramount. In some cases it is possible to infer the values of individual tuples from a sequence statistical queries. This is particularly true when the conditions result in a population consisting of a small number of tuples. Flow Control

Flow control regulates the distribution or flow of information among accessible objects. A flow between object X and object Y occurs when a program reads values from X and writes values into Y. Flow controls check that information contained in some objects does not flow explicitly or implicitly into less protected objects. A flow policy specifies the channels along which information is allowed to move. The simplest flow policy specifies just two classes of information: confidential (C) and nonconfidential (N), and allows all flows except those from class C to class N. This policy can solve the confinement problem that arises when a service program handles data such as customer information, some of which may be confidential. Flow controls can be enforced by an extended access control mechanism, which involves assigning a security class (usually called the clearance) to each running program. Flow control mechanism must verify that only authorized flows, both explicit and implicit are executed. A set of rules must be satisfied to ensure secure information flow. Covert Channels Covert channel allows a transfer of information that violates the security or the policy. It allows information to pass from a higher classification level to a lower classification level through improper means. Covert channels can be classified into two broad categories: timing channel and storage. In a Timing Channel the information is conveyed by the timing of events or processes whereas Storage Channels do not require any temporal synchronization, in that information is conveyed by accessing system information or what is otherwise inaccessible to the user. Encryption and Public key Infrastructures Encryption is a means of maintaining secure data in an insecure environment. Encryption consists of applying an encryption algorithm to data using some prespecified encryption key. The resulting data has to be decrypted using a decryption key to recover the original data. The Data and Advanced Encryption Standards The Data Encryption Standard (DES) is a system developed by the U.S government for use by the general public. It has been widely accepted as cryptographic standard both in the United States and abroad. DES can provide end-to-end encryption on the channel between sender A and receiver B. The DES algorithm is a careful and complex combination of two of the fundamental building blocks of encryption: substitution and permutation (transposition) Public Key Encryption The two keys used for public key encryption are referred to as the public key and the private key. Invariably, the private key is kept secret, but it is referred to as a private key rather than a secret key (the key used in conventional encryption) to avoid confusion with conventional encryption. Public key encryption refers to a type of cypher architecture known as public key cryptography that utilizes two keys, or a key pair), to encrypt and decrypt data. One of the two keys is a public key, which anyone can use to encrypt a message for the owner of that key. The encrypted message is sent and the recipient uses his or her private key to decrypt it. This is the basis of public key encryption. Other encryption technologies that use a single shared key to both encrypt and decrypt data rely on both parties deciding on a key ahead of time without other parties finding out what that key is. This type of encryption technology is called symmetric encryption, while public key encryption is known as asymmetric encryption.

The public key of the pair is made public for others to use, whereas the private key is known only to its owner.

Public key encryption scheme or infrastructure has six ingredients: i. Plaintext: data or readable message that is fed into algorithm as input. ii. Encryption algorithm: this algorithm performs various transformations on the plaintext. iii. Public and iv private keys: these are a pair of keys that have been selected so that if one is used for encryption, the other is used for decryption. v. Ciphertext: this scramble message produced as output. It is depends on the plaintext and the key. For a given message, two different keys will produce two different cipher texts. vi. Decryption algorithm: this algorithm accepts the ciphertext and the matching key and produces the original plaintext. A "key" is simply a small bit of text code that triggers the associated algorithm to encode or decode text. In public key encryption, a key pair is generated using an encryption program and the pair is associated with a name or email address. The public key can then be made public by posting it to a key server, a computer that hosts a database of public keys. Public key encryption can also be used for secure storage of data files. In this case, your public key is used to encrypt files while your private key decrypts them. User Authentication: is a way of identifying the user and verifying that the user is allowed to access some restricted data or application. This can be achieved through the use of passwords and access rights. Methods of attacking a distributed systems - Eavesdropping: is the act of surreptitiously listening to a private conversation. - Masquerading - Message tampering - Replaying - Denial of service: A denial-of-service attack (DoS attack) or distributed denial-of-service attack (DDoS attack) is an attempt to make a computer resource unavailable to its intended users. Although the means to carry out, motives for, and targets of a DoS attack may vary, it generally consists of the concerted efforts of a person or persons to prevent an Internet site or service from functioning efficiently or at all, temporarily or indefinitely. Perpetrators of DoS attacks typically target sites or services hosted on high-profile web servers such as banks, credit card payment gateways, and even root nameservers - phishing. Phishers use electronic communications that look as if they came from legitimate banks or other companies to persuade people to divulge sensitive information, including passwords and credit card numbers.
Why Cryptography is necessary in a Distributed System Supporting the facilities of a distributed system, such as resource distribution, requires the use of an underlying message passing system. Such systems are, in turn, reliant on the use of a physical transmission network, upon which the messages may physically be communicated between hosts.

Physical networks and, therefore, the basic message passing systems built over them are vulnerable to attack. For example, hosts may easily attach to the network and listen in on the messages (or 'conversations') being held. If the transmissions are in a readily understandable form, the eavesdroppers may be able to pick out units of information, in effect stealing their information content. Aside from the theft of user data, which may be in itself of great value, there may also be system information being passed around as messages. Eavesdroppers from both inside and outside the system may attempt to steal this system information as a means of either breaching internal access constraints, or to aid in the attack of other parts of the system. Two possibly worse scenarios may exist where the attacking system may modify or insert fake transmissions on the network. Accepting faked or modified messages as valid could lead a system into chaos. Without adequate protection techniques, Distributed Systems are extremely vulnerable to the standard types of attack outlined above. The encryption techniques discussed in the remainder of this report aim to provide the missing protection by transforming a message into a form where if it were intercepted in transit, the contents of the original message could not be explicitly discovered. Such encrypted messages, when they reach their intended recipients, however, are capable of being transformed back into the original message. There are two main frameworks in which this goal may be achieved, they are named Secret Key Encryption Systems and Public Key Encryption Systems. Secret Key Encryption Systems Secret key encryption uses a single key to both encrypt and decrypt messages. As such it must be present at both the source and destination of transmission to allow the message to be transmitted securely and recovered upon receipt at the correct destination. The key must be kept secret by all parties involved in the communication. If the key fell into the hands of an attacker, they would then be able to intercept and decrypt messages, thus thwarting the attempt to attain secure communications by this method of encryption. Secret key algorithms like DES assert that even although it is theoretically possible to derive the secret key from the encrypted message alone, the quantities of computation involved in doing so make any attempts infeasible with current computing hardware. The Kerberos architecture is a system based on the use of secret key encryption. Public Key Encryption Public key systems use a pair of keys, each of which can decrypt the messages encrypted by the other. Provided one of these keys is kept secret (the private key), any communication encrypted using the corresponding public key can be considered secure as the only person able to decrypt it holds the corresponding private key. The algorithmic properties of the encryption and decryption processes make it infeasible to derive a private key from a public key, an encrypted message, or a combination of both. RSA is an example of a public key algorithm for encryption and decryption. It can be used within a protocol framework to ensure that communication is secure and authentic. Data Privacy through Encryption There are two aspects to determining the level of privacy that can be attained through the Kerberos and RSA systems. To begin with, there is an analysis of the security of the two systems from an algorithmic view. The questions raised at this stage aim to consider exactly how hard it is to derive a private or secret key from encrypted text or public keys. Currently, one of the main secret key algorithms is DES, although two other more recent algorithms, RC2 and RC4 have also arisen. The size( i.e. length) of keys employed in processes is considered to be a useful metric when considering the strength of cryptology. This is because, longer key sizes generally make encrypted text more difficult to decrypt without the appropriate key. The DES algorithm has a maximum key length of approximately 50 bits. Current consensus is that this range of key size yields keys that are strong enough to withstand attacks using current technologies. The algorithms fixed size nature may, however, constrain it in the future when hardware and theoretic advances are made. The RC2 and RC4 algorithms also have bounded maximum key sizes that limit their usefulness similarly. A major problem associated with secret key systems, however, is their need for a secure channel within which keys can be propagated. In Kerberos, every client needs to be made aware of its secret key before it can begin communication. To do so without giving away the key to any eavesdroppers requires a secure channel. In practice, maintaining a channel that is completely secure is very difficult and often impractical. A second aspect to privacy concerns how much inferential information can be obtained through the system. For example, how much information is it possible to deduce without explicitly decrypting actual messages. One particularly disastrous situation would be if it were possible to derive the secret or private keys without mounting attacks on public keys or encrypted messages. In Kerberos, there is a danger that the ability to watch a client progress through the authentication protocol is available. Such information may be enough to mount an attack on the client by jamming the network at strategic points in the protocol. Denial of service like this may be very serious in a time critical system.

In pure algorithmic terms, RSA is a strong. It has the ability to support much longer key lengths than DES etc. Key length is also only limited by technology, and so the algorithm can keep step with increasing technology and become stronger by being able to support longer key lengths. Unlike secret key systems, the private keys of any public key system need never be transmitted. Provided local security is strong, the overall strength of the algorithm gains from the fact that the private key never leaves the client. RSA is susceptible to information leakage, however, and some recent theoretic work outlined an attack plan that could infer the private key of a client based on some leaked, incidental information. Overall however, the RSA authentication protocol is not as verbose as the Kerberos equivalent. Having fewer interaction stages limits the bandwidth of any channel though which information may escape. A verbose protocol like Kerberos's simply gives an eavesdropper more opportunity to listen and possibly defines a larger and more identifiable pattern of interaction to listen for.

Distributed systems require the ability to communicate securely with other computers in the network. To accomplish this, most systems use key management schemes that require prior knowledge of public keys associated with critical nodes. In large, dynamic, anonymous systems, this key sharing method is not viable. Scribe is a method for efficient key management inside a distributed system that uses Identity Based Encryption (IBE). Public resources in a network are addressable by unique identifiers. Using this identifier as a public key, other entities are able to securely access that resource. This paper evaluates key distribution schemes inside Scribe and provides recommendations for practical implementation to allow for secure, efficient, authenticated communication inside a distributed system

Parallel and Distributed Databases In parallel system architecture, there are two main types of multiprocessor system architectures that are commonplace: Shared memory (tightly coupled) architecture. Multiple processors share secondary (disk) storage and also share primary memory. Shared disk (loosely coupled) architecture. Multiple processors share secondary (disk) storage but each has their own primary memory. These architectures enable processors to communicate without the overhead of exchanging messages over network. Database mgt systems developed using the above types of architectures are termed Parallel Database Mgt System rather than DDBMS, since they utilize parallel processor technology. Another type of multiprocessor architecture is called shared nothing architecture. In this architecture, every processor has its own primary and secondary (disk)

memory, no common memory exists and the processors communicate over a high-speed interconnection network. BENEFITS OF A PARALLEL DBMS Improves response time Interquery parallelism It is possible to process a number of transactions in parallel with each other. Improves Throughput. INTRAQUERY PARALLELISM It is possible to process sub-tasks of a transaction in parallel with each other. How to Measure the Benefits Speed-Up As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor: 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs Scale-up. As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor. 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs Characteristics of Parallel DBMSs CPUs will be co-located. Same machine or same building: tightly coupled. Biggest problem: Interference contention for memory access and bandwidth. Shared Architectures only!

The Evolution of Distributed Database Management Systems Distributed database management system (DDBMS) governs storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed among several sites. To understand how and why the DDBMS is different from DBMS, it is useful to briefly examined the changes in the database environment that set the stage for the development of the DDBMS. Corporations implemented centralized database mgt systems to meet their structured information needs. The structured information needs are well served by centralized systems. Basically, the use of a centralized database required that corporate data be stored in a single central site, usually a mainframe or midrange computer. Database mgt systems based on the relational model could provide the environment n which unstructured information needs would be met by employing ad hoc queries. End users would be given the ability to access data when needed. Social and technological changes that affected DB development and design:

Business operations became more decentralized geographically. Competition increased at the global level. Customer demands and market needs favoured a decentralised mgt style. Rapid technological change created low-cost microcomputers with mainframe-like power. The large number of applications based on DBMSs and the need to protect investments in centralised DBMS software made the notion of data sharing attractive.

Those factors created a dynamic business envt in which companies had to respond quickly to competitive and technological pressures. Two database requirements became obvious: Rapid ad hoc data access became crucial in the quick-response decision-making envt. The decentralized of mgt structures based on the decentralised of business units made decentralised multiple-access and multiple-location databases a necessity. However, the way those factors were addressed was strongly influenced by: The growing acceptance of the internet The increased focus on data analysis that led to data mining and data warehousing. The decentralized DB is especially desirable because centralised DB mgt is subject to problems such as: Performance degradation due to a growing number of remote locations over greater distances. High costs associated with maintaining and operating large central database systems. Reliability problems created by dependence on a central site. Dynamic business environment and centralized databases shortcomings spawned a demand for applications based on data access from different sources at multiple locations. Such a multiplesource/multiple-location database envt is managed by a distributed DB mgt system (DBMS).

DDBMS Advantages and Disadvantages Advantages: I. Data are located near the greatest demand site. The data in a distributed DB system are dispersed to match business requirements. II. Faster data access: end users often work with only a locally stored subset of the companys data. III. Faster data processing: spreads out the systems workload by processing data at several sites. IV. Growth facilitation: new site can be added to the network without affecting the operations of other sites. V. Improved communications: local sites are smaller and located closer to customers, local sites foster better communication among departments and between customers and company staff.

VI. Reduced operating costs: development work is done more cheaply and more quickly on low-cost PCs than on mainframes. VII. User-friendly interface: the GUI simplifies use and training for end users. VIII. Less danger of a single-point failure. IX. Processor independence: the end user is able to access any available copy of the data and an end users request is processed by any processor at the data location Disadvantages: 1. Complexity of mgt and control: applications must recognise data location and they must be able to stitch together data from diff sites. DBA must have the ability to coordinate DB activities to prevent DB degradation due to data anomalies. 2. Security: the probability of security lapses increases when data are located at multiple sites. 3. Lack of standards: there re no standard communication protocols at the DB level. 4. Increased storage requirements: multiple copies of data re required at diff sites, thus requiring additional disks storage space. 5. Increased training cost: generally higher in a distributed model than they would be in a centralised model. Distributed Processing and Distributed Databases In distributed processing, a DBs logical processing is shared among two or more physically independent sites that re connected through a network.

A distributed Database, on the other hand, stores a logically related DB over two or more physically independent sites. In contrast, the distributed processing system uses only a single-site DB but shares the processing chores among several sites. In a distributed database systems, a DB is composed of several parts known as Database Fragments. An example of a distributed DB envt is shown below:

The DB is divided into three database fragments (E1, E2 and E3) located at diff sites. The computers are connected through a network system. The users Alan, Betty and Hernado do not need to know the name or location of each fragment in order to access the DB. As you examine and contrast fig 14.2 and 14.3 you should keep in mind that: Distributed processing does not require a distributed DB, but a distributed DB requires distributed processing. Distributed processing may be based on a single DB located on a single computer. Both distributed processing and distributed DBs require a network to connect all components. Characteristics of DDBMS Application interface to interact with the end user. Validation to analyse data requests. Transformation to determine which data request components are distributed and which are local. iv. Query optimisation to find the best access strategy. v. Mapping to determine the data location of local and remote fragments. vi. I/O interface to read or write data from or to permanent local storage. vii. Formatting to prepare the data for presentation to the end user viii. Security to provide data privacy at both local and remote DBs ix. Concurrency control to manage simultaneous data access and to ensure data consistency across DB fragments in the DBMS. i. ii. iii. DDBMS Components Computer workstations (sites or nodes) that form the network system. Network hardware and software components that reside in each workstation. Communications media that carry the data from one workstation to another. Transaction processor TP, which is the software component found in each computer that requests data. The TP receives and processes the applications data requests. The TP is also known as Application Processor (AP) or the transaction Manager (TM). The Data Processor (DP) which is a software component residing on each computer that stores and retrieves data located at the site. Also known as Data Manager (DM). A data processor may even be a centralised DBMS.

Levels of Data and Process Distribution. Current DB systems can be classified on the basis of how process distribution and data distribution are supported. For example, a DBMS may store data in a single site (centralised DB) or in multiple sites (Distributed DB) and may support data processing at a single site or at multiple sites. The table below uses a simple matrix to classify DB systems according to data and process distribution. Database System: levels of data and process distribution Single-site Data Multiple-site Data Single-site process Host DBMS (mainframe) Not applicable (requires multiple processes Multiple-site process File server Fully distributed Client/server DBMS (LAN Client/server DDBMS DBMS)

Single-site Processing, Single-site Data (SPSD): all processing is done on a single CPU or host computer and all data re stored on the host computers local disk. Processing cannot be done on the end users side of the system. The functions of TP and the DP are embedded within the DBMS located on a single computer. All data storage and data processing are handled by a single CPU. Multiple-site Processing, Single-site Data (MPSD): Multiple processed run on diff computers sharing a single data repository. MPSD scenario requires a network file server running conventional applications that re accessed through a LAN.

Note The TP on each workstation acts only as a redirector to route all network data requests to the file server. The end user sees the file server as just another hard disk. The end user must make a direct reference to the file server in order to access remote data. All record- and file-locking activity is done at the end-user location. All data selection, search and update functions take place at the workstation.

Multiple-site Processing, Multiple-site Data (MPMD): This describes a fully distributed database management system with support for multiple data processors and transaction processors at multiple sites. Classified as either homogeneous or heterogeneous Homogeneous DDBMSs Integrate only one type of centralized DBMS over a network. Heterogeneous DDBMSs Integrate different types of centralized DBMSs over a network Fully heterogeneous DDBMS Support different DBMSs that may even support different data models (relational, hierarchical, or network) running under different computer systems, such as mainframes and microcomputers.

Distributed Database Transparency Features: This has the common property of allowing the end user to feel like the DBs only user. The use believes that (s)he is working with centralised DBMS; all complexities of a distributed DB re hidden or transparent to the user. The features are: Distribution Transparency: which allows a distributed DB to be treated as a single logical DB. Transaction Transparency: which allows a transaction to update data at several network sites. Failure Transparency: which ensures that the system will continue to operate in the event of a node failure. Performance Transparency: which allows the system to perform as if it were a centralised DBMS. It also ensures that system will find the most cost-effective path to access remote data. Heterogeneity Transparency: which allows the integration of several diff local DBMSs under a common or global schema. Distribution Transparency: Allows management of physically dispersed database as though it were a centralized database. Three levels of distribution transparency are recognized: Fragmentation transparency: the highest level of transparency. The end user or programmer does not need to know that the a DB is partitioned Location transparency: exists when the end user or programmer must specify the DB fragment names but does not need to specify where those fragments are located. Local mapping transparency: exists when the end user or programmer must specify both the fragment names and their locations.

Transaction Transparency: Ensures database transactions will maintain distributed databases integrity and consistency. Transaction transparency ensures that the transactions are completed only when all DB sites involved in the transaction complete their part of the transaction.

Distributed Requests and Distributed Transactions Distributed transaction Can update or request data from several different remote sites on network Remote request Lets single SQL statement access data to be processed by single remote database processor Remote transaction Accesses data at single remote site Distributed transaction Allows transaction to reference several different (local or remote) DP sites Distributed request Lets single SQL statement reference data located at several different local or remote DP sites. Because each request (SQL statement) can access data from more than one local or remote DP site, a transaction can access several sites.

Distributed Concurrency Control Concurrency control becomes especially important in the distributed envt because multisite, multiple-process operations are much more likely to create data inconsistencies and deadlocked transactions than are single-site systems. For example, the TP component of a DBMS must ensure that all parts of the transaction are completed at all sites before a final COMMIT is used to record the transaction. Performance Transparency and Query Optimization Because all data reside at a single site in a centralised DB, the DBMS must evaluate every data request and find the most efficient way to access the local data. In contrast, the DDBMS makes it possible to partition a DB into several fragments, thereby rendering the query translation more complicated because the DDBMS must decide which fragment of the DB to access. The objective of a query optimization routine is to minimise the total cost associated with the execution of a request. One of the most important characteristics of query optimisation in distributed DB system is that it must provide distribution transparency as well as Replica transparency. Replica Transparency refers to the DDBMSs ability to hide the existence of multiple copies of data from the user. Operation modes can be classified as manual or automatic. Automatic query Optimization means that the DDBMS finds the most cost-effective access path without user intervention. Manual query optimisation requires that the optimisation be selected and scheduled by the end or programmer.

Query optimisation algorithms can also be classified as: Static query optimisation: takes place at the compilation time. It creates the plan necessary to access the DB. When the program is executed, the DBMS uses that plan to access the DB. Dynamic query optimisation: takes place at execution time. DB access strategy is defined when the program is executed. Its efficient; its cost is measured by run-time processing overhead. The best strategy is determined every time the query is executed, this could happen several times in the same program.

Distributed Database Design Data fragmentation deals with How to partition database into fragments Data replication deals with which fragments to replicate Data allocation Where to locate those fragments and replicas. Data Fragmentation Breaks single object into two or more segments or fragments Each fragment can be stored at any site over computer network Information about data fragmentation is stored in distributed data catalog (DDC), from which it is accessed by TP to process user requests. Three Types of data fragmentation Strategies Horizontal fragmentation: Division of a relation into subsets (fragments) of tuples (rows). Each fragment is stored at a different node, and each node has a unique rows. However, the unique rows all have the same attributes (columns). In short, each fragment represents the equivalent of a SELECT statement, with the WHERE clause on a single attribute. Vertical fragmentation: Division of a relation into attribute (column) subsets. Each subset (fragment) is stored on diff node and each fragment has unique columns with the exception of the key column, which is common to all fragments. This is equivalent of the PROJECT statement. Mixed fragmentation: Combination of horizontal and vertical strategies. In other words, a table may be divided into several horizontal subsets (rows), each one having a subset of the attributes (columns).

Data Replication This refers to the storage of data copies at multiple sites served by computer network. Fragment copies can be stored at several sites to serve specific information requirements Can enhance data availability and response time Can help to reduce communication and total query costs. Replicated data are subject to the mutual consistency rule. The mutual consistency rule requires that all copies of data fragments be identical. Therefore to maintain data consistency among the replicas, the DDBMS must ensure that a DB update is performed at all sites where replicas exist.

Three Replication scenarios exist: a DB can be: Fully replicated database - Stores multiple copies of each database fragment at multiple sites. It can be impractical due to amount of overhead it imposes on the system. Partially replicated database - Stores multiple copies of some database fragments at multiple sites. Most DDBMSs are able to handle the partially replicated database well. Unreplicated database: Stores each database fragment at single site. No duplicate database fragments. Several factors influence the decision to use data replication: Database size Usage frequency

Costs Data Allocation It describes the process of deciding where to locate data. Data allocation strategies as follows: With Centralized data allocation, the entire database is stored at one site With Partitioned data allocation, the database is divided into several disjointed parts (fragments) and stored at several sites. With Replicated data allocation, Copies of one or more database fragments are stored at several sites. Data distribution over computer network is achieved through data partition, data replication, or combination of both. Data allocation is closely related to the way a database is divided or fragmented. Data allocation algorithms take into consideration a variety of factors, including: Performance and data availability goals. Size, number of rows and number of relations that an entity maintains with other entities. Types of transactions to be applied to the DB and the attributes accessed by each of those transactions. Client/Server vs. DDBMS It refers to the way in which computers interact to form system. The architecture features user of resources, or client, and provider of resources, or server. The client/server architecture can be used to implement a DBMS in which client is the TP and server is the DP. Client/server advantages Less expensive than alternate minicomputer or mainframe solutions Allow end user to use microcomputers GUI, thereby improving functionality and simplicity More people in job market have PC skills than mainframe skills PC is well established in workplace. Numerous data analysis and query tools exist to facilitate interaction with DBMSs available in PC market There is a considerable cost advantage to offloading applications development from mainframe to powerful PCs. Client/server disadvantages Creates more complex environment in which different platforms (LANs, operating system etc) are often difficult to manage. An increase in number of users and processing sites often paves the way for security problems. The C/S envt makes it possible to spread data access to much wider circle of users. Such and envt increases demand for people with broad knowledge of computers and software applications. The burden of training increases cost of maintaining the environment.

C. J. Dates Twelve Commandments for Distributed Databases 1. Local site independence. Each local site can act as an independent, autonomous, centralized DBMS. Each site is responsible for security, concurrency control backup and recovery.

2. Central site independence. No site in the network relies on a central site or any other site. All sites have the same capabilities. 3. Failure independence. The system is not affected by node failures. 4. Location transparency. The user does not need to know the location of the data in order to retrieve those data 5. Fragmentation transparency. The user sees only one logical DB. Data fragmentation is transparent to the user. The user does not need to know the name of the DB fragments in order to retrieve them. 6. Replication transparency. The user sees only one logical DB. The DDBMS transparently selects the DB fragment to access. 7. Distributed query processing. A distributed query may be executed at several different DP sites. 8. Distributed transaction processing. A transaction may update data at several different sites. The transaction is transparently executed at several diff DP sites. 9. Hardware independence. The system must run on any hardware platform. 10. Operating system independence. The system must run on any OS software platform. 11. Network independence. The system must run on any network platform. 12. Database independence. The system must support any vendors DB product.
Two-phase commit protocol Two-phase commit is a standard protocol in distributed transactions for achieving ACID properties. Each transaction has a coordinator who initiates and coordinates the transaction. In the two-phase commit the coordinator sends a prepare message to all participants (nodes) and waits for their answers. The coordinator then sends their answers to all other sites. Every participant waits for these answers from the coordinator before committing to or aborting the transaction. If committing, the coordinator records this into a log and sends a commit message to all participants. If for any reason a participant aborts the process, the coordinator sends a rollback message and the transaction is undone using the log file created earlier. The advantages of this are all participants reach a decision consistently, yet independently. However, the two-phase commit protocol also has limitations in that it is a blocking protocol. For example, participants will block resource processes while waiting for a message from the coordinator. If for any reason this fails, the participant will continue to wait and may never resolve its transaction. Therefore the resource could be blocked indefinitely. On the other hand, a coordinator will also block resources while waiting for replies from participants. In this case, a coordinator can also block indefinitely if no acknowledgement is received from the participant. This is most likely the reason why systems still use the two-phase commit protocol. Three-phase commit protocol An alternative to the two-phase commit protocol used by many database systems is the three-phase commit. Dale Skeen describes the three-phase commit as a non blocking protocol. He then goes on to say that it was developed to avoid the failures that occur in two-phase commit transactions. As with the two-phase commit, the three-phase also has a coordinator who initiates and coordinates the transaction. However, the three-phase protocol introduces a third phase called the pre-commit. The aim of this is to remove the uncertainty period for participants that have committed and are waiting for the global abort or commit message from the coordinator. When receiving a pre-commit message, participants

know that all others have voted to commit. If a pre-commit message has not been received the participant will abort and release any blocked resources. Review Questions

1. 2. 3. 4. 5. 6. 7. 8. 9.

Describe the evolution from centralized DBMS to distributed DBMS. List and discuss some of the factors that influenced the evolution of the DBMS. What are the advantages and disadvantages of DBMS? Explain the difference between a distributed DB and distributed processing. What is a fully distributed DB mgt system? What are the components of a DDBMS Explain the transparency features of a DBMS. Define and explain the different types of distribution transparency. Explain the need for the two-phase commit protocol. Then describe the two phases. 10. What is the objective of query optimisation function?. 11. To which transparency feature are the query optimisation functions related? 12. What are different types of query optimisation algorithms? 13. Describe the three data fragmentation strategies. Give some examples. 14. What is data replication and what are the three replication strategies? 15. Explain the difference between file server and client/server architecture.

Data Warehouse The Need for Data Analysis Managers must be able to track daily transactions to evaluate how the business is performing. By tapping into operational database, management can develop strategies to meet organizational goals. Data analysis can provide information about short-term tactical evaluations and strategies.

Given the many and varied competitive pressures, managers are always looking for a competitive advantage through product development and maintenance, service, market positioning, sales promotion. In addition, the modern business climate requires managers to approach increasingly complex problems that involve a rapidly growing number of internal and external variables.

Different managerial levels require different decision support needs. Managers require detailed information designed to help them make decisions in a complex data and analysis environment. To support such decision making, information systems IS depts. have created decision support systems or DSSs. Decision support System Decision support is methodology (or series of methodologies) designed to extract information from data and to use such information as a basis for decision making Decision support system (DSS) Arrangement of computerized tools used to assist managerial decision making within business Usually requires extensive data massaging to produce information Used at all levels within organization Often tailored to focus on specific business areas or problems such as finance, insurance, banking and sales. The DSS is interactive and provides ad hoc query tools to retrieve data and to display data in different formats. Keep in mind that managers must initiate decision support process by asking the appropriate questions. The DSS exists to support the manager; it does not replace the mgt function. DSS is composed of following four main components: Data store component: Basically a DSS database. The data store contains two or main types of data: business data and business model data. The business data are extracted from the operational DB and from external data sources. The external data source provides data that cannot be found within the company. The business models are generated by special algorithms that model the business to identify and enhance the understanding of business situation and problems. Data extraction and data filtering component: Used to extract and validate data taken from operational database and external data sources. For example, to determine the relative market share by selected product line, the DSS requires data from competitors products. Such data can be located in external DBs provided by the industry groups or by companies that market the data. This component extracts the data, filters the extracted data to select the relevant records, and packages the data in the right format to be added to the DSS data store component. End-user query tool: Used by data analyst to create queries that access database. Depending on the DSS implementation, the query tool accesses either the operational DB or more commonly, the DSS DB. This tool advises the user on which data to select and how to build a reliable business data model. End-user presentation tool: Used to organize and present data. This also helps the end user select the most appropriate presentation format such as summary report or mixed graphs. Although the DSS is used at strategic and tactical managerial levels within organization, its effectiveness depends on the quality of data gathered at the operational level.

Operational Data vs. Decision Support Data Operational Data Mostly stored in relational database in which the structures (tables) tend to be highly normalized. Operational data storage is optimized to support transactions representing daily operations.

DSS Data Give tactical and strategic business meaning to operational data. Differs from operational data in following three main areas: Timespan: operational data covers a short time frame. Granularity (level of aggregation) DSS data must be presented at different levels of aggregation from highly summarized to near atomic. Dimensionality: operational data focus on representing individual transactions rather than on the effects of the transactions over time.

Difference Between Operational and DSS Data Characteristic Data currency Operational Data Current operations Real-time data Granularity Summarization level Atomic-detailed data Low; some aggregate yields DSS Data Historic data, Snapshot of company data, Time component (week/month/year) Summarized data High; many aggregation levels

Data model

Highly normalized Mostly relational DBMS

Non-normalised, Complex structures Some relational, but mostly multidimensional DBMS

Transaction type Transaction volumes Transaction speed Query activity Query scope Query complexity Data volumes

Mostly updates High update volumes Updates are critical Low to medium Narrow range Simple to medium

Mostly query Periodic loads and summary calculations Retrievals are critical High Broad range Very complex

Hundreds of megabytes, up to Hundreds of gigabytes, up to gigabytes. terabytes.

DSS Database Requirements: A DSS DB is specialized DBMS tailored to provide fast answers to complex queries. Four main requirements: Database schema Data extraction and loading End-user analytical interface Database size Must support complex data representations Must contain aggregated and summarized data Queries must be able to extract multidimensional time slices

Database schema

Data extraction Should allow batch and scheduled data extraction Should support different data sources Flat files Hierarchical, network, and relational databases

Multiple vendors

Data filtering: Must allow checking for inconsistent data or data validation rules. End-user analytical interface The DSS DBMS must support advanced data modeling and data presentation tools. Using those tools makes it easy for data analysts to define the nature and extent of business problems. The end-user analytical interface is one of most critical DSS DBMS components. When properly implemented, an analytical interface permits the user to navigate through the data to simplify and accelerate the decision-making process. Database size In 2005, Wal-Mart had 260 terabytes of data in its data warehouses. DSS DB typically contains redundant and duplicated data to improve retrieval and simplify information generation. Therefore, the DBMS must support very large databases (VLDBs) The Data Warehouse The acknowledge father of the data warehouse defines the term as an Integrated, subject-oriented, time-variant, nonvolatile collection of data that provides support for decision making. Integrated: the data warehouse is centralized, consolidated DB that integrates data derived from the entire organization and form multiple sources with diverse formats. Data integration implies that all business entities, data elements, data characteristic and business metrics are described in the same way throughout the enterprise. Subject-oriented: data warehouse data are arranged and optimized to provide answers to questions coming from diverse functional areas within a company. Data warehouse data are organized and summarized by topic such as sales, marketing. Time-variant: warehouse data represent the flow of data through time. The data warehouse can even contain projected data generated through statistical and other models. It is also time-variant in the sense that once data are periodically uploaded to the data warehouse, all time-dependent aggregations are recomputed. Non-volatile: once data enter the data warehouse, they are never removed. Because the data in the warehouse represent the companys history, the operatonal data, representing the near-term history, are always added to it. Data are never deleted and new data are continually added, so the data warehouse is always growing. Comparison of Data Warehouse and Operational database characteristics Characteristic Integrated Operational Database Data Data Warehouse Data Similar data can have Provide a united view of all different representations or data elements with a common meanings. definition and representation for all business units.

Subject-orientated

Data are stored with a function or process, orientation. For example, data may be stored for invoices, payments and credit amounts.

Data are stored with a subject orientation that facilitates multiple views of the data and facilitates decision making. For example, sales may be recorded by product, by division, by manager or by region. Data are recorded with a historical perspective in mind. Therefore, a time dimension is added to facilitate data analysis and various time comparisons. Data cannot be changed. Data are added only periodically from historical systems. Once the data are properly stored, no changes are allowed. Therefore the data environment is relatively static.

Time-variant

Data are recorded as current transactions. For example, the sales data may be the sale of a product on a given data.

Non-volatile

Data updates are frequent and common. For example, an inventory amount charges with each sale. Therefore the data environment is fluid.

In summary data warehouse is usually a read-only database optimized for data analysis and query processing. Creating a data warehouse requires time, money, and considerable managerial effort. Data Warehouse properties The warehouse is organized around the major subjects of an enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and order processing). Subject Oriented The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent. Such data, must be made consistent to present a unified view of the data to the users. Integrated Data in the warehouse is only accurate and valid at some point in time or over some time interval. Time-variance is also shown in the extended time that the data is held, the association of time with all data, and the fact that data represents a series of historical snapshots. Time Variant Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database, rather than a replacement. Non-volatile

Data mart is a Small, single-subject data warehouse subset that provides decision support to a small group of people. Some organization choose to implement data marts not only because of the lower cost and shorter implementation time, but also because of the current technological advances and inevitable people issues that make data marts attractive. Data marts can serve as a test vehicle for companies exploring the potential benefits of data warehouses. By migrating gradually from data marts to data warehouses, a specific depts decision support needs can be addressed within a reasonable time frame, as compared to the longer time frame usually required to implement a data warehouse. The diff between a data warehouse and a data warehouse is only the size and scope of the problem being solved. Twelve Rules That Define a Data Warehouse 1. Data warehouse and operational environments are separated. 2. Data warehouse data are integrated. 3. Data warehouse contains historical data over long time horizon. 4. Data warehouse data are snapshot data captured at given point in time. 5. Data warehouse data are subject oriented. 6. Data warehouse data are mainly read-only with periodic batch updates from operational data. No online updates allowed 7. Data warehouse development life cycle differs from classical systems development. The data warehouse development is data-driven; the classical approach is process-driven. 8. Data warehouse contains data with several levels of detail: current detail data, old detail data, lightly summarized data, and highly summarized data. 9. Data warehouse environment is characterized by read-only transactions to very large data sets. The operational envt is characterized by numerous update transactions to a few data entities at a time. 10. Data warehouse environment has system that traces data sources, transformations, and storage. 11. Data warehouses metadata are critical component of this environment. The metadata identify and define all data elements. 12. Data warehouse contains chargeback mechanism for resource usage that enforces optimal use of data by end users.

Online Analytical Processing (OLAP), create an aadvanced data analysis environment that supports decision making, business modeling, and operations research. OLAP systems share four main characteristics: Use multidimensional data analysis techniques Provide advanced database support Provide easy-to-use end-user interfaces Support client/server architecture Multidimensional Data Analysis Techniques The most distinct characteristic of modern OLAP is their capacity for multidimensional analysis. In multidimensional analysis: Data are processed and viewed as part of a multidimensional structure. This type of data analysis is particularly attractive to business decision makers because they tend to view business data as data that are related to other business data. Multidimensional data analysis techniques are augmented by following functions: Advanced data presentation functions: 3-D graphics, pivot tables, crosstabs, Advanced data aggregation, consolidation and classification functions that allow the data analyst to create multiple data aggregation levels, slice and dice data. Advanced computational functions: business-oriented variables, financial and accounting ratios Advanced data modeling functions: support for what-if scenarios, variable assessment, variable contributions to outcome, linear programming and other modeling tools.

Advanced Database Support To deliver efficient decision support, OLAP tools must have advanced data access features include: Access to many different kinds of DBMSs, flat files, and internal and external data sources. Access to aggregated data warehouse data as well as to detail data found in operational databases. Advanced data navigation features such as drill-down and roll-up.

Rapid and consistent query response times Ability to map end-user requests to appropriate data source and then to proper data access language (usually SQL) Support for very large databases Easy-to-Use End-User Interface: Many of interface features are borrowed from previous generations of data analysis tools that are already familiar to end users. This familiarity makes OLAP easily accepted and readily used. Client Server Architecture This provides a framework within which new systems can be designed, developed, and implemented. The client/server envt: Enables OLAP system to be divided into several components that define its architecture OLAP is designed to meet ease-of-use as well as system flexibility requirements OLAP ARCHITECTURE :OLAP operational characteristics can be divided into three main modules: Graphical user interface (GUI) Analytical processing logic. Data-processing logic.

Designed to use both operational and data warehouse data Defined as an advanced data analysis environment that supports decision making, business modeling, and an operations research activities In most implementations, data warehouse and OLAP are interrelated and complementary environments

RELATIONAL OLAP: Provides OLAP functionality by using relational databases and familiar relational query tools to store and analyze multidimensional data Adds following extensions to traditional RDBMS: Multidimensional data schema support within RDBMS Data access language and query performance optimized for multidimensional data Support for very large databases (VLDBs).

Relational vs. Multidimensional OLAP Characteristic ROLAP MOLAP

Schema

Database size Architecture

Uses star schema Uses data cubes Additional dimensions can be Additional dimensions require added dynamically re-creation of the data cube Small to medium Medium to large Client/server Standard-based Open Supports ad hoc requests Unlimited dimensions High High High Client/server Proprietary Limited to predefined dimensions Very high Low Low

Access Resources Flexibility Scalability Speed

Good with small data sets; Faster for small to medium average for medium to large data sets; average for large data sets. data sets.

Review Questions 1. What are decision support systems and what role do they play in the business envt.? 2. Explain how the main components of a DSS interact to form a system? 3. What are the most relevant differences between operational and decision support data? 4. What is a data warehouse and what are its main characteristics? 5. Give three examples of problems likely to be encountered when operational data are integrated into the data warehouse. While working as a DB analyst for a national sales organization, you are asked to be part of its data warehouse project team. 6. Prepare a high level summary of the main requirements for evaluating DBMS products for data warehousing. 8.Suppose you re selling the data warehouse idea to your users. How would you define multidimensional data analysis for them? How would you explain its advantages to them? 9. before making a commitment, the data warehousing project group has invited you to provide an OLAP overview. The groups members are particularly concerned about the OLAP client/server architecture requirements and how OLAP will fit the existing environment. Your job is to explain to them the main OLAP client/server components and architectures. 11. The project group is ready to make a final decision , choosing between ROLAP and MOLAP. What should be the basis for this decision? Why?

14. What is OLAP and what are its main characteristics? 15. Explain ROLAP and give the reasons you would recommend its use in the relational DB envt. 20. Explain some of the most important issues in data warehouse implementation. Web DBMS Database System: An Introduction to OODBMS and Web DBMS PROBLEMS WITH RDBMSs Poor representation of real world entities. Semantic overloading. Poor support for integrity and business constraints. Homogeneous data structure. Limited operations. Difficulty handling recursive queries. Impedance mismatch. Difficulty with Long Transactions. Object Oriented Database Management Systems (OODBMSs): These are an attempt at marrying the power of Object Oriented Programming Languages with the persistence and associated technologies of a DBMS. Object Oriented Database Management System OOPLs Complex Objects Object Identity Methods and Messages Inheritance Polymorphism Extensibility Computational completeness DBMSs Persistence Disc Management Data sharing Reliability Security Ad Hoc Querying

THE OO DATABASE MANIFESTO CHARACTERISTICS THAT MUST BE SUPPORTED Complex objects Object Identity Encapsulation Classes Inheritance Overriding and late-binding

Extensibility Computational completeness Persistence Concurrency Recovery Ad-hoc querying

Requirements and Features Requirements: Transparently add persistence to OO programming languages Ability to handle complex data - i.e., Multimedia data Ability to handle data complexity - i.e., Interrelated data items Add DBMS Features to OO programming languages. The host programming language is also the DML. The in-memory and storage models are merged. No conversion code between models and languages is needed.

Features:

Data Storage for Web Site File based systems: information in separate HTML files file management problems information update problems static web pages, non-interactive Database based systems: database accessed from the web dynamic information handling data management and data updating through the DBMS

Interconnected networks TCP/IP (Transmission Control Protocol/ Internet Protocol)

http: (HyperText Transfer Protocol) Internet Database Web database connectivity allows new innovative services that: Permit rapid response to competitive pressures by bringing new services and products to market quickly. Increase customer satisfaction through creation of Web-based support services. Yield fast and effective information dissemination through universal access from across street or across globe. Benefit Saving in equipment/software acquisition Ability to run on most existing equipment Platform independence and portability No need for multiple platform development Reduced training time and cost Reduced end-user support cost No need for multiple platform development Global access through internet infrastructure Reduced requirements (and costs) for dedicated connections. Availability of multiple development tools Plug-and-play development tools (open standards) More interactive development Reduced development times Relatively inexperience tools Free client access tools (Web browsers) Low entry costs frequent availability of free web servers. Reduced costs of maintaining private networks Distributed processing and scalability, using multiple servers.

Characteristics and Benefits of Internet Technologies Internal Characteristic Hardware and Software independence

Common and simple user interface

Location independence

Rapid development at manageable costs

Web-to-Database Middleware: Server-Side Extensions A server-side extension is a program that interacts directly with the web server to handle specific types of requests. It also makes it possible to retrieve and present the query results, but whats more important is that it provides its services to the web server in a way that is totally transparent to the client browser. In short, the server-side-extension adds significant functionality to the web server and therefore to the internet.

A database server-side extension program is also known as Web-to-database middleware.

The client browser sends a page request to the Web server. The web server receives and validates the request. The web-to-database middleware reads, validates and executes the script. In this case, it connects to the database and passes the query using the database connectivity layer. The database server executes the query and passes the result back to the Web-todatabase middleware. The Web-to-database middleware complies the result set, dynamically generates an HTML-formatted page that includes the data retrieved from the database and sends it to the Web server. The Web server returns the just-created HTML page, which now includes the query result, to the client browser. The client browser displays the page on the local computer. The interaction between the Web server and the Web-to-database middleware is crucial to the development of a successful internet database implementation. Therefore, the middleware must be well integrated with the other internet services and the components that are involved in its use. Web Server Interfaces: It defines how a Web server communicates with external programs.

Two well-defined Web server interfaces: Common Gateway Interface (CGI) Application programming interface (API)

The Common Gateway Interface uses script files that perform specific functions based on the clients parameters that are passed to the Web server. The script file is a small program containing commands written in a programming language. The script files contents can be used to connect to the DB and to retrieve data from it, using the parameters data to the Web server. A script is a series of instructions executed in interpreter mode. The script is a plain text file that is not compiled like COBOL, C++, or Java. Scripts are normally used in Web application development environments. An Application programming Interface is a newer Web server interface standard that is more efficient and faster than a CGI script. APIs are more efficient because they are implemented as shared code or as dynamic-link libraries (DLL). API are faster than CGI because the code resides in memory and there is no need to run an external program for each request. APIs share the same memory space as the Web server, an API error can bring down the server. The other disadvantage is that APIs are specific to the Web server and to the operating system. The Web Browser This is the application software e.g Microsoft Internet Explorer, Mozilla Firefox, that lets users navigate (browse) Web. Each time the end user clicks a hyperlink, the browser generates an HTTP GET page request that is sent to the designated Web server using the TCP/IP internet protocol. The Web browsers job is to interpret the HTML code that it receives from the Web server and to the present the different page components in standard formatted way. The Web as a Stateless System: Stateless system indicates that at any given time, Web server does not know status of any of clients communicating with it. Client and server computers interact in very short conversations that follow request-reply model. XML Presentation Extensible Markup Language (XML) is a metalanguage used to represent and manipulate data elements. XML is designed to facilitate the exchange of structured documents, such as orders and invoices over the internet. XML provides the semantics that facilitates the exchange, sharing and manipulation of structured documents across organizational boundaries.

One of the main benefits of XML is that it separates data structure from its presentation and processing.

Data Storage for Web Sites http provides multiple transactions between clients and server Based on Request-Response paradigm - Connection (from client to web server) - Request (message to web server) - Response (information required as an HTML file) - Close (connection to web server closed) Web is used as an interface platform that provides access to one or more databases Question of database connectivity Open architecture approach to allow interoperability: - (Distributed) Common Object Model (MS DCOM/ COM) - CORBA (Common Request Broker Architecture) - Java/RMI (Remote Method Invocation) DBMS Architecture Integration of web with database application Vendor/ product independent connectivity Interface independent of proprietary Web browsers Capability to use all the features of the DBMS Access to corporate data in a secure manner

Two-tier client-server architecture

-user interface/ transaction logic - Database application/ data storage

DBMS - Web Architecture

Three-tier client-server -User interface architecture

Transaction / application logic DBMS

Three-tier client-server architecture maps suitably to the Web environment - First tier: Web browser, thin client - Second tier: Application server, Web server - Third tier: DBMS server, DBMS

DBMS - Web Architecture

Three-tier clientserver architecture

107

N-tier client-server architecture, Internet Computing Model - Web browser, (thin client) - Web server - Application server, - DBMS server, DBMS

N-Tier ClientServer (Internet Mo Computing) del

DBMS - Web Archit ecture

108

DBMS - Web Architecture N-Tier Client-Server (Internet Computing Model)

109

Integrating the Web and DBMS Integration between Web Server and Application Server Web requests received by the Web Server invoke transactions on the Application Server CGI (Common Gate Interface) Non-CGI Gateways. CGI (Common Gate Interface) transfers information between a Web Server and a CGI Program CGI programs (scripts) run on either Web Server or Application Server scripts can be written in VBScript or Perl CGI is web server independent and scripting language independent.

Integrating the Web and DBMS

CGI (Common Gate Interface) Environment

110

Non-CGI Gateways Proprietary interfaces, specific to a vendors web server Netscape API (Sun Micro Systems) ASP (Active Server Pages), (Microsoft Internet Information Server) Integration between Application Server and DBMS -

applications on the Application Server connect to and interacts with the Database connections between Application Server and the Databases provided by API (Application Programming Interface) Standard APIs: ODBC (Open Database Connectivity) connects application programs to DBMS. JDBC (Java Database Connectivity) connects Java applications to DBMS ODBC (Open Database Connectivity) standard API, common interface for accessing SQL databases DBMS vendor provides a set of library functions for database access Functions are invoked by application software execute SQL statements to return rows of data as a result of data search de facto industry standard ODBC is Microsofts implementation of a superset of the SQL Access Group Call Level Interface (CLI) standard for DB access. ODBC is probably the most widely supported database connectivity interface. ODBC allows any Windows application to access relational data source using SQL via a standard application programming interface (API). Microsoft also developed two other data access interface: Data Access Objects (DAO) and Remote Data Objects (RDO).

DAO is an object-oriented API used to access MS Access, MS FpxPro, and dBase databases (using the Jet data engine) from visual basic programs. RDO: is a higher-level object-oriented application interface used to access remote database servers. The Basic ODBC architecture has three main Components: A higher-level ODBC API through which application programs access ODBC functionality. A driver manager that is in charge of managing all DB connections. An ODBC driver that communicates directly to the DBMS Defining a data source is the first step in using ODBC. To define a data source, you must create a data source name DSN for the data source. To create a DSN you need to provide: An ODBC driver A DSN Name ODBC Driver Parameters modelled after ODBC, a standard API provides access to DBMSs from Java application programs machine independent architecture direct mapping of RDBMS tables to Java classes SQL statements used as string variables in Java methods (embedded SQL).

JDBC (Java Database Connectivity)

Review Questions 1. Difference between DAO and RDO 2. Three basic components of ODBC architecture. 3. What steps are required to create an ODBC data source name? 4. What are Web server interfaces used for? Give some examples. 5. What does this statement mean: The web is a stateless system? What implications does a stateless system have for DB applications developers? 6. What is a web application server and how does it work from a database perspertive. 7. What are scripts, and what is their function? ( Think in terms of DB applications development.) 8. What is XML and why is it important?