Sie sind auf Seite 1von 112

Introduction to Data Warehouse

(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

Introduction to Data Warehousing and Data Introduction to Data Warehousing and Data Mining Mining

1) Data Warehouse Introduction 2) Engineering Conflicts 3) OLTP and DSS 4) Stovepipe vs. Integration 5) Data Warehouse Solution 6) Enterprise Information System 7) Security in a Data Warehouse 8) Moving Data to a Data Warehouse 9) Data Marts 10) Data Mining
2

Introduction Introduction
Key topics for this course include:
Data Warehouse Data Mart Data Mining

Background and review of relational database systems Main focus on data warehouse and data mining

Data Warehouse Introduction Data Warehouse Introduction


A data warehouse is a single source for key, corporate information needed to enable business decisions A database application is a piece of software that provides a user interface for users to add, delete, query and update data Typically, a database management system is used to actually do the work of adding, deleting, querying or updating data
Application
Database System Data

Engineering Conflicts, Query and Update Engineering Conflicts, Query and Update
It is often an engineering problem when data is updated and long-running queries occur at the same time In some cases, the users who are doing updates must wait for queries to complete One way to avoid this is to make a read-only copy of data
Database System Application
Data for update Data for query

OLTP and DSS Defined OLTP and DSS Defined


An application that updates is called an on-line transaction processing (OLTP) application An application that issues queries to the readonly database is called a decision support system (DSS)

OLTP Application Database System OLTP Data DSS Data

DSS Application

Applications in a Typical Enterprise Applications in a Typical Enterprise


Most organizations have several disparate OLTP/DSS applications in several databases
Finance OLTP Application Inventory OLTP Application Sales OLTP Application

Finance DSS Application

Inventory DSS Application


DATABASE SYSTEM

Sales DSS Application

Finance OLTP Data Finance DSS Data

Inventory OLTP Data Inventory DSS Data

Sales OLTP Data Sales DSS Data

Stovepipe vs Integration Stovepipe vs Integration


When systems stand by themselves they are often referred to as stovepipes Systems that easily share data are called well integrated systems
Finance OLTP Application Inventory OLTP Application

Finance DSS Application

Inventory DSS Application


8

Problems with Stovepipe Architecture Problems with Stovepipe Architecture


Problems:
Users who wish to access data must query several different DSS to find it Data may have fundamental conflicts between DSS a department code table in one DSS may differ in another DSS a measurement may be stored in meters in one DSS and yards in another

Solution:
Use a data warehouse, where data is integrated from the several different stovepipe systems Data warehouse is really sharing-lite -- you dont have to co-ordinate as much when applications are built and you still reap the benefits of data sharing

Data Warehouse Solution Data Warehouse Solution


A data warehouse is an attempt to integrate separate DSS so that users can query one place to find the answers to their questions A data warehouse has the key, corporate data in the organization A data warehouse tracks historical data

10

Data Warehouse -- A Success Story Data Warehouse A Success Story


Largest data warehouse is Wal-Mart (9 TB) Uses for Wal-Mart data warehouse
Identifies where a new store should be built based on customer demand Identifies how stores are performing across the nation Contains every scan from every purchase

Benefits Wal-Mart gained from their data warehouse


Provided competitive advantage over K-Mart Reduced excess inventory in individual stores Avoided wasted funds in building stores which would fail

11

Selling the Data Warehouse Selling the Data Warehouse


A data warehouse project will fail without corporate sponsorship
Preferably, the project should be sponsored by the CEO The CEO must be sold on the value to the business to improve competitive advantage by deploying a data warehouse

If an active, corporate sponsor does not exist, data sources will be very difficult to identify Only add data to the warehouse that will answer key, corporate questions asked by the corporate sponsor. Otherwise, you will have a data dump
12

Building a Useful Data Warehouse Building a Useful Data Warehouse


You really need:
strong executive sponsorship good knowledge of the data sound software engineering stability from source systems users who want a success

A 75 percent failure rate is often cited It is WORTH the effort!!!

13

Enterprise Information System Enterprise Information System


An EIS (Enterprise Information System) allows users to query data in a data warehouse Users can access key, corporate data in the data warehouse
Enterprise Information System

Data Warehouse
14

Users of an Enterprise Information System Users of an Enterprise Information System


Frequently, multiple EIS are needed to satisfy different types of users
Some users only want a system that has pre-defined reports so they only need to click one button to see data they need. These users want the system to be no harder to use than a coffee pot Other users want to delve into the data and build their own queries

Executives want a high-level, summary data and a simple tool


Must be VERY easy to use, users want to click a few buttons and get data they want Results must be graphs Users should be able to drill-down into key areas.

15

Users of an Enterprise Information System Users of an Enterprise Information System


Analysts want a flexible, more detailed tool
Often very knowledgeable about the data Willing to do more work to learn about the data Sometimes even learn SQL to issue their own ad-hoc queries

General users want a tool that provides detailed data, but is very easy to use
Want access to the data warehouse to do routine tasks such as Find me Hanks phone number, etc. Simple application, but not so focused on large reports

16

Data Warehouse // EIS Data Warehouse EIS


Finance OLTP Application Inventory OLTP Application
Inventory OLTP Data

Sales OLTP Application

Finance OLTP Data

Enterprise Information System

S a lle s Sa es OLTP OLTP Data Data

Data Warehouse
Finance Subject Area Inventory Subject Area Sales Subject Area

17

Need for Data Warehouses Need for Data Warehouses


Data warehouses provide a single place to store key corporate data
The idea is that users can go one place to find this key data using an enterprise information system (EIS)

Data warehouse is also a place to store and access historical data


Users measure performance goals for their company over a period of time Company statistics are available Data not stored in the same place is difficult to locate and compare, easily lost Single query can be used to access key data

18

Security in Data Warehouse Security in Data Warehouse


Building a data warehouse does increase security risk because key, corporate information is all in one place To mitigate that risk, database system components can be used to protect the data warehouse. These include
Views Access control Security Administration Encryption Audit

19

Moving Data into the Data Warehouse Moving Data into the Data Warehouse
Moving data from source OLTP systems to the data warehouse is the hard part of data warehousing Updates to the data warehouse are performed periodically
weekly nightly monthly

Occasionally, real-time data is needed in a data warehouse, but this is not very common
20

10

Using Middleware to Move Data Using Middleware to Move Data


Data can be moved to the warehouse via data migration software This is often called middleware because it sits between the source OLTP and the data warehouse

Source OLTP System

Data Warehouse Migration Software Middleware

Data Warehouse

21

Need for a Data Mart Need for a Data Mart


A data mart is a subset of the data warehouse that may make it simpler for users to access key corporate data
Sometimes, users only need a piece of data from the data warehouse

The data mart is typically fed from the data warehouse


Data Warehouse
Inventory Subject Area Finance Subject Area Sales Subject Area

New York Data Mart

California Data Mart

22

11

Data Mart in Action Data Mart in Action


Finance OLTP Application Inventory OLTP Application
Inventory OLTP Data

Sales OLTP Application

Finance OLTP Data

Enterprise Information System

S a lle s Sa es OLTP OLTP Data Data

Data Warehouse
Finance Subject Area Inventory Subject Area Sales Subject Area California Data Mart

New York Data Mart

23

Data Mining Introduction Data Mining Introduction


Data Mining is done by running software that examines a database and looks for patterns in the data A data warehouse by itself will respond to queries from users
It will not tell users about patterns in data that users may not have thought about To find patterns in data, data mining is used to try and mine key information from a data warehouse

24

12

Advantages of Data Mining Advantages of Data Mining


Data mining allows companies to collect information and make them more productive and beat their competition Data mining helps identify
why customers buy certain products ideas for very direct marketing ideas for shelf placement training of employees vs. employee retention employee benefits vs. employee retention

25

Implementing Data Mining Implementing Data Mining


Apply data mining tools to run data mining algorithms against data There are two approaches:
Copy data from the Data Warehouse and mine it Mine the data in the Data Warehouse

Popular tools use a variety of different data mining algorithms:


association rules genetic algorithms decision trees neural networks

26

13

Data Mining using Separate Data Data Mining using Separate Data
You can move data from the data warehouse to data mining tools
Advantages Data mining tools may organize data so they can run faster Disadvantages Could be very expensive to move large amounts of data
Data Warehouse

Data Mining Tool


Copy of data made by the Data Mining Tool
27

Data Mining Against the Data Warehouse Data Mining Against the Data Warehouse
Data mining tools can access data directly in the Data Warehouse
Advantages No copy of data is needed for data mining Disadvantages Data may not be organized in a way that is efficient for the tool
Data Warehouse

Data Mining Tool

28

14

Data Mining: Summary Data Mining: Summary


Data mining attempts to find patterns in data that we did not know about Often data mining is just a new buzzword for statistics Data mining differs from statistics in that large volumes of data are used Many different data mining algorithms exist and we will discuss them in the course Examples
identify users who are most likely to commit credit card fraud identify what attributes about a person most results in them buying product x.
29

SQL Review
(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

30

15

Introduction to SQL Introduction to SQL

1) Introduction to SQL 2) Data Definition Language (DDL) 3) Data Manipulation Language (DML) 4) SELECT Construct 5) SELECT Operators 6) Wildcard Searches 7) Aggregate Operators 8) Calculated Attributes 9) Sorting Results
31

Introduction to Structured Query Language Introduction to Structured Query Language


Structured Query Language (SQL) is the language used to communicate with a relational database
Industry standard Based on set theory

SQL composed of two types of constructs:


Data Definition Language (DDL) Defines the structure of the database Data Manipulation Language (DML) Provides the constructs to input and retrieve data

32

16

SQL Overview -- DDL SQL Overview DDL


Data Definition Language (DDL) is used to describe the structure of the database
Create tables, indexes, etc. Typical Operations are: CREATE TABLE defines what columns are in the table and establishes the table CREATE INDEX defines an index for the table. Indexes are used to improve database performance

33

SQL Overview -- DML SQL Overview DML


Data Manipulation Language (DML) is used for storing, updating, and retrieving data. Typical operations include:
SELECT is used to retrieve data. Ex: SELECT * FROM PRODUCTS INSERT is used to add new rows to the database. INSERT INTO PRODUCTS VALUES ('food', 'hardware', 'housewares') UPDATE is used to change rows that already exist in the database. UPDATE PRODUCTS SET PRICE = PRICE + 4 DELETE is used to eliminate rows of data from the database. DELETE FROM PRODUCTS

34

17

SELECT Overview SELECT Overview


SELECT is used to retrieve records from the database. Single table SELECT constructs:
WHERE IN BETWEEN LIKE

Aggregate Operators DISTINCT ORDER BY

35

SELECT Examples SELECT Examples


Query Purpose: Retrieve names and prices of all products SELECT ProductName, Price FROM TinyProducts Query Purpose: Retrieve all information for all employees from the TinyProducts table SELECT * FROM TinyProducts

36

18

SELECT with WHERE SELECT with WHERE


The WHERE clause is used to filter which information is returned from a SELECT Query Purpose: Retrieve all information only for product type of food SELECT * FROM TinyProducts WHERE ProductType = Food

37

Use of Boolean Operators Use of Boolean Operators


Conditions can be separated by Boolean operators:
AND, OR, NOT

Query Purpose: List all information about food products that are either cereal or fruit SELECT * FROM TinyProducts WHERE (ProductName = 'Cereal') OR (ProductName = 'Fruit')

38

19

Boolean Operator Example Boolean Operator Example


Query Purpose: List the names of all products that the type is fruit and the price is less than $2.00 SELECT ProductType, ProductName FROM TinyProducts WHERE Price < 2 AND ProductName = 'Fruit'

39

IN Operator IN Operator
The IN operator allows a search for records that match one value in a set of unordered values Example questions to use IN:
'Find all products whose type is Food, Hardware, or Housewares' 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'

40

20

IN Example IN Example
Query Purpose: List the name of Housewares that are Cookware, Linens, or Dishes SELECT ProductName, ProductType FROM TinyProducts WHERE ProductName in ('Cookware', 'Linens', 'Dishes') instead of:
SELECT ProductName, ProductType FROM TinyProducts WHERE (ProductName = Cookware') OR (ProductName = 'Linens') OR (ProductName = 'Dishes')
41

BETWEEN Operator BETWEEN Operator


The BETWEEN operator allows a search for a range of values Example Queries:
'Find all fruit between Bananas and Grapes' 'Find all cereals whose price is between $1.50 and $4.00 a box

1.50

4.00

42

21

BETWEEN Example BETWEEN Example


Query Purpose: Find all products whose price is between $2.00 and $8.00 SELECT ProductName, Price FROM TinyProducts WHERE Price BETWEEN 2.00 AND 8.00 instead of: SELECT ProductName, Hardware FROM TinyProducts WHERE (Price >= 2.00) OR (Price <= 8.00)
43

Wildcard Searches of Strings Wildcard Searches of Strings


The LIKE operator is used to search parts of a string The following wildcard characters are used:
% to match any zero or more characters _ to match exactly one character

44

22

Wildcard Search Examples Wildcard Search Examples


Query Purpose: List all products whose name starts with an C' SELECT * FROM TinyProducts WHERE ProductName LIKE 'C%' Query Purpose: List all products that have a SKU number with the last 2 characters of 23' when you don't know the first character SELECT * FROM TinyProducts WHERE SKUNumber LIKE '_23'
45

Aggregate Operators Aggregate Operators


MIN, MAX, and AVERAGE are used when computing statistics on a range of data Query Examples:
'What is the highest batting average on the team?' 'What is the average number of hits for all the little league teams in the National League?' 'What are the names of the players that had the lowest average on the little league team?'

46

23

Aggregate Operators Example Aggregate Operators Example


Query Purpose: Find the minimum, maximum, and average batting average of all players in the National League of Little League SELECT MIN(Average), MAX(Average), AVG(Average) FROM PLAYERS WHERE League = 'National'

47

SUM and COUNT Operators SUM and COUNT Operators


Use the SUM operator to total the results of a query COUNT will count the total number of occurrences of an item in a search

1+2+3+4
48

24

SUM And COUNT Examples SUM And COUNT Examples


Query Purpose: Find the total number of homeruns hit by all players in the American League? SELECT SUM(HomeRuns) FROM PLAYERS WHERE League='American' Query Purpose: List the names of players that have hit 3 home runs in the National League? SELECT COUNT(*) FROM PLAYERS WHERE HomeRuns = '3' AND League = 'National'
49

Calculated Attributes Calculated Attributes


A new attribute can be obtained by using arithmetic operators (+,-, *, /) on other numeric attributes All operators follow standard precedence:
Multiplication and division are computed first left to right Addition and subtraction are computed last left to right Use parenthesis to override the standard precedence

(+,-, *, /)
50

25

Calculated Attributes Example Calculated Attributes Example


Query Purpose: List all players with their hits, at bats, and their batting average SELECT Name, Hits, AtBats, (Hits / AtBats) FROM PLAYERS

51

DISTINCT Operator DISTINCT Operator


DISTINCT is used to exclude duplicate occurrences in the result of a query Query Purpose: List all distinct batting averages SELECT DISTINCT(Average) FROM PLAYERS

52

26

Sorting Query Results Sorting Query Results


The ORDER BY clause is used at the end of the SELECT statement to sort the results of a query Use DESC on the end of the ORDER BY clause to sort the data in descending order. Otherwise, the result will be in ascending order

53

Sorting Example Sorting Example


Query Purpose: List all players in ascending order of their batting average SELECT Name, Average FROM PLAYERS ORDER BY Average For descending order add the keyword DESC SELECT Name, Average FROM PLAYERS ORDER BY Name DESC
54

27

Sorting Calculated Attributes Sorting Calculated Attributes


To refer to a computed attribute in the ORDER BY, use its position in the list of columns following SELECT Query Purpose: List all players in descending order of their batting average (here we assume batting average is computed at the time of the query) SELECT Name, Hits, AtBats, Hits / AtBats FROM PLAYERS ORDER BY 3 DESC
55

More SQL More SQL

1) GROUP BY Construct 2) HAVING Filter 3) Multiple Tables 4) Joins 5) Equijoins 6) Cartesian Product 7) Nulls 8) OUTER JOIN
56

28

GROUP BY Clause GROUP BY Clause


GROUP BY will partition a table into multiple groups of related rows. As an example, consider the EMPLOYEE table where Department partitions the EMPLOYEE set into subsets: Engineering Marketing Finance Customer

57

GROUP BY Example GROUP BY Example


Query Purpose: For each department, list the average salary using the EMPLOYEE table SELECT Department, AVG(Salary) FROM EMPLOYEE GROUP BY Department

58

29

GROUP BY With WHERE GROUP BYWith WHERE GROUP BY GROUP BY With WHERE WHERE
To filter data further, we can use the WHERE clause with GROUP BY clause Query Purpose: For each department, list the highest salary of their administrative assistants. SELECT Department, MAX(Salary) FROM EMPLOYEE WHERE Title='administrative assistant' GROUP BY Department

59

HAVING Construct HAVING Construct


HAVING is used to restrict the output of aggregate functions, such as SUM, MIN, MAX and AVG, to only those groups of rows that meet some condition. Query Purpose: List the average salary for all departments that have more than three employees. SELECT Department, AVG(Salary) FROM EMPLOYEE GROUP BY Department HAVING COUNT(*) > 3

60

30

Multi-Table SQL Multi-Table SQL


It is often necessary to combine data into multiple tables. EMPLOYEE EmpID Name Salary 1 2 3 4 Fred 200 ATTENDS EmpID Name 1 2 2 3 3 3 Harvard GMU Yale MIT Stanford GMU
61

Ethel 300 Mike 400

David 100

Joins Joins
Joins are the means by which multiple tables can be combined. A join allows us to combine data from different tables. A join operation is done through the SELECT construct. Types of Joins: Equijoin, Outer Join, Inner Join

62

31

Equijoin Equijoin
Joins only those rows where a foreign key matches the primary key Allows information from multiple tables to be linked together in a single query Can be used to link as many tables as needed in a single query

63

Equijoin Query Example Equijoin Query Example


Query Purpose: List the names of all colleges attended by Ethel SELECT b.Name FROM EMPLOYEE a, ATTENDS b WHERE a.EmpID = b.EmpID AND a.Name = 'Ethel'

64

32

Equijoin Example Equijoin Example


EMPLOYEE
EmpID 1 2 3 Name Fred Ethel Mike Salary 200 300 400

ATTENDS
EmpID 1 2 2 3 3 3 College Harvard GMU Nova Yale Nova GMU GPA 2.45 3.79 3.65 2.85 2.65 4.0

65

Warning about Joining Tables Warning about Joining Tables


A join is really just a subset of a cartesian product. When no fields are 'joined' in the WHERE clause, a cartesian product is produced
Restated in English: When the linking condition is omitted from the WHERE clause, you get a lot of excess garbage that you probably do not want.

Sample Query: SELECT b.Name FROM EMPLOYEE a, ATTENDS b WHERE a.Name = 'Ethel'

66

33

Cartesian Product Cartesian Product


Each row in one table with every other row in other table a.EmpID a.Name a.Salary 2 2 2 2 Ethel Ethel Ethel Ethel .... 300 300 300 300 b.EmpID 1 2 3 4 b.GPA 3.4 2.8 3.7 3.5

67

Nulls Nulls
An attribute may be defined as null. This indicates that the value is unknown and avoids the need for user-defined special indicators. To prevent a column from having nulls, specify NOT NULL on the column in the CREATE TABLE statement when setting up the database.

68

34

Nulls Examples Nulls Examples


Statement Purpose: Add an employee whose salary is unknown INSERT INTO EMPLOYEE (3,'Hank', NULL) Query Purpose: Find all employees whose salary is unknown (or null) SELECT * FROM EMPLOYEE WHERE Salary IS NULL

69

OUTER JOIN OUTER JOIN


An OUTER JOIN is used when the query should return a result row even for rows that do not have corresponding data in one of the tables. A LEFT OUTER JOIN returns all rows from the 'left' table. Nulls are returned when a row in the 'left' table has no corresponding rows in the right table.

70

35

LEFT OUTER JOIN Example LEFT OUTER JOIN Example


Query Purpose: List the college GPAs for each employee. Include employees who have not attended any colleges SELECT a.Name, b.GPA FROM EMPLOYEE a LEFT OUTER JOIN ATTENDS b on a.EmpID = b.EmpID

71

LEFT OUTER JOIN Example LEFT OUTER JOIN Example


Result of the outer join
All employees are listed. For an equijoin, only those who attended a college would be listed Here, employee number 4 did not attend college, but is still retrieved by the outer join.

Name

GPA

---------- ----Fred 2.45 Ethel 3.79 Ethel 3.65 Mike 2.85 Mike 2.65 Mike 4.00 David NULL

72

36

Advanced SQL
(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

73

Advanced SQL Advanced SQL

1) Finding the nth element in a list 2) Finding the median 3) Correlated subquery 4) Data Definition Language Constructs

74

37

Find the Nth Element Find the Nth Element


It is very common to try to find the nth element in a list.
Examples: Who makes the second highest salary in marketing department? What is the fifth best product in sales? This can be done with a program that uses SQL to access the database: SQL is sent to the database and the program keeps retrieving the result set until the threshold is crossed.

We show another way of doing this using standard SQL.

75

Find the Nth Element: Example Table Find the Nth Element: Example Table
Consider a table, called TEST, with just one column, x, with the following values: X 4 5 8

76

38

Find the Nth Element: Step 1 Find the Nth Element: Step 1
First join TEST with itself, this yields each element matched with every other element: 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8
77

Find the Nth Element: Step 2 Find the Nth Element: Step 2
Next keep only those rows where the first column is greater than or equal the second column. 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8 4 5 5 8 8 8 4 4 5 4 5 8

Notice the pattern that just developed, each number on the list now has a certain number of values that match on the right. This number matches the position of this value in the list. For example, 4 has only one match as it is the first number in the list, 5 has two matches, 8 has three matches.
78

39

Find the Nth Element: Step 3 Find the Nth Element: Step 3
Now group by the column on the left and identify the size of each group. The same ideas can be applied to any SELECT statement output. 4 5 5 8 8 8 4 4 5 4 5 8 4 5 8 1 2 3

79

Finding the Nth Element: Example Finding the Nth Element: Example
Query Purpose: Find the information about the product with the second highest price.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b WHERE a.Price >= b.Price GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber HAVING COUNT(*) = (SELECT COUNT(*)-1 FROM TinyProducts)

80

40

Finding the Top N Elements: Example Finding the Top N Elements: Example
To ask for the top n values instead of the nth value, specify a range (>=) instead of just an equality (=) in the HAVING. Query Purpose: Find information about the products with the two highest prices.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b WHERE a.Price >= b.Price GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber HAVING COUNT(*) >= (SELECT COUNT(*)-1 FROM TinyProducts) ORDER BY a.Price

81

Finding the Median Finding the Median


The median is defined as the element in the middle of the list. Query Purpose: Find the median price in TinyProducts.
SELECT FROM WHERE GROUP HAVING a.ProductName, a.ProductType, a.Price, a.SKUNumber TinyProducts a, TinyProducts b a.Price >= b.Price BY a.ProductName,a.ProductType, a.Price, a.SKUNumber COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)

82

41

Using Subqueries Using Subqueries


A subquery may be used in the middle of a query. Query Purpose: Find the information about the highest priced product, using a simple subquery.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)

83

Correlated Subquery Correlated Subquery


If the subquery references a data element from outside of the subquery, it is called a correlated subquery.
For each row in the outer part of the query, the correlated subquery is executed.
The following query will indicate who makes more money than Ethel

SELECT a.Name, a.Salary FROM Employee a WHERE EXISTS (SELECT FROM WHERE AND b.Salary Employee b a.Salary > b.Salary b.Name = 'Ethel')

84

42

Other Data Manipulation Other Data Manipulation

INSERT
Add rows to a single table

UPDATE
Modify rows in a single table

DELETE
Remove rows from a single table

85

INSERT Examples INSERT Examples


Statement Purpose: Add a record for employee #1, Fred' with a salary of 200 to the EMPLOYEE table INSERT INTO Employee VALUES (1, Fred', 200) Statement Purpose: Copy all rows in the EMPLOYEE table and place them in NEW_EMPLOYEE INSERT INTO New_Employee SELECT * FROM Employee

86

43

UPDATE Example UPDATE Example


Statement Purpose: Modify Freds salary to 150 UPDATE Employee SET Salary = 150.00 WHERE Name = 'Fred' Statement Purpose: Give all employees a ten percent raise UPDATE Employee SET Salary = Salary * 1.10

87

DELETE Examples DELETE Examples


Statement Purpose: Remove all employees who have a salary higher than 100. DELETE FROM Employee WHERE Salary > 100 To remove all employees: DELETE FROM Employee

88

44

CREATE TABLE Example CREATE TABLE Example


Statement Purpose: Create a table to store employee information CREATE TABLE EMPLOYEE (EmpId SMALLINT, Name CHAR(10), Salary DECIMAL(5,2)) To drop the EMPLOYEE table DROP TABLE EMPLOYEE

89

Data Warehouse Security


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

90

45

Data Warehouse Security Data Warehouse Security

1) Key Security Services 2) Views 3) Access Control 4) Roles 5) Encryption 6) Audit Trails 7) Security Holes 8) Intrusion Detection 9) Misuse Detection
91

Introduction Introduction
A key feature provided by database systems is good security services.
In a database system with good security, applications do not have to worry about problems that arise with security violations.

A data warehouse also requires good security services because it holds key, corporate data.
Database System

EIS

Security Services

92

46

Key Security Services Key Security Services


Access Control
Controls who accesses what data

Administration of Access Control


Used to give access to users as well as track who has various accesses and what kind of accesses are given to a user or group of users Audit tracks the usage of the data warehouse

93

Security in a Data Warehouse Security in a Data Warehouse


A data warehouse consolidates organizations key data in one place.
A data warehouse increases the security risk that unauthorized users will try to obtain this data

Security aspects of EIS applications must be designed and implemented very thoroughly. Access control and audits are two of the critical components of security.

94

47

Data Warehouse Security Components Data Warehouse Security Components


Database system components that can be used to protect a data warehouse include:
Views Allow users to only see certain rows or columns of data Access control Indicate which users have access to what data Administration This component is used to actually give access to groups of users and to define the accesses given to either an individual or a group. Encryption Protect data from access outside of the DBMS Audit Track what users are doing

95

Views in Data Warehouse Views in Data Warehouse


A view is a logical view into one or more tables. Users may be given access to the view without access to the base table. Views provide some security assistance because they can hide data from users.
EMPLOYEE
Name Hank Esther Tom Sue Dave Pete Kathy Address 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue Salary $50,000 $80,000 $90,000 $28,500 $35,000 $60,000 $85,000

96

48

View Example View Example


A view called SAFE_EMPLOYEE may be created as:
CREATE VIEW SAFE_EMPLOYEE AS (SELECT name, address FROM EMPLOYEE)

Now users of the view SAFE_EMPLOYEE will not even know that salary exists.
SAFE_EMPLOYEE
Name Hank Esther Tom Sue Dave Pete Kathy Address 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue Salary

VIEW (SAFE_EMPLOYEE) Salary is effectively hidden

97

Updating Views Updating Views


Restrictions exist on updating views. For the EMPLOYEE table, it is possible to insert into the SAFE_EMPLOYEE view.
Example : INSERT INTO SAFE_EMPLOYEE VALUES (Hank, 300) This will insert a NULL into the SALARY column of the base table EMPLOYEE.

Other restrictions to view updates exist:


Cannot update a view that is defined with an aggregate Cannot update a view that is defined with a GROUP BY

98

49

Data Warehouse Access Control Data Warehouse Access Control


Access control is implemented in a data warehouse with the SQL Grant and Revoke commands. Syntax
GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON <object-name> TO <user name> Example: GRANT SELECT ON EMPLOYEE TO MARY

Access control is done by DBAs and creators of tables. To remove access the REVOKE command is used.
Example: REVOKE SELECT ON EMPLOYEE FROM MARY
99

Database Roles Database Roles


Roles provide security administration by allowing users to be grouped into roles. Accesses may then be given to a group of users.
As an example, some roles for a company might be: Administrative assistant Loan officer Salesperson

Accesses may be assigned based on roles.


This dramatically simplifies administration. If new tables are created, it is not necessary to add thousands of new accesses. Examples:
CREATE ROLE loan_officer AS (Hank, John, Mike) GRANT SELECT ON LOAN TO LOAN_OFFICER

100

50

Example of Application-based Roles Example of Application-based Roles


Consider:
Users Applications Database System Data

If the database system controls accesses than it does not matter what the application does, accesses are controlled consistently (same for SALES as MARKETING) However, more fine-grained access control can be granted in the application.

101

Application Roles Application Roles


The application can restrict:
Data entry screens Reports

Care must be taken to restrict users in a consistent fashion so that a user cannot jump to a different application and avoid security set up by another application.

102

51

Role Based Security in a Data Warehouse Role Based Security in a Data Warehouse
Both application and database level security are useful in a data warehouse. Database level security is needed so that users are only allowed to see data they need to see. Application level security can be used to control access to certain menus so that users do not even know what reports exist.

103

Encryption Encryption
Encryption is the process of coding data so that it can only be read by users who have the key that allows them to decrypt the data.
Example: A message sell 500 shares would appear as xyzzy without the key. Once the key is paired with the encrypted string xyzzy, it can then be decrypted. The size of the key is a factor in how difficult it is to attack the encryption scheme.

Three places where encryption might be used in a data warehouse:


Network Data Tape backups

104

52

Network Encryption Network Encryption


In a data warehouse application, data and queries are transmitted through a network.
Attackers might be able to steal network traffic just by breaking into the network medium.

One way to reduce the risk of this threat is to encrypt traffic on the network.
User

Network

Data Warehouse

Application

Database System
Tape Backup

105

Network Encryption Network Encryption


Network encryption is critical because the network connects all of the key components in a data warehouse. Encrypting network traffic mitigates the risk that an attacker could succeed with the man in the middle attack. Without this, it may be possible for the man in the middle to masquerade as another user and circumvent existing application and database security.

106

53

Data Encryption Data Encryption


Data encryption refers to encrypting the actual data in the data warehouse. If the attackers were to retrieve data from the warehouse, they would have to decrypt it in order to read it.

EIS

Database System

Data Warehouse

107

Backup Encryption Backup Encryption


Periodically, databases are copied to some kind of long-term storage (usually tapes). If the database is encrypted, but the tapes are not encrypted, the risk exists of someone walking off with the tapes.
EIS Database System Data Warehouse

Tape Backup

108

54

Audit Trails Audit Trails


Audit trails are a means of tracking queries, updates, deletes, and additions of new data to the data warehouse.
Audit trails are turned on when the DBMS is started and all activity that uses the data warehouse is tracked in the audit trail.

If a user is suspected of an evil deed, the audit trail can be examined to identify what data has been accessed by users.

109

Details of DW Audit Trails Details of DW Audit Trails


An audit trail of a database system typically includes the following information:
User ID, Date, Time, Object that has been accessed (table or view), Action that accessed the object (INSERT, UPDATE, DELETE, SELECT) For UPDATE, the old value and new value is tracked.

For data warehouses, the SELECT is often used to track the queries that have been run against the warehouse.

110

55

Other Uses for DW Audit Trails Other Uses for DW Audit Trails
Audit trails can be used to identify the most popular data in the warehouse.
This information can be used to optimize queries

An additional use for audit trails is performance tuning of the data warehouse.
Administrators know where to focus their efforts Reduces administrative overhead

111

Dealing with Known Security Holes Dealing with Known Security Holes
Commercial database systems and operating systems are often filled with holes that allow users to obtain unauthorized access.
To reduce the risk of these known holes, vendors often provide fixes to their products as soon as these holes become public.

It is important to constantly keep up with known security holes and apply the latest fixes as soon as they are released. One of the key risks surrounding a data warehouse is that privileged users have the keys to the kingdom.
112

56

The Risk of Privileged Users The Risk of Privileged Users


"Privileged users" include:
Data warehouse administrators Operating system programmers Operators in the computer center These users can: Modify, delete and query any data in the warehouse Modify the audit trail to mask their actions Give other users unauthorized access

Numbers of "privileged users" could be anywhere from 20 to 30 in some organizations.

113

Reducing the Risk of Privileged Users Reducing the Risk of Privileged Users
One way to reduce the risk of privileged users is to separate security administration from database administration.
This would separate the task of giving accesses and managing the audit trail from the task of making sure the data in the warehouse was correct and properly optimized.

Security Services
Access Control Audit

Security Services
Access Control Audit

Database Services
Database Tuning Query Optimization Backups

Database Services
Database Tuning Query Optimization Backups

114

57

Information Security Attacks Information Security Attacks


Two types of Information security attacks on data warehouses are:
Intrusion An intrusion occurs when an unauthorized user gains access to the data warehouse. Misuse Misuse, often referred to as the insider problem occurs when a user who has access to the warehouse uses that access for an unauthorized purpose

Audit Trails can be used to identify either type of attack, but identification of misuse is typically MUCH harder to do than intrusion.
115

Intrusion Detection Intrusion Detection


An intrusion is defined as an unauthorized access to a system. The assumption is the user is external to the environment (e.g.; a hacker). To reduce the risk of intrusion, intrusion detection tools are used.
These tools monitor access to the data warehouse and sound an alarm if unauthorized accesses are detected.
INTRUSION DETECTION SYSTEM

USER

DATA WAREHOUSE

116

58

Misuse Detection Misuse Detection


Unwanted access by a user that has the ability to access data is referred to as misuse.
This is also known as the insider problem. Some estimates have shown that 80 % of computer crime is a result of misuse.

For data warehouses the threat of misuse is high especially by privileged users.

117

Summary Summary
DBMS Security is useful for data warehouses to hide data from users with views and to restrict access to data with GRANT and REVOKE. Application Level Security assists EIS that access data warehouses by hiding certain reports from users. Encryption can be used to further protect against the risk of someone walking off with the data warehouse. Audit Trails are useful for: Catching attackers Identifying usage trends of the data warehouse
118

59

Moving Data to the Data Warehouse


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

119

Moving Data to the Data Warehouse Moving Data to the Data Warehouse

1) Moving Data into the Data Warehouse 2) Updating the Data Warehouse 3) Full Refresh 4) Copy Only the Changes 5) BCP 6) Simple Transformations 7) Complex Transformations 8) Commercial ETL Tools
120

60

Moving Data into the Data Warehouse Moving Data into the Data Warehouse
Data must be moved to the data warehouse from source systems. Some key issues:
Determine the frequency of data updates -- how often should data be moved from source systems to the data warehouse. Various means of updating data in the warehouse exist: SQL Commands Database system load programs (e.g.; SQL Servers BCP) Commercial tools

121

Updating the Data Warehouse Updating the Data Warehouse


OLTP (On-Line Transaction Processing) Systems have to send their updates to the data warehouse.
Finance OLTP Application Inventory OLTP Application Sales OLTP Application

Data Warehouse
Finance Subject Area Inventory Subject Area Sales Subject Area

122

61

Frequency of Updates to the Data Frequency of Updates to the Data Warehouse Warehouse
Updates may occur daily, weekly, monthly, or in real-time.
Finance OLTP Application
te da Up ily Da
Finance Subject Area

Inventory OLTP Application


ate Upd kly Wee

Sales OLTP Application


te pda ly U nth Mo

Data Warehouse
Inventory Subject Area Sales Subject Area

123

Determining the Frequency of Updates Determining the Frequency of Updates


Requirements should drive update frequency Range of updates runs from real-time, to quarterly.
Real time update Expensive Requires update of warehouse while users are querying Daily update Somewhat cheaper than real time, but significant maintenance required if the warehouse has lots of tables. Monthly or weekly update Much more manageable

124

62

Updating the Warehouse Updating the Warehouse


Full Refresh vs. Only the Changes
Inventory OLTP Application
ges an Ch

Finance OLTP Application

Sales OLTP Application


es tabl o m e les of s b e s h ther ta refr o F u l l ges for n cha

ate pd tu las ce sin

Copy the entire source table in the OLTP system to the destination table in the Data Warehouse.

Source OLTP

esh efr ll R Fu
Finance Subject Area

Data Warehouse
Inventory Subject Area Sales Subject Area

125

Full Refresh Full Refresh

Target Data Warehouse

Source Table

Target Table

126

63

Copy Only the Changes Copy Only the Changes


Copy only the changes to the source table in the OLTP system to the destination table in the data warehouse.
Source OLTP Source Table Target Table Target Data Warehouse

Modified data since last update to the warehouse Data from two updates ago. Historical data no longer in source OLTP.

127

Full Refresh vs. Only the Changes Full Refresh vs. Only the Changes
Full Refresh
Pros Much easier to implement Less chance of messing up your database (good data integrity) Cons Can take a lot longer to actually do -- may run out of night Can lose out on warehouse ability to track historical data.

Only the Changes (DELTA)


Pros Tracks historical data Cons Can be very hard to implement Can require changes in source applications (more on this later)
128

64

Full Refresh Using INSERT-SELECT Full Refresh Using INSERT-SELECT


One way to move data from one table to another is via the INSERT-SELECT.
Syntax: INSERT INTO <target_table> <any sql SELECT statement>

Example:
INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE

TARGET
129

Updating Changes Using INSERT-SELECT Updating Changes Using INSERT-SELECT


Changes may be moved by adding a WHERE clause to the INSERT-SELECT. Example:
INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE WHERE DATE-UPDATED = DATEPART(m, CURRENT_TIMESTAMP)

130

65

Updating Using BCP Updating Using BCP


BCP is the bulk copy program that comes with MS SQL Server.
Bulk copy (BCP) moves data to or from a flat file to a SQL table.

Syntax: bcp <table> [in | out] <data file>


Source OLTP Unload
Temporary Flat File

Target Data Warehouse Load

Source Table

Target Table

131

BCP Example BCP Example


To bulk copy data from the publishers table in the pubs database to the publishers.txt data file in ASCII text format, execute from the command prompt:
bcp pubs..publishers out publishers.txt -c -Sservername -Usa -Ppassword

To bulk copy data from the publishers.txt file into the pub2 table in the pubs database, execute from the command prompt:
bcp pubs..pub2 in publishers.txt -c -Sservername -Usa -Ppassword

132

66

Simple Transformation Simple Transformation


In addition to moving data from OLTP to the warehouse, it is often necessary to transform data.
Example: System A stores TOTAL_CLOTH in meters and system B stores TOTAL_CLOTH in yards. Before the data is moved from system A, we need to transform the data.

Store 31
(Pattern = 31, TOTAL_CLOTH = 50

Store 32

yards )

TRANSFORMATION

(Pattern = 32, Total Cloth = 20

meters)

Data Warehouse
P a t t e r n = 3 1 , T o t a l C l o t h = 5 0 yards P a t t e r n = 3 2 , T o t a l C l o t h = 7 0 yards 133

Complex Transformation Complex Transformation


More complex transformations occur when a value in a source table must be moved to several locations in a data warehouse.
BLUE3 4 8 4
(Color = Blue, 34 Inches, LS)

34 in CONVERT TO CENTIMETERS

BLUE

TO RT VE 4 ON DE 8 and C O C eves) bles sle o ta g (lon in tw put

84

TABLE 2 TABLE 1 86.36 cm COLOR TABLE 3 Long Sleeves TABLE 4 Long Sleeves

Data Warehouse

134

67

Commercial ETL Tools Commercial ETL Tools


Key tools in the marketplace
Informatica Ardent DecisionBase (Platinum) Microsoft Data Transformation Services

All provide libraries of common transformations. All provide the ability to code complex transformations.

135

Data Transformation Services Data Transformation Services

136

68

Choose a Source Choose a Source

137

Choose a Destination Choose a Destination

138

69

Choose to use a Query for Transfer Choose to use a Query for Transfer

139

Enter SQL Query Enter SQL Query

140

70

Choose Destination TableName Choose Destination TableName

141

Verify Transformation Verify Transformation

142

71

Decide When to Run Transformation Decide When to Run Transformation

143

Final Verification Final Verification

144

72

Run Transformation Run Transformation

145

Check Results Check Results


select * from orderfact
orderid 10248 10248 10248 10249 10249 orderdate productid productname 11 42 72 14 51 Queso Cabrales Singaporean Hokkien Fried Mozzarella di Giovanni Tofu Manjimup Dried Apples quantity unitprice 12 10 5 9 40 14.0000 9.8000 34.8000 18.6000 42.4000 discount 0.0 0.0 0.0 0.0 0.0

1996-07-04 00:00:00.000 1996-07-04 00:00:00.000 1996-07-04 00:00:00.000 1996-07-05 00:00:00.000 1996-07-05 00:00:00.000

146

73

Summary Summary
ETL is one of the hard parts of building a data warehouse. Either full refreshes of data or just the changes may be done. Doing full refresh is easy, but historical data is lost and it may take a lot of time. Tracking changes is a tough business. ETL commercial tools are beginning to mature and can lessen the pain of this task.

147

More Ways of Moving Data to the Data Warehouse


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

148

74

More Ways of Moving Data More Ways of Moving Data to the Data Warehouse to the Data Warehouse

1) Determining What Data Has Changed 2) Recovery Logs 3) Triggers 4) Insert Triggers 5) Delete Triggers 6) Update Triggers 7) Manual Detection

149

More Ways of Moving Data More Ways of Moving Data to the Data Warehouse to the Data Warehouse
There is a need to move data into the data warehouse from OLTP and DSS applications The problem is detecting what data needs to be moved into the data warehouse Three methods:
Recovery Logs Triggers Manual Techniques

150

75

Determining What Data Has Changed Determining What Data Has Changed
Problem: How to get updates made to the source to the same information in the data warehouse?

SOURCE

How to get updates from Source Table A to Data Warehouse Table B

DATA WAREHOUSE

A LE TAB
S TE DA UP
P OLT

B LE TAB

151

Determining What Data Has Changed (cont.) Determining What Data Has Changed (cont.)
Problem: How to get updates made to multiple sources to the same information in the data warehouse?
SOURCE DATA WAREHOUSE

A LE TAB
ROW X

Employee
UPD A ROWTES X
NAME DEPT. Fred Mktg Hank Sales Sue Joe UPDATES IT Sales SALARY 35000 60000 71000 50000

? ?

A LE TAB
ROW X

B LE TAB
ROW X

EmployeeCount
DEPT Mktg Sales IT HR COUNT 1 1 2 1 0

SalaryInfo
DEPT AVG SAL TOT SAL

P OLT

Insert into Employee Values (Joe,Sales,50000)

Mktg 35000 IT 71000 HR 0 Sales 60000

35000 71000 0 60000

55000

110000

152

76

What is the Recovery Log? What is the Recovery Log?


Recovery log is used for transaction processing
Used to handle errors Does contain before and after image.

Recovery log can be used to identify the data to be updated in the data warehouse.
Change Data Capture Utility This scans the database log and identifies all changes that the user is interested in and either writes them to a file or stores them in another table.

153

Change Data Capture Utility in Action Change Data Capture Utility in Action

SOURCE
OLTP

DATA

DBMS LOG

All changes to DBMS

RECOVERY LOG

S AD RE

CHANGE DATA CAPTURE UTILITY

DATA WAREHOUSE

WRITES

154

77

Example of Using Recovery Log Example of Using Recovery Log


Consider an update to the Employee table
The information is recorded in the log The change data capture reconstructs update Can then be sent to the data warehouse
UPDATE EMPLOYEE Where SSN=10

LOG
TABLE=EMPLOYEE SSN=10 OldSalary=100, NewSalary=200

SET Salary=Salary*2.0

CHANGE DATA CAPTURE RECONSTRUCTS

DATA WAREHOUSE

UPDATE
155

Using the Recovery Log Using the Recovery Log


Recovery logs are usually in proprietary format. Use commercial tools to read the log and identify the changes. Commercial tools such as CAs log analyzer can place the results of their work in a table.

156

78

Summary of Change Data Capture Summary of Change Data Capture


Pro
Log exists anyway, might as well use it to find what has changed

Con
Some difficult scenarios may occur where it is hard to see what the new update should be in the Data Warehouse. Proprietary format, may not be supported in many DBMS and will always lag behind DBMS development. Many tables will be in the source that have nothing to do with the data warehouse, but change data capture will process their changes as well.

157

Triggers Triggers
Triggers allow DBAs to specify that when an event such as an INSERT, UPDATE, or DELETE occurs on a table, another event is triggered.
Triggers are used to identify changes that are needed by the warehouse. A trigger can be added to a source table and whenever the source table is updated, an update can be placed either directly in the warehouse or in a staging table that tracks all updates.

Triggers can be used to detect the changes and perform data warehouse updates.
A different trigger might be run on key updates so that the data warehouse nightly process would know what data has changed.
158

79

Example of a Trigger Example of a Trigger


STAGING
STEP 2

A LE TAB
Values (X, Y) are inserted

When values are inserted, sets off the TRIGGER

X, Y
TRIGGER inserts values (X, Y) into a STAGING area

STEP 3

Nightly Process

STEP 1 STEP 4

Nightly Process inserts values (X, Y) into the Data Warehouse DATA WAREHOUSE

INSERT into TABLE A VALUES (X, Y)

A LE TAB
Values (X, Y)

159

Real-Life Trigger Example Real-Life Trigger Example


OLTP/DSS Data - Employee table:
Employee (ssn, name, salary)

DW Data - Summary table:


EmployeeStatistics (total number employees, total salary paid, average salary).

When a row is inserted in the employee table, we need to do an insert into the EmployeeStatistics table.
Shown on the next page

160

80

Insert Trigger Example Insert Trigger Example


CREATE TRIGGER EmployeeInsertTrigger ON Employee FOR INSERT AS BEGIN UPDATE EmployeeStatistics SET NoEmployee = NoEmployee + (SELECT COUNT(*) FROM INSERTED) UPDATE EmployeeStatistics SET TotSalary = TotSalary + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END

161

Insert Trigger in Action Insert Trigger in Action


COMMANDS INSERT INTO EMPLOYEE VALUES (1, 'John', 300) RESULTS

(1 ROW(S) AFFECTED)

INSERT INTO EMPLOYEE VALUES (2,'Mike', 400)

(1 ROW(S) AFFECTED)

SELECT * FROM EMPLOYEE

Employee EmpId Name Salary ------ -------------------------1 John 300.00 2 Mike 400.00 EmployeeStatistics NoEmployee TotSalary ---------- ---------2 700.00

SELECT * FROM EMPLOYEESTATISTICS

AvgSalary --------350.00
162

81

Delete Trigger Example Delete Trigger Example


CREATE TRIGGER EmployeeDeleteTrigger ON Employee FOR DELETE AS BEGIN DECLARE @numberEmployee int UPDATE EmployeeStatistics SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED) UPDATE EmployeeStatistics SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED) SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics IF @numberEmployee > 0 BEGIN UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee End ELSE UPDATE EmployeeStatistics SET AvgSalary = 0.0 END 163

Update Trigger Example Update Trigger Example


CREATE TRIGGER EmployeeUpdateTrigger ON Employee FOR UPDATE AS BEGIN IF UPDATE (Salary) UPDATE EmployeeStatistics SET TotSalary = TotSalary (SELECT SUM(Salary) FROM DELETED) + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END

164

82

Summary of Using Triggers Summary of Using Triggers


Pro
Only needed for tables whose data is going to go to the DW

Con
Additional work needed to create detailed triggers Non-trivial to generate a trigger to implement appropriate action May not be acceptable for commercial software on source system

165

Other Ways to Determine What Has Changed Other Ways to Determine What Has Changed
There are other manual ways of detecting the change and doing DW updates
Look at each row of OLTP and the data in the warehouse Compare the differences between the two files, if the data is not in the warehouse, add it!

OLTP
Hank John Mike Sam

DATA WAREHOUSE
Hank John

RE PA Mike M CO
ADD THE DIFFERENCES

166

83

Manually Identifying What Has Changed Manually Identifying What Has Changed
Pro
Flexible

Con
Very expensive Could take a long time

167

Summary Summary
Recovery Logs Triggers Manual Detection

168

84

Data Warehouse Design


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

169

Data Warehouse Design Data Warehouse Design

1) Overview 2) Describing a Design - ER Diagrams 3) Design Normalization 4) Star Schema Design

170

85

Overview Overview
How to describe a design
Entity Relationship (ER) Diagram

Types of Designs
Normalized Star Schema Snowflake

171

Describing a Design Describing a Design


Different techniques exist, the most prevalent is the ER (Entity-Relationship) Diagram Entities
Things that occur in the real world, usually nouns e.g.; employee, part, product, etc.

Relationships
How entities interact, example: one employee may attend many colleges -- usually verbs Types of relationships 1-1 1-Many Many-1 Many-Many
172

86

Examples of Relationships Examples of Relationships


1-1 1-MANY

MANY-1

MANY- MANY

173

Normalized Design Normalized Design


Methodology
All 1-1 relationships are placed in a single table. Many-many relationships require two tables that store the singlevalued relationships and one linking table that indicates how the entities are related. The relationship is represented in the linking table by referencing keys in the two tables that represent each entity in the relationship.

Checking the design


In a Normalized Design, there are many different normalized forms. Each normal form (NF) builds on the previous one so that a table in 2NF is, by definition, in 1NF. 1NF 2NF 3NF
174

87

Dealing With Many-Many Relationships Dealing With Many-Many Relationships


For Many-Many
Two 1-1 Tables (SUPPLIER, PARTS) One linking table (SP) Ex: Suppliers, Parts are the 1-1, SP is the linking table that says who sells what parts.

SUPPLIER
S# 1 2 SNAME SEARS OFFICE DEPOT

PARTS
P# 1 2 PNAME HAMMERS NAILS

SP
S# 1 1 2 2 P# 1 2 1 2

175

Normalized Design: Example Normalized Design: Example


A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time.
Entities: Customer, Product, Store Relationships: Customer buys Product Product is located in Store Product is supplied By a Vendor
VENDOR

CUSTOMER

PRODUCT

STORE

BUYS

IS-LOCATED-IN

176

88

Checking a Normalized Design Checking a Normalized Design


Normalization
Used to reduce data insertion, delete, and update anomalies caused by bad designs. Enables users to quickly check a design and make sure there are no glaring holes in the design. 1NF All cells are atomic -- i.e. each entry in a column contains only one value 2NF All non-key values are functionally dependent upon the entire primary key -- i.e. if the primary key changes, all other columns change. 3NF No transitive dependencies -- i.e. all keys are completely dependent on the primary key. If the primary key changes, all non-key columns are affected.
177

Overview of Normalized Design Overview of Normalized Design


Pro
Relatively easy to change

Con
Queries can involve numerous joins The massive number of tables and links between tables makes it hard for customers to build their own queries

178

89

Star Schema Star Schema


Methodology
Single fact table in the middle describing a key event (e.g. sale) surrounded by dimension tables (i.e. location, time, employee)
D = DIMENSIONS

D1

D2

FACT
D5 D4

D3

179

Star Schema: Methodology Star Schema: Methodology


Identify a key fact that occurs.
Usually some event creates a real fact. Selling a product in a store on Wednesday, patient visiting a hospital, etc.

Identify all the dimensions of the data being used. Think of a dimension as a way to slice the data.
Ex: by time, by product, by customer, etc.

Drill down operations are very well supported

180

90

Star Schema: Example Star Schema: Example


A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time. Fact
CustomerPurchase

Dimensions are
Customer Product Time Vendor

181

Star Schema: Example (cont.) Star Schema: Example (cont.)


Customer

Time

Sale
Store Product

Price

SALE

SALE ID

CUST. ID

STORE ID

PROD. ID

PRICE

TIME

1
CUSTOMER CUST. ID

3
NAME

7
PHONE

$3.00
Buys Apples

4/24/99
Has Big Car

3
TIME

FRED
DAY 24 4

1234

Y
YEAR 99

MONTH

QTR 2Q

182

91

Star Schema: Overview Star Schema: Overview


Pro
Easy for users to navigate and understand

Con
Performance Can end up with one monster fact table, millions of rows Flexibility Not as easy for customers to change the design

183

Snowflake Schema Snowflake Schema


Several stars can be connected to form a snowflake
MARKETING
Distribution

Ad

Direct Mail Price

SALES
Marketing Revenue

Sales Location

PRODUCT
Parts
Manufacturing

Sale
Price Vendor

Make Chips
Product

Cost

Price Labor

184

92

Summary Summary
Two basic types of design
Star Schema Normalized

Many Data Warehouse vendors sell products built specifically for the star schema Some data warehouses insist that normalization is the way to build the data warehouse.

185

Building a Data Warehouse


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

186

93

Building a Data Warehouse Building a Data Warehouse

1) Top Down Approaches 2) Enterprise Data Model Approach 3) "Let Data Users Decide" 4) "Let Data Warehouse Builders Decide" 5) "Let Senior Management Decide" 6) Bottom Up Approach

187

Building the Data Warehouse Building the Data Warehouse


How to decide what data goes into the data warehouse? Methods:
Top Down Using Enterprise Data Models "Let data users decide" approach "Let data warehouse builders decide" approach "Let senior management decide" approach Bottom Up Combine data marts into a data warehouse

188

94

Using Enterprise Data Models Using Enterprise Data Models


Use the Enterprise Data Model to decide what data goes into the data warehouse.
Model key processes. This approach says let the business decide. Identify key data used by these processes in an enterprise data model -- might be a giant Entity-Relationship diagram.

Put data in the warehouse based on the enterprise data model.

189

An Enterprise Data Model Example An Enterprise Data Model Example


MAKE CHIPS
PUT IN BAGS SELL CHIPS

COUNT $$

BUY MORE POTATOES

CHIP SUPPLIERS
CHIP RECIPES

INGREDIANTS

190

95

"Enterprise Data Model" Approach "Enterprise Data Model" Approach


Pro
All inclusive -- no chance of leaving key data out.

Con
Very difficult to build an EDM. If the business model changes, you may have to rebuild the Enterprise Data Model and the data warehouse.

Ways of Avoiding the Con


In some cases you can buy an EDM -- if the business is common enough the packaged EDM might be very close and then you just have to modify it to fit your business.

191

"Let Data Users Decide" "Let Data Users Decide"


Let the users of the data warehouse choose what data will go into the warehouse.
The data users deciding the data warehouse data and design will pay for it as well. Also, you can charge users who query the data as well.

SOURCE

USERS

DATA WAREHOUSE
192

96

"Let Data Users Decide": An Example "Let Data Users Decide": An Example
DATA WAREHOUSE

DATA DATA
demographics

DATA
budget

trends
Advertising

Ethnic group

?
education

Age

spending

Revenue

MARKETING

HUMAN RESOURCES

FINANCE

193

"Let Data Users Decide" Approach "Let Data Users Decide" Approach
Pro
Reduces budget problems Users know best!

Con
Requires marketing Could end up with data in the warehouse that is meaningless to the people who run the place. Users may not place important data in the warehouse because their budget is small. Users who need the data may not use the DW because of budget concerns.

Ways of Mitigating the Con


Do not just take money -- try to determine if data is really corporate.
194

97

Pay As You Go Warehouse Analogy Pay As You Go Warehouse Analogy

I-495

195

"Let Data Warehouse Builders Decide" "Let Data Warehouse Builders Decide"
The technical staff who is building the warehouse decides what data gets put in the warehouse.
LETS PUT INFORMATION ON HOW TO BUILD VIRUSES IN THE DATA WAREHOUSE

DATA WAREHOUSE

196

98

"Let Data Warehouse Builders Decide" "Let Data Warehouse Builders Decide" Approach Approach
Pro
Very easy to design Does not take much time Do not have to deal with users

Con
Could easily result in data DUMP not data warehouse

Ways to mitigate the con


Talk to lots of users to help you guess what should go in the DW

197

Let Senior Management Decide Let Senior Management Decide


The senior management decides what data goes into the warehouse. Asking the senior management is the safest way to build a data warehouse. Identify the key questions on senior managements mind and get the data to answer these questions.

198

99

Let Senior Management Decide Approach Let Senior Management Decide Approach
Pro
Ensures executive support for the project

Con
Senior management does not have much time for this -- you will have to only get a few questions at a time This dramatically increases visibility - if you do not move quickly senior management will become very angry with the DW.

Ways to mitigate the con


Do your homework before talking to the senior management -- talk to the aides of senior management to find out what is on their mind. Allocate resources so you can plan to move very quickly once you hear from the senior management.
199

Bottom-Up Approach Bottom-Up Approach


Move data from existing OLTP Applications to data marts. Combine data marts into a data warehouse.
DATA WAREHOUSE

DATA MART 25 YARDS

DATA MART 50 METERS

DATA MART 200 CM

OLTP APP

OLTP APP

OLTP APP

200

100

Bottom-Up Approach Bottom-Up Approach


Pro
Data marts are much easier to build than full-fledged DW.

Con
Could end up with a bunch of stove pipe data marts.

Ways to mitigate the con


Develop standards for data when building the data marts so that you can glue data from different data marts together.

201

Recommendations for an Approach Recommendations for an Approach


"Let senior management decide"

202

101

User Interface to the Data Warehouse


(slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

203

User Interface to the Data Warehouse User Interface to the Data Warehouse

1) Introduction 2) Types of Users 3) Functions Users Want to Do 4) Approaches to Building a User Interface 5) Hand Built 6) Class Libraries 7) OLAP Tools 8) Types of User Interfaces
204

102

Introduction Introduction
A User Interface (UI) is a front end application designed for the user that presents information in a simplified manner.
Data in a data warehouse does nothing if users cannot access it Users do not want to learn SQL to drive DW applications
Finance OLTP Application Inventory OLTP Application Sales OLTP Application

DATA WAREHOUSE Finance OLTP Data Inventory OLTP Data Sales OLTP Data

USER INTERFACE

205

Building User Interfaces Building User Interfaces


DW applications have different types of users with different functionality requirements.
It is critical to identify the key users. Once you do this, you need to identify their functional requirements.

There are three main approaches to building UIs


Build your own entirely Use commercial Class Libraries Using OLAP Tools

206

103

Types of Users Types of Users


CEO Executive Marketing Analysts Everyone Executive Sales Analysts Everyone Executive Finance Analysts Everyone
207

Types of Users (cont.) Types of Users (cont.)


Executives
People who run the place Need answers quickly May not be very technical Expect UI to get them what they want quickly and efficiently without any need for special training

Analysts
Have time to really analyze data and think about it May have strong statistical and IT background (i.e. Power user of Excel) Expect UI to have many complex features, and provide the ability to generate new queries and perform statistical analysis of the data.
208

104

Types of Users (cont.) Types of Users (cont.)


Regular User
All other users Just need some simple answers to simple questions like What is Hanks phone number) Expect UI to be simplistic, easy to understand, and provide access to basic information.

209

Subject Matter Experts Expect Subject Matter Experts Expect


Query data in the data warehouse Trend analysis
show me how much money we have spent on computers in the last four years
Trend

Sales

1995

1999

Benchmark to competitors
what are all our competitors charging for product X
210

105

Subject Matter Experts Expect (cont.) Subject Matter Experts Expect (cont.)
Drill Down
on that chart you just showed me, I noticed that revenue was down in Region #4. Please drill down and show me the breakdown of each area in Region #4.
DRILL DOWN WAL-MART
20 15 10 5 0 1 2 3 4 REGIONS
DRILL DOWN Revenue

REVENUE

Y Values X Values

MD

DC VA Region 4

211

Approaches to Building User Interfaces Approaches to Building User Interfaces


Hand-Built
Write all of your own code

Use Class Libraries


Use an object oriented approach and buy the CLASS libraries that do all the hard work

OLAP
Use an On-Line Analytical Processing package to build user interfaces for you.

212

106

Architecture of User Interfaces (cont.) Architecture of User Interfaces (cont.)


Hand Built
USER INTERFACE i.e. JAVA

DATA WAREHOUSE DBMS


Commercial Off The Shelf

Class Libraries
USER INTERFACE
GRAP HIC CLAS S S LIBR ARY

(COTS)

OLAP CLASS LIBRARY

Hand Built

USER E FAC INTER SS CLA RY LIBRA

213

Architecture of User Interfaces (cont.) Architecture of User Interfaces (cont.)


OLAP
YEAR
E OR ST

REGION

Result Cube Commercial Off The Shelf (COTS)

REVENUE
USER INTERFACE

DBMS

214

107

Hand-Building User Interfaces Hand-Building User Interfaces


Write all the code yourself
Requires many design documents, coding and testing for all of the code components.

Pros
Very flexible

Cons
Could take a long time to develop Requires substantial resources May need lots of testing and debugging

215

Using Class Libraries to Build User Interfaces Using Class Libraries to Build User Interfaces
Write initial user dialog yourself and call class libraries for the hard part (graphics and data access functionality). Pro
Many class libraries available -- avoid doing a lot of coding yourself

Con
Not as flexible -- if the class library does not do what you want it to do you have to Find a new class library Live without the functionality Can take a while to find the class library you need and learn how to interface to it
216

108

Using OLAP Tools to Build User Interfaces Using OLAP Tools to Build User Interfaces
Many different OLAP tools
Need to survey an OLAP tool Buy an OLAP tool Install it If it does not match all requirements some code may be needed to communicate with the OLAP tool.

Three types multi-dimensional OLAP


Relational OLAP (ROLAP) Multi-dimensional (MOLAP) Hybrid (HOLOP) Distributed (DOLAP)

217

Summary of Tools for UI Development of DW Summary of Tools for UI Development of DW


Tools that may be used include:
Development of in-house software Do it all yourself Use Class Libraries OLAP ROLAP MOLAP HOLAP DOLAP

Different tools or techniques may be useful depending upon what kind of user interface is being developed.
Executive Information Systems Analytical Systems Enterprise Information Systems
218

109

Types of User Interfaces Types of User Interfaces


Executive Information System
Developed for the person who runs the place

Analytical System
Developed for business analysts

Enterprise Information System


Developed for users throughout the organization
EXECUTIVE INFORMATION SYSTEM

CEO

Executive

Executive

Executive

Marketing Analysts
ENTERPRISE INFORMATION SYSTEM

Sales Analysts

Finance Analysts
ANALYTICAL SYSTEM

Everyone

Everyone

Everyone
219

Executive Information System Executive Information System


The Executive IS is developed specifically for people who run the organization. Development process:
No clean life cycle Prototype constantly. Usually have to guess at what executives will want to see Show executives let them come up with ideas for revisions Drill down functionality required

Tools
Frequently hand-built, but purchasing a class library can help lower the development cost. May just want to use tools that allow development of a subscription service in which users may Subscribe to a few canned reports.
220

110

Analytical System Analytical System


Analytical systems are user interfaces developed for business analysts in an organization. Development process:
Allow users to drag-and-drop data around to further the analysis of this data. More complex interface is acceptable Users may be required to know some SQL knowledge

Tools:
OLAP Tools are frequently used to build the interface

221

Enterprise Information System Enterprise Information System


Enterprise IS is written for the general user to retrieve simple, key information. Development process:
Frequently developed in-house So many users around that you really cannot pick a few and ask what they need. Simpler than Executive IS as it does not require drill down functionality.

Tools
Place some simple, key information on a few screens and control access and then deploy.

222

111

Summary of Types of User Interfaces Summary of Types of User Interfaces


Executive Information System
For the senior executives Use in-house development or in -house development augmented by class libraries

Analytical System
OLAP may make sense here as the interface is more complicated, but OLAP has drawbacks due to: Data sparseness No well accepted query language

Enterprise Information System


Much simpler than executive system Good candidate for in-house development
223

112