Sie sind auf Seite 1von 90

Aalborg University

DE-1 project

Building a Data Warehouse

Author: Dovydas Sabonis Femi Adisa

Supervisor: Liu Xiufeng

December 19, 2008

Faculty of Engineering and Science


Aalborg University

Department of Computer Science

PROJECT TITLE: Building a Data Warehouse

PROJECT PERIOD: DE-1 September 2, 2008 - December 19, 2008.

GROUP MEMBERS: Dovydas Sabounis Femi Adisa

SUPERVISOR: Liu Xiufeng

REPORT PAGES:60
3

Contents
1 Introduction What is a Data Warehouse? . . What is Data Warehousing? . . Why build a Data Warehouse? The Case Study . . . . . . . . . Summary . . . . . . . . . . . . 8 8 9 9 10 12

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 The Data Warehouse Architecture 13 The Data Warehouse Architecture . . . . . . . . . . . . . . . . . . . . . . 13 Data Flow Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 The Methodology 17 The Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 The Four-Step Design Process. . . . . . . . . . . . . . . . . . . . . . . . 17 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Functional Requirements 21 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Data Modelling 5.1 . . . . . . . . . . . Data Modelling . . . . . 5.2 PRIMER . . . . . Data Modelling Primer . 5.2.1 Dimensional Dimensional Model 5.2.2 Metadata . Metadata . . . . . 23 23 23 23 23 24 24 25 25

. . . . . . . . . . . . . . . . Model . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . 1

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

5.3 Designing the Dimensional Data Store . . . . Designing the Dimensional Data Store . . . . . . . 5.3.1 STEP 1: Selecting the Business Model Selecting the Business Model . . . . . . . . . . 5.3.2 STEP 2: Declaring the Grain . . . . . Declaring the Grain . . . . . . . . . . . . . . . 5.3.3 STEP 3: Choosing the dimensions . . . Choosing the dimensions . . . . . . . . . . . . 5.4 Slowly Changing Dimensions . . . . . . . . . . Slowly Changing Dimensions . . . . . . . . . . . . . 5.5 Data Hierarchy . . . . . . . . . . . . . . . . . Data Hierarchy . . . . . . . . . . . . . . . . . . . . 5.6 The Date Dimension . . . . . . . . . . . . . . The Date Dimension . . . . . . . . . . . . . . . . . 5.7 The Oce Dimension . . . . . . . . . . . . . . The Oce Dimension . . . . . . . . . . . . . . . . . 5.8 The Product Dimension . . . . . . . . . . . . The Product Dimension . . . . . . . . . . . . . . . 5.9 The Customer Dimension . . . . . . . . . . . The Customer Dimension . . . . . . . . . . . . . . 5.10 Step 4: Identifying the Facts. . . . . . . . . . Step 4: Identifying the Facts. . . . . . . . . . . . . 5.11 Source System Mapping. . . . . . . . . . . . . Source System Mapping. . . . . . . . . . . . . . . . 5.12 Summary . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . 6 The Physical Database Design 6.1 . . . . . . . . . . . . . . . . . The Physical Database Design. . . 6.2 The source system database. . The source system database. . . . . 6.3 The Staging area database. . . The Staging area database. . . . . 6.4 The DDS database. . . . . . . The DDS database. . . . . . . . . . 6.5 The Metadata database. . . . The Metadata database. . . . . . . 6.6 Views. . . . . . . . . . . . . . Views. . . . . . . . . . . . . . . . . 6.7 Summary . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

25 25 26 26 26 26 27 27 28 28 31 31 34 34 36 36 37 37 39 39 40 40 42 42 43 43 44 44 44 44 44 45 45 46 46 46 46 47 47 48

. . . . . . . . . . . . . 2

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7 Populating the Data Warehouse 7.1 . . . . . . . . . . . . . . . . . . . Populating the Data Warehouse. . . . 7.2 Populating the Stage database . . Populating the Stage database. . . . . 7.3 Data Mappings . . . . . . . . . . Data Mappings. . . . . . . . . . . . . . 7.4 Control Flow . . . . . . . . . . . Control Flow. . . . . . . . . . . . . . . 7.5 Moving Data to the DDS . . . . . Moving Data to the DDS. . . . . . . . 7.6 Populating the Dimension tables . Populating the Dimension tables. . . . 7.7 Populating the Fact table . . . . Populating the Fact table. . . . . . . . 7.8 Preparing for the next upload . . Preparing for the next upload. . . . . . 7.9 Scheduling the ETL . . . . . . . . Scheduling the ETL. . . . . . . . . . . 7.10 Summary . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . 49 49 49 51 51 54 54 56 56 58 58 60 60 64 64 69 69 71 71 74 74

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

8 Building Reports 75 Building Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Selecting Report elds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Bibliography 86

List of Figures
1.1 2.1 2.2 3.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 A simple Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . 9

Elements of a Data Warehouse . . . . . . . . . . . . . . . . . . . . . 14 A single DDS architecture . . . . . . . . . . . . . . . . . . . . . . . 14 The Four-Step Dimensional Design Process . . . . . . . . . . . . . . 17 The Four-Step Dimensional Design Process Product Sales Data mart . . . . . . . . . . The Product dimension Hierarchy . . . . . The Customer dimension Hierarchy . . . . The Date dimension Hierarchy . . . . . . . The Oce dimension Hierarchy . . . . . . The Date dimension . . . . . . . . . . . . The Oce dimension . . . . . . . . . . . . The product dimension . . . . . . . . . . . The Customer dimension . . . . . . . . . . The Product Sales Data mart . . . . . . . Data owing through the warehouse . . Sample customer table . . . . . . . . . The Metadata data ow table . . . . . source-to-stage mappings . . . . . . . . Stage ETL Control Flow . . . . . . . . DDS ETL Control Flow . . . . . . . . populating Customer Dimension . . . . Slowly changing Customer Dimension . Merge Joining Orders and OrderDetails Retrieving the Oce code . . . . . . . Business to Surrogate key . . . . . . . Fact table mapping . . . . . . . . . . . populating the Fact table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 32 32 33 33 35 36 38 39 41 50 51 52 55 57 59 61 62 64 66 67 68 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tables . . . . . . . . . . . . . . . .

7.14 Creating an SQL Agent Job . . . . . . . . . . . . . . . . . . . . . . 71 7.15 the ETL scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 8.1 8.2 8.3 8.4 8.5 8.6 Creating the Prot report . Building the Prot report . Designing the report matrix Sales by country report . . . Model Sales Report . . . . . Model Sales Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 78 79 81 83 84

List of Tables
4.1 5.1 5.2 Functional Requirements. . . . . . . . . . . . . . . . . . . . . . . . 21

Type 2 response to SCD . . . . . . . . . . . . . . . . . . . . . . . . 29 Type 3 response to SCD . . . . . . . . . . . . . . . . . . . . . . . . 30

Faculty of Engineering and Science


Aalborg University

Department of Computer Science

TITLE: Building a Data Warehouse PROJECT PERIOD: DE, Sept 1st 2008 Dec 19th 2008 PROJECT GROUP: DE-1 GROUP MEMBERS: Dovydas Sabunas Femi Adisa SUPERVISOR: Liu Xiufeng NUMBER OF COPIES: 4 REPORT PAGES: ?? TOTAL PAGES: ??

ABSTRACT:
This report documents our experiences while trying to learn the fundamental aspects of data warehousing. fundamental aspect of w building this report tries to.. our journey into data warehousing/ foray into, tries to present our/ the obstacle encountered

Chapter 1 Introduction
What is a Data Warehouse?
Before we get down to work and try to build a data warehouse, we feel it is very important to rst dene a data warehouse and related terminologies and why organizations decide to implement one. Further down we will talk about what should be the driving force behind the need to build a data warehouse and what the focus should be on, during implementation. While various denitions abound for what is and what constitutes a data warehouse, the denition which we believe best describes a data warehouse is dened by [1]: A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of managements decision making process. We take a moment to go through the denition. A data warehouse is subject oriented ; this means that it is specically designed to attack a particular business domain. A data warehouse is integrated ; it is a repository of data from multiple, possibly heterogeneous data sources, presented with consistent and coherent semantics. Data in a data warehouse comes from one or more source systems. These are usually OLTP or online analytical processing systems that handle day to day transactions of a business or organization. A data warehouse is time-variant; where each unit of data in a data warehouse is relevant to some moment in time. A data warehouse is non-volatile; it contains historic snapshots of various operational system data, and is durable. Data in the data warehouse is usually neither updated nor deleted but new rows are rather uploaded, usually in batches on a regular basis. A data warehouse supports managements decision making process; the main

reason for building a data warehouse is to be able to query it for business intelligence and other analytical activities. Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to probe, retrieve and analyze (slice and dice) the data in a data warehouse to get a deeper understanding about their businesses. They can analyze the sales by time, customer, and product. Users can also analyze the revenue and cost for a certain month, region, and product type. ETL: Data from source systems are moved into the data warehouse by a process known as ETL (Extract, Transform and Load). It is basically a system that connects to the source systems, read the data, transform the data, and load it into a target system. It is the ETL system that integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a dierent format than OLTP. The data is moved from the source system into the DDS because data in the DDS is arranged in a dimensional format that is more suitable for analysis and helps to avoid querying the source system directly. Another reason is because a DDS is a one-stop shop for data from several source systems.

Figure 1.1: A Data Warehouse in its simplest form.

What is Data Warehousing?


This is the process of designing, building, and maintaining a data warehouse system.

Why build a Data Warehouse?


The most compelling reason why an organization should want a data warehouse, would be to help it make sense of the vast amount of transactional data that the business is generating, the volume of which is growing tremendously on a day to day basis. Typically before the advent of data warehousing, data from OLTP systems were regularly archived onto magnetic disk and kept in storage over a period of time, in case something goes wrong and they need to restore the data or as in 9

case with the banking and insurance industries as required by regulations and also for performance enhancing purposes. It was not until much later that it was realized, the potential that these data hold for analysis of business activities over time, as well as forecasting and analyzing trends. Even then, it was not quite feasible to get a consolidated or integrated overview of the data due to the lack of the available technology and also because most of the information, often times come from several disparate systems and available reporting tools were not able to deal with them. Technology has come a long way and so also has data warehousing matured. Any organization that implements an OLTP system in the day to day running of its business, knows that the value of information contained within these systems, when analyzed properly can help leverage the business and support management decision making. It is important to mention at this very early juncture, that the decision to build a data warehouse should to a large extent be a purely business decision and not one of technology. Early data warehouse projects failed because, project managers focused more on delivering a technology and at the end of the day they succeeded. But what they delivered was beautiful nonsense; Nice to look at and state of the art but of little benet to business users. The business users and their needs were not properly aligned and well incorporated in the data warehousing; instead the focus was on delivering the technology. These projects failed not because, the data warehouse was not delivered. On the contrary, they delivered a product that did not meet or satisfy the needs of the business users and as a result, they were abandoned. It is of utmost importance to get business users involved in every stage of the data warehouse development cycle and to put in place a mechanism for constant interaction and feedback sessions. From the moment a need is identied until the nal delivery of a fully functional data warehouse.

The Classic Car Case study


During the course of this project, we will be building a data warehouse for a ctitious company called Classic Cars Inc. we try to cover all the core aspects of data warehousing; architecture, methodology, requirements, data modeling, ETL, metadata, reports. Building a complete data warehouse given our time frame and human resources is not feasible. It is very important that we dene a scope for our project and this we do by analyzing the source system to know what kind of data resides in it and 10

what we can derive out of it. The classic car source database contains sales order transactions data which makes it ideal for constructing a sales data mart. Classic car Inc. is a company that is in the business of selling scale models of classic/vintage cars, aero planes, ships, trucks, motorbikes, trains and busses. Their customer base spans across the globe. They sell only to retailers in dierent regions. There is usually more than one customer in a country. The company itself is headquartered in the USA and has branch oces in several countries. Each branch oce is responsible for dierent geographical regions. Customers send in their orders and the company ships it to them via courier. Each customer has a responsible employee that deals with it. The company also gives credit facilities to the customers and each customer has a credit limit depending on their level of standing with the company. The customers usually mail in their payment checks after they receive their orders. The company itself does not manufacture the products it sells but there is no information in the database about its suppliers. We can only assume that its operations are not fully computerized or that it runs several disparate systems.

11

Summary
In this chapter we gave a breakdown of what data warehousing is. We explained what should be the driving force behind every decision to build a data warehouse. We nished by giving an introduction to our case study. In the next chapter we will look at the various data warehousing architecture.

12

Chapter 2 The Data Warehouse Architecture

In this chapter we will give a brief overview of data warehouse elements. We will explain typical data warehouse architectures and explain which one we have chosen and why. A data warehouse system comprises 2 architectures; the data ow architecture and the system architecture. System architecture deals with the physical conguration of the servers, network, software, storage, and clients and will not be discussed in this report. Choosing what architecture to implement when building a data warehouse is largely based on the business environment that the warehouse will be operating in. For example, how many source systems feed into the data warehouse or how the data ows within the data stores to the users or what kind of data will be requested by end users applications. The gure 2.1 illustrates the basic elements of a data warehouse.

13

Figure 2.1: Basic elements of a Data Warehouse

Data Flow Architecture.


According to [3], there are four data ow architectures: single Dimensional Data Store (DDS), Normalized Data Store (NDS) + DDS, Operational Data Store (ODS) + DDS, and federated data warehouse. The rst three use a dimensional model as their back-end data stores, but they are dierent in the middle-tier data store. The federated data warehouse architecture consists of several data warehouses integrated by a data retrieval layer. We have chosen to implement the single DDS architecture because our data warehouse will be fed from only one source system. Not only is the single DDS the simplest, quickest and most straightforward architecture to implement, but also because our DDS will consist of only the sales data mart. The Architecture is by every means extensible. It can quite easily be scaled up to be fed by more than one source system and the DDS can also comprise several data marts.

Figure 2.2: A single DDS Data Warehouse architecture.

14

A data store is one or more databases or les containing data warehouse data, arranged in a particular format and involved in data warehouse processes [3]. The stage is an internal data store used for transforming and preparing the data obtained from the source systems, before the data is loaded to into the DDS. Extracting data into the stage minimizes the connection time with the source system and allows processing to be done in the staging area without undue strain on the OLTP systems. We have incorporated the staging area to make the design extensible as well because if in the future the DDS will be fed from multiple source systems, the staging area is vital for the processing and transformation. The dimensional data store (DDS) is a user-facing data store, in the form of a database, made up of one or more data marts, with each data mart comprising of dimension and fact tables arranged in dimensional format for the purpose of supporting analytical queries. We will describe the format of the DDS later. For applications that require the data to be in the form of a multidimensional database (MDB) rather than a relational database an MDB is incorporated into our design. An MDB is a database where the data is stored in cells and the position of each cell is dened by a number of variables called dimensions [3]. Each cell represents a business event, and the value of the dimensions shows when and where this event happened. The MDB is populated from DDS. In between the data stores sits the ETL processes that move data from one data store (source) into another data store (target). Embedded within the ETL are logics to extract, transform and load the data. Information about each ETL process is stored in metadata. This includes the source and target info, the transformation applied, the parent process, and each ETL process to run schedule. The technology we have chosen for this data warehousing project is Microsoft SQL Server Integration Services and Analysis Services (MSSIS, MSSAS). It provides a platform for building data integration and workow applications. It is an integration of tools that provides database, updating multidimensional cube data, ETL and Reporting capabilities. It also includes the Business Intelligence Development Studio (BIDS). which allows us to edit SSIS packages.

15

Summary
In this chapter we explained what consists a data warehouse architecture. We mentioned the 4 types of data ow architectures available and explained why we adopted the Single DDS architecture and went on to describe it in detail. We also introduced the technology we will be using . In the next chapter we will explain the methodology we will be following to build the data warehouse and why we have adopted the particular approach.

16

Chapter 3 The Methodology

In this chapter we discuss the process which we will be adopting in building our data warehouse. We have chosen to go with Ralph Kimballs Four-Step Dimensional Design Process [2]. The approach was mentioned and recommended in all the dierent literatures we read. It is followed by experts in the eld and it was quite easy to see why after consulting The Data Warehouse Toolkit, The Complete Guide to DImensional Modelling [2] ourselves. Dimensional Modelling was well outlined and quite straightforward and we felt it provided us with the right footing to literally hit the ground running when it came to building our own data warehouse.

Figure 3.1: Key Input to the four-step dimensional design process

The Four-Step Design Process.


STEP 1: Selecting a business process to model. A process is a natural business activity performed in an organization that is typically supported by a source system [3]. 17

It should not be confused with a business department. Orders, purchasing, shipments, invoicing and inventory all fall under business processes. For example, a single dimensional model is built to handle orders rather than building separate models for the sales and marketing departments. That way both departments can access orders data. Data is published once and inconsistencies can be avoided. After a careful analysis of our source system database, we have selected sales as a business process to model because this is the only model that can supported by the data available to us in the source system. We will build a sales data mart for the Classis Cars Co., which should allow business users to analyze individual and overall product sales and individual stores performances. The norm would have been to set up a series of meetings with the prospective users of the data warehouse as a means of gathering the requirements and selecting which model to implement but because we do not have this opportunity, we are conned to selecting a model which we feel can best be implemented based on the data available from our source system database. STEP 2: Declaring the grain of the business process. Here we identify what exactly constitutes a row in a fact table. The grain conveys the level of detail associated with the fact table measurements [3]. Kimball and Ross recommend that a dimensional model be developed for the most atomic information captured by a business process. Typical examples of suitable candidates: An Individual line item on a customers retail sales ticket as measured by a scanner device. A daily snapshot of the inventory levels of each product in a warehouse. A monthly snapshot for each bank account.. . . When data is at its atomic form, it provides maximum analytic exibility because it can be rolled up and cut through (sliced and diced) in every possible manner. Detailed data in a dimensional model is most suitable for ad hoc user queries. A must if the data warehouse is to be accepted by the users. STEP 3: Choosing the Dimensions. By choosing the correct grain for the fact table, the dimensions automatically become evident. These are basically elds that describe the grain items. We try to create very robust dimensions and this means juicing it up with descriptive textlike attributes. Fields like order date which represents the date the order was made, product Description, which helps to describe the product and so on. 18

As we understand the problem better, more dimensions will be added as required. Sometimes adding a new dimension causes us a take a closer look at the fact table. Adding additional dimensions should however not cause additional fact rows to be generated. STEP 4: Identifying the numeric facts that will populate each fact table row; Numeric facts are basically business performance measures. According to [2], all candidate facts in a design must be true to the grain dened in step 2. In our case, an individual order details line include such facts like, quantity sold, unit cost amount and total sale amount. These facts are numeric additive gures, and will allow for slicing and dicing, their sums will be correct across dimensions and more additional measures can be derived or computed from them. With the proper facts, things like gross prot (cost amount - sales amount) can be easily computed and this derived gure is also additive across dimensions. In building a data warehouse, it is highly important to keep the business users requirements and the realities of the source data in tandem. One should normally use an understanding of the business to determine what dimensions and facts are required to build the dimensional model. We will do our best to apply Kimball and Ross four-step methodology to what we believe would be the normal business requirements for this project.

19

Summary
In this chapter, we outlined Ralph kimballs four-step methodology and presented why it is very popular amongst the data warehousing community. We talked briey about our constraint of not having business users to interact with as a means of gathering business requirements for this project and how we hope to work around this. Next chapter, we will discuss the functional requirements for the data warehouse.

20

Chapter 4 Functional Requirements

Before diving into the process of data warehousing, it is important to dene what is expected from the completed data mart. i.e what do the business users expect to be able to do with our system or as in our case, what we believe will help Classic Cars achieve their business objectives. Functional requirements mean dening what the system does.By dening the functional requirements, we have a measure of success at the completion of the project, as we can easily look at the data warehouse and determine how well it conforms or provides answers to the various requirements posed in table 4.1. In trying to dene the functional requirements, we explored the source system and tried to analyze the business operations of Classic Cars. In the end, we agreed that the data warehouse should be able to help users provide answers to: No. Requirement 1 Customer Purchase history 2 Product order history 3 Product sales per geographic region 4 Store sales performance 5 Customer payment history 6 Buying patterns per geographic region Priority High High High High High High

Table 4.1: Functional requirements for the Classic Cars Data Warehouse.

21

4.1

Summary

In this short but very important chapter, we tried to outline what the business users expect from our nished data warehouse. This will very much be the yardstick, which will determine whether the data warehouse will be accepted by the users or not. A data warehouse that does not meet the expectation of the business users would not be used and from that perspective would be deemed to have failed. In the next chapter, we combine the functional requirements and the methodology and try to come up with a dimensional model of our data warehouse.

22

Chapter 5 Data Modelling


5.1
We start o this chapter by explaining some dimensional modeling terms. We will design the data stores. By looking at the functional requirements, we are able to know what to include in our data stores. We will be using the dimensional modeling approach and follow the Four-Step Dimensional Design Process [2] outlined in the previous chapter. We will rst dene and then build our fact table surrounded by the dimensional tables. The contents of our fact and dimension tables will be dictated by the functional requirements dened in the previous chapter. We will construct a data hierarchy and also construct a metadata database.

5.2

PRIMER

Fact Table: A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored [2]. Measurements from a single business process are stored in a single data mart. FACT represents a business measure e.g. quantities sold, dollar sales amount per product, per day in a store. The most useful facts in a FACT table are numeric and additive. This is due to the fact that the usual operation on warehouse data is retrieving thousands of rows and adding them up. Fact tables contain a primary key which is a combination of primary keys from the dimension tables (foreign keys). Also known as a composite or concatenated key, this helps to form a many-to-many relationship between the fact table and the

23

dimensional tables. Not every foreign key in the fact table is needed to guarantee uniqueness. Fact tables may also contain a degenerate dimension (DD) column. This is a dimension with only one attribute and as such is added to the fact table as opposed to having a dimension table of its own with only one column. Dimension Tables: These contain textual descriptors that accompany the data in the fact table. The aim is to include as much descriptive attributes as possible because they serve as the primary source of query constraints, groupings, and report labels. E.g. when a user states to see a model sales by country by region, country and region must be available as dimension attributes. They are the key to making the data warehouse usable and understandable, and should contain verbose business terminology as opposed to cryptic abbreviations [2]. Dimension tables are highly denormalized and as a result contain redundant data but this is a small price to pay for the trade o. What we achieve is ease of use and better query performance as less joins are required. The data warehouse is only as good as its dimension attributes. Dimension tables also represent hierarchical relationships in the business.

5.2.1

Dimensional Model

When we join the fact table together with the corresponding dimension tables, we get what is known as a data mart. This forms a kind of star like structure and is also referred to as the star join schema [2]. The star schema is based on simplicity and symmetry. It is very easy to understand and navigate. Data in the dimension tables are highly denormalized and contain meaningful and verbose business descriptors, users can quickly recognize that the dimensional model properly represents their business. Another advantage of using a dimensional model is that it is gracefully extensible to accommodate changes [2]. It can easily withstand unexpected changes in user behavior. We can easily add completely new dimensions to the schema as long as a single value of that dimension is dened for each existing fact row. It has no built-in bias as to query expectations and certainly no preferences for likely business questions. All dimensions are equal and present a symmetrical equal entry points into the fact table. The schema should not have to be adjusted

24

every time users come up with new ways to analyze the business. The key to achieving this lies in the process of choosing the granularity as the most granular or atomic data has the most dimensionality [2]. According to [2], atomic data that has not been aggregated is the most expressive and the fact table incorporates atomic data, and so should be able to withstand ad hoc user queries; a must if our warehouse is to useful and durable. Creating a report should be as simple as dragging and dropping dimensional attributes and facts into a simple report.

5.2.2

Metadata

Metadata is the encyclopedia of a data warehouse. It contains all the information about the data in the data warehouse. It supports the various activities required to keep the data warehouse functioning, be it technical; (information about source systems, source tables, target tables, load times, last successful load, transformation on data, etc), administrative; (indexes, view denitions, security privileges and access rights, ETL run schedules, run-log results, usage statistics, etc) or business users support (user documentation, business names and denition, etc). We build a metadata database, which will serve as the catalogue of the data warehouse.

5.3

Designing the Dimensional Data Store

In order to do a good DDS design, we must ensure that the design of the DDS is driven by the functional requirements dened in the previous chapter. This is because the functional requirements represent the kind of analysis that the business users will want to perform on the data in the warehouse.

Figure 5.1: Key Input to the four-step dimensional design process

25

5.3.1

STEP 1: Selecting the Business Model

Understanding the business requirements coupled with analysis of the available data helps us to choose what business process to model. In a normal real life situation, we would choose an area that would have an immediate and the most impact on business users as a means of getting them to adopt the system quite easily. However, we are constrained by the fact that the only data available to us in our source system is sales data. So our business process to model is product sales. We will build a Product-sales data mart. A data mart is simply a fact table surrounded by its corresponding dimension tables that model a business process. It will allow users to answer questions posed in the functional requirements. The product sales event happens when a customer, through a sales rep places an order for some of the products. The roles (who, what, where) in this case are the customer, product, and the store. The measures are the quantity, unit price and value of sales. We will put the measures into the fact table and the roles (plus dates) in the dimension tables. The business events become individual rows in the fact table.

Figure 5.2: Preliminary Sales Data Mart

5.3.2

STEP 2: Declaring the Grain

Declaring the grain means deciding what level of data detail should be available in the dimensional model. The goal being to create a dimen26

sional model for the most atomic information captured by the business process outlined in step 1. Dierent arguments abound about how low or the atomicity of the grain should be. According to Ralph Kimball, tackling data at its lowest, most atomic grain makes sense on multiple fronts. Atomic data is highly dimensional. The more detailed and atomic the fact measurement, the more things we know for sure. Atomic data provides maximum analytic exibility because it can be constrained and rolled up in every possible way. Detailed data in a dimensional model is poised and ready for the ad hoc attack by the business users. Selecting a higher-level grain limits the potential to less detailed dimensions and makes the model vulnerable to unexpected user requests to drill down into the details. The same would also be true if summary or aggregated data is used. In our Classic Car study, we have chosen an individual line item in the order details transaction table as the most granular data item. In other words, the grain or one row of the Product-Sales fact table corresponds to one unit of a model sold (car, truck, motorcycle, etc). By choosing such low level grain, we are not restricting the potentials of the data warehouse by anticipating user queries but ensuring maximum dimensionality and exibility because queries need to cut through details (slicing and dicing) in precise ways, whether they want to compare sales between particular days, or compare models sale according to scale model size. While users will probably not want to analyze every single line item sale in a particular order, providing access to summarized data only would not be able to answer such questions.

5.3.3

STEP 3: Choosing the dimensions

After we have identied what constitutes the business measure of the event we are modeling (Product-Sales), certain elds which describe or qualify the event (roles) become obvious: product, store, customer, date will form the dimensions. We will also have the Order Number as a dimension, but because it 27

does not have any other attributes of its own, it will sit in our fact table as a degenerate dimension. It will help to identify products belonging to a particular order. Dimension tables need to be robust and as verbose as possible. Dimensions implement the user interface to a data warehouse and It is not uncommon to have a dimension table containing 50 - 100 columns. Unlike fact tables, they are updated infrequently and updates are usually minor additions like adding a new product or customer or updating prices and etc.

5.4

Slowly Changing Dimensions

This brings us to the problem of slowly changing dimensions or SCD and how it is handled. If we recall from our denition of what a data warehouse is, we know that it stores historical data, so what then happens for example if the value of a dimensional attribute changes? Say for example, an oce that was overseeing a particular region or a customer changes address? Surely, merely updating this dimension by simply changing the address will mean all previous transactions carried out under the old region or address can no longer be isolated and we might not be able to analyze the information because queries would have no means to refer to them explicitly since they are now part of the new region or address and a fundamental function of our data warehouse of storing historical data is no longer true. According to [2], the problem of SCD can be handled by either overwriting existing values (type 1 SCD) or preserving the old attribute values as rows (type 2), or storing them as columns (type 3). Type 1 response is only suitable if the attribute change is a correction or there is no value of retaining the old description. This is not usually desirable and should be up to the business users to determine if they want to be able to keep it or not. Type 2 response is the most common technique as it is the most exible to implement and does not limit the number of times we can easily reect a change in a dimension attribute. It involves adding a 28

new dimension row every time an attribute changes, the current value is preserved in the current row and the new value is reected in the new row. Using this method we are able to stay true to our denition of a data warehouse keeping historical data and also allowing users to be able to track historical changes and perform analysis constrained on either or both values. Let us suppose in our case study, that a particular car model is only sold in region 1 up until a certain period and then Classic Cars decided to discontinue its sale there and move it to region 2. Obviously under type 1 response, from the moment the eld attribute is corrected to reect region 2 as the new region, there will be no way of analyzing car X model sales performance prior to when it was moved to Region 2. Furthermore, analysis on the sales gures in Region 2 will reect albeit, incorrectly car X models sales gure from when it was in Region 1 as part of Region 2s. Using type 2 approach, when the region changed, we will add a new dimension row to reect the change in region attribute. We will then have two product dimensions for car X model: Product Key 1233 2346 Product Description Ferrari Blazer Ferrari Blazer Region Region 1 Region 2 Product Code FERR-12 FERR-12

Table 5.1: Type 2 response to SCD

29

The above table also helps us to see why we have to introduce surrogate keys into our dimension tables as opposed to using the natural keys. The surrogate keys can help to identify a unique product attribute prole that was true for a span of time [2]. Plus we do not also need to go into the fact table to modify the product keys and the new dimension row also helps to automatically partition history in the fact table. Constraining a query by Region 1, on car x prior to the change date will only reect product key 1233 when car x was still in Region 1 and constraining by a date after the change will no longer reect the same Product key because it now rolls up in Region 2. We also introduce a date stamp column on the dimension row which will help track new rows that are added, a valid or invalid attribute is also added to indicate the state of the attributes. Eective and expiration dates are necessary in the staging area because they help to determine which surrogate keys are valid when the ETL is loading historical fact records. Type 3 response uses a technique that requires adding a new column to the dimension table to reect the new attribute. The advantage it oers is that unlike type 2 response, it allows us to associate the new value with old fact history or vice versa [2]. If we remembered that in type 2 response, the new row had to be assigned a new product key (Surrogate Key) so as to guarantee uniqueness, the only way to connect them was through the product code (Natural Key). Using a type 3 response, the solution would look like, Product Key 1233 Product Description Ferrari Blazer Region Region 2 Prior Region Region 1 Product Code FERR-12

Table 5.2: Type 3 response to SCD

30

Type 3 response is suitable when theres a need to support both the current and previous view of an attribute value simultaneously. But it is quite obvious that adding a new column will involve some structural changes to the physical design of the underlying dimension table and so might be preferable if the business users decide that only the last 2 to 3 prior attribute values would need to be tracked. Also, the biggest drawback would be if we needed to the track the impact of the intermediate attribute values [2]. There are hybrid methods for solving the problem of SCD which combine features of the above techniques but while they can oer greater exibility, they usually introduce more complexity and if possible, according to [2], should be avoided. We introduce surrogate keys into our dimension tables and use them as the primary keys. This approach is more suitable because for one reason, it helps to tackle the problem of SCD. It is also essential for the Stage process ETL especially because we have chosen type the 2 response to dealing with SCDs. Surrogate keys help the ETL process keep track of rows that already exist in the data warehouse and avoids reloading same. Surrogate keys are very easy to automate and assign because they are usually integer values and the last assigned value is stored in metadata and is easily retrieved and incremented on the next run.

5.5

Data Hierarchy

Dimension tables often also represent hierarchical relationships in the business. Hierarchies help us to roll up and drill down to analyze information based on related facts. For example state rolls up into country and country into region. Or in the date dimension, days roll up into week and weeks into month, months into period etc. Products roll up into product line, product line into vendor. Having hierarchy translates into better query performance and more ecient slicing and dicing through grouping along a path. Users are able to for example, view a products performance during a week and later on roll it up into a month and further into a quarter or period. All our four dimensions have hierarchy. 31

Figure 5.3: The Product dimension hierarchy

Figure 5.4: The Customer dimension hierarchy

32

Figure 5.5: The Date dimension hierarchy

Figure 5.6: The Oce dimension hierarchy with multiple paths.

33

5.6

The Date Dimension

Every business event that takes place, happens on a particular date and so the date dimension is very important to a data warehouse. It is the primary basis of every report and virtually every data mart is a time series [2]. It is also common to every data mart in a data warehouse and as a result must be designed correctly. When modeling the date dimension, care must be taken to make sure that it is lled with attributes that are necessary for every fact table that will be using it. Assigning the right columns will make it possible to create reports that will for example, compares sales on a Monday with sales on a Sunday, or comparing a particular one month versus another. According to [3], the columns or attributes in a date dimension can be categorized into four groups: Date formats: The date format columns contain dates in various formats. Calendar date attributes: The calendar date attributes contain various elements of a date, such as day, month name, and year. Fiscal attributes: The scal attribute columns contain elements related to the scal calendar, such as scal week, scal period, and scal year. Indicator columns: These contain Boolean values used to determine whether a particular date satises a certain condition, e.g. a national holiday.. . .

34

Figure 5.7: The Date dimension table.

35

5.7

The Oce Dimension

The oce dimension describes every branch oce outlet in the business. It is a geographic dimension. Each outlet is a location and so can be rolled up into city, state or country. Each oce can easily be rolled up into its corresponding geographic region as well. To accommodate the movement of an oces coverage region, we have introduced the store key as a surrogate key and this will be used to implement a type 2 SCD response.

Figure 5.8: The Oce dimension table.

36

5.8

The Product Dimension

The product dimension describes the complete portfolio of products sold by the company. We have introduced the product key as the surrogate key. It is mapped to the product code in the source system(s). This helps to integrate product information sourced from dierent operational systems. It also helps to overcome the problem that arises when the company discontinues a product and assigns the same code to a new product and as we have mentioned earlier, the problem of SCD. Apart from very few dimension attributes changing over time, most attributes stay the same over time. Hierarchies are also very apparent in our product dimension. Products roll up into product line, product scale and product vendor, business users will normally constrain on a product hierarchy attribute. Drilling down simply means adding more row headers and drilling up is just the opposite. As with all dimension tables, we try to make our attributes as rich and textually verbose as possible, since they will also be used to construct row headers for reports.

37

Figure 5.9: The product dimension table.

38

5.9

The Customer Dimension

The customer forms an important part of the product sales event. The customer is actually the initiator of this event. All classic Cars customers are commercial entities, since they are all resellers. The customer name eld in this respect makes sense as only one column. But we do have a contact rst name and contact last name for correspondence. The customer key is a surrogate key that helps with SCD. Attributes are chosen based on the business users requirements outlined in the functional requirements.

Figure 5.10: The customer dimension table.

39

5.10

Step 4: Identifying the Facts.

The nal step is identifying the facts that will form the columns of the fact table. The facts are actually dictated by the grain declared in step 2. According to [2], the facts must be true to the grain; which in our case, is an individual order line item. The facts available to us are the sales quantity, buy price per unit and the sales amount, all purely additive across all dimensions. We will be able to calculate gross prot (sales amount - buy price) on items sold, also known as revenue. We can calculate the gross prot of any combination of products sold in any set of stores on any set number of days. And in the cases where stores sell products at slightly dierent prices from the recommended retail price, we should also be able to calculate the average selling price for a product in a series of stores or across a period of time. Kimball et al recommends that these computed facts be stored in the physical database to eliminate the possibility of user error. The cost of a user incorrectly representing gross prot overwhelms the minor incremental storage cost. We agree with this, since storage cost is no longer an issue as it once was. Since the fact table connects to our dimension tables to form a data mart, it is necessary that it contain attributes that link it with the dimension table, in order words, attributes that enforce Referential Integrity. All the surrogate keys in the dimension tables are present in the fact table as foreign keys. A combination of these keys will help us dene a primary key for our fact table to guarantee uniqueness. Our fact table also contains 2 degenerate dimensions, namely the order number and the order line number.

40

Figure 5.11: The Product Sales Data mart.

41

5.11

Source System Mapping.

After completing the DDS design, the next step will be to map the source system columns to the DDS columns. This will aid the ETL process during the extraction phase to know which columns to extract from and the target columns to populate. Since the fact table columns comprise attributes from dierent tables, which in turn could also be from dierent source systems, we need to have in place a source system code in order to identify the source system where the record comes from and for the ETL to be able to map to the column in whatever system it might reside in. The only requirement is that the source system code and its mapping be stored in the metadata database. At this stage we also consider the necessary transformations and calculations to be performed by the ETL logic during extraction. But because our source system database is rather simple and straight forward, we will not be performing any.

42

5.12

Summary

In this chapter we went in depth into data modeling and what guides the modeling process and then designed our DDS. We started by dening some data modeling jargons. We used the Kimball four step approach in our DDS construction process. We also looked at how columns from the source system are mapped to the columns in the DDS. In the next chapter, we will be looking at the physical elements of our data warehouse.

43

Chapter 6 The Physical Database Design


6.1
In this chapter we look at the physical structure of our data warehouse and its supporting technology. We will show how we implement our DDS and data warehouse structure using Microsoft SQL server. We will not be discussing the hardware structure or requirements as these are beyond our dened scope for this project. In a normal business environment, the source system, the ETL server and the DDS would ideally be running on separate systems. More so, because the source systems is an OLTP system and we must not interfere with its smooth running. For the purpose of implementation, we needed to nd a way to simulate multiple systems on a single computer. Our solution is to represent each element as a separate database running on a single SQL server installation installed on one computer. What we have is an environment where each database behaves like an individual system and using MSSIS, we could connect and move data between the dierent elements of the data warehouse through OLEDB, just as we would if the databases were residing on separate systems.

6.2

The source system database.

This will simulate our source system and is a database of transactions that is pulled from the order management system of the Classic Car Company. It is an OLTP system and records the day to day transac44

tion of receiving and dispatching orders, as well as inventories and all the supporting data. In our case it is a single system but as classic cars have stores in various regions, it would ideally be OLTP data from the various stores. We as data warehouse designers and implementers do not create the source systems but it is the rst place we start our feasibility study for the functional requirements of the data warehouse. Careful thought must be put into providing the OLTP source system with as little interference as possible from the other elements of the data warehouse. According to Kimball and Ross, a well designed data warehouse can help to relive OLTP systems of the responsibility of storing historical data.

6.3

The Staging area database.

In trying to conform to the last sentence of the above paragraph, it is very essential that we have a staging area. A data warehouse diers from an OLTP system in that, the data in a data warehouse is accurate up until the last time it was updated. A data warehouse does not contain live data and is not updated in real time. Updating the data in a data warehouse might mean uploading hundreds of megabytes to tens of gigabytes of data from OLTP systems on a daily basis. OLTP systems are not designed to be tolerant to this kind of extraction. So in order to avoid slowing the source systems down, we create a stage database from where our stage ETL will connect to the source system at a predened time of the day (usually at a time of low transaction trac) extract the data, dump it into the stage database and immediately disconnect from the source system database. The internal structure of the staging database is basically the same as that of the source system, except that the tables have been stripped of all constraints and indexes. We have added the columns: source system code and date of record creation as a means of identifying the originating source of the data and the date of extraction as a bookmark. These are for auditing and ETL purposes. That way the ETL can avoid reloading the same data on the next load. The stage ETL performs all the necessary transformations on the extracted data in this area and then loads them into the dimension and fact tables of the DDS. 45

The stage database area is akin to a workshop in that it is not accessible to user queries. It is just an intermediate place that data warehouse data pass through on their way to the DDS.

6.4

The DDS database.

The DDS database houses the Classic Cars DDS that contains our dimension and fact tables. Our data mart contains four dimensions and 1 fact table but ideally in a real world, it could house tens of data marts and we would recommend it having a standalone system of its own. This is our data presentation area and will be accessed by various report writers, analytic applications, data mining and other data access tools. We aim to design a DDS that is unbiased and transparent to the accessing application or tool. This way, users are not tied to any particular tool for querying or analysis purposes. Due to referential integrity, it is important to create the dimensions before the fact tables.

6.5

The Metadata database.

The metadata database maintains all the information in a data warehouse that is not actual data itself. It is data about data. Kimball likens it to the encyclopedia of the data warehouse. Under normal circumstances, it would be lled with tons of information about everything that is done in the warehouse and how it is done. It will support all user groups from technical to administrative to business users. Our metadata database is a stripped down version and its primary purpose is to support our ETL processes. We store information about source system and columns mapping and ETL scheduling. Information about date and time last successful and unsuccessful load is recorded. The last increments of surrogate keys are also recorded. The metadata is the starting point of every ETL process.

46

6.6

Views.

A view is a database object akin to a table with rows and columns but is not physically stored on disk. It is a virtual table that is formed by using a join to select subsets of table(s) rows and columns. We created a view in order to be able to link a sale to a particular store. This was because the store table does not connect to the orders transaction table and the only way to deduct which store the transaction took place was through the employee making the sale. To extract this information, we had to join the order transaction table to the employees table through the salesRepEmployee number and from that we could retrieve the store ID.

47

6.7

Summary

This chapter looked at the physical components of our data warehouse. We explained how we are able to achieve the simulation of the various elements and environment of a data warehouse in a single system. We built our databases and can now look forward to the next phase in our implementation; populating the data warehouse. In the next chapter, we will be looking at how to move data from our source system into the DDS.

48

Chapter 7 Populating the Data Warehouse


7.1
In this chapter we will look at how we move data from our source system into the data warehouse. Populating our data warehouse is done in the following steps: the rst step is to move the data from our source database to the staging database, here the necessary transformations are applied and thereafter all the data is transferred to the DDS. While transferring the data from the staging database to the DDS we need to denormalize it rst. This is a necessary step in preparing it for the DDS. To achieve this goal we have implemented two ETL processes: The Stage ETL: this connects to the source system, moves the data to the stage database and disconnects from the source system. The DDS ETL: this denormalizes the data, and then loads it into to the DDS.. . . Both steps are illustrated in the gure below

49

Figure 7.1: Data ow through the data warehouse showing ETL processes

50

7.2

Populating the Stage database

As we mentioned in an earlier chapter, our decision to include the stage database into our data warehouse architecture is to primarily reduce the amount of time during which our ETL is connected to the source system. In order to minimize this burden time on the source database, we have chosen to implement the incremental extract method in the Stage ETL. Using this approach, only the initial load process of the ETL will require that all the data in the source system be moved into the stage database, thereafter, at regular intervals, usually once a day and normally at a time when the OLTP system is handling less transactions, the ETL will connect and only picks up new or updated records since its last connection from the source system and load them into the data warehouse, hence the name incremental extract method. To enable the ETL recognize and extract the data incrementally we have added the created and lastUpdated timestamp columns to each table in our source database. Below is an extract from the Customer table:

Figure 7.2: Sample customer table showing the created and the lastUpdated columns.

51

We use the metadata database to store the the times for the last successful extraction time LSET and current extraction time CET for each table in the source system. This is a mechanism to help the ETL process gure out where to begin the next run of incremental extraction and also to help in the case of error recovery, if there is a failure during an extract and the process does not complete [3].

Figure 7.3: Snapshot from the Metadata data ow table showing the LSET and the CET.

52

From gure 7.3, we can clearly see that the last ETL run successfully loaded all the records from the source database to the stage until the 11th of November, 2008 LSET. Therefore, in the next extraction session we need the ETL process to only load those records which were created or updated after the last successful extraction time LSET. As an example, lets assume that we are running our ETL process on the 11th of December 2008,CET, from the three customers in shown in the above Figure, only one will be transferred to our staging area i.e. Atelier graphicue. The reason being that this record was last updated on (2008/12/08), which was after our last successful extraction time for the Customers table(2008/11/11). In order to pick all the new or updated records from the source database, we must rst save the current time as the CET in our metadata data ow table. Next we need to get the LSET for a particular table, for example the Customers table. This is achieved by using a simple SQL query like this: SELECT LSET from metadata.data ow where name = Customers The query returns the LSET for the Customers table. Armed with these two parameters, we can then proceed to extract new or updated customers from our source database with the following query: SELECT * FROM Customers WHERE (created >LSET AND created <= CET) OR (lastUpdated >LSET AND lastUpdated <= CET) To pick logically correct data here also requires that the record should have the created or the lastUpdated timestamp elds not bigger than current extraction time. For even better performance of our stage ETL process, our staging database tables do away with constraints and indexes. While transferring the data to the staging area we do not want our ETL process to be bogged down by unnecessarily checking for any constraint violations. We use constraints to ensure data quality only when inserting new or updating old data in our source systems. This way we are sure that the data, that comes to our staging area, is correct and there is no need to

53

double check it again.

7.3

Data Mappings

Another important step in the design of the ETL process is data mappings. It is a common occurrence to nd columns that make up a table in the DDS to be derived from multiple tables in the source systems. Mapping them helps the ETL process to populate the tables with their rightful columns. Columns are mapped to tables in the stage area and mappings are also done between the stage area and the DDS tables. Below is the data mappings of our source-to-stage ETL process:

54

55 Figure 7.4: Column mappings in the ETL process for source system to staging database.

7.4

Control Flow

The next gure shows the control ow diagram of our ETL process. It is based on the incremental extraction algorithm that we described earlier in this chapter. The whole procedure is divided into four main steps: 1. Set the current extraction time (CET = current time). 2. Get the last successful extraction time (LSET) of some particular table 3. Pick the records that were created or updated during time interval T,(LSET <T ? CET), and load them into the stage database. 4. Set the last successful extraction time (LSET = CET).. . . This ETL process extracts and loads the data for each table in parallel. If some error occurs while executing one of those steps then the operation is halted and marked as failed. However individual tables are not dependent on each other. If an error occurs while loading the Customers table then only the procedure that works with this table is marked as failed, while the others continue as usual. Using this approach we are sure that step 4 is executed if and only if there were no errors during the whole procedure, i.e. the extraction was successful.

56

Figure 7.5: ETL process for populating the staging database.

57

7.5

Moving Data to the DDS

Now that we have the data in our staging database, it is time to apply some transformations, if needed, and move the data to the DDS. This is the task that our second ETL process is responsible for. One of the rst things to do when moving data to the DDS is to populate the dimension tables before . That is because we have a fact table which references every dimension table through a foreign key. Trying to populate the fact table rst would result in a violation of its referential integrity. Having successfully populated the dimension tables, we can safely load the fact table. If no errors occurred and all the data has been successfully transferred to the DDS we clear the staging database. Since we have that data in the data warehouse we no longer need to keep a copy of it in the stage database. Our DDS contains four dimensions: Customer, Product, Oce and Time. This ETL process will only populate three of them because the Time dimension is populated only once - when our DDS is created. This dimension rarely changes, except for when an organization needs to redene its scal times or update a holiday. Hence, we do not need to update it very often. While designing the control ow architecture of this ETL process in Business Intelligence Development Studio we placed every dimension populating procedure in a sequence container and made it as the starting point of our ETL process. This is to load the dimension tables rst and separate this task from the loading the fact table. If an error occurs in one of the procedures that are in the sequence container, all further execution is halted and the entire ETL process results in a failure. The Figure below illustrates the control ow diagram from Business Intelligence Development Studio.

58

Figure 7.6: ETL process for populating the DDS.

59

7.6

Populating the Dimension tables

While designing this ETL process, one of the most important requirements was to incorporate slowly changing dimensions into our data warehouse. This feature makes the loading of this ETL process quite not as straightforward as was the case while populating the staging database. For this project we are implementing Type 1 and Type 2 SCD. Slowly changing dimension is used only when populating dimension tables. The reason being that dimension records are updated more frequently than fact table records. Implementing SCD using Microsoft Business Intelligence Development Studio is rather an easy task. For that purpose we use the Slowly Changing Dimension data ow transformation as seen in the next gure. The same dataow architecture is used to populate each dimension in Populate Dimensions sequence container.

60

Figure 7.7: Data ow architecture for populating Customer Dimension. 61

While populating the Customer dimension we can select some columns to correspond to SCD Type 1 or Type 2 response. The following gure illustrates the SCD response type for those columns in the Customer dimension.

Figure 7.8: Handling SCD in the Customer Dimension.

62

The Changing attribute represents a Type 1 SCD response; New values overwrite existing values and no history is preserved. While the Historical attribute corresponds to a Type 2 SCD response; Changes in these column values are saved in new record rows. Previous values are saved in records marked as outdated. To show a records current status, we use the currentStatus column with possible values of Active, meaning the record is up to date and Expired, meaning the record is outdated. When data passes through the SCD data ow transformation, it can go to three dierent outputs: 1. If we receive an updated version of changing attribute then the data is directed to Changing Attribute Updates Output and the old value in the DDS is updated with the new one. We do not need to insert anything into our DDS this time 2. If the historical attribute is updated, then the data is directed to Historical Attribute Inserts Output. Since we want to keep a history of a previous value of this attribute we do not update this record but, instead, create a new Customer record with the new data and mark it as Active. The old record is then marked as Expired. Both of those records reside in the DDS. 3. Data is redirected to the New Output when we receive a new record, that is not currently in our DDS. Those records by default are marked as Active and inserted into the DDS.. . .

63

7.7

Populating the Fact table

The next step after populating the dimension tables is to load data into the fact table. Remember dimensions are connected to the fact table through surrogate keys. This means that we cannot just load the fact table with the data we get from the source database. Firstly, we need to nd each records matching surrogate key in the dimension table and only then can we be able to link those tables together through foreign keys. What distinguishes this procedure from the ones mentioned earlier is that here, the dataow architecture has two data sources. This is the result of our fact table being composed of columns from two dierent tables from the data source,namely the Orders and OrderDetails tables. This is not uncommon in data warehousing. So when populating the fact table we need to join those two tables rst. We do that with the help of Merge Join dataow transformation as shown below,

Figure 7.9: Joining the Orders and OrderDetails tables with a Merge Join.

64

Before joining the two data sets, it is required that they be sorted rst. Here, Sort 1 is the OrderDetails table and Sort is a sorted version of Orders table. The orderNumber naturally forms the join key. After joining the two tables, we now have all the data we need to form our fact table. The only thing left to be done is to nd surrogate keys to be able to join every fact table record with all of the dimensions. Since our fact table is connected to four dimensions we need to contain four of those keys in every record of our fact table, very much like a composite key. At this moment our fact table can join directly with the following dimensions: Product dimension on productCode. Customer dimension on customerNumber. Date dimension on orderDate.. . . At this point, we do not have a column that directly joins the Oce dimension. This is the case because neither the Orders nor the OrderDetails tables have a direct link with the Oce table in the source database. To help us work around this situation, we created a view that joins every order to the oce where the order was made through the Sales Rep number. This is because the order record contains the employee number for the person who handled the sale, so we use that employee number as a handle to get the oce code and voila problem solved!

65

Figure 7.10: Joining the Orders, customers, employees and oce tables.

66

Using the data that we have now in our fact table we can successfully join with each dimension using a business key. However, we need to be using surrogate keys for our star-join. Since all the dimensions have a business key (not used as a primary key) here but we can use it to get the surrogate key using the Lookup dataow transformation. The lookup transformation just joins the fact table with one of the dimensions using a business key and then retrieves a surrogate key and adds it into the fact table. The gure below shows a lookup data ow transformation for getting the Customer dimension surrogate key.

Figure 7.11: Retrieving the Customer dimension surrogate key.

67

The business key in this example is customerNumber. Using this column we join our fact table with the Customer dimension and retrieve its surrogate key i.e. customerDimKey. We insert this key into our fact table as a new column. Using this same approach, we get the remaining three surrogate keys. After getting the surrogate keys for all four dimensions we can nally insert our rst record into the fact table. The only thing left to do is to get rid of the business keys that we currently store in the data set that is going to be inserted into our fact table. This is easily solved using data mapping. We simply leave out the Business keys from the mappings because we do not want to include them in the fact table now that we have all the surrogate keys.

Figure 7.12: Data mappings for the Fact table.

68

7.8

Preparing for the next upload

After populating the fact table with data from the staging database, we complete the data warehouse population task. The Last step will be to clear the staging database for the next scheduled data extraction. The complete data ow architecture diagram for populating the fact table is depicted in Figure 7.13.

69

Figure 7.13: Data ow diagram for populating the Fact table.

70

7.9

Scheduling the ETL

To keep our data warehouse up to date we need to run our ETL processes regularly. Using incremental extract, the best option would be to run it on a daily basis, usually at a time when business transaction is low or at the end of a work day. Although the latter argument holds no water in these days of online shopping, where as one part of the world is shutting down another is resuming work. The general idea is that we do not want to interfere with the smooth running of the OLTP source systems. So, it is purely a business decision when an organization would like the ETL process that connects to the source systems to run. On the technical aspect, to schedule an ETL and execute all of our previously built SSIS packages, we need create a new SQL Server Agent Job. It is a multi step process with one step per ETL process. Figure 7.14 illustrates.

Figure 7.14: Creating an SQL Agent Job.

71

The whole job process is atomic, rst it populates the staging database. If it fails at any point, it does not continue to step 2 for the obvious reason that it makes no sense trying to load the data from stage database to the DDS if no data was transferred to the staging area. Instead, we just quit the job, reporting failure. If step 1 succeeds, it moves onto step 2. If both steps succeed then the whole job is marked as successful execution of our ETL processes. There are also options for notifying the administrator in case of a failure. These notications can be sent by mail, network message or written to the metadata or Windows Application event log. In trying not to interfere with the smooth running of the source system, we have scheduled our ETL processes to run at 3:00AM.

Figure 7.15: Scheduling the ETL.

72

At this point we have an up to date data warehouse. Our ETL processes are built. The data ows into the DDS regularly. The data warehouse is ready and fully functional.

73

7.10

Summary

In this chapter, we explained our implementation in detail with well illustrated diagrams. This was the part that took he longest to complete. Being able to complete is was a milestone for us. In the next chapter we will look at reporting from our completed data warehouse.

74

Chapter 8 Building Reports

Now that our data warehouse is up and running, it is ready to be tested. We have chosen to build some sample report as a means of seeing our data warehouse in action. But our main goal is not to limit our data warehouse to certain pre-built reports. We believe a data warehouse should not be biased towards any particular reporting or analysis tool and should be exible enough to handle users slicing and dicing in whatever manner they choose. This to us will represent how successful our planning and design process was. One of the most common ways to study data in the data warehouse is by building reports. Using this approach the data is gathered from the DDS and presented to the user in a very convenient way. Instead of just plain data elds in the DDS the user can use charts, diagrams and other ways of representing data. This makes it much more easier to inspect and analyze the data that resides in the DDS. For creating our reports we are using SQL Server 2008 Reporting Services. Using these services, there is a possibility to deploy the reports so that they can be accessed via a web site. We are not using this feature because we do not have access to an http server with SQL server 2008 Enterprise edition installed.

75

Selecting the Report elds


The rst thing to do when building a report is to decide what tables and columns to include in our report. This of course depends on what kind of information we expect from our report. If we want to build a report that shows the sales of a particular product in dierent countries than we should build a SQL query that extracts all the sales and then groups the result by product and country. Before continuing with the reports we need to make another view. A view that relates every order line item of a particular product and a value indicating the prot from that sale. To create such a view we have to collect data from two tables: fact table and product dimension. To calculate the prot for one particular order line in the fact table we use the expression: Prot = (selling price of one unit - buying price of one unit) * quantity of units sold We then cast this value as Money and insert this column into our view. So the view now contains three columns: orderNumber, prot and productDimKey. Why not use a nested query, one might ask. We are forced to take this approach because grouping by nested queries is forbidden in SQL Server 2008 while creating reports.

76

Figure 8.1: Creating a view to relate each order line with the prot it made

77

Now that we have this view created we can build some reports to test our data warehouse. The rst report we created shows product sales by country over a period of time. Time period is year and quarter columns from the date dimension. Two other important columns are prot and country. Figure 8.2 shows how the query groups the data by year, quarter and country and then sums up the prot. The result of this operation is exactly what we need to analyze the companys sales in dierent countries over time.

Figure 8.2: A query used to provide data to the Sales by country report.

78

The nal step is to wrap this data into a nice and readable format, i.e. build a report or paint a picture. Once we have the data we need, building a report is a very straightforward process. In this report we are using a matrix style layout. This basically means that we have three dimensions that we can use; columns, rows and details:

The time year and quarter are represented as columns. Countries are displayed as rows Prot is shown as details.. . . The matrix design template is depicted in Figure 8.3.

Figure 8.3: Designing the report matrix.

79

The main group in time dimension is the year eld. It also contains eld quarter as a child group. This technique groups the data in a hierarchical approach, where every year column is a parent of four quarter columns. There is also an additional row: Total per quarter that shows us the total prot made during each quarter. A picture they say, speaks a thousand words and to present the data in a more easily readable format, we also included a chart in this report. Notice that having three dimensions (time, prot and country) in the matrix, we also need a three dimensional chart. A simple column or pie chart would be unsuitable because those only present two dimensional data. We use a stacked cylinder chart instead. To achieve the third dimension the cylinder is split in parts, that in our case represents different countries.

80

Figure 8.4: The Sales by country report.

81

Including charts with reports is a very helpful practice. Sometimes, when we have a big matrix and large numbers it is very hard to analyze the data by just looking at the plain digits. For example, by just taking a quick look at the chart in previous report it is very easy to notice that two of the most protable countries in our case are France and USA. The following two reports were built while testing our data warehouse to demonstrate the exibility of the data warehouse. These reports were built using the same methods as a previous one. The report in Figure 8.5: Sales by model type gives us an overview of the sales grouped by model type (cars, planes, ships and etc.) and the report in Figure 8.6: Sales by manufacturer shows us the performance of the manufacturers over time.

82

Figure 8.5: Sales by model type report.

83

Figure 8.6: Sales by model type report.

84

Summary
In this chapter, we have demonstrated creating reports from our data warehouse. While we only only demonstrated by creating 3 reports, endless amounts of reports can be generated according to the users want. It is also possible to use third party reporting and analysis tools connect to the data warehouse and slice and dice through the data as required. That is based on solid design principles that was adopted during the implementation of the data warehouse.

85

Bibliography
[1] W.H Inmon, Building the Data Warehouse. Wiley and Sons, Inc. 3rd Edition, 2002. [2] Ralph Kimball and Margy Ross, The Data Warehouse Toolkit.The complete guide to Dimensional Modelling. Wiley and Sons, Inc. 2nd Edition, 2002. [3] Vincent Rainardi, Building a Data Warehouse. Apress, CA. 2008. [4] Paulraj Ponniah, Data Warehousing fundamentals. A comprehensive guide for IT professionals. Wiley and Sons, Inc. 2001. [5] C.Imho, N.Galemmo, J.G.Geiger, Mastering Data Warehouse Design. Wiley and Sons, Indiana. 2003.

86

Das könnte Ihnen auch gefallen