Sie sind auf Seite 1von 40

Course: 7023T - Advanced Database

Systems

Topic: BASIC PROCESSES


OF THE DATA WAREHOUSE
Session 5
Course Objective
Understand that the data staging process is the
iceberg of the data warehouse project.
Understand ten-step plan for creating the data
staging application for a data mart.
Understand additional administrative issues.
Understand hand-coded staging systems and
data staging tools.
Undestand the functions of the data staging
team. the data modelers, the DBAs, and the
project manager

Bina Nusantara
Data Staging Overview

Bina Nusantara
Data Staging Overview
Plan:
1. Create a very high-level, one-page schematic of the
source-to-target flow.
2. Test, choose, and implement a data staging tool.
3. Drill down by target table, graphically sketching any
complex data restructuring or transformations.
Graphically illustrate the surrogate-key generation
process. Develop preliminary job sequencing.

Bina Nusantara
Data Staging Overview (contd)

Dimension loads:
4. Build and test a static dimension table load. The
primary goal of this step is to work out the infrastructure
kinks, including connectivity, file transfer, and security
problems.
5. Build and test the slowly changing process for one
dimension.
6. Build and test remaining dimension loads.

Bina Nusantara
Data Staging Overview (contd)

Fact tables and automation:


7. Build and test the historical fact table loads (base
tables only), including surrogate key lookup and
substitution.
8. Build and test the incremental load process.
9. Build and test aggregate table loads and/or MOLAP
loads.
10. Design, build, and test the staging application
automation.

Bina Nusantara
Do the Preliminary Work
Before you begin the data staging application:
Design for a set of fact tables (completed the
logical design),
Drafted your high-level architecture plan,
Completethe source-to-target mapping for all
data elements.
Gather all the relevant information
Test some of the key alternatives
Tools are available on each, and how effective?
Set up the development environment, including
directories, naming conventions
Bina Nusantara
Importance of Good System
Development Practices
Set up a header format and comment fields for
your code.
Hold structured design reviews early enough to
allow changes.
Write clean, well-commented code.
Stick to the naming standards.
Use the code library and management system.
Test everythingboth unit testing and system
testing.
Document everythinghopefully in the
information catalog.
Bina Nusantara
Plan Effectively
Step 1. High-Level Plan: Start the design
process with a very simple schematic of the
pieces of the plan that you know: the sources and
targets.

Bina Nusantara
Plan Effectively

Basic high-level data staging plan schematic.


Bina Nusantara

Step 2. Data Staging Tools
The extracts are typically written in source
system code or use source system code
generators.
New tools have a different architecture from the
first generation(transformation engine).

Bina Nusantara

Step 3. Detailed Plan
Drill down on each of the flows.
Start planning which tables to work on, in which
order, and for sequencing the transformations
within each data set.
Graphically diagram the complex restructurings.
Makes more sense to structure the diagrams
around the source tables instead of the target
tables. The schematic is backed up with a few
pages of pseudocode detailing complex
transformations.
First pass on the detailed plan schematic for the
main utility fact table.
Bina Nusantara

First draft of historical load
schematic for the fact table

Bina Nusantara

Step 3. Detailed Plan (contd)
Organizing the Data Staging Area
The data staging area is the data warehouse
workbench. It is the place where raw data is
loaded, cleaned, combined, archived, and quickly
exported to one or more presentation server
platforms.

Bina Nusantara

Dimension Table Staging:
Step 4. Populate a Simple
Dimension Table
Static Dimension Extract:The primary source is
often a lookup table or file that can be pulled in
its entirety to the staging area.
Creating and Moving the Result Set:There
are two primary methods for getting data from a
source system:
as a file or as a stream
an extract to file approach,
consists of three or four discrete steps: extract to file,
move file to staging server, transform file contents, and
load transformed data into the staging database, and
Bina Nusantara
data compression

Step 4. Populate a Simple Dimension
Table (contd)
Static Dimension Transformation: Even the
simplest dimension table may require substantial
data cleanup, and it will certainly require
surrogate key assignment. Data cleanup is a huge
topic, tackled in a separate section at the end of
this chapter.
Simple Data Transformations: The most common,
and easiest, form of data transformation is data type
conversion.
Surrogate Key Assignment: Surrogate keys are
typically assigned as integers, increasing by one for
each new key. If your staging area is in an RDBMS,
surrogate key assignment is elegantly accomplished by
Bina Nusantara
creating a sequence.


Step 4. Populate a Simple Dimension
Table (contd)
Load
Utilities are:
Turn off logging.
Pre-sort the file.
Transform with caution.
Aggregations.
Use the bulk loader to perform within-database
inserts.
Truncate Target Table before Full Refresh
Index Management
Drop and reindex.
Keep indexes in place.
Bina Nusantara

Step 4. Populate a Simple Dimension
Table (contd)
Maintaining Dimension Tables
This can happen in two places:
Warehouse-Managed Maintenance
Source System-Managed Maintenance

Bina Nusantara

Step 5. Implement Dimension
Change Logic
Dimension Table Extracts
Processing Slowly Changing Dimensions
The meaning of the types:
Type 1: Overwrite. We take the revised description in
the raw data and overwrite the dimension table
contents.
Type 2: Create a new dimension record. We copy
the previous version of the dimension record and create
a new dimension record with a new surrogate key.
Type 3: Push down the changed value into an
old attribute field.
Slowly Changing Dimension Table
Transformation and Load
Bina Nusantara

Step 5. Implement Dimension
Change Logic(CONTD)

The lookup and key assignment logic for handling a changed


dimension record when we know that the input represents only changed records.
Bina Nusantara

Step 5. Implement Dimension
Change Logic(CONTD)

The logic for determining if an input dimension record has been changed..

Bina Nusantara

Step 6. Populate Remaining
Dimensions
Implementing Your Own INSERT Statements

Bina Nusantara

Fact Table Loads and Warehouse
Operations
Step 7. Historical Load of Atomic-
Historic Fact Level Facts (Audit
Table Extracts
Statistics)
Fact Table Processing (Fact Table Surrogate
Key Lookup)
Null Fact Table Values
Improving Fact Table Content
Data Restructuring
Data Mining Transformations

Bina Nusantara

Step 7. Historical Load of Atomic-
Level Facts
Data Mining Transformations (contd)
Flag normal, abnormal, out of bounds, or
impossible facts.
Recognize random or noise values from context
and mask out.
Apply a uniform treatment to null values.
Flag fact records with changed status.
Classify an individual record by one of its
aggregates.
Divide data into training, test, and evaluation
sets.
Map continuous values into ranges.
Normalize values between 0 and 1.
Convert from textual to numeric or numeral
Bina Nusantara category.

Step 8. Incremental Fact Table
Staging
Incremental Fact Table Extracts
New transactions.
Updated transactions.
Database logs.
Replication.
Incremental Fact Table Load
Speeding Up the Load Cycle
More Frequent Loading
Partitioned Files and Indexes
Parallel Processing
Multiple load steps.
Parallel execution.
Parallel Databases
Duplicate Tables

Bina Nusantara

Step 8. Incremental Fact Table
Staging (contd)
Speeding Up the Load Cycle
More Frequent Loading
Partitioned Files and Indexes
Parallel Processing
o Multiple load steps.
o Parallel execution.
Parallel Databases
Duplicate Tables

Bina Nusantara

Step 8. Incremental Fact Table
Staging (contd)
Duplicate table loading technique.

Bina Nusantara

Step 8. Incremental Fact Table
Staging (contd)
Step 0. The active_fact1 table is active and queried by users.
Step 1. Load the incremental facts into the load_fact1 table.
Step 2. Prepare for the next load by duplicating load_fact1
into dup_fact1.
Step 3. Index the load_fact1 table according to the final
index list.
Step 4. Bring down the warehouse, forbidding user access.
Step 5. Rename the active_fact1 table to old_fact1.
Step 6. Rename the load_fact1 table to active_fact1.
Step 7. Open the warehouse to user activity. Total downtime:
seconds!
Step 8. Rename the dup_fact1 table to load_fact1. This step
prepares for the next load cycle.
Step 9. Drop the old_fact1 table.
Bina Nusantara

Step 9. Aggregate Table and MOLAP
Loads
There are two general cases:
The aggregate table does not include the most recent
month until the month has finished. In this case, define a
monthly process that recomputes the latest month from
the base fact table when the full month of data is in.
The aggregate table keeps the current month as month-
to-date. In this case, define a nightly process that
updates existing rows in the aggregate table for this
month where they already exist and appends new rows
where the key combination does not exist.
Computation and Loading Techniques
MOLAP Engines

Bina Nusantara

Step 10. Warehouse Operation and
Automation
Typical Operational Functions
Job definition-flows and dependencies
Job scheduling-time and event based
Monitoring
Logging
Exception handling
Error handling
Notification
Job Control Approaches
Extract Metadata
Process Management (File existence, Flag set)
Process Measurement

Bina Nusantara

Step 10. Warehouse Operation and
Automation
Operations Metadata
Typical Job Schedule
Extract dimensions and write out metadata
Extract facts and write out metadata
Process dimensions
o Surrogate key/slowly changing processing/key
lookup, etc.
o Data quality checkswrite out metadata
Process facts
o Surrogate key lookupRI checkwrite out failed
records
o Process failed records
Bina Nusantara
o Data transformations

Step 10. Warehouse Operation and
Automation
Typical Job Schedule (contd)
Load dimensions before the fact table to enforce
referential integrity
Load facts
Load aggregates
Review load process-validate load against metadata
Change pointers or switch instance for high uptime (24 x
7) or parallel load warehouses
Extract, load and notify downstream data marts (and
other systems)
Update metadata as needed
Write job metadata
Review job logs, verify successful load cycle
Bina Nusantara

Data Quality and Cleansing
Criteria : Accurate, Complete, Consistent,
Unique, Timely.
Data Improvement
Inconsistent or incorrect use of codes and special
characters.
A single field is used for unofficial or undocumented
purposes.
Overloaded codes.
Evolving data.
Missing, incorrect, or duplicate values.
Processing Names and Addresses
An Approach to Improving the Data

Bina Nusantara

Data Quality and Cleansing (contd)
An Approach to Improving the Data (contd)
Where there are alternatives, identify the highest quality
source system: the organizations system of record.
Examine the source to see how bad it is.
Upon scanning this list, you will immediately find minor
variations in spelling.
Raise problems with the steering committee.
Fix problems at the source if at all possible.
Fix some problems during data staging.
Dont fix all the problems.
Work with the source system owners to help them institute
regular examination and cleansing of the source systems.
If its politically feasible, make the source system team
responsible for a clean extract.
Bina Nusantara

Data Quality Assurance
Cross-Footing
Manual Examination
Process Validation

Bina Nusantara

Miscellaneous Issues
Archiving in the Data Staging Area
Source System Rollback Segments
Disk Space Management
Key Roles
o The data staging team is on the front line for this entire
part of the project.
o The project manager must take a proactive role to partner
closely with the data staging team.
o The quality assurance analysts and data stewards begin to
take an active role during data staging development.
o The database administrator continues active involvement
by setting up appropriate backup, recovery, and archival
processes.
Estimating Considerations
Bina Nusantara

Summary
The data staging application is one of the most difficult
pieces of the data warehouse project. We have found
that no matter how thorough our interviewing and
analysis, there is nothing like working with real data to
uncover data quality problems-some of which may be
bad enough to force a redesign of the schema.

The extract logic is a challenging step. You need to


work closely with the source system programmers to
develop an extract process that generates quality
data, does not place an unbearable burden on the
transaction system, and can be incorporated into the
flow of your automated data staging application.
Bina Nusantara

Summary (contd)

Data transformation may sound mysterious, but


hopefully at this point you realize that its really fairly
straightforward. We have tried to demystify the jargon
around implementing dimensional designs,
emphasizing the importance of using surrogate keys
and describing in detail how to implement key
assignment. Data loading at its heart is quite simple:
Always use the bulk loader. But, as with all aspects of
data warehousing, there are many tricks and
techniques that help minimize the load window.

Bina Nusantara

Summary (contd)

Finally, we hope that the many ideas presented in this


chapter have sparked your creativity for designing
your data staging application. More than anything
else, we want to leave you with the notion that there
are very many ways to solve any problem. The most
difficult problems require patience and perseverance-a
willingness to keep bashing your head against the
wall-but an elegant and efficient solution is its own
reward. Be sure to consult the project plan
spreadsheet at the end of this chapter for a summary
of the tasks and responsibilities for data staging
development.

Bina Nusantara

Das könnte Ihnen auch gefallen