Sie sind auf Seite 1von 3

Chapter 8: Populating the Data Warehouse

Now that we have extracted the data from the source system, we will populate the NDS and the DDS with the data we have extracted. In this chapter, we will look at the five main subjects regarding data warehouse population in the sequence they occur in a data warehouse system: 1. Loading the stage: we load the source system data into the stage. Usually, the focus is to extract the data as soon as possible without doing too much transformation. In other words, the structure of the stage tables is similar to the source system tables. In the previous chapter, we discussed the extraction, and in this chapter we will discuss the loading. 2. Creating the data firewall: We check the quality when the data is loaded from the stage into the NDS or ODS. The check is done using predefined rules that define what action to take: reject the data, allow the data, or fix the data.

3. Populating a normalized data store: This is when we load the data from the stage into the NDS or
ODS, after the data passes through the data firewall. Both are normalized data stores consisting of entities with minimal data redundancy. Here we deal with data normalization and key management. 4. Populating dimension tables: This is when we load the NDS or ODS into the DDS dimension tables. This is done after we populate the normalized data store. DDS is a dimensional store where the data is denormalized, so when populating dimension tables, we deal with issues such as denormalization and slowly changing dimension. 5. Populating fact tables: this is the last step in populating the DW. It is done after we populate the dimension tables in the DDS. The data from the NDS or ODS is loaded into the DDS fact tables. In this process we deal with surrogate key lookup and late-arriving fact rows.

Stage Loading
If your stage is a database, which is common, it is better not to put any indexes or constraints (such as null, PK, or check constraints) in the stage database. The main reason for this is not performance but because we want to capture and report the bad data in the data quality process. We want to allow bad data such as null and duplicate primary keys into the stage.

Data Firewall
The concept of a data firewall is similar to the firewall concept in networking. The data firewall is a program that checks the incoming data. Physically, it is an SSIS package or a stored procedure. We place a data firewall between the stage and the NDS, which will allow or reject depending on the data quality rules that we set. Like network firewalls, data firewalls also have mechanisms to report what data has been rejected, by what rule and when. This is because every time the data firewall captures or finds bad data (as defined by the data quality rule), the bad data is stored in the data quality database, along with which rule was used to capture the bad data, which action was taken, and when it happened. We can then report from the data quality database. We can also set the data quality system to notify the appropriate people when certain data quality rules are violated. Unlike a network firewall, a data firewall can also fix or correct bad data. When the data firewall detects bad data, we can set it to do one of these three actions:

a. Reject the data (do not load it into the DW) b. Allow the data ( load it into the DW) c. Fix the data (correct the data before loading it into the DW).

A data firewall is an important part of a DW loading because it ensures data quality. Before we load data into the NDS, we check the data by passing the data through firewall rules.

Populating NDS
In the NDS + DDS architecture, we need to populate the tables in the NDS before we populate the dimension and fact tables in the DDS. This is because we populate the DDS based on the NDS data. Populating a NDS is quite different from populating a Stage. This is because when populating the NDS, we need to normalize the data, while when populating the Stage; we dont need to normalize the data. We may have to extract some data straight from the Stage table or from the source system and then load it into the NDS database. If the record does not exist in the NDS, we insert it. If the record already exists, we update it. When populating the NDS, we need to consider several issues: a. Normalization: in the NDS, the tables are normalized, so when loading the data from the stage, we need to normalize it to fit the NDS structure. This means that we need to populate certain tables first before we can populate the main table.

b. External data: it is possible that data from external sources does not match the data from the
source systems. This means that when we load data into the NDS, we may need to do some data conversion. c. Key management: In the NDS, we need to create and maintain internal DW keys, which will also be used in the DDS. When loading data into the NDS, we need to manage these keys.

d. Junction tables: the junction tables enable us to implement many-to-many relationships. When populating tables, we need to follow a certain sequence. The purpose of having our own keying system in the data warehouse is twofold. First, it enables integration with the second source system, and second, we want to be able to adapt the key changes in the source system(s). data warehouse keying is in the form of simple incremental integers (1,2,3,4,and so on). The data warehouse key is known as a surrogate key (SK). The source system key is known as the natural key (NK). The SK enables the integration of several source systems because we map or associate the NK with the SK. For example, say Jupiter has the product status as shown below:

If the Quality Control was deleted from Jade, we would change the active flag for Quality Control from T to F.

Using SSIS to Populate NDS

In this section, you will learn how to populate a table in the NDS using slowly changing dimension Wizard in SSIS. The best way to learn about data warehouse population is by doing it. It is good to know the theory, but if you havent actually done it, then you wont encounter any problems (or their solutions). So, lets populate the NDS using SSIS. Open BIDS, and open Amadeus ETL SSIS package that we created in the previous chapter. [destination data store] [frequency] [extraction method]

The first element is the target data store, such as the stage, the NDS, or the DDS. The second element is the frequency that the ETL package is executed, such as weekly, daily, monthly, or ad hoc. The third element is the extraction method, such as incremental, full reload, or external data. There are five stage packages in the Amadeus ETL SSIS solution. There are five stage packages in the Amadeus ETL SSIS solution: Stage daily incremental.dtsx contains the order header, order detail, customer, product, and store. Stage monthly incremental.dtsx contains the currency rate. Stage daily full reload.dtsx contains the product status, customer status, product type, household income, product category, interest, currency, package, and package type. Stage weekly external data.dtsx contains the country and language Stage adhoc full reload.dtsx contains the region, division, and state.

Das könnte Ihnen auch gefallen