Sie sind auf Seite 1von 18

THE DATAWAREHOUSING

The DATAwarehousing, as the name suggests is a game played with data. The data for today, yestarday, day before yesterday, a week old data, and year old data stretching may be up to the ages of your grand grand father. The data warehouse is mainly the term used in the companies to manage the data. The larger the company, the bigger the data and the bigger requirements to manage the data. The concept of a data warehouse is to have a centralized database that is used to capture information from different parts of the business process. The definition of a data warehouse can be determined by the collection of data and how it is used by the company and individuals that it supports. Data warehousing is the method used by businesses where they can create and maintain information for a company wide view of the data. The History Many people, when they first hear the basic principles of data warehousing particularly copying data from one place to another think (or even say), That doesnt make any sense! Why waste time copying and moving data, and storing it in a different database? Why not just get it directly from its original location when someone needs it?

The 1970s the preparation


The computing world was dominated by the mainframe in those days. Real data-processing applications, the ones run on the corporate mainframe, almost always had a complicated set of files or early-generation databases (not the table-oriented relational databases most applications use today) in which they stored data. Although the applications did a fairly good job of performing routine dataprocessing functions, data created as a result of these functions (such as information about customers, the products they ordered, and how much money they spent) was locked away in the depths of the files and databases. The way data was stored made it more complex to solve the purpose. It was almost impossible, for example, to see how retail stores in one region were doing against stores in the other region, against their competitors, or even against their own performance in some earlier period. Between 1976 and 1979, the concept grew out of research at the Pune Institute of Technology (Caltech), driven from discussions with Citibanks advanced technology group, called TARADATA.The name Teradata was chosen to symbolize the ability to manage terabytes (trillions of bytes) of data.

The 1980s the birth


As the era began, the computers were the only name making there presence everywhere. The organizantion started to have computers every where in each department. How could an organization hope to compete if its data was scattered all over the place on different computer systems that werent even all under the control of the centralized dataprocessing department? (Never mind that even when the data was all stored on mainframes, it was still isolated in different files and databases, so it was just as inaccessible.) A special software then came into existance which made the life simple for the user and everyone, using the data or analysing the data. This new type of software,

called a distributed database management system (distributed DBMS, or DDBMS), would magically pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate it, sort it, and do whatever else was necessary to answer the users question. Althought this was thing all were expecting, but the chocolate is always better if its sweeter, the sweetest the perfect. The chocolate being RDBMS (relational database management system) for decision support the worlds first. Data warehousing --- THE need The need has been established for a company wide view of data in operational systems. Date warehouses are designed to help management and businesses analyze data and this helps to fill the need for subject- oriented concepts. Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When this concept is achieved, the data warehouse is considered to be integrated. The form of the stored data has nothing to do with whether something is a data warehouse. A data warehouse can be normalized or de-normalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. Also, data warehouses will most often be directed to a specific action or entity. Data warehouse success cannot be guaranteed for each project. The techniques involved can become quite complicated and erroneous data can also cause errors and failure. When management support is strong, resources committed for business values, and an enterprise vision is established, the end results may turn out to be more helpful for the organization or business. The main factors that create needs for data warehousing for most businesses today are requirements for the companywide view of quality information and departments separating informational from operational systems for improved performance for managing data.

W hat is a Data War ehouse?


The most popular definition came from Bill Inmon(Father of Data warehousing), who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Data War ehouse Ar chitectur e


Different data warehousing systems have different structures. As the companies is structured in a different way from the other company, the data warehouse is bound to be different. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. In general, all data warehouse systems have the following layers: Data Source Layer Data Extraction Layer Staging Area ETL Layer Data Storage Layer Data Logic Layer Data Presentation Layer Metadata Layer System Operations Layer

Each component is discussed individually below: Data Source Layer This indicates the data source that feed data into the data warehouse. The data source can be of any format flat file, relational database, other types of database, Excel file etc. Some examples are listed below: Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer. Data Extraction Layer The Data is extracted from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL Layer The data is transformed in this layer. This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. Data Storage Layer(Target) This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types. Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. Metadata Layer This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Data Warehousing Concepts


Dimensional Data Model
The basic terminology used for data warehousing is: Dimension: A category of information, maily contains descriptive text. For example, the time dimension. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable. Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. For example customer dimension could look like following: Customer_key Customer_full_name Customer_city Customer_state Customer_country Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column. Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain".

Fact And Fact Table Types

Types of Facts
There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the dimensions in the
fact table.

Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables


Based on the above classifications, there are two types of fact tables:

Cumulative: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot: This type of fact table describes the state of things in a particular instance of time,
and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Types of Schema
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema, Snowflake Schema and Fact Constellation Schema.

Star Schema What is star schema? The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension. In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake schema The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy. For example, the Time Dimension that consists of 2 different hierarchies: 1. Year Month Day 2. Week Day We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Fact Constellation Schema This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.

Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: 1. Determine which dimensions will be included. 2. Determine where along the hierarchy of each dimension the information will be kept.

Slowly Changing Dimensions


The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below: Minal is a customer with ABC Inc. She first lived in Mumbai. So, the original entry in the customer lookup table has the following record: Customer Key 1001 Name Minal State Mumbai

At a later date, she moved to Pune, on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem. There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated
essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives. Type 1 Slowly Changing Dimension In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept. In our example, recall we originally have the following table: Customer Key 1001 Name Minal State Mumbai

After Minal moved from Mumbai to Pune, the new information replaces the new record, and we have the following table: Customer Key 1001 Name Minal State Pune

Advantages: This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: - All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Minal lived in Mumbaibefore. Usage: About 50% of the time. When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes. Type 2 Slowly Changing Dimension In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key. In our example, recall we originally have the following table: Customer Key 1001 Name Minal State Mumbai

After Minal moved from Mumbai to Pune, we add the new information as a new row into the table: Customer Key 1001 1005 Name Minal Minal State Mumbai Pune

Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. Type 3 Slowly Changing Dimension In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. In our example, recall we originally have the following table: Customer Key 1001 Name Minal State Mumbai

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key Name Original State Current State Effective Date After Minal moved from Mumbai to Pune, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer Key 1001 Name Minal Original State Mumbai Current State Pune Effective Date 15-JAN-2003

Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: - Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Minal later moves to Texas on December 15, 2003, the Pune information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time. Data Integrity Data integrity refers to the validity of data, meaning data is consistent and correct. In the data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity in the data warehouse, any resulting report and analysis will not be useful. In a data warehouse or a data mart, there are three areas of where data integrity needs to be enforced: Database level We can enforce data integrity at the database level. Common ways of enforcing data integrity include: Referential integrity The relationship between the primary key of one table and the foreign key of another table must always be maintained. For example, a primary key cannot be deleted if there is still a foreign key that refers to this primary key. Primary key / Unique constraint Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely identified. Not NULL vs NULL-able For columns identified as NOT NULL, they may not have a NULL value. Valid Values Only allowed values are permitted in the database. For example, if a column can only have positive integers, a value of '-1' cannot be allowed.

OLTP vs. OLAP


We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

-OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

-OLAP (On-line Analytical Processing)is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System Online Transaction Processing (Operational System)


Source of data Purpose of data What the data Inserts and Updates Queries Operational data; OLTPs are the original source of the data. To control and run fundamental business tasks Reveals a snapshot of ongoing business

OLAP System Online Analytical Processing (Data Warehouse)


Consolidation data; OLAP data comes from the various OLTP Databases To help with planning, problem solving, and decision support Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data Often complex queries involving aggregations

processes
Short and fast inserts and updates initiated by end users Relatively standardized and simple queries Returning relatively few records Typically very fast

Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Space Larger due to the existence of aggregation structures and Can be relatively small if historical data is archived Requirements history data; requires more indexes than OLTP Typically de-normalized with fewer tables; use of star Database Design Highly normalized with many tables and/or snowflake schemas Backup religiously; operational data is critical to run Instead of regular backups, some environments may Backup and the business, data loss is likely to entail significant consider simply reloading the OLTP data as a recovery Recovery monetary loss and legal liability method

Processing Speed

Types of OLAP

Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP, ROLAP, and section. In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

HOLAP

MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Advantages: Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations. Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly. Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Advantages: Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount. Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages: Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large. Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-ofthe-box complex functions as well as the ability to allow users to define their own functions.

HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Requirement Gathering Task Description The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as

Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting. The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements. Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project. Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live. Time Requirement 2 - 8 weeks. Deliverables A list of reports / cubes to be delivered to the end users by the end of this current phase. A updated project plan that clearly identifies resource loads and milestone delivery dates. Possible Pitfalls This phase often turns out to be the most tricky phase of the data warehousing implementation. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise, there are often political battles that center on the willingness of information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions where departments may not feel the same way. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition, the data warehouse effort either never gets off the ground, or could not start in the direction originally defined. When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level, she can often exert enough influence to make sure everyone cooperates. Task Description Once the requirements are somewhat clear, it is necessary to set up the physical servers and databases. At a minimum, it is necessary to set up a development environment and a production environment. There are also many data warehousing projects where there are three environments: Development, Testing, and Production. It is not enough to simply have different physical environments set up. The different processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment. It is best for the different environments to use distinct application and database servers. In other words, the development environment will have its own application server and database servers, and the production environment will have its own set of application and database servers. Having different environments is very important for the following reasons: All changes can be tested and QA'd first without affecting the production environment. Development and QA can occur during the time users are accessing the data warehouse. When there is any question about the data, having separate environment(s) will allow the data warehousing team to examine the data without impacting the production environment.

Time Requirement Getting the servers and databases ready should take less than 1 week. Deliverables Hardware / Software setup document for all of the environments, including hardware specifications, and scripts / settings for the software. Possible Pitfalls To save on capital, often data warehousing teams will decide to use only a single database and a single server for the different environments. Environment separation is achieved by either a directory structure or setting up distinct instances of the database. This is problematic for the following reasons: 1. Sometimes it is possible that the server needs to be rebooted for the development environment. Having a separate development environment will prevent the production environment from being impacted by this. 2. There may be interference when having different database environments on a single box. For example, having multiple long queries running on the development database could affect the performance on the production database. Data Modeling Task Description This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance. In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section. Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process. Time Requirement 2 - 6 weeks. Deliverables Identification of data sources. Logical data model. Physical data model. Possible Pitfalls It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out. ETL

Task Description The ETL(Extraction, Transformation, Loading) process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models. Time Requirement 1 - 6 weeks. Deliverables Data Mapping Document ETL Script / ETL Package in the ETL tool Possible Pitfalls There is a tendency to give this particular phase too little development time. This can prove suicidal to the project because end users will usually tolerate less formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information. A second common problem is that some people make the ETL process more complicated than necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is, however, sometimes not followed. There are cases where the design goal is to cover all possible future uses, whether they are practical or just a figment of someone's imagination. When this happens, ETL performance suffers, and often so does the performance of the entire data warehousing system.

Das könnte Ihnen auch gefallen