This Data Warehousing

This Data Warehousing site aims to help people get a good high-level understanding of what it takes toimplement a successful
data warehouse project. A lot of the information is from my personal experienceas a business intelligence professional, both as a client and as a vendor.This site is divided into five main areas.- Tools: The selecti on of business intelligence tools and the selection of the data warehousing team.Tools covered are: Database, Hardware ETL (Extraction, Transformation, and Loading) OLAP Reporting Metadata- Steps: This selection contains the typi cal milestones for a data warehousing project, from requirementgathering, query optimization, to production rollout and beyond. I also offer my observations on the datawarehousing field.- Business Intelligence: Business intelligence is closel y related to data warehousing. This sectiondiscusses business intelligence, as well as the relationship between business intelligence and datawarehousing.- Concepts: This section discusses several concepts particular to the data warehousing field. Topicsinclude: Dimensional Data Model Star Schema Snowflake Schema Slowl y Changing Dimension Conceptual Data Model Logical Data Model Physical Data Model Conceptual, Logical, and Physical Data Model Data Integrity What is OLAP MOLAP, ROLAP, and HOLAP Bill Inmon vs. Ralph Kimball- Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / datawarehousing industry.- Gl ossary: A glossar y of common data warehousing terms.This site is updated frequently to reflect the latest technology, information, and reader feedback. Pleasebookmark this site now.A conformed dimension is a set of data attributes that have been physicall y implemented in multipledatabase tables using the same structure, attributes, domain values, definitions and concepts in eachimplementation.Unlike in operational systems where data redundancy is normally avoided, data replication is expectedin the Data Warehouse world. To provide fast access and intuitive "drill down" capabilities of dataoriginating from multiple operational systems, it is often necessar y to replicate dimensional data in DataWarehouses and in Data Marts.Un-conformed dimensions imply the existence of l ogical and/or physi cal inconsistencies that should beavoided.Conformed dimentions are dimensions which are common to the cubes.(cubes are the
schemascontains facts and dimension tables)Consider Cube-1 contains F1 D1 D2 D3 and Cube-2 contains F2 D1 D2 D4 are the Facts andDimensionshere D1 D2 are the Conformed Dimensions Registration FactREGISTRATION_FACTCompleted LearningTRANSCRIPT_FACT Retail LearningProgrammeRETAIL_LEARNING_FACTAssessmentsASSESSMENT_FACTQuesti on AnalysisQUESTION_BANK_FACTLocationLOCATION_DIMVendorsVENDOR_DIMFa cilityFAC ILITY_DIMLocation(alias)LOCATION_DIMDelegatesDELEGATE_DIMProduct CalalogCATALOG_DIMGroup TreeGROUP_TREE _DIMPeoplesoft TreeRBS_DEPT_ROLLUP_DIMProduct CalalogCATALOG_DIMProduct CalalogCATALOG_DIMDelegatesDELEGATE_DIM DeDELGroup TreeGROUP_TREE_DIMPeopl esoft TreeRBS_DEPT_ROLLUP_DIMPeopl esoft TreeRBS_DEPT_ROLLUP_DIMGroup TreeGROUP_TREE_DIMPeoplesoft TreeRBS_DEPT_ROLLUP_DIMGroup TreeGROUP_TREE_DIMProductAudience T ypePRODUCT_AUD_TYPE_DIMProductAudience T ypePRODUCT_AUD_TYPE_DIM ProductAudience T ypePRODUCT_AUD_TYPE_DIMDelegateAudi ence T ypeDELEGATE_AUD_TYPE_DIMDelegateAudi ence T ypeDELEGATE_AUD_TYPE_DIMDelegateAudi ence T ypeDELEGATE_ACostsCOST_DIMDelegatesDELEGATE_DIMCeDiCETIGr GRPeoplesoft TreeRBS_DEPT_ROLLUP_DIMCertificationCERTIFICATION_DIMCertificationAssignmentCER TIFICATION_ASSIGN_ DIM
Dimensional data model is most often used in data warehousing syst ems. This is different from the 3rdnormal form, commonly used for transactional (OLTP) type syst ems. As you can imagine, the samedata would then be st ored differently in a dimensional model than in a 3rd normal form model.To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling: Dimension : A category of information. For example, the time dimension. Attribute : A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hi erarchy : The specification of levels that represents relationship bet ween different attributes within adimension. For example, one possi ble hierarchy in the Time dimension is Year Quarter Month Day. Fact Table : A fact table is a table that contains the measures of interest. For example, sales amountwould be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by da y. In this case, the fact table would contain threecolumns: A date column, a store column, and a sales amount column. Lookup Table : The lookup table provides the detailed information about the attributes. For example, thelookup table for the Quarter attribute would include a list of all of the quarters available in the datawarehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies thequarter, and one or more additional fields that specifies how that particular quarter is represented on areport (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookuptables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies arerepresented by l ookup tables. Attributes are the non-key columns in the lookup tables.In designing data models for data warehouses / data marts, the most commonly used schema types areStar SchemaandSnowflake Schema. Whether one uses a star or a snowflake largely depends on personal preference and business needs.Personally, I am partial to snowflakes, when there is a business case t o analyze the information at that particular level .In the star schema design, a single object (thefact table) sits in the middle and is radially connected toother surrounding objects (dimension lookup tables) like a star. Each dimension is represented as asingle table. The primary key in each dimension table is related to a forieng key in the fact table.Sample star schemaAll measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity.A star schema can be simple or complex. A simple star consists of one fa ct table; a complex star canhave more than one fact table.Let's look at an example: Assume our data warehouse keeps store sales data, and the differentdimensions are time, store, product, and customer. In this case, the figure on the left repesents our star schema. The lines bet ween two tables indicate that there is a primary key / foreign key relationshipbet ween the two tables. Note that different dimensions are not related to one another.
The snowflake schema is an extension of thestar schema, where each point of the star explodes intomore points. In a star schema, each dimension is represented by a single dimensional table, whereas ina snowflake schema, that dimensional table is normalized into multiple lookup tables, each representinga level in the dimensional hierarchy.Sample snowflake schemaFor example, the Time Dimension that consists of 2 different hierarchies:1. Year Month Day2. Week Da yWe will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, alookup table for week, and a lookup table for day. Year is connected to Month, which is then connectedto Da y. Week is onl y connected to Day. A sample snowflake schema illustrating the above relationshipsin the Time Dimension is shown to the right.The main advantage of the snowflake schema is the improvement in query performance due tominimized disk storage requirements and joining smaller lookup tables. The main disadvantage of thesnowflake schema is the additional maintenance efforts needed due to the increase number of lookuptables.The "Slowl y Changing Dimension" problem is a common one particular to data warehousing. In anutshell, this applies to cases where the attribute for a record varies over time. We give an examplebelow:Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in thecustomer lookup table has the following record:
Customer Key Name State1001 Christina IllinoisAt a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. nowm odify its cust omer table to reflect this change? This is the "Slowl y Changing Dimension" problem.There are in general three wa ys to sol ve this type of probl em, and they are categorized as follows:T ype 1: The new record replaces the original record. No trace of the old record exists.Type 2: A new record is added into the cust omer dimension table. Therefore, the customer is treatedessentially as two people.T ype 3: The original record is modified to reflect the change.We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives.In Type 1 Slowl y Changing Dimension, the new information simply overwrites the original information. Inother words, no history is kept.In our example, recall we originally have the foll owing table: Customer Key Name State1001 Christina IllinoisAfter Christina moved from Illinois to California, the new information replaces the new record, and wehave the following table:Customer Key Name State1001 Christina CaliforniaAdvantages:- This is the easiest wa y t o handle the Slowl y Changing Dimension probl em, since there is no need tokeep track of the old information.Disadvantages:- All history is lost. By appl ying this methodol ogy, it is not possibl e to trace back in history. For example,in this case, the company would not be abl e to know that Christina lived in Illinois before.Usage:
About 50% of the time.When to use T ype 1:T ype 1 slowl y changing dimension should be used when it is not necessar y for the data warehouse tokeep track of historical changes.In Type 2 Slowl y Changing Dimension, a new record is added to the table to represent the newinformation. Therefore, both the original and the new record will be present. The newe record gets itsown primary key.In our example, recall we originally have the following table: Customer Key Name State1001 Christina IllinoisAfter Christina moved from Illinois to California, we add the new information as a new row into the table:Custom er Key Name State1001 Christina Illinois1005 Christina CaliforniaAdvantages:- This allows us to accurately keep all historical information.Disadvantages:- This will cause the size of the table to grow fast. In cases where the number of rows for the table isver y high to start with, storage and performance can become a concern.- This necessarily complicates the ETL process. Usage:About 50% of the time.When to use T ype 2:T ype 2 slowl y changing dimension should be used when it is necessary for the data warehouse to trackhistorical changes.In Type 3 Slowl y Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be acol umn that indicates when the current value becomes active.In our example, recall we originally have the following table: Customer Key Name State1001 Christina IllinoisTo accommodate T ype 3 Slowl y Changing Dimension, we will now have the foll owing columns:
Customer Key Name Original State Current State Effective DateAfter Christina moved from Illinois to California, the original information gets updated, and we have thefoll owing table (assuming the effecti ve date of change is January 15, 2003):Customer Key Name Original State Current State Effecti ve Date1001 Christina Illinois California 15-JAN-2003Advantages:- This does not increase the size of the table, since new information is updated.- This allows us to keep some part of histor y.Disadvantages:- Type 3 will not be able to keep all history where an attribute is changed more than once. For example,if Christina later moves to Texas on December 15, 2003, the California information will be lost.Usage:T ype 3 is rarely used in actual practice.When to use T ype 3:T ype III slowl y changing dimension should only be used when it is necessary for the data warehouse totrack historical changes, and when such changes will only occur for a finite number of time.A conceptual data model identifies the highest-level relationships bet ween the different entities.Features of conceptual data model include: Includes the important entities and the relationships among them. No attribute is speci fied. No primary key is speci fied.The figure below is an example of a conceptual data model.
Conceptual Data Model From the figure above, we can see that the only information shown via the conceptual data model is theentities that describe the data and the relationships between those entities. No other information isshown through the conceptual data model.A logical data model describes the data in as much detail as possi ble, without regard to how they will beph ysical implemented in the database. Features of a logical data model include: Includes all entities and relationships among them. All attributes for each entity are speci fied. The primary key for each entity is speci fied. Foreign keys (keys identifying the relationship between different entities) are speci fied. Normalization occurs at this level.The steps for designing the logical data model are as foll ows:1. Speci fy primary keys for all entities.the relationships bet ween different entities.all attributes for each entity.many-to-many relationships.m a l i z a t i o n . conceptual data modelt entities into tables.t relationships into foreign keys.t attributes into columns.physical data model based on physical constraints / requirements.logical data modelconceptual data modellogical data modelphysical data modele a t u r e C o n c e p t u a l L o g i c a l P h y s i c a l The OLAP ReportMOLAP, ROLAP,and HOLAPBusiness intelligence toolsBusiness intelligence usesBusiness intelligence newsReportingTool SelectioOLAP Tool SelectionRequirement Gathering Phyical Environment Stu Data Modeling ETL OLAP Cube Design Front End Development Report Developmet Performance Tuning Query Optimization Quality
Assurance Rolling out to Production Production Maintenance Incremental Enhanements Additional ObservationsConceptual, Logical, andPhysical Data ModelingRequirement GatheringETLdatamodeling Report delivery : What report delivery methods are needed? In addition to delivering the report to theweb front end, other possibilities include delivery via email, via text messaging, or in some form of spreadsheet. There are reporting solutions in the marketplace that support report delivery as a flash file.Such flash file essentially acts as a mini-cube, and would allow end users to slice and dice the data onthe report without having to pull data from an external source. Access privileges : Special attention needs to be paid to who has what access to what information. Asales report can show 8 metrics covering the entire company to the company CEO, while the samereport may only show 5 of the metrics covering only a single district to a District Sales Director.Report development does not happen only during the implementation phase. After the system goes intoproduction, there will certainly be requests for additional reports. These types of requests generally fallinto two broad categories:1. Data is already available in the data warehouse. In this case, it should be fairly straightforward todevelop the new report into the front end. There is no need to wait for a major production push beforemaking new reports available.2. Data is not yet available in the data warehouse. This means that the request needs to be prioritizedand put into a future data warehousing development cycle. Time Requirement 1 - 2 weeks. Deliverables Report Specification Documentation. Reports set up in the front end / reports delivered to user's preferred channel. Possible Pitfalls Make sure the exact definitions of the report are communicated to the users. Otherwise, user interpretation of the report can be errenous. Performance Tuning Task Description There are three major areas where a data warehousing system can use a little performance tuning: ETL - Given that the data load is usually a very time-consuming process (and hence they aretypically relegated to a nightly load job) and that data warehousing-related batch jobs aretypically of lower priority, that means that the window for data loading is not very long. A datawarehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyondthe control of the data warehousing team. As a result, it is always an excellent idea for the datawarehousing group to tune the ETL process as much as possible. Query Processing - Sometimes, especially in a ROLAP environment or in a system where thereports are run directly against the relationship database, query performance can be an issue.A study has shown that users typically lose interest after 30 seconds of waiting for a report toreturn. My experience has been that ROLAP reports or reports that run directly against theRDBMS often exceed this time limit, and it is hence ideal for the data warehousing team toinvest some time to tune the query, especially the most popularly ones. We present a number of query optimizationideas. Report Delivery - It is also possible that end users are experiencing significant delays inreceiving their reports due to factors other than the query performance. For example, networktraffic, server setup, and even the way that the frontend was built sometimes play significantroles. It is important for the data warehouse team to look into these areas for performancetuning. Time Requirement 3 - 5 days. Deliverables
Performance tuning document - Goal and Result Possible Pitfalls Make sure the development environment mimics the production environment as much as possible -Performance enhancements seen on less powerful machines sometimes do not materialize on thelarger, production-level machines. Query Optimization For any production database, SQL query performance becomes an issue sooner or later. Having long-running queries not only consumes system resources that makes the server and application run slowly,but also may lead to table locking and data corruption issues. So, query optimization becomes animportant task.First, we offer some guiding principles for query optimization: 1. Understand how your database is executing your query Nowadays all databases have their own query optimizer, and offers a way for users to understand howa query is executed. For example, which index from which table is being used to execute the query?The first step to query optimization is understanding what the database is doing. Different databases
have different commands for this. For example, in MySQL, one can use "EXPLAIN [SQL Query]"keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see thequery plan. 2. Retrieve as little data as possible The more data returned from the query, the more resources the database needs to expand to processand store these data. So for example, if you only need to retrieve one column from a table, do not use'SELECT *'. 3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired resultthrough the use of subqueries, inline views, and UNION-type statements. For those cases, theintermediate results are not stored in the database, but are immediately used within the query. This canlead to performance issues, especially when the intermediate results have a large number of rows.The way to increase query performance in those cases is to store the intermediate results in atemporary table, and break up the initial SQL statement into several SQL statements. In many cases,you can even build an index on the temporary table to speed up the query performance even more.Granted, this adds a little complexity in query management (i.e., the need to manage temporary tables),but the speedup in query performance is often worth the trouble.Below are several specific query optimization strategies. Use Index Using an index is the first strategy one should use to speed up a query. In fact, this strategy isso important that index optimization is also discussed. Aggregate Table Pre-populating tables at higher levels so less amount of data need to be parsed. Vertical Partitioning Partition the table by columns. This strategy decreases the amount of data a SQL query needsto process. Horizontal Partitioning Partition the table by data value, most often time. This strategy decreases the amount of data aSQL query needs to process. Denormalization The process of denormalization combines multiple tables into a single table. This speeds upquery performance because fewer table joins are needed. Server Tuning Each server has its own parameters, and often tuning server parameters so that it can fullytake advantage of the hardware resources can significantly speed up query performance. Quality Assurance Task Description Once the development team declares that everything is ready for further testing, the QA team takesover. The QA team is always from the client. Usually the QA team members will know little about datawarehousing, and some of them may even resent the need to have to learn another tool or tools. Thismakes the QA process a tricky one.Sometimes the QA process is overlooked. On my very first data warehousing project, the project teamworked very hard to get everything ready for Phase 1, and everyone thought that we had met thedeadline. There was one mistake, though, the project managers failed to recognize that it is necessaryto go through the client QA process before the project can go into production. As a result, it took fiveextra months to bring the project to production (the original development time had been only 2 1/2months). Time Requirement 1 - 4 weeks. Deliverables QA Test Plan QA verification that the data warehousing system is ready to go to production Possible Pitfalls As mentioned above, usually the QA team members know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. Make sure the QA team membersget enough education so that they can complete the testing themselves. Rollout To Production Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may thinkthis is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, itsometimes take up to a full week to bring everyone online! Fortunately, nowadays most end usersaccess the data warehouse over the web, making going production sometimes as easy as sending outan URL via email. Time Requirement 1 - 3 days. Deliverables Delivery of the data warehousing system to the end users. Possible Pitfalls Take care to address the user education needs. There is nothing more frustrating to spend severalmonths to develop and QA the data warehousing system, only to have little usage because the usersare not properly trained. Regardless of how intuitive or easy the interface may be, it is always a goodidea to send the users to at least a one-day course to let them understand what they can achieve byproperly using the data warehouse. Production Maintenance Task Description Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backupand crisis management becomes important and should be planned out. In addition, it is very importantto consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requestsso that they can be fixed before slowing the entire system down, and 2. To understand how much usersare utilizing the data warehouse for return-oninvestment calculations and future enhancementconsiderations. Time Requirement Ongoing. Deliverables Consistent availability of the data warehousing system to the end users. Possible Pitfalls Usually by this time most, if not all, of the developers will have left the project, so it is essential thatproper documentation is left for those who are handling production maintenance. There is nothing morefrustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation.Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the datawarehouse planned, start on that as soon as possible.
Incremental Enhancements Task Description Once the data warehousing system goes live, there are often needs for incremental enhancements. Iam not talking about a new data warehousing phases, but simply small changes that follow the businessitself. For example, the original geographical designations may be different, the company may originallyhave 4 sales regions, but now because sales are going so well, now they have 10 sales regions. Deliverables Change management documentation Actual change to the data warehousing system Possible Pitfalls Because a lot of times the changes are simple to make, it is very tempting to just go ahead and makethe change in production. This is a definite no-no. Many unexpected problems will pop up if this is done.I would very strongly recommend that the typical cycle of development --> QA --> Production befollowed, regardless of how simple the change may seem

This Data Warehousing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

This Data Warehousing

Hochgeladen von

Copyright:

Verfügbare Formate

This Data Warehousing site aims to help people get a good high-level understanding of what it takes toimplement a successful

Das könnte Ihnen auch gefallen