Datawarehousing Basics

OLTP On Line Transaction Processing The major task of OLTP system is to perform on-line transaction and query processing.
g. It mostly covers all day-day operations of an organization like purchasing, inventory, banking, payroll, registration, accounting etc OLTP is designed to get data in quickly and to analyze the current events characterized by : Process Oriented Normalized Data Current Data Volatile Data Updated in real-time Data Warehousing is a concept and not a technology!!! In Layman Words : Data Warehousing is a dump or collection of historical data from different sources of different databases formatted and stored into a common target in order to get an intelligent output According to Bill Inmon known as Father of Data Warehousing Data Warehousing is Subject Oriented Integrated Nonvolatile Time Variant Collection of data in support of management decisions. Subject Oriented Data is organized around major subjects of the enterprise Applications typically are designed around process/functions Data warehouse is subject/data-driven orientation. It only includes data used for decision making Data warehouse data spans time and allows more complex relations Integrated Data in data warehouse is integrated to refer in only one way unlike many ways in legacy system Data is in same format Data is in same units in which attributes are measured
Non-volatile Data is not updated regularly on record-by record basis Data in data warehouse is refreshed at certain intervals Data in data warehouse can be accessed as and then required Time Variant Data warehouse data can be accurate at some moment in time but not necessarily right now Data Warehouse key always contains unit of time (day, week etc) Correctly recorded data warehouse data cannot be updated Data Warehouse is designed to get data out and quickly analyze mainly characterized by: Subject oriented rather than process oriented Integrated across subjects and entire enterprise De-normalized data Time Variant Historical Non Volatile Summary data Differnce between Dataware House and OLTP Data Warehouse Works with enterprise wide information Large to very large database Non volatile data De-normalized data Less joins Multidimensional Data Structures Updated on a schedule Read Queries Analyze the business OLTP Works with small pieces of information Small to large database Volatile data Normalized data More number of joins Complex Data structures Updated in real time Insert/update Queries Runs the business
Drawbacks of Data Warehousing Handling large volume of data Data Warehouse solutions complicate business processes Data Warehouse solutions may have too long a learning curve Costs factor In getting a professional In getting data warehousing licensed tools In cleaning, capturing and delivering data Benefits of Data Warehouse The ability to scale to large volumes of data and large numbers of concurrent users Consistent, fast query response times that allow foriterative speed-of-thought analysis Integrated metadata that seamlessly links that OLAP server and the data warehouse relational database The ability to automatically drill from summary calculated data, which is managed by OLAP server, to detail data stored in the data warehouse relational database A calculation engine that includes robust mathematial functions for computing derived data (aggregations, matrix calculations, cross-dimensional calculations, OLAP-aware formulas and procedural calculations) Seamless integration of historical, projected and derived data A Multi-user read/write environment to support users what-if analysis, modeling and planning requirements The ability to be depolyed quickly, adopted easily and maintained cost-effectively Robust data-access security and user management Availability of a wide variety of viewing and analysis tools to support different user communities Goals of Data Warehousing To provide a reliable, single, integrated source of key corporate information To give end users access to their data without a reliance on reports produced by the information system department To allow analysts to analyze corporate data and even produce predictive What if models from that data The data warehouse is simply one component of modern reporting architectures The real goal is decision support or its modern equivalent Business Intelligence to help people make better, more intelligent decision.
History of Data Warehousing Data Warehousing philosophy falls into: William H Inmon philosophy He is known as Father of Data Warehouse as he started in early 70s Ralph Kimball philosophy Based on Bill Inmon philosophy but has a different approach in building data warehousing Different Approaches to Build Data Warehouse Two different approaches to build Data Warehouse are: Top - Down approach [William H Inmon] An enterprise has one data warehouse and data marts source their information from the data warehouse. Information is stored in 3rd normal form Bottom - Up approach [Ralph Kimball] An enterprise has one data warehouse where the information is sourced from data marts. Information is stored in dimensional model
Top - Down Approach Architecture
Bottom Up Approach
Data Warehousing important terminologies

Data warehouse most often use Dimensional data modeling and some of the terms frequently used : Dimension table: A category by which summarized data can be viewed. For eg, Time dimension Dimension: A unique level within a dimension table. For eg, month is a dimension in Time dimension Hierarchy: The specification of levels that represents relationship between different dimensions within a hierarchy. For eg, one possible hierarchy in the Time dimension in Year-Quarter-Month-Day Data Mart: A data mart is a focused subset of a data warehouse that is organized for quick analysis Meta data: Is Data about data. Its a descriptions of what kind of information is stored where, how it is encoded, how it is related to other information, where it comes from and how it is related to Surrogate key: Its a system generated key, act as primary key in dimension tables. They are used when: Preserve the history of changes instead of updating There is a high possibility of restructuring the business keys Can be used to increase the join performance Fact table: A table that contain facts and foreign keys from the primary keys of related dimension tables. Facts: A fact is a collection of related data items, consisting of measures and context data. Measure: Its a numeric attribute of fact. For eg, a sales fact table contain profit as measure representing profit on each sale. Aggregates: Its a pre-calculated numeric data. This is key in providing fast query performance. Cubes: Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients. Types of Dimensions Causal dimension: Is a dimension which explains the existence of a record in fact table. For eg, Transaction like hiring or termination will warrant an additional row in the fact. Degenerate dimension: Is a dimension key generated in the fact table that doesnt refer to any dimension table. It acts as a primary key for the fact table.
Conformed dimension: Is a dimension key that is shared by more than one fact table. Hierarchical dimension: Is a dimension which has hierarchies. For eg, Geography Types of Dimension table: Mini dimension: A mini dimension is a group of attributes which changes frequently or a group of attributes which would be queried frequently. Junk dimension: A junk dimension is a group of static attributes like random flags Aggregate dimension: A aggregate dimension is a group of aggregated attributes. Types of facts: Additive: Additive facts are the facts that can be summed up through all of the dimensions in the fact table Semi-additive: Semi-additive facts are the facts that can be summed up for some of the dimensions in the fact table, but not the others Non-additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table Factless fact table: Factless fact doesnt have any additive or semiadditive. It contains only foreign key value from dimension tables Data Warehousing is done in 2 stages: ETL Data is pulled from different sources of different database into a common area called staging area and then this data is transformed into Warehouse tables in a desired way and in a desired format OLAP Process the warehouse data to generate cube and tp provide multidimensional view either to generate reports or to analyze or to mine the data Why data modeling??? Helps in managing complex data relationships A data model helps keep track of the complex environment like data warehousing Many complex relationship exits, with the ability to change over time Transformation and integrations from various systems of record need to be worked out and maintained. Provides the means of supplying users with a road map through the data and relationships
Steps in data modeling Step 1 : Conceptual Data Model Includes the important entities and the relationships among them No attribute is specified No primary key is specified At this level, the data modeler attempts to identify the highest-level relationships among the different entities Step 2 : Logical Data Model Includes all entities and relationships among them All attributes for each entity are specified The primary key for each entity specified Foreign keys are specified Normalization occurs at this level At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database Step 3 : Physical Data Model Specification of all tables and columns Foreign keys are used to identify relationships between tables De-normalization may occur based on user requirements Physical considerations may cause the physical data model to be quite different from logical data model At this level, the data modeler will specify how the logical data model will be realized in the database schema Schemas available are: Star schema - consist of a central fact table surrounded by denormalized dimensions Snowflake schema consist of a central fact table surrounded by partly or completely normalized dimensions Constellation schema consist of more than one fact tables surrounded by denormalized dimensions
Star schema Features of Star schema are: Dimension tables are separated from the fact table De-normalized Dimension tables Attribute information stored within the dimension table De-normalized fact table Each dimension table has a key in the fact table Typical Star Schema
Snowflake schema Features of Snowflake Schema are: Dimensions are separated from facts At least one normalized dimension table Attribute info of normalized dimension table stored in outrigger tables Attribute info of de-normalized dimension tables stored in dimension tables De-normalized fact table Each dimension table has a key into the fact table Typical Snowflake Schema
Constellation Schema Features of Constellation schema are: Dimension tables separated from fact table At least one normalized dimension table (usually the largest) Attributes of normalized dimension tables stored in outrigger tables Attributes of de-normalized dimension tables stored in dimension tables Normalized fact tables Each dimension table keyed into one or more fact tables Typical constellation schema
What is ETL??? ETL - Extraction Transformation Loading Extraction of data from different sources of different database Transformation of extracted data in a desired common format Loading of transformed data into staging/warehouse tables ETL Architecture
Why do we need ETL??? Migrate data from one database to another or same database Cleanses the data Eliminates duplicates Organize the data Data handling would be easier Reformats the data for target repository Capture data change Maintains historical data How to implement ETL??? Using PL/SQL scripts Coding is tedious and cumbersome Needs more resource for coding Difficult to implement Takes more time to implement Needs no additional cost for implementing Data retrieval is faster Using data warehouse tools ETL tools are user friendly and easy to handle Needs less resource for implementing Easy to implement Takes less time to implement Need to have licensed copy to use the tool Data retrieval is slower when compared to PL/SQL scripting
Available ETL tools in market Some of the ETL tools in market are: Informatica Powercenter Ab Initio Ascential DataStage Oracle Warehouse Builder BusinessObjects Dataintegrator Cognos DecisionStream Microsoft DTS Pervasive Datajunction Hummingbird Genio How to capture data change??? Data change in the dimension tables can be captured using following methods. Method 1: Simple Pass Through Dimension - Truncate and load the data Method 2: Slowly Changing Dimension - New data gets inserted as new row and changed data gets either updated or inserted as new row or new column based on the types of SCD Method 3: Slowly Growing Target Dimension - New and changed data gets inserted as new row Method 1 Simple Pass Through Dimension: Used to load current data after truncating the target table Doesn't filter out the existing rows, loads all the source rows Data flow for all existing rows in the source Method 2 Slowly Changing Dimension [SCD]: History of data can be maintained in three different types: Type 1 Changed data overwrites the existing data and history is not maintained Type 2 Flag Changed data is added as a new row and history is maintained using flag Version changed data is added as new row and history is maintained using version numbers Time variant Changed data is added as new row and history is maintained using Time stamp Type 3 Changed data is added as new column and history is restricted to previous data and current data
SCD Type 1 Used to maintain current data without a historical log Filters source rows based on user-defined comparisons and updates only those found to be new data to the target Data flow for new and changed data For each changed data in the source, this data flow marks the row for update and overwrites the corresponding row in the target Updates changed data in the target, overwriting existing data SCD Type 2 - Flag Used to maintain full history of data in the dimension table, with the most current data flagged Filters source rows based on user-defined comparisons and inserts changed data into the target The designer uses two instances for the same target definition to enable the two separate data flows to write to the same target table Generate only one target table in the target database Data flow for changed data Increments the existing primary key by 1 Sets the current flag to 1 for changed data and inserts into the target Data flow to update existing rows Updates the corresponding row in the target of the changed rows in the source Resetting the current flag to 0 to indicate the row is no longer current. SCD Type 2 - Version Used to maintain full history of data in the dimension table Filters source rows based on user-defined comparisons and inserts changed data into the target The current version of a data has the highest version number Data flow for changed data Increments the primary key and version number for changed rows Inserts changed data in the target SCD Type 2 Time Variant Used to maintain full history of data in the dimension table An effective date range tracks the chronological history of changes for each dimension
Filters source rows based on user-defined comparisons and inserts changed data into the target
The current data has a begin date with no corresponding end date The designer uses two instances of the same target definition to enable the two separate data flows to write to the same target table. Generate only one target table in the target database Data flow for changed data Increments the existing primary key by 1 Generates beginning of the effective date range for changed rows and insert into the target Data flow to update existing rows Updates existing row of changed data in the target Generate the end of the effective date range SCD Type 3 Used to maintain only current and previous versions of changed data in the dimension table Filters source rows based on user-defined comparisons and inserts only those found to be new data into the target Rows containing changes to existing data are updated in the target While updating, the existing data is saved into a different column of the same row and replaces the existing data with the new data Data flow to update existing rows Writes previous values for each changed row into previous columns Replaces previous values with updated values Updates changed date in the target Method 3 Slowly Growing Target Dimension: Used to maintain current and history data Filters source data based on user-defined comparisons and updates only those found to be new data to the target Data flow for new and updated data For each changed row in the source, this data flow inserts a new row into the target
What is OLAP??? OLAP stands for On-Line Analytical Processing Process the warehouse data to generate cube and provide multidimensional view to generate reports. The OLAP tools have the ability to do rapid analysis of multiple simultaneous factors, something that relational databases cant do. What is Data Mining??? Data Mining is the running of automated routines that search through data organized in a warehouse. They look for patterns in order to point us to areas that we should be addressing Data Mining deals with five kind of data Associations (things done together) Sequences (events over time) Classifications Pattern recognition Clusters (define new groups) Forecasting predictions from time series Why is it required??? Organizations information easily accessible Present information consistently Adaptive and resilient to changes Secure bastion that protects information Foundation for improved decision Business community must accept for deemed successful Types of OLAP Different types of OLAP are: MOLAP Multidimensional OLAP This methodology is a traditional way of OLAP analysis The data is stored in a multidimensional cube The storage is not in the relational database, but in proprietary formats ROLAP Relational OLAP
This methodology relies on manipulating the data stored in relational database Gives appearance of traditional OLAPs slicing and dicing functionality Each action of slicing and dicing is equivalent to adding a WHERE clause in the SQL statement
HOLAP Hybrid OLAP This methodology attempts to combine the MOLAP and ROLAP technologies DOLAP Desktop/Database OLAP This methodology provide multidimensional analysis locally in the client machine on the data collected from relational or multidimensional database servers How to implement??? It can be implemented using some of the User Friendly Tools: For MOLAP Cognos tool For ROLAP Business Object or Micro Strategy For HOLAP Relational Access Manager For DOLAP Business Object Available OLAP tools in market Some of the OLAP tools in market are: Business Objectss - Business objects Business Objectss - Crystal reports Cognoss - Cognos Hypersions - Hypersion Micro Strategys Micro Strategy Microsofts Microsoft Reporting Services BRIO Relational Access Manager

Datawarehousing Basics

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Datawarehousing Basics

Hochgeladen von

Copyright:

Verfügbare Formate

OLTP On Line Transaction Processing The major task of OLTP system is to perform on-line transaction and query processing.

Top - Down Approach Architecture

Data Warehousing important terminologies

Das könnte Ihnen auch gefallen