DW ETL Concepts

Business Intelligence,
Data Warehousing &

ETL Concepts
Business Intelligence
Business Intelligence
How intelligent can you make your business processes?
What insight can you gain into your business?
How integrated can your business processes be?
How much more interactive can your business be with customers, partners,
employees and managers?
3
What is Business Intelligence (BI)?
Business Intelligence is a generalized term applied to a broad category of

applications and technologies for gathering, storing, analyzing and providing
access to data to help enterprise users make better business decisions
Business Intelligence applications include the activities of decision support

systems, query and reporting, online analytical processing (OLAP), statistical
analysis, forecasting, and data mining
An alternative way of describing BI is: the technology required to turn raw data
into information to support decision-making within corporations and business
processes
4
Why BI?
BI technologies help bring decision-makers the data in a form they can quickly
digest and apply to their decision making.
BI turns data into information for managers and executives and in general, people
making decisions in a company.
Companies want to use technology tactically to make their operations more

effective and more efficient - Business intelligence can be the catalyst for that
efficiency and effectiveness.
5
Benefits
The benefits of a well-planned BI implementation are going to be closely tied to
the business objectives driving the project.
Identify trends and anomalies in business operations more quickly, allowing

for more accurate and timelier decisions.
Deliver actionable insight and information to the right place with less effort .
Identify and operate based on a single version of the truth, allowing all
analysis to be completed on a core foundation with confidence.
6
Business Intelligence Platform Requirements
Data Warehouse Databases
OLAP
Data Mining
Interfaces
Build and Manage Capabilities
The business intelligence platform should provide good integration across these
technologies. It should be a coherent platform, not a set of diverse and
heterogeneous technologies.
7
Business Intelligence Components
DATA OLAP
MINING
Data
Warehouse
LOAD
TRANSFORM
EXTRACT
Operational Data
8
Business Intelligence Architecture
9
Business Intelligence Technologies
Increasing potential to
support business decisions End User
Decision Making
Business Analyst
Data Presentation
Visualization Techniques
Data Mining Data Analyst
Information discovery
Data Exploration
OLAP, DSS, EIS, Querying and Reporting
DB Admin
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
10
Data Warehousing
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data.
A data warehouse environment includes an extraction, transportation,

transformation, and loading (ETL) solution, online analytical processing (OLAP)
and data mining capabilities, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
It is a series of processes, procedures and tools (h/w & s/w) that help the
enterprise understand more about itself, its products, its customers and the
market it services
12
Facts !
NOT possible to
Data Warehouse is
purchase a Data
NOT a specific
Warehouse, but it is
technology
possible to build one.
13
Why Data Warehousing?
Need of Intelligent Information in Competitive Market
Who are the potential What are the region-wise

Customers ? preferences ?
Which Products are sold the What are the competitor
most ? products ?
What will be the impact

on revenue ? What are the projected
Results of promotion sales ?
schemes introduced ? What if you sale more
quantity of a particular
product ?
14
Defining Data warehouse
“Data Warehouse is a subject-

oriented, integrated nonvolatile
and time-variant collection of
data in support of
management’s decisions.”
William Imon
15
Subject Oriented
 The data in data Operational Data
warehouse is Systems Warehouse
organized around the
major subject of the
enterprise ( i.e. the Customer
high level entities).
 The orientation around

the major subject areas
causes the data
warehouse design to
be data driven. Supplier
 The operational
systems are designed
around the application
and functions. e.g.
Product
Loans , savings , credit
cards in case of a
Bank. Where Data
Warehouse is designed
around a subject like Organized by processes Organized by
Customer , Product , or tasks subject
Vendor etc.
16
Time Variant
Data is stored as a series of snapshots or views which record how it is

collected across time.
Data Warehouse Data
Time Data
{
Key
 It helps in Business trend analysis
 In contrast to OLTP environment, data warehouse’s focus
on change over time that is what we mean by time variant.
17
Integrated
Data is stored once in a single integrated location
Auto
AutoPolicy
Policy Data Warehouse
Processing
Processing Database
System
System
Customer Fire
FirePolicy
Policy
data Processing
Processing
stored System
System
in several
databases FACTS,
FACTS,LIFE
LIFE Subject = Customer
Commercial,
Commercial,Accounting
Accounting
Applications
Applications
 It is closely related with subject orientation.

 Data from disparate sources need to be put in a consistent format.
 Resolving of problems such as naming conflicts and
inconsistencies
18
Non-Volatile
Existing data in the warehouse is not overwritten or updated.
External
Sources
Product Data
ion Warehouse
Databas Database
Production
Production Data
Data
es
Applications
Applications Warehouse
Warehouse
Environment
Environment
Update • Load
Insert • Read-Only
Delete
 This is logical because the purpose of a data warehouse is to enable you to
analyze what has occurred.
19
So, what’s different between OLTP
and Data Warehouse?
20
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is
not known in a data warehouse
Special data organization, access methods and implementation methods are

needed to support data warehouse queries (typically multidimensional queries)
e.g., average amount spent on phone calls between 9AM-5PM in Pune during
the month of December
21
OLTP vs. Data Warehouse
OLTP WAREHOUSE (DSS)
Application Oriented Subject Oriented

Used to run business
Used to analyze business
Detailed data
Summarized and refined
Current up to date
Isolated Data
Snapshot data
Repetitive access Integrated Data
Clerical User Ad-hoc access
Knowledge User (Manager)
22
OLTP vs Data Warehouse
OLTP DATA WAREHOUSE
Performance Sensitive Performance relaxed

Few Records accessed at a time (tens) Large volumes accessed at
Read/Update Access a time(millions)
No data redundancy Mostly Read (Batch Update)
Database Size 100MB -100 GB
Redundancy present
Database Size 100 GB -
few terabytes
23
OLTP vs Data Warehouse
OLTP Data Warehouse
Transaction throughput is the Query throughput is the

performance metric performance metric
Thousands of users
Hundreds of users
Managed in entirety
Managed by subsets
24
To summarize ...
OLTP Systems are

used to “run” a business
The Data Warehouse helps

to “optimize” the business
25
Data Warehouse Architectures
 Centralized
In a centralized architecture, there exists only one data warehouse which stores
all data necessary for business analysis. As already shown in the previous
section, the disadvantage is the loss of performance in opposite to distributed
approaches.
Central Architecture
26
Data Warehouse Architectures Contd…
 Federated
In a federated architecture the data is logically consolidated but stored in

separate physical databases, at the same or at different physical sites. The local
data marts store only the relevant information for a department.
The amount of data is reduced in contrast to a central data warehouse. The level
of detail is enhanced.
Federated
Architecture
27
Data Warehouse Architectures Contd…
Tiered:
A tiered architecture is a distributed data approach. This process

can not be done in one step because many sources have to be
integrated into the warehouse.
On a first level, the data of all branches in one region is collected, in
the second level the data from the regions is integrated into one
data warehouse.
Advantages:
 Faster response time

because the data is
located closer to the client
applications and
 Reduced volume of data
to be searched.
Tiered Architecture
28
Complete Warehouse Solution Architecture
Data Information Knowledge

Data Sources Data Management Access
Sales
Data
Mart
Metadata
Legacy Data
Inventory
Extract
Transform Enterprise Data
Load Data Mart
Warehouse
Operational Data
The Post
Purchase
Organizationally Data
structured Mart
VISA
External Data Departmentally
Sources structured
Asset Assembly (and Management) Asset Exploitation
29
Data Warehouse Architecture Components
Data Sources:
Legacy data
Operational data Disparate data
External data resources sources
Data Management :
Metadata - At all levels of the data warehouse, information is required to support
the maintenance and use of the Data Warehouse.
Data Mart – A data mart is a subject oriented data warehouse.
30
Introduction To Data Marts
What is a Data Mart
From the Data Warehouse , atomic data flows to various departments for their
customized needs. If this data is periodically extracted from data warehouse
and loaded into a local database, it becomes a data mart. The data in Data Mart
has a different level of granularity than that of Data Warehouse. Since the data
in Data Marts is highly customized and lightly summarized , the departments can
do whatever they want without worrying about resource utilization. Also the
departments can use the analytical software they find convenient. The cost of
processing becomes very low.
31
Data Mart Overview
DM Sales Sales Representatives
and Analysts
Data Warehouse
DM Marketing
DM HR
DM Sales DM HR Human
Resources
DM Finance
Data Marts
DM Marketing
Satisfy 80% of
Financial Analysts,
the local end-
Strategic Planners,
users’ requests
and Executives
32
From The Data Warehouse To Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
33
Operational Data Store (ODS)
What is an ODS
An Operational Data Store (ODS) integrates data from multiple business operation
sources to address operational problems that span one or more business functions.
An ODS has the following features:
Subject-oriented — Organized around major subjects of an organization

(customer, product, etc.), not specific applications (order entry, accounts
receivable, etc.).
Integrated — Presents an integrated image of subject-oriented data which is

pulled from fragmented operational source systems.
Current — Contains a snapshot of the current content of legacy source systems.

History is not kept, and might be moved to the data warehouse for analysis.
Volatile — Since ODS content is kept current, it changes frequently. Identical

queries run at different times may yield different results.
Detailed — ODS data is generally more detailed than data warehouse data.
Summary data is usually not stored in an ODS; the exact granularity depends on the
subject that is being supported.
34
Operational Data Store (ODS) Contd…
The ODS provides an integrated view of data in operational systems.

As the figure below indicates, there is a clear separation between the ODS and the
data warehouse.
Operational
A Data Store Data Warehouse
EIS
DSS
B
Apps
PC
C Current or near Historical data
current data
Summary and detail
Detailed data
Non-volatile
Updates allowed snapshots only
35
Benefits Of ODS
Supports operational reporting needs of the organization
Provides a complete view of customer relationships, the data for which might be
stored in several operational databases -- this data can include data from an
organization’s internal systems, as well as external data from third-party vendors.
Operates as a store for detailed data, updated frequently and used for drill-downs
from the data warehouse which contains summary data.
Reduces the burden placed on other operational or data warehouse platforms by

providing an additional data store for reporting.
Provides more current data than in a data warehouse and more integrated than an
OLTP system
Feeds other operational systems in addition to the data warehouse
36
Definition Of Data Warehouse
Ralph Kimball's paradigm:
Data warehouse is the conglomerate of all data marts within the

enterprise. Information is always stored in the dimensional model.
Bill Inmon's paradigm:
Data warehouse is one part of the overall business intelligence system.

An enterprise has one data warehouse, and data marts source their
information from the data warehouse. In the data warehouse, information
is stored in 3rd normal form
37
Basic Design Approaches of Data Warehouse
There are two major types of approaches to building or designing the

Data Warehouse.
The Top-Down Approach
The Bottom-Up Approach
38
The Top Down Approach
The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach
Inmon advocated a “dependent data mart structure”
The data flow in the top down OLAP environment begins with data extraction
from the operational data sources. This data is loaded into the staging area and
validated and consolidated for ensuring a level of accuracy and then transferred
to the Operational Data Store (ODS).
Detailed data is regularly extracted from the ODS and temporarily hosted in the
staging area for aggregation, summarization and then extracted and loaded into
the Data warehouse.
Once the Data warehouse aggregation and summarization processes are

complete, the data mart refresh cycles will extract the data from the Data
warehouse into the staging area and perform a new set of transformations on
them. This will help organize the data in particular structures required by data
marts. Then the data marts can be loaded with the data and the OLAP
environment becomes available to the users.
39
The Top Down Approach Contd…
Inmon Approach
The data marts are treated as sub sets of the data warehouse. Each
data mart is built for an individual department and is optimized for
analysis needs of the particular department for which it is created.
40
The Bottom-Up Approach
The Data warehouse Bus Structure: The Bottom-Up Approach
Ralph Kimball designed the data warehouse with the data marts connected
to it with a bus structure.
The bus structure contained all the common elements that are used by data
marts such as conformed dimensions, measures etc defined for the enterprise
as a whole.
This architecture makes the data warehouse more of a virtual reality than a
physical reality
All data marts could be located in one server or could be located on different
servers across the enterprise while the data warehouse would be a virtual
entity being nothing more than a sum total of all the data marts
In this context even the cubes constructed by using OLAP tools could be
considered as data marts.
41
The Bottom-Up Approach Contd…
Kimball Approach
• The bottom-up approach reverses the positions of the Data warehouse

and the Data marts. Data marts are directly loaded with the data from the
operational systems through the staging area.
• The data flow in the bottom up approach starts with extraction of data
from operational databases into the staging area where it is processed
and consolidated and then loaded into the ODS.
42
The Bottom-Up Approach Contd…
The data in the ODS is appended to or replaced by the fresh data being
loaded. After the ODS is refreshed the current data is once again
extracted into the staging area and processed to fit into the Data mart
structure. The data from the Data Mart, then is extracted to the staging
area aggregated, summarized and so on and loaded into the Data Warehouse and
made available to the end user for analysis.
43
Modeling Fundamentals:
What is Data Model ?
 Data model is a conceptual representation of data structures

(tables) required for a database and is very powerful in
expressing and communicating the business requirements. A
data model is an abstract model that describes how data is
represented and used.
The term data model has two generally accepted meanings:

 A data model theory i.e. a formal description of how data may
be structured and used.
 A data model instance i.e. applying a data model theory to
create a practical data model instance for some particular
application.
44
What is Data Modeling ?
 A technique aimed at optimizing the way that information is stored and used within an organization.
It begins with the identification of the main data groups, for example the invoice, and continues by
defining the detailed content of each of these groups. This results in structured definitions for all of
the information that is stored and used within a given system.
 Is an essential precursor to analysis & design, maintenance & documentation and improving the
performance of an existing system.
 Is the process of creating a data model by applying a data model theory to create a data model
instance.
45
Types OF Data Modeling
Logical Data Model (LDM) - A logical design is conceptual and abstract.
The process of logical design involves arranging data into a series of logical
relationships called entities and attributes.
 Logical data model includes all required entities, attributes, key groups, and
relationships that represent business information and define business rules.
Logical Data Model 46

Physical Data Model (PDM) - A physical data model is a representation
of a data design which takes into account the facilities and constraints of a
given database management system.
 A complete physical data model will include all the database artifacts
required to create relationships between tables or achieve performance
goals, such as indexes, constraint definitions, linking tables, partitioned
tables or clusters.
Physical Data Model 47

LOGICAL DATA MODEL PHYSICAL DATA MODEL

Represents business information Represents the physical
defines business rules. implementation of the model in the
data base
Entity Table
Attribute Column
Primary Key Primary Key Column
Alternate Key Unique Constraint or Unique Index
Inversion Key Entry Non Unique Index

Rule Check Constraint, Default Value
Relationship Foreign Key

Definition Comment
48
Entity relationship diagram (ERD) – A data model utilizing several
notations to depict data in terms of the entities and relationships described by
that data.
 Databases are used to store structured data. The structure of this data, together
with other constraints, can be designed using a variety of techniques, one of
which is called entity-relationship modeling or ERM.
ERD Diagram 49
Important Terminologies –
 Entity – Are the principal data object about which information is to be collected.
A class of persons, places, objects, events, or concepts about which we need to
capture and store data.
•Persons: agency, contractor, customer,
department, division, employee, instructor,
student, supplier.
•Places: sales region, building, room,
branch office, campus.
•Objects: book, machine, part, product, raw material,
software license, software package, tool, vehicle model,
vehicle.
•Events: application, award, cancellation, class, flight,
invoice, order, registration, renewal, requisition,
reservation, sale, trip.
•Concepts: account, block of time, bond, course, fund,
qualification, stock.
50
 Relationship – A natural business association that exists between

one or more entities. The relationship may represent an event that links
the entities or merely a logical affinity that exists between the entities
 An example of a relationship would be:

• Employees are assigned to projects
• Student enrolling in a curriculum
• Projects have subtasks
• Departments manage one or more projects
STUDENT CURRICULUM
Is being studied by is enrolled in
51
 Cardinality – The cardinality of a relationship is the actual number

of related occurrences for each of the two entities. The basic types of
connectivity for relations are: one-to-one, one-to-many, and many-to-
many. The minimum and maximum number of occurrences of one
entity that may be related to a single occurrence of the other entity.
Because all relationships are bidirectional, cardinality must be defined
in both directions for every relationship.
bidirectional
Student Is being studied by is enrolled in Curriculum
52
Dimensional Data Modeling (DDM) - Dimensional modeling is the
design concept used by many data warehouse designers to build their data
warehouse.
 Is a logical design technique that seeks to present the data in a standard, intuitive
framework that allows for high-performance access. It adheres to a discipline that
uses the relational model with some important restrictions.
 Every dimensional model is composed of one table with a multi-part key, called
the fact table, and a set of smaller tables called dimension tables.
Components of a DM:
 Fact Table
 Dimension table
 Attributes
 Good examples of dimensions are location, product, time, promotion,

organization etc. Dimension tables store records related to that particular
dimension and no facts (measures) are stored in these tables.
 A fact (measure) table contains measures (sales gross value, total units sold) and
dimension columns. These dimension columns are actually foreign keys from the
respective dimension tables. 53
Why Dimensional Modeling?
Early Days
123 754 123 36 892 26 714 123 549
123 89 123 123 432 78 123 159 159
Single fat record with many fields 96 562
123 562
263 263 123 26
288 95
788 652 123 562
526 82
698 999 78
52 123 549
562 26 722 123 549

62 875 788 562
Separate out the redundant data into

distinct tables to remove
inconsistencies in the data and improve
updating the transaction data.
Our software systems for retrieving and manipulating the

data became complex and inefficient because they
required careful attention to the processing algorithms
for linking these sets of tables together.
We needed a database system that was very good at

linking tables. This paved the way for the relational
database revolution, where the database was devoted to
just this task. 54
The success of transaction processing

in relational databases is largely due to
the discipline of ER modeling.
The success of transaction processing

in relational databases is largely due to
the discipline of ER modeling.
We have created databases that

cannot be queried !
Ralph Kimball
55
 End users cannot understand or navigate ER models
 Software cannot usefully query an ER model
 Use of ER modeling techniques defeats intuitive and high performance retrieval of

data
When the designer places understandability and

performance as the highest goals . . .
Dimensional Modeling is the natural approach
56
Transaction vs. Query Environments:
Design Goals
 Transaction Environments Query Environments
 Get data in fast
 Organize data by transaction flow
 Expect high volumes of inserts, updates, and deletes
 Allow retrieval of specific information quickly
 Provide transaction level access Get data out fast
Stability over time

• Organize data by business analysis
• Expect high volumes of complicated
queries
Provide reasonable performance for
a variety of information requests
• Provide multidimensional data view
• Ease of use
Data
Transaction Refinery
Databases
57
What is a Star Schema ?
Each dimension table has

a single-part primary key
that corresponds exactly to
one of the components of
the multi-part key in the
fact table. This
characteristic "star-like"
structure is often called a
star-schema.
58
What is a Star Schema ?
 The Star schema model is essentially a method to store data which are multi-
dimensional in nature, in a relational database. It consists of a single “fact table"
with a compound primary key, with one segment for each “dimension" and with
additional columns of additive, numeric facts.
Channel
Time SALES Organization
Customer Product
The star schema makes multi-dimensional database (MDDB)

functionality possible using a traditional relational database.
59
Fact Tables
 A fact table, because it has a multi-part primary key made up of

two or more foreign keys, always expresses a many-to-many
relationship.
 The most useful fact tables also contain one or more numerical
measures, or "facts," that occur for the combination of keys
that define each record.
 The most useful facts in a fact table are numeric. Numeric
addition is crucial because data warehouse applications rarely
retrieve a single fact table record. Rather, they retrieve
hundreds, thousands, or even millions of these records at a
time, and the only useful thing to do with so many records is to
add them up.
60
Fact Tables – Additivity of Measures
 Characteristics of a Measure – Fully Additive, Non-
Additive, Semi-Additive.
Fully Additive – When it is meaningful to summarize it by adding values together
across any dimensions.
Example – Sales_Dollar; We can add Sales_Dollar values together across all dates in
a particular month.
Non-Additive – These measures can not be added together across dimensions.
Example – Margin expressed as a percentage of sales; On a particular day, a sales
person sells a customer 4 different products, each at a rate of 25% margin rate. Can
we add all these and say that margin rate for customer for that day is 100%?
Semi-Additive – These measures can be summarized across some dimensions, but
not all.
Example – Bank Account Balances at the end of the day is fully additive. But, if we
do it for different days, will it be additive?
61
Defining Fact Table Structure
Item
Fact Item Day Store
ITEM_ID
Week WEEK_ID
STORE_ID
SALES_DOLLARS
SALES_UNITS
Store
Fact Columns
Fact Table Structure
62
What is a Dimension?
Data Warehouse is
• Subject-Oriented
•Integrated
• Time-Variant
• Non-volatile
Subject Dimension
In a Dimensional Model, context of the measurements are represented in

dimension tables
The Dimension Attributes are the various columns in a dimension table
63
Dimensional Hierarchy
 A dimensional hierarchy expresses the one-to-many relationships between
attributes.
Year
Quarter
Month
Date
Sequence
Current Flag
Day of Week
Dimensional Hierarchy
64
Types of Dimensions
 Conformed Dimensions
 Degenerate Dimensions
 Junk Dimensions
 Conformed Dimensions:
Data marts may have several Fact Tables. In any two data marts in an
enterprise, there could be common dimensions between the Fact tables.
These common dimensions must be conformed; indicating that they are
either the same or one is strictly the rollup of the other. The advantage of
conformed dimensions is that the two data marts don't have to be on the
same machine and don't need to be created at the same time.
65
Conformed Dimensions … But
Why?
Build Dimensional Tables to Serve All

Fact Tables !
Ability to “Mix and Match” Views !
Two data marts don' t have to be on the

same machine and don' t need to be
created at the same time !
66
Types of Dimensions
 Degenerate Dimensions:
 Certain fields which cannot be grouped with any Dimension table are usually
stored in the Fact tables. But, they are not true Fact values. Common examples
include invoice numbers or order numbers.
 A degenerate dimension is represented by a dimension key attributes with no
corresponding dimension table.
Degenerate Dimensions
Junk Dimensions:
 A junk dimension is a convenient grouping of flags and attributes
to get them out of a fact table into a useful dimensional framework.
67
What are Slow changing Dimensions?
In the real world this is not

strictly true
“ Dimensions have been assumed to be independent of time”
Slowly changing dimensions are dimensions where a "constant" actually evolves

slowly and asynchronously.
Examples: Humans change their name

Get married or divorced
68
Three Methods…
The three choices for dealing with slow changing

dimensions are:
Approach Results
Type 1: Overwriting the old values in Losing the ability to track the
the dimension record old history
Type 2 Creating an additional Segmenting history very

dimension record at the time of accurately between the old
the change with the new description and the new
attribute values description
Type 3: Creating new “current” fields Describe history both
69
Type one
Implementing Type 1:
 Overwrite the field with new value

 No effect anywhere else in the database
Advantages Disadvantages
Easy to implement History is lost
No key affected
Scenarios where applicable:

 When original data was in error
 When no value is reviewed in keeping the old description/attribute
70
Type two
 Create new record with unique key

 Generalize the dimensioning by adding 2 or 3 various digits to the end of the
key.
Automatically partitions history Abrupt point of time constraints
No time constraints required not effective

 Most commonly used where history is of importance
71
Type three
 Add a new field of current strategy for the affected attributes with an effective
data field as well.
Useful for tracking new and old intermediate values are lost
values Complex

 Only when there is a legitimate need to track both old and new value with
forward and backward across the time of the change
72
Basic Dimensions
Time
Dimension
Location
Dimension Acct Year
Product
Dimension Region
Acct Period
Department
Store
Acct Week
Item
Item
Store
Acct Week
Sales
73
Dimension Tables
 Dimension tables, most often contain descriptive textual information.
 Dimension attributes are used as the source of most of the interesting constraints
in data warehouse queries, and they are virtually always the source of the row
headers in the SQL answer set.
 It should be obvious that the power of the data warehouse is proportional to the
quality and depth of the dimension tables.
74
Attributes in a Dimension Table
 Allows users to constrain data by one or more attributes.

 Allows users to define aggregation levels for data
• Present Classes by Departments

DEPT CLASS SALES
• Aggregate by Class Dept 1 Class 101 1000
Class 120 1100
• Qualify by Department Class 133 1900
Dept 2 Class 127 2100
Class 141 1500
Class 145 1800
75
Basic Dimensional Model
Lookup Item
ITEM_ID
ITEM_DESC
Lookup Store DEPT_ID Fact Item Day Store
DEPT_DESC
STORE_ID ITEM_ID
STORE_DESC WEEK_ID
REGION_ID STORE_ID
REGION_DESC SALES_DOLLARS
SALES_UNITS
Lookup Week
WEEK_ID
PERIOD_ID
YEAR_ID
76
ETL Concepts
ETL !!!
(Extract, Transform, Load) –
ETL refers to the methods involved in accessing and manipulating source
data and loading it into target database. During the ETL process, more often,
data is extracted from an OLTP database, transformed to match the data
warehouse schema, and loaded into the data warehouse database.
78
WHAT IS ETL?
EXTRACT DATA FROM

E  EXTRACT
DISPARATE SOURCES
T  TRANSFORM TRANSFORM DATA
L  LOAD LOAD DATA WHERE

WE WANT TO
79
EXTRACTION (Data Capturing)
The ETL extraction element is responsible for extracting data from the source system.
During extraction, data may be removed from the source system or a copy made and the
original data retained in the source system.
80
EXTRACTION (Data Transmission)
Legacy systems may require too much effort to implement such offload processes, so
legacy data is often copied into the data warehouse, leaving the original data in place.
Extracted data is loaded into the data warehouse staging area (a relational database
usually separate from the data warehouse database), for manipulation by the
remaining ETL processes.
81
EXTRACTION (Cleansing Process)
Data extraction is generally performed within the source system itself.
Data extraction processes can be implemented using Transact-SQL stored procedures,

Data Transformation Services (DTS) tasks, or custom applications developed in
programming or scripting languages.
82
TRANSFORMATION
The ETL transformation element is responsible for data validation, data accuracy, data
type conversion, and business rule application. An ETL system that uses inline
transformations during extraction is less robust and flexible than one that confines
transformations to the reformatting element. Transformations performed in the OLTP
system impose a performance burden on the OLTP database.
83
TRANSFORMATION (contd.)
Data Validation
Check that all rows in the fact table match rows in dimension tables to enforce data integrity.
Data Accuracy
Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.
Data Type Conversion

Ensure that all values for a specified field are stored the same way in the data warehouse
regardless of how they were stored in the source system. For example, if one source system
stores "off" or "on" in its status field and another source system stores "0" or "1" in its status
field, then a data type conversion transformation converts the content of one or both of the
fields to a specified common value such as "off" or "on".
Business Rule Application

Ensure that the rules of the business are enforced on the data stored in the warehouse. For
example, check that all customer records contain values for both FirstName and LastName
fields.
84
LOADING
The ETL loading element is responsible for loading transformed data into the data
warehouse database.
Data warehouses are usually updated periodically rather than continuously, and large
numbers of records are often loaded to multiple tables in a single data load.
The data warehouse is often taken offline during update operations so that data can be
loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to
incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools
for data loading operations.
The design of the loading element should focus on efficiency and performance to
minimize the data warehouse offline time.
85
ETL Tools
What are ETL Tools?
ETL Tools are meant to extract, transform and load the data into Data Warehouse for
decision making. Before the evolution of ETL Tools, the above mentioned ETL process
was done manually by using SQL code created by programmers. This task was tedious
and cumbersome in many cases since it involved many resources, complex coding and
more work hours. On top of it, maintaining the code placed a great challenge among the
programmers
Selecting an appropriate ETL tool is the most important decision that has to be made
when choosing the components of a data warehousing application. The ETL tool
operates at the heart of the data warehouse, extracting data from multiple data sources,
transforming the data to make it accessible to business analysis, and loading multiple
target databases
86
Features of ETL Tools
Features of ETL Tools
The ETL tools have the ability to extract data from various sources like RDBMS ,
DB2 , COBOL data files and flat files at scheduled intervals , do required
transformation and load the data into Data Warehouse which resides on RDBMS.
The ETL tools can connect to a RDBMS and get the list of tables and their
attributes. The general steps for designing an ETL process are
Define the structure of source data
Define the structure of Destination Data
Map elements of source data to elements of destination data
Define the transformation required like changing values , summing
Schedule the execution of process
The process once executed , generates a log showing status of process ,
number of records inserted etc. Various reports about processes are available
which can form the Metadata.
87

DW ETL Concepts

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DW ETL Concepts

Hochgeladen von

Copyright:

Verfügbare Formate

Business Intelligence,

Data Warehousing &

How intelligent can you make your business processes?

What insight can you gain into your business?

How integrated can your business processes be?

Business Intelligence is a generalized term applied to a broad category of

Business Intelligence applications include the activities of decision support

Companies want to use technology tactically to make their operations more

Identify trends and anomalies in business operations more quickly, allowing

Data Warehouse Databases

Build and Manage Capabilities

A data warehouse environment includes an extraction, transportation,

Need of Intelligent Information in Competitive Market

Who are the potential What are the region-wise

What will be the impact

“Data Warehouse is a subject-

 The orientation around

Data is stored as a series of snapshots or views which record how it is

Data Warehouse Data

 It is closely related with subject orientation.

Existing data in the warehouse is not overwritten or updated.

Special data organization, access methods and implementation methods are

OLTP WAREHOUSE (DSS)

Application Oriented Subject Oriented

OLTP DATA WAREHOUSE

Performance Sensitive Performance relaxed

OLTP Data Warehouse

Transaction throughput is the Query throughput is the

OLTP Systems are

The Data Warehouse helps

In a federated architecture the data is logically consolidated but stored in

A tiered architecture is a distributed data approach. This process

 Faster response time

Data Information Knowledge

What is a Data Mart

DM Sales Sales Representatives

Subject-oriented — Organized around major subjects of an organization

Integrated — Presents an integrated image of subject-oriented data which is

Current — Contains a snapshot of the current content of legacy source systems.

Volatile — Since ODS content is kept current, it changes frequently. Identical

The ODS provides an integrated view of data in operational systems.

Supports operational reporting needs of the organization

Reduces the burden placed on other operational or data warehouse platforms by

Feeds other operational systems in addition to the data warehouse

Ralph Kimball's paradigm:

Data warehouse is the conglomerate of all data marts within the

Bill Inmon's paradigm:

Data warehouse is one part of the overall business intelligence system.

There are two major types of approaches to building or designing the

The Top-Down Approach

The Bottom-Up Approach

Inmon advocated a “dependent data mart structure”

Once the Data warehouse aggregation and summarization processes are

The Data warehouse Bus Structure: The Bottom-Up Approach

• The bottom-up approach reverses the positions of the Data warehouse

 Data model is a conceptual representation of data structures

The term data model has two generally accepted meanings:

Logical Data Model 46

Physical Data Model 47

LOGICAL DATA MODEL PHYSICAL DATA MODEL

Inversion Key Entry Non Unique Index

Relationship Foreign Key