Beruflich Dokumente
Kultur Dokumente
Pristine www.edupristine.com
Pristine
Data Mart & Data Warehousing
Data Warehousing
ETL Concepts
Database Design
Pristine 1
Why should we consider Data Warehousing solutions?
For developing reports often required writing specific computer programs which
was slow and expensive
Pristine 2
Why should we consider Data Warehousing solutions?
For developing reports often required writing specific computer programs which
was slow and expensive
Pristine 3
Data Warehousing
DWH is type of relational data base system specially designed for query analysis
processing rather than transactional processing.
Pristine 4
Data Warehousing
DWH is type of relational data base system specially designed for query analysis
processing rather than transactional processing.
The DWH systems are also called as Historical Db's, Read only Db's, Integrated Db's,
Decision Supporting System, Executive info System, Business Info System.
Pristine 5
Differences..
Pristine 6
Differences..
Pristine 7
Differences..
Pristine 8
Differences..
Pristine 9
DWH Architecture
Pristine 10
DWH Architecture (Basic)
Analysis
Operational
System
Metadata
Reporting
Mining
Flat Files
Pristine 11
DWH Architecture (with a staging area)
Operational
System
Metadata
Reporting
Mining
Flat Files
Pristine 12
DWH Architecture (with a staging area and data marts)
Operational
Purchasing
System Analysis
Metadata
Flat Files
Inventory
Mining
Pristine 13
Dimensional Data Modeling
Conceptual modeling
Logical Modeling
Physical Modeling
Pristine 14
Before start implementing the schema design a Data modeler should understand the
following process
Pristine 15
Example of Dimensional Data Model (Star Schema Design)
Pristine 16
Fact Table
Facts or measures
A fact table might contain either detail level facts or facts that have been
aggregated
Pristine 17
Steps in designing Fact Table
Identify a business process for
analysis (like sales).
Identify measures or facts
(sales dollar).
Identify dimensions for facts
(product dimension, location
dimension, time dimension,
organization dimension).
List the columns that describe
each dimension. (region
name, branch name, region
name).
Determine the lowest level
of summary in a fact table
(sales dollar).
Pristine 18
Dimension Tables
Contain textual information
Example of Location Dimension
that represents attributes of
the business Country Lookup
Contain relatively static data Country Code (PK)
Country Name
Are joined to fact table Date Time Stamp
through a foreign key
reference State Lookup
State Code (PK) Location Dimension
Are usually smaller than fact
State Name
tables Location Dimension Identifier (PK)
Date Time Stamp
Country Name
State Name
County Lookup County Name
County Code (PK) City Name
County Name Date Time Stamp
Date Time Stamp
City Lookup
City Code (PK)
City Name
Date Time Stamp
Pristine 19
Location Dimension
1/1/2005
1 USA New York Shelby Manhattan
11:23:31 AM
1/1/2005
2 USA Florida Jefferson Panama City
11:23:31 AM
1/1/2005
3 USA California Montgomery San Hose
11:23:31 AM
1/1/2005
4 USA New Jersey Hudson Jersey City
11:23:31 AM
Pristine 20
Star Schema Design benefits
Pristine 21
Star Schema Design benefits
Other schema designs are also in practice viz. Clickstar, Snowflakes etc.
Pristine 22
Data Warehouse Star Schema
In contrast, this data model better supports the ease of developing reports and
simple, efficient summarization queries
Customers
Dates Channels
Sales
Promotions Products
Pristine 23
Data Acquisition
It is the process of extracting the relevant business info/- from the different source
systems transforming the data from one format into an another format, integrating
the data in to homogeneous format and loading the data in to a warehouse database.
Pristine 24
ETL
Pristine 25
Data Acquisition
It is the process of extracting the relevant business info/- from the different source
systems transforming the data from one format into an another format, integrating
the data in to homogeneous format and loading the data in to a warehouse
database.
Pristine 26
Extraction, Transformation, and Loading (ETL) Processes
Pristine 27
Sample ETL Process Flow
Source
System = 1
Source
System = 2
Source
System = 3
File source
Text to come
Excel(.xls) files
Text to come
Text(.txt) files
XML(.xml) files
and other files
Other source
SAP
People Soft
Siebel & several
Pristine 28
Sample ETL Process Flow
Data Profiling
Source
System = 3 File source
Aggregation
File source
Filtering
Joining
Text to come
Excel(.xls) files
Text to come
Text(.txt) files Sorting
XML(.xml) files
and other files
Loading
Creation and execution of
Other source Workflows to load data from
source to target
SAP
People Soft etc.
Siebel & several
Pristine 29
Sample ETL Process Flow
Data Profiling
Source
System = 3 File source
Data Mart 3
Aggregation
File source
Filtering
Joining
Text to come
Excel(.xls) files
Text to come
Text(.txt) files Sorting
XML(.xml) files
and other files
Loading
Creation and execution of
Other source Workflows to load data from
source to target
SAP
People Soft etc.
Siebel & several
Pristine 30
Data Sources and Types
Technology exists for storing unstructured data and expect this to become more
important over time
Pristine 31
ETL Process
Is transforming cleansed source data and then loading into the target system
Pristine 32
Data Staging
Often used as an interim step between data extraction and later steps
Accumulates data from asynchronous sources using native interfaces, flat files, FTP
sessions, or other processes
At a predefined cutoff time, data in the staging file is transformed and loaded to the
warehouse
Pristine 33
Data Transformation
Transforms the data in accordance with the business rules and standards that have
been established
Pristine 34
Data Loading
The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications
Pristine 35
Meta Data
IT personnel need to know data sources and targets; database, table and column
names; refresh schedules; data usage measures; etc.
Pristine 36
OLAP is a Data Warehouse Tool
OLAP organizes data warehouse data into multidimensional cubes based on this
dimensional model, and then preprocesses these cubes to provide maximum
performance for queries that summarize data in various ways.
Pristine 37
OLAP is a Data Warehouse Tool
OLAP organizes data warehouse data into multidimensional cubes based on this
dimensional model, and then preprocesses these cubes to provide maximum
performance for queries that summarize data in various ways.
OLAP is not designed to store large volumes of text or binary data, nor is it designed
to support high volume update transactions.
The inherent stability and consistency of historical data in a data warehouse enables
OLAP to provide its remarkable performance in rapidly summarizing information for
analytical queries.
Pristine 38
Data Mining is a Data Warehouse Tool
OLAP organizes data in a model suited for exploration by analysts, and data mining
performs analysis on data and provides the results to decision makers.
Thus, OLAP supports model-driven analysis and data mining supports data-driven
analysis.
Pristine 39
Conclusion
Focus on the users, determine what data is needed, locate sources for the data, and
organize the data in a dimensional model that represents the business needs.
Pristine 40
Database & Design An introduction
Pristine 41
Spreadsheet vs. Database
A spreadsheet is a way of describing a table of numeric data, and having some of that data
interact.
A spreadsheet is like an accountant's ledger. It has columns and rows. You can enter bits of
information (typically numbers) directly into a 'cell' that can be identified by its column and row
number. You can then manipulate the data and possibly reach some conclusions concerning what
you've entered. For examples addition of columns or rows, averages, multiplication, etc. The
electronic spreadsheet is a very useful and powerful method of analysis.
Pristine 42
Spreadsheet vs. Database
A spreadsheet is a way of describing a table of numeric data, and having some of that data
interact.
A spreadsheet is like an accountant's ledger. It has columns and rows. You can enter bits of
information (typically numbers) directly into a 'cell' that can be identified by its column and row
number. You can then manipulate the data and possibly reach some conclusions concerning what
you've entered. For examples addition of columns or rows, averages, multiplication, etc. The
electronic spreadsheet is a very useful and powerful method of analysis.
The database is a compilation of like things. Names, birthdays, weights, ages, etc. If, for example, you
had 100 friends, you could use a database to enter various information about each of them. Name,
age, address, birthday, height, weight, favorite color, whatever information you think useful or
interesting. Then you could 'sort' that information based on 'filters' that you think important or
interesting. For example you could rank them by age or height or weight.
Pristine 43
Approach to Database design
To Consider cases of 'Inheritance', where there are general Entities and Specific
Entities.
For example, a Customer is a General Entity, and Commercial Customer and Personal Customer
would be Specific Entities.
Pristine 44
Approach to Database design
To identify the Static and Reference Data, such as Country Codes or Customer
Types.
To review Code or Type Data which is (more or less) constant, which can be
classified as Reference Data.
For example, Currency or Country Codes. Where possible, use standard values, such as ISO
Codes.
To look for 'has a' relationships. These can become Foreign Keys, or 'Parent-Child'
relationships.
Pristine 45
Approach to Database design
To confirm the first draft of the Database design against the Sample Data.
To review and obtain from the Users some representative enquiries for the
Database,
e.g., "How many Maintenance Engineers do we have on staff coming available in the next
4 weeks ?"
Define User Scenarios and step through them with some sample data to check that
Database supports the required functionality.
Pristine 46
Data Models, Schemas, and Instances
Data types
Relationships
Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.
Pristine 47
Data Models, Schemas, and Instances
Data types
Relationships
Data Model: A set of concepts to describe the structure of a database, and certain
constraints that the database should obey.
Data Model Operations: Operations for specifying database retrievals and updates
by referring to the concepts of the data model.
Generic operation: insert, delete, modify, retrieve
User-defined operations
Pristine 48
Categories of Data Models
Conceptual (high-level, semantic) data models: Provide concepts that are close to the way many
users perceive data. (Also called entity-based or object-based data models.)
Entity
Attribute
Relationship
Physical (low-level, internal) data models: Provide concepts that describe details of how data is
stored in the computer.
Record formats
Record ordering
Access paths
Implementation (record-oriented) data models: Provide concepts that fall between the above
two, balancing user views with some computer storage details.
Relational
Network
Hierarchical
Pristine 49
Schemas, Instances and Database State
Pristine 50
Schema diagram for University database
schema construct
Known data:
Name of record types, data items
Pristine 51
Student Name Student Number Class Major
Smith 17 1 CS
Brown 8 2 CS
Course Course Name Course Number Credit Hours Department
Intro to Computer Science CS1310 4 CS
Data Structures CS3320 4 CS
Discrete Mathematics MATH2410 3 MATH
Database CS3380 3 CS
Section Section Identifier Course Number Semester Year Instructor
85 MATH2410 Fall 98 King
92 CS1310 Fall 98 Anderson
102 CS3320 Spring 99 Knuth
112 MATH2410 Fall 99 Chang
119 CS1310 Fall 99 Anderson
135 CS3380 Fall 99 Stone
Grade Report Student Number Section Identifier Grade
17 112 B
17 119 C
8 85 A
8 92 A
8 102 B
8 135 A
Prerequisite Course Number Prerequisite Number
CS3380 CS3320
CS3380 MATH2410
CS3320 CS1310
Pristine 52
DBMS Architecture and Data Independence
Mappings among schema levels are also needed. Programs refer to an external
schema, and are mapped by the DBMS to the internal schema for execution
Pristine 53
The Three-schema architecture
Pristine 54
DBMS Interfaces
Pristine 55
The Database System Environment
DBMS Component Modules
Pristine 56
Database System Utilities
Pristine 57
Tools, Application Environments, and Communications Facilities
Communications Facilities
Allow users at locations remote from the database system site to access the
database.
DB (DBMS)/DC (Data Communication System)
Pristine 58
2.5 Classification of Database Management Systems
Pristine 59
A Network Schema
Student Course
Is A
Course Offerings Has A
Section Grades
Grade Report
Pristine 60
Database Management System (DBMS)
Pristine 61
An architecture for a database system
View level
Logical level
Physical level
Pristine 62
Data Models
A collection of tools for describing
Data
Data relationships
Data semantics
Data constraints
Entity-Relationship model
Relational model
Other models:
Object-oriented model
Semi-structured data models
Older models: network model and hierarchical model
Pristine 63
Entity-Relationship Model
Pristine 64
Relational Model
Pristine 65
A Sample Relational Database
Pristine 66
Entity Relationship Model (Cont.)
Entities (objects)
Database design in E-R model usually converted to design in the relational model
(coming up next) which is used for storage and processing
Pristine 67
Relational Database
Pristine 68
Relational Database
A relational database is created using the relational model. The software used in a
relational database is called a relational database management system (RDBMS).
Pristine 69
Relational Database
A relational database is created using the relational model. The software used in a
relational database is called a relational database management system (RDBMS).
A relational database is the predominant choice in storing data, over other models
like the hierarchical database model or the network model. It consists of n number
tables and each table has its own primary key.
Pristine 70
Relational database example
Contain tables
Tables contain records (rows)
Records are broken into columns (fields)
PK ID Quote FK Sources
1 I don't like that man; I must get to know him better. 4
Pristine 71
Overview of Object-Oriented Concepts
Object
State(Value)
Behavior(operations)
Transient vs. persistent
Pristine 72
Overview of Object-Oriented Concepts (Cont.)
Pristine 73
Overview of Object-Oriented Concepts (Cont.)
Pristine 74
Object Identity
Pristine 75
Object Identity (Cont.)
Properties of OID
Pristine 76
Object Structure
(i, c, v)
c: a type constructor
Pristine 77
Example 1: Complex Object
Pristine 78
Transaction Management
Pristine 79
Storage Management
Storage manager is a program module that provides the interface between the low-
level data stored in the database and the application programs and queries
submitted to the system.
Pristine 80
Overall System Structure
Pristine 81
Application Architectures
Pristine 82
Planning Overall
Annually.
Pristine 83
Thank you!
Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400
069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191
Pristine www.edupristine.com
Pristine 84