Dataware Housing Concepts

vanuguard@gmail.
com
1
Data Warehouse Concepts
What is Data Ware House?
A DWH is a collection of Data Marts representing historical data from different
operational data source (OLTP). The data from these OLTP are structured and optimized
for querying and data analysis in DWH.
A Data warehouse is a relational database that is designed for Query and analysis
rather than for transaction processing. It usually contains historical data derived from
transactional data, but it can include data from other sources. It separates analysis
workload and enables an organization to consolidate data from several sources. In
addition to a relational database, a DWH environment includes an ETL solution, an
OLAP engine, client analysis tools and other applications that manage the process of
gathering data and delivering to business users. The characteristics of a DWH are
Subject-Oriented: Information in the data warehouse should revolve around the
subject and should give all the information regarding that subject. DWHs are designed to
help you analyze data. For example, to learn more about the companys sales data, you
can build a warehouse that concentrates on sales. This ability to define a DWH by subject
matter, sales in this case makes the DWH subject oriented.
Integrated: There should be consistency when loading the data from different
heterogeneous system and transforming it. It is closely related to subject orientation.
DWHs put data from desperate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure. When they
achieve this, they are said be integrated.
Nonvolatile: It means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to analyze
what has occurred and whatever once happened never changes.
Time-Variant: In order to discover trends, analysts need large amounts of data.
This is very much in contrast to OLTP systems, where performance requirements demand
that historical data be moved to an archive. A DWH focus on change over time is what is
meant by the term time variant.
What are the uses of DATAWAREHOUSE?
It separates analysis workload and enables an organization to consolidate data
from several sources.
It manages the process of gathering data and delivering to business users.
It is used to analyze data.
It puts data from desperate sources into a consistent format.
vanuguard@gmail.com
2
What is a Data Mart?
A Data Mart is a focused subset of a DWH that deals with a single area of data
and is organized for quick analysis. It contains the summarized data of the warehouses
and is referred as High Performance Query Structures. They consist of Materialized
Views and Special Indexes. In some businesses these data marts may be maintained
within the warehouses whereas, in some other scenarios they may be maintained apart
from the DWHs.
What are the difference between Database, DATAWAREHOUSE and Data Marts?
A Database is an organized collection of data.
A DWH is a very large database with special set of tools to extract and cleanse data from
operational systems and to analyze data.
A Data Mart is a focused subset of a DWH that deals with a single area of data and is
organized for quick analysis.
What is Dimension Modeling?
A Dimension Modeling is high level methodology used to implement the Star Schema
Structure which is done by the Data Modeling.
Or
Dimension Model is composed of one Table with a multiple key, called the fact table
and a set of smaller tables called dimension tables. Each Dimension table has a single
part primary key that corresponds exactly to one of the components of the multipart
key in fact table.
Or
Dimension Modeling is nothing but maintaining the relation ship between dimension
table and fact table using primary key and foreign key.
What is meant by OLTP?
OLTP stands for On-Line Transaction Processing. This is a standard,
normalized database structure. OLTP is designed for Transactions i.e., day-to-day
transactions. OLTP database has hundreds of users connected to it. These databases are
normalized to reduce the redundancy of the data & increase the performance while
inserting the data. The ratio of no. of records being inserted is more than the ration of no.
of records being updated or deleted. OLTP systems are not designed for analysis,
reporting and decision support. Examples: ATM Machines, Online Shopping, Online
Application Filling, and Online Railway Reservations.
What is meant by OLAP? What are the types of OLAP?
OLAP stands for On-Line Analytical Processing. OLAP system stores data in
multidimensional databases. User accesses these databases to perform financial and
statistical analysis on different combinations of the data. An OLAP database is generally
used to analyze data. It is optimized so that user can quickly retrieve data. An OLAP
database is generally created from the information we have put in an OLTP database.
OLAP products can be grouped into 3 categories.
vanuguard@gmail.com
3
MOLAP: (Multidimensional OLAP)
Data is stored multidimensional arrays/cube in order to be viewed in a
multidimensional manner.
Multidimensional arrays provide efficiency in storage and operations. Examples:
ORACLE Express Servers, Essbase by Hyperion Software, Power play by
Cognos.
MOLAP does not support ad-hoc queries because it is optimized for
multidimensional operations
It can perform complex calculations. All the calculations have been pre-generated
when cube is created.
Retrieval is Fast
Storage is very efficient
ROLAP: (Relational OLAP)

This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
Data is stored in a Relational model because OLAP capabilities are best provided
against the relational database.
Examples: Oracle, SQL Server etc.
ROLAP integrates naturally with existing technology and standards.
ROLAP can readily take advantage of parallel relational technology.
HOLAP: (Hybrid OLAP)
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
For summary-type information, HOLAP leverages cube technology for faster
performance. When detail information is needed, HOLAP can "drill through" from the
cube into the underlying relational data. A reduction of network traffic and better
performance.
These products combine MOLAP and ROLAP.
With HOLAP products, a relational database stores most of the data.
A separatable multidimensional database stores a small portion of the data
Transformations: Transformations are the manipulation of data from how it appears in
the source systems into another form in the DWH or data mart in a way that enhances or
vanuguard@gmail.com
4
simplifies its meaning. In another way, you transform data into information. This
includes the following:
Data Cleansing: It is the process of validating the data brought from multiple sources.
This involves identifying any changing inconsistencies or inaccuracies. i.e, Eliminating
inconsistencies in the data from multiple sources. Converting data from different systems
into single consistent data set suitable for analysis. Correct data errors and fills in missing
data values
Data Merging: It is a process of standardizing data types and fields. Suppose one source
system calls integer type data as small-int whereas another calls same data as decimal.
The data from the two source systems needs to rationalize when moved into the oracle
data format called number.
Aggregation: The process whereby multiple detailed values are combined into a single
summary value.
What is an ODS (Operational Data Source)?
It is used to store current data through transactional web applications, SAP, MQ Series.
Current data means particular data from one data to one data.
ODS contains 30-90 days data.
Def : An ODS is an environment where data from different operational databases is
integrated. The purpose is to provide the end user community with an integrated view of
enterprise data. It enables the user to address operational challenges that span over more
than one business function.
What is Star Schema design?
A Star Schema is defined as a logical database design in which there will be
centrally located fact table which is surrounded by at least one or more dimension table.
This design is best suited for DWH or Data Mart.
A star schema can be simple or complex. A simple star consists of one fact table;
a complex star can have more than one fact table. It is non-relational databases.
What is Snow Flake Schema?
In a Snow Flake Schema the dimension table will be further divided into one or
more dimensions (normalized) to organize the information in a better structural Format.
To design snowflake we should first design star schema design.
The main advantage of the snowflake schema is the improvement in query
performance due to minimized disk storage requirements and joining smaller lookup
tables. The main disadvantage of the snowflake schema is the additional maintenance
efforts needed due to the increase number of lookup tables. It is a relation databases.
vanuguard@gmail.com
5
What is Dimension?
Dimension is a star schema object which contains descriptive data to provide the meaning
full information for the facts in the fact table.
E.g. a Profit Summary Fact Table can be viewed by a time Dimension.
What are Conformed Dimensions?
The Dimensions which are reusable and Fixed in nature. E.g.: Customer, Time,
Geography dimensions.
Or
A dimension table, which is used by more than one fact table, is known as a Conformed
Dimension.
What is Junk Dimension?
The Junk Dimension is simply a structure that provides a convenient place to
store the Junk Attributes.
E.g. Like Transactional Codes, Flags and Or Text Attributes.
The Fact would Contains several metrics and would be related to several dimensions
such as account, data, office, rep...
This fact would as contain several codes and flags that were related to the transactional
rather than any of the dimensions such as
Origin Code: That indicates whether the trade was initiated with a phone call or via the
web. Reinvest Flag, Comment Field For storing special instructions from the
customers.
These three would be normally be removed from the fact table and stored in a junk
dimensions called trade dimension.
In this way, the number of indexes on the fact table would be reduced and performance
would be enhanced.
Junk dimension is a dimension which can be used for conveniently grouping low
cardinality flags and indicators.
What is a Degenerated Dimension?
The values of the Dimension which is stored in fact table is called Degenerate
Dimensions. These Dimensions doesnt have its own Dimensions.
E.g.: Invoice_No, Invoice_line_no in fact table will be a degenerate dimension; if you
dont have a dimension called Invoice.
Or
If a table contains a value which is neither dimension nor measure.
What is the Difference between Junk Dimension and Degenerate Dimension?
The data stored in a dimension in the Junk Dimension where as the data that is
Dimensional in nature but stored in a fact table.
vanuguard@gmail.com
6
What are the Different methods of loading Dimension tables?
We have two methods
Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.
Direct Load (Faster Loading)
All the Constraints will be disabled. Data will be load directly. Later the data will be
checked against the table constraints and the bad data wont be indexed.
What is a Fact?
Facts are metrics that describe the process; dimensions give facts their context.
Def: Fact is nothing but a numeric value or fact value.
What is Fact Table?
A Fact table is a table that contains summarized numerical (facts) and historical data.
This Fact table has a Foreign Key- Primary Key relation with a dimension table.
The Fact table maintains the information in the third normal Form.
Whenever we have the keys in a table, that it self implements that the table is in
the normal form. Being in normal form more granularities is achieved with less
coding i.e. less no of joins while retrieving the fact.
For example, sales amount would be such a measure. This measure is stored in the
fact table with the appropriate granularity. For example, it can be sales amount by
store by day. In this case, the fact table would contain three columns: A date
column, a store column, and a sales amount column.
Why fact table is in normal form?
Fact tables are in normal form because they only store the additive measures,
degenerate dimensions, indicators, and dimension surrogate keys. By definition 'FACT'
tables contain the actual data that the users want to query (such as amounts, numbers etc).
Traditionally the fact tables are huge in size and in number of columns. De-normalizing
them (unless for performance reasons by a specific BI tool) has big impact on the total
size of the table
What are the types of Facts?
There are three types of facts
Additive Facts:
A Fact which can be summed up for any of the dimension available in the Fact
Table.
vanuguard@gmail.com
7
Semi Additive Fact:
A Fact which can be summed up to a few dimensions and not for all dimensions available
in the Fact Table.
Non Additive Fact:
A Fact which cant be summed up for any of the dimension available in the Fact Table.
What are the Types of Fact Tables?
There are two types of Fact Tables
Cumulative Fact table:
This type of Fact tables generally describes what was happened over the period
of time. They contain Additive Facts.
Semi Additive Fact Table:
This type of Fact Table generally describes what was happened in a particular
period of time. They contain Semi Additive Facts, Non Additive Facts.
Let us use examples to illustrate each of the three types of facts. The first example
assumes that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each
store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an
additive fact, because you can sum up this fact along any of the three dimensions present
in the fact table -- date, store, and product. For example, the sum of Sales_Amount for
all 7 days in a week represent the total sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the
end of each day, as well as the profit margin for each account for each day.
Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for
vanuguard@gmail.com
8
all accounts in the bank?), but it does not make sense to add them up through time
(adding up all current balances for a given account for each day of the month does not
give us any useful information). Profit_Margin is a non-additive fact, for it does not
make sense to add them up for the account level or the day level.
What is Fact less Fact Table?
Factless fact tables are fact tables which does not contain any
facts(measure).They contain only keys. They usually contain foreign keys and no facts.
Generally when we need to combine two Data marts, then one data mart will have a fact
less fact table and other one with common fact table.
Example
Identifying product promotion events (to determine promoted products that didnt
sell)
Tracking student attendance or registration events
Tracking insurance-related accident events
Identifying building, facility, and equipment schedules for a hospital or
university.
There are two kinds of fact-less Fact table
1.Fact tables which records events
2.Coverage fact table.
For the first type student attendance fact table is an example. Here we can have
dimensions such as date, student, teacher, course etc and the fact table contains keys to
these dimensions
Using this model. we can answer question such as Which course were the most
heavily attended? Which courses were the most consistently attended? Which teachers
taught the most students?
A coverage fact table is required when the primary fact table cannot answer
certain questions. For example a sales fact table can not answer the question which items
were not sold?.Coverage fact table keeps information of all items which were in
promotion irrespective of whether they are sold or not. Answer to the above question is
the difference set between coverage and primary fact tables
What is a Measure?
Measures are numerical data based on columns in a fact table.
What is Conformed Fact?
Conformed facts are allowed to have the same name in separate tables and can be
combined and compared mathematically.
vanuguard@gmail.com
9
What is the difference between OLAP and OLTP?
OLTP Schema OLAP Schema
Normalized
More no of Transactions
More no of users
Less time for query Execution
Have Insert, delete and update Trans.
De-Normalized
Less no of Transactions
Less no of users
More times for query Execution
Will not have more Insert, Update and
delete
What is the Difference between Star Schema and Snow Flake Schemas?
Star Schema Snow Flake
De-Normalized
Easy to use and understand
Want Little Efforts to maintenance
Fast Execution of query
Normalized
End User will get confused
Easy to maintain
More time for execution because of more
joins
Is OLAP databases are called decision support systems
Yes
What are the methodologies of Data Warehousing?
Top Down Approach:(Bill Immon)
Preparing Individual departments data (Data Marts) from the Enterprise Data Warehouse
Bottom up Approach (Ralf Kimbal)
First Gathering all the departments data and then cleansing the data and Transforms
then data and load all the individual departments data into the enterprise DWH.
What is a Surrogate key?
Surrogate key is the solution for critical problems.
Ex: - Customer purchases different items in different locations for this situation we have
to maintain historical data.
By using surrogate key we can introduce the row in the DWH to maintain historical data.
Surrogate key is a substitution for the natural primary key.
vanuguard@gmail.com
10
Which columns go to fact table and which columns go the dimension table?
Aggregate or calculated value columns will go to Fact table and details information
will go to dimension table.
What is Data Mining?
Data Mining is a process extracting the hidden trends with in a data ware house.
Ex: An Insurance DWH can be used to mine data for the most high risk people to insure
in a certain geographical area.
What is the difference between E-R Modeling and Dimension Modeling?
Basic difference is E-R modeling will have logical and physical model.
Dimensional model will have only physical model.
E-R model is used for normalizing the OLTP database design.
Dimensional modeling is used for De-normalizing the ROLAP/MOLAP design.
E-R modeling revolves around the entities and their relationships to capture the
overall process of the system.
Dimensional model/Multi-Dimensional Modeling revolves around Dimensions
(point of analysis) for decision making and not to capture the process.
Difference between Two tier architecture and three tier architecture?
Two-tier architecture is client/server architecture, where a request to do some task
is sent to the server and the server responds by performing the task. Where as a three-
tier or a Multi -tier architecture has client, server and databases. Where the client
request is sent to the server and the server in turn sends the requests to the database. The
database sends back the information/data required to the server which in turn sends
it to the client.
1-tier Application: All the processing is done on one machine and number of
clients are attached to this machine(mainframe application)
2-tier Application: Clients and data base on different machine. Processing is
done at client side. Application layer is on clients.
3-tier Application: Client is partially thick, Apart from that there are two more
layers application layer and database layer.
What is Granularity?
The lowest level of information that will be stored in the fact table. This constitutes two
steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.
vanuguard@gmail.com
11
What is Hierarchy?
The specification of levels that represents relationship between different attributes within
a dimension. For example, one possible hierarchy in the Time dimension is Year ?
Quarter ? Month ? Day
What is Attribute?
A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
A column is the database is corresponding to a level in the dimension.
Slowly changing dimension
Dimensions that change over time are called Slowly Changing Dimensions. For
instance, a product price changes over time; People change their names for some reason;
Country and State names may change over time. These are a few examples of Slowly
Changing Dimensions since some changes are happening to them over a period of time.
Slowly Changing Dimensions are often categorized into three types namely
Type1, Type2 and Type3.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent
the new information. Therefore, both the original and the new record will be present. The
new record gets its own primary key. And we add two columns for effective begin date
and end date.
In Type 3 slowly changing dimension, The original record is modified to reflect the
change. We add another column for old value.
What is the meaning of stitched query?
Stitch queries send two separate queries to the data source and then merge them locally
What is difference between Co-related sub query and nested sub query?
Co-related sub query is one in which inner query is evaluated only once and from
that result outer query is evaluated.
Nested query is one in which Inner query is evaluated for multiple times for getting one
row of that outer query.
ex. Query used with IN() clause is Co-related query.
vanuguard@gmail.com
12
Query used with = operator is Nested query
What is the main difference between the IN and EXISTS clause in subqueries?
The main difference between the IN and EXISTS predicate in subquery is the way in
which the query gets executed.
IN -- The inner query is executed first and the list of values obtained as its result is used
by the outer query. The inner query is executed for only once.
EXISTS -- The first row from the outer query is selected ,then the inner query is executed
and , the outer query output uses this result for checking. This process of inner query
execution repeats as many no. of times as there are outer query rows. That is, if there are
ten rows that can result from outer query, the inner query is executed that many no. of
times.
What is the Sub-query ?
Sub query is a query whose return values are used in filtering conditions of the main
query.
When do you use WHERE clause and when do you use HAVING clause?
Where Clause :- Used to filter the records from the table before group by clause.
Having Clause :- Used to filter the grouped records after group By clause.
What is the difference between Data source and Database?
Database means any type of database like Oracle, Db2, Taradata and etc
Data source means from where we are retrieving data it means data source may be
database or cognos EP series mappings or SAP BW mappings and so on
What are the differences between stored procedures and triggers?
Stored procedures are compiled collection of programs or SQL statements that live in the
database. A stored procedure can access and modify data present in many tables. Also a
stored procedure is not associated with any particular database object. But triggers are
event-driven special procedures which are attached to a specific database object say a
table. Stored procedures are not automatically run and they have to be called explicitly by
the user. But triggers get executed when the particular event associated with the event
gets fired. For example in case of a database having say 200 users and the last modified
timestamp need to be updated every time the database is accessed and changed. To
ensure this one may have a trigger in the insert or update event. So that whenever any
insert or update event of the table gets fired the corresponding trigger gets activated and
updates the last modified timestamp column or field with the current time. Thus the main
difference between stored procedure and trigger is that in case of stored procedure the
program logic is executed on the database server explicitly under eth users request but in
case of triggers event-driven procedures attached to database object namely table gets
vanuguard@gmail.com
13
fired automatically when the event gets fired.
What are stored procedures?
A stored procedure is a named collection of SQL statements and procedural logic that is
compiled, verified and stored in a server database. It is typically treated like any other
database object. Stored procedures accept input parameters so that a single procedure can
be used over the network by multiple clients using different input data. A single remote
message triggers the execution of a collection of stored SQL statements. The results are a
reduction of network traffic and better performance.
What is inline view?
The inline view is a construct in Oracle SQL where you can place a query in the SQL
FROM, clause, just as if the query was a table name.
A common use for in-line views in Oracle SQL is to simplify complex queries by
removing join operations and condensing several separate queries into a single query.
Example: The best example of the in-line view is the common Oracle DBA script that is
used to show the amount of free space and used space within all Oracle table spaces.
What is correlated sub query?
A correlated sub query is a sub query that contains a reference to a table that also appears
in the outer query.
SELECT * FROM t1 WHERE column1 = ANY(SELECT column1 FROM t2 WHERE
t2.column2 = t1.column2);
Cardinality
A property of a relationship that is used to ensure that queries return the correct results.
Cardinality describes the association between two query subjects and is set at each end of
the relationship.
Cardinality is expressed by using the following notation:
0..1 (zero or one match)
1..1 (only one match required)
0..n (zero or more matches)
1..n (one or more matches required)
The first part of the notation specifies the minimum required matches that must exist
between tables: 0 indicates that finding a match is optional, and 1 indicates that at least
one row must match. The second part defines the maximum required matches (1=1,
n=many).
For example, A and B have an association with one another. The cardinality of 1...1 for
table A means that for each row in table B, there is only one row in table A. The
cardinality of 0...n for table B means that for each row in table A, there are zero or many
rows in table B.
Fan Trap (AKA Chasm Trap )
The use of data from two fact tables with a central dimension table within one query.
vanuguard@gmail.com
14
If one entry from the dimension table is used and fact table B has 5 facts the result is 5
rows
in the resultant data set.
If the query is then linked to Fact B that has 10 rows relating to the single dimensional
item,
the resultant data set will be 50 rows. This happens because the result set of the first two
tables (5 rows) is multiplied by the entries in the fact B table (10 rows) as there are now
effectively 5 entries in the dimension table.
This error can be avoided by using a stitch query either built into the reporting tool or
using
stitch SQL commands such as UNION or INTERSECT.

Dataware Housing Concepts

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Dataware Housing Concepts

Hochgeladen von

Copyright:

Verfügbare Formate

vanuguard@gmail.

ROLAP: (Relational OLAP)

Das könnte Ihnen auch gefallen