Sie sind auf Seite 1von 36

Q. What are non-additive facts?

A. Fact tables in which data in columns data cannot be aggregated or


calculations cannot be possible for producing the expected results or
known as non-additive facts.

Q. What are non-additive facts in detail?

A. A fact may be measure, metric or a dollar value. Measure and metric


are non additive facts.

Dollar value is additive fact. If we want to find out the amount for a
particular place for a particular period of time, we can add the dollar
amounts and come up with the total amount.

A non additive fact, for e.g. measure height(s) for 'citizens by


geographical location' , when we rollup 'city' data to 'state' level data
we should not add heights of the citizens rather we may want to use it
to derive 'count'

Q. Explain in detail about type 1, type 2(SCD), type 3?

A. Type 1: overwrite data.

Type 2: current, recent and history data should be there.

Type 3: current and recent data should be there.

Q. What is active data warehousing?


A. An active data warehouse provides information that enables
decision-makers within an organization to manage customer
relationships nimbly, efficiently and proactively. Active data
warehousing is all about integrating advanced decision support with
day-to-day-even minute-to-minute-decision making in a way that
increases quality of those customer touches which encourages
customer loyalty and thus secure an organization's bottom line. The
marketplace is coming of age as we progress from first-generation
"passive" decision-support systems to current- and next-generation
"active" data warehouse implementations

Q. Is OLAP databases are called decision support system???


True/false?
A. Yes, OLAP databases are called decision support systems because it
helps in analyzing the historical and present data of the organizations
and thereby helping in taking intelligent business decisions

Q. What is a snapshot?

A. It is a permanent local copy of the data in a report which can be


used to create the reports

Q. What is the difference between data warehousing and


business intelligence?

A. Business Intelligence:
BI is a broad category of applications and technologies for gathering,
integrating, storing, analyzing and providing access to data to help
enterprise users make better business decisions. BI Applications
includes the activities of decisions support systems, query and
reporting, online analytical processing, statistical analysis, forecasting
and Data Mining.

Data warehousing:
Data warehousing is a process of dimensional modeling by Extraction,
Clean, Conform and Delivering to build Data warehouses which
are subject oriented, time variant, non volatile.

Q. What is a factless fact table? Where you have used them?


A. Fact less table means which doesn't have measures.
Used only to put relation between the elements of various dimensions.

Q. What is the difference between ODS and OLTP


A. ODS: - It is nothing but a collection of tables created in the Data
warehouse that maintains only current data whereas OLTP maintains
the data only for transactions, these are designed for recording daily
operations and transactions of a business.

Q. What is the difference between OLAP and data


warehousing?
A. Data warehouse is the place where the data is stored for analyzing
whereas OLAP is the process of analyzing the data, managing
aggregations,Partitioning information into cubes for in-depth
visualization.

Q. What are aggregate table and aggregate fact table?


A. Aggregate table contains summarized data. The materialized views
are aggregated tables.

For ex in sales we have only date transaction. If we want to create a


report like sales by product per year. In such cases we aggregate the
date vales into week_agg, month_agg, quarter_agg, year_agg. To
retrieve date from these tables we use @aggregate function.

Q. What is the difference between E-R modeling and


Dimensional modeling?
A. ER modeling:
- focused how data will be efficient for processing (insert, update,
delete)
- Minimalize (limit to zero) data redundancies

Dimensional:
- focused how data will be efficient for retrieving
(example, by report and analysis tools).
- many data redundancies
- Consist of Fact and Dimension table

Q. Definition of data marts?


A. Data Mart is the subset of data warehouse. You can also consider
data mart holds the data of one subject area. For an example, you
consider an organization that has HR, Finance, Communications and
Corporate Service divisions. For each division you can create a data
mart. The historical data will be stored into data marts first and then
exported to data warehouse finally.

Q. How many ways the data can be purged from cache?


A. In Informatica we have an option of 'rebuild cache'. Use that to
rebuild cache.
Q. What is the difference between choosing a multi-
dimensional database and a relational database?

A. Muliti-dimensional database: OLAP(Online Analytical Processing)

Relational database: OLTP(OnlineTransaction Processing)

Q. Which approach is better and why? Loading data from


data marts to data warehouse or vice versa?

A. Best way is to move data from ODS --> Data warehouse to Data
Marts.

Q: What is a Data Warehouse?


A: A Data Warehouse is the "corporate memory". Academics will say it
is a subject oriented, point-in-time, integrated, time-variant, non-
volatile inquiry only collection of operational data. Typical relational
databases are designed for on-line transactional processing (OLTP) and
do not meet the requirements for effective on-line analytical
processing (OLAP). As a result, data warehouses are designed
differently than traditional relational databases.

Q: What is ETL?
A: ETL is the Data Warehouse acquisition processes of Extracting (E),
Transforming or Transporting (T) and Loading (L) data from source
systems into the data warehouse.

Q: What is the difference between a Data warehouse and a


Data mart?
A: Data mart contains department data and Data warehouse contains
enterprise wise data.

Q: What is the difference between a W/H and an OLTP


application?
A: OLTP databases are designed to maintain Atomicity (normalized),
Consistency and Integrity (many constraint) Data ("ACID" tests). OLTP
application is update intensive (many small update). Warehouses are
Time Referenced, Subject-Oriented, Non-volatile (read only) and
Integrated. Since a data warehouse is not updated, these constraints
are relaxed.
Q: What is the difference between OLAP, ROLAP, MOLAP and
HOLAP?
A: On Line Analytical Processing (OLAP), Relational OLAP (use RDBMS),
Multi dimensional OLAP (cube), Hybrid OLAP (ROLAP+MOLAP).
 ROLAP stands for Relational OLAP. Users see their data organized
in cubes with dimensions, but the data is really stored in a
Relational Database (RDBMS).
 MOLAP stands for Multidimensional OLAP. Users see their data
organized in cubes with dimensions, but the data is store in a
Multi-dimensional database (MDBMS) like Oracle Express Server.
In a MOLAP system lot of queries have a finite answer and
performance is usually critical and fast.
 HOLAP stands for Hybrid OLAP, it is a combination of both worlds.
Seagate Software's Holos is an example HOLAP environment. In a
HOLAP system one will find queries on aggregated data as well
as on detailed data.

Q: What is the difference between an ODS and a W/H?


A: ODS is a staging area where we bring all OLTP data on real time
basis and put it in a de-normalized form. W/H contains data for longer
period and are non-volatile (read only) and integrated in nature.

Q: What Oracle tools can be used to design and build a W/H?


A: Oracle Warehouse Builder.

Q: When should one use an MD-database (multi-dimensional


database) and not a relational one?
A: To develop an analytical application allowing users to slice and dice
measures against various contexts (dimension).

Q: What is a star schema? Why does one design this way?


A: Star schema will have only one Fact table containing all the
measures and multiple dimension tables directly linked with the Fact
table containing various contexts against which measures have been
taken. This help to address real life analytical problems providing
multidimensional cube views to the users.
 It allows for the highest level of flexibility of metadata
 Low maintenance as the data warehouse matures
 Best possible performance

Q: When should you use a STAR and when a SNOWFLAKE


schema?
A: We should always avoid SNOWFLAKE and de-normalized to STAR.
Q: What is the difference between Oracle Express and Oracle
Discoverer?
A: Express is an MD database and development environment.
Discoverer is an ad-hoc end-user query tool.

Q: What is the difference between View and Materialized View?


A: Data of a Materialized View is saved in a physical table, so data
access is fast due to direct access to the table. View will perform join
on tables based on the query every time it is referred.

Q: How can Oracle Materialized Views be used to speed up


data warehouse queries?
A: Using “Query Rewrite” feature Oracle may access data from
available Materialized Views instead of the base tables. This will
eliminate some table joins.

Q: What Oracle features can be used to optimize my


Warehouse system?
A: Bitmap Index, Join Index, enable “Query Rewrite” to use Materialized
views, set parameter Star_transformation_enable = TRUE, Partitioning,
Parallel Query (parallel_max_servers > 0 and set Degree of table > 1),
transport table spaces to transfer data between Oracle databases, etc.

What are the perceptions to use ER and Normalization?

Q. What is ER model and Dimensional Model?


ER Model - Relational
Dimensional - Star Schema (central table fact table with numeric
data, all others are linked to central table, faster, but
denormalised), Snowflake Schema (one fact table, Normalizing
the dimension tables, Fact Constellation (Different fact tables
and combined from one datamart to other)

What is Metadata?
Information about domain structure of data warehouse

What are different types of Dimensional Modeling?


Dimensional - Star Schema (central table fact table with numeric
data, all others are linked to central table, faster, but
denormalised), Snowflake Schema (one fact table, Normalizing
the dimension tables, Fact Constellation (Different fact tables
and combined from one datamart to other)
1. What is dimensional modelling? What is called a dimension?
What are the different types of dimensional modelling?
Have you done any ER modelling? If so, how does it differ from
dimensional modelling?
Which type do you prefer? Why wouldn't you use the other type?

2. What is snowflaking? Example?


Why do you use snowflaking? How is it different from star
organization?
What are the advantages or disadvantages of snowflaking?
What type of data organization do you prefer? Why?

4. What RDBMS are you most comfortable in?


How does it support data warehousing needs?

5. In data modelling, how do you implement a many-to-many


relationship with respect to E-R modelling?

6. Do you have any experience in data loading?


What tools or methods have you used for data loading?

10. Why do you use dimensional modelling instead of ER modelling for


data warehousing applications?

1) Erwin - Is it possible to reverse engineer to diff schemes into single


data model

2) Suppose there is a star schema where a fact table has 3 dimension


tables and this system is in product. Is it possible to add the more
dimension table to the fact table? What is the impact in all the stages?

Difference between Star & Snowflake Schema


Snowflaking is a star schema design technique to separately store
logical attributes usually of low cardinality along a loosely
normalization technique. For example, you could snowflake the gender
of your customers in order for you to track changes on these attributes
if your customer dimension is too large to SCD's.

The technique is not quite recommendable if you are going to use


OLAP tools for your front end due to speed issues.

Snowflaking allows for easy update and load of data as redundancy of


data is avoided to some extent, but browsing capabilities are greatly
compromised. But sometimes it may become a necessary evil.
To add a little to this, snowflaking often becomes necessary when you
need data for which there is a one-to-many relationship with a
dimension table. To try to consolidate this data into the dimension
table would necessarily lead to redundancy (this is a violation of
second normal form, which will produce a Cartesian product). This sort
of redundancy can cause misleading results in queries, since the count
of rows is artificially large (due to the Cartesian product). A simple
example of such a situation might be a "customer" dimension for which
there is a need to store multiple contacts. If the contact information is
brought in to the customer table, there would be one row for each
contact (i.e., one for each customer/contact combination). In this
situation, it is better just to create a "contact" snowflake table with a
FK to the customer. In general, it is better to avoid snowflaking if
possible, but sometimes the consequences of avoiding it are much
worse.

In star schema, all your dimensions will be linked directly with your fact
table. On the other hand in Snowflake schema, dimensions maybe
interlinked or may have one to many relationship with other tables. As
previous mails said this isn't a desirable situation but you can make
best choice once you have gathered all the requirements.

The snowflake is a design like a star but with a connect tables in the
dimensions tables is a relation between 2 dimensions.
3. Q: Which is better, Star or Snowflake?
A: Strict data warehousing rules would have you use a Star schema but
in reality most designs tend to become Snowflakes. They each have
their pros and cons but both are far better then trying to use a
transactional system third-normal form design.

4. Q: Why can’t I use a copy of my transactional system for my data


warehouse?
A: This is one of the absolute worst things you can do. A lot of people
initially go down this road because a tool vendor will support the idea
when making their sales pitch. Many of these attempts will even
experience success for a short period of time. It’s not until your data
sets grow and your business questions begin to be complex that this
design mistake will really come out to bite you.

Q. What are the responsibilities of a data warehouse


consultant/professional?

The basic responsibility of a data warehouse consultant is to ‘publish


the right data’.
Some of the other responsibilities of a data warehouse consultant are:
1. Understand the end users by their business area, job
responsibilities, and computer tolerance
2. Find out the decisions the end users want to make with the help
of the data warehouse
3. Identify the ‘best’ users who will make effective decisions using
the data warehouse
4. Find the potential new users and make them aware of the data
warehouse
5. Determining the grain of the data
6. Make the end user screens and applications much simpler and
more template driven

Q. Stars and Cubes (Polaris)

The star schema and OLAP cube are intimately related. Star schemas
are most appropriate for very large data sets. OLAP cubes are most
appropriate for smaller data sets where analytic tools can perform
complex data comparisons and calculations. In almost all OLAP cube
environments, it’s recommended that you originally source data into a
star schema structure, and then use wizards to transform the data into
the OLAP cube.

Q. What is the necessity of having dimensional modeling


instead of an ER modeling?

Compared to entity/relation modeling, it's less rigorous (allowing the


designer more discretion in organizing the tables) but more practical
because it accommodates database complexity and improves
performance.

Q. Dimensions and Facts.

Dimensional modeling begins by dividing the world into measurements


and context. Measurements are usually numeric and taken repeatedly.
Numeric measurements are facts. Facts are always surrounded by
mostly textual context that's true at the moment the fact is recorded.
Facts are very specific, well-defined numeric attributes. By contrast,
the context surrounding the facts is open-ended and verbose. It's not
uncommon for the designer to add context to a set of facts partway
through the implementation.

Dimensional modeling divides the world of data into two major types:
Measurements and Descriptions of the context surrounding those
measurements. The measurements, which are typically numeric, are
stored in fact tables, and the descriptions of the context, which are
typically textual, are stored in the dimension tables.

A fact table in a pure star schema consists of multiple foreign keys,


each paired with a primary key in a dimension, together with the facts
containing the measurements.

Every foreign key in the fact table has a match to a unique primary key
in the respective dimension (referential integrity). This allows the
dimension table to possess primary keys that aren’t found in the fact
table. Therefore, a product dimension table might be paired with a
sales fact table in which some of the products are never sold.

Dimensional models are full-fledged relational models, where the fact


table is in third normal form and the dimension tables are in second
normal form.

The main difference between second and third normal form is that
repeated entries are removed from a second normal form table and
placed in their own “snowflake”. Thus the act of removing the context
from a fact record and creating dimension tables places the fact table
in third normal form.

E.g. for Fact tables  Sales, Cost, Profit


E.g. for Dimensions  Customer, Product, Store, Time

Q. What are Additive Facts? Or what is meant by Additive Fact?

The fact tables are mostly very huge and almost never fetch a single
record into our answer set. We fetch a very large number of records on
which we then do, adding, counting, averaging, or taking the min or
max. The most common of them is adding. Applications are simpler if
they store facts in an additive format as often as possible. Thus, in the
grocery example, we don’t need to store the unit price. We compute
the unit price by dividing the dollar sales by the unit sales whenever
necessary.

Q. What is meant by averaging over time?

Some facts, like bank balances and inventory levels, represent


intensities that are awkward to express in an additive format. We can
treat these semi additive facts as if they were additive – but just before
presenting the results to the end user; divide the answer by the
number of time periods to get the right result. This technique is called
averaging over time.
Q. What is a Conformed Dimension?

When the enterprise decides to create a set of common labels across


all the sources of data, the separate data mart teams (or, single
centralized team) must sit down to create master dimensions that
everyone will use for every data source. These master dimensions are
called Conformed Dimensions.
Two dimensions are conformed if the fields that you use as row
headers have the same domain.

Q. What is a Conformed Fact?

If the definitions of measurements (facts) are highly consistent, we call


them as Conformed Facts.

Q. What are the 3 important fundamental themes in a data


warehouse?

The 3 most important fundamental themes are:


1. Drilling Down
2. Drilling Across and
3. Handling Time

Q. What is meant by Drilling Down?

Drilling down means nothing more than “give me more detail”.


Drilling Down in a relational database means “adding a row header” to
an existing SELECT statement. For instance, if you are analyzing the
sales of products at a manufacturer level, the select list of the query
reads:
SELECT MANUFACTURER, SUM(SALES).
If you wish to drill down on the list of manufacturers to show the brand
sold, you add the BRAND row header:
SELECT MANUFACTURER, BRAND, SUM(SALES).
Now each manufacturer row expands into multiple rows listing all the
brands sold. This is the essence of drilling down.

We often call a row header a “grouping column” because everything in


the list that’s not aggregated with an operator such as SUM must be
mentioned in the SQL GROUP BY clause. So the GROUP BY clause in
the second query reads, GROUP BY MANUFACTURER, BRAND.

Q. What is meant by Drilling Across?

Drilling Across adds more data to an existing row. If drilling down is


requesting ever finer and granular data from the same fact table, then
drilling across is the process for linking two or more fact tables at the
same granularity, or, in other words, tables with the same set of
grouping columns and dimensional constraints.

A drill across report can be created by using grouping columns that


apply to all the fact tables used in the report.

The new fact table called for in the drill-across operation must share
certain dimensions with the fact table in the original query. All fact
tables in a drill-across query must use conformed dimensions.

Q. What is the significance of handling time?

Example, when a customer moves from a property, we might want to


know:
1. Who the new customer is
2. When did the old customer move out
3. When did the new customer move in
4. How long was the property empty etc

Q. What is menat by Drilling Up?

If drilling down is adding grouping columns from the dimension tables,


then drilling up is subtracting grouping columns.

Q. What is meant by Drilling Around?

The final variant of drilling is drilling around a value circle. This is


similar to the linear value chain that I showed in the previous example,
but occurs in a data warehouse where the related fact tables that
share common dimensions are not arranged i n a linear order. The best
example is from health care, where as many as 10 separate entities
are processing patient encounters, and are sharing this information
with one another.
E.g. a typical health care value circle with 10 separate entities
surrounding the patient.

When the common dimensions are conformed and the requested


grouping columns are drawn from dimensions that tie to all the fact
tables in a given report, you can generate really powerful drill around
reports by performing separate queries on each fact table and outer
joining the answer sets in the client tool.

Q. What are the important fields in a recommended Time


dimension table?
Time_key
Day_of_week
Day_number_in_month
Day_number_overall
Month
Month_number_overall
Quarter
Fiscal_period
Season
Holiday_flag
Weekday_flag
Last_day_in_month_flag

Q. Why have timestamp as a surrogate key rather than a real


date?

The time stamp in a fact table should be a surrogate key instead of a


real date because:

 The rare timestamp that is inapplicable, corrupted, or hasn’t


happened yet needs a value that cannot be a real date
 Most end-user calendar navigation constraints, such as fiscal
periods, end-of-periods, holidays, day numbers and week
numbers aren’t supported by database timestamps
 Integer time keys take up much less disk space than full
dates

Q. Why have more than one fact table instead of a single fact
table?

We cannot combine all of the business processes into a single fact


table because:
 The separate fact tables in the value chain do not share all
the dimensions. You simply can’t put the customer ship to
dimension on the finished goods inventory data
 Each fact table possesses different facts, and the fact table
records are recorded at different items along the value chain

Q. What is mean by Slowly Changing Dimensions and what are


the different types of SCD’s? (Mascot)

Dimensions don’t change in predicable ways. Individual customers and


products evolve slowly and episodically. Some of the changes are true
physical changes. Customers change their addresses because they
move. A product is manufactured with different packaging. Other
changes are actually corrections of mistakes in the data. And finally,
some changes are changes in how we label a product or customer and
are more a matter of opinion than physical reality. We call these
variations Slowly Changing Dimension (SCD).

The 3 fundamental choices for handling the slowly changing dimension


are:

 Overwrite the changed attribute, thereby destroying previous


history
e.g. useful when correcting an error
 Issue a new record for the customer, keeping the customer natural
key, but creating a new surrogate primary key
 Create an additional field in the existing customer record, and store
the old value of the attribute in the additional field. Overwrite the
original attribute field

A Type 1 SCD is an overwrite of a dimensional attribute. History is


definitely lost. We overwrite when we are correcting an error in the
data or when we truly don’t want to save history.

A Type 2 SCD creates a new dimension record and requires a


generalized or surrogate key for the dimension. We create surrogate
keys when a true physical change occurs in a dimension entity at a
specific point in time, such as the customer address change or the
product packing change. We often add a timestamp and a reason code
in the dimension record to precisely describe the change.
The Type 2 SCD records changes of values of dimensional entity
attributes over time. The technique requires adding a new row to the
dimension each time there’s a change in the value of an attribute (or
group of attributes) and assigning a unique surrogate key to the new
row.

A Type 3 SCD adds a new field in the dimension record but does not
create a new record. We might change the designation of the
customer’s sales territory because we redraw the sales territory map,
or we arbitrarily change the category of the product from confectionary
to candy. In both cases, we augment the original dimension attribute
with an “old” attribute so we can switch between these alternate
realities.

Q. What are the techniques for handling SCD’s?

 Overwriting
 Creating another dimension record
 Creating a current value filed

Q. What is a Surrogate Key and where do you use it? (Mascot)

A surrogate key is an artificial or synthetic key that is used as a


substitute for a natural key. It is just a unique identifier or number for
each row that can be used for the primary key to the table.

It is useful because the natural primary key (i.e. Customer Number in


Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME


which are stated as the primary keys (according to the business users)
but, not only can these change, indexing on a numerical value is
probably better and you could consider creating a surrogate key called,
say, AIRPORT_ID. This would be internal to the system and as far as the
client is concerned you may display only the AIRPORT_NAME.

Another benefit you can get from surrogate keys (SID) is in tracking
the SCD - Slowly Changing Dimension.

A classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business
Unit 'BU1' (that's what would be in your Employee Dimension). This
employee has a turnover allocated to him on the Business Unit 'BU1'
but on the 2nd of June the Employee 'E1' is muted from Business Unit
'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the
new Business Unit 'BU2' but the old one should belong to the Business
Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your
data warehouse everything would be allocated to Business Unit 'BU2'
even what actually belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new
record for the Employee 'E1' in your Employee Dimension with a new
surrogate key.

This way, in your fact table, you have your old data (before 2nd of
June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd
of June) would take the SID of the employee 'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of


your natural key: natural key of the Employee was Employee Code 'E1'
but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the
difference with the natural key enlargement process is that you might
not have all part of your new key within your fact table, so you might
not be able to do the join on the new enlarge key  so you need
another id.

Every join between dimension tables and fact tables in a data


warehouse environment should be based on surrogate key, not natural
keys.

Q. What is the necessity of having surrogate keys?

 Production may reuse keys that it has purged but that you are
still maintaining
 Production might legitimately overwrite some part of a
product description or a customer description with new values
but not change the product key or the customer key to a new
value. We might be wondering what to do about the revised
attribute values (slowly changing dimension crisis)
 Production may generalize its key format to handle some new
situation in the transaction system. E.g. changing the
production keys from integers to alphanumeric or may have
12-byte keys you are used to have become 20-byte keys
 Acquisition of companies

Q. What are the advantages of using Surrogate Keys?

 We can save substantial storage space with integer valued


surrogate keys
 Eliminate administrative surprises coming from production
 Potentially adapt to big surprises like a merger or an acquisition
 Have a flexible mechanism for handling slowly changing
dimensions

Q. What are Factless Fact tables?

Fact tables which do not have any facts are called factless fact tables.
They may consist of nothing but keys.

There are two kinds of fact tables that do not have any facts at all.

The first type of factless fact table is a table that records an event.
Many event-tracking tables in dimensional data warehouses turn out to
be factless.
E.g. A student tracking system that detects each student attendance
event each day.

The second type of factless fact table is called a coverage table.


Coverage tables are frequently needed when a primary fact table in a
dimensional data warehouse is sparse.
E.g. A sales fact table that records the sales of products in stores on
particular days under each promotion condition. The sales fact table
does answer many interesting questions but cannot answer questions
about things that did not happen. For instance, it cannot answer the
question, “which products were in promotion that did not sell?”
because it contains only the records of products that did sell. In this
case the coverage table comes to the rescue. A record is placed in the
coverage table for each product in each store that is on promotion in
each time period.

Q. What are Causal dimension?

A causal dimension is a kind of advisory dimension that should not


change the fundamental grain of a fact table.
E.g. why the customer bought the product? It can be due to promotion,
sales etc.

Q. What is meant by Drill Through? (Mascot)

Operating Data Source - directly connects to application database

Q. What is Operational Data Store? (Mascot)

Q. What is BI? And why do we need BI?

Business Intelligence, it is an ongoing process of various integration


packages to analyze data.

Q What is Slicing and Dicing ? How we can do in Impromptu (We


cannot do)? It is done only in Powerplay.

GENERAL

Q. Explain the Project. (Polaris)

Explain about the various projects (MIDAS2/VIP).


Why was MIDAS2 or VIP or SCI developed.

Q. What is the size of the database in your project? (Polaris)


Approximately 900GB.

Q. What is the daily data volume (in GB/records)? Or What is


the size of the data extracted in the extraction process?
(Polaris)

Q. How many Data marts are there in your project?

Q. How many Fact and Dimension tables are there in your


project?

Q. What is the size of Fact table in your project?

Q. How many dimension tables did you had in your project and name
some dimensions (columns)? (Mascot)

Q. Name some measures in your fact table? (Mascot)

Q. Why couldn’t u go for Snowflake schema? (Mascot)

Q. How many Measures u have created? (Mascot)

Q. How many Facts & Dimension Tables are there in your Project?
(Mascot)

Q. Have u created Datamarts? (Mascot)

Q. What is the difference between OLTP and OLAP?

OLAP - Online Analytical processing, mainly required for DSS, data is in


denormalized manner and mainly used for non volatile data, highly
indexed, improve query response time

OLTP - Transactional Processing - DML, highly normalized to reduce


deadlock & increase concurrency

Q. What is the difference between OLTP and data warehouse?

Operational System Data Warehouse


Transaction Processing Query Processing
Time Sensitive History Oriented
Operator View Managerial View
Organized by transactions Organized by subject
(Order, Input, Inventory) (Customer, Product)
Relatively smaller Large database size
database
Many concurrent users Relatively few concurrent
users
Volatile Data Non Volatile Data
Stores all data Stores relevant data
Not Flexible Flexible

Q. Explain the DW life cycle

Data warehouses can have many different types of life cycles with
independent data marts. The following is an example of a data
warehouse life cycle.
In the life cycle of this example, four important steps are involved.

Extraction - As a first step, heterogeneous data from different online


transaction processing systems is extracted. This data becomes the
data source for the data warehouse.
Cleansing/transformation - The source data is sent into the populating
systems where the data is cleansed, integrated, consolidated, secured
and stored in the corporate or central data warehouse.
Distribution - From the central data warehouse, data is distributed to
independent data marts specifically designed for the end user.
Analysis - From these data marts, data is sent to the end users who
access the data stored in the data mart depending upon their
requirement.

Q. What is the life cycle of DW?


Getting data from OLTP systems from diff data sources
Analysis & staging - Putting in a staging layer- cleaning, purging,
putting surrogate keys, SCM , dimensional modeling
Loading
Writing of metadata

Q. What are the different Reporting and ETL tools available in


the market?

Q. What is a data warehouse?

A data warehouse is a database designed to support a broad range of


decision tasks in a specific organization. It is usually batch updated
and structured for rapid online queries and managerial summaries.
Data warehouses contain large amounts of historical data, which are
derived from transaction data, but it can include data from other
sources also. It is designed for query and analysis rather than for
transaction processing.

It separates analysis workload from transaction workload and enables


an organization to consolidate data from several sources.

The term data warehousing is often used to describe the process of


creating, managing and using a data warehouse.

Q. What is a data mart?

A data mart is a selected part of the data warehouse which supports


specific decision support application requirements of a company’s
department or geographical region. It usually contains simple
replicates of warehouse partitions or data that has been further
summarized or derived from base warehouse data. Instead of running
ad hoc queries against a huge data warehouse, data marts allow the
efficient execution of predicted queries over a significantly smaller
database.

Q. How do I differentiate between a data warehouse and a data mart?


(KPIT Infotech Pune, Mascot)

A data warehouse is for very large databases (VLDBs) and a data mart
is for smaller databases. The difference lies in the scope of the things
with which they deal.
A data mart is an implementation of a data warehouse with a small
and more tightly restricted scope of data and data warehouse
functions. A data mart serves a single department or part of an
organization. In other words, the scope of a data mart is smaller than
the data warehouse. It is a data warehouse for a smaller group of end
users.

Q. What is the aim/objective of having a data warehouse? And who


needs a data warehouse? Or what is the use of Data Warehousing?
(Polaris)

Data warehousing technology comprises a set of new concepts and


tools which support the executives, managers and analysts with
information material for decision making.
The fundamental reason for building a data warehouse is to improve
the quality of information in the organization.
The main goal of data warehouse is to report and present the
information in a very user friendly form.

Q. What approach to be followed for creation of Data Warehouse?

Top Down Approach (Data warehousing first) , Bottom Up (data marts),


Enterprise Data Model ( combines both)

Q. Explain the methodology of Data Warehousing? (Polaris)

Q. What are the important concerns of OLTP and DSS systems?

Q. What is the Architecture of a data warehouse?

A data warehouse system (DWS) comprises the data warehouse and all
components used for building, accessing and maintaining the DWH
(illustrated in Figure 1). The center of a data warehouse system is the
data warehouse itself. The data import and preparation component is
responsible for data acquisition. It includes all programs, applications
and legacy systems interfaces that are responsible for extracting data
from operational sources, preparing and loading it into the warehouse.
The access component includes all different applications (OLAP or data
mining applications) that make use of the information stored in the
warehouse.

Additionally, a metadata management component (not shown in Figure


1) is responsible for the management, definition and access of all
different types of metadata. In general, metadata is defined as “data
about data” or “data describing the meaning of data”. In data
warehousing, there are various types of metadata, e.g., information
about the operational sources, the structure and semantics of the DWH
data, the tasks performed during the construction, the maintenance
and access of a DWH, etc. The need for metadata is well known.
Statements like “A data warehouse without adequate metadata is like
a filing cabinet stuffed with papers, but without any folders or labels”
characterize the situation. Thus, the quality of metadata and the
resulting quality of information gained using a data warehouse solution
are tightly linked.
Implementing a concrete DWS is a complex task comprising two major
phases. In the DWS configuration phase, a conceptual view of the
warehouse is first specified according to user requirements (data
warehouse design). Then, the involved data sources and the way data
will be extracted and loaded into the warehouse (data acquisition) is
determined. Finally, decisions about persistent storage of the
warehouse using database technology and the various ways data will
be accessed during analysis are made.

After the initial load (the first load of the DWH according to the DWH
configuration), during the DWS operation phase, warehouse data must
be regularly refreshed, i.e., modifications of operational data since the
last DWH refreshment must be propagated into the warehouse such
that data stored in the DWH reflect the state of the underlying
operational systems. Besides DWH refreshment, DWS operation
includes further tasks like archiving and purging of DWH data or DWH
monitoring.

Q. What are the functional requirements for a data warehouse?

A data warehouse must be able to support various types of information


applications.
Decision support processing is the principle type of information
application in a data warehouse, but the use of a data warehouse is
not restricted to a decision support system.
It is possible that each information application has its own set of
requirements in terms of data, the way that data is modeled, and the
way it is used.
The data warehouse is where these applications get their "consolidated
data."
A data warehouse must consolidate primitive data and it must provide
all facilities to derive information from it, as required by the end-users.
Detailed primitive data is of prime importance, but data volumes tend
to be big and users usually require information derived from the
primitive data. Data in a data warehouse must be organized such that
it can be analyzed or explored from different angles.

Analysis of the historical context (the time dimension) is of prime


importance.
Examples of other important contextual dimensions are geography,
organization, products, suppliers, customers, and so on.

Q. What are the characteristics of a data warehouse?

Data in a data warehouse is organized as subject oriented rather than


application oriented. It is designed and constructed as a non-volatile
store of business data, transactions and events. Data warehouse is a
logically integrated store of data originating from disparate operational
sources.
It is the only source for deriving information needed by the end users.
Several temporal modeling styles are usually used in different areas of
the data warehouse.

Q. What are the characteristics of the data in a data warehouse?

Data in the DWH is integrated from various, heterogeneous operational


systems (like database systems, flat files, etc.) and further external
data sources (like demographic and statistical databases, WWW, etc.).
Before the integration, structural and semantic differences have to be
reconciled, i.e., data have to be “homogenized” according to a uniform
data model. Furthermore, data values from operational systems have
to be cleaned in order to get correct data into the data warehouse.

The need to access historical data (i.e., histories of warehouse data


over a prolonged period of time) is one of the primary incentives for
adopting the data warehouse approach. Historical data are necessary
for business trend analysis which can be expressed in terms of
understanding the differences between several views of the real-time
data (e.g., profitability at the end of each month). Maintaining
historical data means that periodical snapshots of the corresponding
operational data are propagated and stored in the warehouse without
overriding previous warehouse states. However, the potential volume
of historical data and the associated storage costs must always be
considered in relation to their potential business benefits.

Furthermore, warehouse data is mostly non-volatile, i.e., access to the


DWH is typically read-oriented. Modifications of the warehouse data
takes place only when modifications of the source data are propagated
into the warehouse.

Finally, a data warehouse contains usually additional data, not


explicitly stored in the operational sources, but derived through some
process from operational data (called also derived data). For example,
operational sales data could be stored in several aggregation levels
(weekly, monthly, quarterly sales) in the warehouse.

Q. When should a company consider implementing a data warehouse?

Data warehouses or a more focused database called a data mart


should be considered when a significant number of potential users are
requesting access to a large amount of related historical information
for analysis and reporting purposes. So-called active or real-time data
warehouses can provide advanced decision support capabilities.

Q. What data is stored in a data warehouse?


In general, organized data about business transactions and business
operations is stored in a data warehouse. But, any data used to
manage a business or any type of data that has value to a business
should be evaluated for storage in the warehouse. Some static data
may be compiled
for initial loading into the warehouse. Any data that comes from
mainframe, client/server, or web-based systems can then be
periodically loaded into the warehouse. The idea behind a data
warehouse is to capture and maintain useful data in a central location.
Once data is organized, managers and analysts can use software tools
like OLAP to link different types of data together and potentially turn
that data into valuable information that can be used for a variety of
business decision support needs, including analysis, discovery,
reporting and planning.

Q. Database administrators (DBAs) have always said that having non-


normalized or de-normalized data is bad. Why is de-normalized data
now okay when it's used for Decision Support?

Normalization of a relational database for transaction processing


avoids processing anomalies and results in the most efficient use of
database storage. A data warehouse for Decision Support is not
intended to achieve these same goals. For Data-driven Decision
Support, the main concern is to provide information to the user as fast
as possible. Because of this, storing data in a de-normalized fashion,
including storing redundant data and pre-summarizing data, provides
the best retrieval results. Also, data warehouse data is usually static so
anomolies will not occur from operations like add, delete and update a
record or field.

Q. How often should data be loaded into a data warehouse from


transaction processing and other source systems?

It all depends on the needs of the users, how fast data changes and
the volume of information that is to be loaded into the data warehouse.
It is common to schedule daily, weekly or monthly dumps from
operational data stores during periods of low activity (for example, at
night or on weekends). The longer the gap between loads, the longer
the processing times for the load when it does run. A technical IS/IT
staffer should make some calculations and consult with potential users
to develop a schedule to load new data.

Q. What are the benefits of data warehousing?

Some of the potential benefits of putting data into a data warehouse


include:
1. Improving turnaround time for data access and reporting;
2. Standardizing data across the organization so there will be one
view of the "truth";
3. Merging data from various source systems to create a more
comprehensive information source;
4. Lowering costs to create and distribute information and
reports;
5. Sharing data and allowing others to access and analyze the
data;
6. Encouraging and improving fact-based decision making.

Q. What are the limitations of data warehousing?

The major limitations associated with data warehousing are related to


user expectations, lack of data and poor data quality. Building a data
warehouse creates some unrealistic expectations that need to be
managed. A data warehouse doesn't meet all decision support needs.
If needed data is not currently collected, transaction systems need to
be altered to collect the data. If data quality is a problem, the problem
should be corrected in the source system before the data warehouse is
built. Software can provide only limited support for cleaning and
transforming data. Missing and inaccurate data can not be "fixed"
using software. Historical data can be collected manually, coded and
"fixed", but at some point source systems need to provide quality data
that can be loaded into the data warehouse without manual clerical
intervention.

Q. How does my company get started with data warehousing?


Build one! The easiest way to get started with data warehousing is to
analyze some existing transaction processing systems and see what
type of historical trends and comparisons might be interesting to
examine to support decision making. See if there is a "real" user need
for integrating the data. If there is, then IS/IT staff can develop a data
model for a new schema and load it with some current data and start
creating a decision support data store using a database management
system (DBMS). Find some software for query and reporting and build
a decision support interface that's easy to use. Although the initial data
warehouse/data-driven DSS may seem to meet only limited needs, it is
a "first step". Start small and build more sophisticated systems based
upon experience and successes.

Q. What is the difference between OLTP database and data warehouse


database?

Q. Why should the OLTP database different from data warehouse


database?

 OLTP and data warehousing require two very differently


configured systems
 Isolation of Production System from Business Intelligence System
 Significant and highly variable resource demands of the data
warehouse
 Cost of disk space no longer a concern
 Production systems not designed for query processing

Data warehouse usually contains historical data that is derived from


transaction data, but it can include data from other sources. Having
separate databases will separate analysis workload from transaction
workload and enables an organization to consolidate data from several
sources.

Q. What is the main difference between Data Warehousing and


Business Intelligence?

The differentials are:

DW - is a way of storing data and creating information through


leveraging data marts. DM's are segments or categories of information
and/or data that are grouped together to provide 'information' into that
segment or category. DW does not require BI to work. Reporting tools
can generate reports from the DW.
BI - is the leveraging of DW to help make business decisions and
recommendations. Information and data rules engines are leveraged
here to help make these decisions along with statistical analysis tools
and data mining tools.

Q. What is data modeling?

Q. What are the different steps for data modeling?

Q. What are the data modeling tools you have used? (Polaris)

Q. What is a Physical data model?

During the physical design process, you convert the data gathered
during the logical design phase into a description of the physical
database, including tables and constraints.

Q. What is a Logical data model?

A logical design is a conceptual and abstract design. We do not deal


with the physical implementation details yet; we deal only with
defining the types of information that we need.
The process of logical design involves arranging data into a series of
logical relationships called entities and attributes.

Q. What are an Entity, Attribute and Relationship?

An entity represents a chunk of information. In relational databases, an


entity often maps to a table.

An attribute is a component of an entity and helps define the


uniqueness of the entity. In relational databases, an attribute maps to
a column.

The entities are linked together using relationships.

Q. What are the different types of Relationships?

Entity-Relationship.

Q. What is the difference between Cardinality and Nullability?

Q. What is Forward, Reverse and Re-engineering?

Q. What is meant by Normalization and De-normalization?


Q. What are the different forms of Normalization?

Q. What is an ETL or ETT? And what are the different types?

ETL is the Data Warehouse acquisition processes of Extracting,


Transforming (or Transporting) and Loading (ETL) data from source
systems into the data warehouse.
E.g. Oracle Warehouse Builder, Powermart.

Q. Explain the Extraction process? (Polaris, Mascot)

Q. How do you extract data from different data sources explain with an
example? (Polaris)

Q. What are the reporting tools you have used? What is the difference
between them? (Polaris)

Q. How do you automate Extraction process? (Polaris)

Q. Without using ETL tool can u prepare a Data Warehouse and


maintain? (Polaris)

Q. How do you identify the changed records in operational data


(Polaris)

Q. What is a Star Schema?


A star schema is a set of tables comprised of a single, central fact table
surrounded by de-normalized dimensions. Each dimension is
represented in a single table. Star schema implement dimensional data
structures with de- normalized dimensions. Snowflake schema is an
alternative to star schema. A relational database schema for
representing multidimensional data. The data is stored in a central fact
table, with one or more tables holding information on each dimension.
Dimensions have levels, and all levels are usually shown as columns in
each dimension table.

Q. What is a Snowflake Schema?


A snowflake schema is a set of tables comprised of a single, central
fact table surrounded by normalized dimension hierarchies. Each
dimension level is represented in a table. Snowflake schema
implements dimensional data structures with fully normalized
dimensions. Star schema is an alternative to snowflake schema.

An example would be to break down the Time dimension and create


tables for each level; years, quarters, months; weeks, days… These
additional branches on the ERD create ore of a Snowflake shape then
Star.

Q. What is Very Large Database?

Q. What are SMP and MPP?

Symmetric multi-processors (SMP)

Q. What is data mining?

Data Mining is the process of automated extraction of predictive


information from large databases. It predicts future trends and finds
behaviour that the experts may miss as it lies beyond their
expectations. Data Mining is part of a larger process called knowledge
discovery; specifically, the step in which advanced statistical analysis
and modeling techniques are applied to the data to find useful patterns
and relationships.

Data mining can be defined as "a decision support process in which we


search for patterns of information in data." This search may be done
just by the user, i.e. just by performing queries, in which case it is quite
hard and in most of the cases not comprehensive enough to reveal
intricate patterns. Data mining uses sophisticated statistical analysis
and modeling techniques to uncover such patterns and relationships
hidden in organizational databases – patterns that ordinary methods
might miss. Once found, the information needs to be presented in a
suitable form, with graphs, reports, etc.

Q. What is an OLAP? (Mascot)

OLAP is software for manipulating multidimensional data from a variety


of sources. The data is often stored in data warehouse. OLAP software
helps a user create queries, views, representations and reports. OLAP
tools can provide a "front-end" for a data-driven DSS.

On-Line Analytical Processing (OLAP) is a category of software


technology that enables analysts, managers and executives to
gain insight into data through fast, consistent, interactive
access to a wide variety of possible views of information that
has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user.

OLAP functionality is characterized by dynamic multi-


dimensional analysis of consolidated enterprise data
supporting end user analytical and navigational activities

Q. What are the Different types of OLAP's? What are their differences?
(Mascot)

OLAP - Desktop OLAP(Cognos), ROLAP, MOLAP(Oracle Discoverer)

ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical


Analysis) applications.

ROLAP stands for Relational OLAP. Users see their data organized in
cubes with dimensions, but the data is really stored in a Relational
Database (RDBMS) like Oracle. The RDBMS will store data at a fine
grain level, response times are usually slow.

MOLAP stands for Multidimensional OLAP. Users see their data


organized in cubes with dimensions, but the data is store in a Multi-
dimensional database (MDBMS) like Oracle Express Server. In a MOLAP
system lot of queries have a finite answer and performance is usually
critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds.


Seagate Software's “Holos” is an example HOLAP environment. In a
HOLAP system one will find queries on aggregated data as well as on
detailed data.

DOLAP

Q. What is the difference between data warehousing and OLAP?

The terms data warehousing and OLAP are often used interchangeably.
As the definitions suggest, warehousing refers to the organization and
storage of data from a variety of sources so that it can be analyzed
and retrieved easily. OLAP deals with the software and the process of
analyzing data, managing aggregations, and partitioning information
into cubes for in-depth analysis, retrieval and visualization. Some
vendors are replacing the term OLAP with the terms analytical
software and business intelligence.

Q. What are the facilities provided by data warehouse to analytical


users?

Q. What are the facilities provided by OLAP to analytical users?

Q. What is a Histogram? How to generate statistics?


Q. In Erwin what are the different types of models (Honeywell)

Q. Many Suppliers – Many Products Model the above scenario in Erwin.


How many tables and what do they contain (Honeywell)

Q. What are the options available in Erwin Tool box (Honeywell)

Q. Aggregate navigation

Q. What are the Data Warehouse Center administration functions?

The functions of Visual Warehouse administration are:

Creating Data Warehouse Center security groups.


Defining Data Warehouse Center privileges for that group.
Registering Data Warehouse Center users.
Adding Data Warehouse Center users to security groups.
Registering data sources.
Registering warehouses (targets).
Creating subjects.
Registering agents.
Registering Data Warehouse Center programs.

Q. How do I set the log level higher for more detailed information
within Data Warehouse Center 7.2?

Within DWC, log level capability can be set from 0 to 4. There is a log
level 5, yet it cannot be turned on using the GUI, but must be turned
on manually. A command line trace can be used for any trace level,
and this is the only way to turn on a level 5 trace:

Go to start, programs, IBM DB2, command line processor.

Connect to the control database:


db2 => connect to Control_Database_name

Update the configuration table:


db2 => update iwh.configuration set value_int = 5 where name =
'TRACELVL' and (component = '<component name>')

Valid components are:

Logger trace = log


Agent trace = agent
Server trace = RTK
DDD = DDD
ODBC = VWOdbc

For multiple traces the format is:


db2 => update iwh.configuration set value_int = 5 where name =
'TRACELVL' and (component = '<component name>' or component =
'<component name>')

Reset the connection:


db2 => connect reset

Stop and restart the Warehouse server and logger.

Perform the failing operation.

Be sure to reset the trace level to 0 using the command line when you
are done:
db2 => update iwh.configuration set value_int = 0 where name =
'TRACELVL'
and (component = '<component name>')

When you run a trace, the Data Warehouse Center writes information
to text files. Data Warehouse Center programs that are called from
steps also write any trace information to this directory. These files are
located in the directory specified by the VWS_LOGGING environment
variable.

The default value of VWS_LOGGING is:

Windows and OS/2 = x:\sqllib\logging


UNIX = /var/IWH
AS/400 = /QIBM/UserData/IWH
For additional information, see basic logging function in the Data
Warehouse Center administration guide.

Q. What types of data sources does Data Warehouse Center support?

The Data Warehouse Center supports a wide variety of relational and


non relational data sources. You can populate your Data Warehouse
Center warehouse with data from the following databases and files:
Any DB2 family database
Oracle
Sybase
Informix
Microsoft SQL Server
IBM DataJoiner
Multiple Virtual Storage (OS/390), Virtual Machine (VM), and local area
network (LAN) files
IMS and Virtual Storage Access Method (VSAM) (with Data Joiner
Classic
Connect)

Q. What is the Data Warehouse Center control database?

When you install the warehouse server, the warehouse control


database that you specify during installation is initialized. Initialization
is the process in which the Data Warehouse Center creates the control
tables that are required to store Data Warehouse Center metadata. If
you have more than one warehouse control database, you can use the
Data Warehouse Center -->
Control Database Management window to initialize the second
warehouse control database. However, only one warehouse control
database can be active at a time.

Q. What databases need to be registered as system ODBC data


sources for the Data Warehouse Center?

The Data Warehouse Center database that needs to be registered as


system
ODBC data sources are:
source
target
control databases

1. What was the original business problem that led you to do this
project?

Whether the consultant is being hired to gather requirements or


to customize an OLAP application, this question indicates that
she’s interested in the big picture. She’ll keep the answer in
mind as she does her work, which is a measure of quality
assurance.

2. Where are you in your current implementation process?

A consultant who asks this question knows not to make any


assumptions about how much progress you’ve made. She
probably also understands that you might be wrong. There are
plenty of clients who have begun application development
without having gathered requirements. Understanding where the
client thinks he is is just as important as understanding where he
wants to be. It also helps the consultant in making improvement
suggestions or recommendations for additional skills or
technologies.

3. How long do you see this position being filled by an external


resource?

While the question might seem self-serving at first, a good


consultant is ever mindful of his responsibility to render himself
dispensable over time. Your answer will give him a good idea of
how much time he has to perform the work as well as to cross
train permanent staff within your organization. A variation on this
question is: "Is there a dedicated person or group targeted for
knowledge transfer in this area?"

4. What deliverables do you expect from this engagement?

The consultant who doesn’t ask about deliverables is the


consultant who expects to sit around giving advice. Beware of
the "ivory tower" consultants, who are too light for heavy work
and too heavy for light work. Every consultant you talk to should
expect to produce some sort of deliverable, be it a requirements
document, a data model, HTML, a project plan, test procedures
or a mission statement.

5. Would you like to talk to a past client or two?

The fact that a consultant would offer references is testimony


that she knows her stuff. Many do not. Those consultants who
hide behind nondisclosures for not giving references should be
avoided. While it’s often valid to deny prospective clients work
samples because of confidentiality agreements, there’s no good
reason not to offer the name and phone number of someone who
will sing the consultant’s praises. Don’t be satisfied with a
reference for the entire firm. Many good firms can employ below-
average consultants. Ask to talk to someone who’s worked with
the person or team you’re considering. Once you’ve hired that
consultant and are happy with his work, offer to be a reference.
It comes around.