Sie sind auf Seite 1von 85

A Definition of Data Warehousing

Market Overview:

The data warehousing market consists of tools, technologies, and


methodologies that allow for the construction, usage, management, and
maintenance of the hardware and software used for a data warehouse,
as well as the actual data itself. Surveys indicate Data Warehousing will
be the single largest IT initiative after completion of Y2K efforts. Data
warehousing is currently a $28 Billion market (Source: Data
Warehousing Institute) and we estimate 20% growth per annum
through at least 2002.

Two of the pioneers in the field were Ralph Kimball and Bill Inmon.
Biographies of these two individuals have been provided, since many of
the terms discussed in this paper were coined and concepts defined by
them.

Biographical Information

Bill Inmon
Bill Inmon is universally recognized as the "father of the data warehouse."
He has over 26 years of database technology management experience and
data warehouse design expertise, and has published 36 books and more
than 350 articles in major computer journals. His books have been
translated into nine languages. He is known globally for his seminars on
developing data warehouses and has been a keynote speaker for every
major computing association. Before founding Pine Cone Systems, Bill was
a co-founder of Prism Solutions, Inc.

Ralph Kimball
Ralph Kimball was co-inventor of the Xerox Star workstation, the first
commercial product to use mice, icons, and windows. He was vice
president of applications at Metaphor Computer Systems, and founder and
CEO of Red Brick Systems. He has a Ph.D. from Stanford in electrical
engineering, specializing in man-machine systems. Ralph is a leading
proponent of the dimensional approach to designing large data
warehouses. He currently teaches data warehousing design skills to IT
groups, and helps selected clients with specific data warehouse designs.
Ralph is a columnist for Intelligent Enterprise magazine and has a
relationship with Sagent Technology, Inc., a data warehouse tool vendor.
His book "The Data Warehouse Toolkit" is widely recognized as the seminal
work on the subject.

In order to clear up some of the confusion that is rampant in the market,


here are some definitions:

Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of
management's decision making process".
He defined the terms in the sentence as follows:

• Subject Oriented: Data that gives information about a particular


subject instead of about a company's ongoing operations.

• Integrated: Data that is gathered into the data warehouse from a


variety of sources and merged into a coherent whole.

• Time-variant: All data in the data warehouse is identified with a


particular time period.

• Non-volatile: Data is stable in a data warehouse. More data is added


but data is never removed. This enables management to gain a
consistent picture of the business.
(Source: "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1,
Number 1, 1995). This definition remains reasonably accurate almost
ten years later. However, a single-subject data warehouse is typically
referred to as a data mart, while data warehouses are generally
enterprise in scope. Also, data warehouses can be volatile. Due to the
large amount of storage required for a data warehouse, (multi-terabyte
data warehouses are not uncommon), only a certain number of periods
of history are kept in the warehouse. For instance, if three years of data
are decided on and loaded into the warehouse, every month the oldest
month will be "rolled off" the database, and the newest month added.

Ralph Kimball provided a much simpler definition of a data warehouse. As


stated in his book, "The Data Warehouse Toolkit", on page 310, a data
warehouse is "a copy of transaction data specifically structured for
query and analysis". This definition provides less insight and depth than
Mr. Inmon's, but is no less accurate.

DATA WAREHOUSING

Data warehousing is essentially what you need to do in order to create a


data warehouse, and what you do with it. It is the process of creating,
populating, and then querying a data warehouse and can involve a
number of discrete technologies such as:

• Source System Identification: In order to build the data warehouse,


the appropriate data must be located. Typically, this will involve both
the current OLTP (On-Line Transaction Processing) system where the
"day-to-day" information about

• the business resides, and historical data for prior periods, which may be
contained in some form of "legacy" system. Often these legacy systems
are not relational databases, so much effort is required to extract the
appropriate data.

• Data Warehouse Design and Creation: This describes the process of


designing the warehouse, with care taken to ensure that the design
supports the types of queries the warehouse will be used for. This is an
involved effort that requires both an understanding of the database
schema to be created, and a great deal of interaction with the user
community. The design is often an iterative process and it must be
modified a number of times before the model can be stabilized. Great
care must be taken at this stage, because once the model is populated
with large amounts of data, some of which may be very difficult to
recreate, the model can not easily be changed.

• Data Acquisition: This is the process of moving company data from


the source systems into the warehouse. It is often the most time-
consuming and costly effort in the data warehousing project, and is
performed with software products known as ETL
(Extract/Transform/Load) tools. There are currently over 50 ETL tools
on the market. The data acquisition phase can cost millions of dollars
and take months or even years to complete. Data acquisition is then an
ongoing, scheduled process, which is executed to keep the warehouse
current to a pre-determined period in time, (i.e. the warehouse is
refreshed monthly).

• Changed Data Capture: The periodic update of the warehouse from


the transactional system(s) is complicated by the difficulty of identifying
which records in the source have changed since the last update. This
effort is referred to as "changed data capture". Changed data capture is
a field of endeavor in itself, and many products are on the market to
address it. Some of the technologies that are used in this area are
Replication servers, Publish/Subscribe, Triggers and Stored Procedures,
and Database Log Analysis.

• Data Cleansing: This is typically performed in conjunction with data


acquisition (it can be part of the "T" in "ETL"). A data warehouse that
contains incorrect data is not only useless, but also very dangerous. The
whole idea behind a data warehouse is to enable decision-making. If a
high level decision is made based on incorrect data in the warehouse,
the company could suffer severe consequences, or even complete
failure. Data cleansing is a complicated process that validates and, if
necessary, corrects the data before it is inserted into the warehouse. For
example, the company could have three "Customer Name" entries in its
various source systems, one entered as "IBM", one as "I.B.M.", and one
as "International Business Machines". Obviously, these are all the same
customer. Someone in the organization must make a decision as to
which is correct, and then the data cleansing tool will change the others
to match the rule. This process is also referred to as "data scrubbing" or
"data quality assurance". It can be an

• extremely complex process, especially if some of the warehouse inputs


are from older mainframe file systems (commonly referred to as "flat
files" or "sequential files").

• Data Aggregation: This process is often performed during the "T"


phase of ETL, if it is performed at all. Data warehouses can be designed
to store data at the detail level (each individual transaction), at some
aggregate level (summary data), or a combination of both. The
advantage of summarized data is that typical queries against the
warehouse run faster. The disadvantage is that information, which may
be needed to answer a query, is lost during aggregation. The tradeoff
must be carefully weighed, because the decision can not be undone
without rebuilding and repopulating the warehouse. The safest decision
is to build the warehouse with a high level of detail, but the cost in
storage can be extreme.Now that the warehouse has been built and
populated, it becomes possible to extract meaningful information from it
that will provide a competitive advantage and a return on investment.
This is done with tools that fall within the general rubric of "Business
Intelligence".
Business Intelligence (BI):

A very broad field indeed, it contains technologies such as Decision


Support Systems (DSS), Executive Information Systems (EIS), On-Line
Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-
Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of
MOLAP and ROLAP), and more. BI can be broken down into four broad
fields:
• Multi-dimensional Analysis Tools: Tools that allow the user to look
at the data from a number of different "angles". These tools often use a
multi-dimensional database referred to as a "cube".

• Query tools: Tools that allow the user to issue SQL (Structured Query
Language) queries against the warehouse and get a result set back.

• Data Mining Tools: Tools that automatically search for patterns in


data. These tools are usually driven by complex statistical formulas. The
easiest way to distinguish data mining from the various forms of OLAP is
that OLAP can only answer questions you know to ask, data mining
answers questions you didn't necessarily know to ask.

• Data Visualization Tools: Tools that show graphical representations of


data, including complex three-dimensional data pictures. The theory is
that the user can "see" trends more effectively in this manner than
when looking at complex statistical graphs. Some vendors are making
progress in this area using the Virtual Reality Modeling Language
(VRML).
Metadata Management:
Throughout the entire process of identifying, acquiring, and querying the
data, metadata management takes place. Metadata is defined as "data
about data". An example is a column in a table. The datatype (for
instance a string or integer) of the column is one piece of metadata. The
name of the column is another. The actual value in the column for a
particular row is not metadata - it is data. Metadata is stored in a
Metadata Repository and provides extremely useful information to all of
the tools mentioned previously. Metadata management has developed
into an exacting science that can provide huge returns to an
organization. It can assist companies in analyzing the impact of changes
to database tables, tracking owners of individual data elements ("data
stewards"), and much more. It is also required to build the warehouse,
since the ETL tool needs to know the metadata attributes of the sources
and targets in order to "map" the data properly. The BI tools need the
metadata for similar reasons.

Summary:
Data Warehousing is a complex field, with many vendors vying for market
awareness. The complexity of the technology and the interactions
between the various tools, and the high price points for the products
require companies to perform careful technology evaluation before
embarking on a warehousing project. However, the potential for
enormous returns on investment and competitive advantage make data
warehousing difficult to ignore.

Data Warehousing - For Better Business Decisions

Introduction

In today’s competitive global business environment, understanding and


managing enterprise wide information is crucial for making timely
decisions and responding to changing business conditions. Many
companies are realizing a business advantage by leveraging one of their
key assets - business data. There is a tremendous amount of data
generated by day-to-day business operational applications. In addition
there is valuable data available from external sources such as market
research organizations, independent surveys and quality testing labs.
Studies indicate that the amount of data in a given organization doubles
every five years. Data Warehousing has emerged as an increasingly
popular and powerful concept of applying information technology to turn
this huge island of data into meaningful information for better business
decisions. Meta Group, Inc., a leading consultant in the data
warehousing environment, suggests that over 90% of the Fortune 2000
businesses will put into place a data warehouse by the end of 1996.

What is Data Warehousing?


According to Bill Inman, known as the father of Data Warehousing, a data
warehouse is a subject oriented, integrated, time-variant, nonvolatile
collection of data in support of management decisions.

• Subject-oriented means that all relevant data about a subject is gathered and stored
as a single set in a useful format;
• Integrated refers to data being stored in a globally accepted fashion with consistent
naming conventions, measurements, encoding structures, and physical attributes, even
when the underlying operational systems store the data differently;
• Non-volatile means the data warehouse is read-only: data is loaded into the data
warehouse and accessed there;
• Time-variant data represents long-term data--from five to ten years as opposed to
the 30 to 60 days time periods of operational data.

Data warehousing is a concept. It is a set of hardware and software


components that can be used to better analyze the massive amounts of
data that companies are accumulating to make better business
decisions. Data Warehousing is not just data in the data warehouse, but
also the architecture and tools to collect, query, analyze and present
information.

Data warehousing concepts


Operational / informational data:
Operational data is the data you use to run your business. This data is
what is typically stored, retrieved, and updated by your Online
Transactional Processing (OLTP) system. An OLTP system may be, for
example, a reservations system, an accounting application, or an order
entry application.
Informational data is created from the wealth of operational data that
exists in your business and some external data useful to analyze your
business. Informational data is what makes up a data warehouse.
Informational data is typically:

• Summarized operational data


• De-normalized and replicated data
• Infrequently updated from the operational systems
• Optimized for decision support applications
• Possibly "read only" (no updates allowed)
• Stored on separate systems to lessen impact on operational systems

OLAP / Multi-dimensional analysis:


Relational databases store data in a two dimensional format: tables of
data represented by rows and columns. Multi-dimensional analysis
solutions commonly referred to as On-Line Analytical Processing (OLAP)
solutions, offer an extension to the relational model to provide a multi-
dimensional view of the data. For example, in multi-dimensional
analysis, data entities such as products, geographies, time periods, store
locations, promotions and sales channels may all represent different
dimensions. Multi-dimensional solutions provide the ability to:

• Analyze potentially large amounts of data with very fast response times
• "Slice and Dice" through the data, and drill down or roll up through various
dimensions as defined by the data structure
• Quickly identify trends or problem areas that would otherwise be missed

Multi-dimensional data structures can be implemented with


multidimensional databases or extended RDBMSs. Relational databases
can support this structure through specific database designs (schema),
such as "star-schema", intended for multi-dimensional analysis and
highly indexed or summarized designs. These structures are sometimes
referred to as relational OLAP (ROLAP)-based structures.

Data Marts:
Data marts are workgroup or departmental warehouses, which are small
in size, typically 10-50GB. The data mart contains informational data
that is departmentalized, tailored to the needs of the specific
departmental work group. Data marts are less expensive and take less
time for implementation with quick ROI. They are scaleable to full data
warehouses and at times are summarized subsets of more detailed, pre-
existing data warehouses.
Metadata/Information Catalogue:
Metadata describes the data that is contained in the data warehouse (e.g.
Data elements and business-oriented description) as well as the source
of that data and the transformations or derivations that may have been
performed to create the data element.
Data Mining:
Data mining predicts future trends and behaviors, allowing businesses to
make proactive, knowledge driven decisions. Data mining is the process
of analyzing business data in the data warehouse to find unknown
patterns or rules of information that you can use to tailor business
operations. For instance, data mining can find patterns in your data to
answer questions like:

• What item purchased in a given transaction triggers the purchase of additional related
items?
• How do purchasing patterns change with store location?
• What items tend to be purchased using credit cards, cash, or check?
• How would the typical customer likely to purchase these items be described?
• Did the same customer purchase related items at another time?

Data Warehouse Implementation

Fig 1: Data Warehousing Architecture Model


The following components should be considered for a successful
implementation of a Data Warehousing solution:

• Open Data Warehousing architecture with common interfaces for product integration
• Data Modeling with ability to model star-schema and multi-dimensionality
• Extraction and Transformation/propagation tools to load the data warehouse
• Data warehouse database server
• Analysis/end-user tools: OLAP/multidimensional analysis, Report and query
• Tools to manage information about the warehouse (Metadata)
• Tools to manage the Data Warehouse environment

Transforming operational data into informational data:


Creating the informational data, that is, the data warehouse, from the
operational systems is a key part of the overall data warehousing
solution. Building the informational database is done with the use of
transformation or propagation tools. These tools not only move the data
from multiple operational systems, but also often manipulate the data
into a more appropriate format for the warehouse. This could mean:

• The creation of new fields that are derived from existing operational data
• Summarizing data to the most appropriate level needed for analysis
• Denormalizing the data for performance purposes
• Cleansing of the data to ensure that integrity is preserved.

Even with the use of automated tools, however, the time and costs
required for data conversion are often significant. Bill Inmon has
estimated 80% of the time required to build a data warehouse is
typically consumed in the conversion process.

Data warehouse database servers--the heart of the warehouse:


Once ready, data is loaded into a relational database management
system (RDBMS), which acts as the data warehouse. Some of the
requirements of database servers for data warehousing include:
Performance, Capacity, Scalability, Open interfaces, Multiple-data
structures, optimizer to support for star-schema, and Bitmapped
indexing. Some of the popular data stores for data warehousing are
relational databases like Oracle, DB2, Informix or specialized Data
Warehouse databases like RedBrick, SAS.
To provide the level of performance needed for a data warehouse, an
RDBMS should provide capabilities for parallel processing - Symmetric
Multiprocessor (SMP) or Massively Parallel Processor (MPP) machines,
near-linear scalability, data partitioning, and system administration.
Data Warehousing Solutions - what is hot?
Solution Area Product Vendor

Report and Query Impromptu Cognos

Brio Query Brio Technology

Business Objects Business Objects


Inc
Crystal Reports
Seagate Software

OLAP / MD analysis DSS Agent/Server MicroStrategy

Decision Suite Information


Advantage
EssBase
Hyperion Solutions
Express Server
Oracle Corp.
PowerPlay
Cognos
Brio Enterprise Corporation

Business Objects Brio Technology

Business Objects

Data mining Discovery Server Pilot Software

Intelligent Minor IBM

Darwin Thinking Machines

Data Modeling ER/Win Platinum

Data extraction, transformation, Data Propagator IBM


load
Info Pump Platinum
Technology
Integrity Data Re-Eng.
Vality Technology
Warehouse Manager
Prism Solutions
Power Mart
Informatica

Databases for data warehousing DB2 IBM

Oracle Server Oracle

MS SQL Server Microsoft

Redbrick Warehouse Red Brick Corp.

SAS System SAS Institute

Teradata DBS NCR

Information catalogue Data Guide IBM

HP Intelligent Hewlett-Packard
Warehouse: Guide
Prism Solutions
Directory Manager

Benefits of Data Warehousing


A well-designed and implemented data warehouse can be used to:

• Understand business trends and make better forecasting decisions


• Bring better products to market in a more timely manner
• Analyze daily sales information and make quick decisions that can significantly affect
your company's performance

Data warehousing can be a key differentiator in many different industries. At present,


some of the most popular Data warehouse applications include:

• Sales and marketing analysis across all industries


• Inventory turn and product tracking in manufacturing
• Category management, vendor analysis, and marketing program effectiveness
analysis in retail
• Profitable lane or driver risk analysis in transportation
• Profitability analysis or risk assessment in banking
• Claims analysis or fraud detection in insurance

Conclusion
Data Warehousing provides the means to change raw data into
information for making effective business decisions--the emphasis on
information, not data. The data warehouse is the hub for decision
support data. A good data warehouse will... provide the RIGHT data... to
the RIGHT people... at the RIGHT time: RIGHT NOW! While data
warehouse organizes data for business analysis, Internet has emerged
as the standard for information sharing. So the future of data
warehousing lies in their accessibility from the Internet. Successful
implementation of a data warehouse requires a high-performance,
scaleable combination of hardware and software that can integrate
easily with existing systems, so customers can use data warehouses to
improve their decision-making--and their competitive advantage.

By - Anjaneyulu Marempudi

BY: WWW.PARETOANALYSTS.COM

What is a data warehouse?


A data warehouse is a collection of data marts representing historical
data from different operations in the company. This data is stored in a
structure optimized for querying and data analysis as a data warehouse.
Table design, dimensions and organization should be consistent
throughout a data warehouse so that reports or queries across the data
warehouse are consistent. A data warehouse can also be viewed as a
database for historical data from different functions within a company.

What is a data mart?


A data mart is a segment of a data warehouse that can provide data for
reporting and analysis on a section, unit, department or operation in the
company, e.g. sales, payroll, production. Data marts are sometimes
complete individual data warehouses which are usually smaller than the
corporate data warehouse.
What are the benefits of data warehousing?
Data warehouses are designed to perform well with aggregate queries
running on large amounts of data.
The structure of data warehouses is easier for end users to navigate,
understand and query against unlike the relational databases primarily
designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a
company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different databases
with different structures.
Queries that would be complex in very normalized databases could be
easier to build and maintain in data warehouses, decreasing the
workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a
company.
Data warehousing is an efficient way to manage demand for lots of
information from lots of users.
Data warehousing provides the capability to analyze large amounts of
historical data for nuggets of wisdom that can provide an organization
with competitive advantage.
What is OLAP?
OLAP stands for Online Analytical Processing.
It uses database tables (fact and dimension tables) to enable
multidimensional viewing, analysis and querying of large amounts of
data. E.g. OLAP technology could provide management with fast
answers to complex queries on their operational data or enable them to
analyze their company's historical data for trends and patterns.
What is OLTP?
OLTP stands for Online Transaction Processing.
OLTP uses normalized tables to quickly record large amounts of
transactions while making sure that these updates of data occur in as
few places as possible. Consequently OLTP database are designed for
recording the daily operations and transactions of a business. E.g. a
timecard system that supports a large production environment must
record successfully a large number of updates during critical periods like
lunch hour, breaks, startup and close of work.
What are dimensions?
Dimensions are categories by which summarized data can be viewed.
E.g. a profit summary in a fact table can be viewed by a Time dimension
(profit by month, quarter, year), Region dimension (profit by country,
state, city), Product dimension (profit for product1, product2).
What are fact tables?
A fact table is a table that contains summarized numerical and historical
data (facts) and a multipart index composed of foreign keys from the
primary keys of related dimension tables.
What are measures?
Measures are numeric data based on columns in a fact table. They are
the primary data which end users are interested in. E.g. a sales fact
table may contain a profit measure which represents profit on each sale.
What are aggregations?
Aggregations are precalculated numeric data. By calculating and storing
the answers to a query before users ask for it, the query processing time
can be reduced. This is key in providing fast query performance in OLAP.
What are cubes?
Cubes are data processing units composed of fact tables and dimensions
from the data warehouse. They provide multidimensional views of data,
querying and analytical capabilities to clients.
What is the PivotTable® Service?
This is the primary component that connects clients to the Microsoft®
SQL Server™ 2000 Analysis Server. It also provides the capability for
clients to create local offline cubes using it as an OLAP server.
PivotTable® Service does not have a user interface, the clients using its
services has to provide its user interface.

What are offline OLAP cubes?


These are OLAP cubes created by clients, end users or third-party
applications accessing a data warehouse, relational database or OLAP
cube through the Microsoft® PivotTable® Service. E.g. Microsoft®
Excel™ is very popular as a client for creating offline local OLAP cubes
from relational databases for multidimensional analysis. These cubes
have to be maintained and managed by the end users who have to
manually refresh their data.

What are virtual cubes?


These are combinations of one or more real cubes and require no disk
space to store them. They store only the definitions and not the data of
the referenced source cubes. They are similar to views in relational
databases.
What are MOLAP cubes?
MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the
data aggregations and a copy of the fact data are stored in a
multidimensional structure on the Analysis Server computer. It is best
when extra storage space is available on the Analysis Server computer
and the best query performance is desired. MOLAP local cubes contain
all the necessary data for calculating aggregates and can be used
offline. MOLAP cubes provide the fastest query response time and
performance but require additional storage space for the extra copy of
data from the fact table.
What are ROLAP cubes?
ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of
data from the fact table is not made and the data aggregates are stored
in tables in the source relational database. A ROLAP cube is best when
there is limited space on the Analysis Server and query performance is
not very important. ROLAP local cubes contain the dimensions and cube
definitions but aggregates are calculated when they are needed. ROLAP
cubes requires less storage space than MOLAP and HOLAP cubes.
What are HOLAP cubes?
HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a
combination of the ROLAP and MOLAP cube characteristics. It does not
create a copy of the source data however, data aggregations are stored
in a multidimensional structure on the Analysis Server computer. HOLAP
cubes are best when storage space is limited but faster query responses
are needed.

What is the approximate size of a data warehouse?


You can estimate the approximate size of a data warehouse made up of
only fact and dimension tables by estimating the approximate size of the
fact tables and ignoring the sizes of the dimension tables.

To estimate the size of the fact table in bytes, multiply the size of a row
by the number of rows in the fact table. A more exact estimate would
include the data types, indexes, page sizes, etc. An estimate of the
number of rows in the fact table is obtained by multiplying the number
of transactions per hour by the number of hours in a typical work day
and then multiplying the result by the number of days in a year and
finally multiply this result by the number of years of transactions
involved. Divide this result by 1024 to convert to kilobytes and by 1024
again to convert to megabytes.
E.g. A data warehouse will store facts about the help provided by a
company’s product support representatives. The fact table is made of up
of a composite key of 7 indexes (int data type) including the primary
key. The fact table also contains 1 measure of time (datetime data type)
and another measure of duration (int data type). 2000 product incidents
are recorded each hour in a relational database. A typical work day is 8
hours and support is provided for every day in the year. What will be
approximate size of this data warehouse in 5 years?

First calculate the approximate size of a row in bytes (int data type = 4
bytes, datetime data type = 8 bytes):

size of a row = size of all composite indexes (add the size of all indexes)
+ size of all measures (add the size of all measures).

Size of a row (bytes) = (4 * 7) + (8 + 4).


Size of a row (bytes) = 40 bytes.

Number of rows in fact table = (number of transactions per hour) * (8


hours) * (365 days in a year).
Number of rows in fact table = (2000 product incidents per hour) * (8
Hours ) * (365 days in a year).
Number of rows in fact table = 2000 * 8 * 365
Number of rows in fact table = 5840000
Size of fact table (1 year) = (Number of rows in fact table) * (Size of a
row)

Size of fact table (bytes per year) = 5840000 * 40


Size of fact table (bytes per year) = 233600000.
Size of fact table (megabytes per year) = 233600000 / (1024*1024)
Size of fact table (in megabytes for 5 years) =
(23360000 * 5) / (1024 *1024)
Size of fact table (megabytes) = 1113.89 MB
Size of fact table (gigabytes) = 1113.89 / 1024
Size of fact table (gigabytes) = 1.089 GB

TERM DEFINITION
Access Refers to mechanisms and policies that restrict access to
Control computer resources.
Ad-Hoc Unpredictable, unplanned access and manipulation of data .
reporting
Archive Provides long-term off-line storage of data, which must be
Services retained for historic purposes. The services will allow
users to archive and retrieve data as needed to support
the business processes. Automated processes may also
archive data, which has not been accessed for a specified
period of time.
Atomic A database of change records that when applied in
Database temporal order will reconstruct, in a target database, an
identical copy of a source database at a point in time.
Attribute Used in Logical Data Modeling, an Attribute is any detail
that serves to identify, describe, classify, quantify or
provide the state of an entity. For example, the entity,
Employee, may have the following attributes: Last Name,
First Name, and Hire Date. Attributes are the general
equivalent of physical columns in a table.
Audit Trail A record showing who has accessed a computer system and
what operations he or she has performed during a given
period of time. Data that is available to trace system
activity usually update activity.
Best Practices
Canned routines based on predefined parameters.
Reports
Change Set of tables that mirror an OLTP in structure, with the
Tables possible addition of auditing information. All OLTP tables
will not necessarily have associated change tables.
Data A specific framework for managing data to enable the
Architectureinstitution to build and maintain the strategic capabilities
it needs to achieve its mission. The framework consists of
a set of principles, standards, and models that describe
how the data will be created, maintained, and protected.
The framework focuses on improving effectiveness and
reducing long-term costs and contains components that
cover the full data life cycle from creation to retirement.
An example is ETL tool.
Database Any collection of data.
Database The software that holds the database and executes the
Engine requests against that database. Oracle is an example of a
Database Engine.
DataMart A customized subset of data taken from the Data
Warehouse. A DataMart is typically set up by a specific
individual or department to support their particular needs.
Data ModelA graphical representation illustrating data-related business
requirements in the context of a given application.
Data Process of copying and maintaining schema objects in
Replication multiple databases that make up a distributed database
system. Replication can improve the performance and
protect the availability of applications because alternate
data access options exist.
DataStore See Operational Data Store.
Data An enterprise-wide database. It is a read-only collection of
Warehousedata from any number of sources. It is usually refreshed
from Operational DataStores, but may also receive data
from OLTP’s. It is also the likely source of data for a DSS
Decision A complete process for allowing users to access data which
Support they need to support their decision making process. This
System includes the database(s) holding the data, the software
(DSS) application which interfaces with the Database Engine,
metadata, training, and support.
Degree Shows how many instances of an entity can exist at one
end of the relationship for each entity instance at the
other end. Crow's feet shows a relationship degree of
many and a single point represents a relationship degree
of one.
Denormalizati
Roughly the opposite of Normalization. In a denormalized
on database, some duplicated data storage is allowed. The
benefits are quicker retrieval of data and a database
structure that is easier for end-users to understand and is
thereby more conducive to ad-hoc queries.
Domain A set of business validation rules, format constraints, and
allowable values that apply to a group of attributes. For
example, yes and no or days of the week.
ETL It signifies Extraction, Transformation, and Load. The tool
extracts, transforms and loads data from data sources to
data targets in a central repository. The data sources can
be a database, file, or COBOL copybook or any
combination of the three. It will be primarily used to
move data from an OLTP to an ODS or an ODS to DSS.
Entity Used in Logical Data Modeling, an Entity is a thing of
significance, either real or conceptual, about which the
business or system being modeled needs to hold
information. For example, if the business needs to process
sales orders, an Entity to represent sales orders would be
recorded. An Entity generally corresponds to a physical
table. Also see Attribute.
Entity Entity relationship modeling involves identifying the things
Relationshipof importance in an organization (entities), the properties
Diagram of those things (attributes) and how they are related to
(ERD) one another (relationships). The resulting information
model is independent of any data storage or access
method.
Foreign KeyIn a table, one or more columns whose values must match
the values in the primary key of the referenced table. The
columns in the foreign key typically reference the primary
key of another table but may reference the same table.
This mechanism allows two tables to be joined together.
Function Displays all of the functional requirements of an application
Hierarchy and their logical groupings. Shows the decomposition of
Diagram functions ranging from the highest level or root to the
lowest level or leaf required.
Metadata This is "data describing the data." This data provides
information about a database, including descriptions of
the tables and columns, as well as descriptions of the data
stored within those tables and columns.
MethodologyFacilitates a repeatable structured approach to defining
requirements and developing business applications. A
methodology tells you what to do and when. An Example
is Develop a Data Movement process.
MI Individuals filling this role are responsible for overseeing
Operations the 24-hour operation of assigned systems. Direct the
& daily setup of customer jobs for assigned systems.
Production Negotiate schedules for all systems in the area.
Control
Normalization
A relational database design concept which eliminates
duplication of data storage in a database. This is a crucial
element of OLTP systems which can suffer severe
performance penalties if the database is not normalized.
Not Nullable
A mandatory attribute or column is marked as mandatory
by making it Not Nullable. Not Nullable indicates that a
valid value must be entered for each occurrence of the
attribute or column. Null values are not allowed.
Null A Null indicates the absence of a value. This is the
equivalent of leaving a field empty. Columns marked as
"Not Nullable" or "Not Null" may not have Nulls. A "blank"
or a "space" is not the equivalent as a null and are
handled very differently than a null. "Blanks" and "spaces"
must be absolutely avoided.
On-Line A Software technology that transforms data into
Analytical multidimensional views and that supports
Processing multidimensional data interaction, exploration, and
(OLAP) analysis. SAS is an example of OLAP.
On-Line An OLTP database is the database with Read and Write
Transactionaccess. This is where transactions are actually entered,
Process modified, and/or deleted. Due to performance
(OLTP) considerations, read-only requests on the database may
be routed to an Operational Data Store. Typically, an
OLTP is a "normalized" database.
OperationalAn ODS is a read-only database containing operational data
DataStore in support of a specific business need. It is updated on a
(ODS) frequent basis (weekly, daily, hourly, or even more often)
and may be populated from one or more OLTP and/or
ODS databases. Depending upon its refresh cycle and
usage, the ODS may be normalized or denormalized.
OperationalStandardized, stable, repeatable reports which are
Reporting scheduled, that access and manipulate data on
parameters which are predefined.
Optionality The minimum number of an entity instance that are
possible at one end of the relationship for each entity
instance at the other end. For example, a dash line
indicates an optional relationship end that is read as
"maybe". A solid line indicates a mandatory relationship
end that is read as "must be".
Oracle Build data replication using Oracle generated snapshot
Replication tables and snapshot logs.
Primary KeyWhile primarily referring to tables, Primary Keys can also
pertain to entities. A Primary Key is the mandatory
column or columns used to enforce the uniqueness of
rows in a table. This is normally the most frequent means
by which rows are accessed. Please note, however, that a
column which is part of a Primary Key may not contain
null values!
Process Visual illustration representing organizational units, which
Model consist of departments or groups within a business,
responsible for a specific business activity. It is strongly
suggested that the process model be used during
analysis.
Purge To systematically and permanently remove old and
unneeded data. The term purge is stronger than delete.
It is often possible to regain deleted objects by undeleting
them, but purged objects are gone forever.
RelationshipA named, significant association between two entities. Each
end of the relationship shows the degree of how the
entities are related and the optionality.
Relational This terms refers to a database in which data is stored in
Database multiple tables. These tables then "relate" to one another
to make up the entire database. Queries can be run to
"join" these related tables together.
Security Refers to techniques for ensuring that data stored in a
computer cannot be read or compromised. Protection
provided to prevent unauthorized or accidental
access/manipulation of a database.
Snapshot A point in time copy of table data originating from one or
Tables more master tables.
Strategy Is a synonym for plan, which is defined as a scheme,
program, or method worked out beforehand for the
accomplishment of an objective. The Strategy will tell
you how to do it, the guidelines and/or techniques to use.
An example is the naming standards developed for the
open systems environment.
Table A tabular view of data used to hold one or more columns of
data. It is often the implementation of an entity.
Trigger A stored procedure associated with a table that is
automatically executed on one or more specified events
affecting the table.
Unique Key1. Defines the attributes and relationships that uniquely
identify the entity. 2. A column or columns which contain
unique values for the rows of a table. A column in a
Unique Key may contain a null. Therefore, a Unique Key
defined for an entity may not make a suitable Primary Key
for a table.

The Basics

What is a data warehouse?


A data warehouse has several processes that require several technology
components. Batch and transaction processing data first has to be extracted from
operational databases and then cleaned up to remove redundant data, fill in blank
and missing fields and organized into consistent formats. The data is then loaded
into a relational database. Business analysts can then dig into the data using data
access and reporting software including On-Line Analytical Processing (OLAP)
tools, statistical modeling tools, geographic information systems (GIS) and data
mining tools.

What are these different kinds of analyses?


They range from the most basic (query and reporting) to the more complex (OLAP
and statistical analysis) to the most complex (data mining). Basic queries and
reports are usually performed by functional managers who use pre-defined queries
to look up such things as average monthly sales, total regional expenses and daily
totals. OLAP and multi-dimensional analysis tools are designed more for business
analysts who need to look at data across multiple dimensions. These tools let them
drill down from summary data sets into the specific data underlying the summaries.
Statistical analysis tools provide summary information too and help determine the
degree of relationship between two factors, such as zip code and sales. Data
mining tools analyze very large data sets to highlight hidden patterns, such as what
items grocery shoppers buy as a pair.
What is a data warehouse used for?
Many things. Data warehouses are the basis for customer relationship
management systems because they can be used for consolidating customer data
and identifying areas of customer satisfaction and frustration. Warehouses are also
used for fraud detection, product repositioning

analysis, profit center discovery and corporate asset management.


For retailers, a data warehouse can help identify customer demographic
characteristics, identify shopping patterns, and improve direct mailing responses.
For banks, it can assist in spotting credit card fraud, help identify the most profitable
customers, and highlight the most loyal customers.
Telecommunications firms use data warehousing to predict which customers are likeliest to

switch and then target them with special incentives to stay.

Insurance companies use data warehousing for claims analysis to see which
procedures are claimed together and to identify patterns of risky customers.
Manufacturers can use data warehousing to compare costs of each of their product
lines over the last several years, determine which factors produced increases and
see what effect these increases had on overall margins.

Is it hard to set up a data warehouse?


Setting up a data warehouse isn’t easy. Just identifying where all a business’s data
comes from, how it gets entered into a system and where it is all stored can be
difficult, and setting up a data cleansing processes is quite complicated. It all
depends on how large and complex the data collecting and storing operation is.
Large data warehousing projects take years and millions of dollars to implement.

Is there such a thing as a small data warehouse?


Yes. Some companies begin with a data mart, a scaled-down warehouse that
focuses on just one functional department area, such as finance. Data marts often
can be implemented in a couple of months and later be linked together into a
confederated warehouse.

What five questions should be asked in the data warehouse planning stage?
1. What data is needed to make business decisions?
2. Which business units will use it?
3. What kind of data analysis will be done?
4. How granular will the data be and what is the oldest data to be archived in it?
5. What are the security requirements?

What kind of staffing does a data warehouse require?


The technical project team for a data warehouse includes a project manager, a
data and system architect, database administrators, business application analysts,
data conversion specialists, and network support staff.

What are some of the factors that determine whether a data warehouse will be
successful?
Database design, end user training, the ongoing adjusting and tuning of
applications to meet user needs, and the system architecture and design.

Is data warehousing cutting edge or is everybody doing this?


Many analysts say that every Fortune 1000 company has some type of data
warehouse, and a survey conducted in 1999 by International Data Corp.
determined that close to 50% of ALL companies surveyed (large, medium, small
and tiny) are either using a data warehouse now or are in the planning stages of
building one.

Data Warehousing - What Is It?

Heralded as the solution to the management information dilemma, the term "data
warehouse" has become one of the most used and abused terms in the IT
vocabulary. But ask a variety of vendors and professionals for their vision of what a
data warehouse is and how it should be built, and the ambiguity of the term will
quickly become apparent.

To a number of people, a data warehouse is any collection of summarised data from


various sources, structured and optimised for query access using OLAP (on-line
analytical processing) query tools. This view was originally propagated by the
vendors of OLAP tools. To others, a data warehouse is virtually any database
containing data from more than one source, collected for the purpose of providing
management information. This definition is neither helpful nor visionary, since such
databases have been a feature of decision support solutions since long before the
coining of the term "data warehouse".

The concept of "data warehousing" dates back at least to the mid-1980s, and
possibly earlier. In essence, it was intended to provide an architectural model for
the flow of data from operational systems to decision support environments. It
attempted to address the various problems associated with this flow, and the high
costs associated with it. In the absence of such an architecture, there usually
existed an enormous amount of redundancy in the delivery of management
information. In larger corporations it was typical for multiple decision support
projects to operate independently, each serving different users but often requiring
much of the same data. The process of gathering, cleaning and integrating data
from various sources, often legacy systems, was typically replicated for each
project. Moreover, legacy systems were frequently being revisited as new
requirements emerged, each requiring a subtly different view of the legacy data.

Based on analogies with real-life warehouses, data warehouses were intended as


large-scale collection/storage/staging areas for legacy data. From here data could
be distributed to "retail stores" or "data marts" which were tailored for access by
decision support users. While the data warehouse was designed to manage the
bulk supply of data from its suppliers (e.g. operational systems), and to handle the
organization and storage of this data, the "retail stores" or "data marts" could be
focused on packaging and presenting selections of the data to end-users, often to
meet specialised needs.

Somewhere along the way this analogy and architectural vision was lost, often
manipulated by suppliers of decision support software tools. Data warehousing
"gurus" began to emerge at the end of the 80s, often themselves associated with
such companies. The architectural vision was frequently replaced by studies of
how to design decison support databases. Suddenly the data
warehouse had become the miracle cure for the decision support headache, and
suppliers jostled
for position in the burgeoning

data warehousing marketplace. Despite the recent association of the term "data
warehousing" with OLAP and multi-dimensional database technology, and the
insistence of some people that data warehouses must be based on a "star
schema" database structure, it is wise to restrict the use of such designs to data
marts. The use of a star schema or multi-dimensional / OLAP design for a data
warehouse can actually seriously compromise its value for a number of reasons:

a. such designs assume that all queries on the warehouse will be of a


quantitative nature - i.e. queries on aggregated numeric data. This overlooks the
fact that data warehouses can also offer enormous benefit as repositories of text-
based or qualitative data - e.g. the provision of a 360° view of customers by
collecting profile information from a range of sources ;
b. such designs require the pre-aggregation of data in the data warehouse. In
doing so, and eliminating much of the original transactional data, much information
can be lost. If information requirements change, requiring alternative aggregations
from the transactional data, a star or multi-dimensional design will quickly become
obsolete. A normalised design, on the other hand, which accommodates
transactional level data would be able to support any number of alternative
aggregations. While capacity and/or performance constraints may preclude this as
an option for some data, the storage of low level transactional data in a data
warehouse should not be ruled out, as this is often the only way of ensuring
maximum flexibility to support future information needs ;
c. optimised models such as star schemas are, in general, less flexible than
normalised designs. Changes to business rules or requirements are generally
more easily accommodated by normalised models.

Data marts provide the ideal solution to perhaps the most significant conflict in data
warehouse design - performance versus flexibility. In general, the more normalised
and flexible a warehouse data model is, the less well it performs when queried.
This is because queries against normalised designs typically require significantly
more table join operations than optimised designs. By directing all user queries to
data marts, and retaining a flexible model for the data warehouse, designers can
achieve flexibility and long term stability in the warehouse design as well as
optimal performance for user queries.

Why is it so expensive?
While the data warehousing concept in its various forms continues to attract
interest, many data warehousing projects are failing to deliver the benefits
expected of them, and many are proving to be excessively expensive to develop
and maintain. For this reason it is important to have a clear understanding of their
real benefit, and of how to realise this benefit at a cost which is acceptable to the
enterprise.

The costs of data warehousing projects are usually high. This is explained primarily
by the requirement to collect, "clean" and integrate data from different sources -
often legacy systems. Such exercises are inevitably labour-intensive and time-
consuming, but are essential to the success of the project - poorly integrated or low
quality data will deliver poor or worthless management information. The cost of
extracting, cleaning and integrating data represents 60-80% of the total cost of a
typical data warehousing project, or indeed any other decision support project.

Vendors who claim to offer fast, cheap data warehouse solutions should be asked to
explain how they are able to avoid these costs, and the likely quality of the results
of such solutions must be carefully considered. Such vendors typically place the
emphasis on tools as a solution to the management information problem – OLAP
tools, data integration technology, data extraction tools, graphical user query tools,
etc. Such tools resolve only a fraction of the management information problem,
and represent a small proportion of the cost of a successful data warehousing
project.

Focus on technology rather than data quality is a common failing among data
warehousing projects, and one which can fatally undermine any real business
benefit.

How can the cost be justified ?

Given the high costs, it is difficult to justify a data warehousing project in terms of
short-term benefit. As a point solution to a specific management information need,
a data warehouse will often struggle to justify the associated investment. It is as a
long term delivery mechanism for ongoing management information needs
that data warehousing reaps significant benefits. But how can this be achieved?
Given the above facts about the loading of costs on data warehousing projects, it is
clear that focus must be on the reduction of the ongoing cost of data extraction,
cleaning and integration.

A number of years ago I conducted a study for a multi-billion dollar manufacturing


and services organization. The purpose of the study was to identify why previous
data warehousing projects had failed to deliver the expected benefits, and to make
recommendations for how future projects could rectify this.

The study resulted in a number of significant findings, including the following:

1. 80% of the data used by the various data warehouses across the corporation
came from the same 20% of source systems.

2. Each new data warehousing project usually carried out its own process to extract,
clean and integrate data from the various sources, despite the fact that much of the
same data had been the subject of previous exercises of a similar nature.
3. The choice of data to be populated in the data warehouse was usually based on
needs of a specific group, with a particular set of information requirements. The
needs of other groups for the same data were rarely considered.

Experience of other organizations showed a very similar pattern to the above. From
these findings alone it is clear that there is scope for economies of scale when
planning data warehousing projects ; if focus were to be placed initially on the 20%
of source systems which supplied 80% of the data to decision support systems,
then an initial project which simply warehouses "useful" data from these systems
would clearly yield cost benefits to future MIS projects requiring that data. Rather
than targeting a specific business process or function, benefits should be aimed at
the wider audience for decision support. Such a project would form an invaluable
foundation for an evolving data warehouse environment.
When building a data warehouse the use of multi-dimensional, star-schema or other
optimised designs should be strongly discouraged, in view of the inherent
inflexibilities in these approaches as outlined above. The use of a relational,
normalised model as the backbone of the warehouse will ensure maximum
flexibility to support future growth. If user query access is then strictly limited to
data marts, the data warehouse needs only to support periodic extracts to data
marts, rather than ad-hoc query access. Performance issues associated with these
extracts can be addressed in a number of ways - for example through the use of
staging areas (either temporary or permanent) where relational table structures are
pre-joined or "flattened" to support specific extract processes.

Once this initial project is complete, emphasis can be placed on the growth of the
warehouse as a global resource for unspecified future decision support needs,
rather than as a solution to specific requirements at a particular time. In
subsequent phases of the warehouse development, new data which is likely to
play a major role in future decision support needs should be carefully selected,
extracted and cleaned. It can then be stored alongside the existing data in the
warehouse, hence maximising its information potential. As new information needs
emerge, the cost of meeting them

will be diminished due to the elimination of the need to perform much of the costly
extraction, cleaning and integration functions usually associated with such
systems. Over time, this environment will grow to offer a permanent and invaluable
repository of integrated, enterprise-wide data for management information. This in
turn will lead to massively reduced time and cost to deliver new decision support
offerings, and hence to true cost justification. The effort required to achieve this
must not be underestimated, however. Identifying which data is "useful" requires a
great deal ofexperience and insight. The way in which the data is modelled in the
warehouse is absolutely critical - a poor data model can render a data warehouse
obsolete within months of implementation. The process used to identify, analyse
and clean data prior to loading it into the warehouse, and the attendant user
involvement, is critical to the success of the operation. Management of user
expectations is also critical. The skills required to achieve all of the above are
specialised.

Once in the warehouse, data can be distributed to any number of data marts for user
query access. These data marts can take any number of forms, from client-server
databases to desktop databases, OLAP cubes or even spreadsheets. The choice
of user query tools can be wide, and can reflect the preferences and experience of
the users concerned. The wide availability of such tools and their ease of
implementation should make this the cheapest part of the data warehouse
environment to implement. If data in the warehouse is well-structured and quality-
assured, then exporting it to new data marts should be a routine and low-cost
operation.
In summary, a data warehouse environment can offer enormous benefits to most
major organizations if approached in the correct way, and if distractions from the
main goal of delivering a flexible, long-term information delivery environment are
placed in perspective.

The Exploration Warehouse and Mart


By: Sand Technologies Systems
The Exploration Warehouse and Mart

Introduction
Tables and Figures are not provided.

Over the course of the 1960s, 1970s, and 1980s, most medium-to-large businesses
successfully moved key operational aspects of their enterprises onto large
computing systems. The 1980s saw relational database technologies mature to the
point where they could play the central role in these systems. Naturally, the
requirements of operational systems, being substantial and unforgiving, forced
database vendors to focus development efforts almost exclusively on issues like
transaction speed, integrity, and reliability.

Unfortunately the methods employed to achieve transaction speed, integrity, and


reliability were completely contrary to the requirements of reporting and freeform
data inquiry. Indexing techniques, integrity checking, locking schemes, data
models and transaction logging impaired the ability to obtain information from
operational data stores. Moreover, operational data stores usually did not have
enough data on line to answer questions against data more than 90 or 180 days
old.

When business questions could be answered, it was not unusual to wait weeks for
answers. Sometimes, executives would be given the non-choice between stopping
the business and producing a particular report. They would also be confronted with
contradictory information from multiple systems. It seemed inconceivable that so
much time, money, and attention could be paid to technology only to have
relatively modest inquiries turned back.

The Data Warehouse

A handful of technologists, most notably Ralph Kimball, founder of Redbrick


Systems, and Bill Inmon, founder of Prism and author of Building the Data
Warehouse, foresaw the large-scale reporting and decision-support demand that
would follow the transactional systems binge. They pioneered and advocated the
data warehouse – the counterpart to operational systems. The last ten years has
seen widespread acceptance of the data warehouse concept and the consequent
growth of an entire industry.

The Promise of the Data Warehouse

The promise of the data warehouse is straightforward. As operational systems are


dedicated to recording information, the data warehouse is dedicated to returning
information to the enterprise. Operational systems run the business today. Analysis
of the data in the warehouse determines how the business is run tomorrow.

The State of the Data Warehouse

The premise of the data warehouse is that it is physically separate from operational
systems, and has a mission completely different from that of operational systems.
The virtue of a separation of systems is twofold. It ensures that the data
warehouse will not interfere with business operations, and it facilitates the
acquisition, reconciliation, and integration of data, not only from different
operational systems within the enterprise, but also from sources external to the
business.

A few technical challenges to successful data warehousing have been overcome


during the last several years. First, the workload of satisfying broad inquiry against
large volumes of data is fundamentally different from the workload of recording
business transactions. Successful on-line transaction processing (OLTP) requires
the ability to satisfy a large numbers of requests, numbering as high as millions per
day, each of which must access and manipulate a very small amount of data in a
large database. Successful inquiry or decision support system (DSS) processing
requires the ability to satisfy a modest-to-large number of requests, each of which
must access and process a large amount of data within a very large database.

First consider the transaction processing request example of consumers using their
retail credit cards to make purchases. There may be millions of these that occur
during a given day. There may be hundreds proceeding simultaneously at any
given time. Each one, however, involves locating a limited amount of account
information for a particular consumer and modifying it. The information is usually
measured in bytes.

Then consider the inquiry and reporting scenario. Someone in credit card
merchandising and marketing has a fairly simple question: Among new
cardholders, how many did we market to in a particular region, who match a
particular demographic and made purchases from a particular product line? This
single question involves accessing and comparing perhaps hundreds of millions of
records depending upon the size of the organization.

Developments in hardware and software parallelism have evolved to a point where


they are acknowledged as indispensable in handling the latter type of workload.
Data warehousing is particularly dependent upon parallel processing technology
due to the large data volumes involved. Parallel processing is essentially a divide-
and-conquer approach to the problem, bringing many processors, memories, and
data buses to bear on any given request. Even the requirement of periodically
moving large volumes of data from operational systems into the data warehouse

depends largely on parallel data loading and index building in order to fit within
acceptable system down-time windows. Developments in large memory computer
configurations also help to satisfy the data warehouse workload. Even more
significantly, however, advances in indexing, compact data representation, and
data processing algorithms mean that fewer actual bytes of data are accessed and
manipulated to answer a given question. The concept of the data warehouse has
become accepted to the point that virtually all Global 2000 companies and most
medium-sized companies have a data warehouse development project underway
or are planning for one. The market for data warehouse-related hardware,
software, and services is measured in the tens of billions of dollars worldwide.

Even so, variants of the data warehouse have emerged to meet some of the
specialized real-world needs of companies and company departments everywhere.
For example, data warehouse satellites, called data marts, are deployed and
tailored to the needs of a specific audience. There is also the up-to-the hour, or up-
to-the minute transactional warehouse hybrid, called the operational data store, for
companies that have a requirement for extremely fresh information.

What Data Warehouses and Data Marts Leave on the Table

Although the concept of data warehouse is universally accepted, they are still hard
to build, and they frequently leave some of the corporate information appetite
unsatisfied, even when deployed successfully. The basic mandate of the data
warehouse or the data mart is enormous: satisfy the information requirements of
an entire company or an entire department, regularly and in a timely way.

The technology used to build data warehouses and marts is antagonistic to this all-
purpose sensibility. Parallelism, indexing, clustering, and even novel storage
architectures like proprietary multi-dimensional data storage all come at a cost.
Effective parallelism and indexing depend extensively upon knowing in advance
what questions will be asked, or if not the specific questions, at least the form of
the question.

In practice, this means that data warehouses and even marts usually discourage
extraordinary lines of inquiry. Given complete freedom of interrogation, power
users will bring a data warehouse to its knees with their queries. That is, parallel
data striping and indexing suitable for one query may be ill suited to another.
Whenever the data warehouse is not indexed or tuned for a particular query the
physical resources of the system can be overwhelmed to the detriment of other
clients.

Most data warehouse and data mart end-users have modest information
requirements and keep businesses running by querying inside the lines. For the
most part it is this relatively large audience that data warehouses and marts end up
satisfying. In order to provide most of the users with timely service most of the
time, technical organizations take the defensive approach, prohibiting non-
standard (ad hoc) queries, or scheduling them only at odd hours. In this way they
inhibit the smaller number of business analysts in the organization — those most
likely to find breakthrough opportunities.

What these elite knowledge workers and business analysts desire most from their
mart or warehouse is the ability to go wherever their mind or intuition may take
them, exploring for patterns, relationships, and anomalies. This is how they
cultivate business knowledge.

Ultimately, the difference between information and knowledge is confidence;


confidence to act decisively. When decisions are pending, knowledge beats
information every time. And knowing depends on the ability to ask any question
and get fast answers from the corporate data, the ability

to fortify a hunch, to supplement a report, or question assumptions in real time. Here


is the money that most data warehouses and marts leave on the table.

Data Warehouse and the OLAP Data Mart Design

The data warehouse and data marts have different objectives, and their
design reflects these differences.

Data Warehouse Design

The data warehouse is optimized for load and data extraction


performance, flexibility, and data maintenance. The flexibility objective
requires that the data be an unbiased representation so that each
business user can extract the portion of the data he or she needs and
apply the assumptions germane to his or her analysis. This model is
substantially based on the business rules; hence, its model is developed
by transforming the business data model as needed to support these
objectives.

Step 1: Select the data of interest. During the development of a data


warehouse, the analyst needs to determine the data that will be needed
to support strategic analysis. One of the questions to avoid asking a user
is "What data do you need?" The answer to this question is pretty
obvious -- all of it. Incorporating unneeded data into the data warehouse
increases the development time, adds a burden to the regular load
cycle, and wastes storage space due to the retention of history for the
data elements that are not used. The first step of the data warehouse
design is determining the elements that are needed. Sometimes the
answer is not very clear, and we must then ask whether or not we feel
there is a reasonable chance that the data element will be needed to
support strategic analysis.

There is an important reason that this is the first step in the process. It
defines our scope. We have many other steps to traverse, and by
eliminating the data we don't need, we will not be wasting our time
incorporating data elements that will be discarded later.

Step 2: Add time to the key. The data warehouse is a time-variant


data store. Unlike the operational systems that often overwrite the
historical view of the data with the current view, the data warehouse
maintains the history. The time dimension is needed to distinguish
multiple instances of the same item.

This is the second step in the process because it has the greatest impact
on the data model, and the data model is the foundation of our design.
Introducing the historical perspective means that attributes (e.g., last
name) could have multiple values over time, and we need to retain each
of them. To do this, we need to make each instance unique. Further, the
business rules change when we opt for the historical perspective. For
the operational system, we need to know the

department to which a person is assigned, and the person can only be


assigned to one department at a time. In the data warehouse, we have
history, and we must therefore deal with a person moving from one
department to another.

Step 3: Add derived data. Some business measures are used


repeatedly. To ensure consistency, these terms should be standardized,
and the formulas for deriving them should be determined. (This is one of
the stewardship functions.)

Step 4: Determine the level of granularity. The fourth step entails


determining the level of granularity, or level of detail, in the data
warehouse. A retailer that is trying to determine which products sold
with other products needs to capture each sales transaction. A retailer
that is only concerned about product movement may be satisfied with a
daily summary of product sales. These are very different granularity
levels, and there is a substantial difference in the cost for developing
each of these warehouses.

Upon completion of the first four steps, the data warehouse design should
meet the business needs. The warehouse will have the needed data and
will store it in a way that provides flexibility to the business users to use
it to meet their needs.

Step 5: Summarize data. The warehouse is used for strategic analyses,


and often these entail data summaries. By summarizing the data, we are
in a better position to ensure consistency and reduce the need for
calculating the same summaries for delivery to multiple data marts.
Data summaries do not necessarily reduce storage costs -- they may
actually increase them. The impact on storage depends on whether we
still need the detailed data. If all of our analysis is on the monthly sales
figure, for example, then we may not need the details. If, however, we
want to delve into exceptions, we still need the details to support the
analysis. In such an instance, we retain the details and add another
table with the summary.

Step 6: Merge tables. Sometimes related entities can be compressed


into one. For example, if we have a customer file and want to relate
each customer to a metropolitan statistical area (MSA), instead of
relating the customer entity to an MSA entity, we could simply add the
MSA as a foreign key (fk) attribute of customer, as shown in Figure 4.

Figure 4 -- Merging tables.


Step 7: Create arrays. Sometimes arrays are appropriate in the data
warehouse, and this step creates them. For accounts receivable
analysis, for example, we need a few buckets of information (e.g.,
current debt, debt aged 1-30 days, debt aged 31-60 days, etc.). We
update all the buckets at one time, and we use all of these together. In
cases such as this, an array may be appropriate within the warehouse.
The criteria for using arrays are that we have a small number of
occurrences of the data, that it's always the same number of
occurrences, that the data is all available for insertion at the same time,
and that the data is used together.

Step 8: Segregate data. The last step is segregating data based on


stability and usage. The data warehouse contains history, and updates
are made by adding records with data changes. If there is a significant
difference in the data volatility, it often makes sense to partition data,
with the rapidly changing data in one table and the slowly changing data
in another.

These eight steps transform the business data model into a data model
for the data warehouse. Additional adjustments may be made to
improve performance.

OLAP Data Mart Design

The OLAP data mart is designed to meet the objectives of legibility,


response time, and data visualization. To meet these objectives, we
apply dimensional modeling techniques. The steps required to develop a
dimensional model (or star schema) follow.14

Step 1: Distill the business questions. The first step is to identify the
business questions and separate the measurements of interest from the
constraints (dimensions). Measurements include sales quantity, sales
amount, customer count, etc. Constraints include the product hierarchy,
customer hierarchy, sales area hierarchy, time, etc. In performing this
step, we don't pay attention to the relationships among the constraints
-- we simply identify them. An easy way of separating the metrics from
the constraints is to ask business users to tell us the questions they will
be asking and then dissect the response. A sample response is that a
user needs to see monthly sales quantity and dollars by region, product,
customer group, and salesperson. The things the user wants to see are
the measures, and the way he or she wants to see them, as indicated by
the parameters following the word "by," are the constraints.

Step 2: Refine the model. A major advantage of using a star schema is


its ability to facilitate moving up and down a hierarchy. This is
accomplished in the second step by combining, in a single dimension,
multiple levels of the hierarchy. An example of a hierarchy follows.

product --> product group --> product line

This is the major denormalization step in the process. By combining the


three levels of the hierarchy into a single table, each piece of data for
the product line is repeated for each product.

Step 3: Add attributes to the dimensions. The dimensions represent


the business constraints. Often, users will want to have information
about the dimensions also available. For a product, for example, users
may want to know the weight, color, and size. These attributes are
added to the dimension table in this step.

Step 4: Ensure that the dimensions have good keys. The key of the
dimension table usually becomes part of the key of the fact table. To
perform this role efficiently, it needs to obey the rules of good keys,15
and it needs to be relatively short. Since we are pulling data from the
data warehouse, the first criterion is usually already met. If the key is
too long, then it may be advisable to use a system-generated key to
replace it.

Step 5: Normalize dimensions. In the second step, we created


denormalized dimension tables. In this step, we examine those tables
and consider normalizing them. In general, we should not be normalizing
dimensions merely to save space. The space savings will typically be
superseded by increases in query complexity.

Step 6: Factor refreshment for the dimensions. History of the data


in the marts is accomplished by including a date-oriented dimension.
Data in some of the other dimensions may also change over time, and if
we are interested in that history, we may need to create what is known
as a slowly changing dimension.
Step 7: Prototype and refine. The last step of the process recognizes
that, despite our best efforts, needs will be identified as people start
using the data marts. This step allocates time to have the business users
exercise the data mart, with the development team incorporating
appropriate modifications.These seven steps create a star schema that
can be used by the business community if they are equipped with the
appropriate tools. Additional measures may also be undertaken to
further improve performance.Model ComparisonThe data models for the
data warehouse and data marts have both similarities and differences,
as shown in Table 1. Both reflect integrated strategic data, include
derived data, and typically include an element of time. They differ in the
degree of normalization and in the organization philosophy, with the
data warehouse being organized for stability and usage and the data
marts being organized for ease of use and response time.

Table 1 -- Model Comparison

Business ModelData Warehouse Model Data Mart Model

Normalized Somewhat denormalizedHighly denormalized

Integrated, subjectIntegrated, subject Integrated, subject


oriented oriented oriented

Enterprise Strategic view of Strategic view of


perspective enterprise data enterprise data

May contain time Contains time element Contains time element


element

No derived data Some derived data Some derived data

Organized aroundOrganized for stability Organized for ease of use


business rules and usage and access speed

DATA ACQUISITION

Data acquisition is the complex set of processes by which data is moved


from the operational systems to the data warehouse. It consists of
several discrete processes: capture, cleansing, integration,
transformation, and loading (see Figure 5).
Figure 5 -- Data acquisition.

Data Capture

The first step of this process is capture. During the capture process, we
get to determine which systems will be used for which data, understand
those systems, and extract that data from them.

Before we can pull data out of a source system, we must choose the
system to be used. Sometimes the decision is easy -- there's only one
universally accepted major data source. More often, however, we must
choose from among several candidates. In making this selection, the
following criteria should be considered:

Point-of-data origin. Traditionally, the best source of data is the system


in which it is initially entered. Sales information needs to be accurate in
the point-of-sale system, and it could be considered to be the best
source of the data. As the data flows through the rest of the operational
environment, changes may take place, and these changes could affect
the validity of the data of interest.

Completeness of the data. If we need data that originates in several


systems, an easier path may be to pick one source into which that data
is collected during normal operational processing. Selecting this source
simplifies both the data capture and the data integration, since the data
from a few sources is already integrated.
System reliability. Some systems are considered to be more reliable
than others. Systems that are implemented more recently typically
reflect more stringent validation rules, and these may provide a better
source of data for the data warehouse.

Data currency. If we want customer contact information, we may choose


to use the billing system rather than using the order entry system. The
billing system is more likely to receive corrections since it is a source of
revenue for the company.

Documentation availability. Some systems are better documented


than others. To the extent that the system is well documented, our data
capture activities (particularly our source system analysis activities) are
simplified.

Accessibility. The location and technology of the source system may


also affect our choice. Data that is locally available is often easier to
obtain than data that is remote and managed by a different computer
center. Similarly, data in a current technology is easier to gather than
data stored in a technology no longer commonly used in the company.

These are all rational reasons for selecting a particular source system.
There is another factor that should also be considered: politics. Selection
of the source system may be impacted by the faith the users will have in
the data, and some users may have preconceived notions concerning
the viability of some of the source systems.

Source System Analysis

The process used for understanding the operational systems is source


system analysis. Although the starting point is looking at the names of
the data elements, the analysis needs to go much deeper. Using the
data element name implies something about what is in the field. The
analyst should either locate the definition for the field in the system
documentation (if any is available) or create it based on information
gleaned from the system users and maintainers. Once the definition is
determined, the analyst needs to examine the data in the field to ensure
that all of it conforms to that definition. There are several conditions that
often lead to exceptions. For example:

• A field that was included in ..the system may no longer be needed, and when a new
field was needed, the programmer reused the existing field without changing the field name
(or the documentation).
• The original field only applies in some cases (e.g., residential customers), and the
programmer used the field to mean something else for other cases (e.g., commercial
customer).
• The original field (e.g., work order number) did not apply to a particular group, and
the programmer used the field to mean something else (e.g., vehicle number).

Once the definition of the field is known, the analyst needs to examine the
quality of the data with respect to its accuracy and completeness. The
accuracy examination entails looking at each individual field and
examining field dependencies. For example, if one field indicates an
insurance claim for pregnancy, the gender field should be "female."

A luxury we have in examining the entire data set is looking at the


demographics of the data. Although 11/11/11 is a valid birth date that
would pass system edit checks, if we find that 30% of the people in the
data set have that birthday, we may become suspicious about the data's
accuracy. Understanding the quality is extremely important. We use it
both to set the quality expectations and to determine the cleansing
requirements of the data acquisition process.

The completeness examination determines whether or not the field has a


data value when it is needed. As with the data accuracy, this is a
reflection of the quality of the data, and we use it to set expectations
and determine the data acquisition processing requirements.

Data Extraction

For the initial load of the data warehouse, we need to look at all the data
in the appropriate source systems. After the initial load, our processing
time and cost are significantly reduced if we can readily identify the
data that has changed. We can then restrict our processes to that data.
There are six basic methods for capturing data changes.16

1. Use source system time-stamps. Operational systems often have time-stamps in


each record to indicate the last time it was updated. When such time-stamps are available,
our extract programs can select only the data that has changed since our last extract. If the
source system performs physical record deletions, this approach cannot be used, since there
won't be a record showing the deletion.
2. Read the database management system (DBMS) log. The DBMS log maintains
information about changes to the database. With some of the tools available today, we can
read this log to detect the data changes.
3. Modify the operational system. When the operational system does not have a way
of marking changes to the data, one approach is to change that system. This option is often
difficult to justify, since changes to the operational system may be costly and each change
introduces a risk with respect to that system's reliability. If this option is selected, it should
be done as a separate project and not be absorbed into the data warehouse development
effort.
4. Compare before and after images of the database. For older batch systems, backup
files can be captured and compared using utilities such as Comparex, Superc, or Syncsort.
The comparison process itself may be slow. Once the changed records are identified, these
can be extracted for further processing.
5. Create snapshots of the operational systems and use them in the load process.
This technique rarely applies to the data warehouse since it entails rebuilding the history.
6. Use database triggers from the operational system. If the operational system
employs a modern relational database management system, triggers can be used to update a
table with the changes. The major drawback of this technique is that it places an additional
burden on each transaction within the operational system.

Applying a changed data capture technique can improve the data capture
process efficiency, but it is not always practical. It is, however,
something that needs to be researched in designing the data capture
logic.

Cleansing

We analyzed the quality of the source systems during source system


analysis. Now we need to do something about it. During the cleansing
process, we set the quality expectations and then incorporated the steps
needed to meet those expectations. In setting the quality expectations,
we need to balance the ideal situation (e.g., perfection) with the cost of
attaining it. The data warehouse is designed to support strategic
analysis, and data perfection is often unnecessary. This is another
important role for the data steward.

Once the data quality expectations are set, we need to use data cleansing
tools or develop algorithms to attain that quality level. One aspect of
quality that is not specific to the individual systems deals with data
integration. This will be addressed in the next section.

We have four fundamental choices in dealing with errors that we


encounter:
1. Reject the record. The error in the source data may be so severe that we would need
to reject the entire record. For example, we may receive a record that is missing critical
data or for which critical data does not pass our validation rules.
2. Accept the error. Sometimes we can detect an error in the source system but
determine that, given our quality expectations, it is within our tolerance levels. When this
occurs, we may accept the record. Depending on the type of error and our interest in
tracking it, we may also issue an alert about the error.
3. Set a default value. We may be receiving data from multiple customer systems,
some of which may not have a value for a field such as customer type. If we know that
most of the customers in that system are of a particular type, we may be willing to accept
an error of misclassifying a few customers for the increased value of inserting a correct
customer type for most customers.
4. Correct the error. In the previous example, there may be other data that can be used
to identify the customer type. For example, if we know that small businesses rarely spend
more than $1 million with us and large businesses almost always spend more than $10
million, we may insert a customer type code based on the business volume. As with the
default value, the data may not be perfect, but it provides a more accurate picture of our
customer base.

If either of the last two options are selected, we will have a mismatch
between the data in the operational sources and the data warehouse.
Having this difference is not necessarily bad -- what's important is that
we recognize that the difference exists. If we correct an error, we also
need to recognize that nothing has changed either in the business
process and source system that permitted the error to exist or in the
source system data that contains the data.

Actually, there's also a fifth option. After we evaluate the implications of


exercising one of the above four options, we may decide that none of
them is acceptable. When this happens, we will need to reexamine and
potentially change our quality expectations. When the quality
expectations change, we may also need to reevaluate whether or not
the data warehouse can still meet its intended objectives.

The data cleansing process may consist of a combination of automated


and manual processes. For example, the validation checks may detect
an error that we determined fell into the fourth category. If we can't
create an algorithm to fix the error condition, we may be faced with the
prospect of suspending the record and waiting until a person looks at
the record and makes the correction. A cleansing process that depends
on manual intervention requires a strong business commitment to make
the corrections. Otherwise, the data will sit in the suspense files for a
long time and will not be available to support strategic analysis.

Integration

Data integration merges data from multiple sources into a single,


enterprise-oriented view. First, we must recognize that duplicate
instances of the same item exist. Once we recognize that, we need to
merge the information from the

se multiple instances. With customer data, for example, the same


customer may exist in multiple files. We could be faced with three
customer records, as shown in Figure 6. On the surface, these may or
may not be instances of the same customer. Our first challenge in data
integration is to determine this.

Figure 6 -- Instances of the same customer.

If we're implementing a data warehouse to support CRM, this step is


crucial. We need to know the value of each of our customers, all the
products they own, and all the interactions we've had with them. Only
then can we devise a set of actions that will benefit the customer and be
profitable.
Fortunately, for customer data, there is a variety of data scrubbing, data
matching, and data householding tools available to help with data
integration.

Within the data acquisition process, we may need to create a table that
relates the customer identifier in the source system with the customer
identifier in the data warehouse. Once the customers are integrated, we
can use this table to relate the customer in the source system to the
data warehouse instance.

The second part of the integration process entails merging the


information from the multiple sources together. This requires an
element-by-element decision by the business community and the
implementation of the logic by the IT community. Resolving the conflicts
that often arise in this process is another role of the data steward.

Transformation

The coding structures may differ among the source systems, and these
need to be transformed into a single structure for the data warehouse.
Also, the physical representation of the data may differ, and again, a
single approach is needed. These are two examples of data
transformation. In the first instance, the

business community often needs to be involved; the second instance is a


technical decision, as long as the business needs can be met.

Loading

The last step of the data acquisition process is the load. During this step,
the data is physically moved into the data warehouse and is available
for subsequent dissemination to the data marts. The data warehouse
load is a batch process and, with rare exception, consists of record
insertions. Due to the retention of history in the data warehouse, each
time changed data is brought in, it appends an existing record.

Some factors to consider in designing the load process include the use of
a staging area to prepare the data for the load, making a backup copy of
the data being loaded, determining the sequence with which each of the
sources needs to be loaded, and within that, determining the sequence
in which the data itself needs to be loaded.
Data WareHouse Interview Questions

Informatica Group URL for Real Time Problems:


http://groups.yahoo.com/group/informaticadevelopment/
http://groups.yahoo.com/group/informaticadevelopment/

Kimbel Ralph URL's:-


http://www.dbmsmag.com/9612d05.html
http://www.dbmsmag.com/9701d05.html
STAR - SCHEMA :-
http://www.starlab.vub.ac.be/staff/robert/Information%20Systems/Halpin%203rd
%20ed/Infosys%20Ch1.pdf
ODS Design:-
http://www.compaq.nl/products/servers/alphaserver/pdf/SPD-ODS%20Service-V1.0.pdf
http://www.intelligententerprise.com/010613/warehouse1_1.shtml?database

1.Can 2 Fact Tables share same dimensions Tables? How many Dimension tables are
associated with one Fact Table ur project?
Ans: Yes
2.What is ROLAP, MOLAP, and DOLAP...?
Ans: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), and DOLAP
(Desktop OLAP). In these three OLAP
architectures, the interface to the analytic layer is typically the same; what is quite
different is how the data is physically stored.
In ROLAP, architects believe to store the data in the relational model; for instance,
OLAP capabilities are best provided
against the relational database
In MOLAP, the premise is that online analytical processing is best implemented by
storing the data multidimensionally; that is,
data must be stored multidimensionally in order to be viewed in a multidimensional
manner.
DOLAP, is a variation that exists to provide portability for the OLAP user. It creates
multidimensional datasets that can be
transferred from server to desktop, requiring only the DOLAP software to exist on the
target system. This provides significant
advantages to portable computer users, such as salespeople who are frequently on the
road and do not have direct access to
their office server.

3.What is an MDDB? and What is the difference between MDDBs and RDBMSs?
Ans: Multidimensional Database There are two primary technologies that are used for
storing the data used in OLAP applications.
These two technologies are multidimensional databases (MDDB) and relational
databases (RDBMS). The major difference
between MDDBs and RDBMSs is in how they store data. Relational databases store
their data in a series of tables and
columns. Multidimensional databases, on the other hand, store their data in a large
multidimensional arrays.
For example, in an MDDB world, you might refer to a sales figure as Sales with Date,
Product, and Location coordinates of
12-1-2001, Car, and south, respectively.

Advantages of MDDB:
Retrieval is very fast because
1 The data corresponding to any combination of dimension members can be retrieved with a
single I/O.
2 Data is clustered compactly in a multidimensional array.
3 Values are caluculated ahead of time.
4 The index is small and can therefore usually reside completely in memory.

Storage is very efficient because


1 The blocks contain only data.
2 A single index locates the block corresponding to a combination of sparse dimension
numbers.

4. What is MDB modeling and RDB Modeling?


Ans:

5. What is Mapplet and how do u create Mapplet?


Ans: A mapplet is a reusable object that represents a set of transformations. It allows you to
reuse transformation logic and can
contain as many transformations as you need.
Create a mapplet when you want to use a standardized set of transformation logic in
several mappings. For example, if you
have a several fact tables that require a series of dimension keys, you can create a
mapplet containing a series of Lookup
transformations to find each dimension key. You can then use the mapplet in each fact
table mapping, rather than recreate the
same lookup logic in each mapping.
To create a new mapplet:
1. In the Mapplet Designer, choose Mapplets-Create Mapplet.
2. Enter a descriptive mapplet name.
The recommended naming convention for mapplets is mpltMappletName.
3. Click OK.
The Mapping Designer creates a new mapplet in the Mapplet Designer.
4. Choose Repository-Save.

6. What for is the transformations are used?


Ans: Transformations are the manipulation of data from how it appears in the source
system(s) into another form in the data
warehouse or mart in a way that enhances or simplifies its meaning. In short, u
transform data into information.

This includes Datamerging, Cleansing, Aggregation: -


Datamerging: Process of standardizing data types and fields. Suppose one source
system calls integer type data as smallint
where as another calls similar data as decimal. The data from the two source systems
needs to rationalized when moved into
the oracle data format called number.
Cleansing: This involves identifying any changing inconsistencies or inaccuracies.
23 Eliminating inconsistencies in the data from multiple sources.
24 Converting data from different systems into single consistent data set suitable for
analysis.
25 Meets a standard for establishing data elements, codes, domains, formats and naming
conventions.
26 Correct data errors and fills in for missing data values.
Aggregation: The process where by multiple detailed values are combined into a single
summary value typically summation numbers representing dollars spend or units sold.
27 Generate summarized data for use in aggregate fact and dimension tables.
Data Transformation is an interesting concept in that some transformation can occur
during the “extract,” some during the
“transformation,” or even - in limited cases--- during “load“ portion of the ETL
process. The type of transformation function u
need will most often determine where it should be performed. Some transformation
functions could even be performed in more
than one place. B’ze many of the transformations u will want to perform already exist
in some form or another in more than
one of the three environments (source database or application, ETL tool, or the target
db).

7. What is the difference btween OLTP & OLAP?


Ans: OLTP stand for Online Transaction Processing. This is standard, normalized database
structure. OLTP is designed for
Transactions, which means that inserts, updates, and deletes must be fast. Imagine a call
center that takes orders. Call takers are continually taking calls and entering orders that
may contain numerous items. Each order and each item must be inserted into a database.
Since the performance of database is critical, we want to maximize the speed of inserts
(and updates and deletes). To maximize performance, we typically try to hold as few
records in the database as possible.

OLAP stands for Online Analytical Processing. OLAP is a term that means many things to
many people. Here, we will use the term OLAP and Star Schema pretty much
interchangeably. We will assume that star schema database is an OLAP system.( This is
not the same thing that Microsoft calls OLAP; they extend OLAP to mean the cube
structures built using their product, OLAP Services). Here, we will assume that any system
of read-only, historical, aggregated data is an OLAP system.

A data warehouse(or mart) is way of storing data for later retrieval. This retrieval is almost
always used to support decision-making in the organization. That is why many data
warehouses are considered to be DSS (Decision-Support Systems).

Both a data warehouse and a data mart are storage mechanisms for read-only, historical,
aggregated data.
By read-only, we mean that the person looking at the data won’t be changing it. If a user
wants at the sales yesterday for a certain product, they should not have the ability to change
that number.
The “historical” part may just be a few minutes old, but usually it is at least a day old.A data
warehouse usually holds data that goes back a certain period in time, such as five years. In
contrast, standard OLTP systems usually only hold data as long as it is “current” or active.
An order table, for example, may move orders to an archive table once they have been
completed, shipped, and received by the customer.

When we say that data warehouses and data marts hold aggregated data, we need to stress
that there are many levels of aggregation in a typical data warehouse.

8. If data source is in the form of Excel Spread sheet then how do use?
Ans: PowerMart and PowerCenter treat a Microsoft Excel source as a relational database,
not a flat file. Like relational sources,
the Designer uses ODBC to import a Microsoft Excel source. You do not need
database permissions to import Microsoft
Excel sources.
To import an Excel source definition, you need to complete the following tasks:
1 Install the Microsoft Excel ODBC driver on your system.
2 Create a Microsoft Excel ODBC data source for each source file in the ODBC 32-bit
Administrator.
3 Prepare Microsoft Excel spreadsheets by defining ranges and formatting columns of
numeric data.
4 Import the source definitions in the Designer.
Once you define ranges and format cells, you can import the ranges in the Designer. Ranges
display as source definitions
when you import the source.

9. Which db is RDBMS and which is MDDB can u name them?


Ans: MDDB ex. Oracle Express Server(OES), Essbase by Hyperion Software, Powerplay by
Cognos and
RDBMS ex. Oracle , SQL Server …etc.

10. What are the modules/tools in Business Objects? Explain theier purpose briefly?
Ans: BO Designer, Business Query for Excel, BO Reporter, Infoview,Explorer,WEBI, BO
Publisher, and Broadcast Agent, BO
ZABO).
InfoView: IT portal entry into WebIntelligence & Business Objects.
Base module required for all options to view and refresh reports.
Reporter: Upgrade to create/modify reports on LAN or Web.
Explorer: Upgrade to perform OLAP processing on LAN or Web.
Designer: Creates semantic layer between user and database.
Supervisor: Administer and control access for group of users.
WebIntelligence: Integrated query, reporting, and OLAP analysis over the Web.
Broadcast Agent: Used to schedule, run, publish, push, and broadcast pre-built reports
and spreadsheets, including event
notification and response capabilities, event filtering, and calendar
based notification, over the LAN, e-
mail, pager,Fax, Personal Digital Assistant( PDA), Short Messaging
Service(SMS), etc.
Set Analyzer - Applies set-based analysis to perform functions such as execlusion,
intersections, unions, and overlaps
visually.
Developer Suite - Build packaged, analytical, or customized apps.

11.What are the Ad hoc quries, Canned Quries/Reports? and How do u create them?
(Plz check this page……C\:BObjects\Quries\Data Warehouse - About Queries.htm)
Ans: The data warehouse will contain two types of query. There will be fixed queries that
are clearly defined and well understood, such as regular reports, canned queries
(standard reports) and common aggregations. There will also be ad hoc queries that are
unpredictable, both in quantity and frequency.

Ad Hoc Query: Ad hoc queries are the starting point for any analysis into a database. Any
business analyst wants to know what is inside the database. He then proceeds by
calculating totals, averages, maximum and minimum values for most attributes within the
database. These are unpredictable element of a data warehouse. It is exactly that ability to
run any query when desired and expect a reasonable response that makes the data warhouse
worthwhile, and makes the design such a significant challenge.
The end-user access tools are capable of automatically generating the database query that
answers any Question posed by the user. The user will typically pose questions in terms
that they are familier with (for example, sales by store last week); this is converted into
the database query by the access tool, which is aware of the structure of information within
the data warehouse.
Canned queries: Canned queries are predefined queries. In most instances, canned queries
contain prompts that allow you to customize the query for your specific needs. For
example, a prompt may ask you for a School, department, term, or section ID. In this
instance you would enter the name of the School, department or term, and the query will
retrieve the specified data from the Warehouse.You can measure resource requirements of
these queries, and the results can be used for capacity palnning and for database design.
The main reason for using a canned query or report rather than creating your own is that your
chances of misinterpreting data or getting the wrong answer are reduced. You are assured
of getting the right data and the right answer.
12. How many Fact tables and how many dimension tables u did? Which table precedes
what?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
13. What is the difference between STAR SCHEMA & SNOW FLAKE SCHEMA?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp

14. Why did u choose STAR SCHEMA only? What are the benefits of STAR SCHEMA?
Ans: Because it’s denormalized structure , i.e., Dimension Tables are denormalized. Why to
denormalize means the first (and often
only) answer is : speed. OLTP structure is designed for data inserts, updates, and
deletes, but not data retrieval. Therefore,
we can often squeeze some speed out of it by denormalizing some of the tables and
having queries go against fewer tables.
These queries are faster because they perform fewer joins to retrieve the same
recordset. Joins are also confusing to many
End users. By denormalizing, we can present the user with a view of the data that is far
easier for them to understand.

Benefits of STAR SCHEMA:


1 Far fewer Tables.
2 Designed for analysis across time.
3 Simplifies joins.
4 Less database space.
5 Supports “drilling” in reports.
6 Flexibility to meet business and technical needs.

15. How do u load the data using Informatica?


Ans: Using session.

16. (i) What is FTP? (ii) How do u connect to remote? (iii) Is there another way to use FTP
without a special utility?
Ans: (i): The FTP (File Transfer Protocol) utility program is commonly used for copying
files to and from other computers. These
computers may be at the same site or at different sites thousands of miles apart. FTP is
general protocol that works on UNIX
systems as well as other non- UNIX systems.

(ii): Remote connect commands:


ftp machinename
ex: ftp 129.82.45.181 or ftp iesg
If the remote machine has been reached successfully, FTP responds by asking for a
loginname and password. When u enter
ur own loginname and password for the remote machine, it returns the prompt like
below
ftp>
and permits u access to ur own home directory on the remote machine. U should be
able to move around in ur own directory
and to copy files to and from ur local machine using the FTP interface commands.
Note: U can set the mode of file transfer to ASCII ( default and transmits seven bits per
character).
Use the ASCII mode with any of the following:
- Raw Data (e.g. *.dat or *.txt, codebooks, or other plain text documents)
- SPSS Portable files.
- HTML files.
If u set mode of file transfer to Binary (the binary mode transmits all eight bits per byte
and thus provides less chance of
a transmission error and must be used to transmit files other than ASCII files).
For example use binary mode for the following types of files:
- SPSS System files
- SAS Dataset
- Graphic files (eg., *.gif, *.jpg, *.bmp, etc.)
- Microsoft Office documents (*.doc, *.xls, etc.)

(iii): Yes. If u r using Windows, u can access a text-based FTP utility from a DOS
prompt.
To do this, perform the following steps:
1. From the Start  Programs MS-Dos Prompt
2. Enter “ftp ftp.geocities.com.” A prompt will appear
(or)
Enter ftp to get ftp prompt  ftp> open hostname ex. ftp>open ftp.geocities.com (It
connect to the specified host).
3. Enter ur yahoo! GeoCities member name.
4. enter your yahoo! GeoCities pwd.
You can now use standard FTP commands to manage the files in your Yahoo! GeoCities
directory.

17.What cmd is used to transfer multiple files at a time using FTP?


Ans: mget ==> To copy multiple files from the remote machine to the local machine. You
will be prompted for a y/n answer before
transferring each file mget * ( copies all files in the current remote directory to
ur current local directory,
using the same file names).
mput ==> To copy multiple files from the local machine to the remote machine.
18. What is an Filter Transformation? or what options u have in Filter Transformation?
Ans: The Filter transformation provides the means for filtering records in a mapping. You
pass all the rows from a source
transformation through the Filter transformation, then enter a filter condition for the
transformation. All ports in a Filter
transformation are input/output, and only records that meet the condition pass
through the Filter transformation.
Note: Discarded rows do not appear in the session log or reject files
To maximize session performance, include the Filter transformation as close to the
sources in the mapping as possible.
Rather than passing records you plan to discard through the mapping, you then
filter out unwanted data early in the
flow of data from sources to targets.

You cannot concatenate ports from more than one transformation into the Filter
transformation; the input ports for the filter
must come from a single transformation. Filter transformations exist within the flow of
the mapping and cannot be
unconnected. The Filter transformation does not allow setting output default
values.

19.What are default sources which will supported by Informatica Powermart ?


Ans :
1 Relational tables, views, and synonyms.
2 Fixed-width and delimited flat files that do not contain binary data.
3 COBOL files.

20. When do u create the Source Definition ? Can I use this Source Defn to any
Transformation?
Ans: When working with a file that contains fixed-width binary data, you must create
the source definition.
The Designer displays the source definition as a table, consisting of names, datatypes,
and constraints. To use a source
definition in a mapping, connect a source definition to a Source Qualifier or
Normalizer transformation. The Informatica
Server uses these transformations to read the source data.

21. What is Active & Passive Transformation ?


Ans: Active and Passive Transformations
Transformations can be active or passive. An active transformation can change the
number of records passed through it. A
passive transformation never changes the record count.For example, the Filter
transformation removes rows that do not
meet the filter condition defined in the transformation.

Active transformations that might change the record count include the following:
1 Advanced External Procedure
2 Aggregator
3 Filter
4 Joiner
5 Normalizer
6 Rank
7 Source Qualifier
Note: If you use PowerConnect to access ERP sources, the ERP Source Qualifier is
also an active transformation.
/*
You can connect only one of these active transformations to the same
transformation or target, since the Informatica
Server cannot determine how to concatenate data from different sets of records with
different numbers of rows.
*/
Passive transformations that never change the record count include the following:
1 Lookup
2 Expression
3 External Procedure
4 Sequence Generator
5 Stored Procedure
6 Update Strategy

You can connect any number of these passive transformations, or connect one active
transformation with any number of
passive transformations, to the same transformation or target.

22. What is staging Area and Work Area?


Ans: Staging Area : -
- Holding Tables on DW Server.
- Loaded from Extract Process
- Input for Integration/Transformation
- May function as Work Areas
- Output to a work area or Fact Table
Work Area: -
- Temporary Tables
- Memory
23. What is Metadata? (plz refer DATA WHING IN THE REAL WORLD BOOK page #
125)
Ans: Defn: “Data About Data”
Metadata contains descriptive data for end users. In a data warehouse the term
metadata is used in a number of different
situations.
Metadata is used for:
1 Data transformation and load
2 Data management
3 Query management
Data transformation and load:
Metadata may be used during data transformation and load to describe the source data and
any changes that need to be made. The advantage of storing metadata about the data being
transformed is that as source data changes the changes can be captured in the metadata, and
transformation programs automatically regenerated.
For each source data field the following information is reqd:
Source Field:
1 Unique identifier (to avoid any confusion occurring betn 2 fields of the same anme from
different sources).
2 Name (Local field name).
3 Type (storage type of data, like character,integer,floating point…and so on).
4 Location
- system ( system it comes from ex.Accouting system).
- object ( object that contains it ex. Account Table).
The destination field needs to be described in a similar way to the source:
Destination:
5 Unique identifier
6 Name
7 Type (database data type, such as Char, Varchar, Number and so on).
8 Tablename (Name of the table th field will be part of).

The other information that needs to be stored is the transformation or transformations


that need to be applied to turn the source data into the destination data:
Transformation:
9 Transformation (s)
- Name
- Language (name of the lanjuage that transformation is written in).
- module name
- syntax
The Name is the unique identifier that differentiates this from any other similar
transformations.
The Language attribute contains the name of the lnguage that the transformation is written
in.
The other attributes are module name and syntax. Generally these will be mutually
exclusive, with only one being defined. For simple transformations such as simple SQL
functions the syntax will be stored. For complex transformations the name of the module
that contains the code is stored instead.
Data management:
Metadata is reqd to describe the data as it resides in the data warehouse.This is needed by the
warhouse manager to allow it to track and control all data movements. Every object in the
database needs to be described.

Metadata is needed for all the following:


10 Tables
- Columns
- name
- type
11 Indexes
- Columns
- name
- type
12 Views
- Columns
- name
- type
13 Constraints
- name
- type
- table
- columns
Aggregations, Partition information also need to be stored in Metadata( for details refer page
# 30)
Query Generation:
Metadata is also required by the query manger to enable it to generate queries. The same
metadata can be used by the Whouse manager to describe the data in the data warehouse is
also reqd by the query manager.
The query mangaer will also generate metadata about the queries it has run. This metadata
can be used to build a history of all quries run and generate a query profile for each user,
group of users and the data warehouse as a whole.
The metadata that is reqd for each query is:
- query
- tables accessed
- columns accessed
- name
- refence identifier
- restrictions applied
- column name
- table name
- reference identifier
- restriction
- join Criteria applied
……
……
- aggregate functions used
……
……
- group by criteria
……
……
- sort criteria
……
……
- syntax
- execution plan
- resources
……
……

24. What kind of Unix flavoures u r experienced?


Ans: Solaris 2.5 SunOs 5.5 (Operating System)
Solaris 2.6 SunOs 5.6 (Operating System)
Solaris 2.8 SunOs 5.8 (Operating System)
AIX 4.0.3
5.5.1 2.5.1 May 96 sun4c, sun4m, sun4d, sun4u, x86, ppc
5.6 2.6 Aug. 97 sun4c, sun4m, sun4d, sun4u, x86
5.7 7 Oct. 98 sun4c, sun4m, sun4d, sun4u, x86
5.8 8 2000 sun4m, sun4d, sun4u, x86

25. What are the tasks that are done by Informatica Server?
Ans:The Informatica Server performs the following tasks:
1 Manages the scheduling and execution of sessions and batches
2 Executes sessions and batches
3 Verifies permissions and privileges
4 Interacts with the Server Manager and pmcmd.
The Informatica Server moves data from sources to targets based on metadata stored in a
repository. For instructions on how to move and transform data, the Informatica Server
reads a mapping (a type of metadata that includes transformations and source and target
definitions). Each mapping uses a session to define additional information and to optionally
override mapping-level options. You can group multiple sessions to run as a single unit,
known as a batch.

26. What are the two programs that communicate with the Informatica Server?
Ans: Informatica provides Server Manager and pmcmd programs to communicate with the
Informatica Server:
Server Manager. A client application used to create and manage sessions and batches, and
to monitor and stop the Informatica Server. You can use information provided through the
Server Manager to troubleshoot sessions and improve session performance.
pmcmd. A command-line program that allows you to start and stop sessions and batches,
stop the Informatica Server, and verify if the Informatica Server is running.
27. When do u reinitialize Aggregate Cache?
Ans: Reinitializing the aggregate cache overwrites historical aggregate data with new
aggregate data. When you reinitialize the
aggregate cache, instead of using the captured changes in source tables, you typically
need to use the use the entire source
table.
For example, you can reinitialize the aggregate cache if the source for a session
changes incrementally every day and
completely changes once a month. When you receive the new monthly source, you
might configure the session to reinitialize
the aggregate cache, truncate the existing target, and use the new source table during
the session.

/? Note: To be clarified when server manger works for following ?/

To reinitialize the aggregate cache:


1.In the Server Manager, open the session property sheet.
2.Click the Transformations tab.
3.Check Reinitialize Aggregate Cache.
4.Click OK three times to save your changes.
5.Run the session.

The Informatica Server creates a new aggregate cache, overwriting the existing aggregate
cache.
/? To be check for step 6 & step 7 after successful run of session… ?/

6.After running the session, open the property sheet again.


7.Click the Data tab.
8.Clear Reinitialize Aggregate Cache.
9.Click OK.

28. (i) What is Target Load Order in Designer?


Ans: Target Load Order: - In the Designer, you can set the order in which the Informatica
Server sends records to various target
definitions in a mapping. This feature is crucial if you want to maintain referential
integrity when inserting, deleting, or updating
records in tables that have the primary key and foreign key constraints applied to them.
The Informatica Server writes data to
all the targets connected to the same Source Qualifier or Normalizer simultaneously, to
maximize performance.

28. (ii) What are the minimim condition that u need to have so as to use Targte Load Order
Option in Designer?
Ans: U need to have Multiple Source Qualifier transformations.
To specify the order in which the Informatica Server sends data to targets, create one
Source Qualifier or Normalizer
transformation for each target within a mapping. To set the target load order, you then
determine the order in which each
Source Qualifier sends data to connected targets in the mapping.
When a mapping includes a Joiner transformation, the Informatica Server sends all
records to targets connected to that
Joiner at the same time, regardless of the target load order.

28(iii). How do u set the Target load order?


Ans: To set the target load order:
1. Create a mapping that contains multiple Source Qualifier transformations.
2. After you complete the mapping, choose Mappings-Target Load Plan.
A dialog box lists all Source Qualifier transformations in the mapping, as well as the
targets that receive data from each
Source Qualifier.
3. Select a Source Qualifier from the list.
4. Click the Up and Down buttons to move the Source Qualifier within the load order.
5. Repeat steps 3 and 4 for any other Source Qualifiers you wish to reorder.
6. Click OK and Choose Repository-Save.

29. What u can do with Repository Manager?


Ans: We can do following tasks using Repository Manager : -
 To create usernames, you must have one of the following sets of privileges:
- Administer Repository privilege
- Super User privilege
To create a user group, you must have one of the following privileges :
- Administer Repository privilege
- Super User privilege
To assign or revoke privileges , u must hv one of the following privilege..
- Administer Repository privilege
- Super User privilege
Note: You cannot change the privileges of the default user groups or the default repository
users.

30. What u can do with Designer ?


Ans: The Designer client application provides five tools to help you create mappings:
Source Analyzer. Use to import or create source definitions for flat file, Cobol, ERP, and
relational sources.
Warehouse Designer. Use to import or create target definitions.
Transformation Developer. Use to create reusable transformations.
Mapplet Designer. Use to create mapplets.
Mapping Designer. Use to create mappings.

Note:The Designer allows you to work with multiple tools at one time. You can also work
in multiple folders and repositories

31. What are different types of Tracing Levels u hv in Transformations?


Ans: Tracing Levels in Transformations :-
Level Description
Terse Indicates when the Informatica Server initializes the session and its
components. Summarizes session results, but not at the level of individual records.
Normal Includes initialization information as well as error messages and notification
of rejected data.
Verbose initialization Includes all information provided with the Normal setting plus
more extensive information about initializing transformations in the session.
Verbose data Includes all information provided with the Verbose initialization
setting.

Note: By default, the tracing level for every transformation is Normal.

To add a slight performance boost, you can also set the tracing level to Terse, writing the
minimum of detail to the session log
when running a session containing the transformation.

31(i). What the difference is between a database, a data warehouse and a data mart?
Ans: -- A database is an organized collection of information.
-- A data warehouse is a very large database with special sets of tools to extract and
cleanse data from operational systems
and to analyze data.
-- A data mart is a focused subset of a data warehouse that deals with a single area of
data and is organized for quick
analysis.

32. What is Data Mart, Data WareHouse and Decision Support System explain briefly?
Ans: Data Mart:
A data mart is a repository of data gathered from operational data and other sources that is
designed to serve a particular
community of knowledge workers. In scope, the data may derive from an enterprise-wide
database or data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in terms of analysis,
content, presentation, and ease-of-use. Users of a data mart can expect to have data
presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the presence of the
other in some form. However, most writers using the term seem to agree that the design of
a data mart tends to start from an analysis of user needs and that a data warehouse
tends to start from an analysis of what data already exists and how it can be collected
in such a way that the data can later be used. A data warehouse is a central aggregation
of data (which can be distributed physically); a data mart is a data repository that may
derive from a data warehouse or not and that emphasizes ease of access and usability for a
particular designed purpose. In general, a data warehouse tends to be a strategic but
somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an
immediate need.

Data Warehouse:
A data warehouse is a central repository for all or significant parts of the data that an
enterprise's various business systems collect. The term was coined by W. H. Inmon. IBM
sometimes uses the term "information warehouse."
Typically, a data warehouse is housed on an enterprise mainframe server. Data from various
online transaction processing (OLTP) applications and other sources is selectively
extracted and organized on the data warehouse database for use by analytical applications
and user queries. Data warehousing emphasizes the capture of data from diverse sources
for useful analysis and access, but does not generally start from the point-of-view of the
end user or knowledge worker who may need access to specialized, sometimes local
databases. The latter idea is known as the data mart.
data mining, Web mining, and a decision support system (DSS) are three kinds of
applications that can make use of a data warehouse.

Decision Support System:


A decision support system (DSS) is a computer program application that analyzes business
data and presents it so that users can make business decisions more easily. It is an
"informational application" (in distinction to an "operational application" that collects the
data in the course of normal business operation).

Typical information that a decision support application might gather and present
would be:
Comparative sales figures between one week and the next
Projected revenue figures based on new product sales assumptions
The consequences of different decision alternatives, given past experience in a context that is
described

A decision support system may present information graphically and may include an expert
system or artificial intelligence (AI). It may be aimed at business executives or some other
group of knowledge workers.

33. What r the differences between Heterogeneous and Homogeneous?


Ans: Heterogeneous Homogeneous
Stored in different Schemas Common structure
Stored in different file or db types Same database type
Spread across in several countries Same data center
Different platform n H/W config. Same platform and H/Ware configuration.

34. How do you use DDL commands in PL/SQL block ex. Accept table name from user and
drop it, if available else display msg?
Ans: To invoke DDL commands in PL/SQL blocks we have to use Dynamic SQL, the
Package used is DBMS_SQL.

35. What r the steps to work with Dynamic SQL?


Ans: Open a Dynamic cursor, Parse SQL stmt, Bind i/p variables (if any), Execute SQL stmt
of Dynamic Cursor and
Close the Cursor.

36. Which package, procedure is used to find/check free space available for db objects like
table/procedures/views/synonyms…etc?
Ans: The Package  is DBMS_SPACE
The Procedure  is UNUSED_SPACE
The Table  is DBA_OBJECTS

Note: See the script to find free space @ c:\informatica\tbl_free_space

37. Does informatica allow if EmpId is PKey in Target tbl and source data is 2 rows with
same EmpID?If u use lookup for the same
situation does it allow to load 2 rows or only 1?
Ans: => No, it will not it generates pkey constraint voilation. (it loads 1 row)
=> Even then no if EmpId is Pkey.

38. If Ename varchar2(40) from 1 source(siebel), Ename char(100) from another source
(oracle) and the target is having Name
varchar2(50) then how does informatica handles this situation? How Informatica
handles string and numbers datatypes
sources?

39. How do u debug mappings? I mean where do u attack?

40. How do u qry the Metadata tables for Informatica?

41(i). When do u use connected lookup n when do u use unconnected lookup?


Ans:
Connected Lookups : -
A connected Lookup transformation is part of the mapping data flow. With connected
lookups, you can have multiple return values. That is, you can pass multiple values from
the same row in the lookup table out of the Lookup transformation.
Common uses for connected lookups include:
=> Finding a name based on a number ex. Finding a Dname based on deptno
=> Finding a value based on a range of dates
=> Finding a value based on multiple conditions
Unconnected Lookups : -
An unconnected Lookup transformation exists separate from the data flow in the mapping.
You write an expression using
the :LKP reference qualifier to call the lookup within another transformation.

Some common uses for unconnected lookups include:


=> Testing the results of a lookup in an expression
=> Filtering records based on the lookup results
=> Marking records for update based on the result of a lookup (for example, updating slowly
changing dimension tables)
=> Calling the same lookup multiple times in one mapping

41(ii). What r the differences between Connected lookups and Unconnected lookups?
Ans: Although both types of lookups perform the same basic task, there are some
important differences:
---------------------------------------------------------------
---------------------------------------------------------------
Connected Lookup Unconnected Lookup
---------------------------------------------------------------
---------------------------------------------------------------
Part of the mapping data flow. Separate from the mapping data flow.
Can return multiple values from the same row. Returns one value from each row.
You link the lookup/output ports to another You designate the return value with the
Return port (R).
transformation.
Supports default values. Does not support default values.
If there's no match for the lookup condition, the If there's no match for the lookup
condition, the server
server returns the default value for all output ports. returns NULL.
More visible. Shows the data passing in and out Less visible. You write an expression
using :LKP to tell
of the lookup. the server when to perform the lookup.
Cache includes all lookup columns used in the Cache includes lookup/output ports in
the Lookup condition
mapping (that is, lookup table columns included and lookup/return port.
in the lookup condition and lookup table
columns linked as output ports to other
transformations).

42. What u need concentrate after getting explain plan?


Ans: The 3 most significant columns in the plan table are named OPERATION,OPTIONS,
and OBJECT_NAME.For each step,
these tell u which operation is going to be performed and which object is the target of
that operation.
Ex:-
**************************
TO USE EXPLAIN PLAN FOR A QRY...
**************************
SQL> EXPLAIN PLAN
2 SET STATEMENT_ID = 'PKAR02'
3 FOR
4 SELECT JOB,MAX(SAL)
5 FROM EMP
6 GROUP BY JOB
7 HAVING MAX(SAL) >= 5000;

Explained.

**************************
TO QUERY THE PLAN TABLE :-
**************************
SQL> SELECT RTRIM(ID)||' '||
2 LPAD(' ', 2*(LEVEL-1))||OPERATION
3 ||' '||OPTIONS
4 ||' '||OBJECT_NAME STEP_DESCRIPTION
5 FROM PLAN_TABLE
6 START WITH ID = 0 AND STATEMENT_ID = 'PKAR02'
7 CONNECT BY PRIOR ID = PARENT_ID
8 AND STATEMENT_ID = 'PKAR02'
9 ORDER BY ID;

STEP_DESCRIPTION
----------------------------------------------------
0 SELECT STATEMENT
1 FILTER
2 SORT GROUP BY
3 TABLE ACCESS FULL EMP

43. How components are interfaced in Psoft?


Ans:

44. How do u do the analysis of an ETL?


Ans:

==============================================================

45. What is Standard, Reusable Transformation and Mapplet?


Ans: Mappings contain two types of transformations, standard and reusable. Standard
transformations exist within a single
mapping. You cannot reuse a standard transformation you created in another mapping,
nor can you create a shortcut to that transformation. However, often you want to create
transformations that perform common tasks, such as calculating the average salary in a
department. Since a standard transformation cannot be used by more than one mapping,
you have to set up the same transformation each time you want to calculate the average
salary in a department.
Mapplet: A mapplet is a reusable object that represents a set of transformations. It
allows you to reuse transformation logic
and can contain as many transformations as you need. A mapplet can contain
transformations, reusable transformations, and
shortcuts to transformations.
46. How do u copy Mapping, Repository, Sessions?
Ans: To copy an object (such as a mapping or reusable transformation) from a shared folder,
press the Ctrl key and drag and drop
the mapping into the destination folder.

To copy a mapping from a non-shared folder, drag and drop the mapping into the
destination folder.
In both cases, the destination folder must be open with the related tool active.
For example, to copy a mapping, the Mapping Designer must be active. To copy a Source
Definition, the Source Analyzer must be active.

Copying Mapping:
1 To copy the mapping, open a workbook.
2 In the Navigator, click and drag the mapping slightly to the right, not dragging it to the
workbook.
3 When asked if you want to make a copy, click Yes, then enter a new name and click OK.
4 Choose Repository-Save.

Repository Copying: You can copy a repository from one database to another. You
use this feature before upgrading, to
preserve the original repository. Copying repositories provides a quick way to copy all
metadata you want to use as a basis for
a new repository.
If the database into which you plan to copy the repository contains an existing repository, the
Repository Manager deletes the existing repository. If you want to preserve the old
repository, cancel the copy. Then back up the existing repository before copying the new
repository.
To copy a repository, you must have one of the following privileges:
1 Administer Repository privilege
2 Super User privilege

To copy a repository:
1. In the Repository Manager, choose Repository-Copy Repository.
2. Select a repository you wish to copy, then enter the following information:
-------------------------------- ---------------------------
-------------------------------------------------

Copy Repository Field Required/ Optional Description


-------------------------------- ---------------------------
-------------------------------------------------
Repository Required Name for the repository copy. Each repository
name must be unique within
the domain and should be easily distinguished from all other
repositories.
Database Username Required Username required to connect to the database.
This login must have the
appropriate database permissions to create the repository.
Database Password Required Password associated with the database
username.Must be in US-ASCII.
ODBC Data Source Required Data source used to connect to the database.
Native Connect StringRequired Connect string identifying the location of the
database.
Code Page Required Character set associated with the repository.
Must be a superset of the code
page of the repository you want to copy.

If you are not connected to the repository you want to copy, the Repository Manager
asks you to log in.
3. Click OK.
5. If asked whether you want to delete an existing repository data in the second
repository, click OK to delete it. Click Cancel to preserve the existing repository.

Copying Sessions:
In the Server Manager, you can copy stand-alone sessions within a folder, or copy sessions in
and out of batches.
To copy a session, you must have one of the following:
1 Create Sessions and Batches privilege with read and write permission
2 Super User privilege
To copy a session:
1. In the Server Manager, select the session you wish to copy.
2. Click the Copy Session button or choose Operations-Copy Session.
The Server Manager makes a copy of the session. The Informatica Server names the copy
after the original session, appending a number, such as session_name1.

47. What are shortcuts, and what is advantage?


Ans: Shortcuts allow you to use metadata across folders without making copies, ensuring
uniform metadata. A shortcut inherits all
properties of the object to which it points. Once you create a shortcut, you can
configure the shortcut name and description.

When the object the shortcut references changes, the shortcut inherits those changes.
By using a shortcut instead of a copy,
you ensure each use of the shortcut exactly matches the original object. For example, if
you have a shortcut to a target
definition, and you add a column to the definition, the shortcut automatically inherits the
additional column.

Shortcuts allow you to reuse an object without creating multiple objects in the
repository. For example, you use a source
definition in ten mappings in ten different folders. Instead of creating 10 copies of the
same source definition, one in each
folder, you can create 10 shortcuts to the original source definition.
You can create shortcuts to objects in shared folders. If you try to create a shortcut to a
non-shared folder, the Designer
creates a copy of the object instead.

You can create shortcuts to the following repository objects:


1 Source definitions
2 Reusable transformations
3 Mapplets
4 Mappings
5 Target definitions
6 Business components

You can create two types of shortcuts:


Local shortcut. A shortcut created in the same repository as the original object.
Global shortcut. A shortcut created in a local repository that references an object in a
global repository.

Advantages: One of the primary advantages of using a shortcut is maintenance. If you


need to change all instances of an
object, you can edit the original repository object. All shortcuts accessing the object
automatically inherit the changes.
Shortcuts have the following advantages over copied repository objects:
1 You can maintain a common repository object in a single location. If you need to edit the
object, all shortcuts immediately inherit the changes you make.
2 You can restrict repository users to a set of predefined metadata by asking users to
incorporate the shortcuts into their work instead of developing repository objects
independently.
3 You can develop complex mappings, mapplets, or reusable transformations, then reuse
them easily in other folders.
4 You can save space in your repository by keeping a single repository object and using
shortcuts to that object, instead of creating copies of the object in multiple folders or
multiple repositories.

48. What are Pre-session and Post-session Options?


(Plzz refer Help Using Shell Commands n Post-Session Commands and Email)
Ans: The Informatica Server can perform one or more shell commands before or after the
session runs. Shell commands are
operating system commands. You can use pre- or post- session shell commands, for
example, to delete a reject file or
session log, or to archive target files before the session begins.

The status of the shell command, whether it completed successfully or failed, appears in
the session log file.
To call a pre- or post-session shell command you must:
1. Use any valid UNIX command or shell script for UNIX servers, or any valid DOS
or batch file for Windows NT servers.
2. Configure the session to execute the pre- or post-session shell commands.

You can configure a session to stop if the Informatica Server encounters an error while
executing pre-session shell commands.

For example, you might use a shell command to copy a file from one directory to another.
For a Windows NT server you would use the following shell command to copy the
SALES_ ADJ file from the target directory, L, to the source, H:
copy L:\sales\sales_adj H:\marketing\

For a UNIX server, you would use the following command line to perform a similar
operation:
cp sales/sales_adj marketing/

Tip: Each shell command runs in the same environment (UNIX or Windows NT) as the
Informatica Server. Environment settings in one shell command script do not carry over to
other scripts. To run all shell commands in the same environment, call a single shell script
that in turn invokes other scripts.

49. What are Folder Versions?


Ans: In the Repository Manager, you can create different versions within a folder to help
you archive work in development. You can copy versions to other folders as well. When
you save a version, you save all metadata at a particular point in development. Later
versions contain new or modified metadata, reflecting work that you have completed since
the last version.

Maintaining different versions lets you revert to earlier work when needed. By
archiving the contents of a folder into a version each time you reach a development
landmark, you can access those versions if later edits prove unsuccessful.

You create a folder version after completing a version of a difficult mapping, then
continue working on the mapping. If you are unhappy with the results of subsequent work,
you can revert to the previous version, then create a new version to continue development.
Thus you keep the landmark version intact, but available for regression.

Note: You can only work within one version of a folder at a time.

50. How do automate/schedule sessions/batches n did u use any tool for automating
Sessions/batch?
Ans: We scheduled our sessions/batches using Server Manager.
You can either schedule a session to run at a given time or interval, or you can
manually start the session.
U needto hv create sessions n batches with Read n Execute permissions or super user
privilege.
If you configure a batch to run only on demand, you cannot schedule it.

Note: We did not use any tool for automation process.

51. What are the differences between 4.7 and 5.1 versions?
Ans: New Transformations added like XML Transformation and MQ Series Transformation,
and PowerMart and PowerCenter both
are same from 5.1version.

52. What r the procedure that u need to undergo before moving Mappings/sessions from
Testing/Development to Production?
Ans:

53. How many values it (informatica server) returns when it passes thru Connected Lookup n
Unconncted Lookup?
Ans: Connected Lookup can return multiple values where as Unconnected Lookup will
return only one values that is Return Value.

54. What is the difference between PowerMart and PowerCenter in 4.7.2?


Ans: If You Are Using PowerCenter
PowerCenter allows you to register and run multiple Informatica Servers against the same
repository. Because you can run
these servers at the same time, you can distribute the repository session load across available
servers to improve overall
performance.
With PowerCenter, you receive all product functionality, including distributed metadata,
the ability to organize repositories into
a data mart domain and share metadata across repositories.
A PowerCenter license lets you create a single repository that you can configure as a
global repository, the core component
of a data warehouse.
If You Are Using PowerMart
This version of PowerMart includes all features except distributed metadata and multiple
registered servers. Also, the various
options available with PowerCenter (such as PowerCenter Integration Server for BW,
PowerConnect for IBM DB2,
PowerConnect for SAP R/3, and PowerConnect for PeopleSoft) are not available with
PowerMart.

55. What kind of modifications u can do/perform with each Transformation?


Ans: Using transformations, you can modify data in the following ways:
----------------- ------------------------
Task Transformation
----------------- ------------------------
Calculate a value Expression
Perform an aggregate calculations Aggregator
Modify text Expression
Filter records Filter, Source Qualifier
Order records queried by the Informatica Server Source Qualifier
Call a stored procedure Stored Procedure
Call a procedure in a shared library or in the External Procedure

COM layer of Windows NT


Generate primary keys Sequence Generator
Limit records to a top or bottom range Rank
Normalize records, including those read Normalizer
from COBOL sources
Look up values Lookup
Determine whether to insert, delete, update, Update Strategy
or reject records
Join records from different databases Joiner
or flat file systems

56. Expressions in Transformations, Explain briefly how do u use?


Ans: Expressions in Transformations
To transform data passing through a transformation, you can write an expression. The
most obvious examples of these are the
Expression and Aggregator transformations, which perform calculations on either
single values or an entire range of values
within a port. Transformations that use expressions include the following:
--------------------- ------------------------------------------
Transformation How It Uses Expressions
--------------------- ------------------------------------------
Expression Calculates the result of an expression for each row passing through the
transformation, using values from one or more ports.
Aggregator Calculates the result of an aggregate expression, such as a sum or average,
based on all data passing through a port or on groups within that data.
Filter Filters records based on a condition you enter using an
expression.
Rank Filters the top or bottom range of records, based on a condition you enter using an
expression.
Update Strategy Assigns a numeric code to each record based on an expression,
indicating whether the Informatica Server should use the information in the record to insert,
delete, or update the target.

In each transformation, you use the Expression Editor to enter the expression. The
Expression Editor supports the transformation language for building expressions. The
transformation language uses SQL-like functions, operators, and other components to build
the expression. For example, as in SQL, the transformation language includes the functions
COUNT and SUM. However, the PowerMart/PowerCenter transformation language
includes additional functions not found in SQL.

When you enter the expression, you can use values available through ports. For example, if
the transformation has two input ports representing a price and sales tax rate, you can
calculate the final sales tax using these two values. The ports used in the expression can
appear in the same transformation, or you can use output ports in other transformations.

57. In case of Flat files (which comes thru FTP as source) has not arrived then what happens?
Where do u set this option?
Ans: U get an fatel error which cause server to fail/stop the session.
U can set Event-Based Scheduling Option in Session Properties under General tab--
>Advanced options..
----------------- ------------------- ------------------
Event-Based Required/ Optional Description
----------------- -------------------- ------------------
Indicator File to Wait For Optional Required to use event-based
scheduling. Enter the indicator file
(or directory and file) whose arrival schedules the session. If
you do
not enter a directory, the Informatica Server assumes the file
appears
in the server variable directory $PMRootDir.

58. What is the Test Load Option and when you use in Server Manager?
Ans: When testing sessions in development, you may not need to process the entire source.
If this is true, use the Test Load
Option(Session Properties  General Tab  Target Options Choose Target Load
options as Normal (option button), with
Test Load cheked (Check box) and No.of rows to test ex.2000 (Text box with Scrolls)).
You can also click the Start button.

59. SCD Type 2 and SGT difference?

60. Differences between 4.7 and 5.1?

61. Tuning Informatica Server for improving performance? Performance Issues?


Ans: See /* C:\pkar\Informatica\Performance Issues.doc */

62. What is Override Option? Which is better?

63. What will happen if u increase buffer size?

64. what will happen if u increase commit Intervals? and also decrease commit Intervals?

65. What kind of Complex mapping u did? And what sort of problems u faced?

66. If u have 10 mappings designed and u need to implement some changes(may be in


existing mapping or new mapping need to
be designed) then how much time it takes from easier to complex?

67. Can u refresh Repository in 4.7 and 5.1? and also can u refresh pieces (partially) of
repository in 4.7 and 5.1?

68. What is BI?


Ans: http://www.visionnet.com/bi/index.shtml

69. Benefits of BI?


Ans: http://www.visionnet.com/bi/bi-benefits.shtml

70. BI Faq
Ans: http://www.visionnet.com/bi/bi-faq.shtml

71. What is difference between data scrubbing and data cleansing?


Ans: Scrubbing data is the process of cleaning up the junk in legacy data and making it
accurate and useful for the next generations
of automated systems. This is perhaps the most difficult of all conversion activities.
Very often, this is made more difficult when
the customer wants to make good data out of bad data. This is the dog work. It is also
the most important and can not be done
without the active participation of the user.
DATA CLEANING - a two step process including DETECTION and then
CORRECTION of errors in a data set
72. What is Metadata and Repository?
Ans:
Metadata. “Data about data” .
It contains descriptive data for end users.
Contains data that controls the ETL processing.
Contains data about the current state of the data warehouse.
ETL updates metadata, to provide the most current state.

Repository. The place where you store the metadata is called a repository. The more
sophisticated your repository, the more
complex and detailed metadata you can store in it. PowerMart and PowerCenter use a
relational database as the
repository.

73. SQL * LOADER?


Ans: http://download-
west.oracle.com/otndoc/oracle9i/901_doc/server.901/a90192/ch03.htm#1004678

74. Debugger in Mapping?

75. Parameters passing in 5.1 vesion exposure?

76. What is the filename which u need to configure in Unix while Installing Informatica?

77. How do u select duplicate rows using Informatica i.e., how do u use
Max(Rowid)/Min(Rowid) in Informatica?

INFORMATICA QUESIONS
----------------------------------------------------------
1.What are active and passive transformations?
Active Trans: Number of records which comes inside for these transformation is not
The same number of records as output.(e.g: Filter, Aggregator, Joiner, Normalizer, Router,
Rank,(Source Qualifier - If filter condition is used),Update Strategy.

Passive Trans: Number of records output is the same as number of records as output.
(e.g: Expression, LookUp (Connected & Unconnected), Input, Output, Sequence Generator,
Stored Procedure, XML Source qualifier.

2.What is the difference between the connected and unconnected transformations?


Connected: Receives input from another transformation (Pipe is connected to
another transformation)
Unconnected : It is called within a another transformation, The piping is not done
Directly E.g: Lookup, Stored Procedure
3.Explain how unconnected transformation improves performance than a connected
transformations?
In a unconnected transformation we can filter the number of records based on lookup
results, marking records for update based on lookup result, calling the same lookup
multiple times in one mapping.

4.What are the various caches available for lookup transformation? explain about them?
Various caches available for lookup trans are Static, Persistent and Dynamic, Lookup
can also be uncached.
Static: By default server creates static cache, it builds the cache when it process
the first lookup request, it queries the cache based on the lookup condition.
If lookup is connected it returns the values by the lookup /output port
If it is unconnected it returns the value by the return port
Persistent: By default Informatica uses non-persistant cache. enable the persistent
cache when required, lkp.cache.enable property must be checked to use
persistent cache. Normally server creates a cache file and deletes it at the
end of a session and for the consequtive sessions it rebuilds the cache files,
whereas when persistent cache is enabled the cache file is saved in the
disk and the same is reused for the consecutive sessions, by building the
memory cache from the saved cache files. Enable (recache from database)
if lkp table has been changed.
DYNAMIC: When target table is also the lkp table dynamic lkp cache is used.
As static cache dynamic lkp cache also builds the cache, when server
receives a new row (a row that is not in the cache) it inserts the row in to
the cache. If existing row (row that is in the cache) it flags the row as
existing and does not insert the row into the cache

5.What do you mean by incremental aggregation? explain briefly?

6.How do you set DTM memory parameters like Default Buffer block size,
Idx Cache size, Dat Cache size and also about source and target based commit?

7.What do you mean by event-based scheduling? What are the uses of Indicator file?
In 'event-based scheduling', the session gets started as soon as the mentioned
indicator file appears in the said directory local to the Informatica server. Note that
the file will be automatically deleted once the session starts. Keep in mind that the
session has to be either manually started or scheduled. But the session actually kicks-
off only after the indicator file appears, till then the session will be in ‘file wait’.

Uses of Indicator file:


a) Event based scheduling (as above)
b) When we use target as Flat File, and we use indicator file option then we will
get informations(0(INS),1(UPD),2(DEL),3(REJ)) relating all DML operations occurred on
the target rows. Server give the name to this file as “target_name.ind” and stores in the
same tgtdirectory. This indicator file option has to be configured in server set up under
‘misc’ option.

8.What do you mean by tracing level and explain about their types?
The amount of details the server writes in the session log file during the execution is
called as tracing level. Server writes row errors in the session log which includes
transformation in which a error occurred and complete row data.
They are Terse, Normal, Verbose init, Verbose data.

Terse: server writes initialization information, error messages and notify the rejected

data.
Normal: server writes initialization & status information, errors encountered and
skipped rows due to transformation row errors, summary of session result but not at the level
of individual rows.
Verbose init: In addition to normal tracing server writes additional tracing details such
as names of index and data files used and detailed transformation statistics.
Verbose data: In addition to verbose init. Server writes additional tracing details for
each row that passes into the mapping and also notes where server truncates
string data to fit the precision of a column and provides detailed transformation
statistics.

9.What kind of tracing level is used in Development and in Production environment?


In Development Verbose data is used and in Production Verbose init. Is used.

10.What is the difference between lookup and joiner trans.?

Lookup Joiner
Can use any operator like (=, <, >, etc) Can use only ‘=’.
Is used to lookup a table Is used to join multiple tables
It supports only relational It supports Heterogeneous
It is a passive trans It is a Active trans
Rejected rows are available (reject file, logDiscarded rows are not available
file) either in (log file or rejected file
11.Explain Surrogate Keys?
12.Explain SCD and its types?
There are 3 types Slowly changing dimensions.
Type 1: It overwrites the existing dimension and inserts the new dimension.

Type 2A (Version Data): Here rows based on user defined comparisons inserts both
new and changed dimension into the target. This changed dimension is tracked by
versioning the primary key and creating a version number for each dimension in the
table. Highest version number represents the current data of the record.
Type 2B (Flag Mapping): Here target will have field called PM_Current _Flag
which will have the value 0 & 1. Value 1 represents the current record.
Type 2C (Date Range): Here target will have a two fields namely Pm_begin_data,
Pm_End_date. For each new and changed rows system date will be inserted in the
Pm_begin_date to represent start of the effective date range . For each updated rows
server uses the system date that will be updated in the previous Pm_End_date to
represent end of effective date range. Each new row and updating existing row will
have null value in the pm_end_date.

Type 3: Here it keeps only the current and previous version of column data in the
table. It maintains both the values in the same row in an additional fields namely
pm_previous _value.

13.Explain Update Strategy trans?


It is used for Slowly changing dimension or for updating the target.
Update strategy can be used at mapping and at session level.
The parameters used at the mapping level are namely DD_Insert (0), DD_update (1),
DD_delete (2), DD_reject (3). If update strategy is used, in the server manager
data driven will be the default option for “treat rows as” and at the target options we
can use either one of these alone can be selected : update as update, update as insert,
update else insert.

At session level all rows are treated as Insert or Update or Delete depending the
option “treat rows as” selected. This has to match the target options.

In the Target option either one of these alone can be selected : update as update,
update as insert. Delete has to be used separately.

14.Explain about mapping variables,parameters?


Mapping parameters & variables are used to make mapping more flexible. The values
of Mapping parameters & variables can be used in any transformation within that
mapping. And it can be reused in different sessions just by altering the Mapping
parameters & variables values. Mapping parameters & variables values are set as
default values wherever required in the mapping or we can set the same from an file
called Parameter file.

15.Explain about Stored Procedure trans?


Stored procedure transformation is a passive transformation it is both connected and
unconnected. It is basically used to populating and maintaing the databases. It is set of
procedures which is executed to perform a time consuming complicated SQL
statements. It has got facility to run Source Pre/ Post session load and also Target

Pre/Post session load.


Normally a stored procedure is used for drop and recreating a index,check the status
of target database,determine enough space is available and to perform specialized
calculations. In unconnected stored procedure we use reserved variable called
PROC_RESULT which is used as output variable to get the output of stored
procedure if it returns single value. Else create no. of local variables according to the
no. of output from the stored procedure.

16.Explain about versioning of map?


Different versions can be created in the Informatica based on the requirement.
Versions are used to maintain the maps and objects created earlier for the
development purposes. This will be useful keep the changed maps and objects
separately.
Versions can be created from an active designer by repository save as.
It creates entire set of folders namely source, target, maps, mapplets, etc with the new
version number.
Versions has three numbers namely major, minor, patch. Versioning has to be done
appropriately.
17.Explain about batch running?

18.Explain about external procedures trans?


External procedures transformations. Operates outside the designer interface. These
procedures functions within a DLL(dynamic link library) or UNIX shared library.
External procedure transformations provides wide range of option to extend the
functionality normal transformations like expression, filter etc. which may not
provide exact functionality required by the user.

19.What is the use of Source Qualifier trans?


It is used for relational or flat file source definitions. Source qualifier is used to merge
More than one source, remember source type has to be same.
Source qualifier can be used for Joining the sources, Filter, Sort, SQL Query

20.What is sequence generator trans?


It is passive transformation it generates numeric values, which can be used as unique
primary key values, to replace missing primary keys or cycle through a sequential
range of numbers. There is no input port but has got two output ports namely Currval,
Nextval.
Nextval port is connected to the input port of another transformation. Next val is
equal to currval . Currval is nextval + “Increment by value” in property tab. Currval
will not be incremented if nextval is not connected to another transformation.
Sequence generator can be made as reusable and used the same sequence generator in
multiple mappings.

21.What is XML source qualifier transform?


22.What is Router trans?
Router transformation is similar to filter transformation, both allows to test
conditions. In filter transformation the rows that does not meet the condition will be
dropped. If we need to test same input data based on multiple conditions router
transformation can be used, instead of using multiple filter transformation.
In router transformation we can give one or more conditions to route the data
The rows that does not meet the condition can also be routed to a group called
default.

23.What is Mapplet and Adv.of mapplet?


Mapplet is a reusable object that represents a set of transformation. It is created in
mapplet designer. Mapplets are created when we want to use a set of transformation
common logic in several mappings. can avoid repetitive transformations across maps
by using mapplets. Normally we use Input/Output transformation as to receive data,
and to send data. Instead of Input trans, Can also use a source qualifier as a input. But cannot
use a target definition for output. Can have multiple groups of output ports. This can be
used to connect different dataflow in a mapping.

A mapplet cannot use more than one input transformation.


A mapplet cannot use Joiner, cobol sources, normalizer, xml sources, non-reusable
sequence generators, Pre/Post Stored Procedures.
The input/output ports in a mapplet don't have a datatype.
Based on the ports it is connected to, it assumes a datatype and displays it when used
in mapping designer. We don't explicitly specify the datatype. But, if there is a
mismatch, it will be a problem.

24.What do you mean by reusable trans? and how do you create?

Transformation which can be reused in multiple mappings. Can be created in


Transformation Developer or can convert a existing transformation by (enable
Reusable option in the transformation tab). It is irreversible process. We can edit
reusable transformation in the transformation developer. A bit of caution is, Editing
the reusable transformation can lead to invalidate the mapping which used the
reusable transformation.

25.What do you mean by Target Load Plan?


To set the order in which server sends the data to the various targets in a single
mapping is called as target load plan. This is used to increase the performance for
inserts update and deletes where a refrential integrity is to be maintained or required
by the target. Server writes data to the various targets from a single source qualifier
thus enhances the performance. Target load plan can be done by selecting the
mapping menu from the mapping designer window.

26.What is business component?


Summary:
I'm preparing for some upcoming Data Warehouse job interviews. What are some of the
types of questions or topics I can expect?
Full Article:
Disclaimer: Contents are not reviewed for correctness and are
not endorsed or recommended by ITtoolbox or any vendor. FAQ
contents include summarized information from ITtoolbox DW-
Career discussion
<http://Groups.ITtoolbox.com/archives/archives.asp?l=dw-
career> unless otherwise noted.

1. Adapted from response by TechnicalSQLUSA on Thursday, July


22, 2004 </groups/groups.asp?v=dw-career&i=514012>

Here a few:
1. Tell me about cubes
2. Full process or incremental
3. Are you good with data cleansing?
4. How do you handle changing dimensions?
5. What is a star-schema?

2. Adapted from response by Mike on Thursday, July 22, 2004


</groups/groups.asp?v=dw-career&i=514652>

A few high level questions might include:


Talk about the Kimball vs. Inmon approaches.
Talk about the concepts of ODS and information factory.
Talk about challenges of real-time load processing vs. batch.

For Informatica:

Let them know which version you are familiar with as well as
what role. Informatica 7.x has divided the developer and
administrator roles.

You will most likely be asked specific questions for building


a mapping and workflow. Know what the difference is between
sttic and reusable objects for both. Be prepared to
demonstrate how to create a connection, source definition
(flat file and relational), use expression transformation,
lookups (connected and disconnected), aggregators,
normalizers, update strategies, how to modify source and
target sql overrides, etc.

For Erwin:

Know the difference between Logical and Physical models.


Know how to use the Reverse Engineer and Comparison features.
The dimension model feature is pretty weak, but you might
want to know how Erwin treats dimensional modeling.

Other topics:

Anything you know about RDBMS is worth discussing.

In Oracle, you can talk about referential integrity as it


applies to DW. Views and Materialized Views, Partitioning,
Bitmap Indexing (when to use), and any other specifics as
related to DW (for 10g there is the new Bitmap join Index).

Always, always offer details of your knowledge, and ask


questions to get the customer's perspectives (you do not want
to push Kimball concepts if the customer is hard-set on
Inmon).

Anything you can bring to the table regarding the customers


business systems (i.e. SAP, Peoplesoft, etc.) will help
separate you from the pack. Also anything you know about
business processes such as:, Order Fulfillment, Inventory
Analysis, Finance, etc. will also separate you.

3. Adapted from response by Shaquille on Wednesday, August


11, 2004 </groups/groups.asp?v=dw-career&i=527204>

Here are the few questions that might be posed:

Data Warehousing questions:


1) What is source qualifier?
2) Difference between DSS & OLTP?
3) Explain grouped cross tab?

4) Hierarchy of DWH?
5) How many repositories can we create in Informatica?
6) What is surrogate key?
7) What is difference between Mapplet and reusable
transformation?
8) What is aggregate awareness?
9) Explain reference cursor?
10) What are parallel querys and query hints?
11) DWH architecture?
12) What are cursors?
13) Advantages of de normalized data?
14) What is operational data source (ODS)?
15) What is meta data and system catalog?
16) What is factless fact schema?
17) What is confirmed dimension?
18) What is the capacity of power cube?
19) Difference between PowerPlay transformer and power play
reports?
20) What is IQD file?
21) What is Cognos script editor?
22) What is difference macros and prompts?
23) What is power play plug in?
24) Which kind of index is preferred in DWH?
25) What is hash partition?
26) What is DTM session?
27) How can you define a transformation? What are different
types of transformations in Informatica?
28) What is mapplet?
29) What is query panel?
30) What is a look up function? What is default
transformation for the look up function?
31) What is difference between a connected look up and
unconnected look up?
32) What is staging area?
33) What is data merging, data cleansing and sampling?
34) What is up date strategy and what are th options for
update strategy?
35) OLAP architecture?
36) What is subject area?
37) Why do we use DSS database for OLAP tools?
Business Objects FAQ:
38) What is a universe?
39) Analysis in business objects?
40) Who launches the supervisor product in BO for first time?
41) How can you check the universe?
42) What are universe parameters?
43) Types of universes in business objects?
44) What is security domain in BO?
45) Where will you find the address of repository in BO?
46) What is broad cast agent?
47) In BO 4.1 version what is the alternative name for
broadcast agent?
48) What services the broadcast agent offers on the server
side?
49) How can you access your repository with different user
profiles?
50) How many built-in objects are created in BO repository?
51) What are alertors in BO?
52) What are different types of saving options in web
intelligence?

53) What is batch processing in BO?


54) How can you first report in BO by using broadcast agent?
55) Can we take report on Excel in BO?

1. What is source qualifier?


2. Difference between DSS & OLTP?
3. Explain grouped cross tab?
4. Hierarchy of DWH?
5. How many repositories can we create in Informatica?
6. What is surrogate key?
7. What is difference between Mapplet and reusable transformation?
8. What is aggregate awareness?
9. Explain reference cursor?
10. What are parallel querys and query hints?
11. DWH architecture?
12. What are cursors?
13. Advantages of de normalized data?
14. What is operational data source (ODS)?
15. What is meta data and system catalog?
16. What is factless fact schema?
17. What is confirmed dimension?
18. What is the capacity of power cube?
19. Difference between PowerPlay transformer and power play reports?
20. What is IQD file?
21. What is Cognos script editor?
22. What is difference macros and prompts?
23. What is power play plug in?
24. Which kind of index is preferred in DWH?
25. What is hash partition?
26. What is DTM session?
27. How can you define a transformation? What are different types of transformations in
Informatica?
28. What is mapplet?
29. What is query panel?
30. What is a look up function? What is default transformation for the look up function?
31. What is difference between a connected look up and unconnected look up?
32. What is staging area?
33. What is data merging, data cleansing and sampling?
34. What is up date strategy and what are th options for update strategy?
35. OLAP architecture?
36. What is subject area?
37. Why do we use DSS database for OLAP tools?

Das könnte Ihnen auch gefallen