Beruflich Dokumente
Kultur Dokumente
Market Overview:
Two of the pioneers in the field were Ralph Kimball and Bill Inmon.
Biographies of these two individuals have been provided, since many of
the terms discussed in this paper were coined and concepts defined by
them.
Biographical Information
Bill Inmon
Bill Inmon is universally recognized as the "father of the data warehouse."
He has over 26 years of database technology management experience and
data warehouse design expertise, and has published 36 books and more
than 350 articles in major computer journals. His books have been
translated into nine languages. He is known globally for his seminars on
developing data warehouses and has been a keynote speaker for every
major computing association. Before founding Pine Cone Systems, Bill was
a co-founder of Prism Solutions, Inc.
Ralph Kimball
Ralph Kimball was co-inventor of the Xerox Star workstation, the first
commercial product to use mice, icons, and windows. He was vice
president of applications at Metaphor Computer Systems, and founder and
CEO of Red Brick Systems. He has a Ph.D. from Stanford in electrical
engineering, specializing in man-machine systems. Ralph is a leading
proponent of the dimensional approach to designing large data
warehouses. He currently teaches data warehousing design skills to IT
groups, and helps selected clients with specific data warehouse designs.
Ralph is a columnist for Intelligent Enterprise magazine and has a
relationship with Sagent Technology, Inc., a data warehouse tool vendor.
His book "The Data Warehouse Toolkit" is widely recognized as the seminal
work on the subject.
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of
management's decision making process".
He defined the terms in the sentence as follows:
DATA WAREHOUSING
• the business resides, and historical data for prior periods, which may be
contained in some form of "legacy" system. Often these legacy systems
are not relational databases, so much effort is required to extract the
appropriate data.
• Query tools: Tools that allow the user to issue SQL (Structured Query
Language) queries against the warehouse and get a result set back.
Summary:
Data Warehousing is a complex field, with many vendors vying for market
awareness. The complexity of the technology and the interactions
between the various tools, and the high price points for the products
require companies to perform careful technology evaluation before
embarking on a warehousing project. However, the potential for
enormous returns on investment and competitive advantage make data
warehousing difficult to ignore.
Introduction
• Subject-oriented means that all relevant data about a subject is gathered and stored
as a single set in a useful format;
• Integrated refers to data being stored in a globally accepted fashion with consistent
naming conventions, measurements, encoding structures, and physical attributes, even
when the underlying operational systems store the data differently;
• Non-volatile means the data warehouse is read-only: data is loaded into the data
warehouse and accessed there;
• Time-variant data represents long-term data--from five to ten years as opposed to
the 30 to 60 days time periods of operational data.
• Analyze potentially large amounts of data with very fast response times
• "Slice and Dice" through the data, and drill down or roll up through various
dimensions as defined by the data structure
• Quickly identify trends or problem areas that would otherwise be missed
Data Marts:
Data marts are workgroup or departmental warehouses, which are small
in size, typically 10-50GB. The data mart contains informational data
that is departmentalized, tailored to the needs of the specific
departmental work group. Data marts are less expensive and take less
time for implementation with quick ROI. They are scaleable to full data
warehouses and at times are summarized subsets of more detailed, pre-
existing data warehouses.
Metadata/Information Catalogue:
Metadata describes the data that is contained in the data warehouse (e.g.
Data elements and business-oriented description) as well as the source
of that data and the transformations or derivations that may have been
performed to create the data element.
Data Mining:
Data mining predicts future trends and behaviors, allowing businesses to
make proactive, knowledge driven decisions. Data mining is the process
of analyzing business data in the data warehouse to find unknown
patterns or rules of information that you can use to tailor business
operations. For instance, data mining can find patterns in your data to
answer questions like:
• What item purchased in a given transaction triggers the purchase of additional related
items?
• How do purchasing patterns change with store location?
• What items tend to be purchased using credit cards, cash, or check?
• How would the typical customer likely to purchase these items be described?
• Did the same customer purchase related items at another time?
• Open Data Warehousing architecture with common interfaces for product integration
• Data Modeling with ability to model star-schema and multi-dimensionality
• Extraction and Transformation/propagation tools to load the data warehouse
• Data warehouse database server
• Analysis/end-user tools: OLAP/multidimensional analysis, Report and query
• Tools to manage information about the warehouse (Metadata)
• Tools to manage the Data Warehouse environment
• The creation of new fields that are derived from existing operational data
• Summarizing data to the most appropriate level needed for analysis
• Denormalizing the data for performance purposes
• Cleansing of the data to ensure that integrity is preserved.
Even with the use of automated tools, however, the time and costs
required for data conversion are often significant. Bill Inmon has
estimated 80% of the time required to build a data warehouse is
typically consumed in the conversion process.
Business Objects
HP Intelligent Hewlett-Packard
Warehouse: Guide
Prism Solutions
Directory Manager
Conclusion
Data Warehousing provides the means to change raw data into
information for making effective business decisions--the emphasis on
information, not data. The data warehouse is the hub for decision
support data. A good data warehouse will... provide the RIGHT data... to
the RIGHT people... at the RIGHT time: RIGHT NOW! While data
warehouse organizes data for business analysis, Internet has emerged
as the standard for information sharing. So the future of data
warehousing lies in their accessibility from the Internet. Successful
implementation of a data warehouse requires a high-performance,
scaleable combination of hardware and software that can integrate
easily with existing systems, so customers can use data warehouses to
improve their decision-making--and their competitive advantage.
By - Anjaneyulu Marempudi
BY: WWW.PARETOANALYSTS.COM
To estimate the size of the fact table in bytes, multiply the size of a row
by the number of rows in the fact table. A more exact estimate would
include the data types, indexes, page sizes, etc. An estimate of the
number of rows in the fact table is obtained by multiplying the number
of transactions per hour by the number of hours in a typical work day
and then multiplying the result by the number of days in a year and
finally multiply this result by the number of years of transactions
involved. Divide this result by 1024 to convert to kilobytes and by 1024
again to convert to megabytes.
E.g. A data warehouse will store facts about the help provided by a
company’s product support representatives. The fact table is made of up
of a composite key of 7 indexes (int data type) including the primary
key. The fact table also contains 1 measure of time (datetime data type)
and another measure of duration (int data type). 2000 product incidents
are recorded each hour in a relational database. A typical work day is 8
hours and support is provided for every day in the year. What will be
approximate size of this data warehouse in 5 years?
First calculate the approximate size of a row in bytes (int data type = 4
bytes, datetime data type = 8 bytes):
size of a row = size of all composite indexes (add the size of all indexes)
+ size of all measures (add the size of all measures).
TERM DEFINITION
Access Refers to mechanisms and policies that restrict access to
Control computer resources.
Ad-Hoc Unpredictable, unplanned access and manipulation of data .
reporting
Archive Provides long-term off-line storage of data, which must be
Services retained for historic purposes. The services will allow
users to archive and retrieve data as needed to support
the business processes. Automated processes may also
archive data, which has not been accessed for a specified
period of time.
Atomic A database of change records that when applied in
Database temporal order will reconstruct, in a target database, an
identical copy of a source database at a point in time.
Attribute Used in Logical Data Modeling, an Attribute is any detail
that serves to identify, describe, classify, quantify or
provide the state of an entity. For example, the entity,
Employee, may have the following attributes: Last Name,
First Name, and Hire Date. Attributes are the general
equivalent of physical columns in a table.
Audit Trail A record showing who has accessed a computer system and
what operations he or she has performed during a given
period of time. Data that is available to trace system
activity usually update activity.
Best Practices
Canned routines based on predefined parameters.
Reports
Change Set of tables that mirror an OLTP in structure, with the
Tables possible addition of auditing information. All OLTP tables
will not necessarily have associated change tables.
Data A specific framework for managing data to enable the
Architectureinstitution to build and maintain the strategic capabilities
it needs to achieve its mission. The framework consists of
a set of principles, standards, and models that describe
how the data will be created, maintained, and protected.
The framework focuses on improving effectiveness and
reducing long-term costs and contains components that
cover the full data life cycle from creation to retirement.
An example is ETL tool.
Database Any collection of data.
Database The software that holds the database and executes the
Engine requests against that database. Oracle is an example of a
Database Engine.
DataMart A customized subset of data taken from the Data
Warehouse. A DataMart is typically set up by a specific
individual or department to support their particular needs.
Data ModelA graphical representation illustrating data-related business
requirements in the context of a given application.
Data Process of copying and maintaining schema objects in
Replication multiple databases that make up a distributed database
system. Replication can improve the performance and
protect the availability of applications because alternate
data access options exist.
DataStore See Operational Data Store.
Data An enterprise-wide database. It is a read-only collection of
Warehousedata from any number of sources. It is usually refreshed
from Operational DataStores, but may also receive data
from OLTP’s. It is also the likely source of data for a DSS
Decision A complete process for allowing users to access data which
Support they need to support their decision making process. This
System includes the database(s) holding the data, the software
(DSS) application which interfaces with the Database Engine,
metadata, training, and support.
Degree Shows how many instances of an entity can exist at one
end of the relationship for each entity instance at the
other end. Crow's feet shows a relationship degree of
many and a single point represents a relationship degree
of one.
Denormalizati
Roughly the opposite of Normalization. In a denormalized
on database, some duplicated data storage is allowed. The
benefits are quicker retrieval of data and a database
structure that is easier for end-users to understand and is
thereby more conducive to ad-hoc queries.
Domain A set of business validation rules, format constraints, and
allowable values that apply to a group of attributes. For
example, yes and no or days of the week.
ETL It signifies Extraction, Transformation, and Load. The tool
extracts, transforms and loads data from data sources to
data targets in a central repository. The data sources can
be a database, file, or COBOL copybook or any
combination of the three. It will be primarily used to
move data from an OLTP to an ODS or an ODS to DSS.
Entity Used in Logical Data Modeling, an Entity is a thing of
significance, either real or conceptual, about which the
business or system being modeled needs to hold
information. For example, if the business needs to process
sales orders, an Entity to represent sales orders would be
recorded. An Entity generally corresponds to a physical
table. Also see Attribute.
Entity Entity relationship modeling involves identifying the things
Relationshipof importance in an organization (entities), the properties
Diagram of those things (attributes) and how they are related to
(ERD) one another (relationships). The resulting information
model is independent of any data storage or access
method.
Foreign KeyIn a table, one or more columns whose values must match
the values in the primary key of the referenced table. The
columns in the foreign key typically reference the primary
key of another table but may reference the same table.
This mechanism allows two tables to be joined together.
Function Displays all of the functional requirements of an application
Hierarchy and their logical groupings. Shows the decomposition of
Diagram functions ranging from the highest level or root to the
lowest level or leaf required.
Metadata This is "data describing the data." This data provides
information about a database, including descriptions of
the tables and columns, as well as descriptions of the data
stored within those tables and columns.
MethodologyFacilitates a repeatable structured approach to defining
requirements and developing business applications. A
methodology tells you what to do and when. An Example
is Develop a Data Movement process.
MI Individuals filling this role are responsible for overseeing
Operations the 24-hour operation of assigned systems. Direct the
& daily setup of customer jobs for assigned systems.
Production Negotiate schedules for all systems in the area.
Control
Normalization
A relational database design concept which eliminates
duplication of data storage in a database. This is a crucial
element of OLTP systems which can suffer severe
performance penalties if the database is not normalized.
Not Nullable
A mandatory attribute or column is marked as mandatory
by making it Not Nullable. Not Nullable indicates that a
valid value must be entered for each occurrence of the
attribute or column. Null values are not allowed.
Null A Null indicates the absence of a value. This is the
equivalent of leaving a field empty. Columns marked as
"Not Nullable" or "Not Null" may not have Nulls. A "blank"
or a "space" is not the equivalent as a null and are
handled very differently than a null. "Blanks" and "spaces"
must be absolutely avoided.
On-Line A Software technology that transforms data into
Analytical multidimensional views and that supports
Processing multidimensional data interaction, exploration, and
(OLAP) analysis. SAS is an example of OLAP.
On-Line An OLTP database is the database with Read and Write
Transactionaccess. This is where transactions are actually entered,
Process modified, and/or deleted. Due to performance
(OLTP) considerations, read-only requests on the database may
be routed to an Operational Data Store. Typically, an
OLTP is a "normalized" database.
OperationalAn ODS is a read-only database containing operational data
DataStore in support of a specific business need. It is updated on a
(ODS) frequent basis (weekly, daily, hourly, or even more often)
and may be populated from one or more OLTP and/or
ODS databases. Depending upon its refresh cycle and
usage, the ODS may be normalized or denormalized.
OperationalStandardized, stable, repeatable reports which are
Reporting scheduled, that access and manipulate data on
parameters which are predefined.
Optionality The minimum number of an entity instance that are
possible at one end of the relationship for each entity
instance at the other end. For example, a dash line
indicates an optional relationship end that is read as
"maybe". A solid line indicates a mandatory relationship
end that is read as "must be".
Oracle Build data replication using Oracle generated snapshot
Replication tables and snapshot logs.
Primary KeyWhile primarily referring to tables, Primary Keys can also
pertain to entities. A Primary Key is the mandatory
column or columns used to enforce the uniqueness of
rows in a table. This is normally the most frequent means
by which rows are accessed. Please note, however, that a
column which is part of a Primary Key may not contain
null values!
Process Visual illustration representing organizational units, which
Model consist of departments or groups within a business,
responsible for a specific business activity. It is strongly
suggested that the process model be used during
analysis.
Purge To systematically and permanently remove old and
unneeded data. The term purge is stronger than delete.
It is often possible to regain deleted objects by undeleting
them, but purged objects are gone forever.
RelationshipA named, significant association between two entities. Each
end of the relationship shows the degree of how the
entities are related and the optionality.
Relational This terms refers to a database in which data is stored in
Database multiple tables. These tables then "relate" to one another
to make up the entire database. Queries can be run to
"join" these related tables together.
Security Refers to techniques for ensuring that data stored in a
computer cannot be read or compromised. Protection
provided to prevent unauthorized or accidental
access/manipulation of a database.
Snapshot A point in time copy of table data originating from one or
Tables more master tables.
Strategy Is a synonym for plan, which is defined as a scheme,
program, or method worked out beforehand for the
accomplishment of an objective. The Strategy will tell
you how to do it, the guidelines and/or techniques to use.
An example is the naming standards developed for the
open systems environment.
Table A tabular view of data used to hold one or more columns of
data. It is often the implementation of an entity.
Trigger A stored procedure associated with a table that is
automatically executed on one or more specified events
affecting the table.
Unique Key1. Defines the attributes and relationships that uniquely
identify the entity. 2. A column or columns which contain
unique values for the rows of a table. A column in a
Unique Key may contain a null. Therefore, a Unique Key
defined for an entity may not make a suitable Primary Key
for a table.
The Basics
Insurance companies use data warehousing for claims analysis to see which
procedures are claimed together and to identify patterns of risky customers.
Manufacturers can use data warehousing to compare costs of each of their product
lines over the last several years, determine which factors produced increases and
see what effect these increases had on overall margins.
What five questions should be asked in the data warehouse planning stage?
1. What data is needed to make business decisions?
2. Which business units will use it?
3. What kind of data analysis will be done?
4. How granular will the data be and what is the oldest data to be archived in it?
5. What are the security requirements?
What are some of the factors that determine whether a data warehouse will be
successful?
Database design, end user training, the ongoing adjusting and tuning of
applications to meet user needs, and the system architecture and design.
Heralded as the solution to the management information dilemma, the term "data
warehouse" has become one of the most used and abused terms in the IT
vocabulary. But ask a variety of vendors and professionals for their vision of what a
data warehouse is and how it should be built, and the ambiguity of the term will
quickly become apparent.
The concept of "data warehousing" dates back at least to the mid-1980s, and
possibly earlier. In essence, it was intended to provide an architectural model for
the flow of data from operational systems to decision support environments. It
attempted to address the various problems associated with this flow, and the high
costs associated with it. In the absence of such an architecture, there usually
existed an enormous amount of redundancy in the delivery of management
information. In larger corporations it was typical for multiple decision support
projects to operate independently, each serving different users but often requiring
much of the same data. The process of gathering, cleaning and integrating data
from various sources, often legacy systems, was typically replicated for each
project. Moreover, legacy systems were frequently being revisited as new
requirements emerged, each requiring a subtly different view of the legacy data.
Somewhere along the way this analogy and architectural vision was lost, often
manipulated by suppliers of decision support software tools. Data warehousing
"gurus" began to emerge at the end of the 80s, often themselves associated with
such companies. The architectural vision was frequently replaced by studies of
how to design decison support databases. Suddenly the data
warehouse had become the miracle cure for the decision support headache, and
suppliers jostled
for position in the burgeoning
data warehousing marketplace. Despite the recent association of the term "data
warehousing" with OLAP and multi-dimensional database technology, and the
insistence of some people that data warehouses must be based on a "star
schema" database structure, it is wise to restrict the use of such designs to data
marts. The use of a star schema or multi-dimensional / OLAP design for a data
warehouse can actually seriously compromise its value for a number of reasons:
Data marts provide the ideal solution to perhaps the most significant conflict in data
warehouse design - performance versus flexibility. In general, the more normalised
and flexible a warehouse data model is, the less well it performs when queried.
This is because queries against normalised designs typically require significantly
more table join operations than optimised designs. By directing all user queries to
data marts, and retaining a flexible model for the data warehouse, designers can
achieve flexibility and long term stability in the warehouse design as well as
optimal performance for user queries.
Why is it so expensive?
While the data warehousing concept in its various forms continues to attract
interest, many data warehousing projects are failing to deliver the benefits
expected of them, and many are proving to be excessively expensive to develop
and maintain. For this reason it is important to have a clear understanding of their
real benefit, and of how to realise this benefit at a cost which is acceptable to the
enterprise.
The costs of data warehousing projects are usually high. This is explained primarily
by the requirement to collect, "clean" and integrate data from different sources -
often legacy systems. Such exercises are inevitably labour-intensive and time-
consuming, but are essential to the success of the project - poorly integrated or low
quality data will deliver poor or worthless management information. The cost of
extracting, cleaning and integrating data represents 60-80% of the total cost of a
typical data warehousing project, or indeed any other decision support project.
Vendors who claim to offer fast, cheap data warehouse solutions should be asked to
explain how they are able to avoid these costs, and the likely quality of the results
of such solutions must be carefully considered. Such vendors typically place the
emphasis on tools as a solution to the management information problem – OLAP
tools, data integration technology, data extraction tools, graphical user query tools,
etc. Such tools resolve only a fraction of the management information problem,
and represent a small proportion of the cost of a successful data warehousing
project.
Focus on technology rather than data quality is a common failing among data
warehousing projects, and one which can fatally undermine any real business
benefit.
Given the high costs, it is difficult to justify a data warehousing project in terms of
short-term benefit. As a point solution to a specific management information need,
a data warehouse will often struggle to justify the associated investment. It is as a
long term delivery mechanism for ongoing management information needs
that data warehousing reaps significant benefits. But how can this be achieved?
Given the above facts about the loading of costs on data warehousing projects, it is
clear that focus must be on the reduction of the ongoing cost of data extraction,
cleaning and integration.
1. 80% of the data used by the various data warehouses across the corporation
came from the same 20% of source systems.
2. Each new data warehousing project usually carried out its own process to extract,
clean and integrate data from the various sources, despite the fact that much of the
same data had been the subject of previous exercises of a similar nature.
3. The choice of data to be populated in the data warehouse was usually based on
needs of a specific group, with a particular set of information requirements. The
needs of other groups for the same data were rarely considered.
Experience of other organizations showed a very similar pattern to the above. From
these findings alone it is clear that there is scope for economies of scale when
planning data warehousing projects ; if focus were to be placed initially on the 20%
of source systems which supplied 80% of the data to decision support systems,
then an initial project which simply warehouses "useful" data from these systems
would clearly yield cost benefits to future MIS projects requiring that data. Rather
than targeting a specific business process or function, benefits should be aimed at
the wider audience for decision support. Such a project would form an invaluable
foundation for an evolving data warehouse environment.
When building a data warehouse the use of multi-dimensional, star-schema or other
optimised designs should be strongly discouraged, in view of the inherent
inflexibilities in these approaches as outlined above. The use of a relational,
normalised model as the backbone of the warehouse will ensure maximum
flexibility to support future growth. If user query access is then strictly limited to
data marts, the data warehouse needs only to support periodic extracts to data
marts, rather than ad-hoc query access. Performance issues associated with these
extracts can be addressed in a number of ways - for example through the use of
staging areas (either temporary or permanent) where relational table structures are
pre-joined or "flattened" to support specific extract processes.
Once this initial project is complete, emphasis can be placed on the growth of the
warehouse as a global resource for unspecified future decision support needs,
rather than as a solution to specific requirements at a particular time. In
subsequent phases of the warehouse development, new data which is likely to
play a major role in future decision support needs should be carefully selected,
extracted and cleaned. It can then be stored alongside the existing data in the
warehouse, hence maximising its information potential. As new information needs
emerge, the cost of meeting them
will be diminished due to the elimination of the need to perform much of the costly
extraction, cleaning and integration functions usually associated with such
systems. Over time, this environment will grow to offer a permanent and invaluable
repository of integrated, enterprise-wide data for management information. This in
turn will lead to massively reduced time and cost to deliver new decision support
offerings, and hence to true cost justification. The effort required to achieve this
must not be underestimated, however. Identifying which data is "useful" requires a
great deal ofexperience and insight. The way in which the data is modelled in the
warehouse is absolutely critical - a poor data model can render a data warehouse
obsolete within months of implementation. The process used to identify, analyse
and clean data prior to loading it into the warehouse, and the attendant user
involvement, is critical to the success of the operation. Management of user
expectations is also critical. The skills required to achieve all of the above are
specialised.
Once in the warehouse, data can be distributed to any number of data marts for user
query access. These data marts can take any number of forms, from client-server
databases to desktop databases, OLAP cubes or even spreadsheets. The choice
of user query tools can be wide, and can reflect the preferences and experience of
the users concerned. The wide availability of such tools and their ease of
implementation should make this the cheapest part of the data warehouse
environment to implement. If data in the warehouse is well-structured and quality-
assured, then exporting it to new data marts should be a routine and low-cost
operation.
In summary, a data warehouse environment can offer enormous benefits to most
major organizations if approached in the correct way, and if distractions from the
main goal of delivering a flexible, long-term information delivery environment are
placed in perspective.
Introduction
Tables and Figures are not provided.
Over the course of the 1960s, 1970s, and 1980s, most medium-to-large businesses
successfully moved key operational aspects of their enterprises onto large
computing systems. The 1980s saw relational database technologies mature to the
point where they could play the central role in these systems. Naturally, the
requirements of operational systems, being substantial and unforgiving, forced
database vendors to focus development efforts almost exclusively on issues like
transaction speed, integrity, and reliability.
When business questions could be answered, it was not unusual to wait weeks for
answers. Sometimes, executives would be given the non-choice between stopping
the business and producing a particular report. They would also be confronted with
contradictory information from multiple systems. It seemed inconceivable that so
much time, money, and attention could be paid to technology only to have
relatively modest inquiries turned back.
The premise of the data warehouse is that it is physically separate from operational
systems, and has a mission completely different from that of operational systems.
The virtue of a separation of systems is twofold. It ensures that the data
warehouse will not interfere with business operations, and it facilitates the
acquisition, reconciliation, and integration of data, not only from different
operational systems within the enterprise, but also from sources external to the
business.
First consider the transaction processing request example of consumers using their
retail credit cards to make purchases. There may be millions of these that occur
during a given day. There may be hundreds proceeding simultaneously at any
given time. Each one, however, involves locating a limited amount of account
information for a particular consumer and modifying it. The information is usually
measured in bytes.
Then consider the inquiry and reporting scenario. Someone in credit card
merchandising and marketing has a fairly simple question: Among new
cardholders, how many did we market to in a particular region, who match a
particular demographic and made purchases from a particular product line? This
single question involves accessing and comparing perhaps hundreds of millions of
records depending upon the size of the organization.
depends largely on parallel data loading and index building in order to fit within
acceptable system down-time windows. Developments in large memory computer
configurations also help to satisfy the data warehouse workload. Even more
significantly, however, advances in indexing, compact data representation, and
data processing algorithms mean that fewer actual bytes of data are accessed and
manipulated to answer a given question. The concept of the data warehouse has
become accepted to the point that virtually all Global 2000 companies and most
medium-sized companies have a data warehouse development project underway
or are planning for one. The market for data warehouse-related hardware,
software, and services is measured in the tens of billions of dollars worldwide.
Even so, variants of the data warehouse have emerged to meet some of the
specialized real-world needs of companies and company departments everywhere.
For example, data warehouse satellites, called data marts, are deployed and
tailored to the needs of a specific audience. There is also the up-to-the hour, or up-
to-the minute transactional warehouse hybrid, called the operational data store, for
companies that have a requirement for extremely fresh information.
Although the concept of data warehouse is universally accepted, they are still hard
to build, and they frequently leave some of the corporate information appetite
unsatisfied, even when deployed successfully. The basic mandate of the data
warehouse or the data mart is enormous: satisfy the information requirements of
an entire company or an entire department, regularly and in a timely way.
The technology used to build data warehouses and marts is antagonistic to this all-
purpose sensibility. Parallelism, indexing, clustering, and even novel storage
architectures like proprietary multi-dimensional data storage all come at a cost.
Effective parallelism and indexing depend extensively upon knowing in advance
what questions will be asked, or if not the specific questions, at least the form of
the question.
In practice, this means that data warehouses and even marts usually discourage
extraordinary lines of inquiry. Given complete freedom of interrogation, power
users will bring a data warehouse to its knees with their queries. That is, parallel
data striping and indexing suitable for one query may be ill suited to another.
Whenever the data warehouse is not indexed or tuned for a particular query the
physical resources of the system can be overwhelmed to the detriment of other
clients.
Most data warehouse and data mart end-users have modest information
requirements and keep businesses running by querying inside the lines. For the
most part it is this relatively large audience that data warehouses and marts end up
satisfying. In order to provide most of the users with timely service most of the
time, technical organizations take the defensive approach, prohibiting non-
standard (ad hoc) queries, or scheduling them only at odd hours. In this way they
inhibit the smaller number of business analysts in the organization — those most
likely to find breakthrough opportunities.
What these elite knowledge workers and business analysts desire most from their
mart or warehouse is the ability to go wherever their mind or intuition may take
them, exploring for patterns, relationships, and anomalies. This is how they
cultivate business knowledge.
The data warehouse and data marts have different objectives, and their
design reflects these differences.
There is an important reason that this is the first step in the process. It
defines our scope. We have many other steps to traverse, and by
eliminating the data we don't need, we will not be wasting our time
incorporating data elements that will be discarded later.
This is the second step in the process because it has the greatest impact
on the data model, and the data model is the foundation of our design.
Introducing the historical perspective means that attributes (e.g., last
name) could have multiple values over time, and we need to retain each
of them. To do this, we need to make each instance unique. Further, the
business rules change when we opt for the historical perspective. For
the operational system, we need to know the
Upon completion of the first four steps, the data warehouse design should
meet the business needs. The warehouse will have the needed data and
will store it in a way that provides flexibility to the business users to use
it to meet their needs.
These eight steps transform the business data model into a data model
for the data warehouse. Additional adjustments may be made to
improve performance.
Step 1: Distill the business questions. The first step is to identify the
business questions and separate the measurements of interest from the
constraints (dimensions). Measurements include sales quantity, sales
amount, customer count, etc. Constraints include the product hierarchy,
customer hierarchy, sales area hierarchy, time, etc. In performing this
step, we don't pay attention to the relationships among the constraints
-- we simply identify them. An easy way of separating the metrics from
the constraints is to ask business users to tell us the questions they will
be asking and then dissect the response. A sample response is that a
user needs to see monthly sales quantity and dollars by region, product,
customer group, and salesperson. The things the user wants to see are
the measures, and the way he or she wants to see them, as indicated by
the parameters following the word "by," are the constraints.
Step 4: Ensure that the dimensions have good keys. The key of the
dimension table usually becomes part of the key of the fact table. To
perform this role efficiently, it needs to obey the rules of good keys,15
and it needs to be relatively short. Since we are pulling data from the
data warehouse, the first criterion is usually already met. If the key is
too long, then it may be advisable to use a system-generated key to
replace it.
DATA ACQUISITION
Data Capture
The first step of this process is capture. During the capture process, we
get to determine which systems will be used for which data, understand
those systems, and extract that data from them.
Before we can pull data out of a source system, we must choose the
system to be used. Sometimes the decision is easy -- there's only one
universally accepted major data source. More often, however, we must
choose from among several candidates. In making this selection, the
following criteria should be considered:
These are all rational reasons for selecting a particular source system.
There is another factor that should also be considered: politics. Selection
of the source system may be impacted by the faith the users will have in
the data, and some users may have preconceived notions concerning
the viability of some of the source systems.
• A field that was included in ..the system may no longer be needed, and when a new
field was needed, the programmer reused the existing field without changing the field name
(or the documentation).
• The original field only applies in some cases (e.g., residential customers), and the
programmer used the field to mean something else for other cases (e.g., commercial
customer).
• The original field (e.g., work order number) did not apply to a particular group, and
the programmer used the field to mean something else (e.g., vehicle number).
Once the definition of the field is known, the analyst needs to examine the
quality of the data with respect to its accuracy and completeness. The
accuracy examination entails looking at each individual field and
examining field dependencies. For example, if one field indicates an
insurance claim for pregnancy, the gender field should be "female."
Data Extraction
For the initial load of the data warehouse, we need to look at all the data
in the appropriate source systems. After the initial load, our processing
time and cost are significantly reduced if we can readily identify the
data that has changed. We can then restrict our processes to that data.
There are six basic methods for capturing data changes.16
Applying a changed data capture technique can improve the data capture
process efficiency, but it is not always practical. It is, however,
something that needs to be researched in designing the data capture
logic.
Cleansing
Once the data quality expectations are set, we need to use data cleansing
tools or develop algorithms to attain that quality level. One aspect of
quality that is not specific to the individual systems deals with data
integration. This will be addressed in the next section.
If either of the last two options are selected, we will have a mismatch
between the data in the operational sources and the data warehouse.
Having this difference is not necessarily bad -- what's important is that
we recognize that the difference exists. If we correct an error, we also
need to recognize that nothing has changed either in the business
process and source system that permitted the error to exist or in the
source system data that contains the data.
Integration
Within the data acquisition process, we may need to create a table that
relates the customer identifier in the source system with the customer
identifier in the data warehouse. Once the customers are integrated, we
can use this table to relate the customer in the source system to the
data warehouse instance.
Transformation
The coding structures may differ among the source systems, and these
need to be transformed into a single structure for the data warehouse.
Also, the physical representation of the data may differ, and again, a
single approach is needed. These are two examples of data
transformation. In the first instance, the
Loading
The last step of the data acquisition process is the load. During this step,
the data is physically moved into the data warehouse and is available
for subsequent dissemination to the data marts. The data warehouse
load is a batch process and, with rare exception, consists of record
insertions. Due to the retention of history in the data warehouse, each
time changed data is brought in, it appends an existing record.
Some factors to consider in designing the load process include the use of
a staging area to prepare the data for the load, making a backup copy of
the data being loaded, determining the sequence with which each of the
sources needs to be loaded, and within that, determining the sequence
in which the data itself needs to be loaded.
Data WareHouse Interview Questions
1.Can 2 Fact Tables share same dimensions Tables? How many Dimension tables are
associated with one Fact Table ur project?
Ans: Yes
2.What is ROLAP, MOLAP, and DOLAP...?
Ans: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), and DOLAP
(Desktop OLAP). In these three OLAP
architectures, the interface to the analytic layer is typically the same; what is quite
different is how the data is physically stored.
In ROLAP, architects believe to store the data in the relational model; for instance,
OLAP capabilities are best provided
against the relational database
In MOLAP, the premise is that online analytical processing is best implemented by
storing the data multidimensionally; that is,
data must be stored multidimensionally in order to be viewed in a multidimensional
manner.
DOLAP, is a variation that exists to provide portability for the OLAP user. It creates
multidimensional datasets that can be
transferred from server to desktop, requiring only the DOLAP software to exist on the
target system. This provides significant
advantages to portable computer users, such as salespeople who are frequently on the
road and do not have direct access to
their office server.
3.What is an MDDB? and What is the difference between MDDBs and RDBMSs?
Ans: Multidimensional Database There are two primary technologies that are used for
storing the data used in OLAP applications.
These two technologies are multidimensional databases (MDDB) and relational
databases (RDBMS). The major difference
between MDDBs and RDBMSs is in how they store data. Relational databases store
their data in a series of tables and
columns. Multidimensional databases, on the other hand, store their data in a large
multidimensional arrays.
For example, in an MDDB world, you might refer to a sales figure as Sales with Date,
Product, and Location coordinates of
12-1-2001, Car, and south, respectively.
Advantages of MDDB:
Retrieval is very fast because
1 The data corresponding to any combination of dimension members can be retrieved with a
single I/O.
2 Data is clustered compactly in a multidimensional array.
3 Values are caluculated ahead of time.
4 The index is small and can therefore usually reside completely in memory.
OLAP stands for Online Analytical Processing. OLAP is a term that means many things to
many people. Here, we will use the term OLAP and Star Schema pretty much
interchangeably. We will assume that star schema database is an OLAP system.( This is
not the same thing that Microsoft calls OLAP; they extend OLAP to mean the cube
structures built using their product, OLAP Services). Here, we will assume that any system
of read-only, historical, aggregated data is an OLAP system.
A data warehouse(or mart) is way of storing data for later retrieval. This retrieval is almost
always used to support decision-making in the organization. That is why many data
warehouses are considered to be DSS (Decision-Support Systems).
Both a data warehouse and a data mart are storage mechanisms for read-only, historical,
aggregated data.
By read-only, we mean that the person looking at the data won’t be changing it. If a user
wants at the sales yesterday for a certain product, they should not have the ability to change
that number.
The “historical” part may just be a few minutes old, but usually it is at least a day old.A data
warehouse usually holds data that goes back a certain period in time, such as five years. In
contrast, standard OLTP systems usually only hold data as long as it is “current” or active.
An order table, for example, may move orders to an archive table once they have been
completed, shipped, and received by the customer.
When we say that data warehouses and data marts hold aggregated data, we need to stress
that there are many levels of aggregation in a typical data warehouse.
8. If data source is in the form of Excel Spread sheet then how do use?
Ans: PowerMart and PowerCenter treat a Microsoft Excel source as a relational database,
not a flat file. Like relational sources,
the Designer uses ODBC to import a Microsoft Excel source. You do not need
database permissions to import Microsoft
Excel sources.
To import an Excel source definition, you need to complete the following tasks:
1 Install the Microsoft Excel ODBC driver on your system.
2 Create a Microsoft Excel ODBC data source for each source file in the ODBC 32-bit
Administrator.
3 Prepare Microsoft Excel spreadsheets by defining ranges and formatting columns of
numeric data.
4 Import the source definitions in the Designer.
Once you define ranges and format cells, you can import the ranges in the Designer. Ranges
display as source definitions
when you import the source.
10. What are the modules/tools in Business Objects? Explain theier purpose briefly?
Ans: BO Designer, Business Query for Excel, BO Reporter, Infoview,Explorer,WEBI, BO
Publisher, and Broadcast Agent, BO
ZABO).
InfoView: IT portal entry into WebIntelligence & Business Objects.
Base module required for all options to view and refresh reports.
Reporter: Upgrade to create/modify reports on LAN or Web.
Explorer: Upgrade to perform OLAP processing on LAN or Web.
Designer: Creates semantic layer between user and database.
Supervisor: Administer and control access for group of users.
WebIntelligence: Integrated query, reporting, and OLAP analysis over the Web.
Broadcast Agent: Used to schedule, run, publish, push, and broadcast pre-built reports
and spreadsheets, including event
notification and response capabilities, event filtering, and calendar
based notification, over the LAN, e-
mail, pager,Fax, Personal Digital Assistant( PDA), Short Messaging
Service(SMS), etc.
Set Analyzer - Applies set-based analysis to perform functions such as execlusion,
intersections, unions, and overlaps
visually.
Developer Suite - Build packaged, analytical, or customized apps.
11.What are the Ad hoc quries, Canned Quries/Reports? and How do u create them?
(Plz check this page……C\:BObjects\Quries\Data Warehouse - About Queries.htm)
Ans: The data warehouse will contain two types of query. There will be fixed queries that
are clearly defined and well understood, such as regular reports, canned queries
(standard reports) and common aggregations. There will also be ad hoc queries that are
unpredictable, both in quantity and frequency.
Ad Hoc Query: Ad hoc queries are the starting point for any analysis into a database. Any
business analyst wants to know what is inside the database. He then proceeds by
calculating totals, averages, maximum and minimum values for most attributes within the
database. These are unpredictable element of a data warehouse. It is exactly that ability to
run any query when desired and expect a reasonable response that makes the data warhouse
worthwhile, and makes the design such a significant challenge.
The end-user access tools are capable of automatically generating the database query that
answers any Question posed by the user. The user will typically pose questions in terms
that they are familier with (for example, sales by store last week); this is converted into
the database query by the access tool, which is aware of the structure of information within
the data warehouse.
Canned queries: Canned queries are predefined queries. In most instances, canned queries
contain prompts that allow you to customize the query for your specific needs. For
example, a prompt may ask you for a School, department, term, or section ID. In this
instance you would enter the name of the School, department or term, and the query will
retrieve the specified data from the Warehouse.You can measure resource requirements of
these queries, and the results can be used for capacity palnning and for database design.
The main reason for using a canned query or report rather than creating your own is that your
chances of misinterpreting data or getting the wrong answer are reduced. You are assured
of getting the right data and the right answer.
12. How many Fact tables and how many dimension tables u did? Which table precedes
what?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
13. What is the difference between STAR SCHEMA & SNOW FLAKE SCHEMA?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
14. Why did u choose STAR SCHEMA only? What are the benefits of STAR SCHEMA?
Ans: Because it’s denormalized structure , i.e., Dimension Tables are denormalized. Why to
denormalize means the first (and often
only) answer is : speed. OLTP structure is designed for data inserts, updates, and
deletes, but not data retrieval. Therefore,
we can often squeeze some speed out of it by denormalizing some of the tables and
having queries go against fewer tables.
These queries are faster because they perform fewer joins to retrieve the same
recordset. Joins are also confusing to many
End users. By denormalizing, we can present the user with a view of the data that is far
easier for them to understand.
16. (i) What is FTP? (ii) How do u connect to remote? (iii) Is there another way to use FTP
without a special utility?
Ans: (i): The FTP (File Transfer Protocol) utility program is commonly used for copying
files to and from other computers. These
computers may be at the same site or at different sites thousands of miles apart. FTP is
general protocol that works on UNIX
systems as well as other non- UNIX systems.
(iii): Yes. If u r using Windows, u can access a text-based FTP utility from a DOS
prompt.
To do this, perform the following steps:
1. From the Start Programs MS-Dos Prompt
2. Enter “ftp ftp.geocities.com.” A prompt will appear
(or)
Enter ftp to get ftp prompt ftp> open hostname ex. ftp>open ftp.geocities.com (It
connect to the specified host).
3. Enter ur yahoo! GeoCities member name.
4. enter your yahoo! GeoCities pwd.
You can now use standard FTP commands to manage the files in your Yahoo! GeoCities
directory.
You cannot concatenate ports from more than one transformation into the Filter
transformation; the input ports for the filter
must come from a single transformation. Filter transformations exist within the flow of
the mapping and cannot be
unconnected. The Filter transformation does not allow setting output default
values.
20. When do u create the Source Definition ? Can I use this Source Defn to any
Transformation?
Ans: When working with a file that contains fixed-width binary data, you must create
the source definition.
The Designer displays the source definition as a table, consisting of names, datatypes,
and constraints. To use a source
definition in a mapping, connect a source definition to a Source Qualifier or
Normalizer transformation. The Informatica
Server uses these transformations to read the source data.
Active transformations that might change the record count include the following:
1 Advanced External Procedure
2 Aggregator
3 Filter
4 Joiner
5 Normalizer
6 Rank
7 Source Qualifier
Note: If you use PowerConnect to access ERP sources, the ERP Source Qualifier is
also an active transformation.
/*
You can connect only one of these active transformations to the same
transformation or target, since the Informatica
Server cannot determine how to concatenate data from different sets of records with
different numbers of rows.
*/
Passive transformations that never change the record count include the following:
1 Lookup
2 Expression
3 External Procedure
4 Sequence Generator
5 Stored Procedure
6 Update Strategy
You can connect any number of these passive transformations, or connect one active
transformation with any number of
passive transformations, to the same transformation or target.
25. What are the tasks that are done by Informatica Server?
Ans:The Informatica Server performs the following tasks:
1 Manages the scheduling and execution of sessions and batches
2 Executes sessions and batches
3 Verifies permissions and privileges
4 Interacts with the Server Manager and pmcmd.
The Informatica Server moves data from sources to targets based on metadata stored in a
repository. For instructions on how to move and transform data, the Informatica Server
reads a mapping (a type of metadata that includes transformations and source and target
definitions). Each mapping uses a session to define additional information and to optionally
override mapping-level options. You can group multiple sessions to run as a single unit,
known as a batch.
26. What are the two programs that communicate with the Informatica Server?
Ans: Informatica provides Server Manager and pmcmd programs to communicate with the
Informatica Server:
Server Manager. A client application used to create and manage sessions and batches, and
to monitor and stop the Informatica Server. You can use information provided through the
Server Manager to troubleshoot sessions and improve session performance.
pmcmd. A command-line program that allows you to start and stop sessions and batches,
stop the Informatica Server, and verify if the Informatica Server is running.
27. When do u reinitialize Aggregate Cache?
Ans: Reinitializing the aggregate cache overwrites historical aggregate data with new
aggregate data. When you reinitialize the
aggregate cache, instead of using the captured changes in source tables, you typically
need to use the use the entire source
table.
For example, you can reinitialize the aggregate cache if the source for a session
changes incrementally every day and
completely changes once a month. When you receive the new monthly source, you
might configure the session to reinitialize
the aggregate cache, truncate the existing target, and use the new source table during
the session.
The Informatica Server creates a new aggregate cache, overwriting the existing aggregate
cache.
/? To be check for step 6 & step 7 after successful run of session… ?/
28. (ii) What are the minimim condition that u need to have so as to use Targte Load Order
Option in Designer?
Ans: U need to have Multiple Source Qualifier transformations.
To specify the order in which the Informatica Server sends data to targets, create one
Source Qualifier or Normalizer
transformation for each target within a mapping. To set the target load order, you then
determine the order in which each
Source Qualifier sends data to connected targets in the mapping.
When a mapping includes a Joiner transformation, the Informatica Server sends all
records to targets connected to that
Joiner at the same time, regardless of the target load order.
Note:The Designer allows you to work with multiple tools at one time. You can also work
in multiple folders and repositories
To add a slight performance boost, you can also set the tracing level to Terse, writing the
minimum of detail to the session log
when running a session containing the transformation.
31(i). What the difference is between a database, a data warehouse and a data mart?
Ans: -- A database is an organized collection of information.
-- A data warehouse is a very large database with special sets of tools to extract and
cleanse data from operational systems
and to analyze data.
-- A data mart is a focused subset of a data warehouse that deals with a single area of
data and is organized for quick
analysis.
32. What is Data Mart, Data WareHouse and Decision Support System explain briefly?
Ans: Data Mart:
A data mart is a repository of data gathered from operational data and other sources that is
designed to serve a particular
community of knowledge workers. In scope, the data may derive from an enterprise-wide
database or data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in terms of analysis,
content, presentation, and ease-of-use. Users of a data mart can expect to have data
presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the presence of the
other in some form. However, most writers using the term seem to agree that the design of
a data mart tends to start from an analysis of user needs and that a data warehouse
tends to start from an analysis of what data already exists and how it can be collected
in such a way that the data can later be used. A data warehouse is a central aggregation
of data (which can be distributed physically); a data mart is a data repository that may
derive from a data warehouse or not and that emphasizes ease of access and usability for a
particular designed purpose. In general, a data warehouse tends to be a strategic but
somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an
immediate need.
Data Warehouse:
A data warehouse is a central repository for all or significant parts of the data that an
enterprise's various business systems collect. The term was coined by W. H. Inmon. IBM
sometimes uses the term "information warehouse."
Typically, a data warehouse is housed on an enterprise mainframe server. Data from various
online transaction processing (OLTP) applications and other sources is selectively
extracted and organized on the data warehouse database for use by analytical applications
and user queries. Data warehousing emphasizes the capture of data from diverse sources
for useful analysis and access, but does not generally start from the point-of-view of the
end user or knowledge worker who may need access to specialized, sometimes local
databases. The latter idea is known as the data mart.
data mining, Web mining, and a decision support system (DSS) are three kinds of
applications that can make use of a data warehouse.
Typical information that a decision support application might gather and present
would be:
Comparative sales figures between one week and the next
Projected revenue figures based on new product sales assumptions
The consequences of different decision alternatives, given past experience in a context that is
described
A decision support system may present information graphically and may include an expert
system or artificial intelligence (AI). It may be aimed at business executives or some other
group of knowledge workers.
34. How do you use DDL commands in PL/SQL block ex. Accept table name from user and
drop it, if available else display msg?
Ans: To invoke DDL commands in PL/SQL blocks we have to use Dynamic SQL, the
Package used is DBMS_SQL.
36. Which package, procedure is used to find/check free space available for db objects like
table/procedures/views/synonyms…etc?
Ans: The Package is DBMS_SPACE
The Procedure is UNUSED_SPACE
The Table is DBA_OBJECTS
37. Does informatica allow if EmpId is PKey in Target tbl and source data is 2 rows with
same EmpID?If u use lookup for the same
situation does it allow to load 2 rows or only 1?
Ans: => No, it will not it generates pkey constraint voilation. (it loads 1 row)
=> Even then no if EmpId is Pkey.
38. If Ename varchar2(40) from 1 source(siebel), Ename char(100) from another source
(oracle) and the target is having Name
varchar2(50) then how does informatica handles this situation? How Informatica
handles string and numbers datatypes
sources?
41(ii). What r the differences between Connected lookups and Unconnected lookups?
Ans: Although both types of lookups perform the same basic task, there are some
important differences:
---------------------------------------------------------------
---------------------------------------------------------------
Connected Lookup Unconnected Lookup
---------------------------------------------------------------
---------------------------------------------------------------
Part of the mapping data flow. Separate from the mapping data flow.
Can return multiple values from the same row. Returns one value from each row.
You link the lookup/output ports to another You designate the return value with the
Return port (R).
transformation.
Supports default values. Does not support default values.
If there's no match for the lookup condition, the If there's no match for the lookup
condition, the server
server returns the default value for all output ports. returns NULL.
More visible. Shows the data passing in and out Less visible. You write an expression
using :LKP to tell
of the lookup. the server when to perform the lookup.
Cache includes all lookup columns used in the Cache includes lookup/output ports in
the Lookup condition
mapping (that is, lookup table columns included and lookup/return port.
in the lookup condition and lookup table
columns linked as output ports to other
transformations).
Explained.
**************************
TO QUERY THE PLAN TABLE :-
**************************
SQL> SELECT RTRIM(ID)||' '||
2 LPAD(' ', 2*(LEVEL-1))||OPERATION
3 ||' '||OPTIONS
4 ||' '||OBJECT_NAME STEP_DESCRIPTION
5 FROM PLAN_TABLE
6 START WITH ID = 0 AND STATEMENT_ID = 'PKAR02'
7 CONNECT BY PRIOR ID = PARENT_ID
8 AND STATEMENT_ID = 'PKAR02'
9 ORDER BY ID;
STEP_DESCRIPTION
----------------------------------------------------
0 SELECT STATEMENT
1 FILTER
2 SORT GROUP BY
3 TABLE ACCESS FULL EMP
==============================================================
To copy a mapping from a non-shared folder, drag and drop the mapping into the
destination folder.
In both cases, the destination folder must be open with the related tool active.
For example, to copy a mapping, the Mapping Designer must be active. To copy a Source
Definition, the Source Analyzer must be active.
Copying Mapping:
1 To copy the mapping, open a workbook.
2 In the Navigator, click and drag the mapping slightly to the right, not dragging it to the
workbook.
3 When asked if you want to make a copy, click Yes, then enter a new name and click OK.
4 Choose Repository-Save.
Repository Copying: You can copy a repository from one database to another. You
use this feature before upgrading, to
preserve the original repository. Copying repositories provides a quick way to copy all
metadata you want to use as a basis for
a new repository.
If the database into which you plan to copy the repository contains an existing repository, the
Repository Manager deletes the existing repository. If you want to preserve the old
repository, cancel the copy. Then back up the existing repository before copying the new
repository.
To copy a repository, you must have one of the following privileges:
1 Administer Repository privilege
2 Super User privilege
To copy a repository:
1. In the Repository Manager, choose Repository-Copy Repository.
2. Select a repository you wish to copy, then enter the following information:
-------------------------------- ---------------------------
-------------------------------------------------
If you are not connected to the repository you want to copy, the Repository Manager
asks you to log in.
3. Click OK.
5. If asked whether you want to delete an existing repository data in the second
repository, click OK to delete it. Click Cancel to preserve the existing repository.
Copying Sessions:
In the Server Manager, you can copy stand-alone sessions within a folder, or copy sessions in
and out of batches.
To copy a session, you must have one of the following:
1 Create Sessions and Batches privilege with read and write permission
2 Super User privilege
To copy a session:
1. In the Server Manager, select the session you wish to copy.
2. Click the Copy Session button or choose Operations-Copy Session.
The Server Manager makes a copy of the session. The Informatica Server names the copy
after the original session, appending a number, such as session_name1.
When the object the shortcut references changes, the shortcut inherits those changes.
By using a shortcut instead of a copy,
you ensure each use of the shortcut exactly matches the original object. For example, if
you have a shortcut to a target
definition, and you add a column to the definition, the shortcut automatically inherits the
additional column.
Shortcuts allow you to reuse an object without creating multiple objects in the
repository. For example, you use a source
definition in ten mappings in ten different folders. Instead of creating 10 copies of the
same source definition, one in each
folder, you can create 10 shortcuts to the original source definition.
You can create shortcuts to objects in shared folders. If you try to create a shortcut to a
non-shared folder, the Designer
creates a copy of the object instead.
The status of the shell command, whether it completed successfully or failed, appears in
the session log file.
To call a pre- or post-session shell command you must:
1. Use any valid UNIX command or shell script for UNIX servers, or any valid DOS
or batch file for Windows NT servers.
2. Configure the session to execute the pre- or post-session shell commands.
You can configure a session to stop if the Informatica Server encounters an error while
executing pre-session shell commands.
For example, you might use a shell command to copy a file from one directory to another.
For a Windows NT server you would use the following shell command to copy the
SALES_ ADJ file from the target directory, L, to the source, H:
copy L:\sales\sales_adj H:\marketing\
For a UNIX server, you would use the following command line to perform a similar
operation:
cp sales/sales_adj marketing/
Tip: Each shell command runs in the same environment (UNIX or Windows NT) as the
Informatica Server. Environment settings in one shell command script do not carry over to
other scripts. To run all shell commands in the same environment, call a single shell script
that in turn invokes other scripts.
Maintaining different versions lets you revert to earlier work when needed. By
archiving the contents of a folder into a version each time you reach a development
landmark, you can access those versions if later edits prove unsuccessful.
You create a folder version after completing a version of a difficult mapping, then
continue working on the mapping. If you are unhappy with the results of subsequent work,
you can revert to the previous version, then create a new version to continue development.
Thus you keep the landmark version intact, but available for regression.
Note: You can only work within one version of a folder at a time.
50. How do automate/schedule sessions/batches n did u use any tool for automating
Sessions/batch?
Ans: We scheduled our sessions/batches using Server Manager.
You can either schedule a session to run at a given time or interval, or you can
manually start the session.
U needto hv create sessions n batches with Read n Execute permissions or super user
privilege.
If you configure a batch to run only on demand, you cannot schedule it.
51. What are the differences between 4.7 and 5.1 versions?
Ans: New Transformations added like XML Transformation and MQ Series Transformation,
and PowerMart and PowerCenter both
are same from 5.1version.
52. What r the procedure that u need to undergo before moving Mappings/sessions from
Testing/Development to Production?
Ans:
53. How many values it (informatica server) returns when it passes thru Connected Lookup n
Unconncted Lookup?
Ans: Connected Lookup can return multiple values where as Unconnected Lookup will
return only one values that is Return Value.
In each transformation, you use the Expression Editor to enter the expression. The
Expression Editor supports the transformation language for building expressions. The
transformation language uses SQL-like functions, operators, and other components to build
the expression. For example, as in SQL, the transformation language includes the functions
COUNT and SUM. However, the PowerMart/PowerCenter transformation language
includes additional functions not found in SQL.
When you enter the expression, you can use values available through ports. For example, if
the transformation has two input ports representing a price and sales tax rate, you can
calculate the final sales tax using these two values. The ports used in the expression can
appear in the same transformation, or you can use output ports in other transformations.
57. In case of Flat files (which comes thru FTP as source) has not arrived then what happens?
Where do u set this option?
Ans: U get an fatel error which cause server to fail/stop the session.
U can set Event-Based Scheduling Option in Session Properties under General tab--
>Advanced options..
----------------- ------------------- ------------------
Event-Based Required/ Optional Description
----------------- -------------------- ------------------
Indicator File to Wait For Optional Required to use event-based
scheduling. Enter the indicator file
(or directory and file) whose arrival schedules the session. If
you do
not enter a directory, the Informatica Server assumes the file
appears
in the server variable directory $PMRootDir.
58. What is the Test Load Option and when you use in Server Manager?
Ans: When testing sessions in development, you may not need to process the entire source.
If this is true, use the Test Load
Option(Session Properties General Tab Target Options Choose Target Load
options as Normal (option button), with
Test Load cheked (Check box) and No.of rows to test ex.2000 (Text box with Scrolls)).
You can also click the Start button.
64. what will happen if u increase commit Intervals? and also decrease commit Intervals?
65. What kind of Complex mapping u did? And what sort of problems u faced?
67. Can u refresh Repository in 4.7 and 5.1? and also can u refresh pieces (partially) of
repository in 4.7 and 5.1?
70. BI Faq
Ans: http://www.visionnet.com/bi/bi-faq.shtml
Repository. The place where you store the metadata is called a repository. The more
sophisticated your repository, the more
complex and detailed metadata you can store in it. PowerMart and PowerCenter use a
relational database as the
repository.
76. What is the filename which u need to configure in Unix while Installing Informatica?
77. How do u select duplicate rows using Informatica i.e., how do u use
Max(Rowid)/Min(Rowid) in Informatica?
INFORMATICA QUESIONS
----------------------------------------------------------
1.What are active and passive transformations?
Active Trans: Number of records which comes inside for these transformation is not
The same number of records as output.(e.g: Filter, Aggregator, Joiner, Normalizer, Router,
Rank,(Source Qualifier - If filter condition is used),Update Strategy.
Passive Trans: Number of records output is the same as number of records as output.
(e.g: Expression, LookUp (Connected & Unconnected), Input, Output, Sequence Generator,
Stored Procedure, XML Source qualifier.
4.What are the various caches available for lookup transformation? explain about them?
Various caches available for lookup trans are Static, Persistent and Dynamic, Lookup
can also be uncached.
Static: By default server creates static cache, it builds the cache when it process
the first lookup request, it queries the cache based on the lookup condition.
If lookup is connected it returns the values by the lookup /output port
If it is unconnected it returns the value by the return port
Persistent: By default Informatica uses non-persistant cache. enable the persistent
cache when required, lkp.cache.enable property must be checked to use
persistent cache. Normally server creates a cache file and deletes it at the
end of a session and for the consequtive sessions it rebuilds the cache files,
whereas when persistent cache is enabled the cache file is saved in the
disk and the same is reused for the consecutive sessions, by building the
memory cache from the saved cache files. Enable (recache from database)
if lkp table has been changed.
DYNAMIC: When target table is also the lkp table dynamic lkp cache is used.
As static cache dynamic lkp cache also builds the cache, when server
receives a new row (a row that is not in the cache) it inserts the row in to
the cache. If existing row (row that is in the cache) it flags the row as
existing and does not insert the row into the cache
6.How do you set DTM memory parameters like Default Buffer block size,
Idx Cache size, Dat Cache size and also about source and target based commit?
7.What do you mean by event-based scheduling? What are the uses of Indicator file?
In 'event-based scheduling', the session gets started as soon as the mentioned
indicator file appears in the said directory local to the Informatica server. Note that
the file will be automatically deleted once the session starts. Keep in mind that the
session has to be either manually started or scheduled. But the session actually kicks-
off only after the indicator file appears, till then the session will be in ‘file wait’.
8.What do you mean by tracing level and explain about their types?
The amount of details the server writes in the session log file during the execution is
called as tracing level. Server writes row errors in the session log which includes
transformation in which a error occurred and complete row data.
They are Terse, Normal, Verbose init, Verbose data.
Terse: server writes initialization information, error messages and notify the rejected
data.
Normal: server writes initialization & status information, errors encountered and
skipped rows due to transformation row errors, summary of session result but not at the level
of individual rows.
Verbose init: In addition to normal tracing server writes additional tracing details such
as names of index and data files used and detailed transformation statistics.
Verbose data: In addition to verbose init. Server writes additional tracing details for
each row that passes into the mapping and also notes where server truncates
string data to fit the precision of a column and provides detailed transformation
statistics.
Lookup Joiner
Can use any operator like (=, <, >, etc) Can use only ‘=’.
Is used to lookup a table Is used to join multiple tables
It supports only relational It supports Heterogeneous
It is a passive trans It is a Active trans
Rejected rows are available (reject file, logDiscarded rows are not available
file) either in (log file or rejected file
11.Explain Surrogate Keys?
12.Explain SCD and its types?
There are 3 types Slowly changing dimensions.
Type 1: It overwrites the existing dimension and inserts the new dimension.
Type 2A (Version Data): Here rows based on user defined comparisons inserts both
new and changed dimension into the target. This changed dimension is tracked by
versioning the primary key and creating a version number for each dimension in the
table. Highest version number represents the current data of the record.
Type 2B (Flag Mapping): Here target will have field called PM_Current _Flag
which will have the value 0 & 1. Value 1 represents the current record.
Type 2C (Date Range): Here target will have a two fields namely Pm_begin_data,
Pm_End_date. For each new and changed rows system date will be inserted in the
Pm_begin_date to represent start of the effective date range . For each updated rows
server uses the system date that will be updated in the previous Pm_End_date to
represent end of effective date range. Each new row and updating existing row will
have null value in the pm_end_date.
Type 3: Here it keeps only the current and previous version of column data in the
table. It maintains both the values in the same row in an additional fields namely
pm_previous _value.
At session level all rows are treated as Insert or Update or Delete depending the
option “treat rows as” selected. This has to match the target options.
In the Target option either one of these alone can be selected : update as update,
update as insert. Delete has to be used separately.
Here a few:
1. Tell me about cubes
2. Full process or incremental
3. Are you good with data cleansing?
4. How do you handle changing dimensions?
5. What is a star-schema?
For Informatica:
Let them know which version you are familiar with as well as
what role. Informatica 7.x has divided the developer and
administrator roles.
For Erwin:
Other topics:
4) Hierarchy of DWH?
5) How many repositories can we create in Informatica?
6) What is surrogate key?
7) What is difference between Mapplet and reusable
transformation?
8) What is aggregate awareness?
9) Explain reference cursor?
10) What are parallel querys and query hints?
11) DWH architecture?
12) What are cursors?
13) Advantages of de normalized data?
14) What is operational data source (ODS)?
15) What is meta data and system catalog?
16) What is factless fact schema?
17) What is confirmed dimension?
18) What is the capacity of power cube?
19) Difference between PowerPlay transformer and power play
reports?
20) What is IQD file?
21) What is Cognos script editor?
22) What is difference macros and prompts?
23) What is power play plug in?
24) Which kind of index is preferred in DWH?
25) What is hash partition?
26) What is DTM session?
27) How can you define a transformation? What are different
types of transformations in Informatica?
28) What is mapplet?
29) What is query panel?
30) What is a look up function? What is default
transformation for the look up function?
31) What is difference between a connected look up and
unconnected look up?
32) What is staging area?
33) What is data merging, data cleansing and sampling?
34) What is up date strategy and what are th options for
update strategy?
35) OLAP architecture?
36) What is subject area?
37) Why do we use DSS database for OLAP tools?
Business Objects FAQ:
38) What is a universe?
39) Analysis in business objects?
40) Who launches the supervisor product in BO for first time?
41) How can you check the universe?
42) What are universe parameters?
43) Types of universes in business objects?
44) What is security domain in BO?
45) Where will you find the address of repository in BO?
46) What is broad cast agent?
47) In BO 4.1 version what is the alternative name for
broadcast agent?
48) What services the broadcast agent offers on the server
side?
49) How can you access your repository with different user
profiles?
50) How many built-in objects are created in BO repository?
51) What are alertors in BO?
52) What are different types of saving options in web
intelligence?