Beruflich Dokumente
Kultur Dokumente
Assignment Set- 1
1.(i) What do you understand by Business Intelligence System? What are the
different steps in order to deliver the Business Value through a BI System?
Ans.Business Intelligence (BI) is a generic term used to describe leveraging the organizational internal and
external data, information for making the best possible business decisions. The field of Business
intelligence is very diverse and comprises the tools and technologies used to access and analyze
various types of business information. These tools gather and store the data and allow the user to view
and analyze the information from a wide variety of dimensions and thereby assist the decision-makers
make better business decisions. Thus the Business Intelligence (BI) systems and tools play a vital
role as far as organizations are concerned in making improved decisions in the current cut throat
competitive scenario.
In simple terms, Business Intelligence is an environment in which business users receive reliable,
consistent, meaningful and timely information. This data enables the business users conduct analyses
that yield overall understanding of how the business has been, how it is now and how it will be in the
near future. Also, the BI tools monitor the financial and operational health of the organization through
generation of various types of reports, alerts, alarms, key performance indicators and dashboards.
Business intelligence tools are a type of application software designed to help in making better business
decisions. These tools aid in the analysis and presentation of data in a more meaningful way and so play
a key role in the strategic planning process of an organization. They illustrate business intelligence in
the areas of market research and segmentation, customer profiling, customer support, profitability, and
Various types of BI systems viz. Decision Support Systems, Executive Information Systems (EIS),
Multidimensional Analysis software or OLAP (On-Line Analytical Processing) tools, data mining tools
are discussed further. Whatever is the type, the Business Intelligence capabilities of the system is to
let its users slice and dice the information from their organization’s numerous databases without having
to wait for their IT departments to develop complex queries and elicit answers.
Although it is possible to build BI systems without the benefit of a data warehouse, most of the systems
are an integral part of the user-facing end of the data warehouse in practice. In fact, we can never think
of building a data warehouse without BI Systems. That is the reason; sometimes, the words ‘data
The manager of a BI system has to take care of the following steps in order to deliver the intended
business value:
Developing a solid business sponsorship is the first step to start a BI project. Your business sponsors
(it is generally good to have more than one) will take a lead role in determining the purpose, content, and
priorities of the system and so the business sponsors are expected to have the following skills;
1.Visionary – a sense for the value and potential of information with clear, specific ideas as to how to apply it.
2.Resourceful – able to obtain necessary resources and facilitate the organizational change that the BI system
3.Reasonable – can temper the enthusiasm with the understanding that the BI system takes time and resources
This cannot be done unless the BI system developing team understands business requirements at an
organizational level. Thus the process of understanding the organizational-level business requirements
The prioritization process is a planning meeting that involves the BI system developing team, the
Business sponsors, and other key senior managers across the organization. A Prioritization Grid can
be developed for the set of business processes identified in the previous step against the feasibility
of a business process and the business value that the processes likely generate. Thus the output of
After getting the complete understanding about the business priorities, the BI System developing team
revisits the Project plan. Now the plan is made based on the priority of the business processes
Based on the previous steps, now the BI System developing team defines and documents the
project-level business requirements. These requirements act as guidelines while developing the
BI system.
(ii) Discuss the characteristics of a Data warehouse and analyze how the
your organizations.
Ans. According to Bill Inmon, who is considered to be the Father of Data warehousing, the data in a
Data Warehouse consists of the following characteristics:
Subject oriented
The first feature of DW is its orientation toward the major subjects of the organization instead of
applications. The subjects are categorized in such a way that the subject-wise collection of
information helps in decision-making. For example, the data in the data warehouse of an insurance
company can be organized as customer ID, customer name, premium, payment period, etc. rather
The data contained within the boundaries of the warehouse are integrated. This means that all
data warehouse. For example, one of the applications of an organization might code gender as
‘m’ and ‘f’ and the other application might code the same functionality as ‘0′ and ‘1′. When the data
is moved from the operational environment to the data warehouse environment, this will result in
conflict.
Time variant
The data stored in a data warehouse is not the current data. The data is a time series data as the data
warehouse is a place where the data is accumulated periodically. This is in contrast to the data in an
operational system where the data in the databases are accurate as of the moment of access.
The data in the data warehouse is non-volatile which means the data is stored in a read-only format and
it does not change over a period of time. This is the reason the data in a data warehouse forms as a
Keeping the above characteristics in view, ‘data warehouse‘ can be defined as a subject-oriented,
requirements of an organization.
A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. Typically, a data warehouse contains historical data derived from
transaction
data including the data from various other data sources of an organization. The data in a data
warehouse has the following characteristics; subject-orientation, integration, non-volatility, and time-
2.(i) What do you understand by Data warehouse Meta Data? What is the use of
Metadata? How can you manage Metadata?
Ans. In simple terms, ‘metadata’ refers to “data about data.” It is the information that describes, or
Supplements the main data. For example, metadata of a digital camera includes the settings used for
the picture, such as exposure value or flash intensity. Here, metadata acts as an additional information,
and is not critical to the functions of the main data. In other cases, such as a Zip disk, metadata might
provide the information regarding the write-protected status of the disk. In such a case, metadata is
essential for proper functioning of the main product. So the value of metadata depends on the context
that it is provided, and the ways that contextual information can be used. When data is made available,
the potential user (human or computer) must put the data into an existing model of knowledge, and may
ask questions to do so. For example, in the case of an image, metadata provides answers to many of
the questions like “When was the image taken?” and “Who are in the image?” In sophisticated data
systems, the metadata includes the contextual information surrounding the data and will also be very
sophisticated, capable of answering many questions that help understand the data. To sum up,
metadata can be defined as “the structured, encoded data that describe characteristics of
information-bearing entities to aid in the identification, discovery, assessment, and management of the
described entities.”
Use of Metadata
a.Metadata provides additional information to users of the data it describes and the information can
b. Metadata speeds up and enriches searching for resources. Search queries using metadata saves users
from performing more complex filter operations manually. Also, web browsers, P2P applications and
media management software automatically downloads and locally caches metadata to improve the speed
information available.
d.Metadata is an important part of electronic discovery. Application and file system metadata derived from
e. Some metadata is intended to enable variable content presentation. For example, if a picture has metadata
that indicates the most important region, the user can narrow the picture to that region and thus obtain the
details required.
f. Metadata can also be used to automate workflows. For example, if a software tool knows content and
structure of data, it can convert it automatically and pass it to another tool as input so that users need not
g. Metadata helps to bridge the semantic gap by explaining how a computer data items are related and how
these relations can be evaluated automatically. For example, if a search engine understands that
“Aditya Kaashyap” was a “Indian Engineer”, it can answer a search query on “Indian Engineers” with
a link to a web page about Aditya Kaashyap, although the exact words “Indian Engineers” never occur
on that page. This approach (called, knowledge representation) is of special interest to the semantic web
and artificial intelligence.
To successfully develop and use metadata, you need to understand the following important issues
a. You need to keep track of the entire metadata created even in the early phases of planning and designing.
It is not economical to start attaching metadata once the production process has been completed.
b. Metadata must adapt if the resource it describes changes. It should be merged when two resources are
merged.
c. It can be useful to keep metadata even after the resource it describes has been removed.
d.Metadata can be stored either internally (in the same file as the data) or externally (in a separate file).
Internal storage allows transferring metadata together with the data it describes. Thus metadata is at
hand and can be easily manipulated. This method creates high redundancy and does not allow holding
metadata together. External storage allows bundling metadata, for example in a database, for more
efficient searching. There is no redundancy and metadata can be transferred simultaneously when
using streaming.
e.Storing the metadata in a human-readable format (such as XML) can be useful because users can
understand and edit it without specialized tools. But these formats are not optimized for storage
capacity. It may be useful to store metadata in a binary, non-human-readable format instead to speed up
Although the majority of the computer professionals see metadata as a chance for better interoperability,
b. Metadata is subjective and depends on context. Two persons will attach different metadata to the same
resource due to their different points of view. Moreover, metadata can be misinterpreted due to its
dependency on context.
c.There is no end to metadata. For example, when annotating a soccer match with metadata, one can
describe only the players and their actions. Others can also describe the advertisements in the
background and the clothes the players wear. So even for a simple resource the amount of possible
d. There is no real need for metadata as most of today’s search engines allow finding text very efficiently.
(ii) What do you understand by ETL? What are the significances of ETL processes?
What are the ETL requirements and steps?
Ans. Mostly the information contained in a data warehouse comes from the operational systems. But we all
know that the operational systems could not be used to provide the strategic information. So you need
to carefully understand what constitutes the difference between the data in the source operational systems
and the information in the data warehouse. It is all ETL functions that reshape the relevant data from the
source systems into useful information to be stored in the data warehouse. There would be no strategic
The ETL functions act as the back-end processes that cover the extraction of the data from the source
systems. Also, they include all the functions and procedures for changing the source data into the exact
formats and structures appropriate for storage in the data warehouse database. After the transformation
of the data, the processes include all processes that physically move the data into the data warehouse
repository. After capturing the information, you cannot dump the data into the data warehouse. You have
to carefully subject the extracted data to all manner of transformations so that the data will be fit to
Let us try to understand the significance of the ETL function by taking an example. For instance, you
want to analyze and compare sales according to stores, product and month. But the sales data is
available in various applications of your organization. Therefore, you have to have the entire sales
details in the data warehouse database. You can do this by providing the sales and price in a fact table,
the products in a product dimension table, the stores in a stores dimension table and months in a
time dimension table. To do this, you need to extract the data from the respective source systems,
reconcile the variations in the data representations among the source systems, transform the entire sales
details, and load the sales into fact and dimension table. Thus the execution of ETL functions is
Also, the amount of time to be spent on performing the ETL functions is as much as 50-70% of the total
effort to be put for building a data warehouse. To extract the data, you have to know the time window
during each day to extract the data from a specific source system without impacting the usage of that
system. Also, you need to determine the mechanism for capturing the changes in the data in each of
the relevant systems. Apart from the ETL functions, the building of a data warehouse includes the
functions like data integration, data summarization and metadata updating. Figure 6.1 details the
3.(i) Describe briefly the Data Transformation process. What are the major types of
transformations? Describe them briefly.
Ans. You need to perform various types of transformation tasks before moving the extracted data from the
source systems into the data warehouse. The transformation of the data is to be done as per the
standards as the data comes from various source systems and you also need to ensure that the
Irrespective of the complexity of the source systems,and regardless of the extent of your data warehouse
By undertaking a combination of the basic tasks discussed above, you can do the following transformation
functions:
Format Revisions
Format revisions include changes to the data types and lengths of individual fields. For instance,
product package types in your source systems may be indicated by codes and names in which the fields
are numeric and text data types. Also, the lengths of package types might vary from one source
system to another. Therefore, you can standardize and change the data type to text in order to
Decoding of Fields
This type of transformation deals with multiple source systems and you are bound to have same data
items described by a plethora of field values. For instance, the coding for two products manufactured
by an organization might have been done as 1 and 2 in one source system and is done as A and B
in another system. In such situations, you need to decode the codes and standardize the code before
loading the data into a data warehouse; otherwise there would be a conflict in the data analysis.
You can maintain both calculated and derived types of data values in a typical data warehouse.
For instance, you can keep ‘profit margin’ (this can be calculated as the difference between the total
sales and total cost) as a calculated value along with sales and cost amounts after extracting the data
from the sales system viz., sales volume, sales value, operating cost estimates. Similarly, you may use
You need to split the larger single files for improved understanding and making better analysis.
For instance, the traditional legacy systems store name and address of customers in a large text files.
Similarly, some systems store city, state, and zip code data together in a single file.But these components
components and to perform analysis by using individual components such as city, state, and zip code.
Merging of Information
This type of transformation deals with merging of information available in various source systems into
a single entity. For instance, the product code and description may come from one data source and
the relevant package types, the cost data may come from several other source systems. Here, merging of
information denotes combining the product code, description, package types, and cost into a single entity.
Summing Up
In this type of transformation, the summaries are created and then loaded in the data warehouse instead
of loading the most granular level of data. For instance, a credit card company need not store each and
every single transaction on each credit card in the data warehouse to analyze sales patterns. Instead, the
data can be summarized to the extent possible and store the summary data instead of the most granular
data.
In this type of data transformation, the character sets are converted into an agreed standard character
set for textual data in the data warehouse. For instance, the source data will be in EBCDIC
(Extended Binary Coded Decimal Interchange Code) characters if you have mainframe legacy systems
as source systems. So you need to convert from mainframe EBCDIC format to the ASCII
(American Standard Code for Information Interchange), format if PC-based architecture is the choice of
Use of standard unit of measurement is one of the prerequisites in building a data warehouse. If your
company has overseas operations, you may have to convert the metrics accordingly so that the numbers
Here, the date/time conversion is an important measurement. For example, the date of October 9, 2006
is written as 10/09/2006 in the U.S format and as 09/10/2006 in the British format. This can be standardized
Key Restructuring
You have to come up with keys for the fact and dimension tables for a data warehouse to be built based on
the keys in the extracted records. So you look at the primary keys of the extracted records while
extracting data from the input sources. For instance, the product code in an organization is structured
to have an inherent meaning (like first letter describes the location code, second letter describes the
machine code, etc.) and you use this product code as the primary key and move the data into another
warehouse. Then the warehouse part of the product key will have to be changed before moving the data.
Therefore, avoid the keys with built-in meanings while choosing keys for your data warehouse
database tables and transform such keys into generic keys (that are generated by the system itself).
Reduplication
Some companies may maintain several records for a single customer and so duplicates are the result of
the additional records. Therefore, it is suggested to keep a single record for one customer and link all the
duplicates in the source systems to this single record in your data warehouse. This process is called
reduplication.
(ii) What do you understand by EIS? What are the significances of EIS?
Briefly describe the benefits of EIS.
Ans. Definition of an EIS
In simple terms, an EIS can be defined as a computer-based system intended to facilitate and support
the information and decision making needs of senior executives of an enterprise by providing easy
access to both internal and external information relevant to meeting the strategic goals of the organization.
These systems act as organizational-wide Decision Support Systems to help top-level executives analyze,
compare, and highlight the trends and patterns of the important variables. Also, these systems emphasize
on graphical displays, easy-to-use user interfaces and offer strong reporting capabilities.
Significance of EIS
An EIS provides the summarized or detailed data of the strategic information at the convenience of the senior
executives of an organization. An EIS performs all these functions by constantly monitoring the internal and
external events and trends. For instance, an executive can use EIS to view the sales functioning categorized
by product, region, month, etc. Similarly, the executive can also monitor the sales performance of the
organization’s competitors. Based on the snapshot provided by the EIS, the executive can drill down into
the organization’s data warehouse to display greater level of details and to explore the current and past data
patterns and trends. This process can be continued till the executive reaches a single transaction level and
thus EIS provides the executive with the information that explains the variance and helps in deciding a course
of action.
The tools offered by EIS are programmed to provide canned reports or briefing books to top-level executives.
Today these tools allow ad-hoc querying against a multi-dimensional database, and most offer analytical
applications along functional lines such as sales or financial analysis. But an organizational EIS cannot
become a substitute for other forms of information technologies and computer-based systems viz.,
Management Information Systems (MIS), Transaction Processing Systems (TPS), and Decision Support
Systems (DSS).
Today, the application of an EIS is not only in typical corporate hierarchies, but also at personal computers
on a local area network. These systems now cross computer hardware platforms and integrate information
stored on mainframes, personal computer systems, and minicomputers. As some client service companies
adopt the latest enterprise information systems, executives can use their personal computers to get access
to the company’s data and decide which data are relevant for their decision making. This arrangement enables
all users to customize their access to the proper company’s data and provide relevant information to both
upper and lower levels in companies.
Benefits of an EIS
a. Provides tools to select, extract, filter, and track the critical information of organization in an organized manner
b. Enables the top-level executives to use the system with ease (extensive computer experience is not required)
c. Provides timely delivery of the organization-wide summary of information highlighting the major deviations of the
e. Provides a wide range of reports including the status reports, trend analyses, drill down investigation, ad hoc
queries.
The organizational EIS is not a substitute for other information technologies and computer-based systems.
The other decision support systems are still vital in bringing relevant information to the various levels of
a modern organization. The EIS feeds off the various information systems within an organization for its
internal information needs and then attaches itself to the external sources as and when necessary to
However, executives face the following limitations with the Executive Information Systems:
a. Cost of establishing an EIS is relatively high and so may not be economically viable for small companies
b. Functions are limited and so the systems may not perform complex calculations
c. Depends on the other information technologies in the organization to gather the organization’s internal data
Assignment Set- 2
ensure that the quality of data loaded into the warehouse is appropriate.
There are two significant dimensions in understanding the quality of the data; intrinsic quality and realistic
quality. Here, the ‘intrinsic data quality’ is the correctness or accuracy of data and ‘realistic data quality’
is the value that the correct data has in supporting the work of the business or organization
To state simply, the ‘intrinsic data quality’ is the accuracy of the data. It is the degree to which data
accurately reflects the real-world object that the data represents. If all facts that an organization needs
to know about an entity are accurate, then that data has intrinsic quality.
Data that does not enable the organization to accomplish its mission has no quality, no matter how
accurate it is. Thus ‘realistic data quality’ comes into the picture. Realistic data quality is the degree of
utility and value the data has to support the organizational processes to accomplish the organizational
objectives. Fundamentally, realistic data quality is the degree of customer satisfaction that the knowledge
Concept of TQDM
Many of the business intelligence projects do not deliver to full potential because of one reason that people
tend to see data quality as a one-time undertaking as a part of user acceptance testing (UAT). But it is very
important that data quality management is to be undertaken as a continuous improvement process. You
have to use an iterative approach as detailed below to achieve the data quality:
Undertaking a commitment to the Data Quality Management process can be accomplished by establishing
the data quality management environment between information system managers and establishing the
conditions to encourage coordination between functional and information system development professionals.
Functional users of legacy information systems know data quality problems of the existing systems but hardly
know how to improve the quality of the existing data systematically. But the Information system developers
know how to identify data quality problems, but hardly know how to change the functional requirements
that drive the systematic improvement of data. Given the existing barriers to communication, establishing
the data quality environment requires participation of both functional users and information system
administrators.
For each data quality analysis project selected, you may have to draft an initial plan that addresses the following items:
Task Summary
Task Description
Project Approach
Schedule
Resources
a. Define
b. Measure
c. Analyze
d. Improve
The data quality project manager performs these activities with input from the functional users of the
data, system developers, and database administrators of the legacy and target database systems.
All stakeholders in the Data Quality Management process (functional users, program managers, developers,
and the Office of Data Management) are required to review the progress to determine whether data quality
functions into a single product and it can be used to automate the process of bringing the data together
from heterogeneous sources into a central, integrated, information providing environment. It does not simply
create a data warehouse or an information database; but provides the processes to define, build, manage,
monitor and maintain an environment which provides information. The visual warehouse can be managed either
centrally or from the workgroup environment. Therefore, business groups can meet their own information
needs without burdening information systems resources, and can enjoy the autonomy of their own data mart
a. Visual Warehouse has the ability to extract and transform data from a wide range of heterogeneous data sources
(both internal and external sources of an enterprise); such as the DB2 family, Microsoft SQL Server, Oracle,
Sybase, Informix, and flat files (for example, from spreadsheets). On the basis of the metadata defined by the
administrative component of visual warehouse, the data from any of these sources can be extracted and
transformed. Also, the extraction process, which supports full refreshing of data, can run on demand or on an
b.The transformed data can be placed in a data warehouse built on any of the DB2 UDB platforms (including DB2
for Windows NT, DB2 for AIX, DB2 for HP-UX, DB2 for Sun Solaris, DB2 for SCO, DB2 for SINIX, DB2 for OS/2,
DB2 for OS/400, and DB2 for OS/390) or on flat files. The visual warehouse provides the flexibility and scalability
to populate any combination of the supported databases. Also, visual warehouse supports Oracle, Sybase,
c. Once the data is in the target data warehouse, the data can be accessible by a variety of end user query tools.
These tools can be from IBM, such as Lotus Application, or QMF for Windows, or from any other vendors whose
products comply with the DB2 Client Application Enabler (CAE) or the Open Database Connectivity (ODBC)
interface, such as Business Objects, and Cognos Impromptu. The data can also be browsed using any of the
b. Transparency: These systems need to be part of an open system that supports heterogeneous data sources.
Also, the end-user need not necessarily be concerned about the details of data access or conversions.
c. Accessibility: The OLAP system should present the user with a single logical schema of the data. It has to map
its own logical schema to the heterogeneous physical data stores and perform any necessary transformations.
d. Consistent Reporting Performance: The users of the system should not experience any significant degradation
in reporting performance as the number of dimensions or the size of the database increases. Users need to
perceive consistent run time, response time, or machine utilization every time a given query is run.
e.Client/Server Architecture: The system has to have conformance to the principles of client/server architecture
for optimum performance, flexibility, adaptability, and interoperability. Also, the server component needs to be
f. Generic Dimensionality: The system has to ensure that every data dimension is equivalent in both structure
and operational capabilities. We should be able to apply the function of one dimension to another too.
g. Dynamic Sparse Matrix Handling: This guideline is related to the idea of nulls in relational databases and to
the notion of compressing large files, and a sparse matrix is one in which not every cell contains data. So the
h. Multi-user Support: Similar to EIS systems, the OLAP systems need to support multiple concurrent users,
i. Unrestricted Cross-dimensional operations: The OLAP system should have the ability to recognize
dimensional hierarchies and automatically perform roll-up and drill-down operations within a dimension or across
dimensions.
j. Intuitive data manipulation: The system should enable consolidation path reorientation (pivoting), drill-down
and roll-up, and other manipulations to be accomplished intuitively and directly via point-and-click and
drag-and-drop actions.
k.Flexible Reporting: The system should enable its users arrange columns, rows, and cells in a manner that
l. Unlimited Dimensions and Aggregation Levels: The system is expected to accommodate at least fifteen
Later in 1995, Codd included the following six requirements in addition to the above twelve basic guidelines:
a. Drill-through to Detail level: The system has to allow a smooth transition from the multidimensional,
pre-aggregated database to the detail record level of the source data warehouse repository.
b.Treatment of Non-normalized Data: The system should prohibit calculations made within it from getting
systems.
Missing Values: The system should be able to ignore the missing values, irrespective of their source.
Incremental Database Refresh: The system has to provide for incremental refreshes of the extracted and
SQL Interface: The OLAP system should have the ability to get integrated into the existing enterprise
environment.
By its simplest definition, data mining (DM) is the set of activities used to find new, hidden, or unexpected
patterns in data. It is the process of analyzing data from different perspectives and summarizing it into
useful information. Technically, the data mining process finds the correlations and patterns existing among
In the past, decision support activities were based on the concept of verification. In this sense, a relational
database could be queried to provide dynamic answers to well-formed questions. The key issue in verification
is that it requires a great deal of prior knowledge on the part of the decision maker in order to verify a
suspected relationship through the query. In the 1990s, data warehouses with query and report tools assisted
the users in retrieving the types of decision support information they needed. Later OLAP tools came in to
Till this point, the approach for obtaining the information was mainly driven by the users. But the sheer
volume of data renders it impossible for anyone to use analysis and query tools to discern useful patterns.
For instance, in marketing research analysis, it is practically impossible to go through all the possible
associations and gain insights by querying and drilling down into the data warehouse. You might really
need a technology that can learn from past associations and results, and predict future behavior of
customers.
It is really good to have a tool that will accomplish the discovery of knowledge by itself to sustain
the cut-throat competition. Thus you require a data-driven approach rather than a user-driven one.
Using the information stored within a data warehouse, the data mining techniques can provide solutions to
2. Which scrips/securities are going to be more profitable during next trading session?
3. What is the probability for an individual customer to respond to a particular promotion?
4. What is the likelihood that an individual customer will default or payback on schedule?
These questions can be answered easily if the information hidden among the terabytes of data in your
Another important DM technique is knowledge data discovery (KDD). Using a combination of techniques,
including statistical analysis, multidimensional analysis, intelligent agents, and data visualization, KDD
can discover highly useful informative patterns within the data that can be used to develop predictive models
of behavior.