Sie sind auf Seite 1von 18

Question Paper

Data Warehousing and Data Mining (MB3G1IT) : October 2008


Section A : Basic Concepts (30 Marks)
i.exe

• This section consists of questions with serial number 1 - 30.


• Answer all questions.
• Each question carries one mark.
• Maximum time for answering Section A is 30 Minutes.

<Answ
1. Which of the following is not determined in the business requirements stage of data warehouse delivery er>
method?
(a) The logical model for information within the data warehouse
(b) The source systems that provide this data (i.e. mapping rules)
(c) The query profiles for the immediate requirement
(d) The capacity plan for hardware and infrastructure
(e) The business rules to be applied to data.
<Answ
2. Which of the following is a system manager who performs backup and archiving the data warehouse? er>

(a) Load manager


(b) Warehouse manager
(c) Query manager
(d) Database manager
(e) Event manager.
<Answ
3. What is the load manager task that is implemented by stored procedure tools? er>

(a) Fast load


(b) Simple transformation
(c) Complex checking
(d) Job control
(e) Backup and archive.
<Answ
4. Which of the following statements is/are true about the types of partitioning? er>

I. Vertical partitioning can take two forms: normalization and row splitting.
II. Before using a vertical partitioning there should not be any requirement to perform major join
operations between the two partitions.
III. In order to maximize the hardware partitioning, minimize the processing power available.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) Both (II) and (III) above.
<Answ
5. Which of the following machine is a set of tightly coupled CPUs that share memory and disk? er>

(a) Symmetric multi-processing


(b) Massively multi-processing
(c) Segregated multi-processing
(d) Asymmetric multi-processing
(e) Multidimensional multi-processing.
<Answ
6. Which Redundant Array of Inexpensive Disks (RAID) level has full mirroring with each disk duplexed? er>

(a) Level 1
(b) Level 2
(c) Level 3
(d) Level 4
(e) Level 5.

1
<Answ
7. Which of the following statements is/are true? er>

I. Snowflake schema is a variant of star schema where each dimension can have its own dimensions.
II. Starflake schema is a logical structure that has a fact table in the center with dimension tables radiating
off of this central table.
III. Star schema is a hybrid structure that contains a mix of Starflake and snowflake schemas.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
8. In database sizing, what is the mathematical representation of ‘T’ (temporary space) given ‘n’ (the number of er>
concurrent queries allowed) and ‘P’ (the size of the partition)?
(a) T = (2n + 1)P

(b) T = P / (2n + 1)

(c) T = (2n - 1)P

(d) T = (2n - 1) / P

(e) T = P / (2n - 1)
.
<Answ
9. In data mining, which of the following states a statistical correlation between the occurrence of certain er>
attributes in a database table?
(a) Association rules
(b) Query tools
(c) Visualization
(d) Case-based learning
(e) Genetic algorithms.
<Answ
10 In data mining, learning tasks can be divided into er>
.
I. Classification tasks.
II. Knowledge engineering tasks.
III. Problem-solving tasks.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
11.Which of the following statements are true about the types of knowledge? er>

I. Hidden knowledge is the information that can be easily retrieved from databases using a query tool such
as Structured Query Language (SQL).
II. Shallow knowledge is the data that can be found relatively easily by using pattern recognition or
machine-learning algorithms.
III. Multi-dimensional knowledge is the information that can be analyzed using online analytical processing
tools.
IV. Deep knowledge is the information that is stored in the database but can only be located if we have a
clue that tells us where to look.
(a) Both (I) and (II) above
(b) Both (II) and (III) above
(c) Both (III) and (IV) above
(d) (I), (II) and (III) above
(e) (II), (III) and (IV) above.

2
<Answ
12 There are some specific rules that govern the basic structure of a data warehouse, namely that such a structure er>
. should be:
I. Time-independent.
II. Non-volatile.
III. Subject oriented.
IV. Integrated.
(a) Both (I) and (II) above
(b) Both (II) and (III) above
(c) Both (III) and (IV) above
(d) (I), (II) and (III) above
(e) (II), (III) and (IV) above.
<Answ
13 Which of the following statements is/are true about Online Analytical Processing (OLAP)? er>
.
I. OLAP tools do not learn, they create new knowledge.
II. OLAP tools are more powerful than data mining.
III. OLAP tools cannot search for new solutions.
(a) Only (I) above
(b) Only (II) above
(c) Both (I) and (II) above
(d) Both (I) and (III) above
(e) Both (II) and (III) above.
<Answ
14 Auditing is a specific subset of security that is often mandated by organizations. As data warehouse is er>
. concerned the audit requirements can basically categorized as:
I. Connections.
II. Disconnections.
III. Data access.
IV. Data change.
(a) Both (I) and (III) above
(b) Both (II) and (III) above
(c) (I), (III) and (IV) above
(d) (II), (III) and (IV) above
(e) All (I), (II), (III) and (IV) above.
<Answ
15 Which of the following produced the ‘Alexandria’ backup software package? er>
.
(a) HP
(b) Sequent
(c) IBM
(d) Epoch Systems
(e) Legato.
<Answ
16 Which of the following statements is/are true about aggregations in data warehousing? er>
.
I. Aggregations are performed in order to speed up common queries.
II. Too few aggregations will lead to unacceptable operational costs.
III. Too many aggregations will lead to an overall lack of system performance.
(a) Only (I) above
(b) Only (II) above
(c) Both (I) and (II) above
(d) Both (II) and (III) above
(e) All (I), (II) and (III) above.

3
<Answ
17 Which of the following statements is/are true about the basic levels of testing the data warehouse? er>
.
I. All unit testing should be complete before any test plan is enacted.
II. In integration testing, the separate development units that make up a component of the data warehouse
application are tested to ensure that they work together.
III. In system testing, the whole data warehouse application is tested together.
(a) Only (I) above
(b) Only (II) above
(c) Both (I) and (II) above
(d) Both (II) and (III) above
(e) All (I), (II) and (III) above.
<Answ
18 To execute each SQL statement the RDBMS uses an optimizer to calculate the best strategy for performing er>
. that statement. There are a number of different ways of calculating such a strategy, but we can categorize
optimizers generally as either rule based or cost based. Which of the following statements is/are false?
I. A rule-based optimizer uses known rules to perform the function.
II. A cost-based optimizer uses stored statistics about the tables and their indexes to calculate the best
strategy for executing the SQL statement.
III. “Number of rows in the table” is generally collected by rule-based optimizer.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
19 In parallel technology, which of the following statements is/are true? er>
.
I. Data shipping is where a process requests for the data to be shipped to the location where the process is
running.
II. Function shipping is where the function to be performed is moved to the locale of the data.
III. Architectures which are designed for shared-nothing or distributed environments use data shipping
exclusively. They can achieve parallelism as long as the data is partitioned or distributed correctly.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
20 Which of the following is/are the common restriction(s) that may apply to the handling of views? er>
.
I. Restricted Data Manipulation Language (DML) operations.
II. Lost query optimization paths.
III. Restrictions on parallel processing of view projections.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
21 One petabyte is equal to er>
.
(a) 1024 terabytes
(b) 1024 gigabytes
(c) 1024 megabytes
(d) 1024 kilobytes
(e) 1024 bytes.

4
<Answ
22 The formula for the construction of a genetic algorithm for the solution of a problem has the following steps. er>
. List the steps in the order.
I. Invent an artificial environment in the computer where the solutions can join in battle with each other.
Provide an objective rating to judge success or failure, in professional terms called a fitness function.
II. Develop ways in which possible solutions can be combined. Here the so-called cross-over operation, in
which the father’s and mother’s strings are simply cut and after changing, stuck together again, is very
popular. In reproduction, all kinds of mutation operators can be applied.
III. Devise a good, elegant coding of the problem in terms of strings of a limited alphabet.
IV. Provide a well-varied initial population and make the computer play ‘evolution’, by removing the bad
solutions from each generation and replacing them with progeny or mutations of good solutions. Stop
when a family of successful solutions has been produced.
(a) I, II, III and IV
(b) I, III, II and IV
(c) III, I, II and IV
(d) II, III, I and IV
(e) III, II, I and IV.
<Answ
23 Which of the following produced the ADSTAR Distributed Storage Manager (ADSM) backup software er>
. package?
(a) HP
(b) Sequent
(c) IBM
(d) Epoch Systems
(e) Legato.
<Answ
24 Which of the following does not belongs to the stages in the Knowledge Discovery Process? er>
.
(a) Data selection
(b) Data encapsulation
(c) Cleaning
(d) Coding
(e) Reporting.
<Answ
25 Which of the following is/are the applications of data mining? er>
.
I. Customer profiling.
II. CAPTAINS.
III. Reverse engineering.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.
<Answ
26 Which of the following managers are not a part of system managers in a data warehouse? er>
.
(a) Configuration manager
(b) Schedule manager
(c) Event manager
(d) Database manager
(e) Load manager.
<Answ
27 In data mining, group of similar objects that differ significantly from other objects is known as er>
.
(a) Filtering
(b) Clustering
(c) Coding
(d) Scattering
(e) Binding.

5
<Answ
28 A perceptron with simple three-layered network has ____________ as input units. er>
.
(a) Photo-receptors
(b) Associators
(c) Responders
(d) Acceptors
(e) Rejectors.
<Answ
29 In which theory, the human brain was described as a neural network? er>
.
(a) Shannon’s communication theory
(b) Kolmogorov complexity theory
(c) Rissanen theory
(d) Freud’s theory of psychodynamics
(e) Kohonen theory.
<Answ
30 Which of the following is/are the task(s) maintained by the query manager? er>
.
I. Query syntax.
II. Query execution plan.
III. Query elapsed time.
(a) Only (I) above
(b) Only (II) above
(c) Only (III) above
(d) Both (I) and (II) above
(e) All (I), (II) and (III) above.

END OF SECTION A

Section B : Caselets (50 Marks)


• This section consists of questions with serial number 1 – 6.
• Answer all questions.
• Marks are indicated against each question.
• Detailed explanations should form part of your answer.
• Do not spend more than 110 - 120 minutes on Section B.

Caselet 1
Read the caselet carefully and answer the following questions:

1. “Braite selected Symphysis as the provider of choice, to create a roadmap for the <Answe
r>
solution, develop a scalable, robust and user-friendly framework, and deliver the
product set.” In this context, explain the data warehousing architecture. (10 marks)
2. If you are a project manager at Braite, what metrics you will consider which help <Answe
r>
Braite in meeting its business goals for improved customer satisfaction and process
improvement? Explain. ( 8 marks)
3. What might be the important characteristics of the proposed data warehouse and <Answe
r>
also list the features of a data warehouse. ( 7 marks)
4. Do you think Software Quality Assurance (SQA) process will play an important <Answe
r>
role in any data warehousing project? Explain. (10 marks)

Braite, a leading provider of software services to financial institutions, launched an


initiative to enhance its application platform in order to provide better data
analytics for its customers. Braite partnered with Symphysis to architect and build
new Data Warehousing (DW) and Business Intelligence (BI) services for its
Business Intelligence Center (BIC). Over time, Symphysis has become Braite's
most strategic product development partner, using their global delivery model.
Braite faced real challenges in providing effective data analytics to its customers -
it supported several complex data sources residing in multiple application layers
6
within its products, and faced challenges in implementing business rules and
integrating data from disparate systems. These source systems included mainframe
systems, Oracle databases and even Excel spreadsheets. Braite also faced several
other challenges: it required manual data validation and data comparison processes;
it required manual controls over the credit card creation process; its system design
process suffered from unclear business requirements; it supported multiple
disparate data marts.
To address these challenges, Braite turned to an offshore IT services provider with
DW/BI experience and deep expertise in healthcare benefits software. Braite
selected Symphysis as the provider of choice, to assess the feasibility of a DW/BI
solution, as well as to create a roadmap for the solution, develop a scalable, robust
and user-friendly DW/BI framework, and deliver the product set.
In this project, Symphysis designed and executed the DW/BI architecture that has
become the cornerstone of Braite's data analytics service offerings, enhancing its
status as a global leader. Business Intelligence (BI) services focus on helping
clients in collecting and analyzing external and internal data to generate value for
their organizations.
Symphysis successfully architected, built and tested the DW/BI solution with the
key deliverables: created scripts to automate existing manual processes involving
file comparison, validation, quality check and control card generation, introduced
change and configuration management best practices, leading to standardization of
Braite's delivery process, robust Software Quality Assurance (SQA) processes to
ensure high software quality. The SQA process relies on unit testing and functional
testing, resulting in reduced effort for business analysts, defined and collected
metrics at various steps in the software development process (from development to
integration testing), in order to improve process methodology and provide the
ability to enter benchmarks or goals for performance tracking. There are several
data warehouse project management metrics worth considering. These metrics have
helped Braite to meet its business goals for improved customer satisfaction and
process improvement.
In addition to full lifecycle product development, we also provide ongoing product
support for Braite's existing applications.
Symphysis provided DW/BI solutions across a wide range of business functions,
including sales and service, relationship value management, customer information
systems, billing and online collections, operational data store, loans, deposits, voice
recognition, custom history, and ATM. Symphysis’s data mart optimization
framework enabled integration and consolidation of disparate data marts.
Symphysis SQA strategy improved delivery deadlines in terms of acceptance and
integration. Symphysis provided DW process improvements, ETL checklists and
standardization. Braite achieved cost savings of 40% by using Symphysis
onsite/offshore delivery model and a scalable architecture enabling new data
warehouse and business intelligence applications.
END OF
CASELET 1

Caselet 2
Read the caselet carefully and answer the following questions:

5. Critically analyze the functions of the tools that chairman of Trilog Brokerage <Answe
r>
Services (TBS) decided to implement in order to increase the efficiency of the
organization. ( 7 marks)
6. Discuss the classification of usage of tools against a data warehouse and also <Answe
discuss about the types of Online Analytical Processing (OLAP) tools. ( 8 marks) r>

Trilog Brokerage Services (TBS) is one of the oldest firms in India with a very
strong customer base. Many of its customers have more than one security holdings
and some even have more than 50 securities in their portfolios. And it has become
very difficult on the part of TBS to track and maintain which customer is

7
selling/buying which security and the amounts they have to receive or the amount
they have to pay to TBS.
It has found that information silos created are running contrary to the goal of the
business intelligence organization architecture: to ensure enterprise wide
informational content to the broadest audience. By utilizing the information
properly, it can enhance customer and supplier relationships, improve the
profitability of products and services, create worthwhile new offerings, better
manage risk, and pare expenses dramatically, among many other gains. TBS was
feeling that it required a category of software tools that help analyze data stored in
its database; help users analyze different dimensions of the data, such as time series
and trend analysis views.
The chairman of TBS felt that Online Analytical Processing (OLAP) was the need
of the hour and decided to implement it immediately so that the processing part
would be reduced significantly, thereby increasing the efficiency of the
organization.
END OF
CASELET 2

E
N
D

O
F

S
E
C
T
I
O
N

Section C : Applied Theory (20 Marks)


• This section consists of questions with serial number 7 - 8.
• Answer all questions.
• Marks are indicated against each question.
• Do not spend more than 25 - 30 minutes on Section C.

7. What is Neural Network and discuss about various forms of Neural <Answer>
( 10 marks)
Networks.

8. Explain the various responsibilities of a Query manager. ( 10 marks) <Answer>


END OF SECTION C

END OF QUESTION PAPER

Suggested Answers
Data Warehousing and Data Mining (MB3G1IT) : October 2008
Section A : Basic Concepts
Answer Reason

8
1. D The capacity plan for hardware and infrastructure is not determined in the business< TOP >
requirements stage, it is identified in technical blueprint stage.
2. B Warehouse manager is a system manager who performs backup and archiving the data< TOP >
warehouse.
3. C Stored procedure tools implement Complex checking. < TOP >

4. D Vertical partitioning can take two forms: normalization and row splitting, before using< TOP >
a vertical partitioning there should not be any requirements to perform major join
operations between the two partitions, in order to maximize the hardware partitioning
maximize the processing power available.
5. A Symmetric multi-processing machine is a set of tightly coupled CPUs that share < TOP >
memory and disk.
6. A Redundant Array of Inexpensive Disks (RAID) Level 1 has full mirroring with each < TOP >
disk duplexed.
7. A Snowflake schema is a variant of star schema where each dimension can have its own< TOP >
dimensions, star schema is a logical structure that has a fact table in the center with
dimension tables radiating off of this central table, Starflake schema is a hybrid
structure that contains a mix of star and snowflake schemas.
8. A In database sizing, if n is the number of concurrent queries allowed and P is the size < TOP >
T = (2n + 1)P
of the partition, then temporary space (T) is set to .
9. A Association rules that state a statistical correlation between the occurrence of certain< TOP >
attributes in a database table.
10.E Learning tasks can be divided into: < TOP >
I. Classification tasks.
II. Knowledge engineering tasks.
III. Problem-solving tasks.
11.C Shallow knowledge is the information that can be easily retrieved from databases < TOP >
using a query tool such as Structured Query Language (SQL), Hidden knowledge is
the data that can be found relatively easily by using pattern recognition or machine-
learning algorithms, Multi-dimensional knowledge is the information that can be
analyzed using online analytical processing tools, Deep knowledge is the information
that is stored in the database but can only be located if we have a clue that tells us
where to look.
12.E There are some specific rules that govern the basic structure of a data warehouse,< TOP >
namely that such a structure should be: Time dependent, Non-volatile, Subject
oriented, Integrated.
13.D OLAP tools do not learn, they create new knowledge and OLAP tools cannot search < TOP >
for new solutions, data mining is more powerful than OLAP.
14.E Auditing is a specific subset of security that is often mandated by organizations. As< TOP >
data warehouse is concerned the audit requirements can basically categorized as:
I. Connections.
II. Disconnections.
III. Data access.
IV. Data change.
15.B Alexandria backup software package was produced by Sequent. < TOP >

16.A Aggregations are performed in order to speed up common queries, too many
aggregations will lead to unacceptable operational costs, too few aggregations will
lead to an overall lack of system performance.

9
17.E All unit testing should be complete before any test plan is enacted, in integration < TOP >
testing, the separate development units that make up a component of the data
warehouse application are tested to ensure that they work together, in system testing,
the whole data warehouse application is tested together.
18.C A rule-based optimizer uses known rules to perform the function, a cost-based < TOP >
optimizer, uses stored statistics about the tables and their indexes to calculate the best
strategy for executing the SQL statement, “Number of rows in the table” is generally
collected by cost-based optimizer.
19.D Data shipping is where a process requests for the data to be shipped to the location< TOP >
where the process is running, function shipping is where the function to be performed
is moved to the locale of the data, architectures which are designed for shared-nothing
or distributed environments use function shipping exclusively. They can achieve
parallelism as long as the data is partitioned or distributed correctly.
20.E The common restrictions that may apply to the handling of views are: < TOP >
I. Restricted Data Manipulation Language (DML) operations.
II. Lost query optimization paths.
III. Restrictions on parallel processing of view projections.
21.A One petabyte is equal to 1024 terabytes. < TOP >

22.C The formula for the construction of a genetic algorithm for the solution of a problem< TOP >
has the following steps:
I. Devise a good, elegant coding of the problem in terms of strings of a limited
alphabet.
II. Invent an artificial environment in the computer where the solutions can join in
battle with each other. Provide an objective rating to judge success or failure, in
professional terms called a fitness function.
III. Develop ways in which possible solutions can be combined. Here the so-called
cross-over operation, in which the father’s and mother’s strings are simply cut
and after changing, stuck together again, is very popular. In reproduction, all
kinds of mutation operators can be applied.
IV. Provide a well-varied initial population and make the computer play ‘evolution’,
by removing the bad solutions from each generation and replacing them with
progeny or mutations of good solutions. Stop when a family of successful
solutions has been produced.
23.C ADSM backup software package was produced by IBM. < TOP >

24.B Data encapsulation is not a stage in the knowledge discovery process. < TOP >

25.E Customer profiling, CAPTAINS and reverse engineering are applications of data< TOP >
mining.
26.E Except load manager all the other managers are part of system managers in a data < TOP >
warehouse.
27.B A group of similar objects that differ significantly from other objects is known as< TOP >
Clustering.
28.A A perceptron consists of a simple three-layered network with input units called Photo- < TOP >
receptors.
29.D In Freud’s theory of psychodynamics, the human brain was described as a neural< TOP >
network.
30.E The tasks maintained by the query manager: < TOP >
I. Query syntax.
II. Query execution plan.
III. Query elapsed time.

Section B : Caselets
1. Architecture of a data warehouse < TOP

10
Load Manager Architecture >
The architecture of a load manager is such that it performs the following operations:
1. Extract the data from the source system.
2. Fast-load the extracted data into a temporary data store.
3. Perform simple transformations into a structure similar to the one in the data warehouse.

Load manager architecture


Warehouse Manager Architecture
The architecture of a warehouse manager is such that it performs the following operations:
1. Analyze the data to perform consistency and referential integrity checks.
2 Transform and merge the source data in the temporary data store into the published data
warehouse.
3. Create indexes, business views, partition views, business synonyms against the base data.
4. Generate denormalizations if appropriate.
5. Generate any new aggregations that may be required.
6. Update all existing aggregation.
7. Back up incrementally or totally the data within the data warehouse.
8. Archive data that has reached the end of its capture life.
In some cases, the warehouse manager also analyzes query profiles to determine which indexes
and aggregations are appropriate.

Architecture of a warehouse manager


Query Manager Architecture
The architecture of a query manager is such that it performs the following operations:
1. Direct queries to the appropriate table(s).
2. Schedule the execution of user queries.
The actual problem specified is tight project schedule within which it had to be delivered. The
field errors had to be reduced to a great extent as the solution was for the company. The
requirements needed to be defined very clearly and there was a need for a scalable and reliable
architecture and solution.
The study had conducted on the company’s current business information requirements, current
process of getting that information and prepared a business case for a data warehousing and
business intelligence solution.

11
2. Metrics are essential in the assessment of software development quality. They may provide < TOP
information about the development process itself and the yielded products. Metrics may be >
grouped into Quality Areas, which define a perspective for metrics interpretation. The adoption
of a measurement program includes the definition of metrics that generate useful information.
To do so, organization’s goals have to be defined and analyzed, along with what the metrics are
expected to deliver. Metrics may be classified as direct and indirect. A direct metric is
independent of the measurement of any other. Indirect metrics, also referred to as derived
metrics, represent functions upon other metrics, direct or derived. Productivity (code size/
programming time) is an example of derived metric. The existence of a timely and accurate
capturing mechanism for direct metrics is critical in order to produce reliable results. Indicators
establish the quality factors defined in a measurement program. Metrics also have a number of
components, and for data warehousing can be broken down in the following manner:
Objects - the “themes” in the data warehouse environment which need to be assessed. Objects
can include business drivers, warehouse contents, refresh processes, accesses, and tools.
Subjects - things in the data warehouse to which we assign numbers, or a quantity. For
example, subjects include the cost or value of a specific warehouse activity, access frequency,
duration, and utilization.
Strata - a criterion for manipulating metric information. This might include day of the week,
specific tables accessed, location, time, or accesses by department.
These metric components may be combined to define an “application,” which states how the
information will be applied. For example: “When actual monthly refresh cost exceeds targeted
monthly refresh cost, the value of each data collection in the warehouse must be re-
established.”
There are several data warehouse project management metrics worth considering. The first
three are:
• Business Return On Investment (ROI)
The best metric to use is business return on investment. Is the business achieving bottom
line success (increased sales or decreased expenses) through the use of the data
warehouse? This focus will encourage the development team to work backwards to do the
right things day in and day out for the ultimate arbiter of success -- the bottom line.
• Data usage
The second best metric is data usage. You want to see the data warehouse used for its
intended purposes by the target users. The objective here is increasing numbers of users
and complexity of usage. With this focus, user statistics such as logins and query bands
are tracked.
• Data gathering and availability
The third best data warehouse metric category is data gathering and availability. Under
this focus, the data warehouse team becomes an internal data brokerage, serving up data
for the organization’s consumption. Success is measured in the availability of the data
more or less according to a service level agreement. I would say to use these business
metrics to gauge the success.
3. The important characteristics of data warehouse are: < TOP
Time dependent: That is, containing information collected over time, which implies there >
must always be a connection between information in the warehouse and the time when it was
entered. This is one of the most important aspects of the warehouse as it related to data mining,
because information can then be sourced according to period.
Non-volatile: That is, data in a data warehouse is never updated but used only for queries.
Thus such data can only be loaded from other databases such as the operational database. End-
users who want to update must use operational database, as only the latter can be updated,
changed, or deleted. This means that a data warehouse will always be filled with historical
data.
Subject oriented: That is, built around all the existing applications of the operational data. Not
all the information in the operational database is useful for a data warehouse, since the data
warehouse is designed specifically for decision support while the operational database contains
information for day-to-day use.
Integrated: That is, it reflects the business information of the organization. In an operational
data environment we will find many types of information being used in a variety of
applications, and some applications will be using different names for the same entities.
12
However, in a data warehouse it is essential to integrate this information and make it
consistent; only one name must exist to describe each individual entity.
The following are the features of a data warehouse:
• A scalable information architecture that will allow the information base to be extended and
enhanced over time.
• Detailed analysis of member patterns, including trading, delivery and funds payment.
• Fraud detection and sequence of event analysis.
• Ease of reporting on voluminous historical data.
• Provision for ad hoc queries and reporting facilities to enhance the efficiency of
knowledge workers.
• Data mining to identify the co-relation between apparently independent entities.
4. Due to the principal role of Data warehouses in making strategy decisions, data warehouse < TOP
quality is crucial for organizations. The typical Quality Assurance (QA) activities aimed at >
ensuring both process and product quality at Braite include software testing resulting in:
• Reduced development and maintenance costs
• Improved software products quality
• Reduced project cycle time
• Increased customer satisfaction
• Improved staff morale thanks to predictable results in stable conditions with less overtime/
crisis/turnover
Quality assurance means different things to different individuals. To some, QA means testing,
but quality cannot be tested at the end of a project. It must be built in as the solution is
conceived, evolves and is developed. To some, QA resources are the “process police” –
nitpickers insisting on 100% compliance with a defined development process methodology.
Rather, it is important to implement processes and controls that will really benefit the project.
Quality assurance consists of a planned and systematic pattern of the activities necessary to
provide confidence that a solution conforms to established requirements. Testing is just one of
those activities.
In the typical software QA methodology, the key tasks are:
• Articulate the development methodology for all to know
• Rigorously define and inspect the requirements
• Ensure that the requirements are testable
• Prioritize based on risk
• Create test plans
• Set up the test environment and data
• Execute test cases
• Document and manage defects and test results
• Gather metrics for management decisions
• Assess readiness to implement
Quality assurance (QA) in a data warehouse/business intelligence environment is a challenging
undertaking. For one thing, very little is written about business intelligence QA. Practitioners
within the business intelligence (BI) community appear to be more interested in discussing
data quality issues and data cleansing solutions. However, data quality does not make for BI
quality assurance, and practitioners within the software QA discipline focus almost exclusively
on application development efforts. They do not seem to appreciate the unique aspects of
quality assurance in a data warehouse/business intelligence environment.
An effective software QA should be ingrained within each DW/BI project. It should have the
following characteristics:
• QA goals and objectives should be defined from the outset of the project.
• The role of QA should be clearly defined within the project organization.
• The QA role needs to be staffed with talented resources, well trained in the techniques
needed to evaluate the data in the types of sources that will be used.

13
• QA processes should be embedded to provide a self-monitoring update cycle.
• QA activities are needed in the requirements, design, mapping and development project
phases.
5. Online Analytical Processing (OLAP), a category of software tools that provides analysis of < TOP
data stored in a database. OLAP tools enable users to analyze different dimensions of >
multidimensional data. For example, it provides time series and trend analysis views. OLAP
often is used in data mining. The chief component of OLAP is the OLAP server, which sits
between a client and a Database Management Systems (DBMS). The OLAP server understands
how data is organized in the database and has special functions for analyzing the data. There
are OLAP servers available for nearly all the major database systems. OLAP (online analytical
processing) is a function of business intelligence software that enables a user to easily and
selectively extract and view data from different points of view. Designed for managers looking
to make sense of their information, OLAP tools structure data hierarchically – the way
managers think of their enterprises, but also allows business analysts to rotate that data,
changing the relationships to get more detailed insight into corporate information. OLAP tools
are geared towards slicing and dicing of the data. As such, they require a strong metadata layer,
as well as front-end flexibility. Those are typically difficult features for any home-built systems
to achieve. The term ‘on-line analytic processing’ is used to distinguish the requirements of
reporting and analysis systems from those of transaction processing systems designed to run
day-to-day business operations. Decision support software that allows the user to quickly
analyze information that has been summarized into multidimensional views and hierarchies.
The most common way to access a data mart or data warehouse is to run reports. Another very
popular approach is to use OLAP tools. To compare different types of reporting and analysis
interface, it is useful to classify reports along a spectrum of increasing flexibility and
decreasing ease of use:
Ad hoc queries, as the name suggests, are queries written by (or for) the end user as a one-off
exercise. The only limitations are the capabilities of the reporting tool and the data available.
Ad hoc reporting requires greater expertise, but need not involve programming, as most
modern reporting tools are able to generate SQL.
OLAP tools can be thought of as interactive reporting environments: they allow the user to
interact with a cube of data and create views that can be saved and reused as generic,
interactive reports. They are excellent for exploring summarised data, and some will allow the
user to drill through from the cube into the underlying database to view the individual
transaction details.
6. Classifying the usage of tools against a data warehouse into three broad categories: < TOP
i. Data dipping. >
ii. Data mining.
iii. Data analysis.
Data dipping tools:
These are the basic business tools. They allow the generation of standard business reports.
They can perform basic analysis, answering standard business questions. As these tools are
relational they can also be used as data browsers and generally have reasonable drill-down
capabilities. Most of the tools will use metadata to isolate the user from the complexities of the
data warehouse and present a business friendly schema.
Data mining tools:
These are specialist tools designed for finding trends and patterns in the underlying data. These
tools use techniques such as artificial intelligence and neural networks to mine the data and
find connections that may not be immediately obvious. A data mining tool could be used to
find common behavioral trends in a business’s customers or to root out market segments by
grouping customers with common attributes.
Data analysis tools:
These are used to perform complex analysis of data. They will normally have a rich set of
analytic functions, which allow sophisticated analysis of the data. These tools are designed for
business analysis and will generally understand the common business metrics. Data analysis
tools can again be subdivided in to two categories: Multidimensional Online Analytical
Processing (MOLAP) and Relational Online Analytical Processing (ROLAP).
Online Analytical Processing (OLAP) is a category of software tools that provides analysis of
data stored in a database. OLAP tools enable users to analyze different dimensions of

14
multidimensional data. For example, it provides time series and trend analysis views. OLAP is
a technology designed to provide superior performance for ad hoc business intelligence
queries. OLAP is designed to operate efficiently with data organized in accordance with the
common dimensional model used in data warehouses.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats.
Advantages:
• Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for
slicing and dicing operations.
• Can perform complex calculations: All calculations have been pre-generated when the
cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
• Limited in the amount of data it can handle: Because all calculations are performed when
the cube is built, it is not possible to include a large amount of data in the cube itself. This
is not to say that the data in the cube cannot be derived from a large amount of data.
Indeed, this is possible. But in this case, only summary-level information will be included
in the cube itself.
• Requires additional investment: Cube technology are often proprietary and do not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are additional
investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP’s slicing and dicing functionality. In essence, each action of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
• Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself
places no limitation on data amount.
• Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of
the relational database, can therefore leverage these functionalities.
Disadvantages:
• Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
• Limited by SQL functionalities: Because ROLAP technology mainly relies on generating
SQL statements to query the relational database, and SQL statements do not fit all needs
(for example, it is difficult to perform complex calculations using SQL), ROLAP
technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have
mitigated this risk by building into the tool out-of-the-box complex functions as well as
the ability to allow users to define their own functions.

Section C: Applied Theory


7. Neural networks <
Genetic algorithms derive their inspiration from biology while neural networks are modeled TOP
on the human brain. In Freud’s theory of psychodynamics, the human brain was described as >
a neural network, and recent investigations have corroborated this view. The human brain
consists of a very large number of neurons, about1011, connected to each other via a huge
number of so-called synapses. A single neuron is connected to other neurons by a couple of
thousand of these synapses. Although neurons could be described as the simple building
blocks of the brain, the human brain can handle very complex tasks despite this relative sim-
plicity. This analogy therefore offers an interesting model for the creation of more complex
learning machines, and has led to the creation of so-called artificial neural networks. Such
networks can be built using special hardware but most are just software programs that can
operate on normal computers. Typically, a neural network consists of a set of nodes: input
nodes receive the input signals, output nodes give the output signals, and a potentially
unlimited number of intermediate layers contain the intermediate nodes. When using neural
15
networks we have to distinguish between two stages - the encoding stage in which the neural
network is trained to perform a certain task, and the decoding stage in which the network is
used to classify examples, make predictions, or execute whatever learning task is involved.
There are several different forms of neural network but we shall discuss only three of them
here:
• Perceptrons
• Back propagation networks
• Kohonen self-organizing map
In 1958 Frank Rosenblatt of the Cornell Aeronautical Laboratory built the so-called
perceptron, one of the first implementations of what would later be known as a neural
network. A perceptron consists of a simple three-layered network with input units called
photo-receptors, intermediate units called associators, and output units called responders. The
perceptron could learn simple categories and thus could be used to perform simple
classification tasks. Later, in 1969, Minsky and Papert showed that the class of problem that
could be solved by a machine with a perceptron architecture was very limited. It was only in
the 1980s that researchers began to develop neural networks with a more sophisticated
architecture that could overcome· these difficulties. A major improvement was the intro-
duction of hidden layers in the so-called back propagation networks.
A back propagation network not only has input and output nodes, but also a set of
intermediate layers with hidden nodes. In its initial stage a back propagation network has
random weightings on its synapses. When we train the network, we expose it to a training set
of input data. For each training instance, the actual output of the network is compared with
the desired output that would give a correct answer; if there is a difference between the
correct answer and the actual answer, the weightings of the individual nodes and synapses of
the network are adjusted. This process is repeated until the responses are more or less
accurate. Once the structure of the network stabilizes, the learning stage is over, and the
network is now trained and ready to categorize unknown input. Figure1 represents a simple
architecture of a neural network that can perform an analysis on part of our marketing
database. The age attribute has been split into three age classes, each represented by a
separate input node; house and car ownership also have an input node. There are four addi-
tional nodes identifying the four areas, so that in this way each input node corresponds to a
simple yes-no Decision. The same holds for the output nodes: each magazine has a node. It is
clear that this coding corresponds well with the information stored in the database.
The input nodes are wholly interconnected to the hidden nodes, and the hidden nodes are
wholly interconnected to the output nodes. In an untrained network the branches between the
nodes have equal weights. During the training stage the network receives examples of input
and output pairs corresponding to records in the database; and adapts the weights of the
different branches until all the inputs match the appropriate outputs.
In Figure 2 the network learns to recognize readers of the car magazine and comics. Figure
3 shows the internal state of the network after training. The configuration of the internal
nodes shows that there is a certain connection between the car magazine and comics readers.
However, the networks do not provide a rule to identify this association.
Back propagation networks are a great improvement on the perceptron architecture.
However, they also have disadvantages, one being that they need an extremely large training
set. Another problem of neural networks is that although they learn, they do not provide us
with a theory about what they have learned - they are simply black boxes-that give answers
but provide no clear idea as to how they arrived at these answers.
In 1981 Tuevo Kohonen demonstrated a completely different version of neural networks that
is currently known as Kohonen’s self-organizing maps. These neural networks can be seen as
the artificial counterparts of maps that exist in several places in the brain, such as visual
maps, maps of the spatial possibilities of limbs, and so on. A Kohonen self-organizing map is
a collection of neurons or units, each of which is connected to a small number of other units
called its neighbors. Most of the time, the Kohonen map is two- dimensional; each node or
unit contains a factor that is related to the space whose structure we are investigating. In its
initial setting, the self-organizing map has a random assignment of vectors to each unit.
During the training stage, these vectors are incre mentally adjusted to give a better coverage
of the space. A natural way to visualize the process of training a self- organizing map is the
so-called Kohonen movie, which is a series of frames showing the positions of the vectors
and their connections with neighboring cells. The network resembles an elastic surface that is
pulled out over the sample space. Neural networks perform well on classification tasks and
16
can be very useful in data mining.

Figure 1

Figure2

Figure 3
8. The query manager has several distinct responsibilities. It is used to control the following: <
• User access to the data TOP
>
• Query scheduling
• Query monitoring
These areas are all very different in nature, and each area requires its own tools, bespoke
software and procedures. The query manager is one of the most bespoke pieces of software in
the data warehouse.
User access to the data: The query manager is the software interface between the users and
the data. It presents the data to the users in a form they understand. It also controls the user

17
access to the data. In a data warehouse, the raw data will often be an amalgamation of data
needs to be tied together somehow; to achieve this raw data is often abstracted. Data in this
raw format can often be difficult to interpret. This coupled with the fact that data from a
single logical table is often partitioned into multiple real tables, can make ad hoc querying of
raw data difficult.
The query manager’s task is to address this problem by presenting a meaningful schema to
the users via a friendly front end. The query manager will at one end take in the user’s
requirements, and in the background using the metadata it will transform these requirements
into queries against the appropriate data.
Ideally all user access tools should work via the query manager. However as a number of
different tools are likely to be used, and the tools used are likely to change over time, it is
possible that not all tools will work directly via the query manager.
If users have access via tools that do not interface directly through the query manager, you
should try setting up some form of indirect control by the query manager, you should try
setting up some form of indirect control by the query manager. Certainly no large ad hoc
queries should be allowed to be run by anyone other than the query manager. It may be
possible to get the tool to dump the query request to a flat file, where the query manager can
pick it up. If queries do bypass the query manager, query statistics gathering will be less
accurate.
Query Scheduling: Scheduling of ad hoc queries is a responsibility of the query manager.
Simultaneous large ad hoc queries, if not controlled, can severely affect the performance of
any system: in particular if the queries are run using parallelism, where a single query can
potentially use all the CPU resource made available to it.
One aspect of query control that it is glaringly visible by its absence is the ability to predict
how long a query will take to complete.
Query monitoring: One of the main functions of the query manager is to monitor the
queries as they run. This is one of the reasons why all queries should be run via, or at least
notified to, the query manager. One of the keys to success of usage of data ware house is to
that success is the tuning of ad hoc environment to meet the user’s needs. To achieve this,
query profiles of different groups of users need to be known. This can be achieved only if
there is long-term statistics on the queries run by each user and the resources used by each
query. The query execution plan needs to be stored along the statistics of the resources used,
and the query syntax used.
The query manager has to be capable of gathering these statistics, which should then be
stored in the database for later analysis. It should also maintain a query history. Every query
created or executed via query manager should be logged. This allows query profiles to be
built up over time. This enables identification of frequently run queries, or types of queries.
These queries can then be tuned, possibly by adding new indexes or by creating new
aggregations.
< TOP OF THE DOCUMENT >

18