Vtu 7th Sem Cse Data Warehousing & Data Mining Solved Papers of Dec2013 June2014 Dec2014 June2015

I
R
YS
SB
TE
O
N
U
VT
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
1a. What is operational data store (ODS)? Explain with neat diagram. (08 Marks)
Ans:
ODS (OPERATIONAL DATA STORE)
• ODS is defined as a subject-oriented, integrated, volatile, current-valued data store,
containing only corporate-detailed data.
→ ODS is subject-oriented i.e. it is organized around main data-subjects of a company.
→ ODS is integrated i.e. it is a collection of data from a variety of systems.
→ ODS is volatile i.e. data changes frequently, as new information refreshes ODS.
→ ODS is current-valued i.e. it is up-to-date & reflects current status of information.
→ ODS is detailed i.e. it is detailed enough to serve needs of manager.
ODS DESIGN & IMPLEMENTATION
I
• The extraction of information from source-databases should be efficient.
R
• The quality of data should be maintained (Figure 8.1).
• Suitable-checks are required to ensure quality of data after each refresh.
• The ODS is required to
→ satisfy integrity constraints. Ex: existential-integrity, referential-integrity.
YS
→ take appropriate actions to deal with null values.
• ODS is a read only database i.e. users shouldn‟t be allowed to update information.
• Populating an ODS involves an acquisition-process of extracting, transforming &
loading data from source systems. This process is called ETL.
(ETL = Extraction, Transformation and Loading).
• Before an ODS can go online, following 2 tasks must be completed:
SB
i) Checking for anomalies &
ii) Testing for performance.
TE
O
N
U
VT
• Why an ODS should be separate from the operational-databases?

Ans: Because from time to time, complex queries are likely to degrade performance of
OLTP systems.
The OLTP systems have to provide a quick response to operational-users.
The businesses cannot afford to have response-time suffer when a
manager is running a complex-query.
A-1
1b. What is ETL? Explain the steps in ETL. (07 Marks)
Ans:
ETL (EXTRACTION, TRANSFORMATION & LOADING)
• The ETL process consists of
→ data-extraction from source systems
→ data-transformation which includes data-cleaning &
→ data-loading in the ODS or the data-warehouse.
• Data cleaning deals with detecting & removing errors/inconsistencies from the data.
• Most often, the data is sourced from a variety of systems.
PROBLEMS TO BE SOLVED FOR BUILDING INTEGRATED-DATABASE
1) Instance Identity Problem
I
• The same customer may be represented slightly differently in different source-
R
systems.
2) Data-errors
• Different types of data-errors include:
i) There may be some missing attribute-values.
YS
ii) There may be duplicate records.
3) Record Linkage Problem
• This deals with problem of linking information from different databases that
relates to the same customer.
4) Semantic Integration Problem
• This deals with integration of information found in heterogeneous-OLTP &
legacy sources.
• For example,
SB
→ Some of the sources may be relational.
→ Some sources may be in text documents.
→ Some data may be character strings or integers.
5) Data Integrity Problem
• This deals with issues like
TE
i) referential integrity
ii) null values &
iii) domain of values.
STEPS IN DATA CLEANING
1) Parsing
• This involves
O
→ identifying various components of the source-files and

→ establishing the relationships b/w i) components of source-files &
ii) fields in the target-files.
• For ex: identifying the various components of a person„s name and address.
N
2) Correcting
• Correcting the identified-components is based on sophisticated techniques
using mathematical algorithms.
• Correcting may involve use of other related information that may be available
U
in the company.
3) Standardizing
• Business rules of the company are used to transform data to standard form.
VT
• For ex, there might be rules on how name and address are to be represented.
4) Matching
• Much of the data extracted from a number of source-systems is likely to be
related. Such data needs to be matched.
5) Consolidating
• All corrected, standardized and matched data can now be consolidated to build
a single version of the company-data.
A-2
1c. What are the guide lines for implementing the data-warehouse. (05 Marks)
Ans:
DW IMPLEMENTATION GUIDELINES
Build Incrementally
• Firstly, a data-mart will be built.
• Then, a number of other sections of the company will be built.
• Then, the company data-warehouse will be implemented in an iterative
manner.
• Finally, all data-marts extract information from the data-warehouse.
Need a Champion
• The project must have a champion who is willing to carry out considerable
I
research into following:
R
i) Expected-costs &
ii) Benefits of project.
• The projects require inputs from many departments in the company.
• Therefore, the projects must be driven by someone who is capable of
YS
interacting with people in the company.
Senior Management Support
• The project calls for a sustained commitment from senior-management due to
i) The resource intensive nature of the projects.
ii) The time the projects can take to implement.
Ensure Quality
SB
• Data-warehouse should be loaded with
i) Only cleaned data &
ii) Only quality data.
Corporate Strategy
• The project must fit with
i) corporate-strategy &
ii) business-objectives.
TE
Business Plan
• All stakeholders must have clear understanding of i) Project plan
ii) Financial costs &
iii) Expected benefits.
Training
• The users must be trained to
O
i) Use the data-warehouse &

ii) Understand capabilities of data-warehouse.
Adaptability
• Project should have build-in adaptability, so that changes may be made to DW
N
as & when required.

Joint Management
• The project must be managed by both
i) IT professionals of software company &
U
ii) Business professionals of the company.
2a. Distinguish between OLTP and OLAP. (04 Marks)

VT
Ans:
A-3
2b. Explain the operation of data-cube with suitable examples. (08 Marks)
ROLL-UP
• This is like zooming-out on the data-cube. (Figure 2.1a).
• This is required when the user needs further abstraction or less detail.
• Initially, the location-hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location-hierarchy from the level-of-
city to level-of-country.
I
R
YS
SB
TE
Figure 2.1a: Roll-up operation
DRILL DOWN
• This is like zooming-in on the data. (Figure 2.1b).
• This is the reverse of roll-up.
• This is an appropriate operation
→ when the user needs further details or
→ when the user wants to partition more finely or
O
→ when the user wants to focus on some particular values of certain dimensions.
• This adds more details to the data.
• Initially, the time-hierarchy was "day < month < quarter < year”.
• On drill-up, the time dimension is descended from the level-of-quarter to the level-of-month.
N
U
VT
Figure 2.1b: Drill-down operation

A-4
PIVOT (OR ROTATE)
• This is used when the user wishes to re-orient the view of the data-cube. (Figure 2.1c).
• This may involve
→ swapping the rows and columns or
→ moving one of the row-dimensions into the column-dimension.
I
R
YS
SB
Figure 2.1c: Pivot operation
SLICE & DICE
• These are operations for browsing the data in the cube.
TE
• These operations allow ability to look at information from different viewpoints.
• A slice is a subset of cube corresponding to a single value for 1 or more members of
dimensions. (Figure 2.1d).
• A dice operation is done by performing a selection of 2 or more dimensions. (Figure 2.1e).
O
N
U
VT
Figure 2.1d: Slice operation Figure 2.1e: Dice operation
A-5
2c. Write short note on (08 Marks)
i) ROLAP
ii) MOLAP
iii) FASMI
iv) DATACUBE
Ans:(i) For answer, refer Solved Paper June-2014 Q.No.2b.
Ans:(ii) For answer, refer Solved Paper June-2014 Q.No.2b.
Ans:(iii) For answer, refer Solved Paper June-2015 Q.No.2a.
Ans:(iv) For answer, refer Solved Paper Dec-2014 Q.No.2a.
3a. Discuss the tasks of data-mining with suitable examples. (10 Marks)
I
Ans:
R
DATA-MINING
• Data-mining is the process of automatically discovering useful information in large data-
repositories.
DATA-MINING TASKS
YS
1) Predictive Modeling
• This refers to the task of building a model for the target-variable as a function of the
explanatory-variable.
• The goal is to learn a model that minimizes the error between
i) Predicted values of target-variable and
ii) True values of target-variable (Figure 3.1).
• There are 2 types: SB
i) Classification: is used for discrete target-variables
Ex: Web user will make purchase at an online bookstore is a classification-task.
ii) Regression: is used for continuous target-variables.
Ex: forecasting the future price of a stock is regression task.
TE
O
N
U
VT
Figure 3.1: Four core tasks of data-mining
A-6
2) Association Analysis
• This is used to find group of data that have related functionality.
• The goal is to extract the most interesting patterns in an efficient manner.
• Ex: Market based analysis
We may discover the rule
{diapers} -> {Milk}
This suggests that customers who buy diapers also tend to buy milk.
• Useful applications:
i) Finding groups of genes that have related functionality.
ii) Identifying web pages that are accessed together.
3) Cluster Analysis
I
• This seeks to find groups of closely related observations so that observations that
R
belong to the same cluster are more similar to each other than observations that
belong to other clusters.
i) To group sets of related customers.
YS
ii) To find areas of the ocean that has a significant impact on Earth's climate.
• For example:
Collection of news articles in Table 1.2 shows
→ First 4 rows speak about economy &
→ Last 2 lines speak about health sector.
SB
TE
4) Anomaly Detection
• This is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies.
O
• The goal is to
i) Discover the real anomalies &
ii) Avoid falsely labeling normal objects as anomalous.
N
i) Detection of fraud &

ii) Network intrusions.
U
VT
A-7
3b. Explain shortly any five data pre-processing approaches. (10 Marks)
Ans:
DATA PRE-PROCESSING
• Data pre-processing is a data-mining technique that involves transforming raw data into an
understandable format.
Q: Why data pre-processing is required?
• Data is often collected for unspecified applications.
• Data may have quality-problems that need to be addressed before applying a DM-
technique. For example: 1) Noise & outliers
2) Missing values &
3) Duplicate data.
I
• Therefore, preprocessing may be needed to make data more suitable for data-mining.
R
DATA PRE-PROCESSING APPROACHES
1. Aggregation
2. Dimensionality reduction
YS
3. Variable transformation
4. Sampling
5. Feature subset selection
6. Discretization & binarization
7. Feature Creation
1) AGGREGATION SB
• This refers to combining 2 or more attributes into a single attribute.
For example, merging daily sales-figures to obtain monthly sales-figures.
Purpose
1) Data reduction: Smaller data-sets require less processing time & less
memory.
2) Aggregation can act as a change of scale by providing a high-level view of
TE
the data instead of a low-level view.
E.g. Cities aggregated into districts, states, countries, etc.
3) More “stable” data: Aggregated data tends to have less variability.
• Disadvantage: The potential loss of interesting-details.
2) DIMENSIONALITY REDUCTION
O
• Key Benefit: Many DM algorithms work better if the dimensionality is lower.

Curse of Dimensionality
• Data-analysis becomes much harder as the dimensionality of the data
increases.
N
• As a result, we get i) reduced classification accuracy & ii) poor quality clusters.
Purpose
• Avoid curse of dimensionality.
• May help to
U
i) Eliminate irrelevant features &

ii) Reduce noise.
• Allow data to be more easily visualized.
VT
• Reduce amount of time and memory required by DM algorithms.
3) VARIABLE TRANSFORMATION
• This refers to a transformation that is applied to all the values of a variable.
Ex: converting a floating point value to an absolute value.
• Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
• For ex: If x is a variable, then transformations may be ex, 1/x, log(x)
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• If x is the mean of the attribute-values and sx is their standard deviation,
then the transformation x'=(x- x )/sx creates a new variable that has a mean of
0 and a standard-deviation of 1.
A-8
4) SAMPLING
• This is a method used for selecting a subset of the data-objects to be analyzed.
• This is used for
i) Preliminary investigation of the data &
ii) Final data analysis.
• Q: Why sampling?
Ans: Obtaining & processing the entire-set of “data of interest” is too expensive or
time consuming.
• Three sampling methods:
i) Simple Random Sampling
• There is an equal probability of selecting any particular object.
I
• There are 2 types:
R
a) Sampling without Replacement
• As each object is selected, it is removed from the population.
b) Sampling with Replacement
• Objects are not removed from the population, as they are selected for
YS
the sample. The same object can be picked up more than once.
ii) Stratified Sampling
• This starts with pre-specified groups of objects.
• Equal numbers of objects are drawn from each group.
iii) Progressive Sampling
• This method starts with a small sample, and then increases the sample-size
5) FEATURE SUBSET SELECTION

SB
until a sample of sufficient size has been obtained.
• To reduce the dimensionality, use only a subset of the features.

• Two types of features:
1) Redundant Features duplicate much or all of the information contained in one or
more other attributes.
TE
For ex: price of a product (or amount of sales tax paid).
2) Irrelevant Features contain almost no useful information for the DM task at hand.
For ex: student USN is irrelevant to task of predicting student‟s marks.
• Three techniques:
1) Embedded Approaches
• Feature selection occurs naturally as part of DM algorithm.
O
2) Filter Approaches
• Features are selected before the DM algorithm is run.
3) Wrapper Approaches
• Use DM algorithm as a black box to find best subset of attributes.
N
6) DISCRETIZATION AND BINARIZATION

• Classification-algorithms require that the data be in the form of categorical attributes.
• Association analysis algorithms require that the data be in the form of binary attributes.
U
• Transforming continuous attributes into a categorical attribute is called discretization.

And transforming continuous & discrete attributes into binary attributes is called as
binarization.
VT
• The discretization process involves 2 subtasks:

i) Deciding how many categories to have and
ii) Determining how to map the values of the continuous attribute to the categories.
7) FEATURE CREATION
• This creates new attributes that can capture the important information in a data-set much
more efficiently than the original attributes.
• Three general methods:
1) Feature Extraction
• Creation of new set of features from the original raw data.
2) Mapping Data to New Space
• A totally different view of data that can reveal important and interesting features.
3) Feature Construction
• Combining features to get better features than the original.
A-9
4a. Develop the Apriori Alogorithm for generating frequent-itemset. (08 Marks)
Ans:
APRIORI ALOGORITHM FOR GENERATING FREQUENT-ITEMSET
• Let Ck = set of candidate k-itemsets.
Let Fk = set of frequent k-itemsets.
I
R
YS
SB
• The algorithm initially makes a single pass over the data-set to determine the support of
each item.
After this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 & 2).
• Next, the algorithm will iteratively generate new candidate k-itemsets using frequent (k - 1)
- itemsets found in the previous iteration (step 5).
TE
Candidate generation is implemented using a function called apriori-gen.
• To count the support of the candidates, the algorithm needs to make an additional pass over
the data-set (steps 6–10).
The subset function is used to determine all the candidate itemsets in C k that are
contained in each transaction „t‟.
• After counting their supports, the algorithm eliminates all candidate itemsets whose support
O
counts are less than minsup (step 12).

• The algorithm terminates when there are no new frequent-itemsets generated.
4b. What is association analysis? (04 Marks)

N
Ans:
ASSOCIATION ANALYSIS
• This is used to find group of data that have related functionality.
• The goal is to extract the most interesting patterns in an efficient manner.
U
• Ex: Market based analysis

We may discover the rule
{diapers} -> {Milk}
VT
which suggests that customers who buy diapers also tend to buy milk.
i) Finding groups of genes that have related functionality.
ii) Identifying web pages that are accessed together.
A-10
4c. Consider the transaction data-set:
Construct the FP tree by showing the trees separately after reading each
transaction. (08 Marks)
Ans:
I
R
YS
SB
TE
O
N
Procedure:
1. A scan of T1 derives a list of frequent-items, h(a:8); (b:5); (c:3); (d:1); etc in which items
are ordered in frequency descending order.
2. Then, the root of a tree is created and labeled with “null”.
U
The FP-tree is constructed as follows:

(a) The scan of the first transaction leads to the construction of the first branch of the
tree: h(a:1), (b:1) (Figure 6.24i).
The frequent-items in the transaction are listed according to the order in the list
VT
of frequent-items.
(b) For the third transaction (Figure 6.24iii).
→ since its (ordered) frequent-item list a, c, d, e shares a common prefix „a‟
with the existing path a:b:
→ the count of each node along the prefix is incremented by 1 and
→ three new nodes (c:1), (d:1), (e:1) is created and linked as a child of (a:2)
(c) For the seventh transaction, since its frequent-item list contains only one item i.e.
„a‟ shares only the node „a‟ with the f-prefix subtree, a‟s count is incremented by 1.
(d) The above process is repeated for all the transactions.
A-11
5a. Explain Hunts Algorithm and illustrate is working? (08 Marks)
Ans:
HUNT’S ALGORITHM
• A decision-tree is grown in a recursive fashion.
• Let Dt = set of training-records that are associated with node „t‟.
Let y = {y1, y2, . . . , yc} be the class-labels.
• Hunt‟s algorithm is as follows:
Step 1:
• If all records in Dt belong to same class yt, then t is a leaf node labeled as yt.
Step 2:
• If Dt contains records that belong to more than one class, an attribute test-
I
condition is selected to partition the records into smaller subsets.
R
• A child node is created for each outcome of the test-condition and the records
in Dt are distributed to the children based on the outcomes.
• The algorithm is then recursively applied to each child node.
YS
SB
TE
O
N
U
VT
A-12
EXPLANATION OF DECISION-TREE CONSTRUCTION
1) The initial tree for the classification problem contains a single node with class-label
Defaulted=No (Fig 4.7(a)).
2) The records are subsequently divided into smaller subsets based on the outcomes of
the Home Owner test-condition.
3) Hunt's algorithm is then applied recursively to each child of the root node.
4) The left child of the root is therefore a leaf node labeled Defaulted=No (Fig 4.7(b)).
5) For the right child, we need to continue applying the recursive step of Hunt's
algorithm until all the records belong to the same class.
TWO DESIGN ISSUES OF DECISION-TREE
1. How should the training-records be split?
I
The algorithm must provide
R
i) a method for specifying test-condition for different attribute-types.
ii) an objective-measure for evaluating goodness of each test-condition.
2. How should the splitting procedure stop?
A possible strategy is to continue expanding a node until either
YS
i) All the records belong to the same class or
ii) All the records have identical attribute values.
SB
TE
O
N
U
VT
A-13
5b. What is rule based classifier? Explain how a rule classifier works. (08 Marks)
Ans:
RULE-BASED CLASSIFICATION
• A rule-based classifier is a technique for classifying records.
• This uses a set of “if..then..” rules.
• The rules are represented as R = (r1∨r2∨. . . rk)
where R = rule-set
ri‟s = classification-rules
• General format of a rule: ri: (conditioni) −→ yi.
where conditioni = conjunctions of attributes (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk)
y = class-label
I
LHS = rule antecedent contains a conjunction of attribute tests i.e. (A j op vj)
R
RHS = rule consequent contains the predicted class yi
op = logical operators such as =, !=, , ≤, ≥
• For ex: Rule R1 is
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
YS
• Given a data-set D and a rule r: A → y. The quality of a rule can be evaluated using following
two measures:
i) Coverage is defined as the fraction of records in D that trigger the rule r.
ii) Accuracy is defined as fraction of records triggered by r whose class-labels are equal to y.
i.e.
SB
where |A| = no. of records that satisfy the rule antecedent.
|A ∩ y| = no. of records that satisfy both the antecedent and consequent.
|D| = total no. of records.
HOW A RULE-BASED CLASSIFIER WORKS

TE
• A rule-based classifier divides a test-record based on the rule triggered by the record.
• Consider the rule-set given below
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
O
• Consider the following vertebrates in below table:

N
• A lemur triggers rule R3, so it is classified as a mammal.

• A turtle triggers the rules R4 and R5. Since the classes predicted by the rules are
U
contradictory (reptiles versus amphibians), their conflicting classes must be resolved.

• None of the rules are applicable to a dogfish shark.
• Characteristics of rule-based classifier are:
1) Mutually Exclusive Rules
VT
• Classifier contains mutually exclusive rules if the rules are independent of each other.
• Every record is covered by at most one rule.
• In the above example,
→ lemur is mutually exclusive as it triggers only one rule R3.
→ dogfish is mutually exclusive as it triggers no rule.
→ turtle is not mutually exclusive as it triggers more than one rule i.e. R4, R5.
2) Exhaustive Rules
• Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values.
• Each record is covered by at least one rule.
• In the above example,
→ lemur and turtle are Exhaustive as it triggers at least one rule.
→ dogfish is not exhaustive as it does not triggers any rule.
A-14
5c. Write the algorithm for k-nearest neighbour classification. (04 Marks)
Ans:
I
R
YS
SB
TE
O
N
U
VT
A-15
6a. What is Bayes Theorm? Show how it is used for classification (06 Marks)
Ans:
BAYES THEORM
• Bayes theorem is a statistical principle for combining prior knowledge of the classes with new
evidence gathered from data.
• Let X and Y be a pair of random-variables.
• A conditional probability P(X=x | Y=y) is the probability that a random-variable will take on a
particular value given that the outcome for another random-variable is known.
• The Bayes theorem is given by
I
• The Bayes theorem can be used to solve the prediction problem.
R
• Two implementations of Bayesian methods are used:
1. Naive Bayes classifier &
2. Bayesian belief network.
YS
NAIVE BAYES CLASSIFIER
• A naive Bayes classifier estimates the class-conditional probability by assuming that the
attributes are conditionally independent.
• The conditional independence assumption can be formally stated as follows:
SB
where each attribute-set X = {X1,X2, . . . ,Xd} consists of „d‟ attributes
Conditional Independence
• Let X, Y, and Z denote three sets of random-variables.
• The variables in X are said to be conditionally independent of Y, given Z, if the
following condition holds:
TE
HOW A NAIVE BAYES CLASSIFIER WORKS

• With the conditional independence assumption,
→ instead of computing the class-conditional probability for every combination of X,
→ we only have to estimate the conditional probability of each Xi.
O
• This approach is more practical because

it does not require a very large training-set to obtain a good estimate of probability.
• To classify a test-record, the naive Bayes classifier computes the posterior probability for
each class Y:
N
Estimating Conditional Probabilities for Categorical Attributes

U
• For a categorical attribute Xi, the conditional probability P(Xi =xi | Y= y) is estimated
according to the fraction of training instances in class y that take on a particular
attribute value xi.
VT
Estimating Conditional Probabilities for Continuous Attributes

• There are 2 ways to estimate the class-conditional probabilities:
1) We can discretize each continuous attribute and then replace the continuous
attribute value with its corresponding discrete interval.
This approach transforms the continuous attributes into ordinal
attributes.
The conditional probability P(Xi | Y =y) is estimated by computing the
fraction of training-records belonging to class y that falls within the
corresponding interval for Xi.
2) We can assume a certain form of probability distribution for the continuous
variable and estimate the parameters of the distribution using the training data.
A Gaussian distribution is usually chosen to represent the class-
conditional probability for continuous attributes.
A-16
6b. Discuss methods for estimating predictive accuracy of classification. (10 Marks)
Ans:
PREDICTIVE ACCURACY
• This refers to the ability of the model to correctly predict the class-label of new or previously
unseen data.
• A confusion-matrix that summarizes the no. of instances predicted correctly or incorrectly by
a classification-model is shown in Table 5.6.
I
R
YS
METHODS FOR ESTIMATING PREDICTIVE ACCURACY
1) Sensitivity
2) Specificity
3) Recall &
4) Precision
• Let True positive (TP) = no. of positive examples correctly predicted.
SB
False negative (FN) = no. of positive examples wrongly predicted as negative.
False positive (FP) = no. of negative-examples wrongly predicted as positive.
True negative (TN) = no. of negative-examples correctly predicted.
• The true positive rate (TPR) or sensitivity is defined as the fraction of positive examples
predicted correctly by the model,
i.e.
TE
Similarly, the true negative rate (TNR) or specificity is defined as the fraction of negative-
examples predicted correctly by the model,
i.e.
• Finally, the false positive rate (FPR) is the fraction of negative-examples predicted as a
positive class,
O
i.e.
Similarly, the false negative rate (FNR) is the fraction of positive examples predicted as a
N
negative class,
i.e.
U
• Recall and precision are two widely used metrics employed in applications where successful
detection of one of the classes is considered more significant than detection of the other
classes.
i.e.
VT
• Precision determines the fraction of records that actually turns out to be positive in the
group the classifier has declared as a positive class.
• Recall measures the fraction of positive examples correctly predicted by the classifier.
• Weighted accuracy measure is defined by the following equation.
6c. What are the two approaches for extending the binary-classifiers to extend to
handle multi class problems. (04 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.6b.
A-17
7a. List and explain four distance measures to compute the distance between a pair
of points and find out the distance between two objects represented by attribute
values (1,6,2,5,3) & (3,5,2,6,6) by using any 2 of the distance measures (08 Marks)
Ans:
1) EUCLIDEAN DISTANCE
• This metric is most commonly used to compute distances.
• The largest valued-attribute may dominate the distance.
• Requirement: The attributes should be properly scaled.
• This metric is more appropriate when the data is not standardized.
I
2) MANHATTAN DISTANCE
• In most cases, the result obtained by this measure is similar to those obtained by
R
using the Euclidean distance.
• The largest valued attribute may dominate the distance.
YS
3) CHEBYCHEV DISTANCE
• This metric is based on the maximum attribute difference.
4) CATEGORICAL DATA DISTANCE

• This metric may be used if many attributes have categorical values with only a small
SB
number of values (e.g. metric binary values).
where N=total number of categorical attributes
Solution:
Given, (x1,x2,x3,x4,x5) = (1, 6, 2, 5, 3)
(y1,y2,y3,y4,y5) = (3, 5, 2, 6, 6)
TE
Euclidean Distance is
= 3.872983
O
Manhattan Distance is
N
D(x,y) = |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5|
=7
U
Chebychev Distance is
D(x,y) = Max ( |x1-y1|,|x2-y2|,|x3-y3|,|x4-y4|,|x5-y5| )

VT
=3
A-18
7b. Explain the cluster analysis methods briefly. (08 Marks)
Ans:
CLUSTER ANALYSIS METHODS
I
R
Partitional Method
YS
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster (Figure 4.1a).
• The method obtains a single-level partition of objects.
• The analyst has to specify
i) Number of clusters prior (k) and
ii) Starting seeds of clusters.
SB
• Analyst has to use iterative approach in which he runs the method many times
→ specifying different numbers of clusters & different starting seeds &
→ then selecting the best solution.
• The method converges to a local minimum rather than the global minimum.
TE
Figure 4.1a Figure 4.1b

O
Hierarchical Methods
• A set of nested clusters is organized as a hierarchical tree (Figure 4.1b).
• Two types:
1. Agglomerative: This starts with each object in an individual cluster & then tries to
N
merge similar clusters into larger clusters.

2. Divisive: This starts with one cluster & then splits into smaller clusters.
• Tentative clusters may be merged or split based on some criteria.
Density based Methods
U
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high-density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist
VT
within a given radius.

• The method can deal with arbitrary shape clusters.
Grid-based Methods
• The object-space rather than the data is divided into a grid.
• This is based on characteristics of the data.
• The method can deal with non-numeric data more easily.
• The method is not affected by data-ordering.
Model-based Methods
• A model is assumed, perhaps based on a probability distribution.
• Essentially, the algorithm tries to build clusters with
→ a high level of similarity within them &
→ a low level of similarity between them.
• Similarity measurement is based on the mean values.
• The algorithm tries to minimize the squared error function.
A-19
7c. What are the features of cluster analysis (04 Marks)
Ans:
DESIRED FEATURES OF CLUSTER ANALYSIS METHOD
Scalability
• Data-mining problems can be large.
• Therefore, a cluster-analysis method should be able to deal with large
problems gracefully.
• The method should be able to deal with datasets in which number of attributes
is large.
Only one Scan of the Dataset
• For large problems, data must be stored on disk.
I
• So, cost of I/O disk becomes significant in solving the problem.
R
• Therefore, the method should not require more than one scan of disk.
Ability to Stop & Resume
• For large dataset, cluster-analysis may require huge processor-time to
complete the task.
YS
• Therefore, the task should be able to be stopped & then resumed as & when
required.
Minimal Input Parameters
• The method should not expect too much guidance from the data-mining
analyst.
• Therefore, the analyst should not be expected
Robustness
SB
→ to have domain knowledge of data and
→ to posses‟ insight into clusters.
• Most data obtained from a variety of sources has errors.

• Therefore, the method should be able to deal with i) noise, ii) outlier & iii)
missing values gracefully.
Ability to Discover Different Cluster-Shapes
TE
• Clusters appear in different shapes and not all clusters are spherical.
• Therefore, method should be able to discover cluster-shapes other than
spherical.
Different Data Types
• Many problems have a mixture of data types, for e.g. numerical, categorical &
textual.
O
• Therefore, the method should be able to deal with

i) Numerical data
ii) Boolean data &
iii) Categorical data.
N
Result Independent of Data Input Order

• Irrespective of input-order, result of cluster-analysis of the same data should
be same.
• Therefore, the method should not be sensitive to data input-order.
U
VT
A-20
8. Write short note on the following: (20 Marks)
a. Text mining b. Spatial-databases mining
c. Mining temporal databases d. Web content mining
Ans (a):
TEXT MINING
• This is concerned with extraction of info implicitly contained in the collection of documents.
• Text-collection lacks the imposed structure of a traditional database.
• The text expresses a vast range of information. (DM = Data-mining, TM = Text-Mining).
• The text encodes the information in a form that is difficult to decipher automatically.
• Traditional DM techniques are designed to operate on structured-databases.
• In structured-databases,
I
it is easy to define the set of items and
R
hence, it is easy to use the traditional DM techniques.
In textual-database, identifying individual items (or terms) is a difficult task.
• TM techniques have to be developed to process the unstructured textual-data.
• The inherent nature of textual-data motivates the development of separate TM techniques.
YS
For ex, unstructured characteristics.
• Two approaches for text-mining:
1) Impose a structure on the textual-database and use any of the known DM
techniques meant for structured-databases.
2) Develop a very specific technique for mining that exploits the inherent
characteristics of textual-databases.
Ans (b):
SPATIAL-DATABASES MINING
SB
• This refers to the extraction of knowledge, spatial relationships, or other interesting patterns
not explicitly stored in spatial-databases.
• Consider a map of the city of Mysore containing clusters of points. (Where each point marks
the location of a particular house).
We can mine varieties of information by identifying likely-relationships.
TE
For ex, "the land-value of cluster of residential area around „Mysore Palace‟ is high".
Such information could be useful to investors, or prospective home buyers.
SPATIAL MINING TASKS
1) Spatial-characteristic Rule
• This is a general description of spatial-data.
• For example, a rule may describe the general price-ranges of houses in
O
various geographic regions.

2) Spatial-discriminant Rule
• This is a general description of the features discriminating a class of spatial-
data from other classes.
N
• For example, the comparison of price-range of houses in different geographical

regions.
3) Spatial Association Rules
• These describe the association between spatially related objects.
U
• We can associate spatial attributes with non-spatial attributes.

• For example, "the monthly rental of houses around the market area is mostly
Rs 500 per sq mt."
VT
4) Attribute-oriented Induction
• The concept hierarchies of spatial and non-spatial attributes can be used to
determine relationships between different attributes.
• For ex, one may be interested in a particular category of land-use patterns.
A built-up area may be a recreational facility or a residential complex.
Similarly, a recreational facility may be a cinema or a restaurant.
5) Aggregate Proximity Relationships
• This problem is concerned with relationships between spatial-clusters based on
spatial and non-spatial attributes.
• Given „n‟ input clusters, we want to associate the clusters with classes of
features.
• For example, educational institutions which, in turn, may be comprised of
secondary schools and junior colleges or higher institutions.
A-21
Ans (c):
MINING TEMPORAL DATABASES
• This can be defined as non-trivial extraction of potentially-useful & previously-unrecorded
information with an implicit/explicit temporal-content, from large quantities of data.
• This has the capability to infer causal and temporal-proximity relationships.
FOUR TYPES OF TEMPORAL-DATA
1) Static
• Static-data are free of any temporal-reference.
• Inferences derived from static-data are also free of any temporality.
2) Sequences (Ordered Sequences of Events)
• There may not be any explicit reference to time.
I
• There exists a temporal-relationship between data-items.
R
• For example, market-basket transaction.
3) Timestamped
• The temporal-information is explicit.
• The relationship can be quantitative, in the sense that
YS
→ we can say the exact temporal-distance between the data-elements &
→ we can say that one transaction occurred before another.
• For example, census data, land-use data etc.
• Inferences derived from this data can be temporal or non-temporal.
4) Fully Temporal
• The validity of the data-element is time-dependent.
TEMPORAL DATA-MINING TASKS

1) Temporal Association
SB
• Inferences derived from this data are necessarily temporal.
• We attempt to discover temporal-associations b/w non-temporal itemsets.

• For example, "70% of the readers who buy a DBMS book also buy a Data-
mining book after a semester".
2) Temporal Classification
TE
• We can extend concept of decision-tree construction on temporal-attributes.
• For example, a rule could be: "The first case of malaria is normally reported
after the first pre-monsoon rain and during the months of May-August".
3) Trend Analysis
• The analysis of one or more time series of continuous data may show similar
trends i.e. similar shapes across the time axis.
O
• For example, "The deployment of the Android OS is increasingly becoming

popular in the Smartphone industry".
Ans (d):
WEB CONTENT MINING
N
• This is the process of extracting useful information from the contents of web-documents.
• In recent years,
→ government information are gradually being placed on the web.
→ users access digital libraries from the web.
U
→ users access web-applications through web-interfaces.

• Some of the web-data are hidden-data, and some are generated dynamically.
• The web-content consists of different types of data such as text, image, audio & video.
VT
• Most of the research on web-mining is focused on the text-contents.

• The textual-parts of web-data consist of i) Unstructured-data. For ex: free texts
ii) Semi structured-data. For ex: HTML documents
iii) Structured-data. For ex: data in the tables
• Much of the web-data is unstructured, free text-data.
As a result, text-mining techniques can be directly employed for web-mining.
• Issues addressed in text mining are:
→ topic discovery
→ extracting association patterns
→ clustering of web documents &
→ classification of Web Pages.
• Research activities have drawn techniques of other disciplines such as i) IR and ii) NLP.
(IR = Information Retrieval, NLP = Natural Language Processing).
A-22
I
R
YS
SB
TE
O
N
U
VT

I
R
YS
SB
TE
O
N
U
VT

DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
1a. Explain ODS and its structure with a neat figure. (07 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.1a.
1b. Explain the implementation steps for data-warehouse. (07 Marks)

Ans:
DW IMPLEMENTATION STEPS
1) Requirements Analysis & Capacity Planning
• This step involves
→ defining needs of the company
→ defining architecture
→ carrying out capacity-planning &
I
→ selecting the hardware & software tools.
R
• This step also involves consulting
→ with senior-management &
→ with the various stakeholders.
2) Hardware Integration
YS
• Both hardware and software need to be put together by integrating
→ servers
→ storage devices &
→ client software tools.
3) Modeling
• This involves designing the warehouse schema and views.
4) Physical Modeling
• This involves designing
SB
• This may involve using a modeling tool if the data-warehouse is complex.
→ data-warehouse organization
→ data placement
→ data partitioning &
→ deciding on access methods & indexing.
TE
5) Sources
• This involves identifying and connecting the sources using gateways.
6) ETL
• This involves
→ identifying a suitable ETL tool vendor
→ purchasing the tool &
O
→ implementing the tool.

• This may include customizing the tool to suit the needs of the company.
7) Populate DW
• This involves testing the required ETL-tools using a staging-area.
N
• Then, ETL-tools are used for populating the warehouse.

8) User Applications
• This involves designing & implementing applications required by end-users.
9) Roll-out the DW and Applications
U
1c. Write the differences between OLTP and data-warehouse. (06 Marks)
Ans:
VT
B-1
2a. Explain characteristics of OLAP & write comparison of OLTP & OLAP. (12 Marks)
Ans:
CHARACTERISTICS OF OLAP SYSTEMS
1) Users
• OLTP systems are designed for many office-workers, say 100-1000 users.
Whereas, OLAP systems are designed for few decision-makers.
2) Functions
• OLTP systems are mission-critical. They support the company's day-to-day
operations. They are mostly performance-driven.
Whereas, OLAP systems are management-critical. They support the
company's decision-functions using analytical-investigations.
I
3) Nature
R
• OLTP systems are designed to process one record at a time, for ex, a record
related to the customer.
Whereas, OLAP systems
→ involve queries to deal with many records at a time &
YS
→ provide aggregate data to a manager.
4) Design
• OLTP systems are designed to be application-oriented.
Whereas, OLAP systems are designed to be subject-oriented.
• OLTP systems view the operational-data as a collection of tables.
Whereas, OLAP systems view operational-information as
5) Data
SB
multidimensional model.
• OLTP systems deal only with the current-status of information.

• The old information
→ may have been archived &
→ may not be accessible online.
Whereas, OLAP systems require historical-data over several years.
TE
6) Kind of use
• OLTP systems are used for read & write operations.
Whereas, OLAP systems normally do not update the data.
COMPARISON OF OLTP & OLAP

O
N
U
VT
B-2
2b. Explain ROLAP & MOLAP. (08 Marks)
Ans:
ROLAP
• This uses relational or extended-relational DBMS to store & manage data of warehouse.
• This can be considered a bottom-up approach to OLAP.
• This is based on using a data-warehouse which is designed using a star scheme.
• Data-warehouse provides multidimensional capabilities.
• In DW, data is represented in i) fact-table &
ii) dimension-table.
• The fact-table contains
→ one column for each dimension &
I
→ one column for each measure.
R
• Every row of the fact-table provides one fact.
• An OLAP tool is used to manipulate the data in the DW tables.
• OLAP tool
→ groups the fact-table to find aggregates &
YS
→ uses some of the aggregates already computed to find new aggregates.
• Advantages:
1) More easily used with existing relational DBMS.
2) Data can be stored efficiently using tables.
3) Greater scalability.
• Disadvantage:
1) Poor query-performance. SB
• Some products are i) Oracle OLAP mode &
ii) OLAP Discoverer.
MOLAP
• This is based on using a multidimensional DBMS.
• The multidimensional DBMS is used to store & access data.
TE
• This can be considered as a top-down approach to OLAP.
• This does not have a standard approach to storing and maintaining the data.
• This uses special-purpose file-indexes.
• The file-indexes store pre-computation of all aggregations in the data-cube.
• Advantages:
1) Implementation is efficient.
O
2) Easier to use and therefore more suitable for inexperienced users.

3) Fast indexing to pre-computed summarized-data.
• Disadvantages:
1) More expensive than ROLAP.
N
2) Data is not always current.

3) Difficult to scale a MOLAP system for very large problems.
4) Storage-utilization may be low if the data-set is sparse.
• Some products are i) Hyperion Essbase &
U
ii) Applix iTM1.

VT
B-3
3a. Explain 4 types of attributes with statistical operations & examples. (06 Marks)
Ans:
I
R
YS
SB
TE
O
N
U
VT
3b. Explain the steps applied in data pre-processing. (10 Marks)

Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b.
B-4
3c. Two binary vectors are given below: (04 Marks)
X = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
Calculate (i) SMC (ii) Jaccord similarly coefficient and hamming distance.
Ans:
Solution:
Let X = (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Y = (y1, y2, y3, y4, y5, y6, y7, y8, y9, y10) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
I
R
YS
Hamming Distance is given by
SB
D(x,y)= |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5|+|x6-y6|+|x7-y7|+|x8-y8|+|x9-y9|+|x10-y10|
=3
TE
O
N
U
VT
B-5
4a. Consider the following transaction data-set 'D' shows 9 transactions and list of
items using Apriori algorithm frequent-itemset minimum support = 2 (10 Marks)
Ans:
Step 1: Generating 1-itemset frequent-pattern.
I
R
YS
SB
TE
O

N
U

VT
• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.

Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its
subset {{I2, I3, I5}} is not frequent.
• Thus, C4 = φ, and algorithm terminates.
4b. For the following transaction data-set table, construct an FP tree and explain
stepwise for all the transaction. (10 Marks)
B-6
5a. Define classification. Draw a neat figure and explain general approach for solving
classification-model. (06 Marks)
Ans:
CLASSIFICATION
• Classification is the task of learning a target-function f that maps each attribute-set x to one
of the predefined class-labels y.
• The target-function is also known informally as a classification-model.
• A classification-model is useful for the following purposes:
1) Descriptive Modeling
• A classification-model can serve as an explanatory-tool to distinguish between objects
of different classes.
I
• For example, it is useful for biologists to have a descriptive model.
R
2) Predictive Modeling
• A classification-model can be used to predict the class-label of unknown-records.
GENERAL APPROACH TO SOLVING A CLASSIFICATION PROBLEM
YS
SB
TE
O
• First, a training-set consisting of records whose class-labels are known must be provided.
N
• The training-set is used to build a classification-model.

• The classification-model is applied to the test-set.
• The test-set consists of records with unknown class-labels (Figure 4.3).
• Evaluation of the performance of a classification-model is based on the counts of test-
U
records correctly and incorrectly predicted by the model. These counts are tabulated in a
Confusion-matrix (Table 4.2).
VT
• Each entry fij in matrix denotes the number of records from class i predicted to be of class j.
For instance, f01 is the number of records from class 0 incorrectly predicted as class 1.
• Accuracy is defined as:
• Error rate is defined as:
B-7
5b. Mention the three impurity measures for selecting best splits. (04 Marks)
Ans:
where c = no. of classes
I
p = fraction of records that belong to one of the 2 classes
R
5c. Consider a training-set that contains 60 +ve examples and 100 -ve examples, for
each of the following candidate rules.
Rule r1: Covers 50 +ve examples and 5 -ve examples.
YS
Rule r2: Covers 2 +ve examples and no -ve examples.
Determine which is the best and worst candidate rule according to,
i) Rule accuracy
ii) Likelihood ratio statistic.
iii) Laplace measure. (10 Marks)
Ans:
(i) Rule accuracy is given by
For r1: Given, f+=50, n=55

SB
Rule accuracy = f+/n
where n = no. of examples covered by rule.
f+= no. of positive-examples covered by rule.
Rule accuracy is 50/55=90.9%.

For r2: Given, f+=2, n=2
TE
Rule accuracy is 2/2=100%.
Therefore, r1 is the best candidate and r2 is the worst candidate according to rule accuracy.
(ii) Likelihood ratio statistic is given by
where k = no. of classes

O
fi = no. of positive-examples covered by rule

ei= expected frequency of a rule that makes random predictions
For r1:
Expected frequency for positive-class is e+ = 55×60/160 = 20.625
N
Expected frequency for negative class is e− = 55 × 100/160 = 34.375

Therefore, the likelihood ratio is
R(r1) = 2 × [50 × log2(50/20.625) + 5 × log2(5/34.375)] = 99.9.
U
For r2:
The expected frequency for the positive-class is e+ = 2 × 60/160 = 0.75 and the
expected frequency for the negative class is e− = 2 × 100/160 = 1.25.
Therefore, the likelihood ratio is
VT
R(r2) = 2 × [2 × log2(2/0.75) + 0 × log2(0/1.25)] = 5.66.

Therefore, r1 is best candidate and r2 is worst candidate according to likelihood ratio statistic.
(iii) Laplace measure is given by
where n = no. of examples covered by rule

f+= no. of positive-examples covered by rule
k = total number of classes
For r1:
Laplace measure is (50+1)/(55+2)=51/57 = 89.47%,
For r2:
Laplace measure is (2+1)/(2+2)=75%
Therefore, r1 is best candidate and r2 is worst candidate according to the Laplace measure.
B-8
6a. For the given Confusion-matrix below for 3 classes. Find sensitivity & specificity
metrics to estimate predictive accuracy of classification methods. (10 Marks)
Table 6.1: Confusion-matrix for three classes

Ans:
Solution:
I
Let True positive (TP) = no. of positive-examples correctly predicted.
False negative (FN) = no. of positive-examples wrongly predicted as negative.
R
False positive (FP) = no. of negative-examples wrongly predicted as positive.
True negative (TN) = no. of negative-examples correctly predicted.
YS
True positive rate (TPR) or sensitivity is given by
True negative rate (TNR) or specificity is given by
Actual class\Predicted Class 1= yes Class 1= no

class
Class 1= yes
Class 1= no
2 TP
8 FP
SB 9 FN
11 TN
Sensitivity is TPR = 8/(8+1) = 88.88
Specificity is TNR = 19/(19+2) = 90.47
Table 1: Confusion-matrix for the Class-1

TE
class
Class 2= yes 2 TP 9 FN Sensitivity is TPR =2/(2+9) =18.18
Class 2= no 8 FP 11 TN Specificity is TNR = 11/(11+8) = 57.89

class
O
Class 3= yes 0 TP 0 FN Sensitivity is TPR = 0/(0+0) = 0

Class 3= no 10 FP 20 TN Specificity is TNR= 20/(20+10) = 66.66
N
U
VT
B-9
6b. Explain with example the two approaches for extending the binary-classifiers to
handle multiclass problem. (10 Marks)
Ans:
TWO APPROACHES FOR EXTENDING THE BINARY-CLASSIFIERS
1) 1-r approach &
2) 1-1 approach
• Let Y = {y1, y2, . . . , yK} be set of classes of input-data
1) 1-r (one-against-rest) Approach
• This approach decomposes the multiclass-problem into K binary-problems.
• For each class yi Є Y, a binary-problem is created.
All instances that belong to yi are considered positive-examples.
I
The remaining instances are considered negative-examples.
R
A binary-classifier is then constructed to separate instances of class yi from the
rest of the classes.
2) 1-1 (one-against-one) Approach
• This approach constructs K(K − 1)/2 binary-classifiers.
YS
• Each classifier is used to distinguish between a pair of classes, (yi, yj).
• Instances that do not belong to either yi or yj are ignored when constructing
the binary-classifier for (yi, yj).
• In both (1-r) and (1-1) approaches, a test-instance is classified by combining the predictions
made by the binary-classifiers.
• A voting-scheme is used to combine the predictions.
SB
• The class that receives the highest number of votes is assigned to the test-instance.
• In the 1-r approach, if an instance is classified as negative, then all classes except for the
positive-class receive a vote.
Example: Consider a multiclass-problem where Y = {y1, y2, y3, y4}.

• Suppose a test-instance is classified as (+,−,−,−) according to the (1-r) approach.
• In other words, the test-instance is classified as
TE
→ positive when y1 is used as the positive-class &
→ negative when y2, y3, and y4 are used as the positive-class.
• Using a simple majority vote, notice that y1 receives the highest number of votes,
which is four, while the remaining classes receive only three votes.
• Therefore, the test-instance is classified as y1.
• Suppose the test-instance is classified as follows using the 1-1 approach:
O
N
• The first two rows in this table correspond to the pair of classes (yi, yj) chosen to
build the classifier.
U
• The last row represents the predicted class for the test-instance.
• After combining the predictions,
→ y1 and y4 each receive two votes &
VT
→ y2 and y3 each receives only one vote.

• Therefore, the test-instance is classified as either y1 or y4, depending on the tie-
breaking procedure.
B-10
7a. Explain K means clustering method and algorithm. (10 Marks)
Ans:
K-MEANS
• K means is a partitional method of cluster analysis.
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster.
• The method obtains a single-level partition of objects.
• This method can only be used if the data-object is located in the main memory.
• The method is called K-means since
each of the K clusters is represented by mean of the objects(called centriod) within it.
• The method is also called the centroid-method since
I
→ at each step, the centroid-point of each cluster is assumed to be known &
R
→ each of the remaining points are allocated to cluster whose centroid is closest to it.
K-MEANS ALGORITHM
1) Select the number of clusters=k. (Figure 7.1a).
2) Pick k seeds as centroids of k clusters. The seeds may be picked randomly.
YS
3) Compute euclidean distance of each object in the dataset from each of the centroids.
4) Allocate each object to the cluster it is nearest to.
5) Compute the centroids of clusters.
6) Check if the stopping criterion has been met (i.e. cluster-membership is unchanged)
If yes, go to step 7.
If not, go to step 3.
7) One may decide SB
→ to stop at this stage or
→ to split a cluster or combine two clusters until a stopping criterion is met.
TE
O
N
Figure 7.1a
U
LIMITATIONS OF K MEANS
1) The results of the method depend strongly on the initial guesses of the seeds.
2) The method can be sensitive to outliers.
VT
3) The method does not consider the size of the clusters.

4) The method does not deal with overlapping clusters.
5) Often, the local optimum is not as good as the global optimum.
6) The method implicitly assumes spherical probability distribution.
7) The method cannot be used with categorical data.
B-11
7b. What is Hierarchical clustering method? Explain the algorithms for computing
distances between clusters. (10 Marks)
Ans:
HIERARCHICAL METHODS
• A set of nested clusters is organized as a hierarchical tree. (Figure 7.1b).
• This approach allows clusters to be found at different levels of granularity.
I
R
YS
Figure 7.1b
• Two types of hierarchical approaches are: 1) Agglomerative & 2) Divisive Approach

1) AGGLOMERATIVE APPROACH SB
• This method is basically a bottom-up approach.
• Each object at the start is a cluster by itself.
• The nearby clusters are repeatedly merged resulting in larger clusters until all the
objects are merged into a single large cluster (Figure 7.1c).
TE
O
Figure 7.1c
AGGLOMERATIVE ALGORITHM
1) Allocate each point to a cluster of its own. Thus, we start with n clusters for n
N
objects.
2) Create a distance-matrix by computing distances between all pairs of clusters
(either using the single link metric or the complete link metric). Sort these
U
distances in ascending order.

3) Find the 2 clusters that have the smallest distance between them.
4) Remove the pair of objects and merge them.
5) If there is only one cluster left then stop.
VT
6) Compute all distances from the new cluster and update the distance-matrix
after the merger and go to step 3.
2) DIVISIVE APPROACH
• This method is basically a top-down approach.
• This method
→ starts with the whole dataset as one cluster
→ then proceeds to recursively divide the cluster into two sub-clusters and
→ continues until each cluster has only one object (Figure 7.1d).
• Two types are:
1) Monothetic: This splits a cluster using only one attribute at a time.
An attribute that has the most variation could be selected.
2) Polythetic: This splits a cluster using all of the attributes together.
Two clusters far apart could be build based on distance between objects.
B-12
DIVISIVE ALGORITHM
1) Decide on a method of measuring the distance between 2 objects. Also,
decide a threshold distance.
2) Create a distance-matrix by computing distances between all pairs of objects
within the cluster. Sort these distances in ascending order.
3) Find the 2 objects that have the largest distance between them. They are the
most dissimilar objects.
4) If the distance between the 2 objects is smaller than the pre-specified
threshold and there is no other cluster that needs to be divided then stop,
otherwise continue.
5) Use the pair of objects as seeds of a K-means method to create 2 new
I
clusters.
R
6) If there is only one object in each cluster then stop otherwise continue with
step 2.
YS
8. Write short notes on the following:
a. Web content mining
SB Figure 7.1d
b. Text mining
c. Spatial-data-mining
d. Spatio-temporal data-mining (20 Marks)
TE
Ans (a): For answer, refer Solved Paper Dec-2013 Q.No.8d.

Ans (b): For answer, refer Solved Paper Dec-2013 Q.No.8a.
Ans (c): For answer, refer Solved Paper Dec-2013 Q.No.8b.
Ans (d):
O
SPATIO TEMPORAL DATA-MINING

• A spatiotemporal database is a database that manages both space and time information.
• For example:
Tracking of moving objects which occupies only a single position at a given time.
N
• Spatio-temporal data-mining is an emerging research area.

• This is dedicated to the development of computational techniques for the analysis of spatio-
temporal databases.
• This encompasses techniques for discovering useful spatial and temporal relationships that
U
are not explicitly stored in spatio-temporal datasets.

• Both the temporal and spatial dimensions add substantial complexity to data-mining tasks.
• Classical data-mining techniques perform poorly when applied to spatio-temporal data-sets
VT
because:
i) Spatial-data is embedded in a continuous space.
Whereas, classical datasets are in discrete notions like transactions.
ii) Since spatial-data are highly auto-correlated. A common assumption about
independence of data samples in classical statistical analysis is generally false.
B-13
I
R
YS
SB
TE
O
N
U
VT

1a. What is ODS? How does it differ from data-warehouse? Explain. (08 Marks)
Ans:
→ ODS is subject-oriented i.e it is organized around main data-subjects of the company
→ ODS is current-valued i.e it is up-to-date & reflects the current-status of information.
I
R
YS
SB
TE
1b. Explain the guidelines for data-warehouse implementation. (08 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.1c.
1c. What is ETL? List steps of ETL. (04 Marks)

O
N
U
VT
C-1
2a. Why multidimensional views of data and data-cubes are used? With a neat
diagram, explain data-cube implementations. (10 Marks)
Ans:
DATA-CUBE
• Data-cube refers to multi-dimensional array of data.
• The data-cube is used to represent data along some measure-of-interest.
• Data-cubes allow us to look at complex data in a simple format.
• For ex (Fig 2.1a): A company might summarize financial-data to compare sales i) by product
ii) by date &
iii) by country.
I
R
YS
DATA-CUBE IMPLEMENTATION
1) Pre-compute and Store All
SB
Figure 2.1a: Data-cube of sales
• Millions of aggregates are computed and stored in data-cube.

• Advantage: This is the best solution, as far as query response-time is concerned.
• Disadvantages:
TE
i) This solution is impractical for a large data-cube.
ii) Indexing large amounts of data is expensive.
2) Pre-compute (and Store) None
• The aggregates are computed on-the-fly using raw data whenever a query is posed.
• Advantage: This does not require additional space for storing the cube.
• Disadvantage: The query response-time is very poor for large data-cubes.
O
3) Pre-compute and Store Some

• Pre-compute and store the most frequently-queried aggregates and compute other
aggregates as the need arises.
• The remaining aggregates can be derived using the pre-computed aggregates.
N
• The more aggregates we are able to pre-compute, the better the query-performance.
• Data-cube products use following methods for pre-computing aggregates:
i) ROLAP (relational OLAP) ii) MOLAP (multidimensional OLAP)
U
VT
C-2
2b. What are data-cube operations? Explain. (10 Marks)
3a. What is data-mining? Explain various data-mining tasks. (06 Marks)

3b. Why data preprocessing is required in DM? Explain various steps in data
preprocessing (06 Marks)
3c. Write a short note on data-mining applications. (04 Marks)
I
Ans:
R
DATA-MINING APPLICATIONS
Prediction & Description
• Data-mining may be used to answer questions like
i) "Would this customer buy a product" or
YS
ii) "Is this customer likely to leave?”
• DM techniques may also be used for sales forecasting and analysis.
Relationship Marketing
• Customers have a lifetime value, not just the value of a single sale.
• Data-mining can helpful for
i) Analyzing customer-profiles and improving direct marketing plans.
SB
ii) Identifying critical issues that determine client-loyalty.
iii) Improving customer retention.
Customer Profiling
• This is the process of using the relevant- & available-information to
i) Describe the characteristics of a group of customers.
ii) Identify their discriminators from ordinary consumers.
iii) Identify drivers for their purchasing decisions.
TE
• This can help the company identify its most valuable customers
so that the company may differentiate their needs and values.
Outliers Identification & Detecting Fraud
• For this, examples include:
i) Identifying unusual expense claims by staff.
ii) Identifying anomalies in expenditure b/w similar units of the company.
O
iii) Identifying fraud involving credit-cards.

Customer Segmentation
• This is a way to assess & view individuals in market based on their status &
needs.
N
• Data-mining may be used to

i) Understand & predict customer behavior & profitability.
ii) Develop new products & services.
iii) Effectively market new offerings.
U
Web site Design & Promotion

• Web mining can be used to discover how users navigate a web-site and the
results can help in improving the site-design.
VT
• Web mining can be used in cross-selling by suggesting to a web-customer,

items that he may be interested in.
C-3
4a. Explain FP - growth algorithm for discovering frequent-item sets. What are its
limitations? (08 Marks)
Ans:
FP - GROWTH ALGORITHM
• This algorithm
→ encodes the data-set using a compact data-structure called a FP-tree &
→ extracts frequent-itemsets directly from this structure (Figure 6.24).
• This finds all the frequent-itemsets ending with a particular suffix.
• This employs a divide-and-conquer strategy to split the problem into smaller subproblems.
I
R
YS
SB
• For example, suppose we are interested in finding all frequent-itemsets ending in e.
To do this, we must first check whether the itemset {e} itself is frequent.
If it is frequent, we consider subproblem of finding frequent-itemsets ending in
TE
de, followed by ce, be, and ae.
• In turn, each of these subproblems is further decomposed into smaller subproblems.
• By merging the solutions obtained from the subproblems, all the frequent-itemsets ending in
e can be found (Figure 6.27).
O
N
U
VT
LIMITATIONS OF FP - GROWTH ALGORITHM

• The run-time performance of FP-growth depends on the compaction factor of the data-set.
• If the resulting conditional FP-trees are very bushy, then the performance of the algorithm
degrades significantly.
C-4
4b. What is Apriori algorithm? How it is used to find frequent-item sets? Explain
briefly. (08 Marks)
Ans:
APRIORI ALGORITHM
• Apriori Theorem states:
“If an itemset is frequent, then all of its subsets must also be frequent.”
• Consider the following example.
Suppose {c, d, e} is a frequent-itemset then any transaction that contains {c, d, e} must
also contain its subsets, {c, d}, {c, e}, {d, e} {c}, {d} and {e} (Figure 6.3).
As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent.
I
R
YS
SB
TE
Frequent-itemset Generation in the Apriori Algorithm

• We assume that the support threshold is 60%, which is equivalent to a minimum support
count equal to 3 (Table 6.1).
O
N
U
VT
• Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.
• Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets
generated by the algorithm is = 6.
• Two of these six candidates, {Beer, Bread} and {Beer, Milk} are subsequently found to be
infrequent after computing their support values.
• Remaining 4candidates are frequent, and thus will be used to generate candidate 3-itemsets.
• With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are
frequent (Figure 6.5).
• The only candidate that has this property is {Bread, Diapers, Milk}.
C-5
4c. List the measures used for evaluating association patterns. (04 Marks)
Ans:
I
R
YS
5a. How decision-trees are used for classification? Explain decision-tree induction
algorithm for classification. (10 Marks)
Ans:
HUNT’S ALGORITHM SB
• A decision-tree is grown in a recursive fashion.
• Let Dt = set of training-records that are associated with node„t‟.
Let y = {y1, y2, . . . , yc} be class-labels.
• Hunt‟s algorithm is as follows.
Step 1:
• If all records in Dt belong to same class yt, then t is a leaf node labeled as yt.
TE
Step 2:
• If Dt contains records that belong to more than one class, an attribute test
condition is selected to partition the records into smaller subsets.
O
N
U
VT
C-6
I
R
YS
DECISION-TREE ALGORITHM: TREEGROWTH SB
• The input to the algorithm consists of i) training-records E and ii) attribute-set F.
• The algorithm works by
i) Recursively selecting the best attribute to split the data (Step 7) and
ii) Expanding leaf nodes of tree (Steps 11 & 12) until stopping criterion is met (Step 1).
TE
O
N
U
VT
The details of this algorithm are explained below:

1. The createNode() function extends the decision-tree by creating a new node. A
node in the decision-tree has either a test condition, denoted as node.test_cond, or a
class-label, denoted as node.label.
2. The find_best_split() function determines which attribute should be selected as
the test condition for splitting the training-records.
The choice of test condition depends on which impurity measure is used to
determine the goodness of a split.
Some widely used measures include entropy, Gini index.
3. The Classify() function determines the class-label to be assigned to a leaf node.
4. The stopping_cond() function is used to terminate the tree-growing process by
testing whether all records have either i)same class-label or ii)same attribute values.
5. After building the decision-tree, a tree-pruning step can be performed to reduce the
size of the decision-tree.
C-7
5b. How to improve accuracy of classification? Explain. (05 Marks)
Ans:
BAGGING
• Bagging is also known as bootstrap aggregating.
• Bagging is a technique that repeatedly samples (with replacement) from a data-set
according to a uniform probability distribution.
• Each bootstrap sample has the same size as the original data.
• Because the sampling is done with replacement,
some instances may appear several times in the same training-set &
other instances may be omitted from the training-set.
BOOSTING
I
• Boosting is an iterative procedure used to adaptively change the distribution of
R
training examples so that the base classifiers will focus on examples that are hard to
classify.
• Unlike bagging, boosting assigns a weight to each training example and may
adaptively change the weight at the end of each boosting round.
YS
• The weights assigned to the training examples can be used in following ways:
1. They can be used as a sampling distribution to draw a set of bootstrap
samples from the original data.
2. They can be used by the base classifier to learn a model that is biased toward
higher-weight examples.
Ans:
SB
5c. Explain importance of evaluation criteria for classification methods. (05 Marks)
• Predictive Accuracy: refers to the ability of the model to correctly predict the class-label of
new or previously unseen data.
• Speed: refers to the computation costs involved in generating and using the model.
Speed involves not just the time or computation cost of constructing a model (e.g. a
decision-tree), it also includes the time required to learn to use the model.
TE
• Robustness: is the ability of the model to make correct predictions given noisy data or data
with missing values.
Most data obtained from a variety of sources has errors.
Therefore, the method should be able to deal with noise, outlier & missing
values gracefully.
• Scalability: refers to ability to construct the model efficiently given large amount of data.
O
Data-mining problems can be large and therefore the method should be able to deal
with large problems gracefully.
• Interpretability: refers to level of understanding & insight that is provided by the model.
An important task of a DM professional is to ensure that the results of data-mining are
N
explained to the decision-makers.

It is therefore desirable that the end-user be able to understand and gain insight
from the results produced by the classification-method.
• Goodness of the model: For a model to be effective, it needs to fit the problem that is
U
being solved.
For example, in a decision-tree classification, it is desirable to find a decision-tree of
the right size and compactness with high accuracy.
VT
6a. What are Baysian classifiers? Explain Baye's theorem. (10 Marks)
6b. How rule based classifiers are used for classification? Explain. (10 Marks)
7a. Explain K-means clustering algorithm. What are its limitations? (10 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.7a.
C-8
7b. How density based methods are used for clustering? Explain. (10 Marks)
Ans:
DENSITY-BASED METHODS
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist
within a given radius.
• Data that is not within such high-density clusters is regarded as outliers or noise.
• For example: DBSCAN (Density Based Spatial Clustering of Applications with Noise).
DBSCAN
• It requires 2 input parameters:
I
1) Size of the neighborhood (R) &
R
2) Minimum points in the neighborhood (N).
• The point-parameter N
→ determines the density of acceptable-clusters &
→ determines which objects will be labeled outliers or noise.
YS
• The size-parameter R determines the size of the clusters found.
• If R is big enough, there will be one big cluster and no outliers.
If R is small, there will be small dense clusters and there might be many outliers.
• We define a number of terms (Figure 7.2):
1. Neighborhood: The neighborhood of an object y is defined as all the objects
that are within the radius R from y.
its neighborhood.
SB
2. Core-object: An object y is called a core-object if there are N objects within
3. Proximity: Two objects are defined to be in proximity to each other if they

belong to the same cluster.
Object x1 is in proximity to object x2 if two conditions are satisfied:
i) The objects are close enough to each other, i.e. within a distance of R.
ii) x2 is a core object.
TE
4. Connectivity: Two objects x1 and xn are connected if there is a chain of
objects x1,x2. . . .xn from x1 to xn such that each xi+1 is in proximity to object xi.
DBSCAN ALGORITHM
1. Select values of R and N.
2. Arbitrarily select an object p.
3. Retrieve all objects that are connected to p, given R and N.
O
4. If p is a core object, a cluster is formed.

5. If p is a border object, no objects are in its proximity.
Choose another object. Go to step 3.
6. Continue the process until all of the objects have been processed.
N
U
VT
Figure 7.2:DBSCAN
C-9
8a. What is web content mining? Explain. (08 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.8d.
8b. Write a short note on following: (12 Marks)

i) Text mining
ii) Temporal database mining
iii) Text clustering .
Ans (i): For answer, refer Solved Paper Dec-2013 Q.No.8a.

Ans (ii): For answer, refer Solved Paper Dec-2013 Q.No.8c.
I
Ans (iii):
R
TEXT CLUSTERING
• Once the features of an unstructured-text are identified, text-clustering can be done.
• Text-clustering can be done by using any clustering technique.
For ex: ward's minimum variance method.
YS
• Ward‟s method is an agglomerative hierarchical clustering technique.
• Ward‟s method tends to generate very compact clusters.
• Following measure of the dissimilarities between feature vectors can be used
i) Euclidean metric or
ii) Hamming distance
• The clustering method begins with „n‟ clusters, one for each text.
SB
• At any stage, 2 clusters are merged to generate a new cluster based on the following
criterion:
where xk is mean value of the dissimilarity for cluster Ck and

TE
nk is the no. of elements in cluster.
SCATTER/GATHER
• It is a method of grouping the documents based on the overall similarities in their
content.
• Scatter/gather is so named because
→ it allows the user to scatter documents into groups(or clusters)
→ then gather a subset of these groups and
O
→ re-scatter them to form new groups.

• Each cluster is represented by a list of topical terms.
• Topical terms are a list of words that attempt to give the user an idea of what the
documents in the cluster are about.
N
U
VT
C-10
I
R
YS
SB
TE
O
N
U
VT

I
R
YS
SB
TE
O
N
U
VT

1a. Explain the characteristics of ODS. (06 Marks)
Ans:
→ ODS is subject-oriented i.e it is organized around main data-subjects of the company
→ ODS is current-valued i.e it is up-to-date & reflects the current-status of information.
• Benefits of ODS to the company:
I
1) ODS is the unified-operational view of the company.
R
ODS provides the managers improved access to important operational-data.
This view assists in better understanding of i) business & ii) customer.
2) ODS is more effective in generating current-reports without accessing OLTP.
3) ODS can shorten time required to implement a data-warehouse system.
YS
• Different types of ODS:
1) The ODS can be used as a reporting-tool for administrative purposes.
The ODS is usually updated daily.
2) The ODS can be used to track more complex-information such as product-code &
location-code.
The ODS is usually updated hourly.
SB
3) The ODS can be used to support CRM (Customer Relationship Management).
1b. List the major steps involved in the ETL process. (06 Marks)
1c. Based on oracle, what are difference b/w OLTP & DW systems. (08 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.1c.
TE
2a. Discuss the FASMI characteristics of OLAP. (05 Marks)

Ans:
FASMI CHARACTERISTICS OF OLAP SYSTEMS
1) Fast
• Most queries must be answered very quickly, perhaps within seconds.
O
• The performance of the system must be like a search-engine.

• The data-structures must be efficient.
• The hardware must be powerful to support i) Large amount of data &
ii) Large number of users.
N
• One of the approaches to speed-up the system is

→ pre-compute the most commonly queried aggregates &
→ compute the remaining aggregates on-the-fly.
2) Analytic
U
• The system must provide rich analytic-functionality.

• Most queries must be answered without any programming.
• System must be able to manage any relevant queries for application & user.
VT
3) Shared
• The system is
→ accessed by few business-analysts &
→ used by thousands of users.
• Being a shared system, the OLAP software must provide adequate security for
i) confidentiality & ii) integrity.
• Concurrency-control is required if users are updating data in the database.
4) Multidimensional
• This is the basic requirement.
• OLAP software must provide a multidimensional conceptual-view of the data.
• A dimension has hierarchies that show parent/child relationships between the
members of dimensions.
• The multidimensional structure must allow hierarchies of parent/child
relationships.
D-1
5) Information
• The system should be able to handle a large amount of input-data.
• Two important critical factors:
i) The capacity of system to handle information &
ii) Integration of information with the data-warehouse.
2b. Explain Codd's OLAP rules. (10 Marks)

Ans:
CODD'S OLAP CHARACTERISTICS
Multidimensional Conceptual View
• This is the central characteristics.
I
• Because of multidimensional-view, data-cube operations like slice and dice can
R
be performed.
Accessibility (OLAP as a Mediator)
• The OLAP software should be sitting b/w
i) Data-sources &
YS
ii) OLAP front-end.
Batch Extraction vs. Interpretive
• In large multidimensional databases, the system should provide
→ multidimensional-data staging plus
→ partial pre-calculation of aggregates.
Multi-user Support
SB
• Being a shared -system, the OLAP software should provide normal database
operations including retrieval, update, integrity and security.
Storing results of OLAP
• OLAP results-data should be kept separate from source-data.
• Read-write applications should not be implemented directly on live
transaction-data if source-systems are supplying information to the system
directly.
TE
Extraction of Missing Values
• The system should distinguish missing-values from zero-values.
• If a distinction is not made, then the aggregates are computed incorrectly.
Uniform Reporting Performance
• Increasing the number of dimensions (or database-size) should not degrade
the reporting performance of the system.
O
2c. Describe the difference between ROLAP & MOLAP. (05 Marks)
Ans:
N
U
VT
3a. What is data preprocessing? Explain various pre-processing tasks. (14 Marks)
D-2
3b. Explain the following: (06 Marks)
i) Euclidean distance
ii) Simple matching coefficient
iii) Jaccard coefficient.
Ans (i):
EUCLIDEAN DISTANCE
• The Euclidean distance (D) between two points x and y is given by:
where xi and yi are respectively the ith attributes of x & y

Example: Find the distance between 2 objects represented by attribute values:
I
x = (1, 6, 2, 5, 3) &
y = (3, 5, 2, 6, 6)
R
Solution:
Let (x1, x2, x3, x4, x5) = (1, 6, 2, 5, 3)
(y1, y2, y3, y4, y5) = (3, 5, 2, 6, 6)
YS
Euclidean Distance is calculated as follows:
Ans (ii):
SIMPLE MATCHING COEFFICIENT SB
• SMC is used as a similarity coefficient.
• SMC is given by
TE
where f00= no. of attributes where x is 0 and y is 0
• This measure counts both presences and absences equally.
Ans (iii):
JACCARD COEFFICIENT
• The Jaccard coefficient is used to handle objects consisting of asymmetric binary attributes.
O
• The jaccard coefficient is given by:

N
Example: Calculate SMC and Jaccard Similarity Coefficients for the following two
binary vectors:
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) &
U
y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 1)
Solution:
VT
4a. Explain frequent-itemset generation in the apriori algorithm. (10 Marks)

4b. What is FP - Growth algorithm? In what way it is used to find frequency

itemsets? (03 Marks)
D-3
4c. Construct the FP tree for following data-set. Show the trees separately after
reading each transaction. (07 Marks)
5a. What is classification? Explain 2 classification-models with example. (06 Marks)

Ans:
I
CLASSIFICATION
R
• Classification is the task of learning a target-function ‘f’ that maps each attribute-set x to
one of the predefined class-labels y.
• The target-function is also known as a classification-model.
• A classification-model is useful for the following 2 purposes (Figure 4.3):
YS
1) DESCRIPTIVE MODELING
• A classification-model can serve as an explanatory-tool to distinguish between objects
of different classes.
• For example, it is useful for biologists to have a descriptive model.
2) PREDICTIVE MODELING
• A classification-model can also be used to predict the class-label of unknown-records.
• As a classification-model automatically assigns a class-label when presented with the
SB
attribute-set of an unknown-record.
• Classification techniques are most suited for predicting or describing data-sets with
binary- or nominal-categories.
• They are less effective for ordinal categories because they do not consider the implicit
order among the categories.
TE
O
N
U
VT
D-4
5b. Discuss the characteristics of decision-tree induction algorithms. (10 Marks)
Ans:
CHARACTERISTICS OF DT INDUCTION ALGORITHMS
1. Decision-tree induction is a non-parametric approach for building classification-
models.
2. Finding an optimal tree is NP complete problem.
Many DM algorithms employ a heuristic-based approach to guide their search in
the vast hypothesis space.
3. Techniques developed for constructing trees are computationally inexpensive i.e. it is
possible to quickly construct models even when the training-set size is very large.
Furthermore, once a tree has been built, classifying a test-record is extremely
I
fast, with a worst-case complexity of O(w)
R
where w = maximum depth of tree
4. Smaller-sized trees are relatively easy to interpret.
5. Trees provide an expressive representation for learning discrete valued functions.
However, they do not generalize well to certain types of Boolean problems.
YS
6. A subtree can be replicated multiple times in a tree (Figure 4.19). This makes the
tree more complex than necessary and perhaps more difficult to interpret.
The algorithm solves the sub-trees by using the divide and conquer algorithm to
avoid complexity.
SB
TE
O
7. DT algorithms are quite robust to the presence of noise, especially when methods for
avoiding overfitting, are employed.
N
8. The presence of redundant attributes does not affect the accuracy of trees.
An attribute is redundant if it is strongly correlated with another attribute in data.
9. At the leaf nodes, the number of records may be too small to make a statistically
U
significant decision about the class representation of the nodes. This is known as the
data fragmentation problem.
Solution: Disallow further splitting when the number of records falls below a
certain threshold.
VT
10. The tree-growing procedure can be viewed as process of partitioning the attribute-
space into disjoint regions until each region contains records of the same class.
The border between two neighboring regions of different classes is known as a
decision boundary.
D-5
5c. Explain sequential covering algorithm in rule-based classifier. (04 Marks)
Ans:
SEQUENTIAL COVERING ALGORITHM
• This is used to extract rules directly from data.
• This extracts the rules one class at a time for data-sets that contain more than 2 classes.
• The criterion for deciding which class should be generated first depends on:
i) Fraction of training-records that belong to a particular class or
ii) Cost of misclassifying records from a given class.
I
R
YS
SB
TE
O
N
U
VT
D-6
6a. List 5 criteria for evaluating classification methods. Explain briefly. (05 Marks)
Ans:
FIVE CRITERIA FOR EVALUATING CLASSIFICATION METHODS
1) Holdout method
2) Random Subsampling
3) Cross-Validation
4) Leave-one-out approach
5) Bootstrap
1) Holdout method
• The original data is divided into 2 disjoint set: i) Training-set &
I
ii) Test-set.
R
• A classification-model is induced from the training-set.
• Performance of classification-model is evaluated on the test-set.
• The proportion of data is reserved as either
i) 50% for training and 50% for testing or
YS
ii) 2/3 for training and 1/3 for testing.
• The accuracy of the classifier can be estimated based on accuracy of induced model.
2) Random Subsampling
• The holdout method can be repeated several times to improve the estimation of a
classifier's performance.
• Limitation: It has no control over the number of times each record is used for testing
3)
& training.
Cross-Validation
SB
• In K-fold cross-validation, the available data is randomly divided into k-disjoint
subsets of approximately equal-size.
• One of the subsets is then used as the test-set.
Remaining (k – 1) sets are used for building the classifier.
• The test-set is used to estimate the accuracy.
TE
• This is done repeatedly k times so that each subset is used as a test subset once.
4) Leave-one-out approach
• A special case of k-fold cross-validation method sets k = N, the size of the data-set.
• Each test-set contains only one record.
• Advantages:
1) Utilizes as much data as possible for training.
O
2) Test-sets are mutually exclusive & they effectively cover entire data-set.
• Two drawbacks:
1) Computationally expensive for large datasets.
2) Since each test-set contains only one record, the variance of the estimated
N
performance metric tends to be high.

5) Bootstrap
• The training-records are sampled with replacement; i.e. a record already chosen for
training is put back into the original pool of records so that it is equally likely to be
U
redrawn.
• A sample contains about 63.2% of the records in the original data.
• Records that are not included in the bootstrap sample become part of the test-set.
VT
• The model induced from the training-set is applied to the test-set to obtain an
estimate of the accuracy of the bootstrap sample, εi.
• The sampling procedure is then repeated ‘b’ times to generate ‘b’ bootstrap samples.
6b. What is predictive accuracy of classification methods? Explain different types of

estimating the accuracy of a method. (07 Marks)
D-7
6c. Consider the following training-set for predicting the loan default problem:
I
Find the conditional independence for given training-set using Bayes theorem for
R
classification. (08 Marks)
Ans:
Solution:
YS
• For each class yj, the class-conditional probability for attribute Xi is:
where μ(x’)=mean
σ2 (s2)=variance
• The sample mean and variance for annual income attribute with respect to the class No are:
SB
• Given a test-record with taxable income equal to $120K, we can compute its class-
TE
conditional probability as follows:
• Since there are three records that belong to the class Yes and seven records that belong to
the class No, P(Yes) = 0.3 and P(No) = 0.7.
O
N
U
VT
• Using the information provided in Figure 5.10(b), the class-conditional probabilities can be
computed as follows:
D-8
7a. List & explain the desired features of cluster analysis. (08 Marks)
7b. Explain the K-means clustering algorithm with suitable example. (12 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.7a.
8. Write short notes on: (20 Marks)

a. Web content mining
b. Text clustering
c. Unstructed text
d. Temporal data-mining tasks.
I
Ans (a): For answer, refer Solved Paper Dec-2013 Q.No.8d.
R
Ans (b): For answer, refer Solved Paper Dec-2014 Q.No.8b.iii.
Ans (c):
UNSTRUCTURED TEXT
YS
• Unstructured-documents are free texts, such as news stories.
• Following features can be extracted to convert an unstructured-document to a structured
form:
Word Occurrences
• The vector-representation takes single words found in the training-set as
features. (Vector-representation = bag of words).
• Two types:
SB
• This ignores the sequence in which the words occur.
i) The feature is said to be boolean, if we consider whether a word is

present(1) or absent(0) in a document.
ii) The feature is said to be frequency-based, if the frequency of the
word in a document is considered.
Stopword Removal
TE
• Stopwords are frequently occurring and insignificant words that help to
construct sentences.
• Stopwords do not represent any content of the documents.
• Common stopwords are:
a, an, as, at, be, by, in, is, of, on, or, to
• Stopwords can be removed before documents are indexed and stored.
O
Stemming
• Stemming refers to the process of reducing words to their morphological roots
or stems.
• A stem is part of a word that is left after removing its prefixes & suffixes.
N
• For example, "informing", "information", "informer" & "informed" are reduced

to "inform".
POS (Part of Speech)
• There can be 25 possible values for POS tags.
U
• Common tags:
noun, verb and adjective
• Thus, we can assign a number 1, 2, 3 or 4, depending on whether the word is
VT
a noun, verb, adjective or any other, respectively.

Positional Collocations
• The values of this feature are the words that occur one or two position to the
right or left of the given word.
n-gram
• An n-gram is a contiguous sequence of n items from a given sequence of text.
• This can be used for predicting the next item in a sequence.
LSI (Latent Semantic Indexing)

• LSI is an indexing and retrieval method to identify the relationships between
the terms and concepts contained in an unstructured collection of text.
D-9
Ans (d):
TEMPORAL DATA-MINING TASKS
Temporal Association
• The association-rule discovery can be extended to temporal-association.
• Here, we attempt to discover temporal-associations between non-temporal
itemsets.
• We can say that: "70% of the readers who buy a DBMS book also buy a Data-
mining book after a semester".
Temporal Classification
• We can extend the concept of decision-tree construction on temporal-
attributes.
I
• For example, a rule can be: "The first case of malaria is normally reported
R
after the first pre-monsoon rain and during the months of May-August".
Trend Analysis
• The analysis of one or more time series of continuous data may show similar
trends i.e. similar shapes across the time axis.
YS
• For example, "The deployment of the Android OS is increasingly becoming
popular in the Smartphone industry".
• Here, we are trying to find the relationships of change in one or more static-
attributes, with respect to changes in the temporal-attributes.
Sequence Analysis
• Events occurring at different points in time may be related by causal
relationships. SB
• For example, an earlier event may appear to cause a later one.
• To discover causal relationships, sequences of events must be analyzed to
discover common patterns.
• This category includes
→ discovery of frequent events and
→ problem of event-prediction.
TE
• Frequent sequence mining finds the frequent subsequences;
while event-prediction predicts the occurrences of events which are rare.
O
N
U
VT
D-10

Vtu 7th Sem Cse Data Warehousing &amp; Data Mining Solved Papers of Dec2013 June2014 Dec2014 June2015

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Vtu 7th Sem Cse Data Warehousing &amp; Data Mining Solved Papers of Dec2013 June2014 Dec2014 June2015

Hochgeladen von

Copyright:

Verfügbare Formate

I

For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

• Why an ODS should be separate from the operational-databases?

→ identifying various components of the source-files and

i) Use the data-warehouse &

as & when required.

ii) Business professionals of the company.

2a. Distinguish between OLTP and OLAP. (04 Marks)

Figure 2.1b: Drill-down operation

Figure 2.1d: Slice operation Figure 2.1e: Dice operation

Figure 3.1: Four core tasks of data-mining

i) Detection of fraud &

• Key Benefit: Many DM algorithms work better if the dimensionality is lower.

i) Eliminate irrelevant features &

• Reduce amount of time and memory required by DM algorithms.

5) FEATURE SUBSET SELECTION

• To reduce the dimensionality, use only a subset of the features.

6) DISCRETIZATION AND BINARIZATION

• Transforming continuous attributes into a categorical attribute is called discretization.

• The discretization process involves 2 subtasks:

counts are less than minsup (step 12).

4b. What is association analysis? (04 Marks)

• Ex: Market based analysis

The FP-tree is constructed as follows:

HOW A RULE-BASED CLASSIFIER WORKS

• Consider the following vertebrates in below table:

• A lemur triggers rule R3, so it is classified as a mammal.

contradictory (reptiles versus amphibians), their conflicting classes must be resolved.

HOW A NAIVE BAYES CLASSIFIER WORKS

• This approach is more practical because

Estimating Conditional Probabilities for Categorical Attributes

Estimating Conditional Probabilities for Continuous Attributes

4) CATEGORICAL DATA DISTANCE

where N=total number of categorical attributes

D(x,y) = Max ( |x1-y1|,|x2-y2|,|x3-y3|,|x4-y4|,|x5-y5| )

Figure 4.1a Figure 4.1b

merge similar clusters into larger clusters.

within a given radius.

• Most data obtained from a variety of sources has errors.

• Therefore, the method should be able to deal with

Result Independent of Data Input Order

various geographic regions.

• For example, the comparison of price-range of houses in different geographical

• We can associate spatial attributes with non-spatial attributes.

TEMPORAL DATA-MINING TASKS

• We attempt to discover temporal-associations b/w non-temporal itemsets.

• For example, "The deployment of the Android OS is increasingly becoming

→ users access web-applications through web-interfaces.

• Most of the research on web-mining is focused on the text-contents.

For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

1b. Explain the implementation steps for data-warehouse. (07 Marks)

→ implementing the tool.

• Then, ETL-tools are used for populating the warehouse.

• OLTP systems deal only with the current-status of information.

COMPARISON OF OLTP & OLAP

2) Easier to use and therefore more suitable for inexperienced users.

2) Data is not always current.

ii) Applix iTM1.

3b. Explain the steps applied in data pre-processing. (10 Marks)

Step 3: Generating 3-itemset frequent-pattern.

Step 4: Generating 4-itemset frequent-pattern.

• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.

Ans: For answer, refer Solved Paper Dec-2013 Q.No.4b.

Vtu 7th Sem Cse Data Warehousing & Data Mining Solved Papers of Dec2013 June2014 Dec2014 June2015

Vtu 7th Sem Cse Data Warehousing & Data Mining Solved Papers of Dec2013 June2014 Dec2014 June2015