Beruflich Dokumente
Kultur Dokumente
R
YS
SB
TE
O
N
U
VT
I
• The extraction of information from source-databases should be efficient.
R
• The quality of data should be maintained (Figure 8.1).
• Suitable-checks are required to ensure quality of data after each refresh.
• The ODS is required to
→ satisfy integrity constraints. Ex: existential-integrity, referential-integrity.
YS
→ take appropriate actions to deal with null values.
• ODS is a read only database i.e. users shouldn‟t be allowed to update information.
• Populating an ODS involves an acquisition-process of extracting, transforming &
loading data from source systems. This process is called ETL.
(ETL = Extraction, Transformation and Loading).
• Before an ODS can go online, following 2 tasks must be completed:
SB
i) Checking for anomalies &
ii) Testing for performance.
TE
O
N
U
VT
A-1
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
1b. What is ETL? Explain the steps in ETL. (07 Marks)
Ans:
ETL (EXTRACTION, TRANSFORMATION & LOADING)
• The ETL process consists of
→ data-extraction from source systems
→ data-transformation which includes data-cleaning &
→ data-loading in the ODS or the data-warehouse.
• Data cleaning deals with detecting & removing errors/inconsistencies from the data.
• Most often, the data is sourced from a variety of systems.
PROBLEMS TO BE SOLVED FOR BUILDING INTEGRATED-DATABASE
1) Instance Identity Problem
I
• The same customer may be represented slightly differently in different source-
R
systems.
2) Data-errors
• Different types of data-errors include:
i) There may be some missing attribute-values.
YS
ii) There may be duplicate records.
3) Record Linkage Problem
• This deals with problem of linking information from different databases that
relates to the same customer.
4) Semantic Integration Problem
• This deals with integration of information found in heterogeneous-OLTP &
legacy sources.
• For example,
SB
→ Some of the sources may be relational.
→ Some sources may be in text documents.
→ Some data may be character strings or integers.
5) Data Integrity Problem
• This deals with issues like
TE
i) referential integrity
ii) null values &
iii) domain of values.
STEPS IN DATA CLEANING
1) Parsing
• This involves
O
2) Correcting
• Correcting the identified-components is based on sophisticated techniques
using mathematical algorithms.
• Correcting may involve use of other related information that may be available
U
in the company.
3) Standardizing
• Business rules of the company are used to transform data to standard form.
VT
• For ex, there might be rules on how name and address are to be represented.
4) Matching
• Much of the data extracted from a number of source-systems is likely to be
related. Such data needs to be matched.
5) Consolidating
• All corrected, standardized and matched data can now be consolidated to build
a single version of the company-data.
A-2
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
1c. What are the guide lines for implementing the data-warehouse. (05 Marks)
Ans:
DW IMPLEMENTATION GUIDELINES
Build Incrementally
• Firstly, a data-mart will be built.
• Then, a number of other sections of the company will be built.
• Then, the company data-warehouse will be implemented in an iterative
manner.
• Finally, all data-marts extract information from the data-warehouse.
Need a Champion
• The project must have a champion who is willing to carry out considerable
I
research into following:
R
i) Expected-costs &
ii) Benefits of project.
• The projects require inputs from many departments in the company.
• Therefore, the projects must be driven by someone who is capable of
YS
interacting with people in the company.
Senior Management Support
• The project calls for a sustained commitment from senior-management due to
i) The resource intensive nature of the projects.
ii) The time the projects can take to implement.
Ensure Quality
SB
• Data-warehouse should be loaded with
i) Only cleaned data &
ii) Only quality data.
Corporate Strategy
• The project must fit with
i) corporate-strategy &
ii) business-objectives.
TE
Business Plan
• All stakeholders must have clear understanding of i) Project plan
ii) Financial costs &
iii) Expected benefits.
Training
• The users must be trained to
O
Ans:
A-3
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
2b. Explain the operation of data-cube with suitable examples. (08 Marks)
ROLL-UP
• This is like zooming-out on the data-cube. (Figure 2.1a).
• This is required when the user needs further abstraction or less detail.
• Initially, the location-hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location-hierarchy from the level-of-
city to level-of-country.
I
R
YS
SB
TE
Figure 2.1a: Roll-up operation
DRILL DOWN
• This is like zooming-in on the data. (Figure 2.1b).
• This is the reverse of roll-up.
• This is an appropriate operation
→ when the user needs further details or
→ when the user wants to partition more finely or
O
→ when the user wants to focus on some particular values of certain dimensions.
• This adds more details to the data.
• Initially, the time-hierarchy was "day < month < quarter < year”.
• On drill-up, the time dimension is descended from the level-of-quarter to the level-of-month.
N
U
VT
I
R
YS
SB
Figure 2.1c: Pivot operation
SLICE & DICE
• These are operations for browsing the data in the cube.
TE
• These operations allow ability to look at information from different viewpoints.
• A slice is a subset of cube corresponding to a single value for 1 or more members of
dimensions. (Figure 2.1d).
• A dice operation is done by performing a selection of 2 or more dimensions. (Figure 2.1e).
O
N
U
VT
A-5
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
2c. Write short note on (08 Marks)
i) ROLAP
ii) MOLAP
iii) FASMI
iv) DATACUBE
Ans:(i) For answer, refer Solved Paper June-2014 Q.No.2b.
Ans:(ii) For answer, refer Solved Paper June-2014 Q.No.2b.
Ans:(iii) For answer, refer Solved Paper June-2015 Q.No.2a.
Ans:(iv) For answer, refer Solved Paper Dec-2014 Q.No.2a.
3a. Discuss the tasks of data-mining with suitable examples. (10 Marks)
I
Ans:
R
DATA-MINING
• Data-mining is the process of automatically discovering useful information in large data-
repositories.
DATA-MINING TASKS
YS
1) Predictive Modeling
• This refers to the task of building a model for the target-variable as a function of the
explanatory-variable.
• The goal is to learn a model that minimizes the error between
i) Predicted values of target-variable and
ii) True values of target-variable (Figure 3.1).
• There are 2 types: SB
i) Classification: is used for discrete target-variables
Ex: Web user will make purchase at an online bookstore is a classification-task.
ii) Regression: is used for continuous target-variables.
Ex: forecasting the future price of a stock is regression task.
TE
O
N
U
VT
A-6
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
2) Association Analysis
• This is used to find group of data that have related functionality.
• The goal is to extract the most interesting patterns in an efficient manner.
• Ex: Market based analysis
We may discover the rule
{diapers} -> {Milk}
This suggests that customers who buy diapers also tend to buy milk.
• Useful applications:
i) Finding groups of genes that have related functionality.
ii) Identifying web pages that are accessed together.
3) Cluster Analysis
I
• This seeks to find groups of closely related observations so that observations that
R
belong to the same cluster are more similar to each other than observations that
belong to other clusters.
• Useful applications:
i) To group sets of related customers.
YS
ii) To find areas of the ocean that has a significant impact on Earth's climate.
• For example:
Collection of news articles in Table 1.2 shows
→ First 4 rows speak about economy &
→ Last 2 lines speak about health sector.
SB
TE
4) Anomaly Detection
• This is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies.
O
• The goal is to
i) Discover the real anomalies &
ii) Avoid falsely labeling normal objects as anomalous.
• Useful applications:
N
A-7
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
3b. Explain shortly any five data pre-processing approaches. (10 Marks)
Ans:
DATA PRE-PROCESSING
• Data pre-processing is a data-mining technique that involves transforming raw data into an
understandable format.
Q: Why data pre-processing is required?
• Data is often collected for unspecified applications.
• Data may have quality-problems that need to be addressed before applying a DM-
technique. For example: 1) Noise & outliers
2) Missing values &
3) Duplicate data.
I
• Therefore, preprocessing may be needed to make data more suitable for data-mining.
R
DATA PRE-PROCESSING APPROACHES
1. Aggregation
2. Dimensionality reduction
YS
3. Variable transformation
4. Sampling
5. Feature subset selection
6. Discretization & binarization
7. Feature Creation
1) AGGREGATION SB
• This refers to combining 2 or more attributes into a single attribute.
For example, merging daily sales-figures to obtain monthly sales-figures.
Purpose
1) Data reduction: Smaller data-sets require less processing time & less
memory.
2) Aggregation can act as a change of scale by providing a high-level view of
TE
the data instead of a low-level view.
E.g. Cities aggregated into districts, states, countries, etc.
3) More “stable” data: Aggregated data tends to have less variability.
• Disadvantage: The potential loss of interesting-details.
2) DIMENSIONALITY REDUCTION
O
• As a result, we get i) reduced classification accuracy & ii) poor quality clusters.
Purpose
• Avoid curse of dimensionality.
• May help to
U
3) VARIABLE TRANSFORMATION
• This refers to a transformation that is applied to all the values of a variable.
Ex: converting a floating point value to an absolute value.
• Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
• For ex: If x is a variable, then transformations may be ex, 1/x, log(x)
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• If x is the mean of the attribute-values and sx is their standard deviation,
then the transformation x'=(x- x )/sx creates a new variable that has a mean of
0 and a standard-deviation of 1.
A-8
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
4) SAMPLING
• This is a method used for selecting a subset of the data-objects to be analyzed.
• This is used for
i) Preliminary investigation of the data &
ii) Final data analysis.
• Q: Why sampling?
Ans: Obtaining & processing the entire-set of “data of interest” is too expensive or
time consuming.
• Three sampling methods:
i) Simple Random Sampling
• There is an equal probability of selecting any particular object.
I
• There are 2 types:
R
a) Sampling without Replacement
• As each object is selected, it is removed from the population.
b) Sampling with Replacement
• Objects are not removed from the population, as they are selected for
YS
the sample. The same object can be picked up more than once.
ii) Stratified Sampling
• This starts with pre-specified groups of objects.
• Equal numbers of objects are drawn from each group.
iii) Progressive Sampling
• This method starts with a small sample, and then increases the sample-size
2) Filter Approaches
• Features are selected before the DM algorithm is run.
3) Wrapper Approaches
• Use DM algorithm as a black box to find best subset of attributes.
N
7) FEATURE CREATION
• This creates new attributes that can capture the important information in a data-set much
more efficiently than the original attributes.
• Three general methods:
1) Feature Extraction
• Creation of new set of features from the original raw data.
2) Mapping Data to New Space
• A totally different view of data that can reveal important and interesting features.
3) Feature Construction
• Combining features to get better features than the original.
A-9
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
4a. Develop the Apriori Alogorithm for generating frequent-itemset. (08 Marks)
Ans:
APRIORI ALOGORITHM FOR GENERATING FREQUENT-ITEMSET
• Let Ck = set of candidate k-itemsets.
Let Fk = set of frequent k-itemsets.
I
R
YS
SB
• The algorithm initially makes a single pass over the data-set to determine the support of
each item.
After this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 & 2).
• Next, the algorithm will iteratively generate new candidate k-itemsets using frequent (k - 1)
- itemsets found in the previous iteration (step 5).
TE
Candidate generation is implemented using a function called apriori-gen.
• To count the support of the candidates, the algorithm needs to make an additional pass over
the data-set (steps 6–10).
The subset function is used to determine all the candidate itemsets in C k that are
contained in each transaction „t‟.
• After counting their supports, the algorithm eliminates all candidate itemsets whose support
O
Ans:
ASSOCIATION ANALYSIS
• This is used to find group of data that have related functionality.
• The goal is to extract the most interesting patterns in an efficient manner.
U
which suggests that customers who buy diapers also tend to buy milk.
• Useful applications:
i) Finding groups of genes that have related functionality.
ii) Identifying web pages that are accessed together.
A-10
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
4c. Consider the transaction data-set:
Construct the FP tree by showing the trees separately after reading each
transaction. (08 Marks)
Ans:
I
R
YS
SB
TE
O
N
Procedure:
1. A scan of T1 derives a list of frequent-items, h(a:8); (b:5); (c:3); (d:1); etc in which items
are ordered in frequency descending order.
2. Then, the root of a tree is created and labeled with “null”.
U
of frequent-items.
(b) For the third transaction (Figure 6.24iii).
→ since its (ordered) frequent-item list a, c, d, e shares a common prefix „a‟
with the existing path a:b:
→ the count of each node along the prefix is incremented by 1 and
→ three new nodes (c:1), (d:1), (e:1) is created and linked as a child of (a:2)
(c) For the seventh transaction, since its frequent-item list contains only one item i.e.
„a‟ shares only the node „a‟ with the f-prefix subtree, a‟s count is incremented by 1.
(d) The above process is repeated for all the transactions.
A-11
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
5a. Explain Hunts Algorithm and illustrate is working? (08 Marks)
Ans:
HUNT’S ALGORITHM
• A decision-tree is grown in a recursive fashion.
• Let Dt = set of training-records that are associated with node „t‟.
Let y = {y1, y2, . . . , yc} be the class-labels.
• Hunt‟s algorithm is as follows:
Step 1:
• If all records in Dt belong to same class yt, then t is a leaf node labeled as yt.
Step 2:
• If Dt contains records that belong to more than one class, an attribute test-
I
condition is selected to partition the records into smaller subsets.
R
• A child node is created for each outcome of the test-condition and the records
in Dt are distributed to the children based on the outcomes.
• The algorithm is then recursively applied to each child node.
YS
SB
TE
O
N
U
VT
A-12
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
EXPLANATION OF DECISION-TREE CONSTRUCTION
1) The initial tree for the classification problem contains a single node with class-label
Defaulted=No (Fig 4.7(a)).
2) The records are subsequently divided into smaller subsets based on the outcomes of
the Home Owner test-condition.
3) Hunt's algorithm is then applied recursively to each child of the root node.
4) The left child of the root is therefore a leaf node labeled Defaulted=No (Fig 4.7(b)).
5) For the right child, we need to continue applying the recursive step of Hunt's
algorithm until all the records belong to the same class.
TWO DESIGN ISSUES OF DECISION-TREE
1. How should the training-records be split?
I
The algorithm must provide
R
i) a method for specifying test-condition for different attribute-types.
ii) an objective-measure for evaluating goodness of each test-condition.
2. How should the splitting procedure stop?
A possible strategy is to continue expanding a node until either
YS
i) All the records belong to the same class or
ii) All the records have identical attribute values.
SB
TE
O
N
U
VT
A-13
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
5b. What is rule based classifier? Explain how a rule classifier works. (08 Marks)
Ans:
RULE-BASED CLASSIFICATION
• A rule-based classifier is a technique for classifying records.
• This uses a set of “if..then..” rules.
• The rules are represented as R = (r1∨r2∨. . . rk)
where R = rule-set
ri‟s = classification-rules
• General format of a rule: ri: (conditioni) −→ yi.
where conditioni = conjunctions of attributes (A1 op v1) ∧ (A2 op v2) ∧ . . . (Ak op vk)
y = class-label
I
LHS = rule antecedent contains a conjunction of attribute tests i.e. (A j op vj)
R
RHS = rule consequent contains the predicted class yi
op = logical operators such as =, !=, , ≤, ≥
• For ex: Rule R1 is
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
YS
• Given a data-set D and a rule r: A → y. The quality of a rule can be evaluated using following
two measures:
i) Coverage is defined as the fraction of records in D that trigger the rule r.
ii) Accuracy is defined as fraction of records triggered by r whose class-labels are equal to y.
i.e.
SB
where |A| = no. of records that satisfy the rule antecedent.
|A ∩ y| = no. of records that satisfy both the antecedent and consequent.
|D| = total no. of records.
• Classifier contains mutually exclusive rules if the rules are independent of each other.
• Every record is covered by at most one rule.
• In the above example,
→ lemur is mutually exclusive as it triggers only one rule R3.
→ dogfish is mutually exclusive as it triggers no rule.
→ turtle is not mutually exclusive as it triggers more than one rule i.e. R4, R5.
2) Exhaustive Rules
• Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values.
• Each record is covered by at least one rule.
• In the above example,
→ lemur and turtle are Exhaustive as it triggers at least one rule.
→ dogfish is not exhaustive as it does not triggers any rule.
A-14
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
5c. Write the algorithm for k-nearest neighbour classification. (04 Marks)
Ans:
I
R
YS
SB
TE
O
N
U
VT
A-15
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
6a. What is Bayes Theorm? Show how it is used for classification (06 Marks)
Ans:
BAYES THEORM
• Bayes theorem is a statistical principle for combining prior knowledge of the classes with new
evidence gathered from data.
• Let X and Y be a pair of random-variables.
• A conditional probability P(X=x | Y=y) is the probability that a random-variable will take on a
particular value given that the outcome for another random-variable is known.
• The Bayes theorem is given by
I
• The Bayes theorem can be used to solve the prediction problem.
R
• Two implementations of Bayesian methods are used:
1. Naive Bayes classifier &
2. Bayesian belief network.
YS
NAIVE BAYES CLASSIFIER
• A naive Bayes classifier estimates the class-conditional probability by assuming that the
attributes are conditionally independent.
• The conditional independence assumption can be formally stated as follows:
SB
where each attribute-set X = {X1,X2, . . . ,Xd} consists of „d‟ attributes
Conditional Independence
• Let X, Y, and Z denote three sets of random-variables.
• The variables in X are said to be conditionally independent of Y, given Z, if the
following condition holds:
TE
• For a categorical attribute Xi, the conditional probability P(Xi =xi | Y= y) is estimated
according to the fraction of training instances in class y that take on a particular
attribute value xi.
VT
A-16
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
6b. Discuss methods for estimating predictive accuracy of classification. (10 Marks)
Ans:
PREDICTIVE ACCURACY
• This refers to the ability of the model to correctly predict the class-label of new or previously
unseen data.
• A confusion-matrix that summarizes the no. of instances predicted correctly or incorrectly by
a classification-model is shown in Table 5.6.
I
R
YS
METHODS FOR ESTIMATING PREDICTIVE ACCURACY
1) Sensitivity
2) Specificity
3) Recall &
4) Precision
• Let True positive (TP) = no. of positive examples correctly predicted.
SB
False negative (FN) = no. of positive examples wrongly predicted as negative.
False positive (FP) = no. of negative-examples wrongly predicted as positive.
True negative (TN) = no. of negative-examples correctly predicted.
• The true positive rate (TPR) or sensitivity is defined as the fraction of positive examples
predicted correctly by the model,
i.e.
TE
Similarly, the true negative rate (TNR) or specificity is defined as the fraction of negative-
examples predicted correctly by the model,
i.e.
• Finally, the false positive rate (FPR) is the fraction of negative-examples predicted as a
positive class,
O
i.e.
Similarly, the false negative rate (FNR) is the fraction of positive examples predicted as a
N
negative class,
i.e.
U
• Recall and precision are two widely used metrics employed in applications where successful
detection of one of the classes is considered more significant than detection of the other
classes.
i.e.
VT
• Precision determines the fraction of records that actually turns out to be positive in the
group the classifier has declared as a positive class.
• Recall measures the fraction of positive examples correctly predicted by the classifier.
• Weighted accuracy measure is defined by the following equation.
6c. What are the two approaches for extending the binary-classifiers to extend to
handle multi class problems. (04 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.6b.
A-17
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
7a. List and explain four distance measures to compute the distance between a pair
of points and find out the distance between two objects represented by attribute
values (1,6,2,5,3) & (3,5,2,6,6) by using any 2 of the distance measures (08 Marks)
Ans:
1) EUCLIDEAN DISTANCE
• This metric is most commonly used to compute distances.
• The largest valued-attribute may dominate the distance.
• Requirement: The attributes should be properly scaled.
• This metric is more appropriate when the data is not standardized.
I
2) MANHATTAN DISTANCE
• In most cases, the result obtained by this measure is similar to those obtained by
R
using the Euclidean distance.
• The largest valued attribute may dominate the distance.
YS
3) CHEBYCHEV DISTANCE
• This metric is based on the maximum attribute difference.
Solution:
Given, (x1,x2,x3,x4,x5) = (1, 6, 2, 5, 3)
(y1,y2,y3,y4,y5) = (3, 5, 2, 6, 6)
TE
Euclidean Distance is
= 3.872983
O
Manhattan Distance is
N
D(x,y) = |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5|
=7
U
Chebychev Distance is
=3
A-18
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
7b. Explain the cluster analysis methods briefly. (08 Marks)
Ans:
CLUSTER ANALYSIS METHODS
I
R
Partitional Method
YS
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster (Figure 4.1a).
• The method obtains a single-level partition of objects.
• The analyst has to specify
i) Number of clusters prior (k) and
ii) Starting seeds of clusters.
SB
• Analyst has to use iterative approach in which he runs the method many times
→ specifying different numbers of clusters & different starting seeds &
→ then selecting the best solution.
• The method converges to a local minimum rather than the global minimum.
TE
Hierarchical Methods
• A set of nested clusters is organized as a hierarchical tree (Figure 4.1b).
• Two types:
1. Agglomerative: This starts with each object in an individual cluster & then tries to
N
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high-density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist
VT
A-19
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
7c. What are the features of cluster analysis (04 Marks)
Ans:
DESIRED FEATURES OF CLUSTER ANALYSIS METHOD
Scalability
• Data-mining problems can be large.
• Therefore, a cluster-analysis method should be able to deal with large
problems gracefully.
• The method should be able to deal with datasets in which number of attributes
is large.
Only one Scan of the Dataset
• For large problems, data must be stored on disk.
I
• So, cost of I/O disk becomes significant in solving the problem.
R
• Therefore, the method should not require more than one scan of disk.
Ability to Stop & Resume
• For large dataset, cluster-analysis may require huge processor-time to
complete the task.
YS
• Therefore, the task should be able to be stopped & then resumed as & when
required.
Minimal Input Parameters
• The method should not expect too much guidance from the data-mining
analyst.
• Therefore, the analyst should not be expected
Robustness
SB
→ to have domain knowledge of data and
→ to posses‟ insight into clusters.
A-20
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
8. Write short note on the following: (20 Marks)
a. Text mining b. Spatial-databases mining
c. Mining temporal databases d. Web content mining
Ans (a):
TEXT MINING
• This is concerned with extraction of info implicitly contained in the collection of documents.
• Text-collection lacks the imposed structure of a traditional database.
• The text expresses a vast range of information. (DM = Data-mining, TM = Text-Mining).
• The text encodes the information in a form that is difficult to decipher automatically.
• Traditional DM techniques are designed to operate on structured-databases.
• In structured-databases,
I
it is easy to define the set of items and
R
hence, it is easy to use the traditional DM techniques.
In textual-database, identifying individual items (or terms) is a difficult task.
• TM techniques have to be developed to process the unstructured textual-data.
• The inherent nature of textual-data motivates the development of separate TM techniques.
YS
For ex, unstructured characteristics.
• Two approaches for text-mining:
1) Impose a structure on the textual-database and use any of the known DM
techniques meant for structured-databases.
2) Develop a very specific technique for mining that exploits the inherent
characteristics of textual-databases.
Ans (b):
SPATIAL-DATABASES MINING
SB
• This refers to the extraction of knowledge, spatial relationships, or other interesting patterns
not explicitly stored in spatial-databases.
• Consider a map of the city of Mysore containing clusters of points. (Where each point marks
the location of a particular house).
We can mine varieties of information by identifying likely-relationships.
TE
For ex, "the land-value of cluster of residential area around „Mysore Palace‟ is high".
Such information could be useful to investors, or prospective home buyers.
SPATIAL MINING TASKS
1) Spatial-characteristic Rule
• This is a general description of spatial-data.
• For example, a rule may describe the general price-ranges of houses in
O
4) Attribute-oriented Induction
• The concept hierarchies of spatial and non-spatial attributes can be used to
determine relationships between different attributes.
• For ex, one may be interested in a particular category of land-use patterns.
A built-up area may be a recreational facility or a residential complex.
Similarly, a recreational facility may be a cinema or a restaurant.
5) Aggregate Proximity Relationships
• This problem is concerned with relationships between spatial-clusters based on
spatial and non-spatial attributes.
• Given „n‟ input clusters, we want to associate the clusters with classes of
features.
• For example, educational institutions which, in turn, may be comprised of
secondary schools and junior colleges or higher institutions.
A-21
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2013
Ans (c):
MINING TEMPORAL DATABASES
• This can be defined as non-trivial extraction of potentially-useful & previously-unrecorded
information with an implicit/explicit temporal-content, from large quantities of data.
• This has the capability to infer causal and temporal-proximity relationships.
FOUR TYPES OF TEMPORAL-DATA
1) Static
• Static-data are free of any temporal-reference.
• Inferences derived from static-data are also free of any temporality.
2) Sequences (Ordered Sequences of Events)
• There may not be any explicit reference to time.
I
• There exists a temporal-relationship between data-items.
R
• For example, market-basket transaction.
3) Timestamped
• The temporal-information is explicit.
• The relationship can be quantitative, in the sense that
YS
→ we can say the exact temporal-distance between the data-elements &
→ we can say that one transaction occurred before another.
• For example, census data, land-use data etc.
• Inferences derived from this data can be temporal or non-temporal.
4) Fully Temporal
• The validity of the data-element is time-dependent.
• This is the process of extracting useful information from the contents of web-documents.
• In recent years,
→ government information are gradually being placed on the web.
→ users access digital libraries from the web.
U
A-22
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
I
R
YS
SB
TE
O
N
U
VT
I
→ selecting the hardware & software tools.
R
• This step also involves consulting
→ with senior-management &
→ with the various stakeholders.
2) Hardware Integration
YS
• Both hardware and software need to be put together by integrating
→ servers
→ storage devices &
→ client software tools.
3) Modeling
• This involves designing the warehouse schema and views.
4) Physical Modeling
• This involves designing
SB
• This may involve using a modeling tool if the data-warehouse is complex.
→ data-warehouse organization
→ data placement
→ data partitioning &
→ deciding on access methods & indexing.
TE
5) Sources
• This involves identifying and connecting the sources using gateways.
6) ETL
• This involves
→ identifying a suitable ETL tool vendor
→ purchasing the tool &
O
1c. Write the differences between OLTP and data-warehouse. (06 Marks)
Ans:
VT
B-1
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
2a. Explain characteristics of OLAP & write comparison of OLTP & OLAP. (12 Marks)
Ans:
CHARACTERISTICS OF OLAP SYSTEMS
1) Users
• OLTP systems are designed for many office-workers, say 100-1000 users.
Whereas, OLAP systems are designed for few decision-makers.
2) Functions
• OLTP systems are mission-critical. They support the company's day-to-day
operations. They are mostly performance-driven.
Whereas, OLAP systems are management-critical. They support the
company's decision-functions using analytical-investigations.
I
3) Nature
R
• OLTP systems are designed to process one record at a time, for ex, a record
related to the customer.
Whereas, OLAP systems
→ involve queries to deal with many records at a time &
YS
→ provide aggregate data to a manager.
4) Design
• OLTP systems are designed to be application-oriented.
Whereas, OLAP systems are designed to be subject-oriented.
• OLTP systems view the operational-data as a collection of tables.
Whereas, OLAP systems view operational-information as
5) Data
SB
multidimensional model.
B-2
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
2b. Explain ROLAP & MOLAP. (08 Marks)
Ans:
ROLAP
• This uses relational or extended-relational DBMS to store & manage data of warehouse.
• This can be considered a bottom-up approach to OLAP.
• This is based on using a data-warehouse which is designed using a star scheme.
• Data-warehouse provides multidimensional capabilities.
• In DW, data is represented in i) fact-table &
ii) dimension-table.
• The fact-table contains
→ one column for each dimension &
I
→ one column for each measure.
R
• Every row of the fact-table provides one fact.
• An OLAP tool is used to manipulate the data in the DW tables.
• OLAP tool
→ groups the fact-table to find aggregates &
YS
→ uses some of the aggregates already computed to find new aggregates.
• Advantages:
1) More easily used with existing relational DBMS.
2) Data can be stored efficiently using tables.
3) Greater scalability.
• Disadvantage:
1) Poor query-performance. SB
• Some products are i) Oracle OLAP mode &
ii) OLAP Discoverer.
MOLAP
• This is based on using a multidimensional DBMS.
• The multidimensional DBMS is used to store & access data.
TE
• This can be considered as a top-down approach to OLAP.
• This does not have a standard approach to storing and maintaining the data.
• This uses special-purpose file-indexes.
• The file-indexes store pre-computation of all aggregations in the data-cube.
• Advantages:
1) Implementation is efficient.
O
B-3
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
3a. Explain 4 types of attributes with statistical operations & examples. (06 Marks)
Ans:
I
R
YS
SB
TE
O
N
U
VT
B-4
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
3c. Two binary vectors are given below: (04 Marks)
X = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
Calculate (i) SMC (ii) Jaccord similarly coefficient and hamming distance.
Ans:
Solution:
Let X = (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Y = (y1, y2, y3, y4, y5, y6, y7, y8, y9, y10) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
I
R
YS
Hamming Distance is given by
SB
D(x,y)= |x1-y1|+|x2-y2|+|x3-y3|+|x4-y4|+|x5-y5|+|x6-y6|+|x7-y7|+|x8-y8|+|x9-y9|+|x10-y10|
=3
TE
O
N
U
VT
B-5
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
4a. Consider the following transaction data-set 'D' shows 9 transactions and list of
items using Apriori algorithm frequent-itemset minimum support = 2 (10 Marks)
Ans:
Step 1: Generating 1-itemset frequent-pattern.
I
R
YS
Step 2: Generating 2-itemset frequent-pattern.
SB
TE
O
4b. For the following transaction data-set table, construct an FP tree and explain
stepwise for all the transaction. (10 Marks)
B-6
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
5a. Define classification. Draw a neat figure and explain general approach for solving
classification-model. (06 Marks)
Ans:
CLASSIFICATION
• Classification is the task of learning a target-function f that maps each attribute-set x to one
of the predefined class-labels y.
• The target-function is also known informally as a classification-model.
• A classification-model is useful for the following purposes:
1) Descriptive Modeling
• A classification-model can serve as an explanatory-tool to distinguish between objects
of different classes.
I
• For example, it is useful for biologists to have a descriptive model.
R
2) Predictive Modeling
• A classification-model can be used to predict the class-label of unknown-records.
YS
SB
TE
O
• First, a training-set consisting of records whose class-labels are known must be provided.
N
records correctly and incorrectly predicted by the model. These counts are tabulated in a
Confusion-matrix (Table 4.2).
VT
• Each entry fij in matrix denotes the number of records from class i predicted to be of class j.
For instance, f01 is the number of records from class 0 incorrectly predicted as class 1.
• Accuracy is defined as:
B-7
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
5b. Mention the three impurity measures for selecting best splits. (04 Marks)
Ans:
I
p = fraction of records that belong to one of the 2 classes
R
5c. Consider a training-set that contains 60 +ve examples and 100 -ve examples, for
each of the following candidate rules.
Rule r1: Covers 50 +ve examples and 5 -ve examples.
YS
Rule r2: Covers 2 +ve examples and no -ve examples.
Determine which is the best and worst candidate rule according to,
i) Rule accuracy
ii) Likelihood ratio statistic.
iii) Laplace measure. (10 Marks)
Ans:
(i) Rule accuracy is given by
For r2:
The expected frequency for the positive-class is e+ = 2 × 60/160 = 0.75 and the
expected frequency for the negative class is e− = 2 × 100/160 = 1.25.
Therefore, the likelihood ratio is
VT
B-8
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
6a. For the given Confusion-matrix below for 3 classes. Find sensitivity & specificity
metrics to estimate predictive accuracy of classification methods. (10 Marks)
I
Let True positive (TP) = no. of positive-examples correctly predicted.
False negative (FN) = no. of positive-examples wrongly predicted as negative.
R
False positive (FP) = no. of negative-examples wrongly predicted as positive.
True negative (TN) = no. of negative-examples correctly predicted.
YS
True positive rate (TPR) or sensitivity is given by
B-9
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
6b. Explain with example the two approaches for extending the binary-classifiers to
handle multiclass problem. (10 Marks)
Ans:
TWO APPROACHES FOR EXTENDING THE BINARY-CLASSIFIERS
1) 1-r approach &
2) 1-1 approach
• Let Y = {y1, y2, . . . , yK} be set of classes of input-data
1) 1-r (one-against-rest) Approach
• This approach decomposes the multiclass-problem into K binary-problems.
• For each class yi Є Y, a binary-problem is created.
All instances that belong to yi are considered positive-examples.
I
The remaining instances are considered negative-examples.
R
A binary-classifier is then constructed to separate instances of class yi from the
rest of the classes.
2) 1-1 (one-against-one) Approach
• This approach constructs K(K − 1)/2 binary-classifiers.
YS
• Each classifier is used to distinguish between a pair of classes, (yi, yj).
• Instances that do not belong to either yi or yj are ignored when constructing
the binary-classifier for (yi, yj).
• In both (1-r) and (1-1) approaches, a test-instance is classified by combining the predictions
made by the binary-classifiers.
• A voting-scheme is used to combine the predictions.
SB
• The class that receives the highest number of votes is assigned to the test-instance.
• In the 1-r approach, if an instance is classified as negative, then all classes except for the
positive-class receive a vote.
• The first two rows in this table correspond to the pair of classes (yi, yj) chosen to
build the classifier.
U
• The last row represents the predicted class for the test-instance.
• After combining the predictions,
→ y1 and y4 each receive two votes &
VT
B-10
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
7a. Explain K means clustering method and algorithm. (10 Marks)
Ans:
K-MEANS
• K means is a partitional method of cluster analysis.
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster.
• The method obtains a single-level partition of objects.
• This method can only be used if the data-object is located in the main memory.
• The method is called K-means since
each of the K clusters is represented by mean of the objects(called centriod) within it.
• The method is also called the centroid-method since
I
→ at each step, the centroid-point of each cluster is assumed to be known &
R
→ each of the remaining points are allocated to cluster whose centroid is closest to it.
K-MEANS ALGORITHM
1) Select the number of clusters=k. (Figure 7.1a).
2) Pick k seeds as centroids of k clusters. The seeds may be picked randomly.
YS
3) Compute euclidean distance of each object in the dataset from each of the centroids.
4) Allocate each object to the cluster it is nearest to.
5) Compute the centroids of clusters.
6) Check if the stopping criterion has been met (i.e. cluster-membership is unchanged)
If yes, go to step 7.
If not, go to step 3.
7) One may decide SB
→ to stop at this stage or
→ to split a cluster or combine two clusters until a stopping criterion is met.
TE
O
N
Figure 7.1a
U
LIMITATIONS OF K MEANS
1) The results of the method depend strongly on the initial guesses of the seeds.
2) The method can be sensitive to outliers.
VT
B-11
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
7b. What is Hierarchical clustering method? Explain the algorithms for computing
distances between clusters. (10 Marks)
Ans:
HIERARCHICAL METHODS
• A set of nested clusters is organized as a hierarchical tree. (Figure 7.1b).
• This approach allows clusters to be found at different levels of granularity.
I
R
YS
Figure 7.1b
Figure 7.1c
AGGLOMERATIVE ALGORITHM
1) Allocate each point to a cluster of its own. Thus, we start with n clusters for n
N
objects.
2) Create a distance-matrix by computing distances between all pairs of clusters
(either using the single link metric or the complete link metric). Sort these
U
6) Compute all distances from the new cluster and update the distance-matrix
after the merger and go to step 3.
2) DIVISIVE APPROACH
• This method is basically a top-down approach.
• This method
→ starts with the whole dataset as one cluster
→ then proceeds to recursively divide the cluster into two sub-clusters and
→ continues until each cluster has only one object (Figure 7.1d).
• Two types are:
1) Monothetic: This splits a cluster using only one attribute at a time.
An attribute that has the most variation could be selected.
2) Polythetic: This splits a cluster using all of the attributes together.
Two clusters far apart could be build based on distance between objects.
B-12
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2014
DIVISIVE ALGORITHM
1) Decide on a method of measuring the distance between 2 objects. Also,
decide a threshold distance.
2) Create a distance-matrix by computing distances between all pairs of objects
within the cluster. Sort these distances in ascending order.
3) Find the 2 objects that have the largest distance between them. They are the
most dissimilar objects.
4) If the distance between the 2 objects is smaller than the pre-specified
threshold and there is no other cluster that needs to be divided then stop,
otherwise continue.
5) Use the pair of objects as seeds of a K-means method to create 2 new
I
clusters.
R
6) If there is only one object in each cluster then stop otherwise continue with
step 2.
YS
8. Write short notes on the following:
a. Web content mining
SB Figure 7.1d
b. Text mining
c. Spatial-data-mining
d. Spatio-temporal data-mining (20 Marks)
TE
Ans (d):
O
because:
i) Spatial-data is embedded in a continuous space.
Whereas, classical datasets are in discrete notions like transactions.
ii) Since spatial-data are highly auto-correlated. A common assumption about
independence of data samples in classical statistical analysis is generally false.
B-13
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
I
R
YS
SB
TE
O
N
U
VT
I
R
YS
SB
TE
1b. Explain the guidelines for data-warehouse implementation. (08 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.1c.
C-1
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
2a. Why multidimensional views of data and data-cubes are used? With a neat
diagram, explain data-cube implementations. (10 Marks)
Ans:
DATA-CUBE
• Data-cube refers to multi-dimensional array of data.
• The data-cube is used to represent data along some measure-of-interest.
• Data-cubes allow us to look at complex data in a simple format.
• For ex (Fig 2.1a): A company might summarize financial-data to compare sales i) by product
ii) by date &
iii) by country.
I
R
YS
DATA-CUBE IMPLEMENTATION
1) Pre-compute and Store All
SB
Figure 2.1a: Data-cube of sales
• The more aggregates we are able to pre-compute, the better the query-performance.
• Data-cube products use following methods for pre-computing aggregates:
i) ROLAP (relational OLAP) ii) MOLAP (multidimensional OLAP)
U
VT
C-2
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
2b. What are data-cube operations? Explain. (10 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.2b.
3b. Why data preprocessing is required in DM? Explain various steps in data
preprocessing (06 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b.
I
Ans:
R
DATA-MINING APPLICATIONS
Prediction & Description
• Data-mining may be used to answer questions like
i) "Would this customer buy a product" or
YS
ii) "Is this customer likely to leave?”
• DM techniques may also be used for sales forecasting and analysis.
Relationship Marketing
• Customers have a lifetime value, not just the value of a single sale.
• Data-mining can helpful for
i) Analyzing customer-profiles and improving direct marketing plans.
SB
ii) Identifying critical issues that determine client-loyalty.
iii) Improving customer retention.
Customer Profiling
• This is the process of using the relevant- & available-information to
i) Describe the characteristics of a group of customers.
ii) Identify their discriminators from ordinary consumers.
iii) Identify drivers for their purchasing decisions.
TE
• This can help the company identify its most valuable customers
so that the company may differentiate their needs and values.
Outliers Identification & Detecting Fraud
• For this, examples include:
i) Identifying unusual expense claims by staff.
ii) Identifying anomalies in expenditure b/w similar units of the company.
O
C-3
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
4a. Explain FP - growth algorithm for discovering frequent-item sets. What are its
limitations? (08 Marks)
Ans:
FP - GROWTH ALGORITHM
• This algorithm
→ encodes the data-set using a compact data-structure called a FP-tree &
→ extracts frequent-itemsets directly from this structure (Figure 6.24).
• This finds all the frequent-itemsets ending with a particular suffix.
• This employs a divide-and-conquer strategy to split the problem into smaller subproblems.
I
R
YS
SB
• For example, suppose we are interested in finding all frequent-itemsets ending in e.
To do this, we must first check whether the itemset {e} itself is frequent.
If it is frequent, we consider subproblem of finding frequent-itemsets ending in
TE
de, followed by ce, be, and ae.
• In turn, each of these subproblems is further decomposed into smaller subproblems.
• By merging the solutions obtained from the subproblems, all the frequent-itemsets ending in
e can be found (Figure 6.27).
O
N
U
VT
C-4
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
4b. What is Apriori algorithm? How it is used to find frequent-item sets? Explain
briefly. (08 Marks)
Ans:
APRIORI ALGORITHM
• Apriori Theorem states:
“If an itemset is frequent, then all of its subsets must also be frequent.”
• Consider the following example.
Suppose {c, d, e} is a frequent-itemset then any transaction that contains {c, d, e} must
also contain its subsets, {c, d}, {c, e}, {d, e} {c}, {d} and {e} (Figure 6.3).
As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent.
I
R
YS
SB
TE
• Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.
• Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets
generated by the algorithm is = 6.
• Two of these six candidates, {Beer, Bread} and {Beer, Milk} are subsequently found to be
infrequent after computing their support values.
• Remaining 4candidates are frequent, and thus will be used to generate candidate 3-itemsets.
• With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are
frequent (Figure 6.5).
• The only candidate that has this property is {Bread, Diapers, Milk}.
C-5
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
4c. List the measures used for evaluating association patterns. (04 Marks)
Ans:
I
R
YS
5a. How decision-trees are used for classification? Explain decision-tree induction
algorithm for classification. (10 Marks)
Ans:
HUNT’S ALGORITHM SB
• A decision-tree is grown in a recursive fashion.
• Let Dt = set of training-records that are associated with node„t‟.
Let y = {y1, y2, . . . , yc} be class-labels.
• Hunt‟s algorithm is as follows.
Step 1:
• If all records in Dt belong to same class yt, then t is a leaf node labeled as yt.
TE
Step 2:
• If Dt contains records that belong to more than one class, an attribute test
condition is selected to partition the records into smaller subsets.
O
N
U
VT
C-6
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
I
R
YS
DECISION-TREE ALGORITHM: TREEGROWTH SB
• The input to the algorithm consists of i) training-records E and ii) attribute-set F.
• The algorithm works by
i) Recursively selecting the best attribute to split the data (Step 7) and
ii) Expanding leaf nodes of tree (Steps 11 & 12) until stopping criterion is met (Step 1).
TE
O
N
U
VT
I
• Boosting is an iterative procedure used to adaptively change the distribution of
R
training examples so that the base classifiers will focus on examples that are hard to
classify.
• Unlike bagging, boosting assigns a weight to each training example and may
adaptively change the weight at the end of each boosting round.
YS
• The weights assigned to the training examples can be used in following ways:
1. They can be used as a sampling distribution to draw a set of bootstrap
samples from the original data.
2. They can be used by the base classifier to learn a model that is biased toward
higher-weight examples.
Ans:
SB
5c. Explain importance of evaluation criteria for classification methods. (05 Marks)
• Predictive Accuracy: refers to the ability of the model to correctly predict the class-label of
new or previously unseen data.
• Speed: refers to the computation costs involved in generating and using the model.
Speed involves not just the time or computation cost of constructing a model (e.g. a
decision-tree), it also includes the time required to learn to use the model.
TE
• Robustness: is the ability of the model to make correct predictions given noisy data or data
with missing values.
Most data obtained from a variety of sources has errors.
Therefore, the method should be able to deal with noise, outlier & missing
values gracefully.
• Scalability: refers to ability to construct the model efficiently given large amount of data.
O
Data-mining problems can be large and therefore the method should be able to deal
with large problems gracefully.
• Interpretability: refers to level of understanding & insight that is provided by the model.
An important task of a DM professional is to ensure that the results of data-mining are
N
being solved.
For example, in a decision-tree classification, it is desirable to find a decision-tree of
the right size and compactness with high accuracy.
VT
6a. What are Baysian classifiers? Explain Baye's theorem. (10 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.6a.
6b. How rule based classifiers are used for classification? Explain. (10 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.5b.
7a. Explain K-means clustering algorithm. What are its limitations? (10 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.7a.
C-8
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
7b. How density based methods are used for clustering? Explain. (10 Marks)
Ans:
DENSITY-BASED METHODS
• A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist
within a given radius.
• Data that is not within such high-density clusters is regarded as outliers or noise.
• For example: DBSCAN (Density Based Spatial Clustering of Applications with Noise).
DBSCAN
• It requires 2 input parameters:
I
1) Size of the neighborhood (R) &
R
2) Minimum points in the neighborhood (N).
• The point-parameter N
→ determines the density of acceptable-clusters &
→ determines which objects will be labeled outliers or noise.
YS
• The size-parameter R determines the size of the clusters found.
• If R is big enough, there will be one big cluster and no outliers.
If R is small, there will be small dense clusters and there might be many outliers.
• We define a number of terms (Figure 7.2):
1. Neighborhood: The neighborhood of an object y is defined as all the objects
that are within the radius R from y.
its neighborhood.
SB
2. Core-object: An object y is called a core-object if there are N objects within
Figure 7.2:DBSCAN
C-9
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER DEC-2014
8a. What is web content mining? Explain. (08 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.8d.
I
Ans (iii):
R
TEXT CLUSTERING
• Once the features of an unstructured-text are identified, text-clustering can be done.
• Text-clustering can be done by using any clustering technique.
For ex: ward's minimum variance method.
YS
• Ward‟s method is an agglomerative hierarchical clustering technique.
• Ward‟s method tends to generate very compact clusters.
• Following measure of the dissimilarities between feature vectors can be used
i) Euclidean metric or
ii) Hamming distance
• The clustering method begins with „n‟ clusters, one for each text.
SB
• At any stage, 2 clusters are merged to generate a new cluster based on the following
criterion:
C-10
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
I
R
YS
SB
TE
O
N
U
VT
I
1) ODS is the unified-operational view of the company.
R
ODS provides the managers improved access to important operational-data.
This view assists in better understanding of i) business & ii) customer.
2) ODS is more effective in generating current-reports without accessing OLTP.
3) ODS can shorten time required to implement a data-warehouse system.
YS
• Different types of ODS:
1) The ODS can be used as a reporting-tool for administrative purposes.
The ODS is usually updated daily.
2) The ODS can be used to track more complex-information such as product-code &
location-code.
The ODS is usually updated hourly.
SB
3) The ODS can be used to support CRM (Customer Relationship Management).
1b. List the major steps involved in the ETL process. (06 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.1b.
1c. Based on oracle, what are difference b/w OLTP & DW systems. (08 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.1c.
TE
3) Shared
• The system is
→ accessed by few business-analysts &
→ used by thousands of users.
• Being a shared system, the OLAP software must provide adequate security for
i) confidentiality & ii) integrity.
• Concurrency-control is required if users are updating data in the database.
4) Multidimensional
• This is the basic requirement.
• OLAP software must provide a multidimensional conceptual-view of the data.
• A dimension has hierarchies that show parent/child relationships between the
members of dimensions.
• The multidimensional structure must allow hierarchies of parent/child
relationships.
D-1
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
5) Information
• The system should be able to handle a large amount of input-data.
• Two important critical factors:
i) The capacity of system to handle information &
ii) Integration of information with the data-warehouse.
I
• Because of multidimensional-view, data-cube operations like slice and dice can
R
be performed.
Accessibility (OLAP as a Mediator)
• The OLAP software should be sitting b/w
i) Data-sources &
YS
ii) OLAP front-end.
Batch Extraction vs. Interpretive
• In large multidimensional databases, the system should provide
→ multidimensional-data staging plus
→ partial pre-calculation of aggregates.
Multi-user Support
SB
• Being a shared -system, the OLAP software should provide normal database
operations including retrieval, update, integrity and security.
Storing results of OLAP
• OLAP results-data should be kept separate from source-data.
• Read-write applications should not be implemented directly on live
transaction-data if source-systems are supplying information to the system
directly.
TE
Extraction of Missing Values
• The system should distinguish missing-values from zero-values.
• If a distinction is not made, then the aggregates are computed incorrectly.
Uniform Reporting Performance
• Increasing the number of dimensions (or database-size) should not degrade
the reporting performance of the system.
O
2c. Describe the difference between ROLAP & MOLAP. (05 Marks)
Ans:
N
U
VT
3a. What is data preprocessing? Explain various pre-processing tasks. (14 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.3b.
D-2
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
3b. Explain the following: (06 Marks)
i) Euclidean distance
ii) Simple matching coefficient
iii) Jaccard coefficient.
Ans (i):
EUCLIDEAN DISTANCE
• The Euclidean distance (D) between two points x and y is given by:
I
x = (1, 6, 2, 5, 3) &
y = (3, 5, 2, 6, 6)
R
Solution:
Let (x1, x2, x3, x4, x5) = (1, 6, 2, 5, 3)
(y1, y2, y3, y4, y5) = (3, 5, 2, 6, 6)
YS
Euclidean Distance is calculated as follows:
Ans (ii):
SIMPLE MATCHING COEFFICIENT SB
• SMC is used as a similarity coefficient.
• SMC is given by
TE
where f00= no. of attributes where x is 0 and y is 0
• This measure counts both presences and absences equally.
Ans (iii):
JACCARD COEFFICIENT
• The Jaccard coefficient is used to handle objects consisting of asymmetric binary attributes.
O
Example: Calculate SMC and Jaccard Similarity Coefficients for the following two
binary vectors:
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) &
U
y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 1)
Solution:
VT
D-3
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
4c. Construct the FP tree for following data-set. Show the trees separately after
reading each transaction. (07 Marks)
I
CLASSIFICATION
R
• Classification is the task of learning a target-function ‘f’ that maps each attribute-set x to
one of the predefined class-labels y.
• The target-function is also known as a classification-model.
• A classification-model is useful for the following 2 purposes (Figure 4.3):
YS
1) DESCRIPTIVE MODELING
• A classification-model can serve as an explanatory-tool to distinguish between objects
of different classes.
• For example, it is useful for biologists to have a descriptive model.
2) PREDICTIVE MODELING
• A classification-model can also be used to predict the class-label of unknown-records.
• As a classification-model automatically assigns a class-label when presented with the
SB
attribute-set of an unknown-record.
• Classification techniques are most suited for predicting or describing data-sets with
binary- or nominal-categories.
• They are less effective for ordinal categories because they do not consider the implicit
order among the categories.
TE
O
N
U
VT
D-4
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
5b. Discuss the characteristics of decision-tree induction algorithms. (10 Marks)
Ans:
CHARACTERISTICS OF DT INDUCTION ALGORITHMS
1. Decision-tree induction is a non-parametric approach for building classification-
models.
2. Finding an optimal tree is NP complete problem.
Many DM algorithms employ a heuristic-based approach to guide their search in
the vast hypothesis space.
3. Techniques developed for constructing trees are computationally inexpensive i.e. it is
possible to quickly construct models even when the training-set size is very large.
Furthermore, once a tree has been built, classifying a test-record is extremely
I
fast, with a worst-case complexity of O(w)
R
where w = maximum depth of tree
4. Smaller-sized trees are relatively easy to interpret.
5. Trees provide an expressive representation for learning discrete valued functions.
However, they do not generalize well to certain types of Boolean problems.
YS
6. A subtree can be replicated multiple times in a tree (Figure 4.19). This makes the
tree more complex than necessary and perhaps more difficult to interpret.
The algorithm solves the sub-trees by using the divide and conquer algorithm to
avoid complexity.
SB
TE
O
7. DT algorithms are quite robust to the presence of noise, especially when methods for
avoiding overfitting, are employed.
N
8. The presence of redundant attributes does not affect the accuracy of trees.
An attribute is redundant if it is strongly correlated with another attribute in data.
9. At the leaf nodes, the number of records may be too small to make a statistically
U
significant decision about the class representation of the nodes. This is known as the
data fragmentation problem.
Solution: Disallow further splitting when the number of records falls below a
certain threshold.
VT
10. The tree-growing procedure can be viewed as process of partitioning the attribute-
space into disjoint regions until each region contains records of the same class.
The border between two neighboring regions of different classes is known as a
decision boundary.
D-5
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
5c. Explain sequential covering algorithm in rule-based classifier. (04 Marks)
Ans:
SEQUENTIAL COVERING ALGORITHM
• This is used to extract rules directly from data.
• This extracts the rules one class at a time for data-sets that contain more than 2 classes.
• The criterion for deciding which class should be generated first depends on:
i) Fraction of training-records that belong to a particular class or
ii) Cost of misclassifying records from a given class.
I
R
YS
SB
TE
O
N
U
VT
D-6
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
6a. List 5 criteria for evaluating classification methods. Explain briefly. (05 Marks)
Ans:
FIVE CRITERIA FOR EVALUATING CLASSIFICATION METHODS
1) Holdout method
2) Random Subsampling
3) Cross-Validation
4) Leave-one-out approach
5) Bootstrap
1) Holdout method
• The original data is divided into 2 disjoint set: i) Training-set &
I
ii) Test-set.
R
• A classification-model is induced from the training-set.
• Performance of classification-model is evaluated on the test-set.
• The proportion of data is reserved as either
i) 50% for training and 50% for testing or
YS
ii) 2/3 for training and 1/3 for testing.
• The accuracy of the classifier can be estimated based on accuracy of induced model.
2) Random Subsampling
• The holdout method can be repeated several times to improve the estimation of a
classifier's performance.
• Limitation: It has no control over the number of times each record is used for testing
3)
& training.
Cross-Validation
SB
• In K-fold cross-validation, the available data is randomly divided into k-disjoint
subsets of approximately equal-size.
• One of the subsets is then used as the test-set.
Remaining (k – 1) sets are used for building the classifier.
• The test-set is used to estimate the accuracy.
TE
• This is done repeatedly k times so that each subset is used as a test subset once.
4) Leave-one-out approach
• A special case of k-fold cross-validation method sets k = N, the size of the data-set.
• Each test-set contains only one record.
• Advantages:
1) Utilizes as much data as possible for training.
O
2) Test-sets are mutually exclusive & they effectively cover entire data-set.
• Two drawbacks:
1) Computationally expensive for large datasets.
2) Since each test-set contains only one record, the variance of the estimated
N
redrawn.
• A sample contains about 63.2% of the records in the original data.
• Records that are not included in the bootstrap sample become part of the test-set.
VT
• The model induced from the training-set is applied to the test-set to obtain an
estimate of the accuracy of the bootstrap sample, εi.
• The sampling procedure is then repeated ‘b’ times to generate ‘b’ bootstrap samples.
D-7
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
6c. Consider the following training-set for predicting the loan default problem:
I
Find the conditional independence for given training-set using Bayes theorem for
R
classification. (08 Marks)
Ans:
Solution:
YS
• For each class yj, the class-conditional probability for attribute Xi is:
where μ(x’)=mean
σ2 (s2)=variance
• The sample mean and variance for annual income attribute with respect to the class No are:
SB
• Given a test-record with taxable income equal to $120K, we can compute its class-
TE
conditional probability as follows:
• Since there are three records that belong to the class Yes and seven records that belong to
the class No, P(Yes) = 0.3 and P(No) = 0.7.
O
N
U
VT
• Using the information provided in Figure 5.10(b), the class-conditional probabilities can be
computed as follows:
D-8
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
7a. List & explain the desired features of cluster analysis. (08 Marks)
Ans: For answer, refer Solved Paper Dec-2013 Q.No.7c.
7b. Explain the K-means clustering algorithm with suitable example. (12 Marks)
Ans: For answer, refer Solved Paper June-2014 Q.No.7a.
I
Ans (a): For answer, refer Solved Paper Dec-2013 Q.No.8d.
R
Ans (b): For answer, refer Solved Paper Dec-2014 Q.No.8b.iii.
Ans (c):
UNSTRUCTURED TEXT
YS
• Unstructured-documents are free texts, such as news stories.
• Following features can be extracted to convert an unstructured-document to a structured
form:
Word Occurrences
• The vector-representation takes single words found in the training-set as
features. (Vector-representation = bag of words).
• Two types:
SB
• This ignores the sequence in which the words occur.
Stemming
• Stemming refers to the process of reducing words to their morphological roots
or stems.
• A stem is part of a word that is left after removing its prefixes & suffixes.
N
• Common tags:
noun, verb and adjective
• Thus, we can assign a number 1, 2, 3 or 4, depending on whether the word is
VT
D-9
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/
DATA WAREHOUSING & DATA MINING SOLVED PAPER JUNE-2015
Ans (d):
TEMPORAL DATA-MINING TASKS
Temporal Association
• The association-rule discovery can be extended to temporal-association.
• Here, we attempt to discover temporal-associations between non-temporal
itemsets.
• We can say that: "70% of the readers who buy a DBMS book also buy a Data-
mining book after a semester".
Temporal Classification
• We can extend the concept of decision-tree construction on temporal-
attributes.
I
• For example, a rule can be: "The first case of malaria is normally reported
R
after the first pre-monsoon rain and during the months of May-August".
Trend Analysis
• The analysis of one or more time series of continuous data may show similar
trends i.e. similar shapes across the time axis.
YS
• For example, "The deployment of the Android OS is increasingly becoming
popular in the Smartphone industry".
• Here, we are trying to find the relationships of change in one or more static-
attributes, with respect to changes in the temporal-attributes.
Sequence Analysis
• Events occurring at different points in time may be related by causal
relationships. SB
• For example, an earlier event may appear to cause a later one.
• To discover causal relationships, sequences of events must be analyzed to
discover common patterns.
• This category includes
→ discovery of frequent events and
→ problem of event-prediction.
TE
• Frequent sequence mining finds the frequent subsequences;
while event-prediction predicts the occurrences of events which are rare.
O
N
U
VT
D-10
For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/