Sie sind auf Seite 1von 5

Data mining and data warehousing

1- Discuss the typical OLAP operations with an example.


Ans.- OLAP operations-The analyst can understand the meaning contained in the databases
using multi-dimensional analysis. By aligning the data content with the analyst's mental model,
the chances of confusion and erroneous interpretations are reduced. The analyst can navigate
through the database and screen for a particular subset of the data, changing the data's
orientations and defining analytical calculations.[6]The user-initiated process of navigating by
calling for page displays interactively, through the specification of slices via rotations and drill
down/up is sometimes called "slice and dice". Common operations include slice and dice, drill
down, roll up, and pivot.

Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or
more members of the dimensions not in the subset.[6] The picture shows a slicing operation: The
sales figures of all sales regions and all product categories of the company in the year 2004 are
"sliced" out the data cube.

Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two
consecutive slices).[7] The picture shows a dicing operation: The new cube shows the sales
figures of a limited number of product categories, the time and region dimensions cover the same
range as before.

Drill Down/Up: Drilling down or up is a specific analytical technique whereby the user navigates
among levels of data ranging from the most summarized (up) to the most detailed (down).[6] The
picture shows a drilling operation: Theres a better understanding of the sales figures of the
product category "Outdoor-Schutzausrstung" since you now see the sales figures for the single
products of this category.

Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions.
To do this, a computational relationship or formula might be defined.[6]

Pivot: This operation is also called rotate operation. It rotates the data in order to provide an
alternative presentation of data - the report or page display takes a different dimensional
orientation.[6] The picture shows a pivoting operation: The whole cube is rotated, giving another
perspective on the data.

2- Discuss how computations can be performed efficiently on data cubes.

Ans.- Users of decision support systems often see data in the form of data cubes. The cube is
used to represent data along some measure of interest. Although called a "cube", it can be 2-
dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in
the database and the cells in the data cube represent the measure of interest. For example, they
could contain a count for the number of times that attribute combination occurs in the database,
or the minimum, maximum, sum or average value of some attribute. Queries are performed on
the cube to retrieve decision support information.Example: We have a database that contains
transaction information relating company sales of a part to a customer at a store location. The
data cube formed from this database is a 3-dimensional representation, with each cell (p,c,s) of
the cube representing a combination of values from part, customer and store-location. A sample
data cube for this combination is shown in Figure 1. The contents of each cell is the count of the
number of times that specific combination of values occurs together in the database. Cells that
appear blank in fact have a value of zero. The cube can then be used to retrieve information
within the database about, for example, which store should be given a certain part to sell in order
to make the greatest sales.

Rollup or summarization of the data cube can be done by traversing upwards through a concept
hierarchy. A concept hierarchy maps a set of low level concepts to higher level, more general
concepts. It can be used to summarize information in the data cube. As the values are combined,
cardinalities shrink and the cube gets smaller. Generalizing can be thought of as computing some
of the summary total cells that contain ANYs, and storing those in favour of the original cells.To
reduce the size of the data cube, we can summarize the data by computing the cube at a higher
level in the concept hierarchy. A non-summarized cube would be computed at the lowest level,
for example, the province level in Figure 2(a). If we compute the cube at the second level, there
are only six categories, B.C., Prairies, Ont., Que., Maritimes and Nfld., and the data cube will be
much smaller. Figure 3 shows a sample generalization of the Province attribute for those
provinces that can be grouped under the concept Prairies and those that can be grouped under the
concept Maritimes. For example, for Sask., the province, or location name, changes to Prairies,
but the other attribute values remain unchanged because they are not summarized at this point.

3- Write short notes on data warehouse meta data.


Ans.- The first image most people have of the data warehouse is a large collection of
historical, integrated data. While that image is correct in many regards, there is another very
important element of the data warehouse that is vital - metadata. Metadata is data about data.
Metadata has been around as long as there have been programs and data that the programs
operate on. Figure 1 shows metadata in a simple form. While metadata is not new, the role of
metadata and its importance in the face of the data warehouse certainly is new. For years the
information technology professional has worked in the same environment as metadata, but in
many ways has paid little attention to metadata. The information professional has spent a life
dedicated to process and functional analysis, user requirements, maintenance, architectures,
and the like. The role of metadata has been passive at best in this milieu. But metadata plays
a very different role in data warehouse. Relegating metadata to a backwater, passive role in
the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a
very active and important part in the data warehouse environment. The reason why metadata
plays such an important and active role in the data warehouse environment is apparent when
contrasting the operational environment to the data warehouse environment insofar as the
user community is concerned. The information technology professional is the primary
community involved in the usage of operational development and maintenance facilities. It is
expected that the information technology community is computer literate, and able to find
his/her way around systems. The community served by the data warehouse is a very different
community. The data warehouse serves the DSS analysis community. It is anticipated that the
DSS analysis community is not computer literate. Instead the expectation is that the DSS
analysis community is a businessperson community first, and a technology community
second. Simply from the standpoint of who needs help the most in terms of finding one's way
round data and systems, it is assumed the DSS analysis community requires a much more
formal and intensive level of support than the information technology community. For this
reason alone, the formal establishment of and ongoing support of metadata becomes
important in the data warehouse environment
4- Explain various methods of data cleaning in detail.

Ans.- Data cleansing, data cleaning, or data scrubbing is the process of detecting and
correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used
mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant,
etc. parts of the data and then replacing, modifying, or deleting this dirty data.After cleansing,
a data set will be consistent with other similar data sets in the system. The inconsistencies
detected or removed may have been originally caused by user entry errors, by corruption in
transmission or storage, or by different data ictionary definitions of similar entities in different
stores.Data cleansing differs from data validation in that validation almost invariably means data
is rejected from the system at entry and is performed at entry time, rather than on batches of data.

Popular methods used


Parsing: Parsing in data cleansing is performed for the detection of syntax errors. A
parser decides whether a string of data is acceptable within the allowed data specification.
This is similar to the way a parser works with grammars and languages.
Data transformation: Data transformation allows the mapping of the data from its given
format into the format expected by the appropriate application. This includes value
conversions or translation functions, as well as normalizing numeric values to conform to
minimum and maximum values.
Duplicate elimination: Duplicate detection requires an algorithm for determining
whether data contains duplicate representations of the same entity. Usually, data is sorted by
a key that would bring duplicate entries closer together for faster identification.
Statistical methods: By analyzing the data using the values of mean, standard
deviation, range, or clustering algorithms, it is possible for an expert to find values that are
unexpected and thus erroneous. Although the correction of such data is difficult since the true
value is not known, it can be resolved by setting the values to an average or other statistical
value. Statistical methods can also be used to handle missing values which can be replaced
by one or more plausible values, which are usually obtained by extensive data augmentation
algorithms.
5- Give an account on data mining Query language.
Ans.- Since the first definition of the Knowledge Discovery in Databases (KDD) domain
in Piatetsky-Shapiro and Frawley, 1991), many techniques have been proposed to support
these "From Data to Knowledge" complex interactive and iterative processes. In
practice, knowledge elicitation is based on some ex- tracted and materialized (collections
of) patterns which can be global (e.g., 716 DATA MINING AND KNOWLEDGE
DISCOVERY HANDBOOK decision trees) or local (e.g., itemsets, association rules).
Real life KDD pro- cesses imply complex pre-processing manipulations (e-g., to clean
the data), several extraction steps with different parameters and types of patterns (e.g.,
feature construction by means of constrained itemsets followed by a classifying phase,
association rule mining for different thresholds values and different ob-
jective measures of interestingness), and post-processing manipulations (e.g., elimination
of redundancy in extracted patterns, crossing-over operations be- tween patterns and data
like the search of transactions which are exceptions to frequent and valid association
rules or the selection of misclassified ex- amples with a decision tree). Looking for a
tighter integration between data and patterns which hold in the data, Imielinski and
Mannila have proposed in
(Imielinski and Mannila, 1996) the concept of inductive database (IDB). In an IDB,
ordinary queries can be used to access and manipulate data, while induc- tive queries can
be used to generate (mine), manipulate, and apply patterns. KDD becomes an extended
querying process where the analyst can control the whole process since helshe specifies
the data andlor patterns of interests. Therefore, the quest for query languages for IDBs is
an interesting goal. It
is actually a long-term goal since we still do not know which are the relevant primitives
for Data Mining. In some sense, we still lack from a well-accepted set of primitives.
6- How is Attribute-Oriented Induction implemented? Explain in detail.
7- Write and explain the algorithm for mining frequent item sets without candidate
generation. Give relevant example.
8- Discuss the approaches for mining multi level association rules from the transactional
databases. Give relevant example.

Das könnte Ihnen auch gefallen