Sie sind auf Seite 1von 6

Class Concept Description: Characterization and Discrimination

Data entries can be associated with classes or concepts. For example, in the AllElectronics store, classes
of items for sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders.
It can be useful to describe individual classes and concepts in summarized, concise, and et precise
terms. Such descriptions of a class or a concept are called class!concept descriptions. "hese descriptions
can be derived using
#. Data $haracterization!%eneralization!Summarization
&. Data Discrimination,
'. (oth data characterization and discrimination.
1. Data Characterization/Generalization/Summarization
Data characterization is a summarization of the general characteristics or features of a target class of
data. "he data corresponding to the user)specified class are tpicall collected b a database *uer.
For example, to stud the characteristics of software products whose sales increased b #+, in the last
ear, the data related to such products can be collected b executing on S-. *uer. "here are several
methods for effective data summarization and characterization.
Several methods to achieve Data $haracterization
I. Simple data summaries based on statistical measures and plots
II. "he data cube/based 0.A1 roll)up operation to perform user)controlled data summarization
along a specified dimension.
III. An attribute)oriented induction techni*ue 2without step)b)step user interaction3
"he output of data characterization can be presented in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
"he resulting descriptions can also be presented as generalized relations or in rule form 2called
characteristic rules3.
2. Data Characterization/Generalization/Summarization
Data discrimination is a comparison of the general features of target class data ob4ects with the general
feature of ob4ects from one or a set of contrasting classes. "he target and contrasting classes can be
specified b the user, and the corresponding data ob4ects retrieved through database *ueries.
1
For example, the user ma li5e to compare the general features of software products whose sales
increased b #+, in the last ear with those whose sales decreased b at least '+, during the same
period.
Discrimination descriptions should include comparative measure that help distinguish between the target
and contrasting classes. Discrimination descriptions expressed in rule form are referred to as
discriminant rules.
Attribute Oriented Induction for Data Characterization
1roposed in #676 28DD 976 wor5shop3
:ot confined to categorical data nor particular measures.
;ow it is done<
o $ollect the tas5)relevant data2 initial relation3 using a relational database *uer
o 1erform generalization b attribute removal or attribute generalization.
o Appl aggregation b merging identical, generalized tuples and accumulating their
respective counts.
o Interactive presentation with users.
asic !rinciples of Attribute"Oriented Induction
Data focusin#: tas5)relevant data, including dimensions, and the result is the initial relation.
Attribute"remo$al: remove attribute A if there is a large set of distinct values for A but 2#3 there is no
generalization operator on A, or 2&3 A=s higher level concepts are expressed in terms of other attributes.
Attribute"#eneralization: If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A.
Attribute"threshold control: tpical &)7, specified!default.
Generalized relation threshold control: control the final relation!rule size
%&ample:
D>-.? Describe general characteristics of graduate students in the (ig)@niversit database
use (igA@niversitAD(
mine characteristics as BScienceAStudentsC
in rele$ance to name, gender, ma4or, birthAplace, birthAdate, residence, phoneD, gpa
from student
'here status in BgraduateC
2
$orresponding S-. statement?
Select name, gender, ma4or, birthAplace, birthAdate, residence, phoneD, gpa
from student
'here status in EB>scC, B>(AC, B1hDC F
Class Characterization: An %&ample
Initial (elation
!rime Generalized (elation
!resentation of Generalized (esults
%eneralized relation?
o Gelations where some or all attributes are generalized, with counts or other aggregation
values accumulated.
$ross tabulation?
o >apping results into cross tabulation form 2similar to contingenc tables3.
o Hisualization techni*ues?
o 1ie charts, bar charts, curves, cubes, and other visual forms.
3
-uantitative characteristic rules?
o >apping generalized result into characteristic rules with *uantitative information
associated with it, e.g.,
vt)weight?
o Interesting measure that describes the tpicalit of

each dis4unct in the rule

each tuple in the corresponding generalized relation

n I number of tuples for target class for generalized relation

*i J *n I tuples for target class in generalized relation

*a is in *i J *n
!resentation)Generalized (elation
!resentation)Crosstab
4
Implementation b* Cube +echnolo#*
$onstruct a data cube on)the)fl for the given data mining *uer
o Facilitate efficient drill)down analsis
o >a increase the response time
o A balanced solution? precomputation of BsubprimeC relation
@se a predefined K precomputed data cube
o $onstruct a data cube beforehand
o Facilitate not onl the attribute)oriented induction, but also attribute relevance analsis,
dicing, slicing, roll)up and drill)down
o $ost of cube computation and the nontrivial storage overhead
Characterization $s. O,A!
Similarit?
o 1resentation of data summarization at multiple levels of abstraction.
o Interactive drilling, pivoting, slicing and dicing.
Differences?
o Automated desired level allocation.
o Dimension relevance analsis and ran5ing when there are man relevant dimensions.
o Sophisticated tping on dimensions and measures.
o Analtical characterization? data dispersion analsis.
Anal*tical Characterization/Attribute (ele$ance Anal*sis
In realit there are man attributes in data, but all are not important. So, we have to find the important
attributes for analsis.
Ge*uire ta5e decision as follows<
Lhich dimensions should be included<
;ow high level of generalization<
Automatic vs. interactive
Geduce no. of attributesM eas to understand patterns
"here are various was to achieve this li5e
5
Statistical method for preprocessing data
o Filter out irrelevant or wea5l relevant attributes
o Getain or ran5 the relevant attributes
Gelevance related to dimensions and levels
Analtical characterization, analtical comparison
!rocedure for Attribute (ele$ance Anal*sis
Data $ollection
Analtical %eneralization
o @se information gain analsis 2e.g., entrop or other measures3 to identif highl relevant
dimensions and levels.
Gelevance Analsis
o Sort and select the most relevant dimensions and levels.
Attribute)oriented Induction for class description
o 0n selected dimension!level
0.A1 operations 2e.g. drilling, slicing3 on relevance rules
-uantitative relevance measure determines the classifing power of an attribute within a set of data.
>ethods
Information gain 2ID'3
%ain ratio 2$N.O3
%ini index

&
contingenc table statistics
@ncertaint coefficient
6

Das könnte Ihnen auch gefallen