Sie sind auf Seite 1von 4

Mining Class Comparisons

v Comparison: Comparing two or more classes.


v Method:
– Partition the set of relevant data into the target class and the contrasting
class(es)
– Generalize both classes to the same high level concepts
– Compare tuples with the same high level descriptions
– Present for every tuple its description and two measures:
u support - distribution within single class
u comparison - distribution between classes
– Highlight the tuples with strong discriminant features
v Relevance Analysis:
– Find attributes (features) which best distinguish different classes.
Example: Analytical comparison
v Task
– Compare graduate and undergraduate students using discriminant rule.
– DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

v 1. Data collection
– target and contrasting classes

v 2. Attribute relevance analysis


– remove attributes name, gender, major, phone#

v 3. Synchronous generalization
– controlled by user-specified dimension thresholds
– prime target and contrasting class(es) relations/cuboids
v 4. Drill down, roll up and other OLAP operations on target and contrasting classes
to adjust levels of abstractions of resulting description

v 5. Presentation
– as generalized relations, crosstabs, bar charts, pie charts, or rules
– contrasting measures to reflect comparison between target and contrasting
classes
u e.g. count%
Example: Analytical
comparison (4)
Birth_country Age_range
Canada 20-25
Canada 25-30
Prime generalized relation for the target class: Graduate students

Birth_country
Canada Age_range
Over_30
…Canada 15-20

Canada
Other 15-20
Over_30
Prime generalized relation for the contrasting class: Undergraduate students

discriminant

Status …Birth_country Age_range … Count


Gpa
Graduate
Canada 25-30
Canada
Undergraduate Canada
25-30
25-30
Good 90
Good 210

… …
Other Over_30
Measuring the Central
Tendency
1 n
x= ∑ xi
n i =1 n

∑w x i i
x= i =1
n

∑w
Mean
i
i =1

Weighted arithmetic mean


Median: A holistic measure
Middle value if odd number of values, or average of
n / 2 − (∑ f )l
the middle two values otherwise
median = L + (
f
1 )c
median

estimated by interpolation
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:

mean − mode = 3 × (mean − median)


Measuring the Dispersion of
Data

Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
Variance s2: (algebraic, scalable computation)
1 n 1 n 2 1 n
s2 = ∑ ( xi − x ) 2 = [∑ xi − (∑ xi ) 2 ]
Standard deviation s is the square root of variance s2
n − 1 i =1 n − 1 i =1 n i =1

Das könnte Ihnen auch gefallen