Sie sind auf Seite 1von 518

DATAMINING/IT0467

UNITI

AnIntroductiononData
MiningandPreprocessing

December26, DataMining:Conceptsand 2
2012 h
Chapter1.Introduction

Motivation:Whydatamining?
Whatisdatamining?
DataMining:Onwhatkindofdata?
Dataminingfunctionality
Classificationofdataminingsystems
Top10mostpopulardataminingalgorithms
Majorissuesindatamining
Overviewofthecourse

December26, DataMining:Conceptsand 3
2012 h
WhyDataMining?

TheExplosiveGrowthofData:fromterabytestopetabytes
Datacollectionanddataavailability
Automateddatacollectiontools,databasesystems,Web,
computerizedsociety
Majorsourcesofabundantdata
Business:Web,ecommerce,transactions,stocks,
Science:Remotesensing,bioinformatics,scientificsimulation,
Societyandeveryone:news,digitalcameras,YouTube
Wearedrowningindata,butstarvingforknowledge!
NecessityisthemotherofinventionDataminingAutomatedanalysisof
massivedatasets

December26, DataMining:Conceptsand 4
2012 h
WhatIsDataMining?

Datamining(knowledgediscoveryfromdata)
Extractionofinteresting(nontrivial, implicit,previouslyunknown and
potentiallyuseful) patternsorknowledgefromhugeamountofdata
Datamining:amisnomer?
Alternativenames
Knowledgediscovery(mining)indatabases(KDD),knowledge
extraction,data/patternanalysis,dataarcheology,datadredging,
informationharvesting,businessintelligence,etc.
Watchout:Iseverythingdatamining?
Simplesearchandqueryprocessing
(Deductive)expertsystems

December26, DataMining:Conceptsand 5
2012 h
KnowledgeDiscovery(KDD)Process

Dataminingcoreof Pattern Evaluation


knowledgediscovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

December26,
Databases 6
DataMining:Conceptsand
2012 h
DataMiningandBusinessIntelligence

Increasing potential
to support
business decisions End User
Decision
Making

DataPresentation Business
Analyst
Visualization Techniques
DataMining Data
Information Discovery Analyst

DataExploration
Statistical Summary, Querying, and Reporting

DataPreprocessing/Integration,DataWarehouses
DBA
DataSources
Paper, Files, Web documents, Scientific experiments, Database Systems
December26, DataMining:Conceptsand 7
2012 h
DataMining:ConfluenceofMultipleDisciplines

Database
Technology Statistics

Machine Visualization
Learning DataMining

Pattern
Recognition Other
Algorithm Disciplines

December26, DataMining:Conceptsand 8
2012 h
WhyNotTraditionalDataAnalysis?
Tremendousamountofdata
Algorithmsmustbehighlyscalabletohandlesuchasterabytesofdata
Highdimensionalityofdata
Microarraymayhavetensofthousandsofdimensions
Highcomplexityofdata
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata
Structuredata,graphs,socialnetworksandmultilinkeddata
Heterogeneousdatabasesandlegacydatabases
Spatial,spatiotemporal,multimedia,textandWebdata
Softwareprograms,scientificsimulations
Newandsophisticatedapplications

December26, DataMining:Conceptsand 9
2012 h
MultiDimensionalViewofDataMining
Datatobemined
Relational,datawarehouse,transactional,stream,objectoriented/relational,
active,spatial,timeseries,text,multimedia,heterogeneous,legacy,WWW
Knowledgetobemined
Characterization,discrimination,association,classification,clustering,
trend/deviation,outlieranalysis,etc.
Multiple/integratedfunctionsandminingatmultiplelevels
Techniquesutilized
Databaseoriented,datawarehouse(OLAP),machinelearning,statistics,
visualization,etc.
Applicationsadapted
Retail,telecommunication,banking,fraudanalysis,biodatamining,stock
marketanalysis,textmining,Webmining,etc.

December26, DataMining:Conceptsand 10
2012 h
DataMining:ClassificationSchemes

Generalfunctionality

Descriptivedatamining
Predictivedatamining
Differentviewsleadtodifferentclassifications

Data view:Kindsofdatatobemined
Knowledge view:Kindsofknowledgetobediscovered
Method view:Kindsoftechniquesutilized
Application view:Kindsofapplicationsadapted

December26, DataMining:Conceptsand 11
2012 h
DataMining:OnWhatKindsofData?

Databaseorienteddatasetsandapplications
Relationaldatabase,datawarehouse,transactionaldatabase
Advanceddatasetsandadvancedapplications
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata(incl.biosequences)
Structuredata,graphs,socialnetworksandmultilinkeddata
Objectrelationaldatabases
Heterogeneousdatabasesandlegacydatabases
Spatialdataandspatiotemporaldata
Multimediadatabase
Textdatabases
TheWorldWideWeb

December26, DataMining:Conceptsand 12
2012 h
DataMiningFunctionalities

Multidimensionalconceptdescription:Characterizationanddiscrimination
Generalize,summarize,andcontrastdatacharacteristics,e.g.,dryvs.
wetregions
Frequentpatterns,association,correlationvs.causality
Diaper Beer[0.5%,75%](Correlationorcausality?)
Classificationandprediction
Constructmodels(functions)thatdescribeanddistinguishclassesor
conceptsforfutureprediction
E.g.,classifycountriesbasedon(climate),orclassifycarsbasedon
(gasmileage)
Predictsomeunknownormissingnumericalvalues

December26, DataMining:Conceptsand 13
2012 h
DataMiningFunctionalities(2)
Clusteranalysis
Classlabelisunknown:Groupdatatoformnewclasses,e.g.,cluster
housestofinddistributionpatterns
Maximizingintraclasssimilarity&minimizinginterclasssimilarity
Outlieranalysis
Outlier:Dataobjectthatdoesnotcomplywiththegeneralbehaviorof
thedata
Noiseorexception?Usefulinfrauddetection,rareeventsanalysis
Trendandevolutionanalysis
Trendanddeviation:e.g.,regressionanalysis
Sequentialpatternmining:e.g.,digitalcamera largeSDmemory
Periodicityanalysis
Similaritybasedanalysis
Otherpatterndirectedorstatisticalanalyses

December26, DataMining:Conceptsand 14
2012 h
MajorIssuesinDataMining

Miningmethodology
Miningdifferentkindsofknowledgefromdiversedatatypes,e.g.,bio,stream,Web
Performance:efficiency,effectiveness,andscalability
Patternevaluation:theinterestingnessproblem
Incorporationofbackgroundknowledge
Handlingnoiseandincompletedata
Parallel,distributedandincrementalminingmethods
Integrationofthediscoveredknowledgewithexistingone:knowledgefusion
Userinteraction
Dataminingquerylanguagesandadhocmining
Expressionandvisualizationofdataminingresults
Interactiveminingof knowledgeatmultiplelevelsofabstraction
Applicationsandsocialimpacts
Domainspecificdatamining&invisibledatamining
Protectionofdatasecurity,integrity,andprivacy

December26, DataMining:Conceptsand 15
2012 h
WhyDataMiningQueryLanguage?

Automatedvs.querydriven?
Findingallthepatternsautonomouslyinadatabase?unrealistic
becausethepatternscouldbetoomanybutuninteresting
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobemined
Usersmustbeprovidedwithasetofprimitives tobeusedtocommunicate
withthedataminingsystem
Incorporatingtheseprimitivesinadataminingquerylanguage
Moreflexibleuserinteraction
Foundationfordesignofgraphicaluserinterface
Standardizationofdataminingindustryandpractice

December26, DataMining:Conceptsand 16
2012 h
PrimitivesthatDefineaDataMiningTask

Taskrelevantdata
Databaseordatawarehousename
Databasetablesordatawarehousecubes
Conditionfordataselection
Relevantattributesordimensions
Datagroupingcriteria
Typeofknowledgetobemined
Characterization,discrimination,association,classification,prediction,
clustering,outlieranalysis,otherdataminingtasks
Backgroundknowledge
Patterninterestingnessmeasurements
Visualization/presentationofdiscoveredpatterns

December26, DataMining:Conceptsand 17
2012 h
DMQLADataMiningQueryLanguage

Motivation
ADMQLcanprovidetheabilitytosupportadhocandinteractive
datamining
Byprovidingastandardizedlanguage likeSQL
HopetoachieveasimilareffectlikethatSQLhasonrelational
database
Foundationforsystemdevelopmentandevolution
Facilitateinformationexchange,technologytransfer,
commercializationandwideacceptance
Design
DMQLisdesignedwiththe primitivesdescribedearlier

December26, DataMining:Conceptsand 18
2012 h
AnExampleQueryinDMQL

December26, DataMining:Conceptsand 19
2012 h
IntegrationofDataMiningandDataWarehousing

Dataminingsystems,DBMS,Datawarehousesystemscoupling

Nocoupling,loosecoupling,semitightcoupling,tightcoupling

Onlineanalyticalminingdata

integrationofminingandOLAPtechnologies

Interactiveminingmultilevelknowledge

Necessityofminingknowledgeandpatternsatdifferentlevelsof
abstractionbydrilling/rolling,pivoting,slicing/dicing,etc.

Integrationofmultipleminingfunctions

Characterizedclassification,firstclusteringandthenassociation

December26, DataMining:Conceptsand 20
2012 h
CouplingDataMiningwithDB/DWSystems

Nocouplingflatfileprocessing,notrecommended
Loosecoupling
FetchingdatafromDB/DW
SemitightcouplingenhancedDMperformance
ProvideefficientimplementafewdataminingprimitivesinaDB/DW
system,e.g.,sorting,indexing,aggregation,histogramanalysis,
multiwayjoin,precomputationofsomestatfunctions
TightcouplingAuniforminformationprocessing
environment
DMissmoothlyintegratedintoaDB/DWsystem,miningqueryis
optimizedbasedonminingquery,indexing,queryprocessing
methods,etc.
December26, DataMining:Conceptsand 21
2012 h
Architecture:TypicalDataMiningSystem

GraphicalUserInterface

PatternEvaluation
Knowl
DataMiningEngine edge
Base
DatabaseorDataWarehouse
Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web

December26, DataMining:Conceptsand 22
2012 h
ChapterDataPreprocessing

Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 23
2012 h
WhyDataPreprocessing?

Dataintherealworldisdirty
incomplete:lackingattributevalues,lacking
certainattributesofinterest,orcontainingonly
aggregatedata
e.g.,occupation=
noisy:containingerrorsoroutliers
e.g.,Salary=10
inconsistent:containingdiscrepanciesincodesor
names
e.g.,Age=42Birthday=03/07/1997
e.g.,Wasrating1,2,3,nowratingA,B,C
e.g.,discrepancybetweenduplicaterecords
December26, DataMining:Conceptsand 24
2012 h
WhyIsDataDirty?

Incompletedatamaycomefrom
Notapplicabledatavaluewhencollected
Differentconsiderationsbetweenthetimewhenthedatawascollectedand
whenitisanalyzed.
Human/hardware/softwareproblems
Noisydata(incorrectvalues)maycomefrom
Faultydatacollectioninstruments
Humanorcomputererroratdataentry
Errorsindatatransmission
Inconsistentdatamaycomefrom
Differentdatasources
Functionaldependencyviolation(e.g.,modifysomelinkeddata)
Duplicaterecordsalsoneeddatacleaning

December26, DataMining:Conceptsand 25
2012 h
WhyIsDataPreprocessingImportant?

Noqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydata
e.g.,duplicateormissingdatamaycauseincorrectoreven
misleadingstatistics.
Datawarehouseneedsconsistentintegrationofqualitydata
Dataextraction,cleaning,andtransformationcomprisesthe
majorityoftheworkofbuildingadatawarehouse

December26, DataMining:Conceptsand 26
2012 h
MultiDimensionalMeasureofDataQuality

Awellacceptedmultidimensionalview:
Accuracy
Completeness
Consistency
Timeliness
Believability
Valueadded
Interpretability
Accessibility
Broadcategories:
Intrinsic,contextual,representational,andaccessibility

December26, DataMining:Conceptsand 27
2012 h
MajorTasksinDataPreprocessing

Datacleaning
Fillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,and
resolveinconsistencies
Dataintegration
Integrationofmultipledatabases,datacubes,orfiles
Datatransformation
Normalizationandaggregation
Datareduction
Obtainsreducedrepresentationinvolumebutproducesthesameorsimilar
analyticalresults
Datadiscretization
Partofdatareductionbutwithparticularimportance,especiallyfornumerical
data

December26, DataMining:Conceptsand 28
2012 h
FormsofDataPreprocessing

December26, DataMining:Conceptsand 29
2012 h
DataPreprocessing

Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 30
2012 h
MiningDataDescriptive Characteristics

Motivation
Tobetterunderstandthedata:centraltendency,variationandspread
Datadispersioncharacteristics
median,max,min,quantiles,outliers,variance,etc.
Numericaldimensions correspondtosortedintervals
Datadispersion:analyzedwithmultiplegranularitiesofprecision
Boxplotorquantileanalysisonsortedintervals
Dispersionanalysisoncomputedmeasures
Foldingmeasuresintonumericaldimensions
Boxplotorquantileanalysisonthetransformedcube

December26, DataMining:Conceptsand 31
2012 h
MeasuringtheCentralTendency

=
1 n
x
Mean(algebraicmeasure)(samplevs.population): x =
n

i =1
xi
N
Weightedarithmeticmean: n

wx i i
Trimmedmean:choppingextremevalues x = i =1
n

Median:Aholisticmeasure
w
i =1
i

Middlevalueifoddnumberofvalues,oraverageofthemiddletwovalues
otherwise
Estimatedbyinterpolation(forgroupeddata):
n / 2 ( f )l
Mode median = L1 + ( )c
Valuethatoccursmostfrequentlyinthedata f median
Unimodal,bimodal,trimodal
Empiricalformula:
mean mode = 3 (mean median)
December26, DataMining:Conceptsand 32
2012 h
Symmetricvs.SkewedData

Median,meanandmodeofsymmetric,
positivelyandnegativelyskeweddata

December26, DataMining:Conceptsand 33
2012 h
MeasuringtheDispersionofData

Quartiles,outliersandboxplots
Quartiles:Q1 (25th percentile),Q3 (75th percentile)
Interquartilerange:IQR=Q3 Q1
Fivenumbersummary:min,Q1,M, Q3,max
Boxplot:endsoftheboxarethequartiles,medianismarked,whiskers,andplotoutlier
individually
Outlier:usually,avaluehigher/lowerthan1.5xIQR
Varianceandstandarddeviation(sample: s,population:)
Variance:(algebraic,scalablecomputation)

1 n 1 n 2 1 n 2 1 n 1 n
s =2
( xi x) =
2

n 1Standarddeviation
i =1
[ xi ( xi ) ] = ( xi ) 2 =
2

n s(or)isthesquarerootofvariances
1 i=1 n i=1 N 2(ior
=1
2) N
xi 2
i =1
2

December26, DataMining:Conceptsand 34
2012 h
DataPreprocessing

Whypreprocessthedata?
Descriptivedatasummarization
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 35
2012 h
DataCleaning

Importance
Datacleaningisoneofthethreebiggestproblemsindata
warehousingRalphKimball
Datacleaningisthenumberoneproblemindatawarehousing
DCIsurvey

Datacleaningtasks
Fillinmissingvalues
Identifyoutliersandsmoothoutnoisydata
Correctinconsistentdata
Resolveredundancycausedbydataintegration

December26, DataMining:Conceptsand 36
2012 h
MissingData

Dataisnotalwaysavailable
E.g.,manytupleshavenorecordedvalueforseveralattributes,suchas
customerincomeinsalesdata
Missingdatamaybedueto
equipmentmalfunction
inconsistentwithotherrecordeddataandthusdeleted
datanotenteredduetomisunderstanding
certaindatamaynotbeconsideredimportantatthetimeofentry
notregisterhistoryorchangesofthedata
Missingdatamayneedtobeinferred.

December26, DataMining:Conceptsand 37
2012 h
HowtoHandleMissingData?

Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthe
tasksinclassificationnoteffectivewhenthepercentageofmissingvalues
perattributevariesconsiderably.
Fillinthemissingvaluemanually:tedious+infeasible?
Fillinitautomaticallywith
aglobalconstant:e.g.,unknown,anewclass?!
theattributemean
theattributemeanforallsamplesbelongingtothesameclass:smarter
themostprobablevalue:inferencebasedsuchasBayesianformulaordecision
tree

December26, DataMining:Conceptsand 38
2012 h
NoisyData

Noise:randomerrororvarianceinameasuredvariable
Incorrectattributevaluesmaydueto
faultydatacollectioninstruments
dataentryproblems
datatransmissionproblems
technologylimitation
inconsistencyinnamingconvention
Otherdataproblemswhichrequiresdatacleaning
duplicaterecords
incompletedata
inconsistentdata

December26, DataMining:Conceptsand 39
2012 h
HowtoHandleNoisyData?

Binning
firstsortdataandpartitioninto(equalfrequency)bins
thenonecansmoothbybinmeans,smoothbybinmedian,smoothby
binboundaries,etc.
Regression
smoothbyfittingthedataintoregressionfunctions
Clustering
detectandremoveoutliers
Combinedcomputerandhumaninspection
detectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossible
outliers)

December26, DataMining:Conceptsand 40
2012 h
SimpleDiscretization
Methods:Binning
Equalwidth (distance)partitioning
DividestherangeintoN intervalsofequalsize:uniformgrid
ifA andB arethelowestandhighestvaluesoftheattribute,thewidthof
intervalswillbe:W=(BA)/N.
Themoststraightforward,butoutliersmaydominatepresentation
Skeweddataisnothandledwell

Equaldepth (frequency)partitioning
DividestherangeintoN intervals,eachcontainingapproximatelysamenumber
ofsamples
Gooddatascaling
Managingcategoricalattributescanbetricky

December26, DataMining:Conceptsand 41
2012 h
BinningMethodsforData
Smoothing
Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34
*Partitionintoequalfrequency(equidepth)bins:
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
*Smoothingbybinmeans:
Bin1:9,9,9,9
Bin2:23,23,23,23
Bin3:29,29,29,29
*Smoothingbybinboundaries:
Bin1:4,4,4,15
Bin2:21,21,25,25
Bin3:26,26,26,34

December26, DataMining:Conceptsand 42
2012 h
Regression

Y1

Y1 y=x+1

X1 x

December26, DataMining:Conceptsand 43
2012 h
ClusterAnalysis

December26, DataMining:Conceptsand 44
2012 h
DataCleaningasaProcess

Datadiscrepancydetection
Usemetadata(e.g.,domain,range,dependency,distribution)
Checkfieldoverloading
Checkuniquenessrule,consecutiveruleandnullrule
Usecommercialtools
Datascrubbing:usesimpledomainknowledge(e.g.,postalcode,
spellcheck)todetecterrorsandmakecorrections
Dataauditing:byanalyzingdatatodiscoverrulesandrelationshipto
detectviolators(e.g.,correlationandclusteringtofindoutliers)
Datamigrationandintegration
Datamigrationtools:allowtransformationstobespecified
ETL(Extraction/Transformation/Loading)tools:allowuserstospecify
transformationsthroughagraphicaluserinterface
Integrationofthetwoprocesses
Iterativeandinteractive(e.g.,PottersWheels)

December26, DataMining:Conceptsand 45
2012 h
DataPreprocessing

Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 46
2012 h
DataIntegration

Dataintegration:
Combinesdatafrommultiplesourcesintoacoherentstore
Schemaintegration:e.g.,A.custid B.cust#
Integratemetadatafromdifferentsources
Entityidentificationproblem:
Identifyrealworldentitiesfrommultipledatasources,e.g.,BillClinton=
WilliamClinton
Detectingandresolvingdatavalueconflicts
Forthesamerealworldentity,attributevaluesfromdifferentsourcesare
different
Possiblereasons:differentrepresentations,differentscales,e.g.,metric
vs.Britishunits

December26, DataMining:Conceptsand 47
2012 h
HandlingRedundancyinDataIntegration

Redundantdataoccuroftenwhenintegrationofmultiple
databases
Objectidentification:Thesameattributeorobjectmayhavedifferent
namesindifferentdatabases
Derivabledata: Oneattributemaybeaderivedattributeinanother
table,e.g.,annualrevenue
Redundantattributesmaybeabletobedetectedby
correlationanalysis
Carefulintegrationofthedatafrommultiplesourcesmayhelp
reduce/avoidredundanciesandinconsistenciesandimprove
miningspeedandquality

December26, DataMining:Conceptsand 48
2012 h
CorrelationAnalysis(NumericalData)

Correlationcoefficient(alsocalledPearsonsproductmoment
coefficient)

rA , B =
( A A )( B B ) ( AB ) n A B
=
( n 1) A B ( n 1) A B

wherenisthenumberoftuples,andaretherespectivemeansofAandB,
A B A
andBaretherespectivestandarddeviationofAandB,and(AB)isthesumof
theABcrossproduct.
IfrA,B >0,AandBarepositivelycorrelated(Asvaluesincreaseas
Bs).Thehigher,thestrongercorrelation.
rA,B =0:independent;rA,B <0:negativelycorrelated

December26, DataMining:Conceptsand 49
2012 h
CorrelationAnalysis(CategoricalData)

2 (chisquare)test
(Observed Expected ) 2
2 =
Expected
Thelargerthe2 value,themorelikelythevariablesarerelated
Thecellsthatcontributethemosttothe2 valuearethose
whoseactualcountisverydifferentfromtheexpectedcount
Correlationdoesnotimplycausality
#ofhospitalsand#ofcartheftinacityarecorrelated
Botharecausallylinkedtothethirdvariable:population

December26, DataMining:Conceptsand 50
2012 h
DataTransformation

Smoothing:removenoisefromdata
Aggregation:summarization,datacubeconstruction
Generalization:concepthierarchyclimbing
Normalization:scaledtofallwithinasmall,specifiedrange
minmaxnormalization
zscorenormalization
normalizationbydecimalscaling
Attribute/featureconstruction
Newattributesconstructedfromthegivenones

December26, DataMining:Conceptsand 51
2012 h
DataTransformation:Normalization

Minmaxnormalization:to[new_minA,new_maxA]
v minA
v' = (new _ maxA new _ minA) + new _ minA
maxA minA
Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then
$73,000ismappedto 73,600 12,000
(1.0 0) + 0 = 0.716
98,000 12,000
Zscorenormalization(:mean,:standarddeviation):

v A
v'=
A

Ex.Let =54,000, =16,000.Then


73,600 54,000
= 1.225
Normalizationbydecimalscaling 16,000

v
v' = j Where j is the smallest integer such that Max(||) < 1
10
December26, DataMining:Conceptsand 52
2012 h
DataPreprocessing

Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 53
2012 h
DataReductionStrategies

Whydatareduction?
Adatabase/datawarehousemaystoreterabytesofdata
Complexdataanalysis/miningmaytakeaverylongtimetorunonthe
completedataset
Datareduction
Obtainareducedrepresentationofthedatasetthatismuchsmallerin
volumebutyetproducethesame(oralmostthesame)analyticalresults
Datareductionstrategies
Datacubeaggregation:
Dimensionalityreduction e.g., removeunimportantattributes
DataCompression
Numerosityreduction e.g., fitdataintomodels
Discretizationandconcepthierarchygeneration

December26, DataMining:Conceptsand 54
2012 h
DataCubeAggregation

Thelowestlevelofadatacube(basecuboid)
Theaggregateddataforanindividualentityofinterest
E.g.,acustomerinaphonecallingdatawarehouse
Multiplelevelsofaggregationindatacubes
Furtherreducethesizeofdatatodealwith
Referenceappropriatelevels
Usethesmallestrepresentationwhichisenoughtosolvethetask
Queriesregardingaggregatedinformationshouldbeanswered
usingdatacube,whenpossible

December26, DataMining:Conceptsand 55
2012 h
AttributeSubsetSelection

Featureselection(i.e.,attributesubsetselection):
Selectaminimumsetoffeaturessuchthattheprobabilitydistributionof
differentclassesgiventhevaluesforthosefeaturesisascloseaspossible
totheoriginaldistributiongiventhevaluesofallfeatures
reduce#ofpatternsinthepatterns,easiertounderstand
Heuristicmethods(duetoexponential#ofchoices):
Stepwiseforwardselection
Stepwisebackwardelimination
Combiningforwardselectionandbackwardelimination
Decisiontreeinduction

December26, DataMining:Conceptsand 56
2012 h
ExampleofDecisionTreeInduction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

December26, DataMining:Conceptsand 57
2012 h
HeuristicFeatureSelectionMethods

Thereare2d possiblesubfeaturesofd features


Severalheuristicfeatureselectionmethods:
Bestsinglefeaturesunderthefeatureindependenceassumption:
choosebysignificancetests
Beststepwisefeatureselection:
Thebestsinglefeatureispickedfirst
Thennextbestfeatureconditiontothefirst,...
Stepwisefeatureelimination:
Repeatedlyeliminatetheworstfeature
Bestcombinedfeatureselectionandelimination
Optimalbranchandbound:
Usefeatureeliminationandbacktracking

December26, DataMining:Conceptsand 58
2012 h
DataCompression

Stringcompression
Thereareextensivetheoriesandwelltunedalgorithms
Typicallylossless
Butonlylimitedmanipulationispossiblewithoutexpansion
Audio/videocompression
Typicallylossycompression,withprogressiverefinement
Sometimessmallfragmentsofsignalcanbereconstructedwithout
reconstructingthewhole
Timesequenceisnotaudio
Typicallyshortandvaryslowlywithtime

December26, DataMining:Conceptsand 59
2012 h
DataCompression

Original Data Compressed


Data
lossless

Original Data
Approximated

December26, DataMining:Conceptsand 60
2012 h
DimensionalityReduction:PrincipalComponent
Analysis(PCA)
GivenN datavectorsfromndimensions,findk n orthogonalvectors
(principalcomponents)thatcanbebestusedtorepresentdata
Steps
Normalizeinputdata:Eachattributefallswithinthesamerange
Computek orthonormal(unit)vectors,i.e.,principalcomponents
Eachinputdata(vector)isalinearcombinationofthek principalcomponent
vectors
Theprincipalcomponentsaresortedinorderofdecreasingsignificanceor
strength
Sincethecomponentsaresorted,thesizeofthedatacanbereducedby
eliminatingtheweakcomponents,i.e.,thosewithlowvariance.(i.e.,usingthe
strongestprincipalcomponents,itispossibletoreconstructagood
approximationoftheoriginaldata
Worksfornumericdataonly
Usedwhenthenumberofdimensionsislarge

December26, DataMining:Conceptsand 61
2012 h
PrincipalComponentAnalysis

X2

Y1

Y2

X1

December26, DataMining:Conceptsand 62
2012 h
DataReductionMethod(1):RegressionandLog
LinearModels

Linearregression:Dataaremodeledtofitastraightline
Oftenusestheleastsquaremethodtofittheline

Multipleregression:allowsaresponsevariableYtobe
modeledasalinearfunctionofmultidimensionalfeature
vector

Loglinearmodel:approximatesdiscretemultidimensional
probabilitydistributions

December26, DataMining:Conceptsand 63
2012 h
RegressAnalysisandLogLinearModels

Linearregression:Y=wX+b
Tworegressioncoefficients,w andb, specifythelineandaretobe
estimatedbyusingthedataathand
UsingtheleastsquarescriteriontotheknownvaluesofY1,Y2,,X1,X2,
.
Multipleregression:Y=b0+b1X1+b2X2.
Manynonlinearfunctionscanbetransformedintotheabove
Loglinearmodels:
Themultiwaytableofjointprobabilitiesisapproximatedbyaproduct
oflowerordertables
Probability:p(a,b,c,d)=abacad bcd
DataReductionMethod(2):Histograms

Dividedataintobucketsandstore 40
average(sum)foreachbucket
35
Partitioningrules:
Equalwidth:equalbucketrange 30
Equalfrequency(orequaldepth) 25
Voptimal:withtheleasthistogram
20
variance (weightedsumoftheoriginal
valuesthateachbucketrepresents)
15
MaxDiff:setbucketboundarybetween
10
eachpairforpairshavethe1largest
differences
5
0
10000 30000 50000 70000 90000
December26, DataMining:Conceptsand 65
2012 h
DataReductionMethod(3):Clustering

Partitiondatasetintoclustersbasedonsimilarity,andstorecluster
representation(e.g.,centroidanddiameter)only
Canbeveryeffectiveifdataisclusteredbutnotifdataissmeared
Canhavehierarchicalclusteringandbestoredinmultidimensionalindex
treestructures
Therearemanychoicesofclusteringdefinitionsandclusteringalgorithms
ClusteranalysiswillbestudiedindepthinChapter7

December26, DataMining:Conceptsand 66
2012 h
DataReductionMethod(4):Sampling

Sampling:obtainingasmallsamples torepresentthewhole
datasetN
Allowaminingalgorithmtorunincomplexitythatispotentially
sublineartothesizeofthedata
Choosearepresentative subsetofthedata
Simplerandomsamplingmayhaveverypoorperformanceinthe
presenceofskew
Developadaptivesamplingmethods
Stratifiedsampling:
Approximatethepercentageofeachclass(or
subpopulationofinterest)intheoveralldatabase
Usedinconjunctionwithskeweddata
Note:SamplingmaynotreducedatabaseI/Os(pageatatime)

December26, DataMining:Conceptsand 67
2012 h
Sampling:withorwithoutReplacement

Raw Data
December26, DataMining:Conceptsand 68
2012 h
DataPreprocessing

Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 69
2012 h
Discretization

Threetypesofattributes:
Nominal valuesfromanunorderedset,e.g.,color,profession
Ordinal valuesfromanorderedset,e.g.,militaryoracademicrank
Continuous realnumbers,e.g.,integerorrealnumbers

Discretization:
Dividetherangeofacontinuousattributeintointervals
Someclassificationalgorithmsonlyacceptcategoricalattributes.
Reducedatasizebydiscretization
Prepareforfurtheranalysis

December26, DataMining:Conceptsand 70
2012 h
DiscretizationandConcept
Hierarchy
Discretization
Reducethenumberofvaluesforagivencontinuousattributebydividingthe
rangeoftheattributeintointervals
Intervallabelscanthenbeusedtoreplaceactualdatavalues
Supervisedvs.unsupervised
Split(topdown)vs.merge(bottomup)
Discretizationcanbeperformedrecursivelyonanattribute
Concepthierarchyformation
Recursivelyreducethedatabycollectingandreplacinglowlevelconcepts(such
asnumericvaluesforage)byhigherlevelconcepts(suchasyoung,middleaged,
orsenior)

December26, DataMining:Conceptsand 71
2012 h
DiscretizationandConceptHierarchyGenerationfor
NumericData

Typicalmethods:Allthemethodscanbeappliedrecursively
Binning(coveredabove)

Topdownsplit,unsupervised,
Histogramanalysis(coveredabove)

Topdownsplit,unsupervised
Clusteringanalysis(coveredabove)

Eithertopdownsplitorbottomupmerge,unsupervised
Entropybaseddiscretization:supervised,topdownsplit
Intervalmergingby2 Analysis:unsupervised,bottomupmerge
Segmentationbynaturalpartitioning:topdownsplit,unsupervised

December26, DataMining:Conceptsand 72
2012 h
Exampleof345Rule

count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
December26,
0)
DataMining:Conceptsand 73
2012 h
ConceptHierarchyGenerationforCategoricalData

Specificationofapartial/totalorderingofattributesexplicitlyat
theschemalevelbyusersorexperts
street<city<state<country
Specificationofahierarchyforasetofvaluesbyexplicitdata
grouping
{Urbana,Champaign,Chicago}<Illinois
Specificationofonlyapartialsetofattributes
E.g.,onlystreet<city,notothers
Automaticgenerationofhierarchies(orattributelevels)bythe
analysisofthenumberofdistinctvalues
E.g.,forasetofattributes:{street,city,state,country}

December26, DataMining:Conceptsand 74
2012 h
AutomaticConceptHierarchyGeneration
Somehierarchiescanbeautomaticallygeneratedbasedon
theanalysisofthenumberofdistinctvaluesperattributein
thedataset
Theattributewiththemostdistinctvaluesisplacedatthelowest
levelofthehierarchy
Exceptions,e.g.,weekday,month,quarter,year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


December26, DataMining:Conceptsand 75
2012 h
DataPreprocessing

Whypreprocessthedata?
Datacleaning
Dataintegrationandtransformation
Datareduction
Discretizationandconcepthierarchygeneration
Summary

December26, DataMining:Conceptsand 76
2012 h
Summary

Datapreparationorpreprocessingisabigissueforbothdata
warehousinganddatamining
Discriptivedatasummarizationisneedforqualitydata
preprocessing
Datapreparationincludes
Datacleaninganddataintegration
Datareductionandfeatureselection
Discretization
Alotamethodshavebeendevelopedbutdatapreprocessing
stillanactiveareaofresearch

December26, DataMining:Conceptsand 77
2012 h
ReviewQuestions

Howisdatawarehousedifferentfromadatabase?Howare
theysimilar?
Listthefiveprimitivesforspecifyingadataminingtask?
Statethedataminingfunctionalities?
Enlisttheclassificationofdataminingsystems
WriteanoteondataminingqueryLanguage?
Describethestepsinvolvedindataminingwhenviewedasa
processofknowledgediscovery?
Statethevariouskindsoffrequentpattern?
Giveanexampleformultidimensionalassociationrule?
Statetheneedforoutlieranalysis?
Areallofthepatterninteresting? Justify
.Whatarethepossibleintegrationschemesincludedinthe
integrationofdataminingsystemwithadatabaseordata
warehousesystem?

December26, DataMining:Conceptsand 78
2012 h
Bibliography

DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003

December26, DataMining:Conceptsand 79
2012 h
UNITII

December26, DataMining:Conceptsand 80
2012 h
ClosedPatternsandMaxPatterns

Alongpatterncontainsacombinatorialnumberofsub
patterns,e.g.,{a1,,a100}contains (1001)+(1002)++(110000)=
2100 1=1.27*1030subpatterns!
Solution:Mineclosedpatterns andmaxpatterns instead
AnitemsetX isclosedifXisfrequent andthereexistsnosuper
pattern Y X,withthesamesupport asX(proposedby
Pasquier,etal.@ICDT99)
AnitemsetXisamaxpattern ifXisfrequentandthereexists
nofrequentsuperpatternY X(proposedbyBayardo@
SIGMOD98)
Closedpatternisalosslesscompressionoffreq.patterns
Reducingthe#ofpatternsandrules

December26, DataMining:Conceptsand 81
2012 h
ClosedPatternsandMaxPatterns

Exercise.DB={<a1,,a100>,<a1,,a50>}
Min_sup=1.
Whatisthesetofcloseditemset?
<a1,,a100>:1
<a1,,a50>:2
Whatisthesetofmaxpattern?
<a1,,a100>:1
Whatisthesetofallpatterns?
!!
December26, DataMining:Conceptsand 82
2012 h
Chapter5:MiningFrequentPatterns,Associationand
Correlations

Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 83
2012 h
ScalableMethodsforMiningFrequentPatterns

Thedownwardclosure propertyoffrequentpatterns
Anysubsetofafrequentitemsetmustbefrequent
If{beer,diaper,nuts} isfrequent,sois{beer,diaper}
i.e.,everytransactionhaving{beer,diaper,nuts}alsocontains
{beer,diaper}
Scalableminingmethods:Threemajorapproaches
Apriori(Agrawal&Srikant@VLDB94)
Freq.patterngrowth(FPgrowthHan,Pei&Yin
@SIGMOD00)
Verticaldataformatapproach(CharmZaki&Hsiao
@SDM02)

December26, DataMining:Conceptsand 84
2012 h
Apriori:ACandidateGenerationandTestApproach

Aprioripruningprinciple:Ifthereisany itemsetwhichis
infrequent,itssupersetshouldnotbegenerated/tested!
(Agrawal&Srikant@VLDB94,Mannila,etal.@KDD94)
Method:
Initially,scanDBoncetogetfrequent1itemset
Generate length(k+1)candidate itemsetsfromlengthk
frequent itemsets
TestthecandidatesagainstDB
Terminatewhennofrequentorcandidatesetcanbe
generated
December26, DataMining:Conceptsand 85
2012 h
TheAprioriAlgorithmAnExample
Supmin =2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
December26, DataMining:Conceptsand 86
2012 h
TheAprioriAlgorithm

Pseudocode:
Ck:Candidateitemsetofsizek
Lk :frequentitemsetofsizek

L1 ={frequentitems};
for (k =1;Lk !=;k++)dobegin
Ck+1 =candidatesgeneratedfromLk;
foreach transactiont indatabasedo
incrementthecountofallcandidatesinCk+1
thatarecontainedint
Lk+1 =candidatesinCk+1 withmin_support
end
return k Lk;
December26, DataMining:Conceptsand 87
2012 h
ImportantDetailsofApriori

Howtogeneratecandidates?
Step1:selfjoiningLk
Step2:pruning
Howtocountsupportsofcandidates?
ExampleofCandidategeneration
L3={abc,abd,acd,ace,bcd}
Selfjoining:L3*L3
abcdfromabc andabd
acde fromacd andace
Pruning:
acde isremovedbecauseade isnotinL3
C4={abcd}

December26, DataMining:Conceptsand 88
2012 h
HowtoGenerateCandidates?

SupposetheitemsinLk1 arelistedinanorder
Step1:selfjoiningLk1
insertinto Ck
selectp.item1,p.item2,,p.itemk1,q.itemk1
fromLk1 p,Lk1q
wherep.item1=q.item1,,p.itemk2=q.itemk2,p.itemk1<q.itemk1
Step2:pruning
forallitemsetscinCk do
forall(k1)subsetssofcdo
if(sisnotinLk1)thendeletec fromCk

December26, DataMining:Conceptsand 89
2012 h
HowtoCountSupportsofCandidates?

Whycountingsupportsofcandidatesaproblem?
Thetotalnumberofcandidatescanbeveryhuge
Onetransactionmaycontainmanycandidates
Method:
Candidateitemsetsarestoredinahashtree
Leafnodeofhashtreecontainsalistofitemsetsand
counts
Interiornode containsahashtable
Subsetfunction:findsallthecandidatescontainedina
transaction
December26, DataMining:Conceptsand 90
2012 h
Example:CountingSupportsofCandidates

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

December26, DataMining:Conceptsand 91
2012 h
EfficientImplementationofAprioriinSQL

HardtogetgoodperformanceoutofpureSQL(SQL92)
basedapproachesalone

MakeuseofobjectrelationalextensionslikeUDFs,BLOBs,
Tablefunctionsetc.

Getordersofmagnitudeimprovement

S.Sarawagi,S.Thomas,andR.Agrawal.Integrating
associationruleminingwithrelationaldatabasesystems:
Alternativesandimplications.InSIGMOD98

December26, DataMining:Conceptsand 92
2012 h
ChallengesofFrequentPatternMining

Challenges
Multiplescansoftransactiondatabase
Hugenumberofcandidates
Tediousworkloadofsupportcountingforcandidates
ImprovingApriori:generalideas
Reducepassesoftransactiondatabasescans
Shrinknumberofcandidates
Facilitatesupportcountingofcandidates

December26, DataMining:Conceptsand 93
2012 h
Partition:ScanDatabaseOnlyTwice

AnyitemsetthatispotentiallyfrequentinDBmustbe
frequentinatleastoneofthepartitionsofDB
Scan1:partitiondatabaseandfindlocalfrequentpatterns
Scan2:consolidateglobalfrequentpatterns
A.Savasere,E.Omiecinski,andS.Navathe.Anefficient
algorithmforminingassociationinlargedatabases.In
VLDB95

December26, DataMining:Conceptsand 94
2012 h
SamplingforFrequentPatterns

Selectasampleoforiginaldatabase,minefrequentpatterns
withinsampleusingApriori
Scandatabaseoncetoverifyfrequentitemsetsfoundin
sample,onlyborders ofclosureoffrequentpatternsare
checked
Example:checkabcd insteadofab,ac,,etc.
Scandatabaseagaintofindmissedfrequentpatterns
H.Toivonen.Samplinglargedatabasesforassociationrules.In
VLDB96

December26, DataMining:Conceptsand 95
2012 h
BottleneckofFrequent
patternMining
Multipledatabasescansarecostly
Mininglongpatternsneedsmanypassesof
scanningandgenerateslotsofcandidates
Tofindfrequentitemseti1i2i100
#ofscans:100
#ofCandidates:(1001)+(1002)++(110000)=21001=
1.27*1030!
Bottleneck:candidategenerationandtest
Canweavoidcandidategeneration?
December26, DataMining:Conceptsand 96
2012 h
MiningFrequentPatternsWithout Candidate
Generation

Growlongpatternsfromshortonesusinglocal
frequentitems
abcisafrequentpattern

Getalltransactionshavingabc:DB|abc

disalocalfrequentiteminDB|abc abcdisa
frequentpattern

December26, DataMining:Conceptsand 97
2012 h
ConstructFPtreefromaTransactionDatabase

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. ScanDBonce,find
frequent1itemset Item frequency head f:4 c:1
(singleitempattern) f 4
c 4 c:3 b:1 b:1
2. Sortfrequentitemsin a 3
frequencydescending b 3
order,flist a:3 p:1
m 3
3. ScanDBagain,construct p 3
m:2 b:1
FPtree
Flist=fcabmp p:2 m:1
December26, DataMining:Conceptsand 98
2012 h
BenefitsoftheFPtreeStructure

Completeness
Preservecompleteinformationforfrequentpatternmining
Neverbreakalongpatternofanytransaction
Compactness
Reduceirrelevantinfoinfrequentitemsaregone
Itemsinfrequencydescendingorder:themorefrequently
occurring,themorelikelytobeshared
Neverbelargerthantheoriginaldatabase(notcountnode
linksandthecount field)
ForConnect4DB,compressionratiocouldbeover100

December26, DataMining:Conceptsand 99
2012 h
FindPatternsHavingPFromPconditionalDatabase

StartingatthefrequentitemheadertableintheFPtree
TraversetheFPtreebyfollowingthelinkofeachfrequentitemp
Accumulatealloftransformedprefixpaths ofitemptoformps
conditionalpatternbase

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


December26, DataMining:Conceptsand 100
2012 h
MiningFrequentPatterns,AssociationandCorrelations

Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 101
2012 h
MiningVariousKindsofAssociationRules

Miningmultilevelassociation

Mimingmultidimensionalassociation

Miningquantitativeassociation

Mininginterestingcorrelationpatterns

December26, DataMining:Conceptsand 102


2012 h
MiningMultipleLevelAssociationRules

Itemsoftenformhierarchies
Flexiblesupportsettings
Itemsatthelowerlevelareexpectedtohavelowersupport
Explorationofshared multilevelmining(Agrawal&
Srikant@VLB95,Han&Fu@VLDB95)

uniformsupport reducedsupport
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

December26, DataMining:Conceptsand 103


2012 h
MultilevelAssociation:RedundancyFiltering

Somerulesmayberedundantduetoancestorrelationships
betweenitems.
Example
milk wheatbread[support=8%,confidence=70%]
2%milk wheatbread[support=2%,confidence=72%]
Wesaythefirstruleisanancestorofthesecondrule.
Aruleisredundantifitssupportisclosetotheexpected
value,basedontherulesancestor.

December26, DataMining:Conceptsand 104


2012 h
MiningMultiDimensionalAssociation

Singledimensionalrules:
buys(X,milk) buys(X,bread)
Multidimensionalrules: 2dimensionsorpredicates
Interdimensionassoc.rules(norepeatedpredicates)
age(X,1925) occupation(X,student) buys(X,coke)
hybriddimensionassoc.rules(repeatedpredicates)
age(X,1925) buys(X,popcorn) buys(X,coke)
CategoricalAttributes:finitenumberofpossiblevalues,no
orderingamongvaluesdatacubeapproach
QuantitativeAttributes:numeric,implicitorderingamong
valuesdiscretization,clustering,andgradientapproaches
December26, DataMining:Conceptsand 105
2012 h
MiningQuantitativeAssociations

Techniquescanbecategorizedbyhownumericalattributes,
suchasageor salary aretreated
1. Staticdiscretizationbasedonpredefinedconcepthierarchies
(datacubemethods)
2. Dynamicdiscretizationbasedondatadistribution
(quantitativerules,e.g.,Agrawal&Srikant@SIGMOD96)
3. Clustering:Distancebasedassociation(e.g.,Yang&
Miller@SIGMOD97)
onedimensionalclusteringthenassociation
4. Deviation:(suchasAumannandLindell@KDD99)
Sex=female=>Wage:mean=$7/hr(overallmean=$9)

December26, DataMining:Conceptsand 106


2012 h
QuantitativeAssociationRules
ProposedbyLent,SwamiandWidomICDE97
Numericattributesaredynamically discretized
Suchthattheconfidenceorcompactnessoftherulesmined
ismaximized
2Dquantitativeassociationrules:Aquan1 Aquan2 Acat
Clusteradjacent associationrules
toformgeneralrulesusinga
2Dgrid
Example

age(X,34-35) income(X,30-50K)
buys(X,high resolution TV)

December26, DataMining:Conceptsand 107


2012 h
MiningOtherInterestingPatterns

Flexiblesupportconstraints(Wangetal.@VLDB02)
Someitems(e.g.,diamond)mayoccurrarelybutare
valuable
Customizedsupminspecificationandapplication
TopKclosedfrequentpatterns(Han,etal.@ICDM02)
Hardtospecifysupmin,buttopk withlengthminismore
desirable
Dynamicallyraisesupmin inFPtreeconstructionandmining,
andselectmostpromisingpathtomine

December26, DataMining:Conceptsand 108


2012 h
MiningFrequentPatterns,AssociationandCorrelations

Basicconceptsandaroadmap
Efficientandscalablefrequentitemset
miningmethods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelation
analysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 109
2012 h
InterestingnessMeasure:Correlations(Lift)

playbasketball eatcereal [40%,66.7%]ismisleading


Theoverall%ofstudentseatingcerealis75%>66.7%.
playbasketball noteatcereal [20%,33.3%]ismoreaccurate,although
withlowersupportandconfidence
Measureofdependent/correlatedevents:lift
Basketball Not basketball Sum (row)

P( A B) Cereal 2000 1750 3750


lift = Not cereal 1000 250 1250
P( A) P( B)
Sum(col.) 3000 2000 5000

2000 / 5000 1000 / 5000


lift ( B, C ) = = 0.89 lift ( B, C ) = = 1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

December26, DataMining:Conceptsand 110


2012 h
Chapter5:MiningFrequentPatterns,Associationand
Correlations

Basicconceptsandaroadmap
Efficientandscalablefrequentitemsetmining
methods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelationanalysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 111
2012 h
Constraintbased(QueryDirected)Mining

Findingall thepatternsinadatabaseautonomously?
unrealistic!
Thepatternscouldbetoomanybutnotfocused!
Dataminingshouldbeaninteractiveprocess
Userdirectswhattobeminedusingadataminingquery
language(oragraphicaluserinterface)
Constraintbasedmining
Userflexibility:provides constraints onwhattobemined
Systemoptimization:exploressuchconstraintsforefficient
miningconstraintbasedmining

December26, DataMining:Conceptsand 112


2012 h
ConstraintsinDataMining

Knowledgetypeconstraint:
classification,association,etc.
Dataconstraint using SQLlikequeries
findproductpairssoldtogetherinstoresinChicago in
Dec.02
Dimension/levelconstraint
inrelevancetoregion,price,brand,customercategory
Rule(orpattern)constraint
smallsales(price<$10)triggersbigsales(sum>$200)
Interestingnessconstraint
strongrules:min_support 3%,min_confidence 60%

December26, DataMining:Conceptsand 113


2012 h
ConstrainedMiningvs.ConstraintBasedSearch

Constrainedminingvs.constraintbasedsearch/reasoning
Bothareaimedatreducingsearchspace
Findingallpatterns satisfyingconstraintsvs.findingsome(or
one)answer inconstraintbasedsearchinAI
Constraintpushing vs.heuristicsearch
Itisaninterestingresearchproblemonhowtointegrate
them
Constrainedminingvs.queryprocessinginDBMS
Databasequeryprocessingrequirestofindall
Constrainedpatternminingsharesasimilarphilosophyas
pushingselectionsdeeplyinqueryprocessing
December26, DataMining:Conceptsand 114
2012 h
TheAprioriAlgorithm Example

Database D itemset sup.


TID Items
L1 itemset sup.
C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5}
December26, {2 3 5} 2 115
DataMining:Conceptsand
2012 h
NaveAlgorithm:Apriori+Constraint

Database D itemset sup.


TID Items
L1 itemset sup.
C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset
December26,2012 Scan D L3 itemset sup Constraint:
{2 3 5}
December26, {2 3 5} 2 Sum{S.price}<5
116
DataMining:Conceptsand
2012 h
MiningFrequentPatterns,AssociationandCorrelations

Basicconceptsandaroadmap
Efficientandscalablefrequentitemsetmining
methods
Miningvariouskindsofassociationrules
Fromassociationminingtocorrelationanalysis
Constraintbasedassociationmining
Summary
December26, DataMining:Conceptsand 117
2012 h
FrequentPatternMining:Summary

Frequentpatternmininganimportanttaskindatamining
Scalablefrequentpatternminingmethods
Apriori(Candidategeneration&test)
Projectionbased(FPgrowth,CLOSET+,...)
Verticalformatapproach(CHARM,...)
Miningavarietyofrulesandinterestingpatterns
Constraintbasedmining
Miningsequentialandstructuredpatterns
Extensionsandapplications
December26, DataMining:Conceptsand 118
2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 119
2012 h
WhatisClusterAnalysis?

Cluster:acollectionofdataobjects
Similartooneanotherwithinthesamecluster
Dissimilartotheobjectsinotherclusters
Clusteranalysis
Findingsimilaritiesbetweendataaccordingtothe
characteristicsfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervisedlearning:nopredefinedclasses
Typicalapplications
Asastandalonetool togetinsightintodatadistribution
Asapreprocessingstep forotheralgorithms
December26, DataMining:Conceptsand 120
2012 h
Clustering:RichApplicationsand
MultidisciplinaryEfforts

PatternRecognition
SpatialDataAnalysis
CreatethematicmapsinGISbyclusteringfeaturespaces
Detectspatialclustersorforotherspatialminingtasks
ImageProcessing
EconomicScience(especiallymarketresearch)
WWW
Documentclassification
ClusterWeblogdatatodiscovergroupsofsimilaraccess
patterns

December26, DataMining:Conceptsand 121


2012 h
ExamplesofClusteringApplications

Marketing: Helpmarketersdiscoverdistinctgroupsintheircustomerbases,
andthenusethisknowledgetodeveloptargetedmarketingprograms
Landuse: Identificationofareasofsimilarlanduseinanearthobservation
database
Insurance: Identifyinggroupsofmotorinsurancepolicyholderswithahigh
averageclaimcost
Cityplanning: Identifyinggroupsofhousesaccordingtotheirhousetype,
value,andgeographicallocation
Earthquakestudies: Observedearthquakeepicentersshouldbeclustered
alongcontinentfaults

December26, DataMining:Conceptsand 122


2012 h
Quality:WhatIsGoodClustering?

Agoodclustering methodwillproducehighqualityclusters
with
highintraclass similarity
lowinterclass similarity
Thequality ofaclusteringresultdependsonboththesimilarity
measureusedbythemethodanditsimplementation
Thequality ofaclusteringmethodisalsomeasuredbyits
abilitytodiscoversomeorallofthehidden patterns

December26, DataMining:Conceptsand 123


2012 h
MeasuretheQualityofClustering

Dissimilarity/Similaritymetric:Similarityisexpressedinterms
ofadistancefunction,typicallymetric:d(i,j)
Thereisaseparatequalityfunctionthatmeasuresthe
goodnessofacluster.
Thedefinitionsofdistancefunctions areusuallyverydifferent
forintervalscaled,boolean,categorical,ordinalratio,and
vectorvariables.
Weightsshouldbeassociatedwithdifferentvariablesbasedon
applicationsanddatasemantics.
Itishardtodefinesimilarenoughorgoodenough
theansweristypicallyhighlysubjective.
December26, DataMining:Conceptsand 124
2012 h
RequirementsofClusteringinDataMining

Scalability
Abilitytodealwithdifferenttypesofattributes
Abilitytohandledynamicdata
Discoveryofclusterswitharbitraryshape
Minimalrequirementsfordomainknowledgetodetermine
inputparameters
Abletodealwithnoiseandoutliers
Insensitivetoorderofinputrecords
Highdimensionality
Incorporationofuserspecifiedconstraints
Interpretabilityandusability

December26, DataMining:Conceptsand 125


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 126
2012 h
DataStructures

Datamatrix x 11 ... x 1f ... x 1p



(twomodes) ... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1

0
d(2,1)
Dissimilaritymatrix 0
d(3,1 ) d ( 3,2 ) 0
(onemode)
: : :
d ( n ,1) d ( n ,2 ) ... ... 0

December26, DataMining:Conceptsand 127


2012 h
Typeofdatainclusteringanalysis

Intervalscaledvariables
Binaryvariables
Nominal,ordinal,andratiovariables
Variablesofmixedtypes

December26, DataMining:Conceptsand 128


2012 h
Intervalvaluedvariables

Standardizedata
Calculatethemeanabsolutedeviation:
s f = 1n (| x1 f m f | + | x2 f m f | +...+ | xnf m f |)

where m f = 1n (x1 f + x2 f + ... + xnf )


.

Calculatethestandardizedmeasurement(zscore)
xif m f
zif = sf
Usingmeanabsolutedeviationismorerobustthanusing
standarddeviation
December26, DataMining:Conceptsand 129
2012 h
SimilarityandDissimilarityBetween
Objects

Distances arenormallyusedtomeasurethesimilarity or
dissimilarity betweentwodataobjects
Somepopularonesinclude:Minkowskidistance:
d (i, j) = q (| x x |q + | x x |q +...+ | x x |q )
i1 j1 i2 j2 ip jp
wherei =(xi1,xi2,,xip)and j =(xj1,xj2,,xjp)aretwop
dimensionaldataobjects,andq isapositiveinteger
Ifq =1,d isManhattandistance

d(i, j) =| x x | +| x x | +...+| x x |
i1 j1 i2 j2 ip jp

December26, DataMining:Conceptsand 130


2012 h
SimilarityandDissimilarityBetweenObjects
(Cont.)

Ifq =2, disEuclideandistance:


d (i, j) = (| x x |2 + | x x |2 +...+ | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) =0
d(i,j) =d(j,i)
d(i,j) d(i,k) +d(k,j)
Also,onecanuseweighteddistance,parametricPearson
productmomentcorrelation,orotherdisimilaritymeasures

December26, DataMining:Conceptsand 131


2012 h
BinaryVariables
Object j
1 0 sum
Acontingencytableforbinary 1 a b a +b
data Object i
0 c d c+d
sum a + c b + d p
Distancemeasureforsymmetric
binaryvariables: d (i, j ) = b+c
a+b+c+d
Distancemeasurefor
asymmetricbinaryvariables: d (i, j ) = b+c
a+b+c
Jaccardcoefficient(similarity
measureforasymmetricbinary
sim Jaccard (i, j ) = a
variables): a+b+c
December26, DataMining:Conceptsand 132
2012 h
DissimilaritybetweenBinaryVariables

Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

genderisasymmetricattribute
theremainingattributesareasymmetricbinary
letthevaluesYandPbesetto1,andthevalueNbesetto0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
December26, DataMining:Conceptsand 133
2012 h
NominalVariables

Ageneralizationofthebinaryvariableinthatitcantakemore
than2states,e.g.,red,yellow,blue,green
Method1:Simplematching
m:#ofmatches, p:total#ofvariables

d ( i , j ) = p p m

Method2:usealargenumberofbinaryvariables
creatinganewbinaryvariableforeachoftheM nominal
states

December26, DataMining:Conceptsand 134


2012 h
OrdinalVariables

Anordinalvariablecanbediscreteorcontinuous
Orderisimportant,e.g.,rank
Canbetreatedlikeintervalscaled
replacexif bytheirrank rif {1,..., M f }

maptherangeofeachvariableonto[0,1]byreplacing ith
objectinthefthvariableby
r if 1
z =
if M f 1
computethedissimilarityusingmethodsforintervalscaled
variables
December26, DataMining:Conceptsand 135
2012 h
RatioScaledVariables

Ratioscaledvariable:apositivemeasurementonanonlinear
scale,approximatelyatexponentialscale, suchas
AeBt orAeBt
Methods:
treatthemlikeintervalscaledvariablesnotagoodchoice!
(why?thescalecanbedistorted)
applylogarithmictransformation
yif= log(xif)
treatthemascontinuousordinaldatatreattheirrankas
intervalscaled

December26, DataMining:Conceptsand 136


2012 h
VariablesofMixedTypes

Adatabasemaycontainallthesixtypesofvariables
symmetricbinary,asymmetricbinary,nominal,ordinal,
intervalandratio
Onemayuseaweightedformulatocombinetheireffects
pf = 1 ij( f ) d ij( f )
d (i, j ) =
f isbinaryornominal: pf = 1 ij( f )
dij(f) =0ifxif=xjf ,ordij(f) =1otherwise
f isintervalbased:usethenormalizeddistance
f isordinalorratioscaled
computeranksrif and
andtreatzif asintervalscaled
z if = r 1
if

fM 1
December26, DataMining:Conceptsand 137
2012 h
VectorObjects

Vectorobjects:keywordsindocuments,gene
featuresinmicroarrays,etc.
Broadapplications:informationretrieval,
biologictaxonomy,etc.
Cosinemeasure

Avariant:Tanimotocoefficient
December26, DataMining:Conceptsand 138
2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 139
2012 h
MajorClusteringApproaches(I)

Partitioningapproach:
Constructvariouspartitionsandthenevaluatethembysomecriterion,e.g.,
minimizingthesumofsquareerrors
Typicalmethods:kmeans,kmedoids,CLARANS
Hierarchicalapproach:
Createahierarchicaldecompositionofthesetofdata(orobjects)usingsome
criterion
Typicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEON
Densitybasedapproach:
Basedonconnectivityanddensityfunctions
Typicalmethods:DBSACN,OPTICS,DenClue

December26, DataMining:Conceptsand 140


2012 h
MajorClusteringApproaches(II)
Gridbasedapproach:
basedonamultiplelevelgranularitystructure
Typicalmethods:STING,WaveCluster,CLIQUE
Modelbased:
Amodelishypothesizedforeachoftheclustersandtriestofindthebestfitof
thatmodeltoeachother
Typicalmethods: EM,SOM,COBWEB
Frequentpatternbased:
Basedontheanalysisoffrequentpatterns
Typicalmethods:pCluster
Userguidedorconstraintbased:
Clusteringbyconsideringuserspecifiedorapplicationspecificconstraints
Typicalmethods:COD(obstacles),constrainedclustering

December26, DataMining:Conceptsand 141


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 142
2012 h
PartitioningAlgorithms:BasicConcept

Partitioningmethod: ConstructapartitionofadatabaseD ofn objectsintoa


setofk clusters,s.t.,minsumofsquareddistance

km=1tmiKm (Cm tmi ) 2


Givenak,findapartitionofkclustersthatoptimizesthechosenpartitioning
criterion
Globaloptimal:exhaustivelyenumerateallpartitions
Heuristicmethods:kmeans andkmedoids algorithms
kmeans (MacQueen67):Eachclusterisrepresentedbythecenterofthe
cluster
kmedoids orPAM(Partitionaroundmedoids)(Kaufman&
Rousseeuw87):Eachclusterisrepresentedbyoneoftheobjectsinthe
cluster

December26, DataMining:Conceptsand 143


2012 h
TheKMeans ClusteringMethod

Givenk,thekmeans algorithmisimplementedinfour
steps:
Partitionobjectsintok nonemptysubsets
Computeseedpointsasthecentroidsoftheclustersof
thecurrentpartition(thecentroidisthecenter,i.e.,
meanpoint,ofthecluster)
Assigneachobjecttotheclusterwiththenearestseed
point
GobacktoStep2,stopwhennomorenewassignment

December26, DataMining:Conceptsand 144


2012 h
TheKMeans ClusteringMethod

Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1 1
1
objects 0
cluster
0
0
0 1 2 3 4 5 6 7 8 9 10 tomost
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

ArbitrarilychooseK 7 7

6 6
objectasinitialcluster 5 5

center 4 Update 4

3 3

2
the 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

December26, DataMining:Conceptsand 145


2012 h
CommentsontheKMeans Method

Strength: Relativelyefficient:O(tkn),wheren is#objects,k is#clusters,andt


is#iterations.Normally,k,t <<n.
Comparing:PAM:O(k(nk)2 ),CLARA:O(ks2 +k(nk))
Comment: Oftenterminatesatalocaloptimum.Theglobaloptimum maybe
foundusingtechniquessuchas:deterministicannealing andgenetic
algorithms
Weakness
Applicableonlywhenmean isdefined,thenwhataboutcategoricaldata?
Needtospecifyk,thenumber ofclusters,inadvance
Unabletohandlenoisydataandoutliers
Notsuitabletodiscoverclusterswithnonconvexshapes

December26, DataMining:Conceptsand 146


2012 h
VariationsoftheKMeans Method

Afewvariantsofthekmeans whichdifferin

Selectionoftheinitialk means

Dissimilaritycalculations

Strategiestocalculateclustermeans

Handlingcategoricaldata:kmodes (Huang98)

Replacingmeansofclusterswithmodes

Usingnewdissimilaritymeasurestodealwithcategoricalobjects

Usingafrequencybasedmethodtoupdatemodesofclusters

Amixtureofcategoricalandnumericaldata:kprototype method

December26, DataMining:Conceptsand 147


2012 h
WhatIstheProblemoftheKMeansMethod?

Thekmeansalgorithmissensitivetooutliers!
Sinceanobjectwithanextremelylargevaluemaysubstantiallydistort
thedistributionofthedata.

KMedoids:Insteadoftakingthemean valueoftheobjectinaclusterasa
referencepoint,medoids canbeused,whichisthemostcentrallylocated
objectinacluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December26, DataMining:Conceptsand 148


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 149
2012 h
HierarchicalClustering

Usedistancematrixasclusteringcriteria.Thismethoddoes
notrequirethenumberofclustersk asaninput,butneedsa
terminationcondition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
December26, DataMining:Conceptsand 150
2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 151
2012 h
DensityBasedClusteringMethods

Clusteringbasedondensity(localclustercriterion),suchas
densityconnectedpoints
Majorfeatures:
Discoverclustersofarbitraryshape
Handlenoise
Onescan
Needdensityparametersasterminationcondition
Severalinterestingstudies:
DBSCAN: Ester,etal.(KDD96)
OPTICS:Ankerst,etal(SIGMOD99).
DENCLUE:Hinneburg&D.Keim(KDD98)
CLIQUE:Agrawal,etal.(SIGMOD98)(moregridbased)
December26, DataMining:Conceptsand 152
2012 h
DensityBasedClustering:BasicConcepts

Twoparameters:
Eps:Maximumradiusoftheneighbourhood
MinPts:MinimumnumberofpointsinanEps
neighbourhoodofthatpoint
NEps(p): {qbelongstoD| dist(p,q)<=Eps}
Directlydensityreachable:Apointp isdirectlydensity
reachablefromapointq w.r.t.Eps,MinPts if
p belongstoNEps(q)
p MinPts = 5
corepointcondition:
q
|NEps (q)|>=MinPts Eps = 1 cm

December26, DataMining:Conceptsand 153


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 154
2012 h
GridBasedClusteringMethod

Usingmultiresolutiongriddatastructure
Severalinterestingmethods
STING(aSTatisticalINformationGridapproach)byWang,Yangand
Muntz(1997)
WaveCluster bySheikholeslami,Chatterjee,andZhang(VLDB98)
Amultiresolutionclusteringapproachusingwaveletmethod
CLIQUE:Agrawal,etal.(SIGMOD98)
Onhighdimensionaldata(thusputinthesectionofclusteringhigh
dimensionaldata

December26, DataMining:Conceptsand 155


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 156
2012 h
ModelBasedClustering

Whatismodelbasedclustering?
Attempttooptimizethefitbetweenthegivendataandsome
mathematicalmodel
Basedontheassumption:Dataaregeneratedbyamixtureof
underlyingprobabilitydistribution
Typicalmethods
Statisticalapproach
EM(Expectationmaximization),AutoClass
Machinelearningapproach
COBWEB,CLASSIT
Neuralnetworkapproach
SOM(SelfOrganizingFeatureMap)
December26, DataMining:Conceptsand 157
2012 h
SelfOrganizingFeatureMap(SOM)

SOMs,alsocalledtopologicalorderedmaps,orKohonenSelfOrganizing
FeatureMap(KSOMs)
Itmapsallthepointsinahighdimensionalsourcespaceintoa2to3dtarget
space,s.t.,thedistanceandproximityrelationship(i.e.,topology)are
preservedasmuchaspossible
Similartokmeans:clustercenterstendtolieinalowdimensionalmanifoldin
thefeaturespace
Clusteringisperformedbyhavingseveralunitscompetingforthecurrent
object
Theunitwhoseweightvectorisclosesttothecurrentobjectwins
Thewinneranditsneighborslearnbyhavingtheirweightsadjusted
SOMsarebelievedtoresembleprocessingthatcanoccurinthebrain
Usefulforvisualizinghighdimensionaldatain2 or3Dspace

December26, DataMining:Conceptsand 158


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 159
2012 h
ClusteringHighDimensionalData

Clusteringhighdimensionaldata
Manyapplications:textdocuments,DNAmicroarraydata
Majorchallenges:
Manyirrelevantdimensionsmaymaskclusters
Distancemeasurebecomesmeaninglessduetoequidistance
Clustersmayexistonlyinsomesubspaces
Methods
Featuretransformation:onlyeffectiveifmostdimensionsarerelevant
PCA&SVDusefulonlywhenfeaturesarehighlycorrelated/redundant
Featureselection:wrapperorfilterapproaches
usefultofindasubspacewherethedatahaveniceclusters
Subspaceclustering:findclustersinallthepossiblesubspaces
CLIQUE,ProClus,andfrequentpatternbasedclustering
December26, DataMining:Conceptsand 160
2012 h
CLIQUE(ClusteringInQUEst)

Agrawal,Gehrke,Gunopulos,Raghavan(SIGMOD98)
Automaticallyidentifyingsubspacesofahighdimensionaldataspacethat
allowbetterclusteringthanoriginalspace
CLIQUEcanbeconsideredasbothdensitybasedandgridbased
Itpartitionseachdimensionintothesamenumberofequallengthinterval
Itpartitionsanmdimensionaldataspaceintononoverlappingrectangular
units
Aunitisdenseifthefractionoftotaldatapointscontainedintheunit
exceedstheinputmodelparameter
Aclusterisamaximalsetofconnecteddenseunitswithinasubspace

December26, DataMining:Conceptsand 161


2012 h
CLIQUE:TheMajorSteps

Partitionthedataspaceandfindthenumberofpointsthatlie
insideeachcellofthepartition.
IdentifythesubspacesthatcontainclustersusingtheApriori
principle
Identifyclusters
Determinedenseunitsinallsubspacesofinterests
Determineconnecteddenseunitsinallsubspacesof
interests.
Generateminimaldescriptionfortheclusters
Determinemaximalregionsthatcoveraclusterofconnected
denseunitsforeachcluster
Determinationofminimalcoverforeachcluster

December26, DataMining:Conceptsand 162


2012 h
Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

30 50
age

December26, DataMining:Conceptsand 163


2012 h
StrengthandWeaknessofCLIQUE

Strength
automatically findssubspacesofthe highestdimensionality
suchthathighdensityclustersexistinthosesubspaces
insensitive totheorderofrecordsininputanddoesnot
presumesomecanonicaldatadistribution
scales linearly withthesizeofinputandhasgoodscalability
asthenumberofdimensionsinthedataincreases
Weakness
Theaccuracyoftheclusteringresultmaybedegradedatthe
expenseofsimplicityofthemethod

December26, DataMining:Conceptsand 164


2012 h
WhyConstraintBasedClusterAnalysis?

Needuserfeedback:Usersknowtheirapplicationsthebest
Lessparametersbutmoreuserdesiredconstraints,e.g.,anATM
allocationproblem:obstacle&desiredclusters

December26, DataMining:Conceptsand 165


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 166
2012 h
WhatIsOutlierDiscovery?

Whatareoutliers?
Thesetofobjectsareconsiderablydissimilarfromthe
remainderofthedata
Example:Sports:MichaelJordon,WayneGretzky,...
Problem:Defineandfindoutliersinlargedatasets
Applications:
Creditcardfrauddetection
Telecomfrauddetection
Customersegmentation
Medicalanalysis

December26, DataMining:Conceptsand 167


2012 h
OutlierDiscovery:Statistical
Approaches

M Assumeamodelunderlyingdistributionthatgeneratesdata
set(e.g.normaldistribution)
Usediscordancytestsdependingon
datadistribution
distributionparameter(e.g.,mean,variance)
numberofexpectedoutliers
Drawbacks
mosttestsareforsingleattribute
Inmanycases,datadistributionmaynotbeknown
December26, DataMining:Conceptsand 168
2012 h
OutlierDiscovery:DistanceBasedApproach

Introducedtocounterthemainlimitationsimposedby
statisticalmethods
Weneedmultidimensionalanalysiswithoutknowingdata
distribution
Distancebasedoutlier:ADB(p,D)outlierisanobjectOina
datasetTsuchthatatleastafractionpoftheobjectsinTlies
atadistancegreaterthanDfromO
Algorithmsforminingdistancebasedoutliers
Indexbasedalgorithm
Nestedloopalgorithm
Cellbasedalgorithm

December26, DataMining:Conceptsand 169


2012 h
ClusterAnalysis

1. WhatisClusterAnalysis?
2. TypesofDatainClusterAnalysis
3. ACategorizationofMajorClusteringMethods
4. PartitioningMethods
5. HierarchicalMethods
6. DensityBasedMethods
7. GridBasedMethods
8. ModelBasedMethods
9. ClusteringHighDimensionalData
10. ConstraintBasedClustering
11. OutlierAnalysis
12. Summary
December26, DataMining:Conceptsand 170
2012 h
Summary

Clusteranalysis groupsobjectsbasedontheirsimilarity and


haswideapplications
Measureofsimilaritycanbecomputedforvarioustypesof
data
Clusteringalgorithmscanbecategorized intopartitioning
methods,hierarchicalmethods,densitybasedmethods,grid
basedmethods,andmodelbasedmethods
Outlierdetection andanalysisareveryusefulforfraud
detection,etc.andcanbeperformedbystatistical,distance
basedordeviationbasedapproaches
Therearestilllotsofresearchissuesonclusteranalysis
December26, DataMining:Conceptsand 171
2012 h
ReviewQuestions

Statetheneedformarketbasketanalysis?
Whatarethetwoconditionsthatmakeassociationrule
interesting?
Statethetwostepprocessofassociationrulemining?
DefineAprioriproperty?
ListthetechniquestoimprovetheefficiencyofApriori
Whatisclusteringanalysis?
Givethetypicalrequirementsofclusteringindatamining?
Whatisthedifferencebetweensymmetricandasymmetric
binaryvariables?
Statethetypesofdatainclusteranalysis?

December26, DataMining:Conceptsand 172


2012 h
Bibliography

DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
R.Agrawal,J.Gehrke,D.Gunopulos,andP.
Raghavan.Automaticsubspaceclusteringof
highdimensionaldatafordatamining
applications
R.AgrawalandR.Srikant.Fastalgorithmsfor
miningassociationrules.VLDB'94

December26, DataMining:Conceptsand 173


2012 h
UNITIII

Classificationandprediction

December26, DataMining:Conceptsand 174


2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 175


2012 h
Classificationvs.Prediction

Classification
predictscategoricalclasslabels(discreteornominal)
classifiesdata(constructsamodel)basedonthetraining
setandthevalues(classlabels)inaclassifyingattribute
andusesitinclassifyingnewdata
Prediction
modelscontinuousvaluedfunctions,i.e.,predictsunknown
ormissingvalues
Typicalapplications
Creditapproval
Targetmarketing
Medicaldiagnosis
Frauddetection

December26, DataMining:Conceptsand 176


2012 h
ClassificationATwoStepProcess
Modelconstruction:describingasetofpredeterminedclasses
Eachtuple/sampleisassumedtobelongtoapredefinedclass,as
determinedbytheclasslabelattribute
Thesetoftuplesusedformodelconstructionistrainingset
Themodelisrepresentedasclassificationrules,decisiontrees,or
mathematicalformulae
Modelusage:forclassifyingfutureorunknownobjects
Estimateaccuracy ofthemodel
Theknownlabeloftestsampleiscomparedwiththeclassified
resultfromthemodel
Accuracyrateisthepercentageoftestsetsamplesthatare
correctlyclassifiedbythemodel
Testsetisindependentoftrainingset,otherwiseoverfittingwill
occur
Iftheaccuracyisacceptable,usethemodeltoclassifydata tuples
whoseclasslabelsarenotknown
December26, DataMining:Conceptsand 177
2012 h
Process(1):ModelConstruction

Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier


M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = professor
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = yes
December26, DataMining:Conceptsand 178
2012 h
Process(2):UsingtheModelinPrediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
December26, DataMining:Conceptsand 179
2012 h
Supervisedvs.UnsupervisedLearning

Supervisedlearning(classification)
Supervision:Thetrainingdata(observations,
measurements,etc.)areaccompaniedbylabelsindicating
theclassoftheobservations
Newdataisclassifiedbasedonthetrainingset
Unsupervisedlearning (clustering)
Theclasslabelsoftrainingdataisunknown
Givenasetofmeasurements,observations,etc.withthe
aimofestablishingtheexistenceofclassesorclustersin
thedata
December26, DataMining:Conceptsand 180
2012 h
Chapter6.Classification
andPrediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 181


2012 h
Issues:DataPreparation

Datacleaning
Preprocessdatainordertoreducenoiseandhandle
missingvalues
Relevanceanalysis(featureselection)
Removetheirrelevantorredundantattributes
Datatransformation
Generalizeand/ornormalizedata

December26, DataMining:Conceptsand 182


2012 h
Issues:EvaluatingClassificationMethods

Accuracy
classifieraccuracy:predictingclasslabel
predictoraccuracy:guessingvalueofpredictedattributes
Speed
timetoconstructthemodel(trainingtime)
timetousethemodel(classification/predictiontime)
Robustness:handlingnoiseandmissingvalues
Scalability:efficiencyindiskresidentdatabases
Interpretability
understandingandinsightprovidedbythemodel
Othermeasures,e.g.,goodnessofrules,suchasdecisiontree
sizeorcompactnessofclassificationrules

December26, DataMining:Conceptsand 183


2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 184


2012 h
DecisionTreeInduction:TrainingDataset

age income student credit_rating buys_computer


<=30 high no fair no
Thisfollows <=30 high no excellent no
3140 high no fair yes
an >40 medium no fair yes
exampleof >40 low yes fair yes
Quinlans >40 low yes excellent no
3140 low yes excellent yes
ID3(Playing <=30 medium no fair no
Tennis) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
December26, DataMining:Conceptsand 185
2012 h
Output:ADecisionTreeforbuys_computer

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

December26, DataMining:Conceptsand 186


2012 h
AlgorithmforDecisionTreeInduction

Basicalgorithm(agreedyalgorithm)
Treeisconstructedinatopdownrecursivedivideandconquermanner
Atstart,allthetrainingexamplesareattheroot
Attributesarecategorical(ifcontinuousvalued,theyarediscretizedin
advance)
Examplesarepartitionedrecursivelybasedonselectedattributes
Testattributesareselectedonthebasisofaheuristicorstatistical
measure(e.g.,informationgain)
Conditionsforstoppingpartitioning
Allsamplesforagivennodebelongtothesameclass
Therearenoremainingattributesforfurtherpartitioning majority
voting isemployedforclassifyingtheleaf
Therearenosamplesleft

December26, DataMining:Conceptsand 187


2012 h
ClassificationinLargeDatabases

Classificationaclassicalproblemextensivelystudiedby
statisticiansandmachinelearningresearchers
Scalability:Classifyingdatasetswithmillionsofexamplesand
hundredsofattributeswithreasonablespeed
Whydecisiontreeinductionindatamining?
relativelyfasterlearningspeed(thanotherclassification
methods)
convertibletosimpleandeasytounderstandclassification
rules
canuseSQLqueriesforaccessingdatabases
comparableclassificationaccuracywithothermethods

December26, DataMining:Conceptsand 188


2012 h
DataCubeBasedDecisionTreeInduction

Integrationofgeneralizationwithdecisiontreeinduction
(Kamberetal.97)
Classificationatprimitiveconceptlevels
E.g.,precisetemperature,humidity,outlook,etc.
Lowlevelconcepts,scatteredclasses,bushyclassification
trees
Semanticinterpretationproblems
Cubebasedmultilevelclassification
Relevanceanalysisatmultilevels
Informationgainanalysiswithdimension+level
December26, DataMining:Conceptsand 189
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 190


2012 h
BayesianClassification:Why?

Astatisticalclassifier:performsprobabilisticprediction,i.e.,
predictsclassmembershipprobabilities
Foundation: BasedonBayesTheorem.
Performance: AsimpleBayesianclassifier,naveBayesian
classifier,hascomparableperformancewithdecisiontreeand
selectedneuralnetworkclassifiers
Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect
priorknowledgecanbecombinedwithobserveddata
Standard:EvenwhenBayesianmethodsarecomputationally
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured

December26, DataMining:Conceptsand 191


2012 h
BayesianTheorem:Basics

LetX beadatasample(evidence):classlabelisunknown
LetHbeahypothesis thatXbelongstoclassC
ClassificationistodetermineP(H|X),theprobabilitythatthe
hypothesisholdsgiventheobserveddatasampleX
P(H)(priorprobability),theinitialprobability
E.g., X willbuycomputer,regardlessofage,income,
P(X):probabilitythatsampledataisobserved
P(X|H)(posterioriprobability),theprobabilityofobservingthe
sampleX,giventhatthehypothesisholds
E.g., Giventhat X willbuycomputer,theprob.thatXis31..40,
mediumincome
December26, DataMining:Conceptsand 192
2012 h
BayesianTheorem

Giventrainingdata X,posterioriprobabilityofahypothesisH,
P(H|X),followstheBayestheorem

P (H | X ) = P (X | H )P (H )
P (X )
Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
PredictsX belongstoC2 ifftheprobabilityP(Ci|X)isthehighest
amongalltheP(Ck|X)forallthek classes
Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost
December26, DataMining:Conceptsand 193
2012 h
TowardsNaveBayesianClassifier

LetDbeatrainingsetoftuplesandtheirassociatedclasslabels,
andeachtupleisrepresentedbyannDattributevectorX =(x1,
x2,,xn)
Supposetherearem classesC1,C2,,Cm.
Classificationistoderivethemaximumposteriori,i.e.,the
maximalP(Ci|X)
ThiscanbederivedfromBayestheorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
SinceP(X)isconstantforallclasses,only
P(C | X) = P(X | C )P(C )
i i i
needstobemaximized

December26, DataMining:Conceptsand 194


2012 h
NaveBayesianClassifier:TrainingDataset
age income studentcredit_rating_comp
<=30 high no fair no
<=30 high no excellent no
Class: 3140 high no fair yes
C1:buys_computer=yes >40 medium no fair yes
C2:buys_computer=no >40 low yes fair yes
>40 low yes excellent no
Datasample
3140 low yes excellent yes
X=(age<=30,
Income=medium, <=30 medium no fair no
Student=yes <=30 low yes fair yes
Credit_rating=Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
December26, DataMining:Conceptsand 195
2012 h
NaveBayesianClassifier:AnExample

P(Ci):P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357

ComputeP(X|Ci)foreachclass
P(age=<=30|buys_computer=yes)=2/9=0.222
P(age=<=30|buys_computer=no)=3/5=0.6
P(income=medium|buys_computer=yes)=4/9=0.444
P(income=medium|buys_computer=no)=2/5=0.4
P(student=yes|buys_computer=yes)=6/9=0.667
P(student=yes|buys_computer=no)=1/5=0.2
P(credit_rating=fair|buys_computer=yes)=6/9=0.667
P(credit_rating=fair|buys_computer=no)=2/5=0.4

X=(age<=30,income=medium,student=yes,credit_rating=fair)

P(X|Ci): P(X|buys_computer=yes)=0.222x0.444x0.667x0.667=0.044
P(X|buys_computer=no)=0.6x0.4x0.2x0.4=0.019
P(X|Ci)*P(Ci):P(X|buys_computer=yes)*P(buys_computer=yes)=0.028
P(X|buys_computer=no)*P(buys_computer=no)=0.007

Therefore,Xbelongstoclass(buys_computer=yes)

December26, DataMining:Conceptsand 196


2012 h
NaveBayesianClassifier:Comments

Advantages
Easytoimplement
Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof
accuracy
Practically,dependenciesexistamongvariables
E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,diabetes,etc.
DependenciesamongthesecannotbemodeledbyNaveBayesian
Classifier
Howtodealwiththesedependencies?
BayesianBeliefNetworks

December26, DataMining:Conceptsand 197


2012 h
BayesianBeliefNetworks

Bayesianbeliefnetworkallowsasubset ofthevariables
conditionallyindependent
Agraphicalmodelofcausalrelationships
Representsdependency amongthevariables
Givesaspecificationofjointprobabilitydistribution
Nodes:randomvariables
Links:dependency
X Y XandYaretheparentsofZ,andYisthe
parentofP
Z NodependencybetweenZandP
P
Hasnoloopsorcycles
December26, DataMining:Conceptsand 198
2012 h
BayesianBeliefNetwork:AnExample

Family Theconditionalprobabilitytable (CPT)


Smoker
History forvariableLungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

CPTshowstheconditionalprobabilityforeach
possiblecombinationofitsparents

PositiveXRay Dyspnea Derivationoftheprobabilityofa


particularcombinationofvaluesofX,
fromCPT:
n
BayesianBeliefNetworks P ( x 1 ,..., x n ) = P ( x i | Parents ( Y i ))
i =1
December26, DataMining:Conceptsand 199
2012 h
TrainingBayesianNetworks

Severalscenarios:
Givenboththenetworkstructureandallvariables
observable:learnonlytheCPTs
Networkstructureknown,somehiddenvariables:gradient
descent (greedyhillclimbing)method,analogoustoneural
networklearning
Networkstructureunknown,allvariablesobservable:
searchthroughthemodelspacetoreconstructnetwork
topology
Unknownstructure,allhiddenvariables:Nogood
algorithmsknownforthispurpose
Ref.D.Heckerman:Bayesiannetworksfordatamining
December26, DataMining:Conceptsand 200
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 201


2012 h
UsingIFTHENRulesforClassification
RepresenttheknowledgeintheformofIFTHEN rules
R:IFage =youthANDstudent =yesTHENbuys_computer =yes
Ruleantecedent/preconditionvs.ruleconsequent
Assessmentofarule:coverage andaccuracy
ncovers=#oftuplescoveredbyR
ncorrect=#oftuplescorrectlyclassifiedbyR
coverage(R)=ncovers/|D|/*D:trainingdataset*/
accuracy(R)=ncorrect/ncovers
Ifmorethanoneruleistriggered,needconflictresolution
Sizeordering:assignthehighestprioritytothetriggeringrulesthathasthe
toughestrequirement(i.e.,withthemostattributetest)
Classbasedordering:decreasingorderofprevalenceormisclassificationcostper
class
Rulebasedordering(decisionlist):rulesareorganizedintoonelongprioritylist,
accordingtosomemeasureofrulequalityorbyexperts

December26, DataMining:Conceptsand 202


2012 h
RuleExtractionfromaDecisionTree
age?

<=30 31..40 >40


Rulesareeasiertounderstandthanlargetrees
student? credit rating?
yes
Oneruleiscreatedforeachpathfromtheroottoa
no yes excellent fair
leaf
no yes yes
Eachattributevaluepairalongapathformsa
conjunction:theleafholdstheclassprediction
Rulesaremutuallyexclusiveandexhaustive
Example:Ruleextractionfromourbuys_computer decisiontree
IFage =youngANDstudent =no THENbuys_computer =no
IFage =youngANDstudent =yes THENbuys_computer =yes
IFage =midage THENbuys_computer =yes
IFage =oldANDcredit_rating =excellent THENbuys_computer=yes
IFage =youngANDcredit_rating =fair THENbuys_computer =no

December26, DataMining:Conceptsand 203


2012 h
RuleExtractionfromtheTrainingData

Sequentialcoveringalgorithm:Extractsrulesdirectlyfromtrainingdata
Typicalsequentialcoveringalgorithms:FOIL,AQ,CN2,RIPPER
Rulesarelearnedsequentially,eachforagivenclassCiwillcovermanytuples
ofCibutnone(orfew)ofthetuplesofotherclasses
Steps:
Rulesarelearnedoneatatime
Eachtimearuleislearned,thetuplescoveredbytherulesareremoved
Theprocessrepeatsontheremainingtuplesunlessterminationcondition,
e.g.,whennomoretrainingexamplesorwhenthequalityofarule
returnedisbelowauserspecifiedthreshold
Comp.w.decisiontreeinduction:learningasetofrulessimultaneously

December26, DataMining:Conceptsand 204


2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 205


2012 h
Classification:AMathematicalMapping

Classification:
predictscategoricalclasslabels
E.g.,Personalhomepageclassification
xi =(x1,x2,x3,),yi =+1or1
x1 :#ofawordhomepage
x2 :#ofawordwelcome
Mathematically
x X=n,y Y={+1,1}
Wewantafunctionf:X Y

December26, DataMining:Conceptsand 206


2012 h
LinearClassification

BinaryClassificationproblem
Thedataabovetheredline
belongstoclassx
Thedatabelowredline
x belongstoclasso
x
x x x Examples:SVM,Perceptron,
x ProbabilisticClassifiers
x x o
x
o
x o o
ooo
o o
o o o o

December26, DataMining:Conceptsand 207


h
DiscriminativeClassifiers

Advantages
predictionaccuracyisgenerallyhigh
AscomparedtoBayesianmethods ingeneral
robust,workswhentrainingexamplescontainerrors
fastevaluationofthelearnedtargetfunction
Bayesiannetworksarenormallyslow
Criticism
longtrainingtime
difficulttounderstandthelearnedfunction(weights)
Bayesiannetworkscanbeusedeasilyforpatterndiscovery
noteasytoincorporatedomainknowledge
Easyintheformofpriorsonthedataordistributions
December26, DataMining:Conceptsand 208
2012 h
Classificationby
Backpropagation
Backpropagation:Aneuralnetworklearningalgorithm
Startedbypsychologistsandneurobiologiststodevelopand
testcomputationalanaloguesofneurons
Aneuralnetwork:Asetofconnectedinput/outputunits
whereeachconnectionhasaweight associatedwithit
Duringthelearningphase,thenetworklearnsbyadjusting
theweights soastobeabletopredictthecorrectclasslabel
oftheinputtuples
Alsoreferredtoasconnectionistlearning duetothe
connectionsbetweenunits
December26, DataMining:Conceptsand 209
2012 h
NeuralNetworkasaClassifier
Weakness
Longtrainingtime
Requireanumberofparameterstypicallybestdeterminedempirically,
e.g.,thenetworktopologyor``structure."
Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehind
thelearnedweightsandof``hiddenunits"inthenetwork
Strength
Hightolerancetonoisydata
Abilitytoclassifyuntrainedpatterns
Wellsuitedforcontinuousvaluedinputsandoutputs
Successfulonawidearrayofrealworlddata
Algorithmsareinherentlyparallel
Techniqueshaverecentlybeendevelopedfortheextractionofrules
fromtrainedneuralnetworks
December26, DataMining:Conceptsand 210
2012 h
ANeuron(=aperceptron)

- k
x0 w0
x1

w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y = sign( wi xi + k )
vector x vector w sum function i =0

Thendimensionalinputvectorx ismappedintovariableybymeansof
thescalarproductandanonlinearfunctionmapping

December26, DataMining:Conceptsand 211


2012 h
AMultiLayerFeedForwardNeuralNetwork

Outputvector

Err j = O j (1 O j ) Errk w jk
Outputlayer k

j = j + (l) Err j
wij = wij + (l ) Err j Oi
Hiddenlayer Err j = O j (1 O j )(T j O j )
wij 1
Oj = I j
1+ e
Inputlayer
I j = wij Oi + j
i
Inputvector:X
December26, DataMining:Conceptsand 212
2012 h
HowAMultiLayerNeuralNetworkWorks?

Theinputs tothenetworkcorrespondtotheattributesmeasuredforeach
trainingtuple
Inputsarefedsimultaneouslyintotheunitsmakinguptheinputlayer
Theyarethenweightedandfedsimultaneouslytoahiddenlayer
Thenumberofhiddenlayersisarbitrary,althoughusuallyonlyone
Theweightedoutputsofthelasthiddenlayerareinputtounitsmakingup
theoutputlayer,whichemitsthenetwork'sprediction
Thenetworkisfeedforward inthatnoneoftheweightscyclesbacktoan
inputunitortoanoutputunitofapreviouslayer
Fromastatisticalpointofview,networksperformnonlinearregression:
Givenenoughhiddenunitsandenoughtrainingsamples,theycanclosely
approximateanyfunction

December26, DataMining:Conceptsand 213


2012 h
DefiningaNetworkTopology

Firstdecidethenetworktopology:#ofunitsintheinputlayer,#
ofhiddenlayers (if>1),#ofunitsineachhiddenlayer,and#of
unitsintheoutputlayer
Normalizingtheinputvaluesforeachattributemeasuredinthe
trainingtuplesto[0.01.0]
Oneinput unitperdomainvalue,eachinitializedto0
Output,ifforclassificationandmorethantwoclasses,one
outputunitperclassisused
Onceanetworkhasbeentrainedanditsaccuracyis
unacceptable,repeatthetrainingprocesswithadifferent
networktopology oradifferentsetofinitialweights

December26, DataMining:Conceptsand 214


2012 h
Backpropagation

Iterativelyprocessasetoftrainingtuples&comparethenetwork's
predictionwiththeactualknowntargetvalue
Foreachtrainingtuple,theweightsaremodifiedtominimizethemean
squarederror betweenthenetwork'spredictionandtheactualtargetvalue
Modificationsaremadeinthebackwardsdirection:fromtheoutputlayer,
througheachhiddenlayerdowntothefirsthiddenlayer,hence
backpropagation
Steps
Initializeweights(tosmallrandom#s)andbiasesinthenetwork
Propagatetheinputsforward(byapplyingactivationfunction)
Backpropagatetheerror(byupdatingweightsandbiases)
Terminatingcondition(whenerrorisverysmall,etc.)

December26, DataMining:Conceptsand 215


2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 216


2012 h
AssociativeClassification

Associativeclassification
Associationrulesaregeneratedandanalyzedforuseinclassification
Searchforstrongassociationsbetweenfrequentpatterns(conjunctionsof
attributevaluepairs)andclasslabels
Classification:Basedonevaluatingasetofrulesintheformof
P1 ^p2 ^pl Aclass =C(conf,sup)
Whyeffective?
Itexploreshighlyconfidentassociationsamongmultipleattributesandmay
overcomesomeconstraintsintroducedbydecisiontreeinduction,which
considersonlyoneattributeatatime
Inmanystudies,associativeclassificationhasbeenfoundtobemore
accuratethansometraditionalclassificationmethods,suchasC4.5

December26, DataMining:Conceptsand 217


2012 h
TypicalAssociativeClassificationMethods

CBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD98)
Mineassociationpossiblerulesintheformof
Condset(asetofattributevaluepairs) classlabel
Buildclassifier:Organizerulesaccordingtodecreasingprecedencebasedon
confidenceandthensupport
CMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM01)
Classification:Statisticalanalysisonmultiplerules
CPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM03)
Generationofpredictiverules(FOILlikeanalysis)
Highefficiency,accuracysimilartoCMAR
RCBT(Miningtopk coveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD05)
Explorehighdimensionalclassification,usingtopkrulegroups
Achievehighclassificationaccuracyandhighruntimeefficiency
December26, DataMining:Conceptsand 218
2012 h
ThekNearestNeighborAlgorithm

AllinstancescorrespondtopointsinthenDspace
ThenearestneighboraredefinedintermsofEuclidean
distance,dist(X1,X2)
Targetfunctioncouldbediscrete orreal valued
Fordiscretevalued,kNNreturnsthemostcommonvalue
amongthek trainingexamplesnearestto xq
Vonoroidiagram:thedecisionsurfaceinducedby1NNfor
atypicalsetoftrainingexamples

_
_
_ _
.
+
_ .
+
xq + . . .
_
December26,
+ .
DataMining:Conceptsand 219
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 220


2012 h
WhatIsPrediction?

(Numerical)predictionissimilartoclassification
constructamodel
usemodeltopredictcontinuousororderedvalueforagiveninput
Predictionisdifferentfromclassification
Classificationreferstopredictcategoricalclasslabel
Predictionmodelscontinuousvaluedfunctions
Majormethodforprediction:regression
modeltherelationshipbetweenoneormoreindependent orpredictor
variablesandadependent orresponse variable
Regressionanalysis
Linearandmultipleregression
Nonlinearregression
Otherregressionmethods:generalizedlinearmodel,Poissonregression,
loglinearmodels,regressiontrees
December26, DataMining:Conceptsand 221
2012 h
LinearRegression

Linearregression:involvesaresponsevariableyandasinglepredictor
variablex
y=w0 +w1 x
wherew0 (yintercept)andw1 (slope)areregressioncoefficients
Methodofleastsquares:estimatesthebestfittingstraightline
|D|

(x x )( y i y )
w = w = y w x
i
i =1

1 |D|
0 1
(x
i =1
i x )2

Multiplelinearregression:involvesmorethanonepredictorvariable
Trainingdataisoftheform(X1,y1),(X2,y2),,(X|D|,y|D|)
Ex.For2Ddata,wemayhave:y=w0 +w1 x1+w2 x2
SolvablebyextensionofleastsquaremethodorusingSAS,SPlus
Manynonlinearfunctionscanbetransformedintotheabove
December26, DataMining:Conceptsand 222
2012 h
NonlinearRegression

Somenonlinearmodelscanbemodeledbyapolynomialfunction
Apolynomialregressionmodelcanbetransformedintolinear
regressionmodel.Forexample,
y=w0 +w1 x+w2 x2+w3 x3
convertibletolinearwithnewvariables:x2=x2,x3=x3
y=w0 +w1 x+w2 x2+w3 x3
Otherfunctions,suchaspowerfunction,canalsobetransformed
tolinearmodel
Somemodelsareintractablenonlinear(e.g.,sumofexponential
terms)
possibletoobtainleastsquareestimatesthroughextensive
calculationonmorecomplexformulae
December26, DataMining:Conceptsand 223
2012 h
OtherRegressionBasedModels

Generalizedlinearmodel:
Foundationonwhichlinearregressioncanbeappliedtomodeling
categoricalresponsevariables
Varianceofyisafunctionofthemeanvalueofy,notaconstant
Logisticregression:modelstheprob.ofsomeeventoccurringasalinear
functionofasetofpredictorvariables
Poissonregression:modelsthedatathatexhibitaPoissondistribution
Loglinearmodels:(forcategoricaldata)
Approximatediscretemultidimensionalprob.distributions
Alsousefulfordatacompressionandsmoothing
Regressiontreesandmodeltrees
Treestopredictcontinuousvaluesratherthanclasslabels

December26, DataMining:Conceptsand 224


2012 h
RegressionTreesandModelTrees

Regressiontree:proposedinCARTsystem(Breimanetal.1984)
CART:ClassificationAndRegressionTrees
Eachleafstoresacontinuousvaluedprediction
Itistheaveragevalueofthepredictedattribute forthetrainingtuples
thatreachtheleaf
Modeltree:proposedbyQuinlan(1992)
Eachleafholdsaregressionmodelamultivariatelinearequationfor
thepredictedattribute
Amoregeneralcasethanregressiontree
Regressionandmodeltreestendtobemoreaccuratethanlinearregression
whenthedataarenotrepresentedwellbyasimplelinearmodel

December26, DataMining:Conceptsand 225


2012 h
PredictiveModelinginMultidimensionalDatabases

Predictivemodeling:Predictdatavaluesorconstruct
generalizedlinearmodelsbasedonthedatabasedata
Onecanonlypredictvaluerangesorcategorydistributions
Methodoutline:
Minimalgeneralization
Attributerelevanceanalysis
Generalizedlinearmodelconstruction
Prediction
Determinethemajorfactorswhichinfluencetheprediction
Datarelevanceanalysis:uncertaintymeasurement,entropy
analysis,expertjudgement,etc.
Multilevelprediction:drilldownandrollupanalysis

December26, DataMining:Conceptsand 226


2012 h
Boosting

Analogy:Consultseveraldoctors,basedonacombinationofweighted
diagnosesweightassignedbasedonthepreviousdiagnosisaccuracy
Howboostingworks?
Weightsareassignedtoeachtrainingtuple
Aseriesofkclassifiersisiterativelylearned
AfteraclassifierMi islearned,theweightsareupdatedtoallowthe
subsequentclassifier,Mi+1,topaymoreattentiontothetrainingtuples
thatweremisclassifiedbyMi
ThefinalM*combinesthevotesofeachindividualclassifier,wherethe
weightofeachclassifier'svoteisafunctionofitsaccuracy
Theboostingalgorithmcanbeextendedforthepredictionofcontinuous
values
Comparingwithbagging:boostingtendstoachievegreateraccuracy,butit
alsorisksoverfittingthemodeltomisclassifieddata
December26, DataMining:Conceptsand 227
2012 h
Classificationand
Prediction
Whatisclassification?Whatis SupportVectorMachines(SVM)
prediction? Associativeclassification

Issuesregardingclassificationand Lazylearners(orlearningfrom
prediction yourneighbors)

Classificationbydecisiontree Otherclassificationmethods

induction Prediction

Bayesianclassification Accuracyanderrormeasures

Rulebasedclassification Ensemblemethods
Modelselection
Classificationbybackpropagation
Summary

December26, DataMining:Conceptsand 228


2012 h
Summary(I)

Classificationand prediction aretwoformsofdataanalysisthatcanbeused


toextractmodels describingimportantdataclassesortopredictfuture
datatrends.
Effectiveandscalablemethodshavebeendevelopedfordecisiontrees
induction,NaiveBayesianclassification,Bayesianbeliefnetwork,rulebased
classifier,Backpropagation,SupportVectorMachine(SVM),associative
classification,nearestneighborclassifiers, andcasebasedreasoning,and
otherclassificationmethodssuchasgeneticalgorithms,roughsetandfuzzy
set approaches.
Linear,nonlinear,andgeneralizedlinearmodelsofregression canbeused
forprediction.Manynonlinearproblemscanbeconvertedtolinear
problemsbyperformingtransformationsonthepredictorvariables.
Regressiontrees andmodeltrees arealsousedforprediction.

December26, DataMining:Conceptsand 229


2012 h
Summary(II)

Stratifiedkfoldcrossvalidation isarecommendedmethodforaccuracy
estimation.Baggingandboosting canbeusedtoincreaseoverallaccuracyby
learningandcombiningaseriesofindividualmodels.
Significancetests andROCcurves areusefulformodelselection
Therehavebeennumerouscomparisonsofthedifferentclassificationand
predictionmethods,andthematterremainsaresearchtopic
Nosinglemethodhasbeenfoundtobesuperioroverallothersforalldata
sets
Issuessuchasaccuracy,trainingtime,robustness,interpretability,and
scalabilitymustbeconsideredandcaninvolvetradeoffs,further
complicatingthequestforanoverallsuperiormethod

December26, DataMining:Conceptsand 230


2012 h
ReviewQuestions

Howdoesclassificationworks?
Howispredictiondifferentformclassification?
DefineDatacleaning?
Listthecriteriainvolvedincomparingandevaluatingtheclassification
andpredictionmethods?
WhatareBayesianclassifier?
StateBayestheorem
DefineBackpropagationandhowdoesitwork?
StateRulepruning?
Whatifwewouldliketopredictacontinuousvalue,ratherthana
categoricallabel?
Statelinearregression?
Statepolynomialregression?
Giveanoteonbootstrapmethod?
Whatisboosting?Statewhyitmayimprovetheaccuracyofdecision
treeinduction?

December26, DataMining:Conceptsand 231


2012 h
Bibliography

DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber
T.DasuandT.Johnson.ExploratoryData
MiningandDataCleaning.JohnWiley&Sons,
2003

December26, DataMining:Conceptsand 232


2012 h
UNITIV

DataMining:Conceptsand
Techniques

December26, 233
2012
MiningStream,TimeSeries,andSequenceData

Miningdatastreams

Miningtimeseriesdata

Miningsequencepatternsintransactional
databases

Miningsequencepatternsinbiologicaldata

December26, DataMining:Conceptsand 234


2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 235


2012 h
CharacteristicsofDataStreams

DataStreams
Datastreamscontinuous,ordered,changing,fast,hugeamount
TraditionalDBMSdatastoredinfinite,persistent datasets

Characteristics
Hugevolumesofcontinuousdata,possiblyinfinite
Fastchangingandrequiresfast,realtimeresponse
Datastreamcapturesnicelyourdataprocessingneedsoftoday
Randomaccessisexpensivesinglescanalgorithm(canonlyhaveone
look)
Storeonlythesummaryofthedataseenthusfar
Moststreamdataareatprettylowlevelormultidimensionalinnature,
needsmultilevelandmultidimensionalprocessing

December26, DataMining:Conceptsand 236


2012 h
StreamDataApplications

Telecommunicationcallingrecords
Business:creditcardtransactionflows
Networkmonitoringandtrafficengineering
Financialmarket:stockexchange
Engineering&industrialprocesses:powersupply&
manufacturing
Sensor,monitoring&surveillance:videostreams,RFIDs
Securitymonitoring
WeblogsandWebpageclickstreams
Massivedatasets(evensavedbutrandomaccessistoo
expensive)

December26, DataMining:Conceptsand 237


2012 h
DBMSversusDSMS

Persistentrelations Transientstreams
Onetimequeries Continuousqueries
Randomaccess Sequentialaccess
Unboundeddiskstore Boundedmainmemory
Onlycurrentstatematters Historicaldataisimportant
Norealtimeservices Realtimerequirements
Relativelylowupdaterate PossiblymultiGBarrivalrate
Dataatanygranularity Dataatfinegranularity
Assumeprecisedata Datastale/imprecise
Accessplandeterminedbyquery Unpredictable/variabledataarrival
processor,physicalDBdesign andcharacteristics
Ack. From Motwanis PODS tutorial slides
December26, DataMining:Conceptsand 238
2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 239


2012 h
Architecture:StreamQueryProcessing

SDMS (Stream Data User/Application


Management System)

Continuous Query

Results
Multiple streams
Stream Query
Processor

Scratch Space
(Main memory and/or Disk)
December26, DataMining:Conceptsand 240
2012 h
ChallengesofStreamDataProcessing

Multiple,continuous,rapid,timevarying,ordered streams
Mainmemory computations
Queriesareoftencontinuous
Evaluatedcontinuouslyasstreamdataarrives
Answerupdatedovertime
Queriesareoftencomplex
Beyondelementatatimeprocessing
Beyondstreamatatimeprocessing
Beyondrelationalqueries(scientific,datamining,OLAP)
Multilevel/multidimensionalprocessinganddatamining
Moststreamdataareatlowlevelormultidimensionalinnature

December26, DataMining:Conceptsand 241


2012 h
ProcessingStreamQueries

Querytypes
Onetimequeryvs.continuousquery (beingevaluatedcontinuouslyas
streamcontinuestoarrive)
Predefinedquery vs.adhocquery(issuedonline)
Unboundedmemoryrequirements
Forrealtimeresponse,mainmemoryalgorithm shouldbeused
Memoryrequirementisunboundedifonewilljoinfuturetuples
Approximatequeryanswering
Withboundedmemory,itisnotalwayspossibletoproduceexact
answers
Highqualityapproximateanswers aredesired
Datareductionandsynopsisconstructionmethods
Sketches,randomsampling,histograms,wavelets,etc.
December26, DataMining:Conceptsand 242
2012 h
MethodologiesforStreamDataProcessing

Majorchallenges
Keeptrackofalargeuniverse,e.g.,pairsofIPaddress,notages
Methodology
Synopses(tradeoffbetweenaccuracyandstorage)
Usesynopsisdatastructure,muchsmaller(O(logk N)space)thantheir
basedataset(O(N)space)
Computeanapproximateanswer withinasmallerrorrange (factor of
theactualanswer)
Majormethods
Randomsampling
Histograms
Slidingwindows
Multiresolutionmodel
Sketches
Radomizedalgorithms
December26, DataMining:Conceptsand 243
2012 h
StreamDataMiningvs.StreamQuerying
StreamminingAmorechallengingtaskinmanycases
Itsharesmostofthedifficultieswithstreamquerying
Butoftenrequireslessprecision,e.g.,nojoin,grouping,
sorting
Patternsarehiddenandmoregeneralthanquerying
Itmayrequireexploratoryanalysis
Notnecessarilycontinuousqueries
Streamdataminingtasks
Multidimensionalonlineanalysisofstreams
Miningoutliersandunusualpatternsinstreamdata
Clusteringdatastreams
Classificationofstreamdata
December26, DataMining:Conceptsand 244
2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 245


2012 h
ChallengesforMiningDynamicsinDataStreams

Moststreamdataareatprettylowlevelormultidimensional
innature:needsML/MDprocessing
Analysisrequirements
Multidimensionaltrendsandunusualpatterns
Capturingimportantchangesatmultidimensions/levels
Fast,realtimedetectionandresponse
Comparingwithdatacube:Similarityanddifferences

Stream(data)cubeorstreamOLAP:Isthisfeasible?
Canweimplementitefficiently?

December26, DataMining:Conceptsand 246


2012 h
AStreamCubeArchitecture

A tiltedtimeframe
Differenttimegranularities
second,minute,quarter,hour,day,week,
Criticallayers
Minimuminterestlayer (mlayer)
Observationlayer (olayer)
User:watchesatolayerandoccasionallyneedstodrilldowndowntom
layer
Partialmaterializationofstreamcubes
Fullmaterialization:toospaceandtimeconsuming
Nomaterialization:slowresponseatquerytime
Partialmaterialization:whatdowemeanpartial?

December26, DataMining:Conceptsand 247


2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 248


2012 h
FrequentPatternsforStreamData

Frequentpatternminingisvaluableinstreamapplications
e.g.,networkintrusionmining(Dokas,etal02)
Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Howtominefrequentpatternswithgoodapproximation?
Approximatefrequentpatterns(Manku&MotwaniVLDB02)
Keeponlycurrentfrequentpatterns?Nochangescanbedetected
Miningevolutionfreq.patterns(C.Giannella,J.Han,X.Yan,P.S.Yu,2003)
Usetiltedtimewindowframe
Miningevolutionanddramaticchangesoffrequentpatterns
Spacesavingcomputationoffrequentandtopkelements(Metwally,Agrawal,andEl
Abbadi,ICDT'05)

December26, DataMining:Conceptsand 249


2012 h
MiningApproximateFrequentPatterns

Miningprecise freq.patternsinstreamdata:unrealistic
Evenstoretheminacompressedform,suchasFPtree
Approximateanswers areoftensufficient(e.g.,trend/patternanalysis)
Example:arouterisinterestedinallflows:
whosefrequency isatleast1%() oftheentiretrafficstreamseenso
far
andfeelsthat1/10of ( =0.1%)error iscomfortable
Howtominefrequentpatternswithgoodapproximation?
LossyCountingAlgorithm(Manku&Motwani,VLDB02)
Majorideas:nottracingitemsuntilitbecomesfrequent
Adv:guaranteederrorbound
Disadv:keepalargesetoftraces
December26, DataMining:Conceptsand 250
2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 251


2012 h
ClassificationforDynamicDataStreams
Decisiontreeinductionforstreamdataclassification
VFDT(VeryFastDecisionTree)/CVFDT(Domingos,Hulten,Spencer,
KDD00/KDD01)
Isdecisiontreegoodformodelingfastchangingdata,e.g.,stockmarket
analysis?
Otherstreamclassificationmethods
Insteadofdecisiontrees,considerothermodels
NaveBayesian
Ensemble(Wang,Fan,Yu,Han.KDD03)
Knearestneighbors(Aggarwal,Han,Wang,Yu.KDD04)
Tiltedtimeframework,incrementalupdating,dynamicmaintenance,
andmodelconstruction
Comparingofmodelstofindchanges
December26, DataMining:Conceptsand 252
2012 h
HoeffdingTree

Withhighprobability,classifiestuplesthesame
Onlyusessmallsample
BasedonHoeffdingBoundprinciple
HoeffdingBound(AdditiveChernoffBound)
r:randomvariable
R:rangeofr
n:#independentobservations
Meanofrisatleastravg ,withprobability1 d

R 2 ln( 1 / )
=
December26,
2n 253
DataMining:Conceptsand
2012 h
HoeffdingTreeAlgorithm

HoeffdingTreeInput
S:sequenceofexamples
X:attributes
G():evaluationfunction
d:desiredaccuracy
HoeffdingTreeAlgorithm
foreachexampleinS
retrieveG(Xa)andG(Xb)//twohighestG(Xi)
if(G(Xa) G(Xb)> )
splitonXa
recursetonextnode
break

December26, DataMining:Conceptsand 254


2012 h
DecisionTreeInductionwithDataStreams
Packets > 10
Data Stream
yes no

Protocol = http

Packets > 10
Data Stream
yes no
Bytes > 60K

Protocol = http
yes

Protocol = ftp Ack. From Gehrkes SIGMOD tutorial slides


December26, DataMining:Conceptsand 255
2012 h
HoeffdingTree:StrengthsandWeaknesses

Strengths
Scalesbetterthantraditionalmethods
Sublinearwithsampling
Verysmallmemoryutilization
Incremental
Makeclasspredictionsinparallel
Newexamplesareaddedastheycome
Weakness
Couldspendalotoftimewithties
Memoryusedwithtreeexpansion
Numberofcandidateattributes

December26, DataMining:Conceptsand 256


2012 h
EnsembleofClassifiersAlgorithm

H.Wang,W.Fan,P.S.Yu,andJ.Han,MiningConceptDrifting
DataStreamsusingEnsembleClassifiers,KDD'03.
Method(derivedfromtheensembleideainclassification)
trainKclassifiersfromKchunks
foreachsubsequentchunk
trainanewclassifier
testotherclassifiersagainstthechunk
assignweighttoeachclassifier
selecttopKclassifiers

December26, DataMining:Conceptsand 257


2012 h
MiningDataStreams

Whatisstreamdata?WhyStreamDataSystems?

Streamdatamanagementsystems:Issuesandsolutions

StreamdatacubeandmultidimensionalOLAPanalysis

Streamfrequentpatternanalysis

Streamclassification

Streamclusteranalysis

Researchissues

December26, DataMining:Conceptsand 258


2012 h
ClusteringDataStreams[GMMO01]

Base on the k-median method


Data stream points from metric space
Find k clusters in the stream s.t. the sum of distances
from data points to their closest center is minimized
Constant factor approximation algorithm
In small space, a simple two step algorithm:
1. For each set of M records, Si, find O(k) centers in
S1, , Sl
Local clustering: Assign each point in Si to its
closest center
2. Let S be centers for S1, , Sl with each center
weighted by number of points assigned to it
Cluster S to find k centers

December26, DataMining:Conceptsand 259


2012 h
HierarchicalClusteringTree

level-(i+1) medians

level-i medians

data points
December26, DataMining:Conceptsand 260
2012 h
HierarchicalTreeandDrawbacks

Method:
maintainatmostmlevelimedians
Onseeingmofthem,generateO(k)level(i+1)
mediansofweightequaltothesumoftheweightsof
theintermediatemediansassignedtothem
Drawbacks:
Lowqualityforevolvingdatastreams(registeronlyk
centers)
Limitedfunctionalityindiscoveringandexploring
clustersoverdifferentportionsofthestreamover
time

December26, DataMining:Conceptsand 261


2012 h
Summary:StreamDataMining

Streamdatamining:Arichandongoingresearchfield
Currentresearchfocusindatabasecommunity:
DSMSsystemarchitecture,continuousqueryprocessing,supporting
mechanisms
StreamdataminingandstreamOLAPanalysis
Powerfultoolsforfindinggeneralandunusualpatterns
Effectiveness,efficiencyandscalability:lotsofopenproblems
Ourphilosophyonstreamdataanalysisandmining
Amultidimensionalstreamanalysis framework
Timeisaspecialdimension:Tiltedtimeframe
Whattocomputeandwhattosave?Criticallayers
partialmaterializationandprecomputation
Miningdynamics ofstreamdata
December26, DataMining:Conceptsand 262
2012 h
Miningtimeseriesdata

December26, DataMining:Conceptsand 263


2012 h
MiningStream,TimeSeries,andSequenceData

Miningdatastreams

Miningtimeseriesdata
Miningsequencepatternsintransactional
databases

Miningsequencepatternsinbiologicaldata

December26, DataMining:Conceptsand 264


2012 h
TimeSeriesandSequentialPatternMining

RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel

December26, DataMining:Conceptsand 265


2012 h
MiningTimeSeriesData

Timeseriesdatabase
Consistsofsequencesofvaluesoreventschanging
withtime
Dataisrecordedatregularintervals
Characteristictimeseriescomponents
Trend,cycle,seasonal,irregular
Applications
Financial:stockprice,inflation
Industry:powerconsumption
Scientific:experimentresults
Meteorological:precipitation
December26, DataMining:Conceptsand 266
2012 h
CategoriesofTimeSeriesMovements

CategoriesofTimeSeriesMovements
Longtermortrendmovements(trendcurve):generaldirectioninwhich
atimeseriesismovingoveralongintervaloftime
Cyclicmovementsorcyclevariations:longtermoscillationsabouta
trendlineorcurve
e.g.,businesscycles,mayormaynotbeperiodic
Seasonalmovementsorseasonalvariations
i.e,almostidenticalpatternsthatatimeseriesappearstofollow
duringcorrespondingmonthsofsuccessiveyears.
Irregularorrandommovements
Timeseriesanalysis:decompositionofatimeseriesintothesefourbasic
movements
AdditiveModal:TS=T+C+S+I
MultiplicativeModal:TS=T C S I
December26, DataMining:Conceptsand 267
2012 h
EstimationofTrendCurve

Thefreehandmethod

Fitthecurvebylookingatthegraph
Costlyandbarelyreliableforlargescaleddatamining
Theleastsquaremethod

Findthecurveminimizingthesumofthesquaresof
thedeviationofpointsonthecurvefromthe
correspondingdatapoints
Themovingaveragemethod

December26, DataMining:Conceptsand 268


2012 h
TrendDiscoveryinTimeSeries(1): Estimationof
SeasonalVariations

Seasonalindex
Setofnumbersshowingtherelativevaluesofavariableduringthe
monthsoftheyear
E.g.,ifthesalesduringOctober,November,andDecemberare80%,
120%,and140%oftheaveragemonthlysalesforthewholeyear,
respectively,then80,120,and140areseasonalindexnumbersfor
thesemonths
Deseasonalizeddata
Dataadjustedforseasonalvariationsforbettertrendandcyclicanalysis
Dividetheoriginalmonthlydatabytheseasonalindexnumbersforthe
correspondingmonths

December26, DataMining:Conceptsand 269


2012 h
TrendDiscoveryinTimeSeries(2)

Estimationofcyclicvariations
If(approximate)periodicityofcyclesoccurs,cyclic
indexcanbeconstructedinmuchthesame
mannerasseasonalindexes
Estimationofirregularvariations
Byadjustingthedatafortrend,seasonalandcyclic
variations
Withthesystematicanalysisofthetrend,cyclic,seasonal,and
irregularcomponents,itispossibletomakelong orshortterm
predictionswithreasonablequality
December26, DataMining:Conceptsand 270
2012 h
TimeSeries&SequentialPatternMining

RegressionandtrendanalysisAstatistical
approach
Similaritysearchintimeseriesanalysis
SequentialPatternMining
MarkovChain
HiddenMarkovModel

December26, DataMining:Conceptsand 271


2012 h
SimilaritySearchinTimeSeriesAnalysis

Normaldatabasequeryfindsexactmatch
Similaritysearchfindsdatasequencesthatdifferonlyslightly
fromthegivenquerysequence
Twocategoriesofsimilarityqueries
Wholematching:findasequencethatissimilarto
thequerysequence
Subsequencematching:findallpairsofsimilar
sequences
TypicalApplications
Financialmarket
Marketbasketdataanalysis
Scientificdatabases
Medicaldiagnosis
December26, DataMining:Conceptsand 272
2012 h
DataTransformation

Manytechniquesforsignalanalysisrequirethedatatobein
thefrequencydomain
Usuallydataindependenttransformationsareused
Thetransformationmatrixisdeterminedapriori
discreteFouriertransform(DFT)
discretewavelettransform(DWT)
Thedistancebetweentwosignalsinthetimedomainisthe
sameastheirEuclideandistanceinthefrequencydomain

December26, DataMining:Conceptsand 273


2012 h
Miningsequencepatternsintransactionaldatabases

December26, DataMining:Conceptsand 274


2012 h
MiningStream,TimeSeries,andSequenceData

Miningdatastreams

Miningtimeseriesdata

Miningsequencepatternsin
transactionaldatabases
Miningsequencepatternsinbiologicaldata

December26, DataMining:Conceptsand 275


2012 h
SequenceDatabases&SequentialPatterns

Transactiondatabases,timeseriesdatabasesvs.sequencedatabases
Frequentpatternsvs.(frequent)sequentialpatterns
Applicationsofsequentialpatternmining
Customershoppingsequences:
Firstbuycomputer,thenCDROM,andthendigitalcamera,
within3months.
Medicaltreatments,naturaldisasters(e.g.,earthquakes),
science&eng.processes,stocksandmarkets,etc.
Telephonecallingpatterns,Weblogclickstreams
DNAsequencesandgenestructures

December26, DataMining:Conceptsand 276


2012 h
WhatIsSequentialPatternMining?

Givenasetofsequences,findthecompleteset
offrequentsubsequences
Asequence:<(ef)(ab)(df)cb>
Asequencedatabase
SID sequence Anelementmaycontainasetofitems.
10 <a(abc)(ac)d(cf)> Itemswithinanelementareunordered
20 <(ad)c(bc)(ae)> andwelistthemalphabetically.
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc> <a(bc)dc>isasubsequence of
<a(abc)(ac)d(cf)>

Givensupportthreshold min_sup=2,<(ab)c>isasequential
pattern
December26, DataMining:Conceptsand 277
2012 h
ChallengesonSequentialPatternMining

Ahuge numberofpossiblesequentialpatternsarehiddenin
databases
Aminingalgorithmshould
findthecompletesetofpatterns,whenpossible,
satisfyingtheminimumsupport(frequency)
threshold
behighlyefficient,scalable,involvingonlyasmall
numberofdatabasescans
beabletoincorporatevariouskindsofuser
specificconstraints
December26, DataMining:Conceptsand 278
2012 h
SequentialPatternMiningAlgorithms

ConceptintroductionandaninitialApriorilikealgorithm
Agrawal&Srikant.Miningsequentialpatterns,ICDE95
Aprioribasedmethod:GSP(GeneralizedSequentialPatterns:Srikant&
Agrawal@EDBT96)
Patterngrowthmethods:FreeSpan&PrefixSpan (Hanetal.@KDD00;Pei,
etal.@ICDE01)
Verticalformatbasedmining:SPADE (Zaki@MachineLeanining00)
Constraintbasedsequentialpatternmining(SPIRIT:Garofalakis,Rastogi,
Shim@VLDB99;Pei,Han,Wang@CIKM02)
Miningclosedsequentialpatterns:CloSpan (Yan,Han&Afshar@SDM03)

December26, DataMining:Conceptsand 279


2012 h
TheAprioriPropertyofSequentialPatterns

Abasicproperty:Apriori(Agrawal&Sirkant94)
IfasequenceSisnotfrequent
ThennoneofthesupersequencesofSisfrequent
E.g,<hb>isinfrequent sodo<hab>and<(ah)b>

Seq.ID Sequence
Givensupportthreshold min_sup
10 <(bd)cb(ac)>
=2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

December26, DataMining:Conceptsand 280


2012 h
TheSPADEAlgorithm

SPADE(SequentialPAtternDiscoveryusingEquivalentClass)
developedbyZaki2001
Averticalformatsequentialpatternminingmethod
Asequencedatabaseismappedtoalargesetof

Item:<SID,EID>
Sequentialpatternminingisperformedby

growingthesubsequences(patterns)oneitemat
atimebyAprioricandidategeneration

December26, DataMining:Conceptsand 281


2012 h
TheSPADEAlgorithm

December26, DataMining:Conceptsand 282


2012 h
Miningsequencepatternsinbiologicaldata

December26, DataMining:Conceptsand 283


2012 h
MiningStream,TimeSeries,andSequenceData

Miningdatastreams
Miningtimeseriesdata
Miningsequencepatternsintransactional
databases

Miningsequencepatternsin
biologicaldata
December26, DataMining:Conceptsand 284
2012 h
MiningSequencePatternsinBiologicalData

Abriefintroductiontobiologyandbioinformatics

Alignmentofbiologicalsequences

HiddenMarkovmodelforbiologicalsequence

analysis

Summary

December26, DataMining:Conceptsand 285


2012 h
BiologyFundamentals(1):DNAStructure

DNA:helixshapedmolecule
whoseconstituentsaretwo
parallelstrandsofnucleotides
DNAisusuallyrepresentedby
sequencesofthesefour
nucleotides
Thisassumesonlyonestrandis
Nucleotides(bases)
considered;thesecondstrandis Adenine(A)
alwaysderivablefromthefirstby Cytosine(C)
pairingAswithTsandCswith Guanine(G)
Thymine(T)
Gsandviceversa

December26, DataMining:Conceptsand 286


2012 h
BiologyFundamentals(2):Genes

Gene:Contiguoussubpartsofsingle
strandDNAthataretemplatesfor
producingproteins.Genescanappearin
eitheroftheDNAstrand.
Chromosomes:compactchainsofcoiled
DNA
Genome:Thesetofallgenes inagiven
organism.
Noncoding part:ThefunctionofDNA
materialbetweengenesislargely
unknown.Certainintergenic regionsof
DNAareknowntoplayamajorrolein
cellregulation (controlstheproduction
ofproteinsandtheirpossible
interactionswithDNA).

Source:www.mtsinai.on.ca/pdmg/Genetics/basic.htm
December26, 287
DataMining:Conceptsand
2012 h
BiologyFundamentals(3):Transcription

Proteins:ProducedfromDNAusing3operationsortransformations:
transcription,splicing andtranslation
In eukaryotes (cellswithnucleus):genesareonlyaminutepartofthetotalDNA
In prokaryotes (cellswithoutnucleus):thephaseofsplicingdoesnotoccur(no
preRNAgenerated)
DNAiscapableofreplicatingitself(DNApolymerase)
Centerdogma:ThecapabilityofDNAforreplicationandundergoingthe
three(ortwo)transformations
Genesaretranscribed intopreRNAbyacomplexensembleofmolecules
(RNApolymerase).DuringtranscriptionTissubstitutedbytheletterU(for
uracil).
PreRNAcanberepresentedbyalternationsoffsequencesegmentscalled
exons andintrons.TheexonsrepresentsthepartsofpreRNAthatwillbe
expressed,i.e.,translatedintoproteins.

December26, DataMining:Conceptsand 288


2012 h
BiologyFundamentals(4):Proteins
Splicing (byspliceosomeanensembleofproteins):concatenatesthe
exonsandexcisesintronstoformmRNA(orsimplyRNA)
Translation (byribosomesanensembleofRNAandproteins)
Repeatedlyconsidersatriplet ofconsecutivenucleotides(calledcodon)inRNA
andproducesonecorrespondingaminoacid
InRNA,thereisonespecialcodoncalledstartcodon andafewotherscalled
stopcodons
AnOpenReadingFrame (ORF):asequenceofcodonsstartingwithastart
codonandendingwithanendcodon.TheORFisthusasequenceof
nucleotidesthatisusedbytheribosometoproducethesequenceof
aminoacidthatmakesupaprotein.
Therearebasically20aminoacids (A,L,V,S,...) butincertainraresituations,
otherscanbeaddedtothatlist.

December26, DataMining:Conceptsand 289


2012 h
BiologicalInformation:FromGenesto
Proteins

Gene
DNA
Transcription genomics
molecular
RNA
biology
Translation

structural
Protein Protein folding biology
biophysics

December26, DataMining:Conceptsand 290


2012 h
BiologyFundamentals(5):3DStructure

Sincethereare64differentcodonsand20aminoacids,thetablelookup
fortranslatingeachcodonintoanaminoacidisredundant:multiple
codonscanproducethesameaminoacid
Thetableusedbynaturetoperformtranslationiscalledthegeneticcode
Duetotheredundancy ofthegeneticcode,certainnucleotidechangesin
DNAmaynotaltertheresultingprotein
Onceaproteinisproduced,itfoldsintoauniquestructurein3Dspace,
with3typesofcomponents:helices,sheets andcoils.
Thesecondary structureofaproteinisitssequenceofaminoacids,
annotatedtodistinguishtheboundaryofeachcomponent
Thetertiary structureisits3Drepresentation

December26, DataMining:Conceptsand 291


2012 h
BiologicalDataAvailable

Vastmajorityofdataare sequenceofsymbols(nucleotidesgenomicdata,
butalsogoodamounton aminoacids).
Nextinvolume:microarray experimentsandalsoproteinarray data
Comparablysmall:3Dstructureofproteins (PDB)
NCBI(NationalCenterforBiotechnologyInformation)server:
Total26Bbp:3Bbphumangenome,thenseveralbacteria(e.g.,E.Coli),higher
organisms:yeast,worm,fruitful,mouse,andplants
Thelargestknowngeneshas~20millionbpandthelargestproteinconsistsof
~34kaminoacids
PDBhasacatalogueofonly45kproteins,specifiedbytheir3Dstructure(i.e,
needtoinferproteinshapefromsequencedata)

December26, DataMining:Conceptsand 292


2012 h
Bioinformatics

Computationalmanagementand
analysisofbiologicalinformation
InterdisciplinaryField(Molecular
Biology,Statistics,ComputerScience,
Genomics,Genetics,Databases,
Chemistry,Radiology)
Bioinformaticsvs.computational
Functional
Bioinformatics Genomics biology (moreonalgorithm
correctness,complexityandother
themescentraltotheoreticalCS)
Genomics
Proteomics

Structural
Bioinformatics
December26, DataMining:Conceptsand 293
2012 h
DataMining&Bioinformatics:Why?

Manybiologicalprocessesarenotwellunderstood
Biologicalknowledgeishighlycomplex,imprecise,descriptive,and
experimental
Biologicaldataisabundantandinformationrich
Genomics&proteomicsdata(sequences),microarrayandproteinarrays,protein
database(PDB),biotestingdata
Hugedatabanks,richliterature,openlyaccessible
Largestandrichestscientificdatasetsintheworld
Mining:gainbiologicalinsight(data/information knowledge)
Miningforcorrelations,linkagesbetweendiseaseandgenesequences,protein
networks,classification,clustering,outliers,...
Findcorrelationsamonglinkagesinliteratureandheterogeneousdatabases

December26, DataMining:Conceptsand 294


2012 h
DataMining&Bioinformatics:How(1)

DataIntegration:Handlingheterogeneous,distributedbiodata
BuildWebbased,interchangeable,integrated,multidimensionalgenome
databases
Datacleaninganddataintegrationmethodsbecomescrucial
Miningcorrelatedinformationacrossmultipledatabasesitselfbecomesadata
miningtask
Typicalstudies:miningdatabasestructures,informationextractionfromdata,
referencereconciliation,documentclassification,clusteringandcorrelation
discoveryalgorithms,...

December26, DataMining:Conceptsand 295


2012 h
DataMining&Bioinformatics:How(2)

Masterandexplorationofexistingdataminingtools
Genomics,proteomics,andfunctionalgenomics(functionalnetworksofgenes
andproteins)
Whatarethecurrentbioinformaticstoolsaimingfor?
Inferringaproteinsshapeandfunctionfromagivensequenceofaminoacids
Findingallthegenesandproteinsinagivengenome
Determiningsitesintheproteinstructurewheredrugmoleculescanbeattached

December26, DataMining:Conceptsand 296


2012 h
DataMining&Bioinformatics How (3)
Researchanddevelopmentofnewtoolsforbioinformatics
Similaritysearchandcomparisonbetweenclassesofgenes(e.g.,diseasedandhealthy)by
findingandcomparingfrequentpatterns
Identifysequentialpatterns thatplayrolesinvariousdiseases
Newclusteringandclassification methodsformicroarraydataandproteinarraydata
analysis
Mining,indexingandsimilaritysearchinsequentialandstructured(e.g.,graphandnetwork)
datasets
Pathanalysis:linkinggenes/proteinstodifferentdiseasedevelopmentstages
Developpharmaceuticalinterventionsthattargetthedifferentstagesseparately
HighdimensionalanalysisandOLAPmining
Visualizationtoolsandgenetic/proteomicdataanalysis

December26, DataMining:Conceptsand 297


2012 h
AlgorithmsUsedinBioinformatics

Comparingsequences:Comparinglargenumbersoflongsequences,allow
insertion/deletion/mutationsofsymbols
Constructingevolutionary(phylogenetic)trees:Comparingseq.ofdiff.organisms,
&buildtreesbasedontheirdegreeofsimilarity(evolution)
Detectingpatternsinsequences
SearchforgenesinDNAorsubcomponentsofaseq.ofaminoacids
Determining3Dstructuresfromsequences
E.g.,inferRNAshapefromseq.&proteinshapefromaminoacidseq.
Inferringcellregulation:
Cellmodelingfromexperimental(say,microarray)data
Determiningproteinfunctionandmetabolicpathways: Interprethuman
annotationsforproteinfunctionanddevelopgraphdbthatcanbequeried
AssemblingDNAfragments (providedbysequencingmachines)
Usingscriptlanguages:scriptontheWebtoanalyzedataandapplications
December26, DataMining:Conceptsand 298
2012 h
MiningSequencePatternsinBiologicalData

Abriefintroductiontobiologyandbioinformatics

Alignmentofbiologicalsequences

HiddenMarkovmodelforbiologicalsequence

analysis

Summary

December26, DataMining:Conceptsand 299


2012 h
ComparingSequences

Alllivingorganismsarerelatedtoevolution
Alignment:Liningupsequencestoachievethemaximallevelofidentity
Twosequencesarehomologous iftheyshareacommonancestor
Sequencestobecompared:eithernucleotides(DNA/RNA)oraminoacids
(proteins)
Nucleotides:identical
Aminoacids:identical,orifonecanbederivedfromtheotherbysubstitutionsthatare
likelytooccurinnature
Localvs.globalalignments:Localonlyportionsofthesequencesarealigned.
Globalalignovertheentirelengthofthesequences
Usegaptoindicatepreferablenottoaligntwosymbols
Percentidentity:ratiobetweenthenumberofcolumnscontainingidentical
symbolsvs.thenumberofsymbolsinthelongestsequence
Score ofalignment:summingupthematchesandcountinggapsasnegative

December26, DataMining:Conceptsand 300


2012 h
SequenceAlignment:ProblemDefinition

Goal:
Giventwoormoreinputsequences
Identifysimilarsequenceswithlongconservedsubsequences
Method:
Usesubstitutionmatrices(probabilitiesofsubstitutionsofnucleotides
oraminoacidsandprobabilitiesofinsertionsanddeletions)
Optimalalignmentproblem:NPhard
Heuristicmethodtofindgoodalignments

December26, DataMining:Conceptsand 301


2012 h
PairWiseSequenceAlignment

HEAGAWGHEE
Example PAWHEAE

HEAGAWGHE-E HEAGAWGHE-E
P-A--W-HEAE --P-AW-HEAE
Whichoneisbetter? Scoringalignments
Tocomparetwosequencealignments,calculateascore
PAM(PercentAcceptedMutation)orBLOSUM(BlocksSubstitutionMatrix)
(substitution)matrices:Calculatematchesandmismatches,consideringamino
acidsubstitution
Gappenalty:Initiatingagap
Gapextensionpenalty:Extendingagap

December26, DataMining:Conceptsand 302


2012 h
PairwiseSequenceAlignment:ScoringMatrix

A E G H W Gappenalty:8
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
Gapextension:8
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
HEAGAWGHE-E
W -3 -3 -3 -3 15 --P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9

HEAGAWGHE-E
Exercise:Calculatefor
P-A--W-HEAE

December26, DataMining:Conceptsand 303


2012 h
HeuristicAlignmentAlgorithms

Motivation:Complexityofalignmentalgorithms:O(nm)
CurrentproteinDB:100millionbasepairs
Matchingeachsequencewitha1,000basepairquerytakesabout3hours!
Heuristicalgorithmsaimatspeedingupatthepriceofpossiblymissingthe
bestscoringalignment
Twowellknownprograms
BLAST:BasicLocalAlignmentSearchTool
FASTA:FastAlignmentTool
Bothfindhighscoringlocalalignmentsbetweenaquerysequenceandatarget
database
Basicidea:firstlocatehighscoringshortstretchesandthenextendthem

December26, DataMining:Conceptsand 304


2012 h
MiningSequencePatternsinBiologicalData

Abriefintroductiontobiologyandbioinformatics

Alignmentofbiologicalsequences

HiddenMarkovmodelforbiologicalsequence

analysis

Summary

December26, DataMining:Conceptsand 305


2012 h
MotivationforMarkovModelsin ComputationalBiology

Therearemanycasesinwhichwewouldliketorepresent the
statisticalregularitiesofsomeclassofsequences
genes
variousregulatorysitesinDNA(e.g.,whereRNA polymeraseand
transcriptionfactorsbind)
proteinsinagivenfamily
Markovmodelsarewellsuitedtothistypeoftask

December26, DataMining:Conceptsand 306


2012 h
AMarkovChainModel

Transitionprobabilities
Pr(xi=a|xi1=g)=0.16
Pr(xi=c|xi1=g)=0.34
Pr(xi=g|xi1=g)=0.38
Pr(xi=t|xi1=g)=0.12

Pr( x | xi i 1 = g) = 1

December26, DataMining:Conceptsand 307


2012 h
DefinitionofMarkovChainModel

AMarkovchainmodelisdefinedby
asetofstates

somestatesemitsymbols

otherstates(e.g.,thebeginstate)aresilent
asetoftransitionswithassociated probabilities

thetransitionsemanatingfromagivenstatedefinea
distributionoverthepossiblenextstates

December26, DataMining:Conceptsand 308


2012 h
MarkovChainModels:Properties

GivensomesequencexoflengthL,wecanaskhow
probablethesequenceisgivenourmodel
Foranyprobabilisticmodelofsequences,wecanwritethis
probabilityas
Pr( x) = Pr( xL , xL 1 ,..., x1 )
= Pr( xL / xL 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 )
keypropertyofa(1storder)Markovchain:theprobability of
eachxi dependsonlyonthevalueof xi1
Pr( x) = Pr( xL / xL 1 ) Pr( xL 1 | xL 2 )... Pr( x2 | x1 ) Pr( x1 )
L
= Pr( x1 ) Pr( xi | xi 1 )
i =2

December26, DataMining:Conceptsand 309


2012 h
TheProbabilityofaSequenceforaMarkovChainModel

Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

December26, DataMining:Conceptsand 310


2012 h
AlgorithmsforLearning&Prediction

Learning
correctpathknownforeachtrainingsequence > simplemaximum likelihood
orBayesianestimation
correctpathnotknown> ForwardBackwardalgorithm+MLor Bayesian
estimation

Classification
simpleMarkovmodel >calculateprobabilityofsequencealongsingle path
foreachmodel
hiddenMarkovmodel > Forwardalgorithmtocalculateprobabilityof
sequencealongallpathsforeachmodel
Segmentation
hiddenMarkovmodel > Viterbialgorithmtofindmostprobablepath for
sequence

December26, DataMining:Conceptsand 311


2012 h
MiningSequencePatternsinBiologicalData

Abriefintroductiontobiologyandbioinformatics

Alignmentofbiologicalsequences

HiddenMarkovmodelforbiologicalsequence

analysis

Summary

December26, DataMining:Conceptsand 312


2012 h
Summary:MiningBiologicalData

Biologicalsequenceanalysiscompares,aligns,indexes,andanalyzesbiological
sequences(sequenceofnucleotidesoraminoacids)
Biosequenceanalysiscanbepartitionedintotwoessentialtasks:
pairwisesequencealignmentandmultiplesequencealignment
Dynamicprogrammingapproach(notably,BLAST)hasbeenpopularlyusedfor
sequencealignments
MarkovchainsandhiddenMarkovmodelsareprobabilisticmodelsinwhichthe
probabilityofastatedependsonlyonthatofthepreviousstate
Givenasequenceofsymbols,x,theforward algorithmfindstheprobabilityofobtaining
xinthemodel
TheViterbi algorithmfindsthemostprobablepath(correspondingtox)throughthe
model
TheBaumWelch learnsoradjuststhemodelparameters(transitionandemission
probabilities)tobestexplainasetoftrainingsequences.

December26, DataMining:Conceptsand 313


2012 h
Graphmining

December26, DataMining:Conceptsand 314


2012 h
GraphMining

MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
GraphIndexing
SimilaritySearch
ClassificationandClustering
Summary
December26, DataMining:Conceptsand 315
2012 h
WhyGraphMining?

Graphsareubiquitous
Chemicalcompounds(Cheminformatics)
Proteinstructures,biologicalpathways/networks(Bioinformactics)
Programcontrolflow,trafficflow,andworkflowanalysis
XMLdatabases,Web,andsocialnetworkanalysis
Graphisageneralmodel
Trees,lattices,sequences,anditemsaredegeneratedgraphs
Diversityofgraphs
Directedvs.undirected,labeledvs.unlabeled(edges&vertices),
weighted,withangles&geometry(topologicalvs.2D/3D)
Complexityofalgorithms:manyproblemsareofhigh
complexity

December26, DataMining:Conceptsand 316


2012 h
Graph,Graph,Everywhere

fromH.JeongetalNature411,41(2001)
Aspirin Yeastproteininteractionnetwork

Coauthornetwork
December26, Internet DataMining:Conceptsand 317
2012 h
GraphPatternMining

Frequent subgraphs
A(sub)graphisfrequent ifitssupport (occurrence
frequency)inagivendatasetisnolessthana
minimumsupport threshold
Applicationsofgraphpatternmining
Miningbiochemicalstructures
Programcontrolflowanalysis
MiningXMLstructuresorWebcommunities
Buildingblocksforgraphclassification,clustering,
compression,comparison,andcorrelationanalysis
December26, DataMining:Conceptsand 318
2012 h
GraphMiningAlgorithms

Incompletebeamsearch Greedy(Subdue)
Inductivelogicprogramming(WARMR)
Graphtheorybasedapproaches
Aprioribasedapproach
Patterngrowthapproach

December26, DataMining:Conceptsand 319


2012 h
SUBDUE(Holderetal.KDD94)

Startwithsinglevertices
Expandbestsubstructureswithanewedge
Limitthenumberofbestsubstructures
Substructuresareevaluatedbasedontheirabilityto
compressinputgraphs
Usingminimumdescriptionlength(DL)
BestsubstructureS ingraphG minimizes:DL(S)+
DL(G\S)
Terminateuntilnonewsubstructureisdiscovered
December26, DataMining:Conceptsand 320
2012 h
PropertiesofGraphMiningAlgorithms

Searchorder
breadthvs.depth
Generationofcandidatesubgraphs
apriorivs.patterngrowth
Eliminationofduplicatesubgraphs
passivevs.active
Supportcalculation
embeddingstoreornot
Discoverorderofpatterns
path tree graph
December26, DataMining:Conceptsand 321
2012 h
AprioriBasedApproach
(k+1)-edge
k-edge
G1

G
G2

G Gn

JOIN

December26, DataMining:Conceptsand 322


2012 h
AprioriBased,BreadthFirstSearch
Methodology:breadthsearch,joiningtwographs

AGM(Inokuchi,etal.PKDD00)
generatesnewgraphswithonemorenode

FSG(KuramochiandKarypisICDM01)
generatesnewgraphswithonemoreedge

December26, DataMining:Conceptsand 323


2012 h
GraphPatternExplosionProblem

Ifagraphisfrequent,allofitssubgraphsare
frequent theAprioriproperty
Annedgefrequentgraphmayhave2n subgraphs
Among422 chemicalcompoundswhichare
confirmedtobeactiveinanAIDSantiviralscreen
dataset,thereare1,000,000 frequentgraph
patternsiftheminimumsupportis5%

December26, DataMining:Conceptsand 324


2012 h
GraphMining

MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
GraphIndexing
SimilaritySearch
ClassificationandClustering
Summary
December26, DataMining:Conceptsand 325
2012 h
ConstrainedPatterns

Density
Diameter
Connectivity
Degree
Min,Max,Avg

December26, DataMining:Conceptsand 326


2012 h
ConstraintBasedGraphPatternMining

Highlyconnectedsubgraphsinalargegraph
usuallyarenotartifacts(group,functionality)

Recurrentpatternsdiscoveredinmultiplegraphsaremorerobustthanthe
patternsminedfromasinglegraph

December26, DataMining:Conceptsand 327


2012 h
GraphMining

MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
ClassificationandClustering
GraphIndexing
SimilaritySearch
Summary
December26, DataMining:Conceptsand 328
2012 h
GraphClustering
Graphsimilaritymeasure
Featurebasedsimilaritymeasure
Eachgraphisrepresentedasafeaturevector
Thesimilarityisdefinedbythedistanceoftheir
correspondingvectors
Frequentsubgraphscanbeusedasfeatures
Structurebasedsimilaritymeasure
Maximalcommonsubgraph
Grapheditdistance:insertion,deletion,andrelabel
Graphalignmentdistance

December26, DataMining:Conceptsand 329


2012 h
GraphClassification
Localstructurebasedapproach
Localstructuresinagraph,e.g.,neighbors
surroundingavertex,pathswithfixedlength
Graphpatternbasedapproach
Subgraphpatternsfromdomainknowledge
Subgraphpatternsfromdatamining
Kernelbasedapproach
Randomwalk(Grtner02,Kashimaetal.02,
ICML03,Mahetal.ICML04)
Optimallocalassignment(Frhlichetal.ICML05)
Boosting(Kudoetal.NIPS04)

December26, DataMining:Conceptsand 330


2012 h
GraphPatternBasedClassification

Subgraphpatternsfromdomainknowledge
Moleculardescriptors
Subgraphpatternsfromdatamining
Generalidea
Eachgraphisrepresentedasafeaturevectorx =
{x1,x2,,xn},wherexiisthefrequencyoftheith
patterninthatgraph
Eachvectorisassociatedwithaclasslabel
Classifythesevectorsinavectorspace
December26, DataMining:Conceptsand 331
2012 h
GraphMining

MethodsforMiningFrequentSubgraphs
MiningVariantandConstrainedSubstructure
Patterns
Applications:
ClassificationandClustering
GraphIndexing
SimilaritySearch
Summary
December26, DataMining:Conceptsand 332
2012 h
GraphSearch

Queryinggraphdatabases:
Givenagraphdatabaseandaquerygraph,findall
thegraphscontainingthisquerygraph

query graph graph database

December26, DataMining:Conceptsand 333


2012 h
ScalabilityIssue

Sequentialscan
DiskI/Os
Subgraphisomorphismtesting
Anindexingmechanismisneeded
DayLight:Daylight.com(commercial)
GraphGrep:DennisShasha,etal.PODS'02
Grace:SrinathSrinivasa,etal.ICDE'03

December26, DataMining:Conceptsand 334


2012 h
Summary:GraphMining

Graphmininghaswideapplications
Frequentandclosedsubgraphminingmethods
gSpanandCloseGraph:patterngrowthdepthfirstsearchapproach
Graphindexingtechniques
Frequentanddiscriminativesubgraphsarehighqualityindexing
features
Similaritysearchingraphdatabases
Indexingandfeaturebasedmatching
Furtherdevelopmentandapplicationexploration

December26, DataMining:Conceptsand 335


2012 h
SocialNetworkAnalysis

December26, DataMining:Conceptsand 336


2012 h
SocialNetworkAnalysis

SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 337
2012 h
Complex systems
Made of
many non-identical elements
connected by diverse interactions.

NETWORK
December26, DataMining:Conceptsand 338
2012 h
NaturalNetworksandUniversality
Considermanykindsofnetworks:
social,technological,business,economic,content,
Thesenetworkstendtosharecertaininformal properties:
largescale;continualgrowth
distributed,organicgrowth:verticesdecidewhotolinkto
interactionrestrictedtolinks
mixtureoflocalandlongdistanceconnections
abstractnotionsofdistance:geographical,content,social,
Donaturalnetworkssharemorequantitative universals?
Whatwouldtheseuniversalsbe?
Howcanwemakethempreciseandmeasurethem?
Howcanweexplaintheiruniversality?
Thisisthedomainofsocialnetworktheory
Sometimesalsoreferredtoaslinkanalysis

December26, DataMining:Conceptsand 339


2012 h
SomeInterestingQuantities
Connectedcomponents:
howmany,andhowlarge?
Network diameter:
maximum(worstcase)oraverage?
excludeinfinitedistances?(disconnectedcomponents)
thesmallworldphenomenon
Clustering:
towhatextentthatlinkstendtoclusterlocally?
whatisthebalancebetweenlocalandlongdistanceconnections?
whatrolesdothetwotypesoflinksplay?
Degree distribution:
whatisthetypicaldegreeinthenetwork?
whatistheoveralldistribution?

December26, DataMining:Conceptsand 340


2012 h
ACanonicalNaturalNetworkhas
Few connectedcomponents:
oftenonly1orasmallnumber,indep.ofnetworksize
Small diameter:
oftenaconstantindependentofnetworksize(like6)
orperhapsgrowingonlylogarithmicallywithnetworksizeorevenshrink?
typicallyexcludeinfinitedistances
Ahigh degreeofclustering:
considerablymoresothanforarandomnetwork
intensionwithsmalldiameter
Aheavytailed degreedistribution:
asmallbutreliablenumberofhighdegreevertices
oftenofpowerlaw form

December26, DataMining:Conceptsand 341


2012 h
ProbabilisticModelsofNetworks

Allofthenetworkgenerationmodelswewillstudyare
probabilistic orstatistical innature
Theycangeneratenetworksofanysize
Theyoftenhavevariousparameters thatcanbeset:
sizeofnetworkgenerated
averagedegreeofavertex
fractionoflongdistanceconnections
Themodelsgenerateadistribution overnetworks
Statementsarealwaysstatistical innature:
withhighprobability,diameterissmall
onaverage,degreedistributionhasheavytail
Thus,weregoingtoneedsomebasicstatisticsandprobability
theory
December26, DataMining:Conceptsand 342
2012 h
SocialNetworkAnalysis

SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 343
2012 h
WorldWideWeb

Nodes: WWW documents


Links: URL links
800 million documents
(S. Lawrence, 1999)

ROBOT: collects all


URLs found in a
document and follows
them recursively

December26,
R. Albert, H. Jeong, A-L Barabasi, Nature, 401
344
130 (1999)
DataMining:Conceptsand
2012 h
WorldWideWeb

ExpectedResult RealResult

out= 2.45 in = 2.1

k ~ 6
P(k=500) ~ 10-99 Pout(k) ~ k-out Pin(k) ~ k- in
NWWW ~ 109 P(k=500) ~ 10-6 NWWW ~ 109
N(k=500) ~ 103
N(k=500)~10-90
J. Kleinberg, et. al, Proceedings of the ICCC (1999)
December26, DataMining:Conceptsand 345
2012 h
WorldWideWeb
3
l15=2 [125]
6
1
4 l17=4 [1346 7]
7
2 5 < l > = ??
Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)

< l > = 0.35 + 2.06 log(N)


19 degrees of separation
R. Albert et al Nature (99)
nd.edu
based on 800 million webpages
<l>

[S. Lawrence et al Nature (99)]


IBM
A. Broder et al WWW9 (00)

December26, DataMining:Conceptsand 346


2012 h
Whatdoesthatmean?
Poisson distribution Power-law distribution

Exponential Network Scale-free Network


December26, DataMining:Conceptsand 347
2012 h
ScalefreeNetworks

Thenumberofnodes(N)isnotfixed
Networkscontinuouslyexpandbyadditionalnewnodes
WWW:additionofnewnodes
Citation:publicationofnewpapers
Theattachmentisnotuniform
Anodeislinkedwithhigherprobabilitytoanodethatalreadyhasalarge
numberoflinks
WWW:newdocumentslinktowellknownsites(CNN,
Yahoo,Google)
Citation:Wellcitedpapersaremorelikelytobecited
again
December26, DataMining:Conceptsand 348
2012 h
Case1:InternetBackbone

Nodes: computers, routers


Links: physical lines

(Faloutsos, Faloutsos and Faloutsos, 1999)


December26, DataMining:Conceptsand 349
2012 h
December26, DataMining:Conceptsand 350
2012 h
SocialNetworkAnalysis

SocialNetworkIntroduction
StatisticsandProbabilityTheory
ModelsofSocialNetworkGeneration
NetworksinBiologicalSystem
Miningon SocialNetwork
Summary
December26, DataMining:Conceptsand 351
2012 h
InformationontheSocial
Network
Heterogeneous,multirelationaldatarepresentedasagraphor
network
Nodesareobjects
Mayhavedifferentkindsofobjects
Objectshaveattributes
Objectsmayhavelabelsorclasses
Edgesarelinks
Mayhavedifferentkindsoflinks
Linksmayhaveattributes
Linksmaybedirected,arenotrequiredtobebinary
Linksrepresentrelationshipsandinteractionsbetweenobjects
richcontentformining

December26, DataMining:Conceptsand 352


2012 h
WhatisNewforLink
MiningHere
Traditionalmachinelearninganddataminingapproaches
assume:
Arandomsampleofhomogeneousobjectsfromsinglerelation
Realworlddatasets:
Multirelational,heterogeneousandsemistructured
LinkMining
Newlyemergingresearchareaattheintersectionofresearchinsocial
networkandlinkanalysis,hypertextandwebmining,graphmining,
relationallearningandinductivelogicprogramming

December26, DataMining:Conceptsand 353


2012 h
ATaxonomyofCommonLinkMiningTasks

ObjectRelatedTasks
Linkbasedobjectranking
Linkbasedobjectclassification
Objectclustering(groupdetection)
Objectidentification(entityresolution)
LinkRelatedTasks
Linkprediction
GraphRelatedTasks
Subgraphdiscovery
Graphclassification
Generativemodelforgraphs

December26, DataMining:Conceptsand 354


2012 h
WhatIsaLinkinLinkMining?

Link:relationshipamongdata
Twokindsoflinkednetworks
homogeneousvs.heterogeneous
Homogeneousnetworks
Singleobjecttypeandsinglelinktype
Singlemodelsocialnetworks(e.g.,friends)
WWW:acollectionoflinkedWebpages
Heterogeneousnetworks
Multipleobjectandlinktypes
Medicalnetwork:patients,doctors,disease,contacts,treatments
Bibliographicnetwork:publications,authors,venues

December26, DataMining:Conceptsand 355


2012 h
PageRank:CapturingPagePopularity (Brin&Page98)

Intuitions
Linksarelikecitationsinliterature
Apagethatiscitedoftencanbeexpectedtobemoreusefulingeneral
PageRankisessentiallycitationcounting,butimprovesover
simplecounting
Considerindirectcitations (beingcitedbyahighlycitedpapercounts
alot)
Smoothingofcitations(everypageisassumedtohaveanonzero
citationcount)
PageRankcanalsobeinterpretedasrandomsurfing(thus
capturingpopularity)

December26, DataMining:Conceptsand 356


2012 h
ThePageRankAlgorithm(Brin&
Page98)

Randomsurfingmodel:
Atanypage,
Withprob.,randomlyjumpingtoapage
Withprob.(1 ),randomlypickingalinktofollow

d1 0 0 1/ 2 1/ 2
1 0 0 0
M = Transition matrix
0 1 0 0 Same as

d3 1/ 2 1/ 2 0 0 /N (why?)
d2
1
pt +1 (di ) = (1 )
d j IN ( di )
m ji pt ( d j ) +
k N
pt (d k )
d4 1
p(di ) = [ + (1 )mki ] p(d k ) Stationary (stable)
k N distribution, so we
v v
p = ( I + (1 ) M )T p I = 1/N ignore time
ij

Initial value p(d)=1/N Iterate until converge


Essentially an eigenvector problem.
December26, DataMining:Conceptsand 357
2012 h
LinkPrediction

Predictwhetheralinkexistsbetweentwoentities,basedon
attributesandotherobservedlinks
Applications
Web:predictiftherewillbealinkbetweentwopages
Citation:predictingifapaperwillciteanotherpaper
Epidemics:predictingwhoapatientscontactsare
Methods
Oftenviewedasabinaryclassificationproblem
Localconditionalprobabilitymodel,basedonstructuralandattribute
features
Difficulty:sparsenessofexistinglinks
Collectiveprediction,e.g.,Markovrandomfieldmodel

December26, DataMining:Conceptsand 358


2012 h
MultirelationalDataMining

December26, DataMining:Conceptsand 359


2012 h
MultirelationalDataMining

Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary

December26, DataMining:Conceptsand 360


2012 h
Outline
Theme:Knowledgeispower,butknowledgeishiddeninmassivelinks

StartingwithPageRankandHITS
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
Morerecentworkandconclusions

December26, DataMining:Conceptsand 361


2012 h
TraditionalDataMining

Workonsingleflatrelations
Contact
Doctor Patient
flatten

Loseinformationoflinkagesandrelationships
Cannotutilizeinformationofdatabasestructuresorschemas

December26, DataMining:Conceptsand 362


2012 h
MultiRelationalData
Mining(MRDM)
Motivation
Moststructureddataarestoredinrelational
databases
MRDMcanutilizelinkageandstructuralinformation
Knowledgediscoveryinmultirelational
environments
Multirelationalrules
Multirelationalclustering
Multirelationalclassification
Multirelationallinkageanalysis

December26, DataMining:Conceptsand 363
2012 h
ApplicationsofMRDM

eCommerce:discoveringpatternsinvolvingcustomers,
products,manufacturers,
Bioinformatics/Medicaldatabases:discoveringpatterns
involvinggenes,patients,diseases,
Networkingsecurity:discoveringpatternsinvolvinghosts,
connections,services,
Manyotherrelationaldatasources
Example:EvidenceExtractionandLinkDiscovery(EELD):ADARPA
fundingprojectthatemphasizesmultirelationalandmultidatabase
linkageanalysis

December26, DataMining:Conceptsand 364


2012 h
ImportanceofMultirelational
Classification(fromEELDProgram
Description)
TheobjectiveoftheEELDProgramistoresearch,develop,demonstrate,and
transitioncriticaltechnologythatwillenablesignificantimprovementinour
abilitytodetectasymmetricthreats,e.g.,alooselyorganizedterrorist
group.
Patternsofactivitythat,inisolation,areoflimitedsignificancebut,when
combined,areindicativeofpotentialthreats,willneedtobelearned.
Addressingthesethreatscanonlybeaccomplishedbydevelopinganew
levelofautonomicinformationsurveillanceandanalysistoextract,discover,
andlinktogethersparseevidencefromvastamountsofdatasources,in
differentformatsandwithdifferingtypesanddegreesofstructure,to
representandevaluatethesignificanceoftherelatedevidence,andtolearn
patternstoguidetheextraction,discovery,linkageandevaluationprocesses.

December26, DataMining:Conceptsand 365


2012 h
MRDMApproaches

InductiveLogicProgramming(ILP)
Findmodelsthatarecoherentwithbackground
knowledge
MultirelationalClusteringAnalysis
Clusteringobjectswithmultirelationalinformation
ProbabilisticRelationalModels
Modelcrossrelationalprobabilisticdistributions
EfficientMultiRelationalClassification
TheCrossMineApproach[Yinetal,2004]

December26, DataMining:Conceptsand 366


2012 h
InductiveLogicProgramming(ILP)

Findahypothesisthatisconsistentwith
backgroundknowledge(trainingdata)
FOIL,Golem,Progol,TILDE,
Backgroundknowledge
Relations(predicates),Tuples(groundfacts)
Trainingexamples Backgroundknowledge
Parent(ann, mary) Female(ann)
Daughter(mary, ann) + Parent(ann, tom) Female(mary)
Daughter(eve, tom) + Parent(tom, eve) Female(eve)
Daughter(tom, ann) Parent(tom, ian)
Daughter(eve, ann)

December26, DataMining:Conceptsand 367


2012 h
InductiveLogic
Programming(ILP)
Hypothesis
Thehypothesisisusuallyasetofrules,
whichcanpredictcertainattributesin
certainrelations
Daughter(X,Y)female(X),parent(Y,X)

December26, DataMining:Conceptsand 368


2012 h
AutomaticallyClassifyingObjectsUsingMultiple
Relations

Whynotconvertmultiplerelationaldataintoasingletableby
joins?
Relationaldatabasesaredesignedbydomainexpertsviasemantic
modeling(e.g.,ERmodeling)
Indiscriminativejoinsmayloosesomeessentialinformation
Oneuniversalrelationmaynotbeappealingtoefficiency,scalabilityand
semanticspreservation
Ourapproachtomultirelationalclassification:
Automaticallyclassifyingobjectsusingmultiplerelations

December26, DataMining:Conceptsand 369


2012 h
AnExample:LoanApplications

Ask the backend database

Approve or not?
Apply for loan

December26, DataMining:Conceptsand 370


2012 h
TheBackendDatabase
Account District
account-id district-id
Loan
district-id dist-name
loan-id
Targetrelation: frequency
account-id Card region
Eachtuplehasaclass date card-id #people
date
label,indicating amount disp-id #lt-500

whetheraloanispaid duration type #lt-2000


Transaction issue-date #lt-10000
ontime. payment
trans-id
#gt-10000
account-id
#city
date Disposition ratio-urban
Order type disp-id
avg-salary
order-id
operation account-id
unemploy95
account-id
amount client-id
unemploy96
bank-to
balance
den-enter
account-to
symbol Client #crime95
amount
client-id
#crime96
type
birth-date
gender
district-id

Howtomakedecisionstoloanapplications?

December26, DataMining:Conceptsand 371


2012 h
Roadmap

Motivation
RulebasedClassification
TupleIDPropagation
RuleGeneration
NegativeTupleSampling
PerformanceStudy

December26, DataMining:Conceptsand 372


2012 h
RulebasedClassification

Ever bought a house Live in Chicago Approve!


Applicant

Just apply for a credit card Reject


Applicant

December26, DataMining:Conceptsand 373


2012 h
RuleGeneration

Searchforgoodpredicatesacrossmultiplerelations

Loan ID Account ID Amount Duration Decision


1 124 1000 12 Yes
2 124 4000 12 Yes
Applicant #1
3 108 10000 24 No
4 45 12000 36 No

Loan Applications
Applicant #2
Account ID Frequency Open date District ID
128 monthly 02/27/96 61820
108 weekly 09/23/95 61820
45 monthly 12/09/94 61801
Applicant #3 Orders
67 weekly 01/01/95 61822

Accounts
Applicant #4
Other relations Districts
December26, DataMining:Conceptsand 374
2012 h
PreviousApproaches
InductiveLogicProgramming(ILP)
Tobuildarule
Repeatedlyfindthebestpredicate
ToevaluateapredicateonrelationR,firstjointargetrelation
withR
Notscalablebecause
Hugesearchspace(numerouscandidatepredicates)
Notefficienttoevaluateeachpredicate
Toevaluateapredicate
Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, monthly,?)
firstjoinloanrelationwithaccountrelation
CrossMineismorescalableandmorethanonehundredtimesfaster
ondatasetswithreasonablesizes

December26, DataMining:Conceptsand 375


2012 h
RuleGeneration
Togeneratearule
while(true)
findthebestpredicatep
if foilgain(p)>thresholdthen addp tocurrentrule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples

December26, DataMining:Conceptsand 376


2012 h
RuleGeneration

Startfromthetargetrelation
Onlythetargetrelationisactive
Repeat
Searchinallactiverelations
Searchinallrelationsjoinabletoactiverelations
Addthebestpredicatetothecurrentrule
Settheinvolvedrelationtoactive
Until
Thebestpredicatedoesnothaveenoughgain
Currentruleistoolong

December26, DataMining:Conceptsand 377


2012 h
RuleGeneration:Example
Account District
account-id district-id
Targetrelation Loan
district-id dist-name
loan-id
frequency region
account-id Card
date card-id #people
date
amount Firstpredicate disp-id #lt-500

duration type #lt-2000


Transaction issue-date #lt-10000
payment
trans-id
#gt-10000
account-id
#city
date Disposition ratio-urban
Order type disp-id
avg-salary
order-id
operation account-id
unemploy95
account-id
amount client-id Second
unemploy96
bank-to predicate
balance
den-enter
account-to
symbol Client #crime95
amount
client-id
#crime96
type
birth-date
gender
Range of Search district-id

Add best predicate to rule


December26, DataMining:Conceptsand 378
2012 h
LookoneaheadinRuleGeneration

Twotypesofrelations:EntityandRelationship
Oftencannotfindusefulpredicatesonrelationsofrelationship

No good predicate
Target
Relation

SolutionofCrossMine:
WhenpropagatingIDstoarelationofrelationship,propagateonemore
steptonextrelationofentity.

December26, DataMining:Conceptsand 379


2012 h
MultirelationalDataMining

Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary
December26, DataMining:Conceptsand 380
2012 h
MultiRelationalandMultiDBMining

Classificationovermultiplerelationsindatabases

ClusteringovermultirelationsbyUserGuidance

Miningacrossmultirelationaldatabases

Miningacrossmultipleheterogeneousdataand
informationrepositories

Summary
December26, DataMining:Conceptsand 381
2012 h
Motivation1:MultiRelationalClustering

Work-In Professor Open-course Course


person name course course-id
group office semester name
position instructor area

Publication
Publish
Advise title
Group author
professor year
name title
student conf
area
degree
Student
Register
name
student
Target of office
clustering course
position
semester
unit
grade
Traditionalclusteringworksonasingletable
Mostdataissemanticallylinkedwithmultiplerelations
Thusweneedinformationinmultiplerelations
December26, 382
DataMining:Conceptsand
2012 h
Motivation2:UserGuidedClustering

Work-In Professor Open-course Course


person name course course-id
group office semester name
position instructor area

Publish Publication
Advise author title
Group
professor
name title year
student
area conf
degree Register
Userh int student
Student course
name semester
Target of office
unit
clustering position
grade

Userusuallyhasagoalofclustering,e.g.,clusteringstudentsbyresearcharea
UserspecifieshisclusteringgoaltoCrossClus

December26, DataMining:Conceptsand 383


2012 h
ComparingwithClassification

User hint
Userspecifiedfeature(intheformof
attribute)isusedasahint,notclasslabels
Theattributemaycontaintoomanyor
toofewdistinctvalues
E.g.,ausermaywanttocluster
studentsinto20clusters
insteadof3
Additionalfeaturesneedtobeincluded
inclusteranalysis

All tuples for clustering

December26, DataMining:Conceptsand 384


2012 h
ComparingwithSemisupervisedClustering

Semisupervisedclustering[Wagstaff,etal 01,Xing,etal.02]
Userprovidesatrainingsetconsistingofsimilar anddissimilar pairsof
objects
Userguidedclustering
Userspecifiesanattributeasahint,andmorerelevantfeaturesarefoundfor
clustering

Semi-supervised clustering User-guided clustering

December26,All tuples for clustering All tuples for clustering


DataMining:Conceptsand 385
2012 h
SemisupervisedClustering

Muchinformation(inmultiplerelations)isneededtojudgewhethertwo
tuplesaresimilar
Ausermaynotbeabletoprovideagoodtrainingset
Itismucheasierforausertospecifyanattributeasahint,suchasa
studentsresearcharea

Tom Smith SC1211 TA

Jane Chang BI205 RA

Tuples to be compared

User hint
December26, DataMining:Conceptsand 386
2012 h
SearchingforPertinentFeatures
Differentfeaturesconveydifferentaspectsofinformation

Research area Academic Performances

Research group area Demographic info GPA

Conferences of papers Permanent address GRE score

Advisor Nationality Number of papers

Featuresconveyingsameaspectofinformationusuallycluster
objectsinmoresimilarways
researchgroupareasvs.conferencesofpublications
Givenuserspecifiedfeature
Findpertinentfeaturesbycomputingfeaturesimilarity

December26, DataMining:Conceptsand 387


2012 h
HeuristicSearchforPertinentFeatures
Work-In Professor Open-course Course
person name course course-id
Overallprocedure group office semester name
1.Startfromtheuser 2 position instructor area

specifiedfeature Advise Publication


Group Publish
2.Searchinneighborhoodof name professor
author
title
existingpertinentfeatures area student 1 title
year
degree conf
3.Expandsearchrange
Register
gradually Userh int
student
Student course
name
semester
office
Target of unit
clustering position
grade
TupleIDpropagation[Yin,etal.04]isusedtocreatemultirelationalfeatures
IDsoftargettuplescanbepropagatedalonganyjoinpath,fromwhichwecanfind
tuplesjoinablewitheachtargettuple

December26, DataMining:Conceptsand 388


2012 h
Roadmap
1. Overview
2. FeaturePertinence
3. SearchingforFeatures
4. Clustering
5. ExperimentalResults

December26, DataMining:Conceptsand 389


2012 h
ClusteringwithMultiRelationalFeature

GivenasetofL pertinentfeaturesf1, , fL,similaritybetween


twoobjects
L
sim (t1 , t 2 ) = sim f i (t1 , t 2 ) f i .weight
i =1
Weightofafeatureisdeterminedinfeaturesearchbyitssimilaritywith
otherpertinentfeatures

Forclustering,weuseCLARANS,ascalablekmedoids[Ng&
Han94]algorithm

December26, DataMining:Conceptsand 390


2012 h
Roadmap
1. Overview
2. FeaturePertinence
3. SearchingforFeatures
4. Clustering
5. ExperimentalResults

December26, DataMining:Conceptsand 391


2012 h
HowtoMeasureSimilaritybetweenClusters?

Singlelink(highestsimilaritybetweenpointsintwoclusters)?
No,becausereferencestodifferentobjectscanbeconnected.
Completelink(minimumsimilaritybetweenthem)?
No,becausereferencestothesameobjectmaybeweaklyconnected.
Averagelink(averagesimilaritybetweenpointsintwo
clusters)?
Abettermeasure

December26, DataMining:Conceptsand 392


2012 h
ClusteringProcedure

Procedure
Initialization:Useeachreferenceasacluster
Keepfindingandmergingthemostsimilarpairofclusters
Untilnopairofclustersissimilarenough

December26, DataMining:Conceptsand 393


2012 h
EfficientComputation
Inagglomerativehierarchicalclustering,oneneedsto
repeatedlycomputesimilaritybetweenclusters
WhenmergingclustersC1 andC2 intoC3,weneedtocomputethe
similaritybetweenC3 andanyothercluster
Veryexpensivewhenclustersarelarge
Weinventmethodstocomputesimilarityincrementally
Neighborhoodsimilarity

Randomwalkprobability

December26, DataMining:Conceptsand 394


2012 h
MultirelationalDataMining

Classificationovermultiplerelationsindatabases
Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Miningacrossmultipleheterogeneousdataandinformation
repositories
Summary
December26, DataMining:Conceptsand 395
2012 h
Summary
Knowledgeispower,butknowledgeishiddeninmassivelinks
MorestoriesthanWebpagerankandsearch
CrossMine:Classificationofmultirelationsbylinkanalysis
CrossClus:Clusteringovermultirelationsbyuserguidance
LinkClus:Efficientclusteringbyexploringthepowerlaw
distribution
Distinct:Distinguishingobjectswithidenticalnamesbylink
analysis
Muchmoretobeexplored!

December26, DataMining:Conceptsand 396


2012 h
ReviewQuestions

Statetheimportanceofslidingwindowmodeltoanalyzestreamdata?
Writeanoteandatastreammanagementsystems(DSMS)
Statethedifferencebetweenonetimequeryandcontinuousquery.
Howdoesthelossycountryalgorithmfindfrequentitems?
Giveanoteonstreamqueryprocessing?
Whatisatimeseriesdatabase?
Definesequentialpatternmining?
Whatisperiodicityanalysis?
Distinguishbetweenfullperiodicpatternandpartialperiodicpattern
StateMarkovchainmodel
Statetheimportanceofsynopsesincontextwithscreendata?
Statetheneedforbiologicalsequenceanalysis?
Discussaboutconstraintbasedmining?
Whatisasocialnetwork?
Briefoutmultirelationdatamining?

December26, DataMining:Conceptsand 397


2012 h
Bibliography

DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber

December26, DataMining:Conceptsand 398


2012 h
MiningObject,Spatial,andMultimediaData

DataMining:Principlesand
12/26/2012 Algorithms 399
MiningObject,SpatialandMultiMediaData

Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary

12/26/2012 DataMining:Principlesand 400


l h
MiningComplexDataObjects:Generalizationof
StructuredData

Setvaluedattribute
Generalizationofeachvalueinthesetintoitscorrespondinghigherlevel
concepts
Derivationofthegeneralbehavioroftheset,suchasthenumberof
elementsintheset,thetypesorvaluerangesintheset,ortheweighted
averagefornumericaldata
E.g.,hobby ={tennis,hockey,chess,violin,PC_games}generalizesto
{sports,music,e_games}
Listvaluedorasequencevaluedattribute
Sameassetvaluedattributesexceptthattheorderoftheelementsin
thesequenceshouldbeobservedinthegeneralization

12/26/2012 DataMining:Principlesand 401


l h
GeneralizingSpatialandMultimediaData

Spatialdata:
Generalizedetailedgeographicpointsintoclusteredregions,suchas
business,residential,industrial,oragriculturalareas,accordingtoland
usage
Requirethemergeofasetofgeographicareasbyspatialoperations
Image data:
Extractedbyaggregationand/orapproximation
Size,color,shape,texture,orientation,andrelativepositionsand
structuresofthecontainedobjectsorregionsintheimage
Musicdata:
Summarizeitsmelody:basedontheapproximatepatternsthat
repeatedlyoccurinthesegment
Summarizeditsstyle:basedonitstone,tempo,orthemajormusical
instrumentsplayed

12/26/2012 DataMining:Principlesand 402


l h
GeneralizingObjectData

Objectidentifier
generalizetothelowestlevelofclassintheclass/subclasshierarchies
Classcompositionhierarchies
generalizeonlythosecloselyrelatedinsemantics tothecurrentone
Constructionandminingofobjectcubes
Extendtheattributeorientedinductionmethod
Applyasequenceofclassbasedgeneralizationoperatorsondifferent
attributes
Continueuntilgettingasmallnumberofgeneralizedobjectsthatcan
besummarizedasaconciseinhighlevelterms
Implementation
Examineeachattribute,generalizeittosimplevalueddata
Constructamultidimensionaldatacube(objectcube)
Problem:itisnotalwaysdesirabletogeneralizeasetofvaluesto
singlevalueddata
12/26/2012 DataMining:Principlesand 403
l h
Ex.:PlanMiningbyDivideandConquer

Plan:asequenceofactions
E.g.,Travel(flight):<traveler,departure,arrival,dtime,atime,airline,
price,seat>
Planmining:extractionofimportantorsignificantgeneralized(sequential)
patternsfromaplanbase(alargecollectionofplans)
E.g.,Discovertravelpatternsinanairflightdatabase,or
findsignificantpatternsfromthesequencesofactionsintherepairof
automobiles
Method
Attributeorientedinductiononsequencedata
Ageneralizedtravelplan:<smallbig*small>
Divide&conquer:Minecharacteristicsforeachsubsequence
E.g.,big*:sameairline,smallbig:nearbyregion
12/26/2012 DataMining:Principlesand 404
l h
ATravelDatabaseforPlanMining

Example:Miningatravelplanbase
Travelplantable
plan# action# departure depart_time arrival arrival_time airline
1 1 ALB 800 JFK 900 TWA
1 2 JFK 1000 ORD 1230 UA
1 3 ORD 1300 LAX 1600 UA
1 4 LAX 1710 SAN 1800 DAL
2 1 SPI 900 ORD 950 AA
. . . . . . . .
. . . . . . . .
. . . . . . . .

Airportinfotable
airport_code city state region airport_size
1 1 ALB 800
1 2 JFK 1000
1 3 ORD 1300
1 4 LAX 1710
2 1 SPI 900
. . . . .
. . . . .
. . . . .
12/26/2012 DataMining:Principlesand 405
l h
MultidimensionalAnalysis

AmultiDmodelfortheplanbase
Strategy
Generalizethe
planbaseindifferent
directions
Lookforsequential
patternsinthe
generalizedplans
Derivehighlevel
plans

12/26/2012 DataMining:Principlesand 406


l h
MiningObject,SpatialandMultiMediaData

Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary

12/26/2012 DataMining:Principlesand 407


l h
WhatIsaSpatialDatabaseSystem?

Geometric,geographicorspatialdata:spacerelateddata
Example:Geographicspace(2Dabstractionofearthsurface),VLSI
design,modelofhumanbrain,3Dspacerepresentingthe
arrangementofchainsofproteinmolecule.
Spatialdatabasesystemvs.imagedatabasesystems.
Imagedatabasesystem:handlingdigitalrasterimage(e.g.,satellite
sensing,computertomography),mayalsocontaintechniquesfor
objectanalysisandextractionfromimagesandsomespatialdatabase
functionality.
Spatial(geometric,geographic)databasesystem:handlingobjectsin
spacethathaveidentityandwelldefinedextents,locations,and
relationships.
12/26/2012 DataMining:Principlesand 408
l h
GIS (Geographic Information System)

GIS (Geographic Information System)


Analysis and visualization of geographic data
Common analysis functions of GIS
Search (thematic search, search by region)
Location analysis (buffer, corridor, overlay)
Terrain analysis (slope/aspect, drainage network)
Flow analysis (connectivity, shortest path)
Distribution (nearest neighbor, proximity, change detection)
Spatial analysis/statistics (pattern, centrality, similarity, topology)
Measurements (distance, perimeter, shape, adjacency, direction)
12/26/2012 DataMining:Principlesand 409
l h
Spatial DBMS (SDBMS)

SDBMS is a software system that


supports spatial data models, spatial ADTs, and a
query language supporting them
supports spatial indexing, spatial operations
efficiently, and query optimization
can work with an underlying DBMS
Examples
Oracle Spatial Data Catridge
ESRI Spatial Data Engine

12/26/2012 DataMining:Principlesand 410


l h
ModelingSpatialObjects

Whatneedstoberepresented?
Twoimportantalternativeviews
Singleobjects:distinctentitiesarrangedinspaceeachof
whichhasitsowngeometricdescription
modelingcities,forests,rivers
Spatiallyrelatedcollectionofobjects:describespaceitself
(abouteverypointinspace)
modelinglanduse,partitionofacountryintodistricts

12/26/2012 DataMining:Principlesand 411


l h
ModelingSingleObjects:Point,LineandRegion

Point:locationonlybutnotextent
Line(oracurveusuallyrepresentedbyapolyline,asequenceof
linesegment):
movingthroughspace,orconnectionsinspace(roads,rivers,
cables,etc.)
Region:
Somethinghavingextentin2Dspace(country,lake,park).It
mayhaveaholeorconsistofseveraldisjointpieces.

12/26/2012 DataMining:Principlesand 412


l h
ModelingSpatiallyRelatedCollectionofObjects

Modelingspatiallyrelatedcollectionofobjects:planepartitionsandnetworks.
Apartition:asetofregionobjectsthatarerequiredtobedisjoint(e.g.,a
thematicmap).Thereexistoftenpairsofobjectswithacommonboundary
(adjacencyrelationship).
Anetwork:agraphembeddedintotheplane,consistingofasetofpoint
objects,formingitsnodes,andasetoflineobjectsdescribingthe
geometryoftheedges,e.g.,highways.rivers,powersupplylines.
Otherinterestedspatiallyrelatedcollectionofobjects:nestedpartitions,
oradigitalterrain(elevation)model.

12/26/2012 DataMining:Principlesand 413


l h
Spatial Data Types and
Models
Field-based model: raster y (0,4)

data Pine

framework: partitioning (0,2)

of space Fir Oak

Object-based model: vector


model
(0,0) (2,0) (4,0)

x
point, line, polygon, (a)

Objects, Attributes Object Viewpoint of Forest Stands Field Viewpoint of Forest Stands
"Pine," 2  x  4 ; 2  y  4
Dominant
Area-ID Area/Boundary
Tree Species
f(x,y)  "Fir," 0  x  2; 0  y  2
FS1 Pine [(0,2),(4,2),(4,4),(0,4)]

"Oak," 2  x  4; 0  y  2
FS2 Fir [(0,0),(2,0),(2,2),(0,2)]

FS3 Oak [(2,0),(4,0),(4,2),(2,2)]

(b) (c)
12/26/2012 DataMining:Principlesand 414
l h
Spatial Query Language
Spatialquery language
Spatial data types, e.g. point, line segment, polygon,
Spatial operations, e.g. overlap, distance, nearest
neighbor,
Callable from a query language (e.g. SQL3) of
underlying DBMS
SELECTS.name
FROM SenatorS
WHERES.district.Area() >300
Standards
SQL3 (a.k.a. SQL 1999) is a standard for query
languages
OGIS is a standard for spatial data types and operators
Both standards enjoy wide support in industry

12/26/2012 DataMining:Principlesand 415


l h
Query Processing

Efficient algorithms to answer spatial queries


Common Strategy: filter and refine
Filter: Query Region overlaps with MBRs (minimum
bounding rectangles) of B, C, D
Refine: Query Region overlaps with B, C

MBR

A FILTER B
B

Query
Region C C
D D

REFINE
Data Object

C
12/26/2012 DataMining:Principlesand 416
Algorithms
Join Query Processing

Determining Intersection Rectangle


Plane Sweep Algorithm
Place sweep filter identifies 5 intersections for
refinement step

sweep line

(T.xu, T.yu)
S3
y-axis

y-axis
R2
S2 R1 T

R4 R3 (T.xl, T.yl)
S1

x-axis x-axis
(a) (b)

R4 S2 S1 R1 S3 R2 R3

12/26/2012 DataMining:Principlesand
(c) 417
Algorithms
File Organization and Indices

SDBMS: Dataset is in the secondary storage, e.g. disk


Space Filling Curves: An ordering on the locations in a
multi-dimensional space
Linearize a multi-dimensional space
Helps search efficiently

12/26/2012 DataMining:Principlesand 418


Algorithms
File Organization and Indices

Spatial Indexing
B-tree works on spatial data with space filling curve
R-tree: Heighted balanced extention of B+ tree
Objects are represented as MBR
provides better performance
A
A B C
e
d
C

B i d e f g h i j

g
f

h
12/26/2012 DataMining:Principlesand 419
Algorithms
Spatial Query Optimization

A spatial operation can be processed using


different strategies
Computation cost of each strategy depends
on many parameters
Query optimization is the process of
ordering operations in a query and
selecting efficient strategy for each operation
based on the details of a given dataset

12/26/2012 DataMining:Principlesand 420


l h
SpatialDataWarehousing

Spatialdatawarehouse:Integrated,subjectoriented,timevariant,and
nonvolatilespatialdatarepository

Spatialdataintegration:abigissue

Structurespecificformats (raster vs.vectorbased,OOvs.relational


models,differentstorageandindexing,etc.)

Vendorspecificformats (ESRI,MapInfo,Integraph,IDRISI,etc.)

Geospecificformats (geographicvs.equalareaprojection,etc.)

Spatialdatacube:multidimensionalspatialdatabase

Bothdimensionsandmeasuresmaycontainspatialcomponents

12/26/2012 DataMining:Principlesand 421


l h
DimensionsandMeasuresinSpatialData
Warehouse

Dimensions Measures
nonspatial numerical(e.g.monthlyrevenueof
e.g.2530degrees aregion)
generalizestohot (both
distributive(e.g.count,sum)
arestrings)
spatialtononspatial algebraic(e.g.average)
e.g.Seattlegeneralizesto holistic(e.g.median,rank)
descriptionPacific spatial
Northwest (asastring)
collectionofspatialpointers
spatialtospatial
(e.g.pointerstoallregionswith
e.g.Seattle generalizesto
PacificNorthwest (asa temperatureof2530degrees
spatialregion) inJuly)

12/26/2012 DataMining:Principlesand 422


l h
SpatialAssociationAnalysis

Spatialassociationrule: A B [s%,c%]
AandBaresetsofspatialornonspatialpredicates
Topologicalrelations:intersects,overlaps,disjoint,etc.
Spatialorientations:left_of,west_of,under, etc.
Distanceinformation:close_to,within_distance,etc.
s% isthesupportandc% istheconfidenceoftherule
Examples
1) is_a(x,large_town)^intersect(x,highway) adjacent_to(x,water)
[7%,85%]
2) Whatkindsofobjectsaretypicallylocatedclosetogolfcourses?

12/26/2012 DataMining:Principlesand 423


l h
ProgressiveRefinementMiningofSpatial
AssociationRules
Hierarchyofspatialrelationship:
g_close_to:near_by, touch, intersect,contain,etc.
Firstsearchforroughrelationshipandthenrefineit
Twostepminingofspatialassociation:
Step1:Roughspatialcomputation(asafilter)
UsingMBRorRtreeforroughestimation
Step2:Detailedspatialalgorithm(asrefinement)
Applyonlytothoseobjectswhichhavepassedtheroughspatial
associationtest(nolessthanmin_support)

12/26/2012 DataMining:Principlesand 424


l h
SpatialAutocorrelation

Spatialdatatendstobehighlyselfcorrelated
Example:Neighborhood,Temperature
Itemsinatraditionaldataareindependentofeachother,
whereaspropertiesoflocationsinamapareoftenauto
correlated.
Firstlawofgeography:
Everythingisrelatedtoeverything,butnearbythingsare
morerelatedthandistantthings.

12/26/2012 DataMining:Principlesand 425


l h
SpatialClassification

Methodsinclassification
Decisiontreeclassification,NaveBayesianclassifier+
boosting,neuralnetwork,logisticregression,etc.
Associationbasedmultidimensionalclassification
Example:classifyinghousevaluebasedonproximityto
lakes,highways,mountains,etc.
Assuminglearningsamplesareindependentofeachother
Spatialautocorrelationviolatesthisassumption!
Popularspatialclassificationmethods
Spatialautoregression(SAR)
Markovrandomfield(MRF)
12/26/2012 DataMining:Principlesand 426
l h
SpatialAutoRegression

LinearRegression
Y=X +
Spatialautoregressiveregression(SAR)
Y=WY+X +
W:neighborhoodmatrix.
modelsstrengthofspatialdependencies
errorvector
Theestimatesof and canbederivedusingmaximumlikelihood
theoryorBayesianstatistics

12/26/2012 DataMining:Principlesand 427


l h
MarkovRandomFieldBasedBayesianClassifiers

Bayesianclassifiers

MRF
Asetofrandomvariableswhoseinterdependencyrelationshipis
representedbyanundirectedgraph(i.e.,asymmetricneighborhood
matrix)iscalledaMarkovRandomField.
Pr(X | Ci, Li) Pr(Ci | Li)
Pr(Ci | X, Li) =
Pr (X)
Li denotessetoflabelsintheneighborhoodofsiexcludinglabelsatsi
Pr(Ci |Li) canbeestimatedfromtrainingdatabyexaminetheratiosof
thefrequenciesofclasslabelstothetotalnumberoflocations
Pr(X|Ci,Li) canbeestimatedusingkernelfunctionsfromtheobserved
valuesinthetrainingdataset

12/26/2012 DataMining:Principlesand 428


l h
SpatialTrendAnalysis

Function
Detectchangesandtrendsalongaspatialdimension
Studythetrendofnonspatialorspatialdatachanging
withspace
Applicationexamples
Observethetrendofchangesoftheclimateorvegetation
withincreasingdistancefromanocean
Crimerateorunemploymentratechangewithregardto
citygeodistribution

12/26/2012 DataMining:Principlesand 429


l h
SpatialClusterAnalysis

Miningclusterskmeans,kmedoids,
hierarchical,densitybased,etc.
Analysisofdistinctfeaturesoftheclusters

12/26/2012 DataMining:Principlesand 430


l h
ConstraintsBasedClustering

Constraintsonindividualobjects
Simpleselectionofrelevantobjectsbeforeclustering
Clusteringparameters asconstraints
Kmeans,densitybased:radius,min#ofpoints
ConstraintsspecifiedonclustersusingSQLaggregates
Sumoftheprofitsineachcluster>$1million
Constraintsimposedbyphysicalobstacles
Clusteringwithobstructeddistance

12/26/2012 DataMining:Principlesand 431


l h
ConstrainedClustering:PlanningATMLocations

C3
C2

C1
River

Mountain C4

Spatial data with obstacles Clustering without taking


12/26/2012
obstacles into consideration
DataMining:Principlesand 432
Algorithms
SpatialOutlierDetection

Outlier
Globaloutliers:Observationswhichisinconsistentwiththe
restofthedata
Spatialoutliers:Alocalinstabilityofnonspatialattributes
Spatialoutlierdetection
Graphicaltests
Variogramclouds
Moranscatterplots
Quantitativetests
Scatterplots
SpatialStatisticZ(S(x))
QuantitativetestsaremoreaccuratethanGraphicaltests

12/26/2012 DataMining:Principlesand 433


l h
MiningObject,SpatialandMultiMediaData

Miningobjectdatasets
Miningspatialdatabasesanddatawarehouses
SpatialDBMS
SpatialDataWarehousing
SpatialDataMining
SpatiotemporalDataMining
Miningmultimediadata
Summary

12/26/2012 DataMining:Principlesand 434


l h
SimilaritySearchinMultimediaData

Descriptionbasedretrievalsystems
Buildindicesandperformobjectretrievalbasedonimage
descriptions,suchaskeywords,captions,size,andtimeof
creation
Laborintensiveifperformedmanually
Resultsaretypicallyofpoorqualityifautomated
Contentbasedretrievalsystems
Supportretrievalbasedontheimagecontent,suchas
colorhistogram,texture,shape,objects,andwavelet
transforms

12/26/2012 DataMining:Principlesand 435


l h
QueriesinContentBasedRetrievalSystems

Imagesamplebasedqueries
Findalloftheimagesthataresimilartothegivenimage
sample
Comparethefeaturevector(signature)extractedfromthe
samplewiththefeaturevectorsofimagesthathave
alreadybeenextractedandindexedintheimagedatabase
Imagefeaturespecificationqueries
Specifyorsketchimagefeatureslikecolor,texture,or
shape,whicharetranslatedintoafeaturevector
Matchthefeaturevectorwiththefeaturevectorsofthe
imagesinthedatabase

12/26/2012 DataMining:Principlesand 436


l h
ApproachesBasedonImageSignature

Colorhistogrambasedsignature
Thesignatureincludescolorhistogramsbasedoncolor
compositionofanimageregardlessofitsscaleororientation
Noinformationaboutshape,location,ortexture
Twoimageswithsimilarcolorcompositionmaycontainvery
differentshapesortextures,andthuscouldbecompletely
unrelatedinsemantics
Multifeaturecomposedsignature
Definedifferentdistancefunctionsforcolor,shape,location,
andtexture,andsubsequentlycombinethemtoderivethe
overallresult

12/26/2012 DataMining:Principlesand 437


l h
WaveletAnalysis

Waveletbasedsignature
Usethedominantwaveletcoefficientsofanimageasits
signature
Waveletscaptureshape,texture,andlocationinformation
inasingleunifiedframework
Improvedefficiencyandreducedtheneedforproviding
multiplesearchprimitives
Mayfailtoidentifyimagescontainingsimilarobjectsthat
areindifferentlocations.

12/26/2012 DataMining:Principlesand 438


l h
OneSignaturefortheEntireImage?

Walnus:[NRS99]byNatsev,Rastogi,andShim
Similarimagesmaycontainsimilarregions,butaregioninone
imagecouldbeatranslationorscalingofamatchingregionin
theother

Waveletbasedsignaturewithregionbasedgranularity
Defineregionsbyclusteringsignaturesofwindowsof
varyingsizeswithintheimage
Signatureofaregionisthecentroidofthecluster
Similarityisdefinedintermsofthefractionoftheareaof
thetwoimagescoveredbymatchingpairsofregionsfrom
twoimages
12/26/2012 DataMining:Principlesand 439
l h
MultidimensionalAnalysisofMultimediaData

Multimediadatacube
Designandconstructionsimilartothatoftraditionaldata
cubesfromrelationaldata
Containadditionaldimensionsandmeasuresformultimedia
information,suchascolor,texture,andshape
Thedatabasedoesnotstoreimagesbuttheirdescriptors
Featuredescriptor:asetofvectorsforeachvisual
characteristic
Colorvector:containsthecolorhistogram
MFC(MostFrequentColor)vector:fivecolorcentroids
MFO(MostFrequentOrientation)vector:fiveedgeorientation
centroids
Layoutdescriptor:containsacolorlayoutvectorandanedge
layoutvector

12/26/2012 DataMining:Principlesand 440


l h
MultiDimensionalSearchinMultimedia
Databases

12/26/2012 DataMining:Principlesand 441


l h
MultiDimensionalAnalysisin
MultimediaDatabases
Color histogram Texture layout

12/26/2012 DataMining:Principlesand 442


l h
MiningMultimediaDatabases

Refining or combining searches

Search for airplane in blue sky


(top layout grid is blue and
keyword = airplane)

Search for blue sky and


green meadows
Search for blue sky (top layout grid is blue
(top layout grid is blue) and bottom is green)

12/26/2012 DataMining:Principlesand 443


l h
MiningMultimediaDatabases
The Data Cube and
the Sub-Space Measurements

By Size
By Format
By Format & Size
RED
WHITE
BLUE
Cross Tab By Colour & Size
JPEG GIF By Colour By Format & Colour
RED
WHITE Sum By Colour
BLUE Format of image
By Format Duration
Group By
Colour
Sum Colors
RED Textures
WHITE Keywords
BLUE
Size
Measurement Width
Sum
Height
Internet domain of image
Internet domain of parent pages
Image popularity
12/26/2012 DataMining:Principlesand 444
l h
MiningMultimediaDatabasesin

12/26/2012 DataMining:Principlesand 445


l h
Classification in
MultiMediaMiner

12/26/2012 DataMining:Principlesand 446


l h
MiningAssociationsinMultimediaData

Specialfeatures:
Need#ofoccurrencesbesidesBooleanexistence,e.g.,
Tworedsquareandonebluecircleimpliesthemeair
show
Needspatialrelationships
Blueontopofwhitesquaredobjectisassociatedwith
brownbottom
Needmultiresolutionandprogressiverefinementmining
Itisexpensivetoexploredetailedassociationsamong
objectsathighresolution
Itiscrucialtoensurethecompletenessofsearchatmulti
resolutionspace

12/26/2012 DataMining:Principlesand 447


l h
MiningMultimediaDatabases

Spatial Relationships from Layout

property P1 on-top-of property P2 property P1 next-to property P2

Different Resolution Hierarchy

12/26/2012 DataMining:Principlesand 448


l h
MiningMultimediaDatabases

From Coarse to Fine Resolution Mining

12/26/2012 DataMining:Principlesand 449


l h
Challenge:Curseof
Dimensionality

Difficulttoimplementadatacubeefficientlygivenalarge
numberofdimensions,especiallyseriousinthecaseof
multimediadatacubes
Manyoftheseattributesaresetorientedinsteadofsingle
valued
Restrictingnumberofdimensionsmayleadtothemodelingof
animageataratherrough,limited,andimprecisescale
Moreresearchisneededtostrikeabalancebetweenefficiency
andpowerofrepresentation

12/26/2012 DataMining:Principlesand 450


l h
Summary

Miningobjectdataneedsfeature/attributebased
generalizationmethods
Spatial,spatiotemporalandmultimediadataminingisoneof
importantresearchfrontiersindataminingwithbroad
applications
Spatialdatawarehousing,OLAPandmining facilitates
multidimensionalspatialanalysisandfindingspatial
associations,classificationsandtrends
Multimediadatamining needscontentbasedretrieval and
similaritysearch integratedwithminingmethods

12/26/2012 DataMining:Principlesand 451


l h
MiningTextandWebData

12/26/2012 DataMining:Principlesand 452


l h
MiningTextandWebData

Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textcategorizationmethods
MiningWeblinkagestructures
Summary

12/26/2012 DataMining:Principlesand 453


l h
MiningTextData:AnIntroduction
Data Mining / Knowledge Discovery

Structured Data Multimedia Free Text Hypertext


HomeLoan ( Frank Rizzo bought <a href>Frank Rizzo
Loanee: Frank Rizzo his home from Lake </a> Bought
Lender: MWF View Real Estate in <a hef>this home</a>
Agency: Lake View 1992. from <a href>Lake
Amount: $200,000 He paid $200,000 View Real Estate</a>
Term: 15 years under a15-year loan In <b>1992</b>.
) Loans($200K,[map],...) from MW Financial. <p>...

12/26/2012 DataMining:Principlesand 454


l h
BagofTokensApproaches

Documents Token Sets

Four score and seven nation 5


years ago our fathers brought civil - 1
forth on this continent, a new war 2
nation, conceived in Liberty, Feature men 2
and dedicated to the Extraction died 4
proposition that all men are people 5
created equal. Liberty 1
Now we are engaged in a God 1
great civil war, testing
whether that nation, or

Loses all order-specific information!


Severely limits context!
12/26/2012 DataMining:Principlesand 455
l h
NaturalLanguageProcessing

A dog is chasing a boy on the playground Lexical


Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
Noun Phrase tagging)
Noun Phrase Complex Verb Noun Phrase

Prep Phrase
Semantic analysis Verb Phrase
Syntactic analysis
Dog(d1). (Parsing)
Boy(b1).
Playground(p1). Verb Phrase
Chasing(d1,b1,p1).
+ Sentence

Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back
Scared(b1)
Inference Pragmatic analysis
(speech act)

12/26/2012
(Taken from ChengXiang Zhai, CS 397cxzDataMining:Principlesand
Fall 2003)
456
l h
GeneralNLPTooDifficult!

Wordlevelambiguity
designcanbeanounoraverb (AmbiguousPOS)
roothasmultiplemeanings (Ambiguoussense)
Syntacticambiguity
naturallanguageprocessing(Modification)
Amansawaboywithatelescope. (PPAttachment)
Anaphoraresolution
JohnpersuadedBilltobuyaTVforhimself.
(himself =JohnorBill?)
Presupposition
Hehasquitsmoking.impliesthathesmokedbefore.

Humans rely on context to interpret (when possible).


This context may extend beyond a given document!
12/26/2012
(Taken from ChengXiang Zhai, CS 397cxzDataMining:Principlesand
Fall 2003)
457
l h
ShallowLinguistics

Progress on Useful Sub-Goals:


English Lexicon
Part-of-Speech Tagging
Word Sense Disambiguation
Phrase Detection / Parsing

12/26/2012 DataMining:Principlesand 458


l h
WordNet

An extensive lexical network for the English language


Contains over 138,838 words.
Several graphs, one for each part-of-speech.
Synsets (synonym sets), each defining a semantic sense.
Relationship information (antonym, hyponym, meronym )
Downloadable for free (UNIX, Windows)
Expanding to other languages (Global WordNet Association)
Funded >$3 million, mainly government (translation interest)
Founder George Miller, National Medal of Science, 1991.
watery parched

moist wet dry arid


synonym

antonym
12/26/2012 damp anhydrous 459
DataMining:Principlesand
l h
PartofSpeechTagging
Training data (Annotated text)
This sentence serves as an example of annotated text
Det N V1 P Det N P V2 N

This is a new sentence.


This is a new sentence. POS Tagger
Det Aux Det Adj N

p(w1likely
Pick the most ,..., wk , ttag
1 ,..., tk )
sequence.
p(t1 | w1 )... p(tk | wk ) p(w1 )... p(wk )

p(w1 ,..., wk , t1 ,..., tk ) = k
p(wi | ti ) p(ti | ti1 )
Independent assignment
p(t1 | w1 )... p(tk | wk ) p(iw =11 )... p( wk )
Most common tag

= k
p(wi | ti ) p(ti | ti 1 )
i =1 Partial dependency
(HMM)
12/26/2012
(Adapted from ChengXiang Zhai, CS 397cxz DataMining:Principlesand
Fall 2003)
460
l h
WordSenseDisambiguation

?
The difficulties of computational linguistics are rooted in ambiguity.
N Aux V P N
Supervised Learning
Features:
Neighboring POS tags (N Aux V P N)
Neighboring words (linguistics are rooted in ambiguity)
Stemmed form (root)
Dictionary/Thesaurus entries of neighboring words
High co-occurrence words (plant, tree, origin,)
Other senses of word within discourse
Algorithms:
Rule-based Learning (e.g. IG guided)
Statistical Learning (i.e. Nave Bayes)
Unsupervised Learning (i.e. Nearest Neighbor)
12/26/2012 DataMining:Principlesand 461
l h
Parsing
Choose most likely parse tree S Probability of this tree=0.000015

Probabilistic CFG NP VP

S NP VP 1.0 Det BNP VP PP


NP Det BNP 0.3
NP BNP 0.4 A N Aux V NP P NP
NP NP PP 0.3
Grammar BNP N dog is chasing on
VP V a boy
VP Aux V NP ... the playground

VP VP PP S Probability of this tree=0.000011
PP P NP 1.0
NP VP
V chasing 0.01 Det NP
Aux is BNP Aux V
N dog PP
0.003 A
N boy N is chasing NP
Lexicon P NP
N playground
Det the dog a boy on

Det a
P on the playground
12/26/2012 DataMining:Principlesand
(Adapted from ChengXiang Zhai, CS 397cxz Fall 2003)
462
l h
Obstacles

Ambiguity
A man saw a boy with a telescope.
Computational Intensity
Imposes a context horizon.

Text Mining NLP Approach:


1. Locate promising fragments using fast IR
methods (bag-of-tokens).
2. Only apply slow NLP techniques to promising
fragments.

12/26/2012 DataMining:Principlesand 463


l h
Summary:ShallowNLP

However, shallow NLP techniques are feasible and useful:


Lexicon machine understandable linguistic knowledge
possible senses, definitions, synonyms, antonyms, typeof, etc.
POS Tagging limit ambiguity (word/POS), entity extraction
...research interests include text mining as well as bioinformatics.
NP N
WSD stem/synonym/hyponym matches (doc and query)
Query: Foreign cars Document: Im selling a 1976 Jaguar
Parsing logical view of information (inference?, translation?)
A man saw a boy with a telescope.
Even without complete NLP, any additional knowledge extracted from
text data can only be beneficial.
Ingenuity will determine the applications.
12/26/2012 DataMining:Principlesand 464
l h
MiningTextandWebData

Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textinformationsystemandinformation
retrieval
Textcategorizationmethods
MiningWeblinkagestructures
Summary

12/26/2012 DataMining:Principlesand 465


l h
TextDatabasesandIR

Textdatabases(documentdatabases)
Largecollectionsofdocumentsfromvarioussources:news
articles,researchpapers,books,digitallibraries,email
messages,andWebpages,librarydatabase,etc.
Datastoredisusuallysemistructured
Traditionalinformationretrievaltechniquesbecome
inadequatefortheincreasinglyvastamountsoftextdata
Informationretrieval
Afielddevelopedinparallelwithdatabasesystems
Informationisorganizedinto(alargenumberof)documents
Informationretrievalproblem:locatingrelevantdocuments
basedonuserinput,suchaskeywordsorexample
documents

12/26/2012 DataMining:Principlesand 466


l h
InformationRetrieval

TypicalIRsystems
Onlinelibrarycatalogs
Onlinedocumentmanagementsystems
Informationretrievalvs.databasesystems
SomeDBproblemsarenotpresentinIR,e.g.,update,
transactionmanagement,complexobjects
SomeIRproblemsarenotaddressedwellinDBMS,e.g.,
unstructureddocuments,approximatesearchusing
keywordsandrelevance
12/26/2012 DataMining:Principlesand 467
l h
BasicMeasuresforTextRetrieval

Relevant Relevant&
Retrieved Retrieved

AllDocuments

Precision: thepercentageofretrieveddocumentsthatareinfactrelevanttothe
query(i.e.,correctresponses)
| {Relevant} {Retrieved} |
precision =
| {Retrieved} |
Recall: thepercentageofdocumentsthatarerelevanttothequeryandwere,in
fact,retrieved
| {Relevant} {Retrieved} |
precision =
| {Relevant} |
12/26/2012 DataMining:Principlesand 468
l h
InformationRetrievalTechniques

BasicConcepts
Adocumentcanbedescribedbyasetofrepresentative
keywordscalledindexterms.
Differentindextermshavevaryingrelevancewhenusedto
describedocumentcontents.
Thiseffectiscapturedthroughtheassignmentofnumerical
weightstoeachindexterm ofadocument.(e.g.:frequency,
tfidf)
DBMSAnalogy
IndexTerms Attributes
Weights AttributeValues

12/26/2012 DataMining:Principlesand 469


l h
InformationRetrievalTechniques

IndexTerms(Attribute)Selection:
Stoplist
Wordstem
Indextermsweightingmethods
TermsU DocumentsFrequencyMatrices
InformationRetrievalModels:
BooleanModel
VectorModel
ProbabilisticModel
12/26/2012 DataMining:Principlesand 470
l h
BooleanModel

Considerthatindextermsareeitherpresentorabsentina
document
Asaresult,theindextermweightsareassumedtobeall
binaries
Aqueryiscomposedofindextermslinkedbythree
connectives:not,and,andor
e.g.:carand repair,planeor airplane
TheBooleanmodelpredictsthateachdocumentiseither
relevantornonrelevantbasedonthematchofa
documenttothequery

12/26/2012 DataMining:Principlesand 471


l h
KeywordBasedRetrieval

Adocumentisrepresentedbyastring,whichcanbeidentified
byasetofkeywords
Queriesmayuseexpressions ofkeywords
E.g.,carand repairshop,teaor coffee,DBMSbutnot Oracle
Queriesandretrievalshouldconsidersynonyms, e.g.,repair
andmaintenance
Majordifficultiesofthemodel
Synonymy:AkeywordT doesnotappearanywhereinthe
document,eventhoughthedocumentiscloselyrelatedto
T,e.g.,datamining
Polysemy:Thesamekeywordmaymeandifferentthingsin
differentcontexts,e.g.,mining

12/26/2012 DataMining:Principlesand 472


l h
SimilarityBasedRetrievalinTextData

Findssimilardocumentsbasedonasetofcommonkeywords
Answershouldbebasedonthedegreeofrelevancebasedon
thenearnessofthekeywords,relativefrequencyofthe
keywords,etc.
Basictechniques
Stoplist
Setofwordsthataredeemedirrelevant,eventhough
theymayappearfrequently
E.g.,a,the,of,for,to,with,etc.
Stoplistsmayvarywhendocumentsetvaries

12/26/2012 DataMining:Principlesand 473


l h
SimilarityBasedRetrievalinTextData

Wordstem
Severalwordsaresmallsyntacticvariantsofeachother
sincetheyshareacommonwordstem
E.g.,drug,drugs,drugged
Atermfrequencytable
Eachentry frequent_table(i,j) =#ofoccurrencesofthe
word ti indocumentdi
Usually,theratio insteadoftheabsolutenumberof
occurrencesisused
Similaritymetrics:measuretheclosenessofadocumenttoa
query(asetofkeywords)
Relativetermoccurrences v1 v2
sim(v1 , v2 ) =
Cosinedistance: | v1 || v2 |
12/26/2012 DataMining:Principlesand 474
l h
IndexingTechniques

Invertedindex
Maintainstwohash orB+treeindexedtables:
document_table:asetofdocumentrecords<doc_id,postings_list>
term_table:asetoftermrecords,<term,postings_list>
Answerquery:Findalldocsassociatedwithoneorasetofterms
+easytoimplement
donothandlewellsynonymyandpolysemy,andpostinglistscouldbe
toolong(storagecouldbeverylarge)
Signaturefile
Associateasignaturewitheachdocument
Asignatureisarepresentationofanorderedlistoftermsthatdescribethe
document
Orderisobtainedbyfrequencyanalysis,stemmingandstoplists

12/26/2012 DataMining:Principlesand 475


l h
VectorSpaceModel

Documentsanduserqueriesarerepresentedasmdimensionalvectors,
wheremisthetotalnumberofindextermsinthedocumentcollection.
Thedegreeofsimilarityofthedocumentdwithregardtothequeryqis
calculatedasthecorrelationbetweenthevectorsthatrepresentthem,
usingmeasuressuchastheEuclidiandistanceorthecosineoftheangle
betweenthesetwovectors.

12/26/2012 DataMining:Principlesand 476


l h
ProbabilisticModel

Basicassumption:Givenauserquery,thereisasetof
documentswhichcontainsexactlytherelevantdocumentsand
noother(idealanswerset)
Queryingprocessasaprocessofspecifyingthepropertiesof
anidealanswerset.Sincethesepropertiesarenotknownat
querytime,aninitialguessismade
Thisinitialguessallowsthegenerationofapreliminary
probabilisticdescriptionoftheidealanswersetwhichisused
toretrievethefirstsetofdocuments
Aninteractionwiththeuseristheninitiatedwiththepurpose
ofimprovingtheprobabilisticdescriptionoftheanswerset

12/26/2012 DataMining:Principlesand 477


l h
TypesofTextDataMining

Keywordbasedassociationanalysis
Automaticdocumentclassification
Similaritydetection
Clusterdocumentsbyacommonauthor
Clusterdocumentscontaininginformationfromacommon
source
Linkanalysis:unusualcorrelationbetweenentities
Sequenceanalysis:predictingarecurringevent
Anomalydetection:findinformationthatviolatesusual
patterns
Hypertextanalysis
Patternsinanchors/links
Anchortextcorrelationswithlinkedobjects

12/26/2012 DataMining:Principlesand 478


l h
KeywordBasedAssociationAnalysis

Motivation
Collectsetsofkeywordsortermsthatoccurfrequentlytogetherandthen
findtheassociation or correlationrelationshipsamongthem
AssociationAnalysisProcess
Preprocessthetextdatabyparsing,stemming,removingstopwords,etc.
Evokeassociationminingalgorithms
Considereachdocumentasatransaction
Viewasetofkeywordsinthedocumentasasetofitemsinthetransaction
Termlevelassociationmining
Noneedforhumaneffortintaggingdocuments
Thenumberofmeaninglessresultsandtheexecutiontimeisgreatlyreduced

12/26/2012 DataMining:Principlesand 479


l h
TextClassification

Motivation
Automaticclassificationforthelargenumberofonlinetextdocuments
(Webpages,emails,corporateintranets,etc.)
ClassificationProcess
Datapreprocessing
Definitionoftrainingsetandtestsets
Creationoftheclassificationmodelusingtheselectedclassification
algorithm
Classificationmodelvalidation
Classificationofnew/unknowntextdocuments
Textdocumentclassificationdiffersfromtheclassificationofrelational
data
Documentdatabasesarenotstructuredaccordingtoattributevalue
pairs

12/26/2012 DataMining:Principlesand 480


l h
TextClassification(2)

ClassificationAlgorithms:
SupportVectorMachines
KNearestNeighbors
NaveBayes
NeuralNetworks
DecisionTrees
Associationrulebased
Boosting

12/26/2012 DataMining:Principlesand 481


l h
DocumentClustering

Motivation
Automaticallygrouprelateddocumentsbasedontheir
contents
Nopredeterminedtrainingsetsortaxonomies
Generateataxonomyatruntime
ClusteringProcess
Datapreprocessing:removestopwords,stem,feature
extraction,lexicalanalysis,etc.
Hierarchicalclustering:computesimilaritiesapplying
clusteringalgorithms.
ModelBasedclustering(NeuralNetworkApproach):clusters
arerepresentedbyexemplars.(e.g.:SOM)
12/26/2012 DataMining:Principlesand 482
l h
TextCategorization

Pregivencategoriesandlabeleddocument
examples(Categoriesmayformhierarchy)
Classifynewdocuments
Astandardclassification(supervisedlearning)
problem Categorization
Sports

Business
System
Education

Sports
Science
Business

Education
12/26/2012 DataMining:Principlesand 483
l h
Applications

Newsarticleclassification
Automaticemailfiltering
Webpageclassification
Wordsensedisambiguation

12/26/2012 DataMining:Principlesand 484


l h
CategorizationMethods
Manual:Typicallyrulebased
Doesnotscaleup(laborintensive,ruleinconsistency)
Maybeappropriateforspecialdataonaparticulardomain
Automatic:Typicallyexploitingmachinelearningtechniques
Vectorspacemodelbased
Prototypebased(Rocchio)
Knearestneighbor(KNN)
Decisiontree(learnrules)
NeuralNetworks(learnnonlinearclassifier)
SupportVectorMachines(SVM)
Probabilisticorgenerativemodelbased
NaveBayesclassifier

12/26/2012 DataMining:Principlesand 485


l h
HowtoMeasureSimilarity?

Giventwodocument

Similaritydefinition
dotproduct

normalizeddotproduct(orcosine)

12/26/2012 DataMining:Principlesand 486


l h
IllustrativeExample

text
doc1 mining Sim(newdoc,doc1)=4.8*2.4+4.5*4.5
search
engine
text Sim(newdoc,doc2)=2.4*2.4
To whom is newdoc
more similar?
travel
text
Sim(newdoc,doc3)=0
doc2 map
travel

text mining travel map search engine govern president congress


IDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3

government doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)


president doc2 1(2.4 ) 2 (5.6) 1(3.3)
doc3 congress doc3 1 (2.2) 1(3.2) 1(4.3)

newdoc 1(2.4) 1(4.5)



12/26/2012 DataMining:Principlesand 487
l h
CategorizationMethods
Vectorspacemodel
KNN
Decisiontree
Neuralnetwork
Supportvectormachine
Probabilisticmodel
NaveBayesclassifier
Many,manyothersandvariantsexist[F.S.02]
e.g.Bim,Nb,Ind,Swap1,LLSF,WidrowHoff,Rocchio,Gis
W,
12/26/2012 DataMining:Principlesand 488
l h
Evaluation(cont)
Benchmarks
Classic:Reuterscollection
Asetofnewswirestoriesclassifiedundercategoriesrelatedto
economics.
Effectiveness
Difficultiesofstrictcomparison
differentparametersetting
differentsplit(orselection)betweentrainingandtesting
variousoptimizations
Howeverwidelyrecognizable
Best:Boostingbasedcommitteeclassifier&SVM
Worst:NaveBayesclassifier
Needtoconsiderotherfactors,especiallyefficiency
12/26/2012 DataMining:Principlesand 489
l h
Summary:TextCategorization

Wideapplicationdomain
Comparableeffectivenesstoprofessionals
ManualTCisnot100%andunlikelytoimprove
substantially.
A.T.C.isgrowingatasteadypace
Prospectsandextensions
Verynoisytext,suchastextfromO.C.R.
Speechtranscripts

12/26/2012 DataMining:Principlesand 490


l h
ResearchProblemsinTextMining

Google:whatisthenextstep?
Howtofindthepagesthatmatchapproximatelythe
sohpisticateddocuments,withincorporationofuserprofiles
orpreferences?
LookbackofGoogle:invertedindicies
Constructionofindiciesforthesohpisticateddocuments,
withincorporationofuserprofilesorpreferences
Similaritysearchofsuchpagesusingsuchindicies

12/26/2012 DataMining:Principlesand 491


l h
MiningTextandWebData

Textmining,naturallanguageprocessingand
informationextraction:AnIntroduction
Textcategorizationmethods
MiningWeblinkagestructures
BasedontheslidesbyDengCai

Summary

12/26/2012 DataMining:Principlesand 492


l h
Outline

BackgroundonWebSearch

VIPS(VIsionbasedPageSegmentation)

BlockbasedWebSearch

BlockbasedLinkAnalysis

WebImageSearch&Clustering
12/26/2012 DataMining:Principlesand 493
l h
SearchEngine TwoRankFunctions
Ranking based on link
Search structure analysis

Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor

Term Dictionary Forward Forward URL


Meta Data
(Lexicon) Index Link Dictioanry

Web Page Parser

Web Pages
12/26/2012 DataMining:Principlesand 494
l h
RelevanceRanking
Invertedindex
Adatastructureforsupportingtextqueries
likeindexinabook
aalborg 3452, 11437, ..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
diskswith armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...

invertedindex
ThePageRankAlgorithm

Basicidea
significanceofapageisdeterminedby
thesignificanceofthepageslinkingtoit

1 if page i links to page j


Moreprecisely: Aij =
Linkgraph:adjacencymatrixA,
0 otherwise
ConstructsaprobabilitytransitionmatrixM byrenormalizingeachrow
ofA tosumto1 U + (1 )M Uij = 1/ n for all i, j
Treatthewebgraphasamarkovchain(randomsurfer)
ThevectorofPageRankscoresp isthendefinedtobethestationary
distributionofthisMarkovchain.Equivalently,pistheprincipalright
eigenvectorofthetransitionmatrix ( U + (1 ) M )T
(U + (1 ) M )T p = p
12/26/2012 DataMining:Principlesand 496
l h
LayoutStructure
Comparedtoplaintext,awebpageisa2Dpresentation
Richvisualeffectscreatedbydifferenttermtypes,formats,separators,
blankareas,colors,pictures,etc
Differentpartsofapagearenotequallyimportant
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD

TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN
Hyperlink:
URL: http://www.cnn.com/...
Anchor Text: AI oaeda
Image:
URL: http://www.cnn.com/image/...
Alt & Caption: Iran nuclear
Anchor Text: CNN Homepage News

12/26/2012 DataMining:Principlesand 497


l h
WebPageBlockBetterInformationUnit

Web Page Blocks

Importance = Low

Importance = Med

Importance = High

12/26/2012 DataMining:Principlesand 498


l h
MotivationforVIPS(VIsionbasedPageSegmentation)

Problemsoftreatingawebpageasanatomicunit
Webpageusuallycontainsnotonlypurecontent
Noise:navigation,decoration,interaction,
Multipletopics
Differentpartsofapagearenotequallyimportant
Webpagehasinternalstructure
Twodimensionlogicalstructure&Visuallayout
presentation
> Freetextdocument
< Structureddocument
Layout the3rd dimensionofWebpage
1st dimension:content
2nd dimension:hyperlink
12/26/2012 DataMining:Principlesand 499
l h
IsDOMaGoodRepresentationofPageStructure?

PagesegmentationusingDOM
ExtractstructuraltagssuchasP,TABLE,UL,
TITLE,H1~H6,etc
DOMismorerelatedcontentdisplay,doesnot
necessarilyreflectsemanticstructure
HowaboutXML?
AlongwaytogotoreplacetheHTML

12/26/2012 DataMining:Principlesand 500


l h
VIPSAlgorithm
Motivation:
Inmanycases,topicscanbedistinguishedwithvisualclues.Suchas
position,distance,font,color,etc.
Goal:
Extractthesemanticstructureofawebpagebasedonitsvisual
presentation.
Procedure:
Topdownpartitionthewebpagebasedontheseparators
Result
Atreestructure,eachnodeinthetreecorrespondstoablockinthe
page.
Eachnodewillbeassignedavalue(DegreeofCoherence)toindicate
howcoherentofthecontentintheblockbasedonvisualperception.
Eachblockwillbeassignedanimportancevalue
Hierarchyorflat

12/26/2012 DataMining:Principlesand 501


l h
VIPS:AnExample

Ahierarchicalstructureoflayoutblock
ADegreeofCoherence(DOC) isdefinedfor
eachblock
Showtheintracoherenceoftheblock
DoC ofchildblockmustbenolessthan
itsparents
ThePermittedDegreeofCoherence(PDOC)
canbepredefinedtoachievedifferent
granularitiesforthecontentstructure
Thesegmentationwillstoponlywhenall
theblocksDoC isnolessthanPDoC
ThesmallerthePDoC,thecoarserthe
contentstructurewouldbe

12/26/2012 DataMining:Principlesand 502


l h
BlockbasedWebSearch

Indexblockinsteadofwholepage
Blockretrieval
CombingDocRankandBlockRank
Blockqueryexpansion
Selectexpansiontermfromrelevantblocks

12/26/2012 DataMining:Principlesand 503


l h
ASampleofUserBrowsingBehavior

12/26/2012 DataMining:Principlesand 504


l h
ImageRank

RelevanceRanking ImportanceRanking CombinedRanking

12/26/2012 DataMining:Principlesand 505


l h
ImageRankvs.PageRank

Dataset
26.5millionswebpages
11.6millionsimages
Queryset
45hotqueriesinGoogleimagesearchstatistics
Groundtruth
Fivevolunteerswerechosentoevaluatethetop100
resultsreturnedbythesystem(iFind)
Rankingmethod

s (x) = rankimportance (x) + (1 ) rankrelevance (x)

12/26/2012 DataMining:Principlesand 506


l h
ImageRankvsPageRank

ImagesearchaccuracyusingImageRankand
PageRank.Bothofthemachievedtheirbest
resultsat=0.25.
12/26/2012 DataMining:Principlesand 507
l h
ExampleonImageClustering&Embedding

1710JPGimagesin1287pagesarecrawledwithinthewebsite
http://www.yahooligans.com/content/animals/

Six Categories

Fish
Reptile
Mammal

Bird Amphibian Insect


12/26/2012 DataMining:Principlesand 508
l h
12/26/2012 DataMining:Principlesand 509
Algorithms
WebImageSearchResultPresentation

(a)

(b)

Figure 1. Top 8 returns of query pluto in Googles image search engine (a)
and AltaVistas image search engine (b)

Twodifferenttopicsinthesearchresult
Apossiblesolution:
Clustersearchresultsintodifferentsemanticgroups

12/26/2012 DataMining:Principlesand 510


l h
ThreekindsofWWWimagerepresentation

VisualFeatureBasedRepresentation
TraditionalCBIR

TextualFeatureBasedRepresentation
Surroundingtextinimageblock

LinkGraphBasedRepresentation
Imagegraphembedding

12/26/2012 DataMining:Principlesand 511


l h
HierarchicalClustering

Clusteringbasedonthreerepresentations
Visualfeature
Hardtoreflectthesemanticmeaning
Textualfeature
Semantic
Sometimesthesurroundingtextistoolittle
Linkgraph:
Semantic
Manydisconnectedsubgraph(toomanyclusters)
TwoSteps:
Usingtextsandlinkinformationtogetsemanticclusters
Foreachcluster,usingvisualfeaturetoreorganizetheimages
tofacilitateusersbrowsing
12/26/2012 DataMining:Principlesand 512
l h
OurSystem
Dataset
26.5millionswebpages
http://dir.yahoo.com/Arts/Visual_Arts/Photography/Museums_and_Galleries/
11.6millionsimages
Filterimageswhoseratiobetweenwidthandheightaregreaterthan
5orsmallerthan1/5
Removedimageswhosewidthandheightarebothsmallerthan60
pixels
Analyzepagesandindeximages
VIPS:Pages Blocks
Surroundingtextsusedtoindeximages
Anillustrativeexample
QueryPluto
Top500results
12/26/2012 DataMining:Principlesand 513
l h
ClusteringUsingVisualFeature

Figure 5. Five clusters of search results of query pluto using low level visual
feature. Each row is a cluster.
Fromtheperspectivesofcolorandtexture,theclustering
resultsarequitegood.Differentclustershavedifferent
colorsandtextures.However,fromsemanticperspective,
theseclustersmakelittlesense.
12/26/2012 DataMining:Principlesand 514
l h
ClusteringUsingTextual
Feature

0.04

0.035

0.03

0.025

0.02

0.015

0.01

0.005

0
0 5 10 15 20 25 30 35 40

Figure 6. The Eigengap curve with k for the


pluto case using textual representation

Figure 7. Six clusters of search results of query pluto


using textual feature. Each row is a cluster

Sixsemanticcategoriesarecorrectly
identifiedifwechoosek =6.

12/26/2012 DataMining:Principlesand 515


l h
Summary

Moreimprovementonwebsearchcanbe
madebyminingwebpageLayoutstructure
Leveragevisualcuesforwebinformation
analysis&informationextraction
Demos:
http://www.ews.uiuc.edu/~dengcai2
Papers
VIPSdemo&dll

12/26/2012 DataMining:Principlesand 516


l h
ReviewQuestions

Definespecialdatamining?
Whatisdocumentrankbasedonthecontextoftext
mining?
Canweconstructaspecialdatawarehouse?
Listthetwotypeofmeasuresinaspecialdatacube?
Enlistthetwotypesofmultimediaindexingandretrieval
system?
Giveanoteonmultimediadatacube?
Whatisinformationretrieval?
Listthemethodsforinformationretrieval?
Whatismeantbyauthoritativewebpage?
Whatiswebusagemining?

12/26/2012 DataMining:Principlesand 517


l h
Bibliography

DataminingconceptsandTechniquesby
JiaweiHanandMichelineKamber

12/26/2012 DataMining:Principlesand 518


l h

Das könnte Ihnen auch gefallen