Sie sind auf Seite 1von 42

DataMining

PracticalMachineLearningToolsandTechniques
SlidesforChapter1ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall

Whatsitallabout?

Datavsinformation
Dataminingandmachinelearning
Structuraldescriptions

Datasets

Weather,contactlens,CPUperformance,labornegotiation
data,soybeanclassification

Fieldedapplications

Rules:classificationandassociation
Decisiontrees

Rankingwebpages,loanapplications,screeningimages,load
forecasting,machinefaultdiagnosis,marketbasketanalysis

Generalizationassearch
Dataminingandethics
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Datavs.information

Societyproduceshugeamountsofdata

Sources:business,science,medicine,economics,
geography,environment,sports,

Potentiallyvaluableresource
Rawdataisuseless:needtechniquesto
automaticallyextractinformationfromit

Data:recordedfacts
Information:patternsunderlyingthedata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Informationiscrucial

Example1:invitrofertilization

Given:embryosdescribedby60features
Problem:selectionofembryosthatwillsurvive
Data:historicalrecordsofembryosandoutcome

Example2:cowculling

Given:cowsdescribedby700features
Problem:selectionofcowsthatshouldbeculled
Data:historicalrecordsandfarmersdecisions

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Datamining

Extracting

implicit,
previouslyunknown,
potentiallyuseful

informationfromdata
Needed:programsthatdetectpatternsand
regularitiesinthedata

Strongpatternsgoodpredictions

Problem1:mostpatternsarenotinteresting
Problem2:patternsmaybeinexact(orspurious)
Problem3:datamaybegarbledormissing
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Machinelearningtechniques

Algorithmsforacquiringstructuraldescriptions
fromexamples
Structuraldescriptionsrepresentpatterns
explicitly

Canbeusedtopredictoutcomeinnewsituation
Canbeusedtounderstandandexplainhow
predictionisderived
(maybeevenmoreimportant)

Methodsoriginatefromartificialintelligence,
statistics,andresearchondatabases

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Structuraldescriptions

Example:ifthenrules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Age

Spectacle
prescription

Astigmatism

Tear
production rate

Recommended
lenses

Young

Myope

No

Reduced

None

Young

Hypermetrope

No

Normal

Soft

Prepresbyopic
Presbyopic

Hypermetrope

No

Reduced

None

Myope

Yes

Normal

Hard

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Canmachinesreallylearn?

Definitionsoflearningfromdictionary:
Togetknowledgeofbystudy,
experience,orbeingtaught
Tobecomeawarebyinformationor
fromobservation
Tocommittomemory
Tobeinformedof,ascertain;toreceive
instruction

Trivialforcomputers

Operationaldefinition:
Thingslearnwhentheychangetheir
behaviorinawaythatmakesthem
performbetterinthefuture.

Difficulttomeasure

Doesaslipperlearn?

Doeslearningimplyintention?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

Theweatherproblem

Conditionsforplayingacertaingame
Outlook

Temperature

Humidity

Windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

If
If
If
If
If

Rainy

Mild

Normal

False

Yes

outlook = sunny and humidity = high then play = no


outlook = rainy and windy = true then play = no
outlook = overcast then play = yes
humidity = normal then play = yes
none of the above then play = yes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

RossQuinlan
Machinelearningresearcherfrom1970s
UniversityofSydney,Australia
1986InductionofdecisiontreesMLJournal
1993C4.5:Programsformachinelearning.
MorganKaufmann
199?Started

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

10

Classificationvs.associationrules

Classificationrule:
predictsvalueofagivenattribute(theclassificationofanexample)
If outlook = sunny and humidity = high
then play = no

Associationrule:
predictsvalueofarbitraryattribute(orcombination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

11

Weatherdatawithmixedattributes

Someattributeshavenumericvalues
Outlook

Temperature

Humidity

Windy

Play

Sunny

85

85

False

No

Sunny

80

90

True

No

Overcast

83

86

False

Yes

Rainy

75

80

False

Yes

If
If
If
If
If

outlook = sunny and humidity > 83 then play = no


outlook = rainy and windy = true then play = no
outlook = overcast then play = yes
humidity < 85 then play = yes
none of the above then play = yes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

12

Thecontactlensesdata
Age
Young
Young
Young
Young
Young
Young
Young
Young
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic

Spectacle
prescription
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope

Astigmatism
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes

Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal

Recommended
lenses
None
Soft
None
Hard
None
Soft
None
hard
None
Soft
None
Hard
None
Soft
None
None
None
None
None
Hard
None
Soft
None
None

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

13

Acompleteandcorrectruleset
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

14

Adecisiontreeforthisproblem

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

15

Classifyingirisflowers
1

Sepal
length
5.1

Sepal
width
3.5

Petal
length
1.4

Petal
width
0.2

Type
Iris setosa

4.9

3.0

1.4

0.2

Iris setosa

51

7.0

3.2

4.7

1.4

52

6.4

3.2

4.5

1.5

Iris
versicolor
Iris

versicolor

101

6.3

3.3

6.0

2.5

Iris virginica

102

5.8

2.7

5.1

1.9

Iris virginica

If petal length < 2.45 then Iris setosa


If sepal width < 2.10 then Iris versicolor
...

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

16

PredictingCPUperformance

Example:209differentcomputerconfigurations
Cycle time
(ns)

Main memory
(Kb)

Cache
(Kb)

Channels

Performanc
e

MYCT

MMIN

MMAX

CACH

CHMIN

CHMAX

PRP

125

256

6000

256

16

128

198

29

8000

32000

32

32

269

208

480

512

8000

32

67

209

480

1000

4000

45

Linearregressionfunction
PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

17

Datafromlabornegotiations
Attribute
Duration
Wage increase first year
Wage increase second year
Wage increase third year
Cost of living adjustment
Working hours per week
Pension
Standby pay
Shift-work supplement
Education allowance
Statutory holidays
Vacation
Long-term disability
assistance
Dental plan contribution
Bereavement assistance
Health plan contribution
Acceptability of contract

Type
(Number of years)
Percentage
Percentage
Percentage
{none,tcf,tc}
(Number of hours)
{none,ret-allw, emplcntr}
Percentage
Percentage
{yes,no}
(Number of days)
{below-avg,avg,gen}
{yes,no}
{none,half,full}
{yes,no}
{none,half,full}
{good,bad}

1
1
2%
?
?
none
28
none
?
?
yes
11
avg
no
none
no
none
bad

2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
good

3
3
4.3%
4.4%
?
?
38
?
?
4%
?
12
gen
?
full
?
full
good

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

40
2
4.5
4.0
?
none
40
?
?
4
?
12
avg
yes
full
yes
half
good

18

Decisiontreesforthelabordata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

19

Soybeanclassification
Attribute
Environment Time of occurrence
Precipitation

Seed Condition
Mold growth

Fruit Condition of fruit


pods
Fruit spots
Leaf Condition
Leaf spot size

Stem Condition
Stem lodging

Root Condition
Diagnosis

Number
of
values
7
3

Sample value
July
Above normal

2
2

Normal
Absent

Normal

5
2
3

?
Abnormal
?

2
2

Abnormal
Yes

3
19

Normal
Diaporthe stem
canker

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

20

Theroleofdomainknowledge
If leaf condition is normal
and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot
If leaf malformation is absent
and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot

Butinthisdomain,leafconditionisnormalimplies
leafmalformationisabsent!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

21

Fieldedapplications

Theresultoflearningorthelearningmethod
itselfisdeployedinpracticalapplications

Processingloanapplications
Screeningimagesforoilslicks
Electricitysupplyforecasting
Diagnosisofmachinefaults
Marketingandsales
Separatingcrudeoilandnaturalgas
Reducingbandinginrotogravureprinting
Findingappropriatetechniciansfortelephonefaults
Scientificapplications:biology,astronomy,chemistry
AutomaticselectionofTVprograms
Monitoringintensivecarepatients
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

22

Processingloanapplications
(AmericanExpress)

Given:questionnairewith
financialandpersonalinformation
Question:shouldmoneybelent?
Simplestatisticalmethodcovers90%ofcases
Borderlinecasesreferredtoloanofficers
But:50%ofacceptedborderlinecasesdefaulted!
Solution:rejectallborderlinecases?

No!Borderlinecasesaremostactivecustomers

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

23

Entermachinelearning

1000trainingexamplesofborderlinecases
20attributes:

Learnedrules:correcton70%ofcases

age
yearswithcurrentemployer
yearsatcurrentaddress
yearswiththebank
othercreditcardspossessed,
humanexpertsonly50%

Rulescouldbeusedtoexplaindecisionsto
customers
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

24

Screeningimages

Given:radarsatelliteimagesofcoastalwaters
Problem:detectoilslicksinthoseimages
Oilslicksappearasdarkregionswithchanging
sizeandshape
Noteasy:lookalikedarkregionscanbecaused
byweatherconditions(e.g.highwind)
Expensiveprocessrequiringhighlytrained
personnel

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

25

Entermachinelearning

Extractdarkregionsfromnormalizedimage
Attributes:

sizeofregion
shape,area
intensity
sharpnessandjaggednessofboundaries
proximityofotherregions
infoaboutbackground

Constraints:

Fewtrainingexamplesoilslicksarerare!
Unbalanceddata:mostdarkregionsarentslicks
Regionsfromsameimageformabatch
Requirement:adjustablefalsealarmrate
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

26

Loadforecasting

Electricitysupplycompanies
needforecastoffuturedemand
forpower
Forecastsofmin/maxloadforeachhour
significantsavings
Given:manuallyconstructedloadmodelthat
assumesnormalclimaticconditions
Problem:adjustforweatherconditions
Staticmodelconsistof:

baseloadfortheyear
loadperiodicityovertheyear
effectofholidays
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

27

Entermachinelearning

Predictioncorrectedusingmostsimilardays
Attributes:

temperature
humidity
windspeed
cloudcoverreadings
plusdifferencebetweenactualloadandpredictedload

Averagedifferenceamongthreemostsimilardays
addedtostaticmodel
Linearregressioncoefficientsformattributeweights
insimilarityfunction
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

28

Diagnosisofmachinefaults

Diagnosis:classicaldomain
ofexpertsystems
Given:Fourieranalysisofvibrationsmeasured
atvariouspointsofadevicesmounting
Question:whichfaultispresent?
Preventativemaintenanceofelectromechanical
motorsandgenerators
Informationverynoisy
Sofar:diagnosisbyexpert/handcraftedrules

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

29

Entermachinelearning

Available:600faultswithexpertsdiagnosis
~300unsatisfactory,restusedfortraining
Attributesaugmentedbyintermediateconcepts
thatembodiedcausaldomainknowledge
Expertnotsatisfiedwithinitialrulesbecausethey
didnotrelatetohisdomainknowledge
Furtherbackgroundknowledgeresultedinmore
complexrulesthatweresatisfactory
Learnedrulesoutperformedhandcraftedones

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

30

MarketingandsalesI

Companiespreciselyrecordmassiveamountsof
marketingandsalesdata
Applications:

Customerloyalty:
identifyingcustomersthatarelikelytodefectby
detectingchangesintheirbehavior
(e.g.banks/phonecompanies)
Specialoffers:
identifyingprofitablecustomers
(e.g.reliableownersofcreditcardsthatneedextra
moneyduringtheholidayseason)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

31

MarketingandsalesII

Marketbasketanalysis

Associationtechniquesfind
groupsofitemsthattendto
occurtogetherina
transaction
(usedtoanalyzecheckoutdata)

Historicalanalysisofpurchasingpatterns
Identifyingprospectivecustomers

Focusingpromotionalmailouts
(targetedcampaignsarecheaperthanmass
marketedones)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

32

Machinelearningandstatistics

Historicaldifference(grosslyoversimplified):

But:hugeoverlap

Statistics:testinghypotheses
Machinelearning:findingtherighthypothesis
Decisiontrees(C4.5andCART)
Nearestneighbormethods

Today:perspectiveshaveconverged

MostMLalgorithmsemploystatisticaltechniques

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

33

Statisticians

SirRonaldAylmerFisher
Born:17Feb1890London,England
Died:29July1962Adelaide,Australia
Numerousdistinguishedcontributionsto
developingthetheoryandapplicationofstatistics
formakingquantitativeavastfieldofbiology

LeoBreiman
Developeddecisiontrees
1984ClassificationandRegression
Trees.Wadsworth.

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

34

Generalizationassearch

Inductivelearning:findaconceptdescription
thatfitsthedata
Example:rulesetsasdescriptionlanguage

Enormous,butfinite,searchspace

Simplesolution:

enumeratetheconceptspace
eliminatedescriptionsthatdonotfitexamples
survivingdescriptionscontaintargetconcept

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

35

Enumeratingtheconceptspace

Searchspaceforweatherproblem

4x4x3x3x2=288possiblecombinations

With14rules2.7x1034possiblerulesets

Otherpracticalproblems:

Morethanonedescriptionmaysurvive
Nodescriptionmaysurvive

Languageisunabletodescribetargetconcept
ordatacontainsnoise

Anotherviewofgeneralizationassearch:
hillclimbingindescriptionspaceaccordingtopre
specifiedmatchingcriterion

Mostpracticalalgorithmsuseheuristicsearchthatcannot
guaranteetofindtheoptimumsolution
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

36

Bias

Importantdecisionsinlearningsystems:

Conceptdescriptionlanguage
Orderinwhichthespaceissearched
Waythatoverfittingtotheparticulartrainingdatais
avoided

Theseformthebiasofthesearch:

Languagebias
Searchbias
Overfittingavoidancebias

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

37

Languagebias

Importantquestion:

islanguageuniversal
ordoesitrestrictwhatcanbelearned?

Universallanguagecanexpressarbitrarysubsets
ofexamples
Iflanguageincludeslogicalor(disjunction),itis
universal
Example:rulesets
Domainknowledgecanbeusedtoexcludesome
conceptdescriptionsapriorifromthesearch

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

38

Searchbias

Searchheuristic

Greedysearch:performingthebestsinglestep
Beamsearch:keepingseveralalternatives

Directionofsearch

Generaltospecific

E.g.specializingarulebyaddingconditions

Specifictogeneral

E.g.generalizinganindividualinstanceintoarule

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

39

Overfittingavoidancebias

Canbeseenasaformofsearchbias
Modifiedevaluationcriterion

E.g.balancingsimplicityandnumberoferrors

Modifiedsearchstrategy

E.g.pruning(simplifyingadescription)

Prepruning:stopsatasimpledescriptionbeforesearch
proceedstoanoverlycomplexone
Postpruning:generatesacomplexdescriptionfirstand
simplifiesitafterwards

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

40

DataminingandethicsI

Ethicalissuesarisein
practicalapplications
Anonymizingdataisdifficult
85%ofAmericanscanbeidentifiedfromjust
zipcode,birthdateandsex
Dataminingoftenusedtodiscriminate

Ethicalsituationdependsonapplication

E.g.loanapplications:usingsomeinformation(e.g.
sex,religion,race)isunethical
E.g.sameinformationokinmedicalapplication

Attributesmaycontainproblematicinformation

E.g.areacodemaycorrelatewithrace
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

41

DataminingandethicsII

Importantquestions:

Whoispermittedaccesstothedata?
Forwhatpurposewasthedatacollected?
Whatkindofconclusionscanbelegitimatelydrawn
fromit?

Caveatsmustbeattachedtoresults
Purelystatisticalargumentsareneversufficient!
Areresourcesputtogooduse?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)

42

Das könnte Ihnen auch gefallen