Beruflich Dokumente
Kultur Dokumente
PracticalMachineLearningToolsandTechniques
SlidesforChapter1ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Whatsitallabout?
Datavsinformation
Dataminingandmachinelearning
Structuraldescriptions
Datasets
Weather,contactlens,CPUperformance,labornegotiation
data,soybeanclassification
Fieldedapplications
Rules:classificationandassociation
Decisiontrees
Rankingwebpages,loanapplications,screeningimages,load
forecasting,machinefaultdiagnosis,marketbasketanalysis
Generalizationassearch
Dataminingandethics
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Datavs.information
Societyproduceshugeamountsofdata
Sources:business,science,medicine,economics,
geography,environment,sports,
Potentiallyvaluableresource
Rawdataisuseless:needtechniquesto
automaticallyextractinformationfromit
Data:recordedfacts
Information:patternsunderlyingthedata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Informationiscrucial
Example1:invitrofertilization
Given:embryosdescribedby60features
Problem:selectionofembryosthatwillsurvive
Data:historicalrecordsofembryosandoutcome
Example2:cowculling
Given:cowsdescribedby700features
Problem:selectionofcowsthatshouldbeculled
Data:historicalrecordsandfarmersdecisions
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Datamining
Extracting
implicit,
previouslyunknown,
potentiallyuseful
informationfromdata
Needed:programsthatdetectpatternsand
regularitiesinthedata
Strongpatternsgoodpredictions
Problem1:mostpatternsarenotinteresting
Problem2:patternsmaybeinexact(orspurious)
Problem3:datamaybegarbledormissing
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Machinelearningtechniques
Algorithmsforacquiringstructuraldescriptions
fromexamples
Structuraldescriptionsrepresentpatterns
explicitly
Canbeusedtopredictoutcomeinnewsituation
Canbeusedtounderstandandexplainhow
predictionisderived
(maybeevenmoreimportant)
Methodsoriginatefromartificialintelligence,
statistics,andresearchondatabases
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Structuraldescriptions
Example:ifthenrules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Age
Spectacle
prescription
Astigmatism
Tear
production rate
Recommended
lenses
Young
Myope
No
Reduced
None
Young
Hypermetrope
No
Normal
Soft
Prepresbyopic
Presbyopic
Hypermetrope
No
Reduced
None
Myope
Yes
Normal
Hard
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Canmachinesreallylearn?
Definitionsoflearningfromdictionary:
Togetknowledgeofbystudy,
experience,orbeingtaught
Tobecomeawarebyinformationor
fromobservation
Tocommittomemory
Tobeinformedof,ascertain;toreceive
instruction
Trivialforcomputers
Operationaldefinition:
Thingslearnwhentheychangetheir
behaviorinawaythatmakesthem
performbetterinthefuture.
Difficulttomeasure
Doesaslipperlearn?
Doeslearningimplyintention?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
Theweatherproblem
Conditionsforplayingacertaingame
Outlook
Temperature
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
If
If
If
If
If
Rainy
Mild
Normal
False
Yes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
RossQuinlan
Machinelearningresearcherfrom1970s
UniversityofSydney,Australia
1986InductionofdecisiontreesMLJournal
1993C4.5:Programsformachinelearning.
MorganKaufmann
199?Started
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
10
Classificationvs.associationrules
Classificationrule:
predictsvalueofagivenattribute(theclassificationofanexample)
If outlook = sunny and humidity = high
then play = no
Associationrule:
predictsvalueofarbitraryattribute(orcombination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
11
Weatherdatawithmixedattributes
Someattributeshavenumericvalues
Outlook
Temperature
Humidity
Windy
Play
Sunny
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
If
If
If
If
If
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
12
Thecontactlensesdata
Age
Young
Young
Young
Young
Young
Young
Young
Young
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope
Myope
Myope
Myope
Myope
Hypermetrope
Hypermetrope
Hypermetrope
Hypermetrope
Astigmatism
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Recommended
lenses
None
Soft
None
Hard
None
Soft
None
hard
None
Soft
None
Hard
None
Soft
None
None
None
None
None
Hard
None
Soft
None
None
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
13
Acompleteandcorrectruleset
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
14
Adecisiontreeforthisproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
15
Classifyingirisflowers
1
Sepal
length
5.1
Sepal
width
3.5
Petal
length
1.4
Petal
width
0.2
Type
Iris setosa
4.9
3.0
1.4
0.2
Iris setosa
51
7.0
3.2
4.7
1.4
52
6.4
3.2
4.5
1.5
Iris
versicolor
Iris
versicolor
101
6.3
3.3
6.0
2.5
Iris virginica
102
5.8
2.7
5.1
1.9
Iris virginica
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
16
PredictingCPUperformance
Example:209differentcomputerconfigurations
Cycle time
(ns)
Main memory
(Kb)
Cache
(Kb)
Channels
Performanc
e
MYCT
MMIN
MMAX
CACH
CHMIN
CHMAX
PRP
125
256
6000
256
16
128
198
29
8000
32000
32
32
269
208
480
512
8000
32
67
209
480
1000
4000
45
Linearregressionfunction
PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
17
Datafromlabornegotiations
Attribute
Duration
Wage increase first year
Wage increase second year
Wage increase third year
Cost of living adjustment
Working hours per week
Pension
Standby pay
Shift-work supplement
Education allowance
Statutory holidays
Vacation
Long-term disability
assistance
Dental plan contribution
Bereavement assistance
Health plan contribution
Acceptability of contract
Type
(Number of years)
Percentage
Percentage
Percentage
{none,tcf,tc}
(Number of hours)
{none,ret-allw, emplcntr}
Percentage
Percentage
{yes,no}
(Number of days)
{below-avg,avg,gen}
{yes,no}
{none,half,full}
{yes,no}
{none,half,full}
{good,bad}
1
1
2%
?
?
none
28
none
?
?
yes
11
avg
no
none
no
none
bad
2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
good
3
3
4.3%
4.4%
?
?
38
?
?
4%
?
12
gen
?
full
?
full
good
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
40
2
4.5
4.0
?
none
40
?
?
4
?
12
avg
yes
full
yes
half
good
18
Decisiontreesforthelabordata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
19
Soybeanclassification
Attribute
Environment Time of occurrence
Precipitation
Seed Condition
Mold growth
Stem Condition
Stem lodging
Root Condition
Diagnosis
Number
of
values
7
3
Sample value
July
Above normal
2
2
Normal
Absent
Normal
5
2
3
?
Abnormal
?
2
2
Abnormal
Yes
3
19
Normal
Diaporthe stem
canker
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
20
Theroleofdomainknowledge
If leaf condition is normal
and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot
If leaf malformation is absent
and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot
Butinthisdomain,leafconditionisnormalimplies
leafmalformationisabsent!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
21
Fieldedapplications
Theresultoflearningorthelearningmethod
itselfisdeployedinpracticalapplications
Processingloanapplications
Screeningimagesforoilslicks
Electricitysupplyforecasting
Diagnosisofmachinefaults
Marketingandsales
Separatingcrudeoilandnaturalgas
Reducingbandinginrotogravureprinting
Findingappropriatetechniciansfortelephonefaults
Scientificapplications:biology,astronomy,chemistry
AutomaticselectionofTVprograms
Monitoringintensivecarepatients
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
22
Processingloanapplications
(AmericanExpress)
Given:questionnairewith
financialandpersonalinformation
Question:shouldmoneybelent?
Simplestatisticalmethodcovers90%ofcases
Borderlinecasesreferredtoloanofficers
But:50%ofacceptedborderlinecasesdefaulted!
Solution:rejectallborderlinecases?
No!Borderlinecasesaremostactivecustomers
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
23
Entermachinelearning
1000trainingexamplesofborderlinecases
20attributes:
Learnedrules:correcton70%ofcases
age
yearswithcurrentemployer
yearsatcurrentaddress
yearswiththebank
othercreditcardspossessed,
humanexpertsonly50%
Rulescouldbeusedtoexplaindecisionsto
customers
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
24
Screeningimages
Given:radarsatelliteimagesofcoastalwaters
Problem:detectoilslicksinthoseimages
Oilslicksappearasdarkregionswithchanging
sizeandshape
Noteasy:lookalikedarkregionscanbecaused
byweatherconditions(e.g.highwind)
Expensiveprocessrequiringhighlytrained
personnel
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
25
Entermachinelearning
Extractdarkregionsfromnormalizedimage
Attributes:
sizeofregion
shape,area
intensity
sharpnessandjaggednessofboundaries
proximityofotherregions
infoaboutbackground
Constraints:
Fewtrainingexamplesoilslicksarerare!
Unbalanceddata:mostdarkregionsarentslicks
Regionsfromsameimageformabatch
Requirement:adjustablefalsealarmrate
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
26
Loadforecasting
Electricitysupplycompanies
needforecastoffuturedemand
forpower
Forecastsofmin/maxloadforeachhour
significantsavings
Given:manuallyconstructedloadmodelthat
assumesnormalclimaticconditions
Problem:adjustforweatherconditions
Staticmodelconsistof:
baseloadfortheyear
loadperiodicityovertheyear
effectofholidays
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
27
Entermachinelearning
Predictioncorrectedusingmostsimilardays
Attributes:
temperature
humidity
windspeed
cloudcoverreadings
plusdifferencebetweenactualloadandpredictedload
Averagedifferenceamongthreemostsimilardays
addedtostaticmodel
Linearregressioncoefficientsformattributeweights
insimilarityfunction
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
28
Diagnosisofmachinefaults
Diagnosis:classicaldomain
ofexpertsystems
Given:Fourieranalysisofvibrationsmeasured
atvariouspointsofadevicesmounting
Question:whichfaultispresent?
Preventativemaintenanceofelectromechanical
motorsandgenerators
Informationverynoisy
Sofar:diagnosisbyexpert/handcraftedrules
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
29
Entermachinelearning
Available:600faultswithexpertsdiagnosis
~300unsatisfactory,restusedfortraining
Attributesaugmentedbyintermediateconcepts
thatembodiedcausaldomainknowledge
Expertnotsatisfiedwithinitialrulesbecausethey
didnotrelatetohisdomainknowledge
Furtherbackgroundknowledgeresultedinmore
complexrulesthatweresatisfactory
Learnedrulesoutperformedhandcraftedones
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
30
MarketingandsalesI
Companiespreciselyrecordmassiveamountsof
marketingandsalesdata
Applications:
Customerloyalty:
identifyingcustomersthatarelikelytodefectby
detectingchangesintheirbehavior
(e.g.banks/phonecompanies)
Specialoffers:
identifyingprofitablecustomers
(e.g.reliableownersofcreditcardsthatneedextra
moneyduringtheholidayseason)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
31
MarketingandsalesII
Marketbasketanalysis
Associationtechniquesfind
groupsofitemsthattendto
occurtogetherina
transaction
(usedtoanalyzecheckoutdata)
Historicalanalysisofpurchasingpatterns
Identifyingprospectivecustomers
Focusingpromotionalmailouts
(targetedcampaignsarecheaperthanmass
marketedones)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
32
Machinelearningandstatistics
Historicaldifference(grosslyoversimplified):
But:hugeoverlap
Statistics:testinghypotheses
Machinelearning:findingtherighthypothesis
Decisiontrees(C4.5andCART)
Nearestneighbormethods
Today:perspectiveshaveconverged
MostMLalgorithmsemploystatisticaltechniques
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
33
Statisticians
SirRonaldAylmerFisher
Born:17Feb1890London,England
Died:29July1962Adelaide,Australia
Numerousdistinguishedcontributionsto
developingthetheoryandapplicationofstatistics
formakingquantitativeavastfieldofbiology
LeoBreiman
Developeddecisiontrees
1984ClassificationandRegression
Trees.Wadsworth.
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
34
Generalizationassearch
Inductivelearning:findaconceptdescription
thatfitsthedata
Example:rulesetsasdescriptionlanguage
Enormous,butfinite,searchspace
Simplesolution:
enumeratetheconceptspace
eliminatedescriptionsthatdonotfitexamples
survivingdescriptionscontaintargetconcept
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
35
Enumeratingtheconceptspace
Searchspaceforweatherproblem
4x4x3x3x2=288possiblecombinations
With14rules2.7x1034possiblerulesets
Otherpracticalproblems:
Morethanonedescriptionmaysurvive
Nodescriptionmaysurvive
Languageisunabletodescribetargetconcept
ordatacontainsnoise
Anotherviewofgeneralizationassearch:
hillclimbingindescriptionspaceaccordingtopre
specifiedmatchingcriterion
Mostpracticalalgorithmsuseheuristicsearchthatcannot
guaranteetofindtheoptimumsolution
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
36
Bias
Importantdecisionsinlearningsystems:
Conceptdescriptionlanguage
Orderinwhichthespaceissearched
Waythatoverfittingtotheparticulartrainingdatais
avoided
Theseformthebiasofthesearch:
Languagebias
Searchbias
Overfittingavoidancebias
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
37
Languagebias
Importantquestion:
islanguageuniversal
ordoesitrestrictwhatcanbelearned?
Universallanguagecanexpressarbitrarysubsets
ofexamples
Iflanguageincludeslogicalor(disjunction),itis
universal
Example:rulesets
Domainknowledgecanbeusedtoexcludesome
conceptdescriptionsapriorifromthesearch
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
38
Searchbias
Searchheuristic
Greedysearch:performingthebestsinglestep
Beamsearch:keepingseveralalternatives
Directionofsearch
Generaltospecific
E.g.specializingarulebyaddingconditions
Specifictogeneral
E.g.generalizinganindividualinstanceintoarule
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
39
Overfittingavoidancebias
Canbeseenasaformofsearchbias
Modifiedevaluationcriterion
E.g.balancingsimplicityandnumberoferrors
Modifiedsearchstrategy
E.g.pruning(simplifyingadescription)
Prepruning:stopsatasimpledescriptionbeforesearch
proceedstoanoverlycomplexone
Postpruning:generatesacomplexdescriptionfirstand
simplifiesitafterwards
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
40
DataminingandethicsI
Ethicalissuesarisein
practicalapplications
Anonymizingdataisdifficult
85%ofAmericanscanbeidentifiedfromjust
zipcode,birthdateandsex
Dataminingoftenusedtodiscriminate
Ethicalsituationdependsonapplication
E.g.loanapplications:usingsomeinformation(e.g.
sex,religion,race)isunethical
E.g.sameinformationokinmedicalapplication
Attributesmaycontainproblematicinformation
E.g.areacodemaycorrelatewithrace
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
41
DataminingandethicsII
Importantquestions:
Whoispermittedaccesstothedata?
Forwhatpurposewasthedatacollected?
Whatkindofconclusionscanbelegitimatelydrawn
fromit?
Caveatsmustbeattachedtoresults
Purelystatisticalargumentsareneversufficient!
Areresourcesputtogooduse?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter1)
42