Beruflich Dokumente
Kultur Dokumente
Today
The Newsgroups Text Collection WEKA: Exporer WEKA: Experimenter Python Interface to WEKA
partitioned evenly across 20 different newsgroups we are only using a subset (6 newsgroups)
computers
subject
writes:
In article <C5vzHF.D5K@cbnews.cb.att.com>, lvc@cbnews.cb.att.com (Larry Cipriani) writes: > WASHINGTON (UPI) -- As part of its investigation of the deadly > confrontation with a Texas cult, Congress will consider whether the > Bureau of Alcohol, Tobacco and Firearms should be moved from the > Treasury Department to the Justice Department, senators said Wednesday. > The idea will be considered because of the violent and fatal events > at the beginning and end of the agency's confrontation with the Branch > Davidian cult.
reply
Of course. When the catbox begines to smell, simply transfer its contents into the potted plant in the foyer. "Why Hillary! Your government smells so... FRESH!" -cdt@rocket.sw.stratus.com --If you believe that I speak for my company, OR cdt@vos.stratus.com write today for my special Investors' Packet...
Used for education, research and applications Complements Data Mining by Witten & Frank
Main Features
49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection 3 algorithms for finding association rules 3 graphical user interfaces
The Explorer (exploratory data analysis) The Experimenter (experimental environment) The KnowledgeFlow (new process model inspired interface)
Extend/modify WEKA
BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to WEKA Weka-Parallel - parallel processing for WEKA Grid Weka - grid computing using WEKA Weka-CG - computational genetics tool library
10
WEKA: Terminology
Some synonyms/explanations for the terms used by WEKA, which may differ from what we use:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category
12
13
15
16
Learning algorithms:
Nave Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.
Meta-classifiers:
cannot be used alone always combined with a learning algorithm examples: boosting, bagging etc.
17
18
Choosing a classifier
19
20
displays synopsis and options outputs additional numerical to information nominal conversion by discretization
21
22
23
accuracy
different/easy class
24
Confusion matrix
Contains information about the actual and the predicted classification All measures can be derived from it: predicted
accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)
+ true a + c b d
Predictions Output
26
Predictions Output
Probability distribution for a wrong example: predicted 1 instead of 3 Nave Bayes makes incorrect conditional independence assumptions and typically is over-confident in its prediction regardless of whether it is correct or not.
27
Error Visualization
28
Error Visualization
29
30
evaluation method:
information gain, chi-squared, etc.
32
misc.forsale rec.sport.hockey
comp.graphics
33
misc.forsale rec.sport.hockey
comp.graphics
???
34
C
importance of feature A feature importance of feature B
feature correlation
feature
2-Way Interactions
35
C
importance of feature A feature importance of feature B
feature
3-Way Interaction: What is common to A, B and C together; and cannot be inferred from pairs of features.
Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's
36
Search
Exhaustive/Complete (enumeration/branch&bounding) Heuristic (sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation
37
38
misc.forsale
rec.sport.hockey
comp.graphics
39
All we can do from this tab is to save the buffer in a text file. Not very useful... But we can also perform feature selection during the pre-processing step... (the following slides)
40
41
42
43
44
higher accuracy
21 Attributes
45
accuracy
different/easy class
47
WEKA: Experimenter
Python Interface to WEKA WEKA: Real-time Demo
48
Performing Experiments
Experimenter makes it easy to compare the performance of different learning schemes Problems:
classification regression
Can also iterate over different parameter settings Significance-testing built in!
Slide adapted from Eibe Frank's
49
Experiments Setup
50
Experiments Setup
51
Experiments Setup
datasets
52
Experiments Setup
53
Experiments Setup
54
Experiments Setup
55
Experiments Setup
56
Experiments Setup
accuracy
57
Experiments: Excel
Results are output into an CSV file, which can be read in Excel!
58
59
Numerical attribute
age numeric Nominal attribute sex { female, male} chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} cholesterol numeric exercise_induced_angina { no, yes} Other attribute types: class { present, not_present} String Date
Missing value
60
We have
@data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
61
Features are weighted by frequency within document Produces a sparse ARFF file to be used by WEKA
62
63
64
IDF: log(N/ni)
ni: number of documents containing term i N: total number of documents
65
66
67
68
69
70
71
72
73
ARFF file
74
ARFF file
75
Assignment
Due November 13. Work individually on this one Objective is to use the training set to get the best features and learning model you can. FREEZE this. Then run one time only on the test set. This is a realistic way to see how well your algorithm does on unseen data.
76
Next Time
Machine learning algorithms
77