Lecture 16

SIMS 290-2: Applied Natural Language Processing
Marti Hearst October 30, 2006

Some slides by Preslav Nakov and Eibe Frank
Today
The Newsgroups Text Collection WEKA: Exporer WEKA: Experimenter Python Interface to WEKA
20 Newsgroups Data Set

http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ Source: originally collected by Ken Lang Content and structure:
approximately 20,000 newsgroup documents
19,997 originally 18,828 without duplicates
partitioned evenly across 20 different newsgroups we are only using a subset (6 newsgroups)
Some categories are strongly related (and thus hard to discriminate):

comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey talk.politics.misc talk.politics.guns talk.politics.mideast sci.crypt sci.electronics sci.med sci.space talk.religion.misc alt.atheism soc.religion.christian
computers
Sample Posting: talk.politics.guns

from
From: cdt@sw.stratus.com (C. D. Tavares) Subject: Re: Congress to review ATF's status
subject
writes:
In article <C5vzHF.D5K@cbnews.cb.att.com>, lvc@cbnews.cb.att.com (Larry Cipriani) writes: > WASHINGTON (UPI) -- As part of its investigation of the deadly > confrontation with a Texas cult, Congress will consider whether the > Bureau of Alcohol, Tobacco and Firearms should be moved from the > Treasury Department to the Justice Department, senators said Wednesday. > The idea will be considered because of the violent and fatal events > at the beginning and end of the agency's confrontation with the Branch > Davidian cult.
reply
Of course. When the catbox begines to smell, simply transfer its contents into the potted plant in the foyer. "Why Hillary! Your government smells so... FRESH!" -cdt@rocket.sw.stratus.com --If you believe that I speak for my company, OR cdt@vos.stratus.com write today for my special Investors' Packet...
Need special handling during feature extraction signature
The 20 Newsgroups Text Collection WEKA: Exporer

WEKA: Experimenter Python Interface to WEKA WEKA: Real-time Demo
WEKA: The Bird
Copyright: Martin Kramer (mkramer@wxs.nl), University of Waikato, New Zealand
Slide adapted from Eibe Frank's
WEKA: the software

Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms and data processing tools implemented in Java
Released under the GPL
Support for the whole process of experimental data mining

Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning
Used for education, research and applications Complements Data Mining by Witten & Frank
Slide by Eibe Frank
Main Features
49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection 3 algorithms for finding association rules 3 graphical user interfaces
The Explorer (exploratory data analysis) The Experimenter (experimental environment) The KnowledgeFlow (new process model inspired interface)
Slide by Eibe Frank
Projects based on WEKA

Incorporate/wrap WEKA
GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a WEKA interface Judge - document clustering and classification
Extend/modify WEKA
BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to WEKA Weka-Parallel - parallel processing for WEKA Grid Weka - grid computing using WEKA Weka-CG - computational genetics tool library
Slide by Eibe Frank
The WEKA Project Today (2006)

Funding for the next two years Goal of the project remains the same People
6 2 3 3 2 staff postdocs PhD students MSc students research programmers
Slide by Eibe Frank
10
WEKA: The Software Toolkit

http://www.cs.waikato.ac.nz/ml/weka http://sourceforge.net/projects/weka/ Machine learning/data mining software in Java GNU License Used for research, education and applications Complements Data Mining by Witten & Frank Main features:
data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms
11
WEKA: Terminology
Some synonyms/explanations for the terms used by WEKA, which may differ from what we use:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category
12
WEKA GUI Chooser
java -Xmx1000M -jar weka.jar
13
Our Toy Example

We demonstrate WEKA on a simple example: 3 categories from Newsgroups:
misc.forsale, rec.sport.hockey, comp.graphics
20 documents per category features:

words converted to lowercase frequency 2 or more required stopwords removed
14
Explorer: Pre-Processing The Data

WEKA can import data from:
files: ARFF, CSV, C4.5, binary URL SQL database (using JDBC)
Pre-processing tools (filters) are used for:

Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.
15
The Preprocessing Tab

Classification Preprocessing Filter selection Statistical attribute selection
Manual attribute selection
Statistics about the values of the selected attribute
List of attributes (last: class variable)
Frequency and categories for the selected attribute
16
Explorer: Building Classifiers

Classifiers in WEKA are models for:
classification (predict a nominal class) regression (predict a numerical quantity)
Learning algorithms:
Nave Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.
Meta-classifiers:
cannot be used alone always combined with a learning algorithm examples: boosting, bagging etc.
17
The Classification Tab

Choice of classifier Cross-validation: split the data into e.g. 10 folds and 10 times train on 9 folds and test on the remaining one The attribute whose value is to be predicted from the values of the remaining ones. Default is the last attribute. Here (in our toy example) it is named class.
18
Choosing a classifier
19
20
displays synopsis and options outputs additional numerical to information nominal conversion by discretization
False: Gaussian True: kernels (better)
21
22
23
accuracy
different/easy class
all other numbers can be obtained from it
24
Confusion matrix
Contains information about the actual and the predicted classification All measures can be derived from it: predicted
accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)
+ true a + c b d
These extend for more than 2 classes:

see previous lecture slides for details
25
Predictions Output
Outputs the probability distribution for each example
26
Predictions Output
Probability distribution for a wrong example: predicted 1 instead of 3 Nave Bayes makes incorrect conditional independence assumptions and typically is over-confident in its prediction regardless of whether it is correct or not.
27
Error Visualization
28
Error Visualization
Little squares designate errors Axes show example number
29
Running on Test Set
30
Explorer: Attribute Selection

Find which attributes are the most predictive ones Two parts:
search method:
best-first, forward selection, random, exhaustive, genetic algorithm, ranking
evaluation method:
information gain, chi-squared, etc.
Very flexible: WEKA allows (almost) arbitrary combinations of these two

31
Individual Features Ranking
32
misc.forsale rec.sport.hockey
comp.graphics
33
misc.forsale rec.sport.hockey
random number seed
comp.graphics
???
34
Feature Interactions (Advanced)

category
C
importance of feature A feature importance of feature B
feature correlation
feature
2-Way Interactions
Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's
35
Feature Interactions (Advanced)

category
C
importance of feature A feature importance of feature B
feature
3-Way Interaction: What is common to A, B and C together; and cannot be inferred from pairs of features.
Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's
36
Feature Subsets Selection

Problem illustration
Full set Empty set Enumeration
Search
Exhaustive/Complete (enumeration/branch&bounding) Heuristic (sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation
Slide adapted from Guozhu Dong's
37
Features Subsets Selection
38
Features Subsets Selection
misc.forsale
rec.sport.hockey
comp.graphics
17,309 subsets considered 21 attributes selected
39
Saving the Selected Features
All we can do from this tab is to save the buffer in a text file. Not very useful... But we can also perform feature selection during the pre-processing step... (the following slides)
40
Features Selection on Preprocessing
41
42
679 attributes: 678 + 1 (for the class)
43
Just 22 attributes remain: 21 + 1 (for the class)
44
Run Nave Bayes With the 21 Features
higher accuracy
21 Attributes
45
(AGAIN) Nave Bayes With All Features
accuracy
different/easy class
ALL 679 Attributes (repeated slide)

46
Some Important Algorithms

WEKA has weird naming for some algorithms Here are some translations:
Nave Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk
Some of these are more sophisticated versions of the classic algorithms

e.g. the classic Nave Bayes seems to be missing A good alternative is the Multinomial Nave Bayes model
47
The 20 Newsgroups Text Collection

WEKA: Explorer
WEKA: Experimenter
Python Interface to WEKA WEKA: Real-time Demo
48
Performing Experiments
Experimenter makes it easy to compare the performance of different learning schemes Problems:
classification regression
Results: written into file or database Evaluation options:

cross-validation learning curve hold-out
Can also iterate over different parameter settings Significance-testing built in!
49
Experiments Setup
50
Experiments Setup
51
Experiments Setup
datasets
CSV file: can be open in Excel algorithms
52
Experiments Setup
53
Experiments Setup
54
Experiments Setup
55
Experiments Setup
56
Experiments Setup
accuracy
SVM is the best
Decision tree is the worst
SVM is statistically better than Nave Bayes
Decision tree is statistically worse than Nave Bayes
57
Experiments: Excel
Results are output into an CSV file, which can be read in Excel!
58
The Newsgroups Text Collection

WEKA: Explorer WEKA: Experimenter
Python Interface to WEKA
59
WEKA File Format: ARFF

@relation heart-disease-simplified @attribute @attribute @attribute @attribute @attribute @attribute
Numerical attribute
age numeric Nominal attribute sex { female, male} chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} cholesterol numeric exercise_induced_angina { no, yes} Other attribute types: class { present, not_present} String Date
@data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...

Missing value
60
WEKA File Format: Sparse ARFF

Value 0 is not represented explicitly Same header (i.e @relation and @attribute tags) the @data section is different
Instead of
@data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B"
We have
@data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
This is especially useful for textual data (why?)
61

This is just to get you started Assumes the newsgroups collection Extracts simple features
currently just single word features
Uses a simple tokenizer which removes punctuation
uses a stoplist lowercases the words
Includes filtering code

currently eliminates numbers
Features are weighted by frequency within document Produces a sparse ARFF file to be used by WEKA
62

Allows you to specify:
Which directory to read files from which newsgroups to use the number of documents for training each newsgroup the number of features to retain
63

Things to (optionally) add or change:
an option to not use stopwords an option to retain capitalization regular expression pattern a feature should match other non-word-based features morphological normalization a minimum threshold for the number of time a term occurs before it can be counted as a feature tf.idf weighting on terms your idea goes here
64

TF.IDF: tij v log(N/ni)
TF
tij: frequency of term i in document j this is how features are currently weighted
IDF: log(N/ni)
ni: number of documents containing term i N: total number of documents
65
Python Weka Code
66
Python Weka Code
67
Python Weka Code
68
Python Weka Code
69
Python Weka Code
70
Python Weka Code
71
Python Weka Code
72
Python Weka Code
73
ARFF file
74
ARFF file
75
Assignment
Due November 13. Work individually on this one Objective is to use the training set to get the best features and learning model you can. FREEZE this. Then run one time only on the test set. This is a realistic way to see how well your algorithm does on unseen data.
76
Next Time
Machine learning algorithms
77

Lecture 16

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 16

Hochgeladen von

Copyright:

Verfügbare Formate

SIMS 290-2: Applied Natural Language Processing

Marti Hearst October 30, 2006

20 Newsgroups Data Set

Some categories are strongly related (and thus hard to discriminate):

Sample Posting: talk.politics.guns

Need special handling during feature extraction signature

The 20 Newsgroups Text Collection WEKA: Exporer

WEKA: The Bird

Copyright: Martin Kramer (mkramer@wxs.nl), University of Waikato, New Zealand

Slide adapted from Eibe Frank's

WEKA: the software

Support for the whole process of experimental data mining

Slide by Eibe Frank

Slide by Eibe Frank

Projects based on WEKA

Slide by Eibe Frank

The WEKA Project Today (2006)

Slide by Eibe Frank

WEKA: The Software Toolkit

WEKA GUI Chooser

java -Xmx1000M -jar weka.jar

Slide adapted from Eibe Frank's

Our Toy Example

20 documents per category features:

Explorer: Pre-Processing The Data

Pre-processing tools (filters) are used for:

Slide adapted from Eibe Frank's

The Preprocessing Tab

Manual attribute selection

Statistics about the values of the selected attribute

List of attributes (last: class variable)

Frequency and categories for the selected attribute

Explorer: Building Classifiers

Slide adapted from Eibe Frank's

The Classification Tab

False: Gaussian True: kernels (better)

all other numbers can be obtained from it

These extend for more than 2 classes:

Outputs the probability distribution for each example

Little squares designate errors Axes show example number

Running on Test Set

Explorer: Attribute Selection

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Individual Features Ranking

Individual Features Ranking

Individual Features Ranking

random number seed

Feature Interactions (Advanced)

Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's

Feature Interactions (Advanced)

Feature Subsets Selection

Slide adapted from Guozhu Dong's

Features Subsets Selection

Features Subsets Selection

17,309 subsets considered 21 attributes selected

Saving the Selected Features

Features Selection on Preprocessing

Features Selection on Preprocessing

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

Features Selection on Preprocessing

Just 22 attributes remain: 21 + 1 (for the class)

Run Nave Bayes With the 21 Features