Sie sind auf Seite 1von 77

SIMS 290-2: Applied Natural Language Processing

Marti Hearst October 30, 2006


Some slides by Preslav Nakov and Eibe Frank

Today
The Newsgroups Text Collection WEKA: Exporer WEKA: Experimenter Python Interface to WEKA

20 Newsgroups Data Set


http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ Source: originally collected by Ken Lang Content and structure:
approximately 20,000 newsgroup documents
19,997 originally 18,828 without duplicates

partitioned evenly across 20 different newsgroups we are only using a subset (6 newsgroups)

Some categories are strongly related (and thus hard to discriminate):


comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey talk.politics.misc talk.politics.guns talk.politics.mideast sci.crypt sci.electronics sci.med sci.space talk.religion.misc alt.atheism soc.religion.christian

computers

Sample Posting: talk.politics.guns


from
From: cdt@sw.stratus.com (C. D. Tavares) Subject: Re: Congress to review ATF's status

subject

writes:

In article <C5vzHF.D5K@cbnews.cb.att.com>, lvc@cbnews.cb.att.com (Larry Cipriani) writes: > WASHINGTON (UPI) -- As part of its investigation of the deadly > confrontation with a Texas cult, Congress will consider whether the > Bureau of Alcohol, Tobacco and Firearms should be moved from the > Treasury Department to the Justice Department, senators said Wednesday. > The idea will be considered because of the violent and fatal events > at the beginning and end of the agency's confrontation with the Branch > Davidian cult.

reply
Of course. When the catbox begines to smell, simply transfer its contents into the potted plant in the foyer. "Why Hillary! Your government smells so... FRESH!" -cdt@rocket.sw.stratus.com --If you believe that I speak for my company, OR cdt@vos.stratus.com write today for my special Investors' Packet...

Need special handling during feature extraction signature

The 20 Newsgroups Text Collection WEKA: Exporer


WEKA: Experimenter Python Interface to WEKA WEKA: Real-time Demo

WEKA: The Bird

Copyright: Martin Kramer (mkramer@wxs.nl), University of Waikato, New Zealand

Slide adapted from Eibe Frank's

WEKA: the software


Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms and data processing tools implemented in Java
Released under the GPL

Support for the whole process of experimental data mining


Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning

Used for education, research and applications Complements Data Mining by Witten & Frank

Slide by Eibe Frank

Main Features
49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection 3 algorithms for finding association rules 3 graphical user interfaces
The Explorer (exploratory data analysis) The Experimenter (experimental environment) The KnowledgeFlow (new process model inspired interface)

Slide by Eibe Frank

Projects based on WEKA


Incorporate/wrap WEKA
GRB Tool Shed - a tool to aid gamma ray burst research YALE - facility for large scale ML experiments GATE - NLP workbench with a WEKA interface Judge - document clustering and classification

Extend/modify WEKA
BioWeka - extension library for knowledge discovery in biology WekaMetal - meta learning extension to WEKA Weka-Parallel - parallel processing for WEKA Grid Weka - grid computing using WEKA Weka-CG - computational genetics tool library

Slide by Eibe Frank

The WEKA Project Today (2006)


Funding for the next two years Goal of the project remains the same People
6 2 3 3 2 staff postdocs PhD students MSc students research programmers

Slide by Eibe Frank

10

WEKA: The Software Toolkit


http://www.cs.waikato.ac.nz/ml/weka http://sourceforge.net/projects/weka/ Machine learning/data mining software in Java GNU License Used for research, education and applications Complements Data Mining by Witten & Frank Main features:
data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms
Slide adapted from Eibe Frank's
11

WEKA: Terminology
Some synonyms/explanations for the terms used by WEKA, which may differ from what we use:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category

12

WEKA GUI Chooser

java -Xmx1000M -jar weka.jar

Slide adapted from Eibe Frank's

13

Our Toy Example


We demonstrate WEKA on a simple example: 3 categories from Newsgroups:
misc.forsale, rec.sport.hockey, comp.graphics

20 documents per category features:


words converted to lowercase frequency 2 or more required stopwords removed
Slide adapted from Eibe Frank's
14

Explorer: Pre-Processing The Data


WEKA can import data from:
files: ARFF, CSV, C4.5, binary URL SQL database (using JDBC)

Pre-processing tools (filters) are used for:


Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

Slide adapted from Eibe Frank's

15

The Preprocessing Tab


Classification Preprocessing Filter selection Statistical attribute selection

Manual attribute selection

Statistics about the values of the selected attribute

List of attributes (last: class variable)

Frequency and categories for the selected attribute

16

Explorer: Building Classifiers


Classifiers in WEKA are models for:
classification (predict a nominal class) regression (predict a numerical quantity)

Learning algorithms:
Nave Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:
cannot be used alone always combined with a learning algorithm examples: boosting, bagging etc.
17

Slide adapted from Eibe Frank's

The Classification Tab


Choice of classifier Cross-validation: split the data into e.g. 10 folds and 10 times train on 9 folds and test on the remaining one The attribute whose value is to be predicted from the values of the remaining ones. Default is the last attribute. Here (in our toy example) it is named class.

18

Choosing a classifier

19

20

displays synopsis and options outputs additional numerical to information nominal conversion by discretization

False: Gaussian True: kernels (better)

21

22

23

accuracy

different/easy class

all other numbers can be obtained from it

24

Confusion matrix
Contains information about the actual and the predicted classification All measures can be derived from it: predicted
accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)
+ true a + c b d

These extend for more than 2 classes:


see previous lecture slides for details
25

Predictions Output

Outputs the probability distribution for each example

26

Predictions Output

Probability distribution for a wrong example: predicted 1 instead of 3 Nave Bayes makes incorrect conditional independence assumptions and typically is over-confident in its prediction regardless of whether it is correct or not.
27

Error Visualization

28

Error Visualization

Little squares designate errors Axes show example number

29

Running on Test Set

30

Explorer: Attribute Selection


Find which attributes are the most predictive ones Two parts:
search method:
best-first, forward selection, random, exhaustive, genetic algorithm, ranking

evaluation method:
information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two


Slide adapted from Eibe Frank's
31

Individual Features Ranking

32

Individual Features Ranking

misc.forsale rec.sport.hockey

comp.graphics

33

Individual Features Ranking

misc.forsale rec.sport.hockey

random number seed

comp.graphics

???

34

Feature Interactions (Advanced)


category

C
importance of feature A feature importance of feature B

feature correlation

feature

2-Way Interactions

Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's

35

Feature Interactions (Advanced)


category

C
importance of feature A feature importance of feature B

feature

3-Way Interaction: What is common to A, B and C together; and cannot be inferred from pairs of features.
Slide adapted from Jakulin, Bratko, Smrke, Dem ar and Zupan's
36

Feature Subsets Selection


Problem illustration
Full set Empty set Enumeration

Search
Exhaustive/Complete (enumeration/branch&bounding) Heuristic (sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation

Slide adapted from Guozhu Dong's

37

Features Subsets Selection

38

Features Subsets Selection

misc.forsale

rec.sport.hockey

comp.graphics

17,309 subsets considered 21 attributes selected

39

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful... But we can also perform feature selection during the pre-processing step... (the following slides)

40

Features Selection on Preprocessing

41

Features Selection on Preprocessing

42

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

43

Features Selection on Preprocessing

Just 22 attributes remain: 21 + 1 (for the class)

44

Run Nave Bayes With the 21 Features

higher accuracy

21 Attributes

45

(AGAIN) Nave Bayes With All Features

accuracy

different/easy class

ALL 679 Attributes (repeated slide)


46

Some Important Algorithms


WEKA has weird naming for some algorithms Here are some translations:
Nave Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms


e.g. the classic Nave Bayes seems to be missing A good alternative is the Multinomial Nave Bayes model

47

The 20 Newsgroups Text Collection


WEKA: Explorer

WEKA: Experimenter
Python Interface to WEKA WEKA: Real-time Demo

48

Performing Experiments
Experimenter makes it easy to compare the performance of different learning schemes Problems:
classification regression

Results: written into file or database Evaluation options:


cross-validation learning curve hold-out

Can also iterate over different parameter settings Significance-testing built in!
Slide adapted from Eibe Frank's
49

Experiments Setup

50

Experiments Setup

51

Experiments Setup

datasets

CSV file: can be open in Excel algorithms

52

Experiments Setup

53

Experiments Setup

54

Experiments Setup

55

Experiments Setup

56

Experiments Setup

accuracy

SVM is the best

Decision tree is the worst

SVM is statistically better than Nave Bayes

Decision tree is statistically worse than Nave Bayes

57

Experiments: Excel

Results are output into an CSV file, which can be read in Excel!

58

The Newsgroups Text Collection


WEKA: Explorer WEKA: Experimenter

Python Interface to WEKA

59

WEKA File Format: ARFF


@relation heart-disease-simplified @attribute @attribute @attribute @attribute @attribute @attribute

Numerical attribute

age numeric Nominal attribute sex { female, male} chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} cholesterol numeric exercise_induced_angina { no, yes} Other attribute types: class { present, not_present} String Date

@data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...


Slide adapted from Eibe Frank's

Missing value

60

WEKA File Format: Sparse ARFF


Value 0 is not represented explicitly Same header (i.e @relation and @attribute tags) the @data section is different
Instead of
@data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B"

We have
@data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)

61

Python Interface to WEKA


This is just to get you started Assumes the newsgroups collection Extracts simple features
currently just single word features
Uses a simple tokenizer which removes punctuation

uses a stoplist lowercases the words

Includes filtering code


currently eliminates numbers

Features are weighted by frequency within document Produces a sparse ARFF file to be used by WEKA
62

Python Interface to WEKA


Allows you to specify:
Which directory to read files from which newsgroups to use the number of documents for training each newsgroup the number of features to retain

63

Python Interface to WEKA


Things to (optionally) add or change:
an option to not use stopwords an option to retain capitalization regular expression pattern a feature should match other non-word-based features morphological normalization a minimum threshold for the number of time a term occurs before it can be counted as a feature tf.idf weighting on terms your idea goes here

64

Python Interface to WEKA


TF.IDF: tij v log(N/ni)
TF
tij: frequency of term i in document j this is how features are currently weighted

IDF: log(N/ni)
ni: number of documents containing term i N: total number of documents

65

Python Weka Code

66

Python Weka Code

67

Python Weka Code

68

Python Weka Code

69

Python Weka Code

70

Python Weka Code

71

Python Weka Code

72

Python Weka Code

73

ARFF file

74

ARFF file

75

Assignment
Due November 13. Work individually on this one Objective is to use the training set to get the best features and learning model you can. FREEZE this. Then run one time only on the test set. This is a realistic way to see how well your algorithm does on unseen data.

76

Next Time
Machine learning algorithms

77

Das könnte Ihnen auch gefallen