Sie sind auf Seite 1von 14

MAT003 Project

Natural Language Processing and Statistics


A study about the Naive Bayes Classier

Adra Marc
1

MAT003 Project - NLP and Statistics

Introduction
Large amount of data on internet Information extraction : Data mining Classication : subdivision of NLP problems.

MAT003 Project - NLP and Statistics

What is a Classier?
Text Class

Function who gives a class as an output for each text given in input Text represented as a vector of features

MAT003 Project - NLP and Statistics

Naive Bayes Classier

Supervised Learning Prior information to compute the class probabilities using a pre-classied sample Rely on the independance between features - not veried in reality

MAT003 Project - NLP and Statistics 2 different types

Multinomial Bayes classier : word frequency Multivariate Bernoulli : word presence

Multinomial performs better

MAT003 Project - NLP and Statistics

Performances : How to assess it?

Evaluated by precision and recall:

And micro and macro F1:

MAT003 Project - NLP and Statistics

Performances
Study revealed by Yimmy Yang on the Reuters Datasets Good but not outstanding performances Remain surprising for a Classier that does't respect its assumptions

MAT003 Project - NLP and Statistics

To summarize

Learning using a corpus of texts Optimal if we can assume independency between features Good performances but can be improved

How can we improve this performance?


8

MAT003 Project - NLP and Statistics

Empirical Rules

Use the shape of the text to classify Require an important analysis of the texts to classify Efcient on spam ltering

MAT003 Project - NLP and Statistics

Poisson distribution and features weighting


Frequency of occurence of the word follows a Poisson distribution Weight the features with mutual information 10% improvement

10

MAT003 Project - NLP and Statistics

Bayesian Networks
F1 F2

F3

F4

Modelling relations between features Reduced model Increase the performance

11

MAT003 Project - NLP and Statistics

Bayesian Online Perceptron


Feature 2

Find separation hyperplanes Minimise the distance of misclassied items from the decision boundary Better performances than SVM

. . . . . . . . . . . . . .. .. . . . . . .. . . . . . . . . .
Feature 1

12

MAT003 Project - NLP and Statistics

Conclusion
Good original performance of the Naive Bayes Classier Simple to implement and to use Large panel of improvements Stay convenient for simple use

13

Thank you

Any Questions?
14

Das könnte Ihnen auch gefallen