Sie sind auf Seite 1von 37

rue Positive is when the predicted instance and the actual instance is not negative.

SVM is a

Support Vector Machine(SVM)

supervised learning algorithm.

Image result for SVM is a

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In
other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane
which categorizes new examples.

Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?

lemmatization

Lemmatization tries to take a similarly careful approach to removing inflections. Lemmatization does not
simply chop off inflections, but instead relies on a lexical knowledge base like WordNet to obtain the correct
base forms of words.

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What command should be given to tokenize a sentence into words?

from nltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)


Which of the given hyper parameter(s), when increased may cause random forest to over fit the data?

Depth of Tree

Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in
random forest. Increase in the number of tree will cause under fitting.

Lemmatization offers better precision than stemming

In these examples, it outperforms than the Porter stemmer. But lemmatization has limits. For example, Porter
stems both happiness and happy to happi, while WordNet lemmatizes the two words to themselves. ... In
general, lemmatization offers better precision than stemming, but at the expense of recall.

Can we consider sentiment classification as a text classification problem?

Yes

Which of the following is not a preprocessing method used for unstructured data classification?

confusion_matrix

In document classification, each document has to be coverted from full text to a document vector

Cross-validation causes over-fitting.

Which one of the following is not a classification technique?

StratifiedShuffleSplit

Here we have the types of classification algorithms in Machine Learning:


Linear Classifiers: Logistic Regression, Naive Bayes Classifier

Support Vector Machines

Decision Trees

Boosted Trees

Random Forest

Neural Networks

Nearest Neighbor

High classification accuracy always indicates a good classifier.

Inverse Document frequency is used in term document matrix.

TF and IDF use matrix representations

The fit(X, y) is used to

Train the Classifier

Clustering is a supervised classification.

Clustering tries to group a set of objects and find whether there is some relationship between the objects. In the
context of machine learning, classification is supervised learning and clustering is unsupervised learning.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What does the command sentiment_analysis_data['label'].value_counts() return?

Return a Series containing counts of unique values.

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document.

In Supervised learning, class labels of the training samples are

Known

Choose the correct sequence from the following:

Data Analysis -> PreProcessing -> Model Building--> Predict


Why this course?
This course gives you a practical experience for solving unstructured text classification problems. If you're wondering why you n

80% of business relevant information originates in the unstructured form, primarily text , says Seth Grimes, a leading analytics

What Would you Need to Follow Along?


Have a basic understanding of machine learning concepts.

Try out the code snippets given for the case study.

Refer the links to gain an in-depth understanding of other machine learning techniques.

Programming is usually taught by examples -Niklaus Wirth

Unstructured data, as the name suggests, does not have a structured format and may contain data such as dates, numbers or f

*This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared

Source : Wikipedia.
A few examples of unstructured data are:

Emails

Word Processing Files

PDF files

Spreadsheets

Digital Images

Video

Audio

Social Media Posts etc

Problem Description
Problem Description
Let us understand unstructured data classification through the following case study:

SMS Spam Detection:

In our day-to-day lives, we receive a large number of spam/junk messages either in the form of Text(SMS) or E-mails. It is impo

In this case study, we apply various machine learning algorithms to categorize the messages depending on whether they are sp
1 of 1

Your Playground...
Note: In case if you don't find any of the required packages in our Katacoda hands-on environment you can do the following to

For Example : If nltk is the required package, then type the following command on the terminal to download it in our hands-on

pip install nltk --target=./.


For NLTK, you have a few other dependent packages. You can perform the following steps to download them :

Open the python terminal in the cmd prompt.(Type python)

Type import nltk

Type nltk.download()

Type d for download

Type all to download all dependent packages of NLTK.

Setup Your Local Machine


To run the code locally:

Install Python 2.7+ in your machine.

Install the required packages - Pandas, Sklearn, Numpy(Use pip install).

Use any IDE (PyCharm, Spyder etc.) for trying out the code snippets.

Note: You can find brief descriptions of the python packages here.

Dataset Download
Dataset Download
The dataset is available at -SMS Spam dataset link .

Open the terminal and type the following command to download.

curl https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Collection_v1/data/00b7d52

This command downloads the data and saves it as dataset.csv.


ataset Description
Dataset Description
The dataset contains customer usage pattern of a telecommunication company.

The following is a description of our dataset:

No. of Classes: 2 (Spam / Ham)

No. of attr 2

No. of ins : 5574

Data Loading
Data Loading
To start with data loading, import the required python package and load the downloaded CSV file.

The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the most widely used libraries for th

import pandas as pd

import csv

#Data Loading

messages = [line.rstrip() for line in open('dataset.csv')]

print len(messages)

#Appending column headers

messages = pd.read_csv('dataset.csv', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])

As you can see, our dataset has 2 columns without any headers.

This code snippet reads the data using pandas and labels the column names as label and message.

Data Analysis
Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to derive useful information from the given d

In this section, we will analyze the dataset in terms of size, headers, view data summary and a sample data.

You can see the dataset size using :

data_size=messages.shape

print(data_size)

Column names can be viewed by :

messages_col_names=list(messages.columns)

print(messages_col_names)

To understand aggregate statistics easily, use the following command :

print(messages.groupby('label').describe())

To see a sample data, use the following command :

print(messages.head(3))

Target Identification
Target Identification
Target is the class/category to which you will assign the data.

In this case, you aim to identify whether the message is spam or not.

By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since

#Identifying the outcome/target variable.

message_target=messages['label']

print(message_target)

Tokenization is a method to split a sentence/string into substrings. These substrings are called tokens.
In Natural Language Processing (NLP), tokenization is the initial step in preprocessing. Splitting a sentence into tokens helps to r

import nltk

from nltk.tokenize import word_tokenize

def split_tokens(message):

message=message.lower()

message = unicode(message, 'utf8') #convert bytes into proper unicode

word_tokens =word_tokenize(message)

return word_tokens

messages['tokenized_message'] = messages.apply(lambda row: split_tokens(row['message']),axis=1)

1 of 8

Lemmatization
Lemmatization
Lemmatization is a method to convert a word into its base/root form.

Lemmatizer removes affixes of the words present in its dictionary.

from nltk.stem.wordnet import WordNetLemmatizer

def split_into_lemmas(message):

lemma = []

lemmatizer = WordNetLemmatizer()

for word in message:

a=lemmatizer.lemmatize(word)
lemma.append(a)

return lemma

messages['lemmatized_message'] = messages.apply(lambda row: split_into_lemmas(row['tokenized_message']),axis=1)

print('Tokenized message:',messages['tokenized_message'][11])

print('Lemmatized message:',messages['lemmatized_message'][11])

2 of 8

Stop Word Removal


Stop Word Removal
Stop words are commons words that do not add any relevance for classification (For eg. “the”, “a”, “an”, “in” etc.). Hence, it is e

from nltk.corpus import stopwords

def stopword_removal(message):

stop_words = set(stopwords.words('english'))

filtered_sentence = []

filtered_sentence = ' '.join([word for word in message if word not in stop_words])

return filtered_sentence

messages['preprocessed_message'] = messages.apply(lambda row: stopword_removal(row['lemmatized_message']),axis=1)

Training_data=pd.Series(list(messages['preprocessed_message']))

Training_label=pd.Series(list(messages['label']))
3 of 8

Why Feature Extraction is important?


To perform machine learning on text documents, you first need to turn the text content into numerical feature vectors.

In Python, you have a few packages defined under sklearn.

We will be looking into a few specific ones used for unstructured data.

4 of 8

Bag Of Words(BOW)
Bag Of Words(BOW)
Bag of Words (BOW) is one of the most widely used methods for generating features in Natural Language Processing.

Representing/Transforming a text into a bag of words helps to identify various measures to characterize the text.

Predominantly used for calculating the term(word) frequency or the number of times a term occurs in a document/sentence.

It can be used as a feature for training the classifier.

5 of 8

Term Document Matrix


Term Document Matrix
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a collection of documents.

In a TDM, the rows represent documents and columns represent the terms.

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)

Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)

message_data_TDM = Total_Dictionary_TDM.transform(Training_data)

6 of 8

Term Frequency Inverse Document Frequency (TFIDF)


Term Frequency Inverse Document Frequency (TFIDF)
In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is expressed by Inverse Document Freq
IDF diminishes the weight of the most commonly occurring words and increases the weightage of rare words.

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)

Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)

message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)

Let's take the TDM matrix for further evaluation. You can also try out the same using TFIDF matrix.

7 of 8

Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a few of these algorithms will be presente

We will discuss the following :

Decision Tree Classifier

Stochastic Gradient Descent Classifier

Support Vector Machine Classifier

Random Forest Classifier

Note:- The explanation for these algorithms are given in the Machine Learning Axioms course. Refer the course for further deta

How Does a Classifier Work?


How Does a Classifier Work?
The following are the steps involved in building a classification model:

Initialize the classifier to be used.

Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and tr

Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label y.

Evaluate the classifier model - The score(X,y) returns the score for the given test data X and test label y.

Train and Test Data


The code snippet provided here is for partitioning the data into train and test for building the classifier model. This split will be
from sklearn.cross_validation import train_test_split#Splitting the data for training and testing

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM, Training_label, test_size=.1)

Decision Tree Classification


Decision Tree Classification
It is one of the commonly used classification techniques for performing binary as well as multi-class classification.

The decision tree model predicts the class/target by learning simple decision rules from the features of the data.

from sklearn.tree import DecisionTreeClassifier#Creating a decision classifier model

classifier=DecisionTreeClassifier() #Model training

classifier = classifier.fit(train_data, train_label) #After being fitted, the model can then be used to predict the output.

message_predicted_target = classifier.predict(test_data)

score = classifier.score(test_data, test_label)

print('Decision Tree Classifier : ',score)

Stochastic Gradient Descent Classifier


Stochastic Gradient Descent Classifier
It is used for large scale learning

It supports different loss functions & penalties for classification

seed=7

from sklearn.linear_model import SGDClassifier

classifier = SGDClassifier(loss='modified_huber', shuffle=True,random_state=seed)

classifier = classifier.fit(train_data, train_label)

score = classifier.score(test_data, test_label)


print('SGD classifier : ',score)

5 of 8

Support Vector Machine


Support Vector Machine
Support Vector Machine(SVM) is effective in high-dimensional spaces.

It is effective in cases where the number of dimensions is greater than the number of samples.

It works well with a clear margin of separation.

from sklearn.svm import SVC

classifier = SVC(kernel="linear", C=0.025,random_state=seed)

classifier = classifier.fit(train_data, train_label)

score = classifier.score(test_data, test_label)

print('SVM Classifier : ',score)

6 of 8

Random Forest Classifier


Random Forest Classifier
Controls over fitting

Here, a random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to im

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10,random_state=seed)

classifier = classifier.fit(train_data, train_label)

score = classifier.score(test_data, test_label)

print('Random Forest Classifier : ',score)

7 of 8

Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of those parameters can influence the resu

For example, let's take the Random Forest Classifier and change the values of a few parameters (n_ estimators,max_ features)

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=5, n_estimators=15, max_features=60,random_state=seed)

classifier = classifier.fit(train_data, train_label)

score=classifier.score(test_data, test_label)

print('Random Forest classification after model tuning',score)

Refer scikit-learn tutorials and try to change the parameters of other classifiers and analyze the results.

8 of 8

Partitioning the Data


It is a methodological mistake to test and train on the same dataset. This is because the classifier would fail to predict correctly

To avoid this problem,

Split the data to train set, validation set and test set.

Training Set: The data used to train the classifier.

Validation Set: The data used to tune the classifier model parameters i.e., to understand how well the model has been trained

Testing Set: The data used to evaluate the performance of the classifier (unseen data by the classifier).

This will help you know the efficiency of your model.

Cross Validation
Cross validation is a model validation technique to evaluate the performance of a model on unseen data (validation set).

It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.

Points to remember:

Cross validation gives high variance if the testing set and training set are not drawn from same population.

Allowing training data to be included in testing data will not give actual performance results.

In cross validation, the number of samples used for training the model is reduced and the results depend on the choice of the
You can refer to the various CV approaches here.

2 of 5

Stratified Shuffle Split


The StratifiedShuffleSplit splits the data by taking an equal number of samples from each class in a random manner.

StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can be seen from the following code

seed=7

from sklearn.cross_validation import StratifiedShuffleSplit

#creating cross validation object with 10% test size

cross_val = StratifiedShuffleSplit(Training_label,1, test_size=0.1,random_state=seed)

test_size=0.1 denotes that 10 % of the dataset is used for testing.

3 of 5

Stratified Shuffle Split Contd...


This selection is then used to split the data into test and train sets.

from sklearn.neighbors import KNeighborsClassifier

from sklearn.multiclass import OneVsRestClassifier

from sklearn import svm

classifiers = [

DecisionTreeClassifier(),

SGDClassifier(loss='modified_huber', shuffle=True),

SVC(kernel="linear", C=0.025),

KNeighborsClassifier(),

OneVsRestClassifier(svm.LinearSVC()),

RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),


]

for clf in classifiers:

score=0

for train_index, test_index in cross_val:

X_train, X_test = message_data_TDM [train_index], message_data_TDM [test_index]

y_train, y_test = Training_label[train_index], Training_label[test_index]

clf.fit(X_train, y_train)

score=score+clf.score(X_test, y_test)

print(score)

The above code uses ensemble of classifiers for cross validation. It helps to select the best classifier based on the cross validati

Note: You may add or remove classifiers based on the requirement.

4 of 5

Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.

from sklearn.metrics import accuracy_score

print('Accuracy Score',accuracy_score(test_label,message_predicted_target))

classifier = classifier.fit(train_data, train_label)

score=classifier.score(test_data, test_label)

test_label.value_counts()

This simple classification accuracy will not tell us the types of errors by our classifier.

It is just an easier method, but it will not give us the latent distribution of response values.
Confusion Matrix
It is a technique to evaluate the performance of a classifier.

It depicts the performance in a tabular form that has 2 dimensions namely “actual” and “predicted” sets of data.

The rows and columns of the table show the count of false positives, false negatives, true positives and true negatives.

from sklearn.metrics import confusion_matrix

print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))

The first parameter shows true values and the second parameter shows predicted values.

Confusion Matrix
Confusion Matrix
This image is a confusion matrix for a two class classifier.

In the table,

TP (True Positive) - The number of correct predictions that the occurrence is positive

FP (False Positive) - The number of incorrect predictions that the occurrence is positive

FN (False Negative) - The number of incorrect predictions that the occurrence is negative

TN (True Negative)- The number of correct predictions that the occurrence is negative

TOTAL - The total number of occurrences

3 of 7

Plotting Confusion Matrix


Plotting Confusion Matrix
To evaluate the quality of output, it is always better to plot and analyze the results.

For our case study, we have plotted the confusion matrix of Decision Tree Classifier which is given in the above image.

The function for plotting confusion matrix is given here.

4 of 7

Classification Report
The classification_report function shows a text report with the commonly used classification metrics.

from sklearn.metrics import classification_report


target_names = ['spam', 'ham']

print(classification_report(test_label, message_predicted_target, target_names=target_names))

Precision

When a positive value is predicted, how often is the prediction correct?


Recall

It is the true positive rate.

When the value is positive, how often does the prediction turn out to be correct?

To know more about model evaluation, check this link.

5 of 7

Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more libraries specific to Java/Ruby, etc.

You can find the reference link here:

NLP Libraries

6 of 7
ms. If you're wondering why you need unstructured text,

Seth Grimes, a leading analytics strategy consultant.

data such as dates, numbers or facts.

onal programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

f Text(SMS) or E-mails. It is important to filter these spam messages since they are not truthful or trustworthy.

epending on whether they are spam or not.


ment you can do the following to install the needed ones on the same.

l to download it in our hands-on environment.

ownload them :

m_Collection_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv
most widely used libraries for this.
ful information from the given data for making decisions.

sample data.

dy a Binary Classification, since it has only two possible outcomes.


a sentence into tokens helps to remove unwanted information in the raw text such as white spaces, line breaks and so on.
enized_message']),axis=1)

“a”, “an”, “in” etc.). Hence, it is essential to remove these words.

mmatized_message']),axis=1)
umerical feature vectors.

l Language Processing.

aracterize the text.

ccurs in a document/sentence.

ms in a collection of documents.

essed by Inverse Document Frequency (IDF).


e of rare words.

hese algorithms will be presented in the upcoming cards.

Refer the course for further details.

) for the given train data X and train label y.

lassifier model. This split will be used to explain classification algorithms.


bel, test_size=.1)

-class classification.

atures of the data.

to predict the output.


dataset and uses averaging to improve the predictive accuracy.

_state=seed)
arameters can influence the results. So algorithm/model tuning is essential to find out the best model.

s (n_ estimators,max_ features)

_state=seed)

er would fail to predict correctly for any unseen data. This could result in overfitting.

well the model has been trained (a part of training data).

seen data (validation set).

population.

lts depend on the choice of the pair of training and testing sets.
in a random manner.

be seen from the following code snippet:


sifier based on the cross validation scores. The classifier with the highest score can be used for building the classification model.
cted” sets of data.

tives and true negatives.

ven in the above image.


es specific to Java/Ruby, etc.
ged) in documents.
eaks and so on.
e classification model.

Das könnte Ihnen auch gefallen