Unstructured

rue Positive is when the predicted instance and the actual instance is not negative.
SVM is a
Support Vector Machine(SVM)
supervised learning algorithm.
Image result for SVM is a
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In
other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane
which categorizes new examples.
Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?
lemmatization
Lemmatization tries to take a similarly careful approach to removing inflections. Lemmatization does not
simply chop off inflections, but instead relies on a lexical knowledge base like WordNet to obtain the correct
base forms of words.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What command should be given to tokenize a sentence into words?
from nltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)

Which of the given hyper parameter(s), when increased may cause random forest to over fit the data?
Depth of Tree
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in
random forest. Increase in the number of tree will cause under fitting.
Lemmatization offers better precision than stemming
In these examples, it outperforms than the Porter stemmer. But lemmatization has limits. For example, Porter
stems both happiness and happy to happi, while WordNet lemmatizes the two words to themselves. ... In
general, lemmatization offers better precision than stemming, but at the expense of recall.
Can we consider sentiment classification as a text classification problem?
Yes
Which of the following is not a preprocessing method used for unstructured data classification?
confusion_matrix
In document classification, each document has to be coverted from full text to a document vector
Cross-validation causes over-fitting.
Which one of the following is not a classification technique?
StratifiedShuffleSplit
Here we have the types of classification algorithms in Machine Learning:

Linear Classifiers: Logistic Regression, Naive Bayes Classifier
Support Vector Machines
Decision Trees
Boosted Trees
Random Forest
Neural Networks
Nearest Neighbor
High classification accuracy always indicates a good classifier.
Inverse Document frequency is used in term document matrix.
TF and IDF use matrix representations
The fit(X, y) is used to
Train the Classifier
Clustering is a supervised classification.
Clustering tries to group a set of objects and find whether there is some relationship between the objects. In the
context of machine learning, classification is supervised learning and clustering is unsupervised learning.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What does the command sentiment_analysis_data['label'].value_counts() return?
Return a Series containing counts of unique values.
Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document.
In Supervised learning, class labels of the training samples are
Known
Choose the correct sequence from the following:
Data Analysis -> PreProcessing -> Model Building--> Predict

Why this course?
This course gives you a practical experience for solving unstructured text classification problems. If you're wondering why you n
80% of business relevant information originates in the unstructured form, primarily text , says Seth Grimes, a leading analytics
What Would you Need to Follow Along?

Have a basic understanding of machine learning concepts.
Try out the code snippets given for the case study.
Refer the links to gain an in-depth understanding of other machine learning techniques.
Programming is usually taught by examples -Niklaus Wirth
Unstructured data, as the name suggests, does not have a structured format and may contain data such as dates, numbers or f
*This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared
Source : Wikipedia.
A few examples of unstructured data are:
Emails
Word Processing Files
PDF files
Spreadsheets
Digital Images
Video
Audio
Social Media Posts etc
Problem Description
Problem Description
Let us understand unstructured data classification through the following case study:
SMS Spam Detection:
In our day-to-day lives, we receive a large number of spam/junk messages either in the form of Text(SMS) or E-mails. It is impo
In this case study, we apply various machine learning algorithms to categorize the messages depending on whether they are sp
1 of 1
Your Playground...
Note: In case if you don't find any of the required packages in our Katacoda hands-on environment you can do the following to
For Example : If nltk is the required package, then type the following command on the terminal to download it in our hands-on
pip install nltk --target=./.

For NLTK, you have a few other dependent packages. You can perform the following steps to download them :
Open the python terminal in the cmd prompt.(Type python)
Type import nltk
Type nltk.download()
Type d for download
Type all to download all dependent packages of NLTK.
Setup Your Local Machine

To run the code locally:
Install Python 2.7+ in your machine.
Install the required packages - Pandas, Sklearn, Numpy(Use pip install).
Use any IDE (PyCharm, Spyder etc.) for trying out the code snippets.
Note: You can find brief descriptions of the python packages here.
Dataset Download
Dataset Download
The dataset is available at -SMS Spam dataset link .
Open the terminal and type the following command to download.
curl https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Collection_v1/data/00b7d52
This command downloads the data and saves it as dataset.csv.

ataset Description
Dataset Description
The dataset contains customer usage pattern of a telecommunication company.
The following is a description of our dataset:
No. of Classes: 2 (Spam / Ham)
No. of attr 2
No. of ins : 5574
Data Loading
Data Loading
To start with data loading, import the required python package and load the downloaded CSV file.
The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the most widely used libraries for th
import pandas as pd
import csv
#Data Loading
messages = [line.rstrip() for line in open('dataset.csv')]
print len(messages)
#Appending column headers
messages = pd.read_csv('dataset.csv', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])
As you can see, our dataset has 2 columns without any headers.
This code snippet reads the data using pandas and labels the column names as label and message.
Data Analysis
Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to derive useful information from the given d
In this section, we will analyze the dataset in terms of size, headers, view data summary and a sample data.
You can see the dataset size using :
data_size=messages.shape
print(data_size)
Column names can be viewed by :
messages_col_names=list(messages.columns)
print(messages_col_names)
To understand aggregate statistics easily, use the following command :
print(messages.groupby('label').describe())
To see a sample data, use the following command :
print(messages.head(3))
Target Identification
Target Identification
Target is the class/category to which you will assign the data.
In this case, you aim to identify whether the message is spam or not.
By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since
#Identifying the outcome/target variable.
message_target=messages['label']
print(message_target)
Tokenization is a method to split a sentence/string into substrings. These substrings are called tokens.
In Natural Language Processing (NLP), tokenization is the initial step in preprocessing. Splitting a sentence into tokens helps to r
import nltk
from nltk.tokenize import word_tokenize
def split_tokens(message):
message=message.lower()
message = unicode(message, 'utf8') #convert bytes into proper unicode
word_tokens =word_tokenize(message)
return word_tokens
messages['tokenized_message'] = messages.apply(lambda row: split_tokens(row['message']),axis=1)
1 of 8
Lemmatization
Lemmatization
Lemmatization is a method to convert a word into its base/root form.
Lemmatizer removes affixes of the words present in its dictionary.
from nltk.stem.wordnet import WordNetLemmatizer
def split_into_lemmas(message):
lemma = []
lemmatizer = WordNetLemmatizer()
for word in message:
a=lemmatizer.lemmatize(word)
lemma.append(a)
return lemma
messages['lemmatized_message'] = messages.apply(lambda row: split_into_lemmas(row['tokenized_message']),axis=1)
print('Tokenized message:',messages['tokenized_message'][11])
print('Lemmatized message:',messages['lemmatized_message'][11])
2 of 8
Stop Word Removal

Stop Word Removal
Stop words are commons words that do not add any relevance for classification (For eg. “the”, “a”, “an”, “in” etc.). Hence, it is e
from nltk.corpus import stopwords
def stopword_removal(message):
stop_words = set(stopwords.words('english'))
filtered_sentence = []
filtered_sentence = ' '.join([word for word in message if word not in stop_words])
return filtered_sentence
messages['preprocessed_message'] = messages.apply(lambda row: stopword_removal(row['lemmatized_message']),axis=1)
Training_data=pd.Series(list(messages['preprocessed_message']))
Training_label=pd.Series(list(messages['label']))
3 of 8
Why Feature Extraction is important?

To perform machine learning on text documents, you first need to turn the text content into numerical feature vectors.
In Python, you have a few packages defined under sklearn.
We will be looking into a few specific ones used for unstructured data.
4 of 8
Bag Of Words(BOW)
Bag Of Words(BOW)
Bag of Words (BOW) is one of the most widely used methods for generating features in Natural Language Processing.
Representing/Transforming a text into a bag of words helps to identify various measures to characterize the text.
Predominantly used for calculating the term(word) frequency or the number of times a term occurs in a document/sentence.
It can be used as a feature for training the classifier.
5 of 8
Term Document Matrix

Term Document Matrix
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a collection of documents.
In a TDM, the rows represent documents and columns represent the terms.
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM = Total_Dictionary_TDM.transform(Training_data)
6 of 8
Term Frequency Inverse Document Frequency (TFIDF)

Term Frequency Inverse Document Frequency (TFIDF)
In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is expressed by Inverse Document Freq
IDF diminishes the weight of the most commonly occurring words and increases the weightage of rare words.
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the same using TFIDF matrix.
7 of 8
Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a few of these algorithms will be presente
We will discuss the following :
Decision Tree Classifier
Stochastic Gradient Descent Classifier
Support Vector Machine Classifier
Random Forest Classifier
Note:- The explanation for these algorithms are given in the Machine Learning Axioms course. Refer the course for further deta
How Does a Classifier Work?

How Does a Classifier Work?
The following are the steps involved in building a classification model:
Initialize the classifier to be used.
Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and tr
Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label y.
Evaluate the classifier model - The score(X,y) returns the score for the given test data X and test label y.
Train and Test Data

The code snippet provided here is for partitioning the data into train and test for building the classifier model. This split will be
from sklearn.cross_validation import train_test_split#Splitting the data for training and testing
train_data,test_data, train_label, test_label = train_test_split(message_data_TDM, Training_label, test_size=.1)
Decision Tree Classification

Decision Tree Classification
It is one of the commonly used classification techniques for performing binary as well as multi-class classification.
The decision tree model predicts the class/target by learning simple decision rules from the features of the data.
from sklearn.tree import DecisionTreeClassifier#Creating a decision classifier model
classifier=DecisionTreeClassifier() #Model training
classifier = classifier.fit(train_data, train_label) #After being fitted, the model can then be used to predict the output.
message_predicted_target = classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('Decision Tree Classifier : ',score)

It is used for large scale learning
It supports different loss functions & penalties for classification
seed=7
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(loss='modified_huber', shuffle=True,random_state=seed)
classifier = classifier.fit(train_data, train_label)

print('SGD classifier : ',score)
5 of 8
Support Vector Machine

Support Vector Machine
Support Vector Machine(SVM) is effective in high-dimensional spaces.
It is effective in cases where the number of dimensions is greater than the number of samples.
It works well with a clear margin of separation.
from sklearn.svm import SVC
classifier = SVC(kernel="linear", C=0.025,random_state=seed)
print('SVM Classifier : ',score)
6 of 8

Controls over fitting
Here, a random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to im
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10,random_state=seed)
print('Random Forest Classifier : ',score)
7 of 8
Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of those parameters can influence the resu
For example, let's take the Random Forest Classifier and change the values of a few parameters (n_ estimators,max_ features)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5, n_estimators=15, max_features=60,random_state=seed)
score=classifier.score(test_data, test_label)
print('Random Forest classification after model tuning',score)
Refer scikit-learn tutorials and try to change the parameters of other classifiers and analyze the results.
8 of 8
Partitioning the Data

It is a methodological mistake to test and train on the same dataset. This is because the classifier would fail to predict correctly
To avoid this problem,
Split the data to train set, validation set and test set.
Training Set: The data used to train the classifier.
Validation Set: The data used to tune the classifier model parameters i.e., to understand how well the model has been trained
Testing Set: The data used to evaluate the performance of the classifier (unseen data by the classifier).
This will help you know the efficiency of your model.
Cross Validation
Cross validation is a model validation technique to evaluate the performance of a model on unseen data (validation set).
It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.
Points to remember:
Cross validation gives high variance if the testing set and training set are not drawn from same population.
Allowing training data to be included in testing data will not give actual performance results.
In cross validation, the number of samples used for training the model is reduced and the results depend on the choice of the
You can refer to the various CV approaches here.
2 of 5
Stratified Shuffle Split

The StratifiedShuffleSplit splits the data by taking an equal number of samples from each class in a random manner.
StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can be seen from the following code
seed=7
from sklearn.cross_validation import StratifiedShuffleSplit
#creating cross validation object with 10% test size
cross_val = StratifiedShuffleSplit(Training_label,1, test_size=0.1,random_state=seed)
test_size=0.1 denotes that 10 % of the dataset is used for testing.
3 of 5
Stratified Shuffle Split Contd...

This selection is then used to split the data into test and train sets.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
classifiers = [
DecisionTreeClassifier(),
SGDClassifier(loss='modified_huber', shuffle=True),
SVC(kernel="linear", C=0.025),
KNeighborsClassifier(),
OneVsRestClassifier(svm.LinearSVC()),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),

]
for clf in classifiers:
score=0
for train_index, test_index in cross_val:
X_train, X_test = message_data_TDM [train_index], message_data_TDM [test_index]
y_train, y_test = Training_label[train_index], Training_label[test_index]
clf.fit(X_train, y_train)
score=score+clf.score(X_test, y_test)
print(score)
The above code uses ensemble of classifiers for cross validation. It helps to select the best classifier based on the cross validati
Note: You may add or remove classifiers based on the requirement.
4 of 5
Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.
from sklearn.metrics import accuracy_score
print('Accuracy Score',accuracy_score(test_label,message_predicted_target))
score=classifier.score(test_data, test_label)
test_label.value_counts()
This simple classification accuracy will not tell us the types of errors by our classifier.
It is just an easier method, but it will not give us the latent distribution of response values.
Confusion Matrix
It is a technique to evaluate the performance of a classifier.
It depicts the performance in a tabular form that has 2 dimensions namely “actual” and “predicted” sets of data.
The rows and columns of the table show the count of false positives, false negatives, true positives and true negatives.
from sklearn.metrics import confusion_matrix
print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))
The first parameter shows true values and the second parameter shows predicted values.
Confusion Matrix
Confusion Matrix
This image is a confusion matrix for a two class classifier.
In the table,
TP (True Positive) - The number of correct predictions that the occurrence is positive
FP (False Positive) - The number of incorrect predictions that the occurrence is positive
FN (False Negative) - The number of incorrect predictions that the occurrence is negative
TN (True Negative)- The number of correct predictions that the occurrence is negative
TOTAL - The total number of occurrences
3 of 7
Plotting Confusion Matrix

Plotting Confusion Matrix
To evaluate the quality of output, it is always better to plot and analyze the results.
For our case study, we have plotted the confusion matrix of Decision Tree Classifier which is given in the above image.
The function for plotting confusion matrix is given here.
4 of 7
Classification Report
The classification_report function shows a text report with the commonly used classification metrics.
from sklearn.metrics import classification_report

target_names = ['spam', 'ham']
print(classification_report(test_label, message_predicted_target, target_names=target_names))
Precision
When a positive value is predicted, how often is the prediction correct?

Recall
It is the true positive rate.
When the value is positive, how often does the prediction turn out to be correct?
To know more about model evaluation, check this link.
5 of 7
Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more libraries specific to Java/Ruby, etc.
You can find the reference link here:
NLP Libraries
6 of 7
ms. If you're wondering why you need unstructured text,
Seth Grimes, a leading analytics strategy consultant.
data such as dates, numbers or facts.
onal programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
f Text(SMS) or E-mails. It is important to filter these spam messages since they are not truthful or trustworthy.
epending on whether they are spam or not.

ment you can do the following to install the needed ones on the same.
l to download it in our hands-on environment.
ownload them :
m_Collection_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv
most widely used libraries for this.
ful information from the given data for making decisions.
sample data.
dy a Binary Classification, since it has only two possible outcomes.

a sentence into tokens helps to remove unwanted information in the raw text such as white spaces, line breaks and so on.
enized_message']),axis=1)
“a”, “an”, “in” etc.). Hence, it is essential to remove these words.
mmatized_message']),axis=1)
umerical feature vectors.
l Language Processing.
aracterize the text.
ccurs in a document/sentence.
ms in a collection of documents.
essed by Inverse Document Frequency (IDF).

e of rare words.
hese algorithms will be presented in the upcoming cards.
Refer the course for further details.
) for the given train data X and train label y.
lassifier model. This split will be used to explain classification algorithms.

bel, test_size=.1)
-class classification.
atures of the data.
to predict the output.

dataset and uses averaging to improve the predictive accuracy.
_state=seed)
arameters can influence the results. So algorithm/model tuning is essential to find out the best model.
s (n_ estimators,max_ features)
_state=seed)
er would fail to predict correctly for any unseen data. This could result in overfitting.
well the model has been trained (a part of training data).
seen data (validation set).
population.
lts depend on the choice of the pair of training and testing sets.
in a random manner.
be seen from the following code snippet:

sifier based on the cross validation scores. The classifier with the highest score can be used for building the classification model.
cted” sets of data.
tives and true negatives.
ven in the above image.

es specific to Java/Ruby, etc.
ged) in documents.
eaks and so on.
e classification model.

Unstructured

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unstructured

Hochgeladen von

Copyright:

Verfügbare Formate

rue Positive is when the predicted instance and the actual instance is not negative.

Support Vector Machine(SVM)

supervised learning algorithm.

Image result for SVM is a

a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What command should be given to tokenize a sentence into words?

from nltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)

Lemmatization offers better precision than stemming

Can we consider sentiment classification as a text classification problem?

Cross-validation causes over-fitting.

Which one of the following is not a classification technique?

Here we have the types of classification algorithms in Machine Learning:

Support Vector Machines

High classification accuracy always indicates a good classifier.

Inverse Document frequency is used in term document matrix.

TF and IDF use matrix representations

The fit(X, y) is used to

Train the Classifier

Clustering is a supervised classification.

b) Give the column names as 'label' and 'message'

c) Try out the code snippets and answer the questions.

What does the command sentiment_analysis_data['label'].value_counts() return?

Return a Series containing counts of unique values.

In Supervised learning, class labels of the training samples are

Choose the correct sequence from the following:

Data Analysis -> PreProcessing -> Model Building--> Predict

What Would you Need to Follow Along?

Programming is usually taught by examples -Niklaus Wirth

Word Processing Files

Social Media Posts etc

SMS Spam Detection:

pip install nltk --target=./.

Open the python terminal in the cmd prompt.(Type python)

Type import nltk

Type d for download

Type all to download all dependent packages of NLTK.

Setup Your Local Machine

Install Python 2.7+ in your machine.

Install the required packages - Pandas, Sklearn, Numpy(Use pip install).

Open the terminal and type the following command to download.

This command downloads the data and saves it as dataset.csv.

The following is a description of our dataset:

No. of Classes: 2 (Spam / Ham)

No. of ins : 5574

messages = [line.rstrip() for line in open('dataset.csv')]

#Appending column headers

messages = pd.read_csv('dataset.csv', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])

You can see the dataset size using :

Column names can be viewed by :

To understand aggregate statistics easily, use the following command :

To see a sample data, use the following command :

#Identifying the outcome/target variable.

from nltk.tokenize import word_tokenize

message = unicode(message, 'utf8') #convert bytes into proper unicode

messages['tokenized_message'] = messages.apply(lambda row: split_tokens(row['message']),axis=1)

Lemmatizer removes affixes of the words present in its dictionary.

from nltk.stem.wordnet import WordNetLemmatizer

for word in message:

messages['lemmatized_message'] = messages.apply(lambda row: split_into_lemmas(row['tokenized_message']),axis=1)

Stop Word Removal

from nltk.corpus import stopwords

filtered_sentence = ' '.join([word for word in message if word not in stop_words])