Beruflich Dokumente
Kultur Dokumente
SVM is a
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In
other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane
which categorizes new examples.
Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?
lemmatization
Lemmatization tries to take a similarly careful approach to removing inflections. Lemmatization does not
simply chop off inflections, but instead relies on a lexical knowledge base like WordNet to obtain the correct
base forms of words.
Depth of Tree
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in
random forest. Increase in the number of tree will cause under fitting.
In these examples, it outperforms than the Porter stemmer. But lemmatization has limits. For example, Porter
stems both happiness and happy to happi, while WordNet lemmatizes the two words to themselves. ... In
general, lemmatization offers better precision than stemming, but at the expense of recall.
Yes
Which of the following is not a preprocessing method used for unstructured data classification?
confusion_matrix
In document classification, each document has to be coverted from full text to a document vector
StratifiedShuffleSplit
Decision Trees
Boosted Trees
Random Forest
Neural Networks
Nearest Neighbor
Clustering tries to group a set of objects and find whether there is some relationship between the objects. In the
context of machine learning, classification is supervised learning and clustering is unsupervised learning.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the variabl
Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document.
Known
80% of business relevant information originates in the unstructured form, primarily text , says Seth Grimes, a leading analytics
Try out the code snippets given for the case study.
Refer the links to gain an in-depth understanding of other machine learning techniques.
Unstructured data, as the name suggests, does not have a structured format and may contain data such as dates, numbers or f
*This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared
Source : Wikipedia.
A few examples of unstructured data are:
Emails
PDF files
Spreadsheets
Digital Images
Video
Audio
Problem Description
Problem Description
Let us understand unstructured data classification through the following case study:
In our day-to-day lives, we receive a large number of spam/junk messages either in the form of Text(SMS) or E-mails. It is impo
In this case study, we apply various machine learning algorithms to categorize the messages depending on whether they are sp
1 of 1
Your Playground...
Note: In case if you don't find any of the required packages in our Katacoda hands-on environment you can do the following to
For Example : If nltk is the required package, then type the following command on the terminal to download it in our hands-on
Type nltk.download()
Use any IDE (PyCharm, Spyder etc.) for trying out the code snippets.
Note: You can find brief descriptions of the python packages here.
Dataset Download
Dataset Download
The dataset is available at -SMS Spam dataset link .
curl https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Collection_v1/data/00b7d52
No. of attr 2
Data Loading
Data Loading
To start with data loading, import the required python package and load the downloaded CSV file.
The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the most widely used libraries for th
import pandas as pd
import csv
#Data Loading
print len(messages)
As you can see, our dataset has 2 columns without any headers.
This code snippet reads the data using pandas and labels the column names as label and message.
Data Analysis
Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to derive useful information from the given d
In this section, we will analyze the dataset in terms of size, headers, view data summary and a sample data.
data_size=messages.shape
print(data_size)
messages_col_names=list(messages.columns)
print(messages_col_names)
print(messages.groupby('label').describe())
print(messages.head(3))
Target Identification
Target Identification
Target is the class/category to which you will assign the data.
In this case, you aim to identify whether the message is spam or not.
By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since
message_target=messages['label']
print(message_target)
Tokenization is a method to split a sentence/string into substrings. These substrings are called tokens.
In Natural Language Processing (NLP), tokenization is the initial step in preprocessing. Splitting a sentence into tokens helps to r
import nltk
def split_tokens(message):
message=message.lower()
word_tokens =word_tokenize(message)
return word_tokens
1 of 8
Lemmatization
Lemmatization
Lemmatization is a method to convert a word into its base/root form.
def split_into_lemmas(message):
lemma = []
lemmatizer = WordNetLemmatizer()
a=lemmatizer.lemmatize(word)
lemma.append(a)
return lemma
print('Tokenized message:',messages['tokenized_message'][11])
print('Lemmatized message:',messages['lemmatized_message'][11])
2 of 8
def stopword_removal(message):
stop_words = set(stopwords.words('english'))
filtered_sentence = []
return filtered_sentence
Training_data=pd.Series(list(messages['preprocessed_message']))
Training_label=pd.Series(list(messages['label']))
3 of 8
We will be looking into a few specific ones used for unstructured data.
4 of 8
Bag Of Words(BOW)
Bag Of Words(BOW)
Bag of Words (BOW) is one of the most widely used methods for generating features in Natural Language Processing.
Representing/Transforming a text into a bag of words helps to identify various measures to characterize the text.
Predominantly used for calculating the term(word) frequency or the number of times a term occurs in a document/sentence.
5 of 8
In a TDM, the rows represent documents and columns represent the terms.
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM = Total_Dictionary_TDM.transform(Training_data)
6 of 8
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the same using TFIDF matrix.
7 of 8
Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a few of these algorithms will be presente
Note:- The explanation for these algorithms are given in the Machine Learning Axioms course. Refer the course for further deta
Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and tr
Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label y.
Evaluate the classifier model - The score(X,y) returns the score for the given test data X and test label y.
The decision tree model predicts the class/target by learning simple decision rules from the features of the data.
classifier = classifier.fit(train_data, train_label) #After being fitted, the model can then be used to predict the output.
message_predicted_target = classifier.predict(test_data)
seed=7
5 of 8
It is effective in cases where the number of dimensions is greater than the number of samples.
6 of 8
Here, a random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to im
7 of 8
Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of those parameters can influence the resu
For example, let's take the Random Forest Classifier and change the values of a few parameters (n_ estimators,max_ features)
score=classifier.score(test_data, test_label)
Refer scikit-learn tutorials and try to change the parameters of other classifiers and analyze the results.
8 of 8
Split the data to train set, validation set and test set.
Validation Set: The data used to tune the classifier model parameters i.e., to understand how well the model has been trained
Testing Set: The data used to evaluate the performance of the classifier (unseen data by the classifier).
Cross Validation
Cross validation is a model validation technique to evaluate the performance of a model on unseen data (validation set).
It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.
Points to remember:
Cross validation gives high variance if the testing set and training set are not drawn from same population.
Allowing training data to be included in testing data will not give actual performance results.
In cross validation, the number of samples used for training the model is reduced and the results depend on the choice of the
You can refer to the various CV approaches here.
2 of 5
StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can be seen from the following code
seed=7
3 of 5
classifiers = [
DecisionTreeClassifier(),
SGDClassifier(loss='modified_huber', shuffle=True),
SVC(kernel="linear", C=0.025),
KNeighborsClassifier(),
OneVsRestClassifier(svm.LinearSVC()),
score=0
clf.fit(X_train, y_train)
score=score+clf.score(X_test, y_test)
print(score)
The above code uses ensemble of classifiers for cross validation. It helps to select the best classifier based on the cross validati
4 of 5
Classification Accuracy
The classification accuracy is defined as the percentage of correct predictions.
print('Accuracy Score',accuracy_score(test_label,message_predicted_target))
score=classifier.score(test_data, test_label)
test_label.value_counts()
This simple classification accuracy will not tell us the types of errors by our classifier.
It is just an easier method, but it will not give us the latent distribution of response values.
Confusion Matrix
It is a technique to evaluate the performance of a classifier.
It depicts the performance in a tabular form that has 2 dimensions namely “actual” and “predicted” sets of data.
The rows and columns of the table show the count of false positives, false negatives, true positives and true negatives.
print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))
The first parameter shows true values and the second parameter shows predicted values.
Confusion Matrix
Confusion Matrix
This image is a confusion matrix for a two class classifier.
In the table,
TP (True Positive) - The number of correct predictions that the occurrence is positive
FP (False Positive) - The number of incorrect predictions that the occurrence is positive
FN (False Negative) - The number of incorrect predictions that the occurrence is negative
TN (True Negative)- The number of correct predictions that the occurrence is negative
3 of 7
For our case study, we have plotted the confusion matrix of Decision Tree Classifier which is given in the above image.
4 of 7
Classification Report
The classification_report function shows a text report with the commonly used classification metrics.
Precision
When the value is positive, how often does the prediction turn out to be correct?
5 of 7
Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more libraries specific to Java/Ruby, etc.
NLP Libraries
6 of 7
ms. If you're wondering why you need unstructured text,
onal programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
f Text(SMS) or E-mails. It is important to filter these spam messages since they are not truthful or trustworthy.
ownload them :
m_Collection_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv
most widely used libraries for this.
ful information from the given data for making decisions.
sample data.
mmatized_message']),axis=1)
umerical feature vectors.
l Language Processing.
ccurs in a document/sentence.
ms in a collection of documents.
-class classification.
_state=seed)
arameters can influence the results. So algorithm/model tuning is essential to find out the best model.
_state=seed)
er would fail to predict correctly for any unseen data. This could result in overfitting.
population.
lts depend on the choice of the pair of training and testing sets.
in a random manner.