Sie sind auf Seite 1von 12

Contents

1. Abstract ................................................................................................................................... 2
2. Introduction ............................................................................................................................. 2
1.2.1. Sentiment Analysis ....................................................................................................... 2
1.3.1. Documents Collection ............................................................................................... 3
1.3.2. Pre-Processing........................................................................................................... 3
1.3.3. Indexing .................................................................................................................... 3
1.3.4. Feature Selection ....................................................................................................... 3
1.3.5. Text Classification .................................................................................................... 3
1.3.6. Performance Evaluations .......................................................................................... 4
1.3.7. Technology used ....................................................................................................... 4
1.3.8. DATASET SOURCES ............................................................................................. 4
1.2.2. Coding .......................................................................................................................... 5
1.2.3. Feature Generation using Bag of Words ...................................................................... 8
1.3.9. Split train and test set ................................................................................................ 9
1.3.10. Model Building and Evaluation .......................................................................... 10
1.2.4. Conclusion ................................................................................................................ 10
3. Future Work .......................................................................................................................... 11
4. References ............................................................................................................................. 12
1. Abstract
In the highly world of business computation, to stay as being competitor in the business,
Companies should know customer opinion, regarding the products and services they provide. This
project focuses on Sentiment analysis using Text Classification method. Text classification is the
task of automatically sorting a set of documents into categories from a predefined set. This project
uses text classification algorithms (Naïve Bayes, for text classification, VSM) which classify
documents into different categories, which is trained on one dataset (Sentiment analysis for three
categories).
KEYWORDS: Text Classification, Naive Bayes Classifier, Vector Space Model

2. Introduction
1.2.1. Sentiment Analysis
Thanks to NLP and related field these days business companies want to know and understand
customer; satisfaction, complain, strength and weakness and gaps, to stay as strong competitor.
Having knowing information like what went wrong with their latest products? What users and the
general public think about the latest feature? Business companies can maintain their good quality
and struggle for filling the gap or to eliminate or reduce all negative variables.

Quantifying customers content, idea, belief, and opinion is known as sentiment analysis. User's
online post, blogs, tweets, feedback of product helps business people to the target audience and
innovate in products and services. Sentiment analysis helps in understanding people in a better and
more accurate way. It is not only limited to marketing, but it can also be utilized in politics,
research, and security.

Human communication just not limited to words; it is more than words. Sentiments are
combination words, tone, and writing style. As a data analyst, It is more important to understand
our sentiments, what it really means? And classifying customers message as Positive, Negative
and Neutral.
There are mainly two approaches for performing sentiment analysis.
Lexicon-based: count number of positive and negative words in given text and the larger count
will be the sentiment of text.
Machine learning based approach: Develop a classification model, which is trained using the pre-
labelled dataset of positive, negative, and neutral
In this project we use the second approach (Machine learning based approach).

1.3.1. Documents Collection


This is first step of classification process in which we are collecting the dataset of document.

1.3.2. Pre-Processing
The first step of pre-processing which is used to presents the text documents into clear word
format. The documents prepared for next step in text classification are represented by a great
amount of features. Commonly the steps taken are:
Tokenization: A document is treated as a string, and then partitioned into a list of tokens.
Removing stop words: Stop words such as “the”, “a”, “and”, etc are frequently occurring, so the
insignificant words need to be removed.
Stemming word: Applying the stemming algorithm that converts different word form into similar
canonical form.
This step is the process of conflating tokens to their root form, e.g. connection to connect,
computing to compute

1.3.3. Indexing
The documents representation is one of the pre-processing technique that is used to reduce the
complexity of the documents and make them easier to handle, the document have to be transformed
from the full text version to a document vector The Perhaps most commonly used document
representation is called vector space model (SMART) vector space model, documents are
represented by vectors of words. Usually, one has a collection of documents which is represented
by word by word document Matrix.

1.3.4. Feature Selection


After pre-processing and indexing the important step of text classification, is feature selection to
construct vector space, which improves the scalability, efficiency and accuracy of a text classifier.
The main idea of Feature Selection (FS) is to select subset of features from the original documents.
FS is performed by keeping the words with highest score according to predetermined measure of
the importance of the word.

1.3.5. Text Classification


Text classification is one of the important tasks of text mining. It is a supervised approach.
Identifying category or class of given text such as a blog, book, web page, news articles, and
tweets. It has various application in today's computer world such as spam detection, task
categorization in CRM services, categorizing products on E-retailer websites, classifying the
content of websites for a search engine, sentiments of customer feedback.
Figure 1.1: shows model of text classification

1.3.6. Performance Evaluations


This is Last stage of Text classification, in which the evaluations of text classifiers is typically conducted
experimentally, rather than analytically. The experimental evaluation of classifiers, rather than concentrating on
issues of Efficiency, usually tries to evaluate the effectiveness of a classifier, i.e. its capability of taking the right
categorization decisions. Many measures have been used, like Precision and recall, fallout, error and accuracy.

1.3.7. Technology used


This project uses:
 Anaconda Navigator 1.9.7
 Windows 10 Operating System
 Personal Laptop

1.3.8. Dataset Sources


The movies review dataset is collected from Github.
The dataset is a tab-separated file. Dataset has four columns:
 PhraseId,
 SentenceId,
 Phrase, and
 Sentiment.
This data has 5 sentiment labels:
 0 - negative
 1 - somewhat negative
 2 - neutral
 3 - somewhat positive
 4 - positive
Here, we build a model to classify the type look at table 1.1.
PhraseId SentenceId Phrase Sentiment
0 1 1 A series of escapades demonstrating the 1
adage ...
1 2 1 A series of escapades demonstrating the 2
adage ...
2 3 1 A series 2
3 4 1 A 2
4 5 1 series 2
Table 1.1: table shows how a dataset is classified

1.2.2. Coding
First place the dataset in required location C:/Users/abiyotad/movies.tvs

Figure 1.2: Shows screen copy of dataset location

The next step is loading movies train dataset and importing pandas’ library using the following
python code on Jupyter Notebook.
# Import pandas
import pandas as pd
data=pd.read_csv('train.tsv', sep='\t')
data.head()

Figure 1.3 shows screen shoot of sample code used in this project
The following output is generated:

Table 1.2: Shows output


Figure: 1.4: shows Data and Sentiment values out put
Figure 1.5: shows sentiment output in graph

1.2.3. Feature Generation using Bag of Words


In our dataset we have a set of texts and their respective labels. Since these dataset can't directly
used we need to convert these text into some numbers or vectors of numbers. Bag-of-words
model(BoW ) is the simplest way of extracting features from the text. BoW converts text into the
matrix of occurrence of words within a document. This model concerns about whether given words
occurred or not in the document.
Example: There are three documents:
Doc1: I love Ethiopia.
Doc2: I hate china’s food.
Doc3: China food is my favorite and passion.
Now, we can create a matrix of document and words by counting the occurrence of words in the
given document. This matrix is known as Document-Term Matrix(DTM).
I Love Ethiopia hate Chin’s food Is My favorite and passion
Doc1 1 1 1
Doc2 1 1 1
Doc3 1 1 1 1 1
Table 1.3: Document-Term Matrix(DTM)
This example use a single word matrix. It can be a combination of two or more words, which is
called a bigram or trigram model and the general approach is called the n-gram model. In this
project we generate document term matrix by using scikit-learn's CountVectorizer.

Figure 1.6: Screen copy of generated document term matrix by using scikit-learn's
CountVectorizer

1.3.9. Split train and test set


To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.
Let's split dataset by using function train_test_split(). we need to pass basically 3 parameters
features, target, and test_set size. Additionally, we can use random_state to select records
randomly.
Figure 1.7 splitting dataset into training set and a test set

1.3.10. Model Building and Evaluation


Let's build the Text Classification Model using TF-IDF. First, import the MultinomialNB module
and create a Multinomial Naive Bayes classifier object using MultinomialNB() function. Then, fit
your model on a train set using fit() and perform prediction on the test set using predict().

Figure 1.7: text classification using CountVector(or BoW)


Well, you got a classification rate of 60.49% using CountVector(or BoW), which is not considered
as good accuracy. We need to improve this.

1.2.4. Conclusion
Text Classification is an important application area in text mining why because classifying millions
of text document manually is an expensive and time consuming task. Therefore, automatic text
classifier is constructed using pre classified sample documents whose accuracy and time efficiency
is much better than manual text classification. If the input to the classifier is having less noisy data,
we obtain efficient results. In our study, we applied different text classifier models namely the
vector space model for text classification (VSM), the Naive Bayes Classifier model (NB) and we
used GitHub dataset.
3. Future Work
In the future we recommend to use better algorithm, comparing the efficiency of above algorithms.
Also comparing supervised text classification algorithms with semi supervised and unsupervised
algorithm. Try to find a most efficient algorithm. In the future work (on our thesis) we would like
to recommend text classification project using local companies, local dataset in local language.
And also, we would like to advise to use deep learning to improve the accuracy.
4. References
[1]. Jupyter Notebook Tutorial: The Definitive Guide, by Karlijn Willems November 12th, 2019,
https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook

[2]. Sentiment Analysis on Movie Reviews, in 2014, https://www.kaggle.com/c/sentiment-


analysis-on-movie-reviews/data

[3]. Deep Learning for Sentiment Analysis, January 20,2019,


https://towardsdatascience.com/deep-learning-for-sentiment-analysis-7da8006bf6c1

Das könnte Ihnen auch gefallen