Beruflich Dokumente
Kultur Dokumente
1. Abstract ................................................................................................................................... 2
2. Introduction ............................................................................................................................. 2
1.2.1. Sentiment Analysis ....................................................................................................... 2
1.3.1. Documents Collection ............................................................................................... 3
1.3.2. Pre-Processing........................................................................................................... 3
1.3.3. Indexing .................................................................................................................... 3
1.3.4. Feature Selection ....................................................................................................... 3
1.3.5. Text Classification .................................................................................................... 3
1.3.6. Performance Evaluations .......................................................................................... 4
1.3.7. Technology used ....................................................................................................... 4
1.3.8. DATASET SOURCES ............................................................................................. 4
1.2.2. Coding .......................................................................................................................... 5
1.2.3. Feature Generation using Bag of Words ...................................................................... 8
1.3.9. Split train and test set ................................................................................................ 9
1.3.10. Model Building and Evaluation .......................................................................... 10
1.2.4. Conclusion ................................................................................................................ 10
3. Future Work .......................................................................................................................... 11
4. References ............................................................................................................................. 12
1. Abstract
In the highly world of business computation, to stay as being competitor in the business,
Companies should know customer opinion, regarding the products and services they provide. This
project focuses on Sentiment analysis using Text Classification method. Text classification is the
task of automatically sorting a set of documents into categories from a predefined set. This project
uses text classification algorithms (Naïve Bayes, for text classification, VSM) which classify
documents into different categories, which is trained on one dataset (Sentiment analysis for three
categories).
KEYWORDS: Text Classification, Naive Bayes Classifier, Vector Space Model
2. Introduction
1.2.1. Sentiment Analysis
Thanks to NLP and related field these days business companies want to know and understand
customer; satisfaction, complain, strength and weakness and gaps, to stay as strong competitor.
Having knowing information like what went wrong with their latest products? What users and the
general public think about the latest feature? Business companies can maintain their good quality
and struggle for filling the gap or to eliminate or reduce all negative variables.
Quantifying customers content, idea, belief, and opinion is known as sentiment analysis. User's
online post, blogs, tweets, feedback of product helps business people to the target audience and
innovate in products and services. Sentiment analysis helps in understanding people in a better and
more accurate way. It is not only limited to marketing, but it can also be utilized in politics,
research, and security.
Human communication just not limited to words; it is more than words. Sentiments are
combination words, tone, and writing style. As a data analyst, It is more important to understand
our sentiments, what it really means? And classifying customers message as Positive, Negative
and Neutral.
There are mainly two approaches for performing sentiment analysis.
Lexicon-based: count number of positive and negative words in given text and the larger count
will be the sentiment of text.
Machine learning based approach: Develop a classification model, which is trained using the pre-
labelled dataset of positive, negative, and neutral
In this project we use the second approach (Machine learning based approach).
1.3.2. Pre-Processing
The first step of pre-processing which is used to presents the text documents into clear word
format. The documents prepared for next step in text classification are represented by a great
amount of features. Commonly the steps taken are:
Tokenization: A document is treated as a string, and then partitioned into a list of tokens.
Removing stop words: Stop words such as “the”, “a”, “and”, etc are frequently occurring, so the
insignificant words need to be removed.
Stemming word: Applying the stemming algorithm that converts different word form into similar
canonical form.
This step is the process of conflating tokens to their root form, e.g. connection to connect,
computing to compute
1.3.3. Indexing
The documents representation is one of the pre-processing technique that is used to reduce the
complexity of the documents and make them easier to handle, the document have to be transformed
from the full text version to a document vector The Perhaps most commonly used document
representation is called vector space model (SMART) vector space model, documents are
represented by vectors of words. Usually, one has a collection of documents which is represented
by word by word document Matrix.
1.2.2. Coding
First place the dataset in required location C:/Users/abiyotad/movies.tvs
The next step is loading movies train dataset and importing pandas’ library using the following
python code on Jupyter Notebook.
# Import pandas
import pandas as pd
data=pd.read_csv('train.tsv', sep='\t')
data.head()
Figure 1.3 shows screen shoot of sample code used in this project
The following output is generated:
Figure 1.6: Screen copy of generated document term matrix by using scikit-learn's
CountVectorizer
1.2.4. Conclusion
Text Classification is an important application area in text mining why because classifying millions
of text document manually is an expensive and time consuming task. Therefore, automatic text
classifier is constructed using pre classified sample documents whose accuracy and time efficiency
is much better than manual text classification. If the input to the classifier is having less noisy data,
we obtain efficient results. In our study, we applied different text classifier models namely the
vector space model for text classification (VSM), the Naive Bayes Classifier model (NB) and we
used GitHub dataset.
3. Future Work
In the future we recommend to use better algorithm, comparing the efficiency of above algorithms.
Also comparing supervised text classification algorithms with semi supervised and unsupervised
algorithm. Try to find a most efficient algorithm. In the future work (on our thesis) we would like
to recommend text classification project using local companies, local dataset in local language.
And also, we would like to advise to use deep learning to improve the accuracy.
4. References
[1]. Jupyter Notebook Tutorial: The Definitive Guide, by Karlijn Willems November 12th, 2019,
https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook