Sie sind auf Seite 1von 3

Vietnamese Text classification and sentiment based

on Convolutional Neural Network


3rd Tuan Nguyen The
1st Huyen Pham Thi 2nd Anh Luu Duc
National Laboratory for Securing
National Laboratory for Securing National Laboratory for Securing
Information
Information Information
Ha Noi, Viet Nam
Ha Noi, Viet Nam Ha Noi, Viet Nam
nguyenthetuan2011@gmail.com
phamhuyenmta87@gmail.com or luuducanh05051995@gmail.com
ORCID 0000-0001-7222-5071

Abstract— Text classification and sentiment are the main Vietnamese language processing. The rest of the paper is
applications in NLP (Natural Language Processing) that are constructed as follow: Section II reviews the related work as
attractive area for researcher. Most of practical text text classification and sentiment, Vietnamese corpus. Section
classification and sentiment are implemented for English text. III presents the proposed Vietnamese text classification and
For Vietnamese, several works have been introduced using the sentiment based on CNN. In section IV, implementation
machine learning model. However, recent Convolutional results and comparison are given. Conclusion is given in the
Neural Networks (CNNs) have been achieved strong section V.
performance on the practical problems in NLP. In this paper,
we propose an implementation for Vietnamese text
classification and sentiment based on CNN model. We built II. RELATED WORK
Vietnamese text datasets for training and testing. Moreover, a
word embedding model created by Google is applied in this A. Text classification and sentiment
work. The results show that CNN model can apply for both Text classification is used to automatically assign the
Vietnamese text classification and sentiment. Moreover, the documents to one or more predefined categories or classes.
text classification based on the CNN model obtains 94.45% In the last years, Sentiment Analysis has become a hot-trend
accuracy which is higher than using neural network. topic of scientific and market research in the field of Natural
Keywords—component, formatting, style, styling, insert (key
Language Processing (NLP) and Machine Learning. The
words) most common use of Sentiment Analysis is to classify a text
to a class. Depending on the dataset and the reason,
Sentiment Classification can be binary (positive or negative),
I. INTRODUCTION 3-class (positive, neutral, or negative) or multi-class
Natural Language Processing (NLP) is a computational problem.
technique for the automatic analysis and representation of Training
human language. NLP research has evolved from the era of Labels
punch cards and batch processing, in which millions of Machine
learning/
webpages can be processed in less than a second [1, 2]. NLP Pre- Feature Deep
.txt file
enables computers to perform a wide range of natural .txtfile
file
processing extraction learning
.txt
language related tasks at all levels, ranging from parsing and
part-of-speech (POS) tagging, to machine translation and
dialogue systems.
Deep learning architectures and algorithms have already .txt file Pre- Feature Trained
Label
processing extraction model
made impressive advances in fields such as computer vision, .txtfile
.txt file

pattern recognition and NLP. For decades, machine learning


Testing
approaches targeting NLP problems have been based on
shallow models (e.g., SVM and logistic regression) trained
Fig.1. Diagram for text classification/ text sentiment.
on very high dimensional and sparse features. In the last few
years, neural networks based on dense vector representations Fig. 1 shows the process of text classification/text
have been producing superior results on various NLP tasks. sentiment. This process includes training and testing phase.
This trend is sparked by the success of word embeddings [3, Training phase generated the model that is used to predict
4] and deep learning methods [5]. Deep learning enables label for new data.
multi-level automatic feature representation learning. In
contrast, traditional machine learning based NLP systems Nowadays, machine learning method such as Naive
liaise heavily on hand-crafted features. Bayes (NB) [6], Support Vector Machine (SVM) [7] have
been widely used for classification purposes in various
Recently, CNNs achieve high performance which are variety of domains. In machine learning we can consider this
applied to solve several NLP problems such as text problem with a multiclass classification problem. Basically,
classification and sentiment. CNNs are used to classify or the automatic text classification methods use a predefined
analyze sentiment in multi-level from sentence to paragraph corpus for training and learning. We extract some kind of
[2]. However, research on NLP problems related to features for each of the text categories in the corpus. Then
Vietnamese has some limitations. There are several works we apply a mathematical model, a classifier, which somehow
that applied machine learning algorithms to solve these estimates the similarities between different texts based on
problems. In this paper, we propose to apply deep learning their features, and guesses this category. Recently, deep
algorithm as CNN to investigate the performance for learning method based on neural networks outperform

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


machine learning method in term of the performance. Thus, neural network and is fit in a supervised way using the
deep learning method based on neural network (NN), CNN, Backpropagation algorithm.
and recurrent neural network (RNN) have been applied to
solve problems related to NLP. In [8-10], text classification In this work, we apply a pre-trained Global word
and text sentiment using CNN have been introduced from embedding with100 dimensions that are available for free.
sentence level to paragraph level.
B. CNN architecture
Recently, word embeddings have been exploited for In this work, we proposed to use CNN architecture to
sentence classification using CNN architecture. build the predictive model. Since CNN works with 3-
Kalchbrenner [11] proposed a CNN architecture with dimentional data, data preparation is processed to create 3-
multiple convolution layers, positing latent, dense and low- dimentional property for dataset.
dimensional word vectors (initialized to random values) as
inputs. Kim [10] defined a one-layer CNN architecture that After cleaning all data, each word in the text is
performed comparably. This model uses pre-trained word represented by a word vector using pre-trained Word-
vectors as inputs, which may be treated as static or non- embedding from Google. As the result, each document is
static. In the former approach, word vectors are treated as presented by 2-dimentional matrix where the first dimension
fixed inputs, while in the latter they are „tuned‟ for a specific is words in the text, and the second dimension is word2vec
task. Their focus was on classification of longer texts, rather of that words. Finally, dataset is pre-processed to create 3-
than sentences (but of course the model can be used for dimentional data, which the third dimension is the number of
sentence classification). texts in the dataset. Moreover, padding is also used to
standardize the same length for texts in dataset.
B. Vietnamese corpus The CNN architecture designed for the text classification
Some researches in English, text classification has includes 1 embedding layer, 2 Conv1D layers, 2
achieved satisfactory classifications by using some standard MaxPooling1D layers, 2 Dense layers with softmax
corpora such as Reuters and 20 Newsgroups. Their accuracy activation function. The parameters for training are loss =
ranges from 80% to 93% [12]. However, the Vietnamese „categorical_crossentropy‟, optimizer = „adam‟, metrics =
datasets are very restricted and small. Each topic only [„accuracy‟], epochs = 25.
includes from 50 to 100 files of news article. Also, they are
not available publicly for independent researches [13]. In this For sentiment analysis, the CNN architecture contains 1
work, we built the Vietnamese datasets including 4 topics embedding layer, 1 Conv1D layers, 1 MaxPooling1D layers,
such as Bien Dong, Dat Dai, Lanh Dao and Bieu Tinh that 2 Dense layers with sigmoid activation function. The
each one has 100 text files to train the Vietnamese text parameters for training are loss = „binary_crossentropy‟,
classifier. For sentiment analysis, we built our own dataset optimizer = „adam‟, metrics = [„accuracy‟], epochs = 25.
based on negative and positive sentiment that crawled from
social network such as facebook, mind, twitter. Each IV. RESULT ANALYSIS
category has 100 text files. Before training, all texts are Table 1. Accuracy comparison result
cleaned by removing the stopwords, normalization. NN CNN

III. PROPOSED TEXT CLASSIFICATION AND SENTIMENT BASED 4 topic dataset 91.68% 94.45%
ON CONVOLUTIONAL NEURAL NETWORK Sentiment dataset - 90.14%
A. Word embedding CCTS Text Classification Using Deep Learning TĐHCH,ĐK

Labels
A word embedding is a learned representation for text Website
Get data Pre- Trained
(unclassified) processing model
where words that have the same meaning have a similar Database
Collected MySQL
representation. It is this approach to representing words and data Update
documents that may be considered one of the key
breakthroughs of deep learning on challenging natural Trained
model_v1
Trained
model_v2
language processing problems. Get data Check Pre- Training
(Classified) manually processing model
Word embedding methods learn a real-valued vector
Training
representation for a pre-defined-fixed sized vocabulary from data-v1
a corpus of text. The learning process is either joint with the
neural network model on some task, such as document
classification, or is an unsupervised process, using document Fig. 2. Applying the trained model to process new data
statistics. To evaluate the performance of the proposed model, we
An embedding layer, for lack of a better name, is a word used our own datasets. Table 1 shows the training result for
embedding that is learned jointly with a neural network text classification and text sentiment. It is clear that using the
model on a specific natural language processing task, such as CNN model provides a higher accuracy, compared to the
language modeling or document classification. It requires neural network model. Furthermore, to take advantage of the
that document text be cleaned and prepared such that each CNN model, we perform training for sentiment analysis and
word is one hot encoded. The size of the vector space is achieve a 90.14% accuracy.
specified as part of the model, such as 50, 100, or 300 Fig. 2 shows the trained model is applied to process the
dimensions. The vectors are initialized with small random new data that crawled from website. The output label of the
numbers. The embedding layer is used on the front end of a
processed data is feedbacked to the database for further [5] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y.
displaying in the web interface. Ng, C. Potts et al., “Recursive deep models for semantic
compositionality over a sentiment treebank,” in Proceedings of the
conference on empirical methods in natural language processing
V. CONCLUSION (EMNLP), vol. 1631, 2013, p. 1642.
[6] Shimodaira, Hiroshi. "Text classification using naive Bayes."
In this paper, the process of text classification and text Learning and Data Note 7 (2014): 1-9.
sentiment for Vietnamese based on CNN model have been [7] Thorsten Joachims. "Text Categorization with Support Vector
proposed. The implementation results show that the CNN Machines: Learning with Many Relevant Features." Proc. of ECML-
model and word embedding from Google are used to solve 98, 10th European Conference on Machine Learning, No. 1398, pp.
137–142.
both text classification and text sentiment. Using the CNN
[8] Z. Xiang, J. Zhao, Y. LeCun, "Character-level convolutional
model for text classification provides a higher accuracy than networks for text classification." Advances in neural information
the neural network. processing systems. 2015.
[9] Ye Zhang, Byron C. Wallace, “A Sensitivity analysis of (and
ACKNOWLEDGMENT Practitioners‟ Guide to) Convolutional Neural Networks for Sentence
The proposed work have been supported by National Classification”, arXiv:1510.03820v4, Apr. 2016.
Laboratory for Securing Information. [10] Yoon Kim, “Convolutional Neural Networks for Sentence
Classification”, Proceeding of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), p.1746-1751,
REFERENCES 25-29 Oct. 2014, Doha, Qatar.
[11] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom, “A
Convolutional neural network for modelling sentences”, In
[1] Tom Young, Devemanyu Hazarika, Soujanya Poria, Erik Cambria, Procedings of the 52nd Annual Meeting of the Association for
“Recent Trends in Deep Learning based Natural Language Computational Linguistics, p. 655-665, Baltimore, Mayrland, June.
Processing”, arXiv:1708.02709v6, Aug. 2018 2014.
[2] E. Cambria and B. White, “Jumping NLP curves: A review of natural [12] S. Fabrizio. "Machine learning in automated text categorization."
language processing research,” IEEE Computational Intelligence ACM computing surveys (CSUR), no. 34, vol. 1, pp. 1–47, 2002.
Magazine, vol. 9, no. 2, pp. 48–57, 2014.
[13] Hung Nguyen, Ha Nguyen, Thuc Vu, Nghia Tran, and Kiem Hoang, “
[3] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur, Internet and Genetics Algorithm-based Text Categorization for
“Recurrent neural network based language model.” in Interspeech, Documents in Vietnamese”. Proceedings of 4th IEEE International
vol. 2, 2010, p. 3. Conference on Computer Science - Research, Innovation and Vision
[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, of the Future, 2006.
“Distributed representations of words and phrases and their
compositionality,” in Advances in neural information processing
systems, 2013, pp. 3111–3119.

Das könnte Ihnen auch gefallen