Beruflich Dokumente
Kultur Dokumente
Abstract— Text classification and sentiment are the main Vietnamese language processing. The rest of the paper is
applications in NLP (Natural Language Processing) that are constructed as follow: Section II reviews the related work as
attractive area for researcher. Most of practical text text classification and sentiment, Vietnamese corpus. Section
classification and sentiment are implemented for English text. III presents the proposed Vietnamese text classification and
For Vietnamese, several works have been introduced using the sentiment based on CNN. In section IV, implementation
machine learning model. However, recent Convolutional results and comparison are given. Conclusion is given in the
Neural Networks (CNNs) have been achieved strong section V.
performance on the practical problems in NLP. In this paper,
we propose an implementation for Vietnamese text
classification and sentiment based on CNN model. We built II. RELATED WORK
Vietnamese text datasets for training and testing. Moreover, a
word embedding model created by Google is applied in this A. Text classification and sentiment
work. The results show that CNN model can apply for both Text classification is used to automatically assign the
Vietnamese text classification and sentiment. Moreover, the documents to one or more predefined categories or classes.
text classification based on the CNN model obtains 94.45% In the last years, Sentiment Analysis has become a hot-trend
accuracy which is higher than using neural network. topic of scientific and market research in the field of Natural
Keywords—component, formatting, style, styling, insert (key
Language Processing (NLP) and Machine Learning. The
words) most common use of Sentiment Analysis is to classify a text
to a class. Depending on the dataset and the reason,
Sentiment Classification can be binary (positive or negative),
I. INTRODUCTION 3-class (positive, neutral, or negative) or multi-class
Natural Language Processing (NLP) is a computational problem.
technique for the automatic analysis and representation of Training
human language. NLP research has evolved from the era of Labels
punch cards and batch processing, in which millions of Machine
learning/
webpages can be processed in less than a second [1, 2]. NLP Pre- Feature Deep
.txt file
enables computers to perform a wide range of natural .txtfile
file
processing extraction learning
.txt
language related tasks at all levels, ranging from parsing and
part-of-speech (POS) tagging, to machine translation and
dialogue systems.
Deep learning architectures and algorithms have already .txt file Pre- Feature Trained
Label
processing extraction model
made impressive advances in fields such as computer vision, .txtfile
.txt file
III. PROPOSED TEXT CLASSIFICATION AND SENTIMENT BASED 4 topic dataset 91.68% 94.45%
ON CONVOLUTIONAL NEURAL NETWORK Sentiment dataset - 90.14%
A. Word embedding CCTS Text Classification Using Deep Learning TĐHCH,ĐK
Labels
A word embedding is a learned representation for text Website
Get data Pre- Trained
(unclassified) processing model
where words that have the same meaning have a similar Database
Collected MySQL
representation. It is this approach to representing words and data Update
documents that may be considered one of the key
breakthroughs of deep learning on challenging natural Trained
model_v1
Trained
model_v2
language processing problems. Get data Check Pre- Training
(Classified) manually processing model
Word embedding methods learn a real-valued vector
Training
representation for a pre-defined-fixed sized vocabulary from data-v1
a corpus of text. The learning process is either joint with the
neural network model on some task, such as document
classification, or is an unsupervised process, using document Fig. 2. Applying the trained model to process new data
statistics. To evaluate the performance of the proposed model, we
An embedding layer, for lack of a better name, is a word used our own datasets. Table 1 shows the training result for
embedding that is learned jointly with a neural network text classification and text sentiment. It is clear that using the
model on a specific natural language processing task, such as CNN model provides a higher accuracy, compared to the
language modeling or document classification. It requires neural network model. Furthermore, to take advantage of the
that document text be cleaned and prepared such that each CNN model, we perform training for sentiment analysis and
word is one hot encoded. The size of the vector space is achieve a 90.14% accuracy.
specified as part of the model, such as 50, 100, or 300 Fig. 2 shows the trained model is applied to process the
dimensions. The vectors are initialized with small random new data that crawled from website. The output label of the
numbers. The embedding layer is used on the front end of a
processed data is feedbacked to the database for further [5] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y.
displaying in the web interface. Ng, C. Potts et al., “Recursive deep models for semantic
compositionality over a sentiment treebank,” in Proceedings of the
conference on empirical methods in natural language processing
V. CONCLUSION (EMNLP), vol. 1631, 2013, p. 1642.
[6] Shimodaira, Hiroshi. "Text classification using naive Bayes."
In this paper, the process of text classification and text Learning and Data Note 7 (2014): 1-9.
sentiment for Vietnamese based on CNN model have been [7] Thorsten Joachims. "Text Categorization with Support Vector
proposed. The implementation results show that the CNN Machines: Learning with Many Relevant Features." Proc. of ECML-
model and word embedding from Google are used to solve 98, 10th European Conference on Machine Learning, No. 1398, pp.
137–142.
both text classification and text sentiment. Using the CNN
[8] Z. Xiang, J. Zhao, Y. LeCun, "Character-level convolutional
model for text classification provides a higher accuracy than networks for text classification." Advances in neural information
the neural network. processing systems. 2015.
[9] Ye Zhang, Byron C. Wallace, “A Sensitivity analysis of (and
ACKNOWLEDGMENT Practitioners‟ Guide to) Convolutional Neural Networks for Sentence
The proposed work have been supported by National Classification”, arXiv:1510.03820v4, Apr. 2016.
Laboratory for Securing Information. [10] Yoon Kim, “Convolutional Neural Networks for Sentence
Classification”, Proceeding of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), p.1746-1751,
REFERENCES 25-29 Oct. 2014, Doha, Qatar.
[11] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom, “A
Convolutional neural network for modelling sentences”, In
[1] Tom Young, Devemanyu Hazarika, Soujanya Poria, Erik Cambria, Procedings of the 52nd Annual Meeting of the Association for
“Recent Trends in Deep Learning based Natural Language Computational Linguistics, p. 655-665, Baltimore, Mayrland, June.
Processing”, arXiv:1708.02709v6, Aug. 2018 2014.
[2] E. Cambria and B. White, “Jumping NLP curves: A review of natural [12] S. Fabrizio. "Machine learning in automated text categorization."
language processing research,” IEEE Computational Intelligence ACM computing surveys (CSUR), no. 34, vol. 1, pp. 1–47, 2002.
Magazine, vol. 9, no. 2, pp. 48–57, 2014.
[13] Hung Nguyen, Ha Nguyen, Thuc Vu, Nghia Tran, and Kiem Hoang, “
[3] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur, Internet and Genetics Algorithm-based Text Categorization for
“Recurrent neural network based language model.” in Interspeech, Documents in Vietnamese”. Proceedings of 4th IEEE International
vol. 2, 2010, p. 3. Conference on Computer Science - Research, Innovation and Vision
[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, of the Future, 2006.
“Distributed representations of words and phrases and their
compositionality,” in Advances in neural information processing
systems, 2013, pp. 3111–3119.