Beruflich Dokumente
Kultur Dokumente
1380
Proceedings of Recent Advances in Natural Language Processing, pages 1380–1387,
Varna, Bulgaria, Sep 2–4, 2019.
https://doi.org/10.26615/978-954-452-056-4_158
on short-text classification, especially for Turk- Bayes.
ish. In Section 3, the data set creation and the The advances of deep learning in NLP led to
steps of tweet preprocessing are explained in de- the use of Artificial Neural Network (ANN) based
tail. Section 4 describes the proposed Transformer models in recent studies. Convolutional Neural
Encoder model and its usage. The experimental Networks (CNNs) and Dynamic CNNs have been
results and error analysis are presented in Section applied to text classificatioon and promising re-
5. Finally, Section 6 includes the conclusion and sults have been obtained (Kim, 2014; Kalchbren-
future works. ner et al., 2014). In a recent study, Le & Mikolov
(2014) successfully added sequential information
2 Related Work by using Recurrent and Convolutional Neural Net-
Traditional text classification methods based on works for sequential short-text classification.
the BoW (Bag of words) model (Harris, 1954) suf- In this study, we address the task of Turkish
fer from high dimensionality and sparse feature short text classification. Turkish is an aggluti-
sets, particularly in short-texts. Other limitations native language, which may result in the same
of BoW based models are that semantic features word to map to different features when it takes
are not captured, the positions of the words in the different inflectional affixes or suffixes. Posses-
text are not considered, and the words are assumed sive pronouns, tenses, and auxiliary verbs can be
to be independent from each other. encoded as affixes of the words. For this rea-
In order to overcome the weaknesses of BoW, son, a word may have many different forms and
Latent Dirichlet Allocation (LDA) (Blei et al., this poses challenges for classical term weight-
2003), where the terms are mapped to distribu- ing methods to construct strong relations between
tional representations based on latent topics within the text and its topic. Most prior work on Turk-
documents, has been used for building classifiers ish short text classification is on sentiment anal-
that deal with short and sparse text (Phan et al., ysis. Demirci (2014) used Naive Bayes, SVM
2008; Song et al., 2014). and k Nearest Neighbor (kNN) for emotion analy-
One of the most commonly used algorithms for sis on Turkish tweets and obtained accuracy lev-
short text classification is Naive Bayes, which is els of up to 69.9%. Similarly, Yelmen et al.
based on word occurrence and class priors. For ex- (2018) showed that SVM reaches 80% accuracy
ample, Kiritchenko & Matwin (2011) used this al- for sentiment classification of Turkish tweets for
gorithm on an email dataset and Kim et al. (2006) GSM operators, whereas an Artificial Neural Net-
offered powerful techniques for improving text work (ANN) results in slightly lower performance.
classification by Naive Bayes. Sriram et al. (2010) Türkmenoğlu & Tantuğ (2014) compared lexi-
extracted different features from short-texts and con based sentiment analysis and machine learn-
used these with Naive Bayes to classify them. ing based sentiment analysis on tweet and movie
Support Vector Machine (SVM) is another com- review datasets and concluded that the machine
monly used algorithm in short text classification learning based algorithms SVM, Naive Bayes, and
studies. Pointing to the weaknesses of the BoW decision trees achieve better scores. Yıldırım et al.
approach, different kernels have been developed (2014) combined lexicon and machine learning
for SVM such as semantic kernels that use TF- based methods to improve sentiment analysis of
IDF (Salton & Buckley, 1988) and its variants Turkish tweets.
that apply different term weighting functions on Unlike prior studies on Turkish short text classi-
the term incidence matrix. Wang & Manning fication that address the task of sentiment analysis,
(2012) showed that Naive Bayes obtains higher we tackle the task of topic-based tweet classifica-
scores than SVM for short-text sentiment classi- tion and create a Turkish tweet data set for nine
fication tasks and the combination of SVM and different topics. We propose using Transformer
Naive Bayes outperforms SVM and Naive Bayes Encoder architecture for Turkish short-text classi-
for some of the datasets that they used. More rel- fication. Transformer Encoder is a recently pro-
evant to our study, Lee et al. (2011) used SVM for posed model that offers a better understanding of
text-based classification of trending topics under the language structure by protecting the semantic
18 classes and reached 61.76% accuracy. They ob- values and meanings of word sequences (Vaswani
tained 65.36% accuracy with Multinomial Naive et al., 2017). Furthermore, the positions of the
1381
terms in text are taken into account and the terms 3.2.1 Data Cleaning
are not assumed to be independent from each other Users use some adhoc pattern such as hashtags,
by using a contextual word embedding representa- mentions, the link of website they want to refer
tion. to, and emojis in tweets.In our work, we are only
interested in lexical terms in tweets. Hence, we
3 Dataset drop the hashtags, mentions, links, emojis, num-
In this section, we briefly explain how we created bers, and punctuations. On the other hand, hash-
the dataset and the main steps of data preprocess- tags can be quite useful when deciding the topic of
ing, as well as the importance of lemmatization on a tweet. There are some studies to split hashtags
Turkish short-text classification. (Çelebi & Özgür, 2018) into meaningful words in
English. However, it is left as an open future work
3.1 Dataset Creation for Turkish hashtags.
Some tweets contain only response to a tweet
The dataset includes 164,549 Turkish tweets, writ-
like “greetings”, “thank you”, “okay”, or “con-
ten by 74 different users until February 2019.
gratulations” for good news etc. When we exam-
These tweets are retrieved from the users’ pro-
ine the tweets which have less than 4 words, nearly
files, who are known experts in their areas and
all of these tweets are examples of such tweets.
mostly write tweets on the subjects of their exper-
These tweets are removed from the dataset, since
tise. The tweets of each user are automatically la-
they are not about any specific topic.
belled with the topic of expertise of the user. The
In addition, when we examined the tweets with
dataset contains nine topics, namely politics, eco-
retweet and like statistics that are significantly dif-
nomics & investment, health, technology & infor-
ferent from the others, we discovered that those
matics, history, literature & film, sports, education
tweets are mostly retweets of campaigns or call-
& personal growth, and magazine. The selection
ing for help. For this reason, we filtered this type
of topics was made in a similar way the news sites
of tweets from the dataset.
categorise their content.
We observed that some of the tweets of a user 3.2.2 Language Identification
could be mislabeled, since a user may also tweet In Twitter, users sometimes write and retweet
about topics different than his/her area of exper- tweets in languages different from their native lan-
tise, which results in noise. In order to ensure that guage. When we analysed our dataset, we ob-
most of the tweets are related to their assigned top- served that the languages used by users are Turk-
ics, we selected a random sample of tweets and ish, English, Arabic, French, and German. For fil-
manually checked the percentage of the correctly tering non-Turkish tweets, we used the language
labelled ones. The percentage of the correctly la- identification model in (Lui & Baldwin, 2012).
belled tweets was around 80%. That is, the train- After the dataset cleaning and language-based
ing dataset contains around 20% noise (i.e., incor- filtering steps, the training set contains 119,778
rectly labeled tweets). We randomly selected 10% tweets and the test set contains 3050 tweets.
of the dataset as a test set. In order to report re-
liable results, we manually verified and corrected 3.2.3 Lemmatization
the labels of the test set. The goal of lemmatization is to find the lemmas
of the words by removing the prefixes, suffixes,
3.2 Data Preprocessing and other types of affixes based on morphological
Preprocessing Turkish sentences is quite different analysis. This process is harder for the Turkish
from the English ones, since Turkish is an agglu- language, since it has more types of affixes than
tinative language and we may encounter words in English.
many different forms. In addition, tweets come In the Naive Bayes model, the frequency of a
with their own difficulties when they are used in word is prominent for scoring the importance of
natural language processing. They contain hash- the word for each class. Therefore, finding the cor-
tags, mentions, emojis, and links which make tok- rect lemmas of the words is a useful transforma-
enization difficult. To overcome those challenges tion to classify tweets correctly. Lemmatization
we follow three main steps for preprocessing as is also important for the SVM classifier, in order
described in the following subsections. to compute the similarity between the instances in
1382
the training set, as well as the support vectors and Moreover, in order to account for the word or-
the test instances in the classification step more ac- der in text, positional embeddings (Vaswani et al.,
curately. Yıldırım et al. (2014) showed the posi- 2017) are used and combined with the word em-
tive impact of morphological analysis for the sen- beddings. When using the input embeddings, they
timent analysis of Turkish tweets. Therefore, in are fed into the Transformer Encoder by matching
order to increase the success of the Naive Bayes each word in the input tweet with the longest to-
and SVM models, we use a lemmatization model ken in the vocabulary of the pretrained contextual
(Sak et al., 2008) which is trained by nearly one word embeddings. An example from the dataset is
million Turkish sentences. An example tweet and shown in Figure 2.
its lemmatized version are shown in Figure 1. The
effects of lemmatization on the success of Turkish
tweet classification are presented in Section 5.
1383
Model Accuracy Precision Recall F1-Score
Random Forest (Baseline) 65.0 64.9 64.5 64.5
Random Forest + Lemmatiza- 71.1 70.9 70.9 70.7
tion
Random Forest + Lemmatiza- 69.9 69.8 69.6 69.6
tion + TF-IDF
Naive Bayes 82.7 82.4 82.4 82.6
Naive Bayes + Lemmatization 85.0 84.8 84.7 84.9
Naive Bayes + Lemmatization 85.4 85.3 85.4 85.2
+ TF-IDF
SVM 78.6 78.2 78.3 78.1
SVM + Lemmatization 80.3 80.3 80.2 80.1
SVM + Lemmatization + TF- 84.4 84.2 84.3 84.2
IDF
Transformer Encoder + Input 89.4 89.3 89.4 89.3
Embeddings
Table 1: Comparison of the models’ scores over the manually labeled test set.
1384
tecture is more effective than using morphological label is literature & film. As another misclassi-
analysis and TF-IDF based term weighting with fied example the topic of the tweet “The Lausanne
the traditional machine learning algorithms Ran- Treaty debate should be discussed in public.“ is
dom Forest, Naive Bayes, and SVM for Turkish predicted as politics, whereas this tweet is in the
short text classification. scope of the recent history.
1385
Conference on Computer and Information Science Lee, K., Palsetia, D., Narayanan, R., Patwary, M.,
(ICIS), (pp. 461–466). Agrawal, A., & Choudhary, A. (2011). Twitter
trending topic classification. 2011 11th IEEE Inter-
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent national Conference on Data Mining Workshops.
dirichlet allocation. J. Mach. Learn. Res., 3, 993–
1022. Logeswaran, L. & Lee, H. (2018). An efficient frame-
work for learning sentence representations. CoRR,
Çelebi, A. & Özgür, A. (2018). Segmenting hashtags abs/1803.02893.
and analyzing their grammatical structure. Journal
of the Association for Information Science and Tech- Lui, M. & Baldwin, T. (2012). Langid.py: An off-the-
nology, 69. shelf language identification tool. In Proceedings
of the ACL 2012 System Demonstrations, ACL ’12,
Demirci, S. (2014). Emotion analysis on Turkish (pp. 25–30)., Stroudsburg, PA, USA. Association for
tweets. Master’s thesis, Middle East Technical Uni- Computational Linguistics.
versity.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S.,
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. & Dean, J. (2013). Distributed representations of
(2019). BERT: Pre-training of deep bidirectional words and phrases and their compositionality. In
transformers for language understanding. In Pro- Advances In Neural Information Processing Sys-
ceedings of the 2019 Conference of the North Amer- tems, (pp. 3111–3119).
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Vol- Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,
ume 1 (Long and Short Papers), (pp. 4171–4186)., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
Minneapolis, Minnesota. Association for Computa- contextualized word representations. In Proc. of
tional Linguistics. NAACL.
Harris, Z. S. (1954). Distributional structure. WORD, Phan, X., Nguyen, L., & Horiguchi, S. (2008). Learn-
10(2-3), 146–162. ing to classify short and sparse text & web with hid-
den topics from large-scale data collections. In Pro-
Kadhim, A. I. (2019). Survey on supervised machine
ceedings of the 17th International Conference on
learning techniques for automatic text classification.
World Wide Web, WWW ’08, (pp. 91–100)., New
Artificial Intelligence Review, 52(1), 273–292.
York, NY, USA. ACM.
Kalchbrenner, N., Grefenstette, E., & Blunsom, P.
(2014). A convolutional neural network for mod- Riquelme, F. & Gonzlez-Cantergiani, P. (2016). Mea-
elling sentences. In Proceedings of the 52nd An- suring user influence on Twitter: A survey. Informa-
nual Meeting of the Association for Computational tion Processing & Management, 52(5), 949–975.
Linguistics (Volume 1: Long Papers), (pp. 655–
665)., Baltimore, Maryland. Association for Com- Sak, H., Güngör, T., & Saraçlar, M. (2008). Turkish
putational Linguistics. language resources: Morphological parser, morpho-
logical disambiguator and web corpus. In Nord-
Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. ström, B. & Ranta, A. (Eds.), Advances in Natural
(2006). Some effective techniques for Naive Bayes Language Processing, (pp. 417–427)., Berlin, Hei-
text classification. IEEE Transactions on Knowl- delberg. Springer Berlin Heidelberg.
edge and Data Engineering, 18(11), 1457–1466.
Salton, G. & Buckley, C. (1988). Term-weighting ap-
Kim, Y. (2014). Convolutional neural networks for proaches in automatic text retrieval. Information
sentence classification. In Proceedings of the Processing & Management, 24(5), 513 – 523.
2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), (pp. 1746–1751)., Selvaperumal, P. & Suruliandi, A. (2014). A short mes-
Doha, Qatar. Association for Computational Lin- sage classification algorithm for tweet classification.
guistics. In 2014 International Conference on Recent Trends
in Information Technology, (pp. 1–3).
Kiritchenko, S. & Matwin, S. (2011). Email classifi-
cation with co-training. In Proceedings of the 2011 Song, G., Yunming, Y., Xiaolin, D., Huang, X., &
Conference of the Center for Advanced Studies on Shifu, B. (2014). Short text classification: A survey.
Collaborative Research, CASCON ’11, (pp. 301– Journal of Multimedia, 9(5), 635.
312). IBM Corp.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H.,
Le, Q. & Mikolov, T. (2014). Distributed representa- & Demirbas, M. (2010). Short text classification
tions of sentences and documents. In Proceedings in Twitter to improve information filtering. In Pro-
of the 31st International Conference on Machine ceeding of the 33rd International ACM SIGIR Con-
Learning - Volume 32, ICML’14, (pp. II–1188–II– ference on Research and Development in Informa-
1196). JMLR.org. tion Retrieval.
1386
Taksa, I., Zelikovitz, S., & Spink, A. (2007). Us-
ing web search logs to identify query classification
terms. International Journal of Web Information
Systems, 3, 315–327.
Taylor, W. L. (1953). Cloze procedure: A new tool for
measuring readability. Journalism Bulletin.
Türkmenoğlu, C. & Tantuğ, A. C. (2014). Sentiment
analysis in Turkish media. In Proceedings of Work-
shop on Issues of Sentiment Discovery and Opin-
ion Mining, International Conference on Machine
Learning (ICML).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin,
I. (2017). Attention is all you need. In Advances In
Neural Information Processing Systems, (pp. 5998–
6008).
Wang, S. & Manning, C. D. (2012). Baselines and
bigrams: Simple, good sentiment and topic classi-
fication. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguis-
tics: Short Papers - Volume 2, ACL ’12, (pp. 90–
94)., Stroudsburg, PA, USA. Association for Com-
putational Linguistics.
Yelmen, İ., Zontul, M., Kaynar, O., & Sönmez, F.
(2018). A novel hybrid approach for sentiment clas-
sification of Turkish tweets for GSM operators. In-
ternational Journal of Circuits, Systems and Signal
Processing.
Yıldırım, E., Çetin, F. S., Eryiğit, G., & Temel, T.
(2014). The impact of NLP on turkish sentiment
analysis. Turkiye Bilisim Vakfi Bilgisayar Bilimleri
ve Muhendisligi Dergisi, 7, 43 – 51.
1387