Sie sind auf Seite 1von 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/256547813

Effects of Topic Relevance on Recommendation Model in Thai Social Media


Data

Article  in  Journal of Convergence Information Technology · August 2013

CITATIONS READS

0 4,503

3 authors:

Nichakorn Pankong Prakancharoen Somchai


Thaksin University King Mongkut's University of Technology North Bangkok
5 PUBLICATIONS   15 CITATIONS    18 PUBLICATIONS   31 CITATIONS   

SEE PROFILE SEE PROFILE

Marut Buranarach
National Electronics and Computer Technology Center
53 PUBLICATIONS   278 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Geometric CAPTCHA View project

Inferring and Modeling User Interests Based on Social Media: Case Study: Local Woven Fabric in Southern Thailand View project

All content following this page was uploaded by Nichakorn Pankong on 21 May 2014.

The user has requested enhancement of the downloaded file.


Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

Effects of Topic Relevance on Recommendation Model in Thai Social


Media Data
1
Nichakorn Pankong, 2Somchai Prakancharoen, 3 Marut Buranarach
1,
Faculty of Information Technology
King Mongkut's University of Technology North Bangkok
Bangkok, Thailand, nichakorn.tsu@gmail.com
*2,
Faculty of Applied Science
King Mongkut's University of Technology North Bangkok
Bangkok, Thailand, spk@kmutnb.ac.th
3,
National Electronics and Computer Technology Center (NECTEC) Pathumthani, Thailand,
marut.bur@nectec.or.th

Abstract
The increasing number of online social networks (OSN) and the growing popularity of social
network sites such as Facebook and Twitter are generating an enormous amount of OSN data. As a
result, the complexity of implicit and explicit social networks is challenges to data analysis of social
network information stream. In this paper, the proposal of a topical relevance analysis algorithm,
focusing on social media data in Thai language, which recommends interested user groups based on
user’s posts in social media sites is investigated. Five stages of development are explored as follows:
data collecting, data pre-processing, data modeling, data classifying and evaluating the
recommendation model. Then, application of three classification algorithms for content analysis of
relevant topics in four categories namely entertainment, smart phones, financial and sport was
applied. The study was conducted using data from both symmetric and asymmetric relationship social
network sites, i.e. Facebook and Twitter. The predictive model that used Support Vector Machine
(SVM) has shown the best performance compared to others. Furthermore, this study investigates some
factors influencing the recommendation model. The results showed eight leading parameters in
precision and ten parameters in recall which influence classification performance over data from both
Twitter and Facebook. The experimental results showed promising performance of recommending user
groups based on user content in social networks, although the social network site with asymmetric
relationship exhibits overall lower average parameter values than the site with symmetric relationship.

Keywords: Semantic Social Network Analysis, Social Media Data Classification, User Group
Recommendation Model.

1. Introduction
The development of Web 2.0 technologies have caused the number of Online Social Networks
(OSN) to increase dramatically. Particularly, with the evolution of smart phone technology which
allows users to use social media anywhere at any time. This has led to a rise in the number of people
involved in social media networking [1-2]. There is rapidly growing popularity of users joining the
social network sites such as Facebook, Twitter, Google Plus, etc. Moreover, traditional industries have
used OSN as a new media channel for marketing. There are a large number of such OSN applications
including business promotion, customer service, communication, and target marketing.
Recommendation models may be applied to social network information streams based on analyzing
user attitudes, interaction, content relevance, and activity relationship, etc. For example, the network of
users and content on timeline post can be analyzed to discover the user interest group based on interest
groups categorization mechanism [2-3][5][15][17]. Recently, Twitter and Facebook have employed
proactive recommendation of user interest groups via a classification mechanism using content
relevance. Some OSN data sharing platforms such as Twitter API and Facebook API have been
deployed to promote even more OSN content analysis. In addition, user activities can be inferred
based on implicit relationships on social networks. [4][6][15].
An affluence of the information can lead to concentration insufficiency of the users in filtering
information and social relationships. Put another way, information filtering becomes more challenging

Journal of Convergence Information Technology(JCIT) 25


Volume8, Number13, August 2013
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

because of the overloaded information stream. Therefore, users need to discover beneficial content in
their domains and user relationships such as friends or topics relevant to their interests on OSN [3][6].
For example, on Twitter, timeline posts can be discovered about a certain topic by using keyword
searches. However, the number of words on a Twitter post is typically shorter than other OSN, which
may result in a large number of matching content [18]. Consequently, analysis of implicit and explicit
relationships on social network is valuable investigation. The topic and keyword analysis can be basis
for text summarization, topic relevance and filtering, etc [7][16] .
In addition, some OSNs have realized the value of controlling information and restricted the once
allowed access flow of the internet. Permission on a Facebook profile, for example, is required for
information access and communication [19-20]. This has created two types of user relationship in
social networks: symmetric and asymmetric types. The user relationships are sometimes called,
‘friend’ and ‘follower’. For OSN with asymmetric type, such as on Twitter, when a user posts a
message on the timeline, it’s displayed on both author’ URL and those of their followers without the
user permission required.
To further investigate these issues, this study proposes a recommendation model built on Twitter
and Facebook data. The model aims to suggest the predictive relevant topics for recommending a user
interest group for a particular social network user. Specifically, it will proactively recommend user
interest group by content relevance based on content in the timeline of the user on Twitter or Facebook.
This is based on an assumption that the information extracting from the user timeline can indicate user
category by content relevance which can be prognosticate targeted groups of users interested in the
same field and across social network sites. In summary, this study wishes to investigate:
(a) whether the recommendation model can help to find topic relevance from content in the users
timeline on Twitter and Facebook.
(b) what are the factors influencing accuracy of the predicted relevant topics.
To achieve these study goals, this research study can be summarized as follows. In choosing the
data sources, a designed and applied recommendation algorithm over the data stream collected from
the Twitter and Facebook API platforms is investigated. Twitter and Facebook were chosen in this
study because both of the social network sites are among the most widely used platforms, which
provide large collections of users’ activities. Both sites also share some common model of social
network data. In addition, both sites represent two different types of social networks, i.e. those with
symmetric and asymmetric user relationship types.
In the data collection process, the development of a semantic social network analysis (SSNA) data
collection system, which collected data streams and relationships among the users from both sites was
designed. The data processing process for social data analysis has three phases consisting of data pre-
processing, data modeling, and data classification based on supervised learning technique to classify
the content categories.
Finally, the study evaluated factors influencing the recommendation performance over the data in
four categories: entertainment, smart phones, financial and sport. Some studied factors included social
network sites, classification algorithms and content properties such as frequency and ratio of unique
words, unknown words, duplicated words within and across posts and categories etc. The results can
help to provide some guideline in selecting the content based on the factors that would give better
recommendation performance.

2. Related Works

2.1. Semantic social network analysis

A survey study by Chelmis and Prasanna [2] provides a review of the field of social network
analysis and provides a taxonomic categorization of semantic social network analysis and impacts of
text analysis. User–centered semantic social analysis in a study by Lee et al. [1] represents entities and
relationships with ontology to enhance retrieval performance of OSN. The ontology is used to enhance
the browsing method and compares the closeness between user accounts in a social network. A study
by Wang et al. [5] used a method of filtering social data consisting of (1) loading activity data from
different social networking sites based on API of each social site, (2) managing user’s profile and
friends by blending friends and grouping friends. This approach for social data integration and
recommendation using semantic technology and representing knowledge by using ontology in sites

26
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

usually represents various properties of the user’s activities and user relationships. A study by Pankong
et al. [4] proposes a semantic approach for social network analysis to aid the organizing of social data
from different social networks, i.e. Twitter and Facebook. Typically, one of the necessary factors to be
studied includes textual content to help extract information from user activities. Additionally, there are
some levels of overlap between the sets of user’s friends on different social networking sites. In an
alternative text classification algorithm, the result in MP algorithm has shown higher classification
accuracy, and it is apparent that, MP algorithm is better than KNN algorithm [8]. A study by Singh and
Joshi [9] used concurrences of words and semantic relationships between vectors of blog posts for
classification content of the blog. An alternate approach applies social network relationship analysis.
Text classification used matrix operation to calculate weight of keyword. The predictive algorithm
investigates the vector space model for text classification and performance evaluation by accuracy
comparison.

2.2. Recommendation System

A recommendation system in a study by Chen et al. [3] provided a recommendation of interesting


content to users on Twitter. The study implemented a discovery feature that recommends interest of
topics by design space of information stream recommenders. A study by Chao et al. [11] develops an
automatic tag recommendation scheme based on a text mining approach. Firstly, it clustered a dataset
as well as their tags separately using self-organizing map algorithm and maps to obtain two feature
maps, which reveal the relationships between several URL and tags. Another study by Yang and Lee
[12] detected the sentiment analysis in social network media, if text segments contain emotion or
opinionated content, it determines the polarity.

3. Research Methodology

3.1. Facebook API & Twitter API

Figure 1. SSNA ER-Diagram showing the data relationship in Facebook and Twitter APIs

Social network analysis (SNA) experiments typically begin with acquiring data from social network
data streams. Therefore, this study designed and developed a SSNA data collection system and applied
it to the Facebook API and the Twitter API. Both Facebook API and Twitter API are application
platforms to extract data streams and relationships among users. The data was collected from a total of
200 different user accounts on Facebook and Twitter. Each user account has an average of 100 friends
on Facebook and each user URL has followers or following 100 users on Twitter. For both sites, the
social network users have an average of 50 posts. The language of the posts is mostly Thai. Moreover,

27
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

Twitter data has a length of content in topics of at least 10 characters, and at least 30 characters on
Facebook. The SSNA system architecture was designed and developed based on the Facebook and
Twitter API platforms. Specifically, this research integrates the data from both APIs for data collection
and semantic social network analysis. This is to allow study of effects on the symmetric social
relationship site, i.e. Facebook and the asymmetric social relationship site, i.e. Twitter comparison [4].
Thus, the SSNA ER diagram is designed to support the collection of data streams and relationships
among users as shown in Figure 1.

3.2. Data Analysis Approach

As shown in Figure 2, conducting of the semantic social analysis approach has five phases
consisting of data collecting, data pre-processing, data modeling and machine learning. Firstly, the
dataset were collected by the SSNA data collection system and stored in the SSNA database.
Secondly, in order to prepare the data for further analysis, pre-processing technique was applied to
extract the Thai-text contents by keyword indication extraction. The process steps of keyword
extraction are as follows. SSNA dataset was converted into an XML file type. The BEST (Benchmark
1
for Enhancing the Standard of Thai language processing) corpora which provided large scale Thai
words corpus together with the TLexs word segmentation system which segmented Thai words by
using conditional Random Field (CRF) to predict sequences and keywords from the dataset was used
[10].
This study used the Tlexs web service to determine unknown words and select keyword indicators
from topics. Thirdly, the experiment aimed to classify the content by looking at keywords to explore
four kinds of category content as follows: entertainment, smart phones, financial and sport. In order to
achieve that, the keywords were calculated in terms of weighting and then converted to the ".arff" file
format type to be used in the WEKA software. Subsequently, the study applied predictive classification
model using various machine learning algorithms including SVM, Naïve Bayes and Bayes Net and
selected the classification algorithm that produced the best classification results in terms of precision
and recall. Fourthly, in order to use content-based models, the SSNA dataset were examined topic
relevance from the timeline. Moreover, a designed analysis algorithm based on social media types
(symmetric vs. asymmetric types) and content parameters was applied. Finally, an examination of
effects of topic relevance parameters on classification performance by applying analysis of variance
(ANOVA) was also investigated.

Tlex.CRFPP
Web service
SSNA

Tlex corpus Keyword selection


Facebook user
SSNA Classes
Data Unknow word *.keyword Term Weighting
Financial Removal
Entertainment Formatting
Selecting
Twitter user Smart Phone
Sport Indicator Extraction FormatArff
SSNA *.category *.keyword *.arff
Database Identified Classes

1.Data Collecting 2. Data Pre-Processing


*.arff

Naïve Bayes Design Space Analysis


Analysis of
Predicted Result Parameters

SVM Variance

Bayes Net Parameter Identification

5.Evaluation
3. Machine Learning 4. Data Modeling

Figure 2. Steps of conducting the semantic social network analysis approach

1
http://www.nectec.or.th/corpus/

28
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

3.3. Designing Recommendation Model

3.3.1. SSNA dataset

The SSNA dataset was collected and contained the data about user activities, relationships and
ratings from users. This research identified 4 user domains consisting of: entertainment, smart phones,
financial and sport. Within these domains, 10 timeline content classes were defined: banking,
investment, stock exchange, movies, music, television, promotion, functional, accessories and sport.
The dataset sample of 10,000 topics was filtered and narrowed down to 4,833, which fitted the 4
classes on Facebook and Twitter. The scope of content in the SSNA dataset is shown in Table 1.

Table 1. The Scope of Content in the SSNA dataset


Unique Unknown Total
Parameters Content
Keyword Word Keyword
Facebook
Financial 748 2,502 25% 7,327 75% 9,829 100%
Entertainment 636 1,375 36% 2,483 64% 3,858 100%
Smart Phone 799 1,329 34% 2,542 66% 3,871 100%
Sport 397 857 23% 2,878 77% 3,735 100%
Total 2,580 6,063 28% 15,230 72% 21,293 100%
Twitter
Financial 910 2,506 57% 1,855 43% 4,361 100%
Entertainment 424 1,407 20% 5,730 80% 7,137 100%
Smart Phone 498 1,294 38% 2,071 62% 3,365 100%
Sport 421 997 40% 1,485 60% 2,482 100%
Total 2,253 7,237 36% 11,141 64% 17,345 100%

3.3.2. Dimension Combination

The study illustrates the full design space in Table 2. The designed recommendation model is
studied in four dimensions: social media type, content domain, text classification algorithm and content
parameters. The dimensions are analyzed for their effects on the predicted topic relevance. Specifically,
assessment of how users from different social media types, keyword classification algorithm, topic
categories and average value of the content parameters may affect the classification performance. This
can be summarized as the total studied dimension combination of 264 combinations (2x4x3x11).

Table 2. The Design Space of the Predictive Topic Relevance Mode


Dimension Space Description
Social Media Type 2 Facebook , Twitter
Financial Entertainment Smart Phone Sport
Content Domains 4
(Fnc) (Ent) (Smp) (Sp)
Classification
3 Support Vector Machine Naïve Bayes Bayes Net
Algorithm
Parameters
Number of
extracted
Length keyword Number keyword
Content Parameters 11 Unique keyword Unknown word Unknown word ratio
Duplicate Duplicate
keyword keyword Ratio Single-category keywords
Joint-category
keywords Difference single-joint category keyword

Dimension
(2 x 4 x 3 x 11) = 264
Combination

29
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

3.3.3. Topical Analysis of Timeline

The topical analysis or topic relevance analysis of the user timeline is illustrated in Figure 3. In
addition, our approach distinguishes two kinds of social media sites based on user relationship types:
asymmetric and symmetric user relationship types. For example, Twitter is a social media site with
asymmetric relationship, which is one-way relationship that the connected users may or may not be
friends. On the other hand, Facebook is symmetric relationship, which is a two-way relationship that
the connected users must be friends. A pair of users is required to have a friend relationship for
activity interaction [4]. One goal of the study is to assess whether these differences may influence the
effectiveness of content analysis in predicting user interest groups.

Figure 3. An Example Topical Analysis Based on Users, Topics, Keywords and Social Media Types

3.3.4. Content Classification

This study focuses on the timeline content rather than conducting in isolation the content individual
users posted. For example, the conversations on timeline may be shared by two users on Twitter and
only one user on Facebook. While, the content can be defined in multiple content categories and
keywords can appear in multiple classes and OSN as shown in Figure 3. Therefore, in order to analyze
the effects of content relevance and relationship of the topics, we focus on the topical contents rather
than analyzing in isolation the individual user’s information stream. Consequently, this research
studies topics of the user’s timeline shared on social media by 101 users. Although, the content can be
determined in various dimensions including the OSN type: symmetric and asymmetric social media
type. Then, we applied the three classification algorithms for topic prediction of user category: SVM,
Naïve Bayes and Bayes Net. We focus only on 4 content classes of entertainment, smart phones,
financial and sport. The total combination of the studied dimensions is shown in Table 2.
The SSNA dataset fitted the 4 targeted classes on both Facebook and Twitter. In this part, the aim
was to evaluate the performance of recommendation models in predicting relevant topics based on
Facebook and Twitter timelines. All algorithms were performed using the WEKA’s data mining tool
version 3.7. The classification algorithm evaluated by 10-fold cross validation set of N category using a
set of labeled timeline posts to analyze. The results are classifiers that are able to associate a groups of
keywords on Facebook (Fb) and Twitter (Tw) with a group of one or more category labels Ti as shown
in Equation 1-2.

30
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach


 () = {1 , ⋯  } (1)

 () = {1 , ⋯  } (2)
3.3.5. Parameter Identifications

In addition, this research defined the scope of analysis of 11 content parameters. Specifically, the
effects of these parameters on topical analysis on the defined content categories of the classification
model were investigated. In this study, user Ui denotes a user on Facebook or Twitter, Li denotes the
length of characters in a topic, Ki denotes keywords that can be extracted from the In this part, the aim
was to evaluate the effects of these parameters on performance of recommendation models in
predicting relevant topics based on Facebook and Twitter timelines. The definitions of the studied
content parameters are summarized as shown in Table 3.

Table 3. Definitions of the Content Parameters


Parameter Equation Definition
Number of extracted NEK c i = å K i
n

keyword i=o The total number of keywords in the content.


n n
The number of keywords that can be
NK c i = å K i - å UN i extracted from the corpus and not in unknown
i=o i=o
Number of keyword words file.
n n
UK ci = å K i - åUN i - å DK i
n
The number of the keywords after removing
Unique keyword i =o i =o i =o duplicate keywords in the content.
n n
Term is symbol character or set of keyword,
UWci = åUN i Ë åUK i which have term frequency more than 100
i =o i =o
Unknown word words and not in the corpus
n

åUW
i =o
i
UWRci = n
Proportion of Unknown keyword to the
Unknown word ratio å NK
i =o
i
number of the keywords in the content.
ì n ü
ïå K , K i > 1ï
DK ci = í i = o i ý The number of the keywords that have term
ï ï
Duplicate keyword î þ frequency > 1 in the content.
n

å DK
i =o
i
DKRci = n
Proportion of duplicate keyword to the
Duplicate keyword ratio åUK
i =o
i
number of the unique keyword in the topic.
The number of keywords in each of the user's
ìn n n n
ü
Single-category SCK ci = íåUK i t n - å KJCi , t n , å NK ci , ' åUWi ý post and category that not used in others
î i =o i =o i =o i =o þ
keywords categories.
The number of keywords in each of the user's
ìn n n n
ü
KJCci = íåUK i t n - å SCK i , t n , å NK ci , ' åUWi ý
post and category that not only used in a
î i =o i =o i =o i =o þ content category but also used in the other
Joint-category keywords categories.
ìn n
ü
ïå SCK i - å KJCi , SCK i ñ KJCi ï
ï i =o
DJRci = í n
i =o ï If the number of single- category keywords is
ý
Difference single-joint n
ï KJC - KJC , SCK á KJC ï larger than joint-category keywords then
category keyword ïîå
i =o
i å
i =o
i i i
ïþ SCKi -KJCi, otherwise KJCi - SCKi

3.3.6. Analysis of Parameter Effects on Classification Performance

To evaluate factors influencing the performance of category prediction, the analysis of variance
(ANOVA) models was applied with class sections nested within category groups. In order to assess
difference in parameters among predictive results of each content category group, the four different
classes of results were used as the dependent variables: true positive (TP), true negative (TN), false
positive (FP) and false negative (FN). The classification result of each content was defined as True or
False. If the predicted result is True, the result has the value of "1". Further, the class of predicted
category is defined as "TP" and the other classes as "TN". When the predicted result is Fales, the result
has the value of "0". Further, the correct class of the content domain is defined as "FN", the predicted
class as"FP" and the other classes as "TN" as shown in Table 4.

31
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

Given the results, each of the 11 studied parameters was compared between the content in each of
the four groups: TP, TN, FP and FN. Therefore, to state the null hypothesis in ANOVA is that the
means of each parameter of the content in these four groups are equal. Thereby, the null hypothesis
[13] of this study can be written as:
H0: μTP = μTN = μFP = μFN ; the mean of the topic parameter is not different across the four groups of
the classification results. Thus, the alternative hypothesis is defined as:
H1: The mean of the topic parameter is different at least in one group.
The null hypothesis will be rejected when the p-value is less than 0.05 and the post-hoc analysis will
be conducted. The significant difference in the parameter values between the TP and FP groups for
each category would imply that the parameter has an effect on precision. The significant difference in
the parameter values between the TP and FN groups for each category would imply that the parameter
has an effect on recall.

Table 4. The Analysis of Parameter Effects on Classification performance using ANOVA


Classification Performance Dependent Variables Independent Variables
Content_id
Classes Predicted Result Fnc Ent Smp Sp Content Parameters
1 Ent Ent 1 TN TP TN TN L1 … DJRc1
2 Ent Sp 0 TN FN TN FP
3 Ent Fnc 0 FP FN TN TN
4 Ent Smp 0 TN FN FP TN

5 Fnc Fnc 1 TP TN TN TN
6 Fnc Smp 0 FN TN FP TN
… Fnc Ent 0 FN FP TN TN
n Fnc Sp 0 FN TN TN FP Ln … DJRcn

4. Experimental Result and Discussion

4.1. Performance Comparison of Text Classification Models

Figure 4. Performance Comparison of Text Classification Models

The result of the experiment is presented in Figure 4, which compares results among classification
techniques consisting of: SVM, Naïve Bayes and Bayes Net for content classification. SVM by 10-
fold cross validation gives the best overall performance results on the study for the studied domains for

32
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

both of Facebook and Twitter data. For Facebook data, SVM gives the highest recall values for the
following classes: financial (0.89), entertainment (0.83), smart phone (0.84) and the highest precision
values for the following categories: financial (0.89) and entertainment (0.92). Bayes Net gives the
highest precision values for category smart phone (0.78) and the highest recall values for category
sport (0.81) and Naïve Bayes gives the highest precision values for category sport (0.87). For Twitter
data, SVM gives the highest recall values for the following classes: financial (0.93), smart phone
(0.84), sport (0.83) and the highest precision values for the following categories: entertainment (0.88),
smart phone (0.91), and sport (0.95). Naïve Bayes gives the highest recall values for category
entertainment (0.79) and the highest precision values for category financial (0.93).

4.2. Analysis Results of Parameter Effects on Classification Performance

Table 5. The Average Values of Content Parameters for Facebook and Twitter Data
Parameter Facebook Twitter
Avg. Fnc Ent Smp Sp Avg. Fnc Ent Smp Sp Avg.
Precision# 0.89 0.92 0.74 0.82 0.84 0.78 0.88 0.91 0.95 0.88
Recall! 0.89 0.83 0.84 0.72 0.82 0.93 0.73 0.84 0.83 0.83
#! ! # #! #!
Li 98.99 94.89 96.00 98.61 97.12 86.89 89.21 89.29 88.09 88.37
#! #! ! ! ! # #
NEK ci 24.49 21.76 21.87 23.60 22.93 20.17 21.16 21.21 20.69 20.81
NK ci #! #! ! ! #! #
12.40 11.01 11.16 11.92 11.62 10.86 10.99 11.07 10.97 10.97
UK ci #! #! ! ! # #! #! ! #!
11.51 11.16 10.45 10.33 10.86 10.23 10.05 10.21 10.13 10.12
# #! #! ! #! #! #! #! # #!
UWci 10.09 10.74 10.71 12.68 11.56 9.31 10.17 10.15 9.73 9.84
#! #! #! #! #! #! # #
UWRci 0.31 0.32 0.31 0.33 0.32 0.30 0.31 0.31 0.30 0.30
DK ci #! #! #! # #! #!
0.88 0.05 0.70 0.76 0.60 0.81 0.89 0.86 0.83 0.85
#! #! #! #! #! #! #! # #! #!
DKRci 0.06 0.69 0.05 0.05 0.21 0.06 0.07 0.07 0.07 0.07
#! #! #! #! #! #! #! #! # #!
SCK c i 3.60 3.45 3.26 5.63 4.04 3.89 3.68 3.87 3.86 3.81
KJCci #! #! ! ! ! #! #! #! #! #!
6.29 7.75 7.50 8.79 7.58 7.29 7.31 7.20 7.11 7.16
DJRci #! #! ! ! ! #! ! #! !
2.69 4.30 4.24 3.16 3.55 3.40 3.63 0.33 3.25 3.41

#
Precision Significant, p < 0.05
!
Recall Significant, p < 0.05

The study of the parameters of the SSNA dataset that influenced the prediction performance was
conducted using ANOVA. ANOVA is a collection of statistical models used to investigate the
differences among group means and their associated parameters. The study was conducted separately
for the Twitter and Facebook contents. This was to investigate whether different types of OSN have
different factors that influenced the prediction performance. The experimental results show whether a
content parameter has a significant value difference among the result groups (p < 0.05). The post-hoc
analysis was conducted to test whether the parameter has an effect on either precision or recall.
The ANOVA test was conducted separately for different categories. In determining the overall
results, if a parameter was found to have a significant effect on precision for at least three out of four
categories, it is considered an influenced parameter on precision. Similarly, if a parameter was found to
have a significant effect on recall for at least three out of four categories, it is considered an influenced
parameter on recall. The results are shown in Table 5.
The experiment results show that the overall performance in terms of precision on Facebook was
affected by four content parameters as follows: (1) unknown word (2) unknown word ratio (3)
duplicated keyword ratio (4) single-category keywords. Specifically, the contents with lower number
of unknown keyword and unknown word ratio have a better overall precision. Whereas, the contents
with higher number of duplicated keyword ratio and single-category keywords have a better overall
precision. The overall performance in terms of recall on Facebook was affected by nine content
parameters as follows: (1) number of extracted keyword (2) number of keyword (3) unique keyword
(4) unknown word (5) unknown word ratio (6) duplicate keyword ratio (7) single-category keywords
(8) joint-category keywords (9) difference single-joint category keyword. Thereby, the contents with
lower number of unknown word and joint-category keywords have a better overall recall. Furthermore,
the contents with higher number of unique keyword and single-category keywords have a better overall
recall.

33
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

In addition, on Twitter for the overall precision performance was effected by seven content
parameters as follows: (1) unique keyword (2) unknown word (3) unknown word ratio (4) duplicate
keyword (5) duplicate keyword ratio (6) single-category keywords (7) joint-category keywords.
Therefore, the contents with lower number of unknown keyword ratio and joint-category keywords
have a better overall precision. In the event that, the contents with higher number of duplicated
keyword ratio and single-category keywords have a better overall precision. Moreover, the overall
performance in terms of recall was affected by seven content parameters as follows: (1) unique
keyword (2) unknown word (3) duplicate keyword (4) duplicate keyword ratio (5) single-category
keywords (6) joint-category keywords (7) difference single-joint category keyword. Especially, the
contents with lower number of unknown keyword and duplicate keyword have a better overall recall.
While, the contents with higher number of unique keyword and single-category keywords have a better
overall recall.

4.3. Discussion

The results content influence of predicted performance as shown in Figure 4. Unlike Twitter which
has a limited number of characters allowed per post, the topics on Facebook have no limited lengths of
characters. However, the performance was smaller on Facebook than Twitter for some categories, such
as the result show low recall performance as sport category. To explain this, the topic has the least
number of unique keywords than other categories, i.e. at 857 words from number of keyword at 3,725
words and the contents have many unknown words at 77%. Thus, the performance is lower because of
the lower number of unique words and higher number of unknown words. Similarly, Twitter has a low
recall performance of SVM algorithm in the entertainment category. To explain this, this category has
the number of unique keywords of 1,407 words from the total number of keywords of 7,137 words and
the contents have many unknown word at 80% as shown in Table 1. Thus, the performance is lower
because of the lower number of unique words and higher number of unknown word.
In addition, the preliminary investigation of content parameters influencing the prediction
performance was examined. The results show low recall performance in the sport category on
Facebook at 72%. To explain this, the contents in the sport category have the highest with an average
joint keywords with other category (KJC) of average 8.79 words in content. For example, the results
have the highest jointed keyword between sport and smart phone category at 1,202 words. Thus, these
factors have contributed to the lower recall performance for the sport category and lower precision
performance for the smart phone category at 74% on Facebook. Similarly, on Twitter, the results show
low recall performance in the entertainment category at 73%. To explain this, the contents in the
entertainment category have the highest average number of joint-category keywords (KJC) of an
average 7.31 words in content. Thereby, the content on entertainment category has jointed keyword
with financial category of 768 words. Thus, these factors have contributed to a lower recall
performance for the entertainment category and lower precision performance for the financial category
at 78% on Twitter.

5. Conclusion and Future Work


In this research, the aim was to design and develop a semantic social network analysis framework to
recommend user interest group on social media. Therefore, a semantic social network analysis with an
application to predict user’s interest topics from content relevance was conducted. The result of
combining different dimensions illustrates different performance using different algorithms when
analyzing the predictive topic relevance. The result of social media type dimension shows that Twitter
and Facebook have overall similar classification performance. The result in topics classification model
dimension shows that SVM had the best overall effectiveness in text classification. This result can be
seen by the percentage of precision performance at an average of 84% and recall performance at an
average of 83% , on Facebook. While, on Twitter The results show the percentage of precision
performance at an average of 88% and recall performance at an average of 83%. In addition, the study
of influencing factors was conducted using analysis of variance to assess the impact of the parameters
on precision and recall. Eight parameters were found to affect precision as follows: (1) unknown word,

34
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

(2) unknown word ratio, (3) duplicate keyword, (4) duplicate keyword ratio, (5) single-category
keywords, (6) joint-category keyword and (7) different single-joint category keyword. As well, Ten
parameters including (1) number of extracted keyword, (2) number of keyword, (3) unique keyword
(4) unknown word, (5) unknown word ratio, (6) duplicate keyword, (7) duplicate keyword ratio, (8)
single-category keywords, (9) joint-category keyword and (10) different single-joint category keyword
are among the leading factors affecting performance in recall. Furthermore, the financial category has
the best overall performance for both social media sites in the user’s domains dimension.
To summarize, the experimental results show some promising performance when applying the
relevant content analysis in social networks to recommend interested user groups. Specifically, the
study results of dimension combination of 264 dimensions for the recommendation model can be
reviewed as follows. (1) Both social media types when applied the SVM prediction model can increase
the overall effectiveness in text classification. (2) The financial category had the best overall
performance in recall on both Twitter and Facebook. The sport category had the best performance in
precision on Twitter and the entertainment category had the best overall performance in precision on
Facebook . (3) There are four leading parameters which influenced the recommendation performance
in precision and nine parameters which influenced recall on Facebook. There are seven parameters
which influenced precision and seven parameters which influenced recall on Twitter.
In this work, the focus was on building a recommendation model based on OSN content analysis.
Social relationship analysis approaches that can be combined with the recommendation model from
this study will be investigated in future research. Specifically, the plan is to improve the
recommendation results based on social network analysis techniques. For instance, topic model can be
influenced by measures of the extracted implicit and explicit relationship of social network users.
Furthermore, the authors are also interested in investigating the recommendation model that is suitable
across different social media types

6. Acknowledgements

This work is supported by King Mongkut's University of Technology North Bangkok. The
corresponding author also would like to thank the Office of the Higher Education Commission in
Thailand and National Electronics and Computer Technology Center for support.

7. References
[1] K.S. Lee, M. Hong, J. Jung, G. Jo, “Building a Semantic Social Network Based on Interpersonal
Relationships”, In 2012 Third FTRA International Conference on Mobile, Ubiquitous, and
Intelligent Computing, pp. 90-9, 2012.
[2] C. Chelmis , V. K. Prasanna, “Social Networking Analysis: A State of the Art and the Effect of
Semantics”, In 2011 IEEE Third Int’l Conference on Privacy, Security, Risk and Trust and 2011
IEEE Third Int'l Conference on Social Computing, pp. 531-536, 2011.
[3] J. Chen, R. Nairn, L. Nelson, “Short and tweet: experiments on recommending content from
information streams”, In Proceedings of the SIGCHI Conference on Human Factors in Computing
System, pp. 1185-1194, 2010.
[4] N. Pankong, S. Prakancharoen, M. Buranarach, “A combined semantic social network analysis
framework to integrate social media data”, In Knowledge and Smart Technology (KST), pp. 37-
42, 2012.
[5] Y. Wang, J. Zhang, J. Vassileva, “A User-Centric Approach for Social Data Integration and
Recommendation”, In 2010 3rd International Conference on Human-Centric Computing, pp. 1-8,
2010.
[6] N. Pankong, S. Prakancharoen, “Combining Algorithms for Recommendation System on
Twitter”, Advanced Materials Research, vol. 403-408, pp. 3688-3692, 2011.
[7] J. Jang, J. Choi, G. Jang, S. Myaeng, “Semantic Social Networks Constructed by Topical Aspects
of Conversations: An Explorative Study”, In ICWSM, pp. 487-490, 2012.

35
Effects of Topic Relevance on Recommendation Model in Thai Social Media Data
Nichakorn Pankong, Somchai Prakancharoen, Marut Buranarach

[8] Jiang Zhong, Lin Su, Qigan Sun, “A Novel Text Classification Algorithm Based on Matrix
Projection Method”, AISS: Advances in Information Sciences and Service Sciences, vol. 5, no. 7,
pp. 427-435, 2013
[9] A. K. Singh, R. C. Joshi, “Semantic tagging and classification of blogs”, In 2010 International
Conference on Computer and Communication Technology (ICCCT), pp. 455-459, 2010.
[10] C. Haruechaiyasak, W. Jitkrittum, C. Sangkeettrakarn, C. Damrongrat, “Implementing News
Article Category Browsing Based on Text Categorization Technique”, In Web Intelligence and
Intelligent Agent Technology, IEEE/WIC/ACM International Conference, vol.3, pp. 143-146,
2008.
[11] Chao, Chih-Yang, Yen, Chia-Sung, Yang, Shih-Chun, Ting, I-Hsien, “The Relationship of Social
Network and the Organizational Justice Strategies in Campus Media News Gathering”, JDCTA,
vol. 6, vo. 17, pp. 612-619, 2012.
[12] H.-C. Yang, C.-H. Lee, “Mining tag semantics for social tag recommendation”, In 2011 IEEE
International Conference on Granular Computing, pp. 760-766, 2011.
[13] H. J. Seltman, “Experimental Design and Analysis”, Statistics, vol. D, no. 2, pp. 171-189, 2012.
[14] J. Scott, “Social Network Analysis”, A Handbook, vol. 3, no. 5. Sage Publications, p. 208, 2000.
[15] H. Wu, I. Ting, K. Wang, “Combining social network analysis and web mining techniques to
discover interest groups in the blog space”, In Innovative Computing Information and Control
ICICIC 2009 Fourth International Conference, pp. 1180-1183, 2009.
[16] M. H. Haggag, “Keyword Extraction using Semantic Analysis”, International Journal of
Computer Applications, vol. 61, no. 1, pp. 1-6, 2013.
[17] Z. Chu, S. Gianvecchio, H. Wang, “Who is Tweeting on Twitter : Human, Bot, or Cyborg ?”, In
ACM, vol. 61, no. 5, pp. 21-30, 2010.
[18] K. Tao, F. Abel, C. Hauff, G. Houben, “What makes a tweet relevant for a topic? ”, Making Sense
of Microposts Making Sense of Microposts, In 2nd Workshop on Making Sense of Microposts,
pp. 49-56, 2012.
[19] B. SMITH, N. MILLINGTON, “Inside the Walled Garden—Social networking in ESL”, In
Ritsumeikan Center for Asia Pacific Studies Ritsumeikan Center for Asia Pacific Studies, pp.
179-183, 2012.
[20] S. Bortoli, T. Palpanas, P. Bouquet, “Pulling Down The Walled Garden: Towards A Paradigm
For Decentralized Social Network Management”, In Proceedings of the IADIS International
Conference WBC 2009 (part of MCCSIS 2009), pp. 35-42, 2009.

36

View publication stats

Das könnte Ihnen auch gefallen