Beruflich Dokumente
Kultur Dokumente
Information Development
112
The Author(s) 2014
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0266666914528523
idv.sagepub.com
zyirmidokuz
Esra Kahya-O
Erciyes University, Turkey
Abstract
The large amounts of Facebook social network data which are generated and collected need to be analyzed for
valuable decision making information about shopping firms in Turkey. In addition, analyzing social network data
from outside the firms becomes a critical business need for the firms which actively use Facebook. To have a
competitive advantage, firms must translate social media texts into something more quantitative to extract
information. In this study, web text mining techniques are used to determine popular online shopping firms
Facebook patterns. For this purpose, 200 popular Turkish companies web URLs are used. Web text mining
through natural language processing techniques is examined. Similarity analysis and clustering are done. Consequently, the clusters of the Facebook websites and their relationships and similarities of the firms are
obtained.
Keywords
text mining, web mining, Facebook, knowledge discovery in databases, data mining, information extraction,
online shopping, Turkey
important to analyze social web data in order to differentiate in the global competition.
Firms use social media to build business, sell products and support the economy. Social media websites
(Facebook, Linkedln, Google, Twitter, etc.) help
consumers to decide on what to buy. A business can
compare its social media data with the social media
data of its competitors to gain perspective on their
performance. The comparison could help a business
to identify weaknesses, find new opportunities and
adjust its social media strategy. One of the main techniques used in social media competitive analysis is
Corresponding author:
Esra Kahya-Ozyirmidokuz, PhD, Assistant Professor, Erciyes
University, Kayseri Vocational College, Computer Technologies
and Programming Department, Erciyes University, Talas, Kayseri,
Turkey. Tel (W): 9 (0) 352 437 49 15.
Email: esrakahya@erciyes.edu.tr
Information Development
Literature review
One area that is finding increased interest is the analysis of web documents. Roussinov and Zhao (2003)
demonstrate how the World Wide Web can be mined
in a fully automated manner to discover the semantic
similarity relationships among the concepts surfaced
during an electronic brainstorming session, and thus
improve the accuracy of their application. Mladenic
and Grobelnik (2003) describe feature subset selection used in text learning and give a brief overview
of feature subset selection which is commonly used
in machine learning. They collected data from Yahoo,
a large text hierarchy of web documents. They used
machine learning methods to analyse them.
Thelwall, Thelwall and Fairclough (2006) apply
web issue analysis to the topic of nurse prescribing.
They used web TM. They collected data via URL
analysis to identify common direct and indirect
sources of web information. The results are presented
in descriptive form and a qualitative analysis is used
to argue that new information has been found. Chau
and Chen (2008) propose a machine-learning-based
approach that combines web content analysis and web
structure analysis. They represent each web page by a
set of content-based and link-based features, which
can be used as the input for various machine learning
algorithms.
Carullo, Binaghi, and Gallo (2009) propose a novel
heuristic online document clustering model that can
be specialized with a variety of text-oriented similarity measures. An experimental evaluation of the proposed model is conducted in the e-commerce domain.
Li and Wu (2010) study online forums hotspot detection and forecast using sentiment analysis and TM
approaches. First, they create an algorithm to automatically analyze the emotional polarity of a text and
to obtain a value for each piece of text. Second, this
algorithm is combined with k-means clustering (k is
the number of clusters) and support vector machine
(SVM) to develop an unsupervised TM approach.
They use the proposed TM approach to group the forums into various clusters, with the center of each cluster representing a hotspot forum within the current
time span. De Maziere and Van Hulle (2011) used
TM to cluster a 7,000-document inventory in order
to evaluate the impact of EU-funded research in the
social sciences and humanities on EU policies. They
Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
compared two visualization techniques, multidimensional scaling and the self-organizing map, and
one of the latters derivatives, the U-matrix.
There are several examples of web/TM approaches.
Song and Shepperd (2006) view the topology of a
website as a directed graph, and use a users access
information on all URLs of a website as features to
characterize the user and use all users access information on a URL as features to characterize the
URL. The user clusters and web page clusters are
discovered by both vector analysis and fuzzy set
theory-based methods. The frequent access paths are
recognized based on web page clusters and take into
account the underlying structure of a website. Soibelman et al. (2008) present text-based, web-based,
image-based, and network-based construction data
as examples of their vision for possible advanced
data management and analysis on these unstructured
types of data. Several required steps essential for
achieving this goal are presented and discussed,
including: problem investigation to define relevant
issues; data preparation to remove features not carrying useful information; data representation to transform data sources into new precise and concise
descriptions; and proper data analysis for information search or for extraction of new knowledge.
Wong and Lam (2009) developed an unsupervised
learning framework which can jointly extract information and conduct feature mining from a set of web
pages across different sites. Their approach allows
tight interaction between the tasks of extraction and
mining. Engel, Whitney, and Cramer (2010) developed algorithms to find and characterize changes in
topic within text streams.
Schedl, Widmer, Knees and Pohle (2011), developed novel techniques as well as refining existing
ones in order to automatically gather information
about music artists and bands. After the searching,
retrieval, and indexing of web pages that are related
to a music artist or band, web content mining and
music information retrieval techniques are applied
to capture the following categories of information:
similarities between music artists or bands, prototypicality of an artist or a band for a genre, descriptive
properties of an artist or a band, band members and
instrumentation, and images of album cover artwork.
Hu and Liu (2012) present one real-world application
to further illustrate how to utilize text analytics
to solve problems in social media applications.
Thorleuchter and Poel (2012) introduced a new
web mining approach that enables automated
Information Development
Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
Experimental design
it: this is the aim of web mining (WM), the application of statistics and DM to web browsing data. WM
covers a range of methods from the simple counting
of visits to a page to the modeling of users movements on the site (Tuffery, 2011).
Using WM, we can do following (Tuffery, 2011):
Text processing
When the analysis of texts and their themes is
completed, we can filter the themes or terms to be
examined. We can use either a statistical criterion
(selecting terms and themes by their frequency) or a
semantic criterion (centered on a given subject) or a
corpus (identifying offensive words to avoid and their
derivation, in order to clean up a document). With
the statistical criterion, we can use a number of
weighting rules, for example, preferring terms which
appear frequently but in few texts (weight frequency
of the term/number of text containing it). These
terms, having been disambiguated, labeled, lemmatized (the algorithmic process of determining the
stem for a given word), grouped and selected, are
then treated with DM methods, with the individuals
(in the statistical sense) being the texts or documents
(e.g. mails) and the characters of the individuals
(their variables) being the themes or terms in the
documents. Thus we can produce lexical tables in
which each cell C ij is the number of occurrences
of term j (or an indictor of presence/absence) in document i, to which the conventional statistical
Information Development
Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
j Dj
id f t; D log
jfd 2 D : t 2 d gj
numberofwords20959, say0
while line i1 to numberofwords
read Word
if Word length 1 then remove the line and
saysay1
5. if Word is in Stop table (including ve, and,
ancak, ama, m, mi, de, da, belki etc . . . )
remove the line and saysay1
6. if Word is in WebCode table (including http,
www, com, aspx, asp etc . . . ) remove the line
Stemming is the process of reducing words to their
stem, the base word or almost-word that provides
the meaning without additional grammatical information. The purpose of stemming is to better capture the
content of a document (Berry and Linoff, 2011). Porters stemming algorithm is one of the most popular
stemming methods and was proposed in 1980. Many
modifications and enhancements have been made and
Clustering
Clustering is a fundamental data analysis task that has
numerous applications in many disciplines. Clustering
can be broadly defined as the process of partitioning a
dataset into groups, or clusters, so that elements of the
same cluster are more similar to each other than to
Information Development
q
X
2
Xik Xjk
3
D Xi ; Xj
k
which is a particular case with p 2 of the Minkowski metric which can be seen in equation 4:
X
p 1=p
X
X
Dp X i ; X j
ik
jk
k
Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
Table 1. Examples of similarity analysis outputs.
Facebook URL-1
id number
29
29
1
6
21
21
22
Facebook URL-2
id number
Similarity
Ratio
27
32
25
5
16
19
13
1
0
0.237
0.125
0.021
0.051
0.074
Conclusions
With the rapid growth of online social media services,
it is necessary to analyze the impact of social media on
companies in order to differentiate competition from
global competition. Facebook plays a very important
role in online purchasing in Turkey. Web TM provides
an effective way to meet firms diverse information
needs. Analyzed information is used to make business
decisions. Firms may develop a strategy for enhancing
firms profits via this. In addition, this research is also
important for the web designs of firms.
In this study a TM model was developed to extract
useful clusters from the Facebook websites of online
shopping firms in Turkey. The online shopping firms
Facebook websites which are indexed in Google were
analyzed. The RapidMiner web mining tool was used
to collect data. Then data were transformed to a collection of documents by generating a document for
each record. In addition, analysis of the content of
Facebook web documents for filtering, summarization and grouping words into meaningful units was
implemented. Tokenization, transform, filtering and
stemming processes were carried out. Similarity analysis was used to determine the similar websites.
Graphs and tables were achieved. Additionally, words
which are suitable for use on websites were detected.
Alternative TM techniques using a tool developed
for preprocessing Turkish texts can be studied in
future research to compare various approaches and
to implement this framework.
10
Information Development
Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
fans
comments
1464031
1462243
1308049
932839
866163
723368
419067
301989
274966
253164
232904
211887
186355
151297
150864
130142
129321
124819
120546
119399
37501
27472
38862
29001
3785
13329
10491
2125
30301
30101
9573
14202
8635
3724
1132
474
380
1207
2393
1608
References
Berry MJA and Linoff GS (2011) DM techniques for marketing, sales, and customer relationship management,
3rd edn. New York: Wiley.
Carullo M Binaghi E and Gallo I (2009) An online document clustering technique for short web contents. Pattern Recognition Letters 30: 870876.
Chau M and Chen H (2008) A machine learning approach
to web page filtering using content and structure analysis. Decision Support Systems 44: 482494.
Choi M and Kim H (2013) Social relation extraction from
texts using a support-vector-machine-based dependency
trigram kernel. Information Processing and Management 49: 303311.
De Maziere PA and Van Hulle MM (2011) A clustering
study of a 7000 EU document inventory using MDS and
SOM. Expert Systems with Applications 38: 88358849.
Ding Y and Fu X (2012) The research of text mining based
on self-organizing maps. 2012 International Workshop
on Information and Electronics Engineering (IWIEE).
Procedia Engineering 29: 537541.
Engel D, Whitney P and Cramer N (2010) Events and
trends in text streams. In: Berry MW and Kogan J (eds)
TM: applications and theory. Wiley, pp.165182.
Feldman R and Sanger J (2007) The TM handbook:
advanced approaches in analyzing unstructured data.
Cambridge University Press.
11
Han J and Kamber M (2006) DM: concepts and techniques. 2nd edn. Elsevier Science and Technology.
He W, Zha S and Li L (2013) Social media competitive
analysis and TM: A case study in the pizza industry.
International Journal of Information Management 33(3):
464472.
Hu X and Liu H (2012) Text analytics in social media. In:
Aggarwal CC and Zhai CX (eds.) Mining text data.
Springer, pp. 385414.
Jivani AG (2011) A comparative study of stemming algorithms. International Journal of Computer Technology
and Applications 2/6: 19301938.
Last M and Kandel A (2010) NATO science for peace and
security: information and communication security, web
intelligence and security: advances in data and text mining techniques for detecting and preventing terrorist
activities on the web. IOS Press.
Li N and Wu DD (2010) Using text mining and sentiment
analysis for online forums hotspot detection and forecast. Decision Support Systems 48: 354368.
Manning C and Schutze H (1999) Foundations of statistical
natural language processing. Cambridge: MIT Press.
Miner G, Delen D, Elder J, Fast A, Hill T, Nisbet B, et al.
(2012) Practical TM and statistical analysis for nonstructured text data applications. Elsevier Inc.
Mladenic D and Grobelnik M (2003) Feature selection on
hierarchy of web documents. Decision Support Systems
35: 4587.
Myatt GJ and Johnson WP (2009) Making sense of data III:
a practical guide to data visualization, advanced DM
methods, and applications. Canada: Wiley.
Puretskiy AA, Shutt GL and Berry MW (2010) Survey
of text visualization techniques. In: Berry MW and
Kogan J (eds.) TM: applications and theory. Wiley,
pp. 105128.
Roussinov D and Zhao JL (2003) Automatic discovery of
similarity relationships through web mining. Decision
Support Systems 35: 149166.
Schedl M, Widmer G, Knees P and Pohle T (2011) A music
information system automatically generated via web
content mining techniques. Information Processing and
Management 47: 426439.
Soibelman L, Wu J, Caldas C, Brilakis I and Lin K-Y (2008)
Management and analysis of unstructured construction
data types. Advanced Engineering Informatics 22: 1527.
Song Q and Shepperd M (2006) Mining web browsing patterns for E-commerce. Computers in Industry 57:
622630.
Spaeth A and Desmarais MC (2013) Combining collaborative filtering and text similarity for expert profile recommendations in social websites, user modeling, adaptation,
and personalization. In: 21st International Conference
UMAP 2013 (eds. S Carberry, S Weibelzahl, A Micarelli
and G Semeraro), Rome, Italy, June 10-14, Rome: Proceedings, 7899. pp. 178189.
12
Su Z, Kogan J and Nicholas C (2010) Constrained clustering
with k-means type algorithms. Berry MW and Kogan J
(eds.) TM: applications and theory. Wiley, pp. 81104.
Thelwall M, Thelwall S and Fairclough R (2006) Automated
web issue analysis: A nurse prescribing case study. Information Processing and Management 42: 14711483.
Thorleuchter D and Van den Poel D (2012) Predicting
e-commerce company success by mining the text of its
publicly-accessible website. Expert Systems with Applications 39: 1302613034.
Tuffery S (2011) DM and statistics for decision making.
Wiley.
Tunal V and Bilgin TT (2012) PRETO: A high-performance
TM tool for preprocessing Turkish texts. In: 13th International Conference on Computer Systems and Technologies (CompSysTech) Ruse, Bulgaria: 22-23 June,
pp. 134140.
Weiss SM, Indurkhya N, Zhang T and Damerau F (2010)
TM: predictive methods for analyzing unstructured
information. USA: Springer.
Information Development
Wong T-L and Lam W (2009) An unsupervised method for
joint information extraction and feature mining across
different websites. Data & Knowledge Engineering
68: 107125.
About the author
Esra Kahya-Ozyirmidokuz is an Assistant Professor at
Erciyes University Kayseri Vocational College, Turkey.
She received her BS in control and computer engineering
from Erciyes University in 1999, and her MS and PhD in
production and marketing from Erciyes University in
2003 and 2009, respectively. Her research interests are
in Internet programming, animation design, programming,
fuzzy logic, and knowledge management systems. Contact:
Erciyes University, Kayseri Vocational College, Computer
Technologies and Programming Department, Erciyes University, Talas, Kayseri, Turkey, Tel (W): 9 (0) 352 437 49
15. Email: esrakahya@erciyes.edu.tr