Sie sind auf Seite 1von 12

Article

Analyzing unstructured Facebook


social network data through web
text mining: a study of online
shopping firms in Turkey

Information Development
112
The Author(s) 2014
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0266666914528523
idv.sagepub.com

zyirmidokuz
Esra Kahya-O
Erciyes University, Turkey

Abstract
The large amounts of Facebook social network data which are generated and collected need to be analyzed for
valuable decision making information about shopping firms in Turkey. In addition, analyzing social network data
from outside the firms becomes a critical business need for the firms which actively use Facebook. To have a
competitive advantage, firms must translate social media texts into something more quantitative to extract
information. In this study, web text mining techniques are used to determine popular online shopping firms
Facebook patterns. For this purpose, 200 popular Turkish companies web URLs are used. Web text mining
through natural language processing techniques is examined. Similarity analysis and clustering are done. Consequently, the clusters of the Facebook websites and their relationships and similarities of the firms are
obtained.
Keywords
text mining, web mining, Facebook, knowledge discovery in databases, data mining, information extraction,
online shopping, Turkey

Facebook plays a very important role in online purchasing in Turkey.


Introduction
The web seems to be too huge for effective data warehousing and data mining (DM). The size of the web is
in the order of hundreds of terabytes and is still growing rapidly. Many organizations and societies place
most of their public-accessible information on the web.
It is barely possible to set up a data warehouse to replicate, store, or integrate all of the data on the web (Han
and Kamber, 2006). Applying text mining (TM) techniques to web data is extremely useful to find interesting, previously unknown, hidden patterns, which can
then be better defined, in massive datasets. Furthermore, it is believed that applying TM to social media
data can yield interesting findings on human behavior
and human interaction (He, Zha and Li, 2013).
Online shopping has become very popular with the
rapid growth of the Internet in recent years in Turkey.
This popularity is intensifying the competition. As
the competition becomes fiercer, it is significantly

important to analyze social web data in order to differentiate in the global competition.
Firms use social media to build business, sell products and support the economy. Social media websites
(Facebook, Linkedln, Google, Twitter, etc.) help
consumers to decide on what to buy. A business can
compare its social media data with the social media
data of its competitors to gain perspective on their
performance. The comparison could help a business
to identify weaknesses, find new opportunities and
adjust its social media strategy. One of the main techniques used in social media competitive analysis is

Corresponding author:
Esra Kahya-Ozyirmidokuz, PhD, Assistant Professor, Erciyes
University, Kayseri Vocational College, Computer Technologies
and Programming Department, Erciyes University, Talas, Kayseri,
Turkey. Tel (W): 9 (0) 352 437 49 15.
Email: esrakahya@erciyes.edu.tr

TM, which provides capabilities to analyze a large


amount of complex textual data on social media. Traditionally, TM focuses on analyzing an organizations
internal textual data. As web applications and social
media become increasingly prevalent, using TM to
analyze textual data from outside the organization
becomes a critical business need and is expected to
provide richer analysis and better support for decision
makers. The recent big data trend also demonstrates
the importance for organizations to develop the capability of collecting, storing and analyzing both internal and external data with the purpose of harvesting
information for decision making and strategic planning (He, Zha and Li, 2013).
With the rapid growth of Internet technology,
which banishes the geographical barrier between customers and vendors, online shopping has become
increasingly common. Vendors set up their own websites providing basic information for a product and
typically highlight certain important product features
for marketing purposes. Customers can shop over the
Internet by browsing different vendor websites to
learn more about the products they desire (Wong and
Lam, 2009). Consumers decide where to spend their
time and energy according to the content of websites.
Website content ensures a consumer buys a firms
product and can make product searches for alternatives. Thus, a firms website content must be highly
readable and each word must carry purpose and meaning. Adding extra words can result in making customers bored. A website must have an optimum word
count. To achieve fresh and high-quality texts in websites there must not be any redundant words. Consequently, firms need to know exactly which words
should be considered top priority in a website. TM
helps to select the most important words in a website
and helps to find the relationships between them. This
research is also important for the web designs of
firms.
This paper describes an approach to analyzing web
data with TM techniques. The aim of this study is to
determine Turkish popular online shopping firms
patterns by analyzing Facebook website data. The
methods used to perform the analysis are data preprocessing, information extraction, and finally the application of some clustering tools. In contrast to previous
work, in this study Turkish social online shopping
firms web data are analyzed.
The paper is organized as follows: The next section
presents the literature. Section III provides a brief
overview of web and TM. Section IV represents the

Information Development

knowledge discovery process with clustering phase.


Section V ends the paper with a brief conclusion.

Literature review
One area that is finding increased interest is the analysis of web documents. Roussinov and Zhao (2003)
demonstrate how the World Wide Web can be mined
in a fully automated manner to discover the semantic
similarity relationships among the concepts surfaced
during an electronic brainstorming session, and thus
improve the accuracy of their application. Mladenic
and Grobelnik (2003) describe feature subset selection used in text learning and give a brief overview
of feature subset selection which is commonly used
in machine learning. They collected data from Yahoo,
a large text hierarchy of web documents. They used
machine learning methods to analyse them.
Thelwall, Thelwall and Fairclough (2006) apply
web issue analysis to the topic of nurse prescribing.
They used web TM. They collected data via URL
analysis to identify common direct and indirect
sources of web information. The results are presented
in descriptive form and a qualitative analysis is used
to argue that new information has been found. Chau
and Chen (2008) propose a machine-learning-based
approach that combines web content analysis and web
structure analysis. They represent each web page by a
set of content-based and link-based features, which
can be used as the input for various machine learning
algorithms.
Carullo, Binaghi, and Gallo (2009) propose a novel
heuristic online document clustering model that can
be specialized with a variety of text-oriented similarity measures. An experimental evaluation of the proposed model is conducted in the e-commerce domain.
Li and Wu (2010) study online forums hotspot detection and forecast using sentiment analysis and TM
approaches. First, they create an algorithm to automatically analyze the emotional polarity of a text and
to obtain a value for each piece of text. Second, this
algorithm is combined with k-means clustering (k is
the number of clusters) and support vector machine
(SVM) to develop an unsupervised TM approach.
They use the proposed TM approach to group the forums into various clusters, with the center of each cluster representing a hotspot forum within the current
time span. De Maziere and Van Hulle (2011) used
TM to cluster a 7,000-document inventory in order
to evaluate the impact of EU-funded research in the
social sciences and humanities on EU policies. They

Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining

compared two visualization techniques, multidimensional scaling and the self-organizing map, and
one of the latters derivatives, the U-matrix.
There are several examples of web/TM approaches.
Song and Shepperd (2006) view the topology of a
website as a directed graph, and use a users access
information on all URLs of a website as features to
characterize the user and use all users access information on a URL as features to characterize the
URL. The user clusters and web page clusters are
discovered by both vector analysis and fuzzy set
theory-based methods. The frequent access paths are
recognized based on web page clusters and take into
account the underlying structure of a website. Soibelman et al. (2008) present text-based, web-based,
image-based, and network-based construction data
as examples of their vision for possible advanced
data management and analysis on these unstructured
types of data. Several required steps essential for
achieving this goal are presented and discussed,
including: problem investigation to define relevant
issues; data preparation to remove features not carrying useful information; data representation to transform data sources into new precise and concise
descriptions; and proper data analysis for information search or for extraction of new knowledge.
Wong and Lam (2009) developed an unsupervised
learning framework which can jointly extract information and conduct feature mining from a set of web
pages across different sites. Their approach allows
tight interaction between the tasks of extraction and
mining. Engel, Whitney, and Cramer (2010) developed algorithms to find and characterize changes in
topic within text streams.
Schedl, Widmer, Knees and Pohle (2011), developed novel techniques as well as refining existing
ones in order to automatically gather information
about music artists and bands. After the searching,
retrieval, and indexing of web pages that are related
to a music artist or band, web content mining and
music information retrieval techniques are applied
to capture the following categories of information:
similarities between music artists or bands, prototypicality of an artist or a band for a genre, descriptive
properties of an artist or a band, band members and
instrumentation, and images of album cover artwork.
Hu and Liu (2012) present one real-world application
to further illustrate how to utilize text analytics
to solve problems in social media applications.
Thorleuchter and Poel (2012) introduced a new
web mining approach that enables automated

identification of new technological ideas extracted


from Internet sources that are able to solve a given problem. It adapts and combines several existing approaches
from the literature: approaches that extract new technological ideas from a user given text, approaches that
investigate the different idea characteristics in different
technical domains, and multi-language web mining
approaches. Choi and Kim (2013) propose a social relation extraction system based on dependency kernels of
SVMs. The system first classifies input sentences into
two groups: one with social relations and the other without social relations. The system then extracts relation
names from sentences in the group with social relations.
To effectively perform these two processes, they
designed new SVM kernels, referred to as dependency
trigram kernels. Spaeth and Desmarais (2013) recently
explored TM to improve classical collaborative filtering
methods for a site aimed at matching people who are
looking for expert advice on a specific topic. They compare results from an LSA (Latent Semantic Analysis)based text similarity analysis, a simple user-user collaborative filter, and a combination of both methods used
for a knowledge-sharing website.
There are many approaches to TM, which can be
classified from different perspectives, based on the
inputs taken in the TM system and the DM tasks to
be performed (Han and Kamber, 2006). In the literature, k-nearest neighbors, naive Bayes classifiers,
SVMs, logit boost, latent semantic indexes, radial
basis function, and Kohonen Self Organizing Maps
are some of them.

Mining unstructured data


Text comes from many different sources, such as the
following (Berry and Linoff, 2011):








E-mails sent by customers


Notes entered by customer service representatives, doctors, nurses, garage mechanics, and
other concerned people
Transcriptions (voice to text translation) of customer service calls
Comments on websites
Newspaper and magazine articles
Professional reports

A firm, for example, has a large database of


unstructured text, such as customer comments.
Unstructured information is generally text-based, but
may contain numbers, dates, etc. These text data need
to be analyzed to extract valuble information.

Information Development

Furthermore, the modeling of text data is completely


different from that of quantitative data.
DM is a business process for exploring large
amounts of data to discover meaningful patterns
and rules. Text is data, so it should be useful for
DM purposes (Berry and Linoff, 2011). TM is an
emerging technology that attempts to extract meaningful information from unstructured textual data.
It is an extension of DM to textual data. To glean
useful information from a large number of textual documents quickly, it has become imperative to use automated computer techniques. TM is focused on finding
useful models, trends, patterns, or rules from unstructured textual data such as text files, HTML files, chat
messages and emails. As an automated technique,
TM can be used to efficiently and systematically
identify, extract, manage, integrate, and exploit
knowledge from texts. Unlike traditional content
analysis, TM is mainly data driven and its main purpose is to automatically identify hidden patterns or
trends in the data and then create an interpretation
or models that explain interesting patterns and trends
in the textual data (He, Zha and Li, 2013).
TM is the set of techniques and methods used for the
automatic processing of natural language text data
available in reasonably large quantities in the form of
computer files, with the aim of extracting and structuring their contents and themes, for the purposes of rapid
(non-literary) analysis, the discovery of hidden data, or
automatic decision making. It is different from stylometry, which studies the style of texts in order to identify an author or date the work, but it has much in
common with lexicometry or lexical statistics (linguistic statistics or quantitative linguistics); indeed, it is
an extension of the latter science using advanced methods of multidimensional statistics (Tuffery, 2011).
TM Lexicometry DM

Like DM, TM originated partly in response to the


huge volume of text data created and diffused in our
society, and partly for the purpose of quasigeneralized
input and storage of these data in computer systems. It
also owes its acceptance to developments in statistical
and data processing tools whose power has increased
greatly in recent years (Tuffery, 2011).
As in DM, there are two types of methods in TM.
Descriptive methods can be used to search for themes
dealt with in a set (corpus) of documents, without
knowing these themes in advance. Predictive methods
find rules for automatically assigning a document to

one of a number of predefined themes. This may be


done, for example, for the purpose of automatically
forwarding a letter to the appropriate department. The
corpus analyzed must meet the following conditions
(Tuffery, 2011):






It must be in a data processing format


It must include a minimum number of texts
It must be sufficiently comprehensible and
coherent
There must not be too many different themes in
each text
It must avoid, as far as possible, the use of
innuendo, irony and antiphrasis

TM is also used to discover hidden information


(descriptive method), for example, new research fields
(in field patents), or information to be added to marketing databases on customers areas of interest and plans.
It can even be used by a business wishing to communicate with its customers in the vocabulary that they use,
and to adopt its marketing presentations to each customer segment. It can be used in search engines on the
web. Finally, TM is an aid for decision making (predictive methods), for example in automatic mail routing,
email filtering and news filtering. The discovery of hidden information and decision making are mainly classed
as forms of information retrieval, while quick analysis is
a form of information extraction. Information extraction
is a search for specific information in the documents,
without any comparison of the documents, taking the
order and proximity of words into account to discriminate between different statements which have
identical keywords. It is only concerned with themes
related to the target database. Information extraction
starts with natural-language data and uses them to
build up a structured database. It is a matter of scanning the natural language text to detect words or
phases corresponding to each field of database. The
analysis is local. In one sense, information extraction
is a more complex process, because it requires the use
of lexical and morpho-syntatic analysis to recognize
the constituents of the text, their nature and their relationships (Tuffery, 2011).

Web text mining


Just as market baskets provide useful information on
the associations of products that have been bought,
so that a store can adjust its stocks, the analysis of
a web users movements on a website can supply
valuable information to those who know how to use

Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining

Web TM is a new issue in the knowledge discovery


research field. It is aimed at helping people to discover knowledge from large quantities of semistructured or unstructured text on the web. Similar
to DM approaches, web TM requires a technique that
can automatically analyze data and extract relationships or patterns from large collections of web text
by using specific algorithms. A retrieval system finds
documents that help the users satisfy their information
needs (Ding and Fu, 2012).

Experimental design

Figure 1. The main process.

it: this is the aim of web mining (WM), the application of statistics and DM to web browsing data. WM
covers a range of methods from the simple counting
of visits to a page to the modeling of users movements on the site (Tuffery, 2011).
Using WM, we can do following (Tuffery, 2011):





Optimize browsing on a site by analyzing the


behavior of users, in order to maximize their
convenience, increase the number of pages
viewed and enhance the impact of links and
advertising banners.
Identify the focus of interest, and therefore the
expectations, of users visiting the site.
Improve the businesss knowledge of the customers who log on under their own names, by
matching their browsing data with their personal data held by the site owner.

Effective DM operations are predicated on sophisticated data preprocessing methodologies. In fact, TM


is arguably so dependent on the various preprocessing
techniques that infer or extract structured representations from raw unstructured data sources, or do both,
that one might even say TM is to a degree defined by
these elaborate preparatory techniques. Certainly,
very different preprocessing techniques are required
to prepare raw unstructured data for TM than those
traditionally encountered in knowledge discovery
operations aimed at preparing structured data sources
for classic DM applications. A large variety of TM
preprocessing techniques exist. All in some way
attempt to structure documents and, by extension,
document collections. Quite commonly, different preprocessing techniques are used in tandem to create
structured document representations from raw textual
data. As a result, some typical combinations of techniques have evolved in preparing unstructured data
for TM (Feldman and Sanger, 2007).
The aim of this research is to analyze popular
Turkish online shopping firms patterns by analyzing
web data. Figure 1 shows the main process of the
study. Two hundred popular companies Facebook
websites are analyzed; these were selected from the
top 200 Turkish list of the Google search engine
(www.google.com.tr) list. The search term was
shopping websites. The web and Facebook URLs
of these firms were collected. Then 18 shopping
firms were eliminated from the database because of
not having Facebook websites. In addition, the numbers of fans and comments were collected manually.
Statistics on fans and comments can be seen in Figure 2. The top 10 firms number of fans and number of comments values can be seen from Appendix
A. Fifty-six firms which do not have any Facebook
fans were eliminated from the URL dataset. The new
database contains 76 firms website URLs.

Figure 2. Facebook websites statistics.

Text data are collected automatically from the


firms Facebook websites with the RapidMiner web
mining tool. The get pages attribute is used to store
the web data of a given link into a new attribute. This
operator retrieves pages whose URLs are contained in
the input data set. Then data are transformed to a collection of documents by generating a document for
each record in the new database.

Text processing
When the analysis of texts and their themes is
completed, we can filter the themes or terms to be
examined. We can use either a statistical criterion
(selecting terms and themes by their frequency) or a
semantic criterion (centered on a given subject) or a
corpus (identifying offensive words to avoid and their
derivation, in order to clean up a document). With
the statistical criterion, we can use a number of
weighting rules, for example, preferring terms which
appear frequently but in few texts (weight frequency
of the term/number of text containing it). These
terms, having been disambiguated, labeled, lemmatized (the algorithmic process of determining the
stem for a given word), grouped and selected, are
then treated with DM methods, with the individuals
(in the statistical sense) being the texts or documents
(e.g. mails) and the characters of the individuals
(their variables) being the themes or terms in the
documents. Thus we can produce lexical tables in
which each cell C ij is the number of occurrences
of term j (or an indictor of presence/absence) in document i, to which the conventional statistical

Information Development

methods are applied. The cell C ij can also be the


number of occurrences of term j in the set of documents
relating to customer i (letters, reports of interviews,
etc.). These tables can be processed by correspondence
analysis, which simplifies the problem by reducing the
initial variables (corresponding to the terms), often
present in very large numbers (several thousand)
although the preliminary transformations may have
decreased their number, to about a hundred factors
(which no longer correspond to terms: this is the
drawback of the method). At the end of this transformation, continuous variables will have been
substituted for the initial discrete variables, and conventional data analysis techniques can be used classification, clustering, etc. (Tuffery, 2011).
The first step in analyzing text using statistical
algorithms is to convert the text to numbers (Engel,
Whitney and Cramer, 2010). There are some obvious
questions to ask first. What are the most common
words in the text? (function words such as determiners, prepositions and complementizers). How many
words are in the text? The question can be interpreted in two ways. The question about the sheer
length of the text is distinguished by asking how
many word tokens there are. How many different
words or, in other words, how many word types
appear in the text? In general, in this way one can
talk about tokens, individual occurrences of something, and types, the different things present. One
can also calculate the ratio of tokens to types, which
is simply the average frequency with which each
type is used. Such words are referred to as hapax
legomena, Greek for read only once. Even beyond
these words, note that the vast majority of word types
occur extremely infrequently. Nevertheless, very
rare words make up a considerable proportion of the
text (Manning and Schutze, 1999).
Words are attributes and documents are examples,
and together these form a sample of data (Weiss,
Indurkhya, Zhang and Damerau, 2010). Words in the
database must be examined to understand the solutions. When we look at the database, the word share
and the word comment, are the most used words
because of Facebooks structure.
Text generally requires more preparation than the
structured data used for other types of DM. After
identifying the source of the text, the biggest challenge is to parse it into words and phrases (Berry and
Linoff, 2011). In order to obtain all the words that are
used in a given text, a tokenization process is required.
Figure 3 shows the document processing step in

Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining

Filter stop-words and stemming

Figure 3. Document processing with RapidMiner.

RapidMiner. Tokenization involves breaking up the


units of text into individual words or tokens. This
process can take many forms, depending on the language being analyzed (Miner, Delen, Elder, Fast, Hill,
Nisbet et al., 2012). In addition, in order to reduce the
size and the dimensionality of the documents, the documents can be reduced by filtering and stemming
methods.

Term Frequency-Inverse Document Frequency


(TF-IDF) weighting approach
In the text preprocessing process, RapidMiner is used
for Tokenization and Transform Cases processes. It
generates word vectors from a text object. TF-IDF
is used for vector creation in the document processing
step. The Transform Cases step turns all letters to
lower case.
As seen in equations (1) and (2), TF-IDF is a
numerical statistic which reflects how important a
word is to a document in a collection:
tf t; d f t; d=maxff t; d : w 2 dg

j Dj
id f t; D log

jfd 2 D : t 2 d gj

After this step, we achieve TF-IDF scores and


20,959 words, with attribute name, total occurrences
and document occurrences. RapidMiner outputs two
datasets; the process documents set and the process
documents wordlist. Their attributes are given respectively below:


frow no., text, link, URL, response-code,


response-message, content-type, content-length,
date, last-modified, expires, title, language,
description, keywords, robots, id, and words
which are used in documentsg,
fword, attribute name, total occurrences, document occurrencesg.

Text preprocessing aims to make the input documents


more consistent to facilitate text representation,
which is necessary for most text analytic tasks. Traditional text preprocessing methods include stop-word
removal and stemming. Stop-word removal eliminates words using a stop-word list, in which the words
are considered more general and meaningless; stemming reduces inflected (or sometimes derived) words
to their stem, base or root form (Hu and Liu, 2012).
Stop-words refer to words that have little meaning.
In this case, meaning refers to the ability to differentiate between documents. Stop-word lists can be
found on the web, developed manually, or even found
automatically. In the last case, this is the search for
words that appear throughout the corpus, independent
of any relationship to the documents. Although stopwords are often common words (for example, of,
through, and), less common examples include
nevertheless, differently, and furthering (Berry and
Linoff, 2011).
RapidMiner does not support filtering Turkish
stop-words. Algorithm 1, which is the algorithm for
filtering Turkish stop-words, uses 220 stop-words like
icin, acaba, yoksa and ama in Turkish in the
stop-words database. In addition, html codes are also
removed from the database, as we can see from step 6
of Algorithm 1. Active Server Pages (ASP) with
VBScript and SQL Server are used.
Algorithm 1. Filtering stop-words
1.
2.
3.
4.

numberofwords20959, say0
while line i1 to numberofwords
read Word
if Word length 1 then remove the line and
saysay1
5. if Word is in Stop table (including ve, and,
ancak, ama, m, mi, de, da, belki etc . . . )
remove the line and saysay1
6. if Word is in WebCode table (including http,
www, com, aspx, asp etc . . . ) remove the line
Stemming is the process of reducing words to their
stem, the base word or almost-word that provides
the meaning without additional grammatical information. The purpose of stemming is to better capture the
content of a document (Berry and Linoff, 2011). Porters stemming algorithm is one of the most popular
stemming methods and was proposed in 1980. Many
modifications and enhancements have been made and

suggested to the basic algorithm. Porter designed a


detailed framework of stemming which is known as
Snowball. The main purpose of the framework is to
allow programmers to develop their own stemmers for
other character sets or languages (Jivani, 2011). Currently there is a Snowball implementation for the Turkish language. Tunal and Bilgin (2012) conclude that
there is no significant evidence that stemming always
improves the quality of clustering for texts in Turkish.
However, when stemming is used, the dimensionality
of the document-term matrix dramatically decreases
without inversely affecting clustering performance. In
this paper, stemming is done with the RapidMiner program using the Turkish Snowball algorithm.
It has been shown that the quality of information
can be improved, when full automation is not necessary, with additional manual filtering to remove
anomalies, or to classify random samples (Thelwall,
Thelwall and Fairclough, 2006). Because RapidMiner
takes Facebook documents and breaks them down
into their word components, some wrongly written
words in Facebook are added to the database. For
example the words detay and detylar, which have
the same meaning, are merged in the database. We
must be careful while doing this, because sometimes
the root of the word represents a trademark. Additionally, because the tokenize step splits the text of a
document into a sequence of tokens, some words
are broken down incorrectly. For example the word
Coca-Cola is divided into Coca and Cola. The
word t-shirt is broken up as t and shirt. In such
cases these problems are corrected and words are
merged manually. Additionally, seasonal data stacks
may occur. For example in May, because of Mothers
Day, the word mothers frequency will be high.
We manually remove 560 redundant words which
are meaningless in Turkish from the database. Consequently, we extract 946 top keywords. Keyword
extraction process, in which a limited number of
words are selected to induct the whole document purpose, is done and provides a compact description of a
Facebook websites content. These words will be a
firms top priority on a website.

Clustering
Clustering is a fundamental data analysis task that has
numerous applications in many disciplines. Clustering
can be broadly defined as the process of partitioning a
dataset into groups, or clusters, so that elements of the
same cluster are more similar to each other than to

Information Development

elements of different clusters (Su, Kogan and Nicholas,


2010). When needing to make sense of a large set of
data, the data can be broken down into smaller groups
of observations that share something in common.
Knowing the contents of these groups helps in understanding the entire data set. Clustering is a widely used
and flexible approach to analyzing data, in which
observations are automatically organized into groups
(Myatt and Johnson, 2009).
Document clustering is one of the most crucial
techniques for organizing documents in an unsupervised manner. When documents are represented as
term vectors, clustering methods can be applied. If our
goal is prediction, then similarity measures, as embodied in information retrieval and nearest-neighbor
methods, are usually not the first choice. They are
unable to simplify a solution by directly learning from
labeled examples. Yet, when we discuss document
clustering, similarity measures assume a prominent
role. Here the data are unlabeled and similarity of
documents defines the characteristics for assigning
labels. So, to cluster documents, it is natural to reexamine the information retrieval techniques and their
integral similarity measures (Weiss, Indurkhya, Zhang
and Damerau, 2010).
A good clustering should group together similar
objects and separate dissimilar ones. Therefore, the
clustering quality function is usually specified in
terms of a similarity function between objects. In
fact, the exact definition of a clustering quality
function is rarely needed for clustering algorithms
because the computational difficulty of the task makes
it unfeasible to attempt to solve it exactly. Therefore,
it is sufficient for the algorithms to know the similarity
function and the basic requirement that similar objects
belong to the same clusters and dissimilar to separate
ones. A similarity function takes a pair of objects and
produces a real value that is a measure of the objects
proximity. To do so, the function must be able to compare the internal structure of the objects (Feldman and
Sanger, 2007).
The most popular metric is the usual Euclidean distance, which is shown in equation 3:


 q
X
2
Xik  Xjk
3
D Xi ; Xj
k
which is a particular case with p 2 of the Minkowski metric which can be seen in equation 4:

X
p 1=p
X

X
Dp X i ; X j
ik
jk
k

Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining
Table 1. Examples of similarity analysis outputs.
Facebook URL-1
id number
29
29
1
6
21
21
22

Facebook URL-2
id number

Similarity
Ratio

27
32
25
5
16
19
13

1
0
0.237
0.125
0.021
0.051
0.074

For text document clustering, the Cosine similarity


measure which is given in equation 5 is the most
common:
X 0 0
0
0
X X
5
SimXi ; Xj Xi : Xj
k ik jk
where x is the normalized vector x xx.
There are many other possible similarity measures
suitable for their particular purposes (Feldman and
Sanger, 2007).
RapidMiner is used to measure and understand the
similarity of each website to every other website. The
measure to use for calculating similarity can be specified through the parameters. Four types of measures
are provided: mixed measures, nominal measures,
numerical measures and Bregman divergences.
Cosine similarity, which is a numerical measure, is
used in the analysis.
Relationship extraction is the identification of the
link between two entities within a sentence (Last and
Kandel, 2010). The data to similarity data operator is
used to calculate the similarity among all examples of
the dataset. Examples are even compared to themselves.
The similarity measures of Turkish Facebook websites
were detected. Some examples of similarity analysis
outputs are illustrated in Table 1. We can see from Table
1 that there is very little (0.051) similarity between the
two biggest Turkish computer and technology shopping
firms. Figure 4 (a,b) shows the similarity graphs in different forms; k-distance plot displays, for a given value
of k, and what the distances are from all points to the kth
nearest neighbor. These are sorted and plotted. Figure 5
shows k-distance plot graph for k3. In addition, the
similarity graphs of websites for every word used in
documents are obtained.
Turkish shopping websites are clustered with
k-means clustering algorithm. Two clusters are
detected with 12 and 63 items. Figure 6 shows the
centroid plot view of the 2 clusters. These clusters

contain the firms websites URLs. Each of them


includes its documents. RapidMiner calculates the
cluster Number Index which is 0.973. The common
feature of Cluster-1 website documents is that they
have nearly the same texts.
Many TM procedures produce unlabeled textual
results (e.g. groups of interrelated terms that describe
features contained in the original input dataset). In order
to draw potentially useful conclusions, further interpretation of these results is necessary. This often requires a
great deal of time and effort on the part of human analysts. Visual postprocessing tools tailored for specific
TM packages can therefore greatly facilitate the analysis process (Puretskiy, Shutt and Berry, 2010).
A performance operator, which can be used to
derive a performance measure (in form of a performance vector) from the dataset, is used. In conclusion,
the performance of the similarity model is 75 percent
and Cluster Number Index is 0.973.

Conclusions
With the rapid growth of online social media services,
it is necessary to analyze the impact of social media on
companies in order to differentiate competition from
global competition. Facebook plays a very important
role in online purchasing in Turkey. Web TM provides
an effective way to meet firms diverse information
needs. Analyzed information is used to make business
decisions. Firms may develop a strategy for enhancing
firms profits via this. In addition, this research is also
important for the web designs of firms.
In this study a TM model was developed to extract
useful clusters from the Facebook websites of online
shopping firms in Turkey. The online shopping firms
Facebook websites which are indexed in Google were
analyzed. The RapidMiner web mining tool was used
to collect data. Then data were transformed to a collection of documents by generating a document for
each record. In addition, analysis of the content of
Facebook web documents for filtering, summarization and grouping words into meaningful units was
implemented. Tokenization, transform, filtering and
stemming processes were carried out. Similarity analysis was used to determine the similar websites.
Graphs and tables were achieved. Additionally, words
which are suitable for use on websites were detected.
Alternative TM techniques using a tool developed
for preprocessing Turkish texts can be studied in
future research to compare various approaches and
to implement this framework.

10

Figure 4. Similarity graphs with RapidMiner.

Figure 5. k-distance plot graph with RapidMiner.

Figure 6. Centroid plot view of clusters.

Information Development

Kahya-Ozyirmidokuz: Analyzing unstructured Facebook social network data through web text mining

Appendix A: Top 20 firms


Facebook values
_Id number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

fans

comments

1464031
1462243
1308049
932839
866163
723368
419067
301989
274966
253164
232904
211887
186355
151297
150864
130142
129321
124819
120546
119399

37501
27472
38862
29001
3785
13329
10491
2125
30301
30101
9573
14202
8635
3724
1132
474
380
1207
2393
1608

References
Berry MJA and Linoff GS (2011) DM techniques for marketing, sales, and customer relationship management,
3rd edn. New York: Wiley.
Carullo M Binaghi E and Gallo I (2009) An online document clustering technique for short web contents. Pattern Recognition Letters 30: 870876.
Chau M and Chen H (2008) A machine learning approach
to web page filtering using content and structure analysis. Decision Support Systems 44: 482494.
Choi M and Kim H (2013) Social relation extraction from
texts using a support-vector-machine-based dependency
trigram kernel. Information Processing and Management 49: 303311.
De Maziere PA and Van Hulle MM (2011) A clustering
study of a 7000 EU document inventory using MDS and
SOM. Expert Systems with Applications 38: 88358849.
Ding Y and Fu X (2012) The research of text mining based
on self-organizing maps. 2012 International Workshop
on Information and Electronics Engineering (IWIEE).
Procedia Engineering 29: 537541.
Engel D, Whitney P and Cramer N (2010) Events and
trends in text streams. In: Berry MW and Kogan J (eds)
TM: applications and theory. Wiley, pp.165182.
Feldman R and Sanger J (2007) The TM handbook:
advanced approaches in analyzing unstructured data.
Cambridge University Press.

11

Han J and Kamber M (2006) DM: concepts and techniques. 2nd edn. Elsevier Science and Technology.
He W, Zha S and Li L (2013) Social media competitive
analysis and TM: A case study in the pizza industry.
International Journal of Information Management 33(3):
464472.
Hu X and Liu H (2012) Text analytics in social media. In:
Aggarwal CC and Zhai CX (eds.) Mining text data.
Springer, pp. 385414.
Jivani AG (2011) A comparative study of stemming algorithms. International Journal of Computer Technology
and Applications 2/6: 19301938.
Last M and Kandel A (2010) NATO science for peace and
security: information and communication security, web
intelligence and security: advances in data and text mining techniques for detecting and preventing terrorist
activities on the web. IOS Press.
Li N and Wu DD (2010) Using text mining and sentiment
analysis for online forums hotspot detection and forecast. Decision Support Systems 48: 354368.
Manning C and Schutze H (1999) Foundations of statistical
natural language processing. Cambridge: MIT Press.
Miner G, Delen D, Elder J, Fast A, Hill T, Nisbet B, et al.
(2012) Practical TM and statistical analysis for nonstructured text data applications. Elsevier Inc.
Mladenic D and Grobelnik M (2003) Feature selection on
hierarchy of web documents. Decision Support Systems
35: 4587.
Myatt GJ and Johnson WP (2009) Making sense of data III:
a practical guide to data visualization, advanced DM
methods, and applications. Canada: Wiley.
Puretskiy AA, Shutt GL and Berry MW (2010) Survey
of text visualization techniques. In: Berry MW and
Kogan J (eds.) TM: applications and theory. Wiley,
pp. 105128.
Roussinov D and Zhao JL (2003) Automatic discovery of
similarity relationships through web mining. Decision
Support Systems 35: 149166.
Schedl M, Widmer G, Knees P and Pohle T (2011) A music
information system automatically generated via web
content mining techniques. Information Processing and
Management 47: 426439.
Soibelman L, Wu J, Caldas C, Brilakis I and Lin K-Y (2008)
Management and analysis of unstructured construction
data types. Advanced Engineering Informatics 22: 1527.
Song Q and Shepperd M (2006) Mining web browsing patterns for E-commerce. Computers in Industry 57:
622630.
Spaeth A and Desmarais MC (2013) Combining collaborative filtering and text similarity for expert profile recommendations in social websites, user modeling, adaptation,
and personalization. In: 21st International Conference
UMAP 2013 (eds. S Carberry, S Weibelzahl, A Micarelli
and G Semeraro), Rome, Italy, June 10-14, Rome: Proceedings, 7899. pp. 178189.

12
Su Z, Kogan J and Nicholas C (2010) Constrained clustering
with k-means type algorithms. Berry MW and Kogan J
(eds.) TM: applications and theory. Wiley, pp. 81104.
Thelwall M, Thelwall S and Fairclough R (2006) Automated
web issue analysis: A nurse prescribing case study. Information Processing and Management 42: 14711483.
Thorleuchter D and Van den Poel D (2012) Predicting
e-commerce company success by mining the text of its
publicly-accessible website. Expert Systems with Applications 39: 1302613034.
Tuffery S (2011) DM and statistics for decision making.
Wiley.
Tunal V and Bilgin TT (2012) PRETO: A high-performance
TM tool for preprocessing Turkish texts. In: 13th International Conference on Computer Systems and Technologies (CompSysTech) Ruse, Bulgaria: 22-23 June,
pp. 134140.
Weiss SM, Indurkhya N, Zhang T and Damerau F (2010)
TM: predictive methods for analyzing unstructured
information. USA: Springer.

Information Development
Wong T-L and Lam W (2009) An unsupervised method for
joint information extraction and feature mining across
different websites. Data & Knowledge Engineering
68: 107125.
About the author
Esra Kahya-Ozyirmidokuz is an Assistant Professor at
Erciyes University Kayseri Vocational College, Turkey.
She received her BS in control and computer engineering
from Erciyes University in 1999, and her MS and PhD in
production and marketing from Erciyes University in
2003 and 2009, respectively. Her research interests are
in Internet programming, animation design, programming,
fuzzy logic, and knowledge management systems. Contact:
Erciyes University, Kayseri Vocational College, Computer
Technologies and Programming Department, Erciyes University, Talas, Kayseri, Turkey, Tel (W): 9 (0) 352 437 49
15. Email: esrakahya@erciyes.edu.tr

Das könnte Ihnen auch gefallen