Sie sind auf Seite 1von 16

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 1

Synthetic Social Media Data Generation


Yalin E. Sagduyu , Senior Member, IEEE, Alexander Grushin , and Yi Shi , Senior Member, IEEE

Abstract— This paper presents a novel system, synthetic high- to model and analyze the social behavior. Existing small
fidelity social media data generator (SHIELD), for generating and static data sets cannot keep up with growing social
the synthetic social media data. SHIELD jointly generates time- media feeds. Social media data are available through either
varying, directed and weighted interaction graph structures and
topic-driven text features similar to the input social media data. public application programming interfaces (APIs) or paid data
A synthetic interaction graph is generated by a social network services. However, there are privacy limitations on collect-
model to minimize the distance to real graph and is enhanced ing, sharing, or distributing new social media data (such as
by adding various patterns, such as anomalies and information Facebook or Twitter data). These limitations may slow down
cascades, interaction types, and temporal dynamics. A synthetic the progress in social network research and social media
text generator based on the n-gram Markov model is trained
under each topic identified by topic modeling. Synthetic text analytics by granting data access only to a small group of
and graph structures are combined through the assignment of researchers and preventing full data disclosure that would be
synthetic social media entities. Extensive performance evaluation necessary to verify and further improve the results reported
via a graph and text analysis is provided to demonstrate the in research findings. Therefore, it is important to generate
statistical fidelity of large-scale synthetic data generated by large-scale synthetic data sets that reflect real-word data sets
SHIELD. A data evaluation exercise with human participants is
executed to identify how difficult it is for a human to distinguish in terms of statistical or application properties and can be
between tweets that were generated by SHIELD and tweets that shared with others without privacy concerns because of their
were posted by real users. Experimental results followed by a anonymized nature. In addition, synthetic data can be used
statistical significance analysis showed that human participants for generating and analyzing the large-scale and high-fidelity
cannot reliably distinguish between real and synthetic tweets. behavior of networks of social bots (automated social media
Index Terms— Computational test, dK-2 distance, experiment, posting programs, such as spammers) and for marketing,
graph, graph fitting, interactions, latent Dirichlet alloca- advertisement, and generating buzz in social media. These
tion (LDA), social media, social network, synthetic data, text, synthetic data that are needed for these purposes should be
topic modeling, topics.
large-scale, automatically generated, and high fidelity in terms
of similarity to real data.
I. I NTRODUCTION
A. Novel Contributions
S OCIAL media has been growing at a fast pace, with
various microblogging and social networking services
generating large-scale data. As an example, Twitter has around
This paper presents the synthetic high-fidelity social media
data generator (SHIELD) system for generating large volumes
336 million monthly active users (reported for the first quarter of synthetic data in social media, such as from microblogging
of 2018 [1]). Social media provides rich data feeds that can be (e.g., Twitter) or social networking (e.g., Facebook) services.
utilized for different purposes, such as marketing and social SHIELD integrates synthetic graph structures and synthetic
event prediction. It has changed the way that people interact text features (while preserving different graph and text pat-
with each other and respond to events. However, the full terns, such as anomalies, temporal dynamics, information
capture and understanding of social media is largely missing cascades, and topics) to produce large-scale and anonymized
due to the growing number and type of data feeds with social media data that is similar to input social media (in terms
complex interactions among social network actors. In this of statistical and application-level properties). SHIELD gen-
context, social media analytics (including graph, machine erates the underlying graph structure and the text content
learning, and natural language processing algorithms) play a jointly by fitting to the graph, text, statistical, and temporal
critical role in extracting, capturing, and assessing information characteristics of the input data generated by real social
in social media. media, such as data from microblogging or social networking
In turn, research and development progress in social media services. Computational tests demonstrate that SHIELD can
analytics depends on the availability of social media data generate high-fidelity data that is statistically similar to real
data (although realizations are widely different), whereas a
Manuscript received September 11, 2017; revised March 23, 2018 and
June 7, 2018; accepted June 10, 2018. This work was supported by the
data evaluation exercise with human participants demonstrates
Defense Advanced Research Projects Agency (DARPA) under Contract that participants cannot reliably distinguish synthetic data
W31P4Q-13-C-0055. The views, opinions, and/or findings expressed are those generated by SHIELD from real data.
of the author and should not be interpreted as representing the official
views or policies of the Department of Defense or the U.S. Government.
Our novel contributions are given as follows.
(Corresponding author: Yi Shi.) 1) To generate synthetic graph structures, a social inter-
The authors are with Intelligent Automation, Inc., Rockville, MD action graph is generated from social interactions in
20855 USA (e-mail: ysagduyu@i-a-i.com; agrushin@i-a-i.com; yshi@
i-a-i.com). the given social media data set [2]. Users and other
Digital Object Identifier 10.1109/TCSS.2018.2854668 social media entities (such as hashtags in Twitter) are
2329-924X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

assigned to vertices in the interaction graph and social survey about which features they used to decide
media interactions between two entities (e.g., mention, whether tweets are real or synthetic. We received a
reply, and retweet in Twitter) are assigned to edges. Human Research Protection Office (HRPO) Con-
Then, a synthetic graph similar to the given interaction currence Memorandum with the exempt determi-
graph is generated by minimizing the distance (statis- nation for the protocol “SHIELD Data Collection
tical or application-dependent) between the input and Plan.” Experimental results followed by the sta-
synthetic graphs. Synthetic patterns (such as anomalies tistical significance analysis showed that human
in the form of high-degree vertices, hubs, and cliques) participants cannot reliably distinguish between
are added to the synthetic graph. The synthetic graph real and synthetic tweets.
is generated with multiple connected components and SHIELD is particularly useful in generating data to be
attributes (such as group memberships) and can be used for various purposes, including developing and testing
fitted to composite social network models and temporal new social media analytics, generating or analyzing social bot
dynamics in the input data. network behavior and campaigns in social media, and sharing
2) The textual content is extracted from the given social test data with others without privacy concerns.
media data set and synthetically generated by dividing
data based on topics, training an n-gram model (Markov
chain) for each topic, sampling text data from the B. Related Work
models, and filtering synthetic social media posts with Synthetic graphs can be generated according to various
identical text or with grammar mistakes. Social media social network models, such as Forest Fire [3], Nearest Neigh-
entities (e.g., hashtags and hyperlinks in tweets) are bor [4], Random Walk [4], Barabasi–Albert [5], Octopus [6],
added with the same distributions as in the input data. Kronecker [7] and their tensor follow-ups [8], Random Typing
3) Large-scale synthetic social media data sets are gen- [9], dK-2 [10] and dK-2.5 [11], or by other graph gen-
erated by combining the synthetic graph and synthetic eration algorithms [12], [13]. For social network models,
textual content. Graph and textual properties of social graph parameters (in parametric models) can be selected to
media entities (such as hashtags in Twitter) in the input make the synthetic graph statistically similar to a real graph,
interaction graph and the synthetic graph are matched as discussed in [14].
with each other, and the text is selected for the most On the other hand, synthetic text can be generated separately
dominant topic assigned to some localized subset of via natural language generation processes such as n-gram
vertices in the synthetic graph. Synthetic data generation models [15]. Recently, machine learning, e.g., the generative
is scaled up beyond the size of input social media adversarial network (GAN) [16], has been applied to build
data by sampling vertices and edges according to the generator models for synthetic data, such as images [17],
fitted social network model and sampling text and social topological features of graphs [18], and text [19]. However,
media entities according to their statistical properties the scalability and the need for fine-tuning the underlying deep
in the input social media data. Synthetic data genera- neural networks for the generator and discriminator models in
tion is parallelized by dividing the input social media the GAN remain as challenges.
data into multiple sets by the posting time, generating From a more practical application perspective, there have
synthetic social media for each set, and combining data been efforts to generate chat bots [20], [21] that can produce
sets through common usernames and social media text synthetic natural language for a conversation. For example,
entities. a chat bot is developed in [20] for the Twitter social network
4) We performed extensive tests to evaluate the synthetic for entertainment and viral advertising, and an agent model is
data generated by SHIELD as follows. developed in [21] to generate humorous sentences and recog-
a) First, we performed computational experiments nize humoristic expressions introduced by the user during the
where we collected a large set of data from Twitter, dialog. Chat bots have been used for many applications, such
generated a large corresponding set of synthetic as speech recognition [22], authorship recognition [23], and
tweets, and compared the two data sets in terms of generating conversations [24]. SHIELD provides a systematic
graph and text properties. The analysis showed that and scalable solution to generate synthetic data that can
synthetic graph and text structures are statistically be used to drive not only individual but also coordinated
similar to their real counterparts, but that their groups of chat bots. SHIELD’s major advantage is jointly
realizations are different. generating synthetic graph and text features along with social
b) Second, we carried out a data evaluation exercise media entities that involve highly complex structures similar
with human participants to measure how difficult (statistically and at the application level) to the real social
it is for the participants to distinguish between media data.
tweets that were generated by SHIELD and tweets The rest of this paper is organized as follows. Section II
that were posted by real users. Participants are describes the SHIELD system that consists of synthetic graph
given a set of tweets and asked to determine generation (see Section II-A), synthetic text generation (see
whether each tweet was generated by a human Section II-B), and combination of synthetic text and graph
(real) or a machine (synthetic). After classifying structures (see Section II-C). Section III provides a computa-
the tweets, participants complete a brief follow-up tional fidelity analysis of synthetic graph (see Section III-B)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 3

Fig. 1. Main steps of SHIELD.

and text (see Section III-C) properties. Section IV provides Fig. 2. Step 1a: generate synthetic graph structures.
experimentation results based on a data evaluation exercise
with human participants. Section V concludes this paper.
is defined as vertex attributes, and edge attributes are defined
II. S YSTEM D ESCRIPTION by the number of interactions between two vertices.
In this section, we will use Facebook data (corresponding
Fig. 1 shows three main steps of SHIELD in generating to a graph of 10k vertices) in all the presented examples
synthetic social media data: Step 1a generates synthetic graph to demonstrate the effectiveness of individual graph gener-
structures, Step 1b generates synthetic text, and Step 2 com- ation components of SHIELD. In this data set, interactions
bines synthetic text and graph structures (through synthetic correspond to wall posts or photograph comment activities
social media entities). Step 1a and Step 1b can be run in (captured from Facebook for the New York region [14]) and
parallel, followed by Step 2. We will discuss each step in detail build the edges in the interaction graph. We will switch to
in Sections II-A–II-C. Each step applies state-of-the-art tech- a larger Twitter data set (corresponding to a graph of more
niques with analytical foundations. SHIELD combines these than 75k vertices) in Section III when we evaluate the overall
steps in a unique way to jointly generate synthetic graph and system performance.
text data for social media. SHIELD can be applied to social 2) Synthetic Graph Generation With Distance
media data from different platforms. While we use Facebook Minimization: A synthetic interaction graph G  is generated
and Twitter data for numerical results, the underlying graph via some social graph model, minimizing the distance
and text generation procedures of SHIELD are generic enough between the real and synthetic interaction graphs.
to extend to different platforms. SHIELD uses various social graph models that can generate
a large graph with a small number of parameters.
A. Synthetic Graph Generation 1) Graph Generation With Statistical Distance: SHIELD
generates the synthetic graph by minimizing the sta-
Methods for generating synthetic graph structures G  for tistical graph distance from the input graph, e.g., the
social media are shown in Fig. 2. dK-2 distance [10] between joint degree distributions.
1) Interaction Graph Generation: An interaction graph G The dK-2 distance has been used in [14] to cali-
is extracted from the input (real) social media data. An interac- brate synthetic graph generation with respect to input
tion graph represents how social network actors interact with real graphs. Under the dK-2 distance, the number of
each other [25], [26]. Entities and their interactions in social edges (n i j for the real graph and n i j for the syn-
media are identified, and an interaction graph is built with a thetic graph) is identified connecting a vertex with
vertex set V , including entities, an edge set E representing degree i to a vertex with degree  j . The distance is then
interactions, and an attribute set A, which includes both defined as dist(G, G  ( p)) = ( i, j (n i j − n i j )2 )1/2 . The
vertex (entity) attributes and edge (interaction) attributes. dK-2 distance refers to the difference of joint degree dis-
This framework can be applied to any social media platform. tributions and provides more fidelity than other measures
In the following, we provide examples. listed in Table I, as it captures higher order statistics.
1) Twitter: The vertex set V includes users and hashtags, SHIELD supports both parametric and nonparametric
while the edge set E includes replies, retweets, and social network models.
mentions (of both users and hashtags). a) Parametric Models: The parametric models include
2) Instagram: The vertex set V includes users and tags, Forest Fire [3], Nearest Neighbor [4], Random
while the edge set E includes post comments and Walk [4], Barabasi–Albert [5], Octopus [6], and
mentions in post captions or post comments. Kronecker [8]. For example, the Forest Fire
3) Facebook: The vertex set V includes users, while model has two parameters (connection probabil-
the edge set E includes wall posts and photograph ity and average number of iterations per vertex)
comments. to select. The parameters are tuned such that
In the attribute set A for each platform, user and hashtag/ the distance dist(G, G  ( p)) between the real and
tag information, such as language and location for a user, synthetic interaction graphs is minimized over
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE I
S TATISTICAL F IDELITY E VALUATION

parameters p, i.e., the synthetic graph G  ( p∗ ) is


generated with the optimal selection of parameters
p∗ = argmin p dist(G, G  ( p)).
b) Nonparametric Models: There are also nonpara-
metric graph models, such as the dK-2 model [10]
that minimizes the dK-2 distance or the dK-2.5
model [11] that minimizes the dK-2 distance while
aiming to preserve the clustering coefficient. These
models can be readily applied without a need to
select parameters.
Ultimately, the best social network model (with the
Fig. 3. Influence spread performance under different social network models.
best parameters, if any) that minimizes the distance is
selected, and the synthetic interaction graph is generated
A synthetic graph with r × n vertices (smaller than the
according to the selected social network model (with the
target output graph), where n is the number of vertices
best set of parameters, if any).
in the input graph, is generated by the first model. Then,
Example: Table I evaluates different graph statistics
the second graph model is used to add more vertices and
(including the dK-2 distance) for various social network
edges to this small graph such that a synthetic graph with
models and compares the performance of these social
a total of n vertices is obtained. This approach can be
network models in terms of graph properties. Nearest
readily extended to cascades of network models.
Neighbor achieves the smallest dK-2 distance (708)
Example: Fig. 4 shows the dK-2 distance when Nearest
among all the parametric models.
Neighbor and Forest Fire are combined with different
2) Graph Generation With Application-Level Distance:
fractions. The distance is reduced to 679 when, first,
SHIELD can also generate the synthetic graph by mini-
the Nearest Neighbor graph is generated with r = 0.8
mizing the distance in terms of application performance
and then the Forest Fire graph is added.
from the input graph. For example, consider influence
spread [28] for an application-based distance. The num- 3) Generation of Patterns in Synthetic Graphs: A synthetic
ber of vertices (n k for the real graph and n k for the graph with patterns is generated in SHIELD. Social networks
synthetic graph) that can be influenced by some given typically provide smoothed graph output, which might not
number of seeds (say k) is computed. The distance is capture various patterns, such as anomalies. After we generate
then defined as dist(G, G  ( p)) = |n k − n k |. a synthetic interaction graph, we further add these patterns in
Example: Consider influence spread which starts with a the synthetic interaction graph. Here, an anomaly corresponds
number of seeds that are selected from the vertices with to a local structure that is significantly different from other
the highest degrees. A vertex is influenced if the ratio parts of a graph, e.g., a high-degree vertex or a high-degree
of the influenced neighbors exceeds a fixed threshold. vertex with low-degree neighbors, such as a hub or a large-
Suppose the influence threshold is 0.5, and 30 seeds size clique. Such a structure can exist in the real interaction
(vertices) are used. The number of influenced vertices is graph but may not be generated by the social graph models.
shown in Fig. 3 for different social network models. The SHIELD identifies patterns in the real graph analyzed and then
performance under the influence and dK-2 distances is generates similar structures in the synthetic interaction graph
shown in Table II. as follows.
3) Graph Generation According to Composite Social Net- 1) All vertices in these patterns and all associated edges
work Models: A synthetic graph can be generated in are removed from the input graph. Denote the graph of
SHIELD by fitting to composite social network models, removed patterns as G̃ and the remaining graph as Ĝ.
which are models that combine multiple social network Each vertex i in G̃ has a vertex property ci , which is
models. A synthetic graph can be generated by two the number of edges connecting vertex i and a vertex
graph models with a parameter r ∈ (0, 1) as follows. in Ĝ.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 5

TABLE II
A PPLICATION -L EVEL F IDELITY E VALUATION

connectivity as in the real graph, the entire synthetic graph


should not be generated at once. Instead, one or more of the
largest components is fitted, while the remainder of the input
graph is returned unchanged. For example, components may
be fitted one by one in nonincreasing order of component size
(the number of vertices in a component) and terminated once
most (e.g., 90%) of the input graph is fitted.
5) Generation of Attributes in Synthetic Graphs: A syn-
thetic graph with attributes is generated in SHIELD. Social
graph models, such as Nearest Neighbor and Forest Fire,
typically generate a synthetic graph by adding vertices one
by one and adding associated edges when a vertex is added
to the graph. Vertex attributes, such as group membership, are
Fig. 4. Graph generation under a composite social network model.
then assigned based on the group distribution of vertices in the
input data, and the edge attributes are added based on the joint
2) A synthetic graph is generated for graph Ĝ. group distribution in the input data. That is, suppose in the
3) G̃ is reattached to this synthetic graph Ĝ by adding ci input data, pi is the number of vertices in group i divided by
edges for each vertex i . the number of all vertices. Then, a new vertex in the synthetic
graph is assigned as a member of group i with probability pi .
For high-degree anomalies, such as hubs, the following
Suppose a vertex in group i is added in the synthetic graph
alternative procedure can also be applied.
and given that one vertex of an edge is in group i in the
1) For each vertex v in the real graph that has a degree input data, pi j is the number of edges with another vertex in
greater than a threshold, one of the higher degree group j divided by the number of all edges with a vertex in
vertices v  in the synthetic graph is selected. group i . Then, with probability pi j , the newly added vertex is
2) Some topologically proximate, but nonneighboring connected to a vertex in group j .
vertices are rewired to become neighbors of v  , until Example: Consider the simple method that generates first
the degree of v  becomes close to the degree of v. the vertices and edges, and then separately generates attributes
This procedure preserves the number of edges and the con- according to prior distributions of attributes. Suppose the
nectivity of the graph while generating high-degree vertices. vertex attribute is 0 or 1, depending on whether the number
Example (Consider Two Types of Patterns): of friends of a user is greater than 200 or not. Define
1) Degree Anomaly: Any vertex with degree greater than t. the ratio of edges ai j for the real graph and ai j for the
2) Hub Anomaly: Any vertex with greater than t1 neigh- synthetic graph, connecting a vertex with attribute i to a
bors, each with degree less than or equal to t2 . vertex with attribute j . The attribute distance is defined as
Consider three alternative methods. Method 1 fits to the input dist(G, G  ( p)) = ( i, j (ai j − ai j )2 )1/2 and computed as
graph (irrespective of patterns/anomalies), whereas Method 2 0.904769 for a synthetic graph generated with Nearest Neigh-
fits to the input graph and then adds patterns. Method 3 (used bor. On the other hand, by jointly generating graph interactions
by SHIELD) removes patterns from the input graph, fits to the and attributes (as carried out in SHIELD), this distance is
input graph, and then adds the patterns to the synthetic graph. reduced to 0.175193.
Table III shows the dK-2 distance and the number of patterns 6) Generation of Interaction Types in Synthetic Graphs:
in the input graph and the synthetic graph (generated with The edges of the synthetic graph are probabilistically
Nearest Neighbor) for t = 100, t1 = 20, and t2 = 5. Method 1 labeled as representing interaction types (e.g., retweets,
cannot match patterns, while Methods 2 and 3 preserve the replies, and mentions in Twitter) and their numbers. Define
patterns. In addition, Method 3 reduces the dK-2 distance. a = tail(e) and b = head(e) as the tail and head of
4) Generation of Multiple Connected Components in Syn- edge e = (a, b), and deg(x) as vertex x’s degree. Define
thetic Graphs: A synthetic graph with multiple connected Pcat(e),deg(tail(e)),deg(head(e)) (w1 , w2 , . . . , wn ) as the joint dis-
components is generated in SHIELD. In general, a giant tribution of weights w1 , w2 , . . . , wn (corresponding to the
connected component can be found in a social graph, and number of interactions of type 1, 2, . . . , n) conditioned on the
remaining components are relatively smaller [29]. Some social edge being of category cat(e) (e.g., user–user versus user-
graph models generate one connected graph while the input hashtag) with degrees deg(tail(e)) and deg(head(e)). This
(real) graph may be disconnected. To preserve the same distribution is computed from the input graph G, and we use
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE III
G ENERATION OF S YNTHETIC PATTERNS

TABLE IV
T EMPORAL G RAPH G ENERATION

it to probabilistically assign edge weights w1 , w2 , . . . , wn to


edges in the synthetic graph G  . If the synthetic graph G 
has an edge with degrees x  and y  for which no distribution
exists in the real graph G, then we instead use an existing
Fig. 5. Run time as a function of graph compression size m.
joint distribution Pcat(e),x,y to assign weights, where (x, y) is
selected to be as close as possible to (x  , y  ).
7) Generation of Temporal Dynamics in Synthetic Graphs: of vertices in G and Ĝ is n and m, respectively, and n ≥ m.
A synthetic graph with temporal dynamics is generated For i, j ∈ [1, m], denote ei j as the number of edges between
in SHIELD. Denote G 1 , G 2 , . . . as a sequence of input a vertex with degree i and a vertex with degree j in G and êi j
graphs. There are overlapping structures among these graphs, as the number of edges between a vertex withdegree i and a
i.e., n i,i+1 is the number of vertices in G i ∩ G i+1 . A sequence m m
vertex with degree j in Ĝ. mDefine p ij = (e ij / i=1 j =i ei j )
of synthetic graphs is generated by Method 4 in SHIELD as
and p̂i j = (êi j / m ê
j =i i j ). The new dK-2 distance
follows.  m i=1
( m i=1 j =i ( pi j − p̂i j ) )
2 1/2 is scaled up by multiplying with
1) Generate the first synthetic graph S1 for G 1 . the number of edges in graph G.
2) Given Si and n i,i+1 , randomly keep n i,i+1 vertices in Si Example: We apply this new distance definition to synthetic
and edges among them. graphs generated from the input graph with n = 10 000 ver-
3) Apply a graph model to add more vertices and edges to tices and study how the run time and dK-2 distance change
this small graph such that a synthetic graph Si+1 with with m. The results are reported in Figs. 5 and 6 that show the
the same number of vertices as G i+1 is obtained. run time and the dK-2 distance, respectively, as a function of
Method 4 of SHIELD preserves temporal correlation in terms graph compression size m. Fig. 5 shows that by using a small
of Jaccard index of edge sets across various time instances. m value, the run time for both models can be significantly
The Jaccard index across intervals i and i + 1 is |E i ∩ E i+1 | / decreased. On the other hand, the dK-2 distance achieved by
|E i ∪ E i+1 |, where |E| denotes the size of set E. the fitted graph does not increase much when we decrease
Example: Table IV shows the dK-2 distance and the Jaccard m from 10 000 to 5000, as shown in Fig. 6. In this case,
index (averaged over time instances) of synthetic graphs gen- the distance increases only by 16% for Nearest Neighbor
erated by Method 4 and the alternative method (Method 5) that and by 26% for Forest Fire, while the run time is reduced
fits each graph G 1 , G 2 , . . . independently. Time is partitioned by 59% for Nearest Neighbor and by 47% for Forest Fire.
into three equal intervals, and Nearest Neighbor is used to We also note that during the graph fitting process, it is not
generate synthetic graphs for each partition. SHIELD reduces the absolute dK-2 distance that matters, but how this distance
the dK-2 distance and preserves the temporal correlation in changes depending on which parameter values are used (with
terms of average Jaccard index. the distance-minimizing parameters ultimately selected).
8) Complexity Reduction in Synthetic Graph Generation: 9) Synthetic Graph Generation From Sampled Data: It is
The time complexity of graph fitting is mainly caused by the possible that the input graph is incomplete, i.e., there may be
need to generate the same size graph as the original graph missing vertices or edges. One potential reason is that data
for each set of parameters and then measure the distance collection through public APIs has strict rate limitations [30].
between the two graphs. To reduce the time complexity of For example, keyword search though Twitter public API allows
graph fitting, a smaller synthetic graph can be generated, and 180 requests per 15 min for a user or 400 requests per minute
the distance between that graph and the input graph can be for an app. Another option is to use Twitter Gardenhose, which
measured by an extended dK-2 distance defined as follows. provides only 10% of all Twitter data, compared with Twitter
Suppose two graphs G and Ĝ are compared, where the number Firehose, which provides 100% of all Twitter data. This raises
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 7

applied here as an alternative. LDA models each post Pi ∈ P


as a mixture of some fixed set of topics {T1 , T2 , . . . Tn }, where
each topic is a vector in the space of the terms wk . The
dominant topic T (Pi ) is Tdi , where di = argmax j pi j and pi j
is the probability of topic T j for post Pi . All posts where some
topic Td dominates (in the mixture) are placed into the same
class d. Synthetic textual content is then generated separately
for each class/topic d.
To generate content for posts belonging to some class d,
part-of-speech tagging [34] is first applied to find the gram-
matical category g(wk ) for each term wk in the post. These
Fig. 6. dK-2 distance as a function of graph compression size m. categories include both standard parts-of-speech, such as noun,
verb, and adjective, as well as social media entities, such as
usernames, hashtags, and hyperlinks.
From this tagged set of posts, a Markov chain Md (more
specifically, an n-gram model) is constructed. For example,
the posts “I like dogs” and “I like cats” will result in the
following second-order Markov chain (3-gram).
1) α ⇒ [I/pronoun, I/pronoun].
2) (α, I/pronoun) ⇒ [like/verb, like/verb].
3) (I/pronoun, like/verb) ⇒ [dogs/noun, cats/noun].
4) (I/pronoun, like/verb) ⇒ [dogs/noun, cats/noun].
Fig. 7. Step 1b: generate synthetic text.
5) (like/verb, dogs/noun) ⇒ ω.
6) (like/verb, cats/noun) ⇒ ω.
the question of whether it is sufficient to use a limited resource Here, α and ω indicate the beginning and the end of a post,
as input and how representative these input data are when we respectively.
generate synthetic graphs out of it. To answer this question, 2) Generation of Synthetic Text From Markov Chains
we construct the following experiment with 10k tweets and According to Underlying Topics: From each Markov
compare two scenarios in the following. chain Md , synthetic posts Pi = (w1 , w2 , . . . , wm ) are sto-


chastically generated, where each wk  in Pi is a term from
1) Take 100% of available tweets as input, synthetically
generate a graph, and compare it with the input graph. the original set of posts P. Notably, the word frequencies
2) Take 10% of available tweets as input, synthetically and transition probabilities in the synthetic posts reflect the
generate a graph corresponding to the 100% data size, word frequencies and transition probabilities in the real posts.
and compare it with the input graph. Furthermore, because generation is performed separately for
Results show that the change in the dK-2 distance between each LDA topic Td , the resulting text is less likely to contain
the two scenarios is less than 3% (we observed small changes incoherent sentences that begin with one topic and end with
in other data sets), implying that it is feasible to generate a another. Finally, because part-of-speech tagging is performed,
synthetic graph from the sampled data with high fidelity. the generation process is more likely to preserve the gram-
matical rules inherent in the real posts. To further improve
B. Synthetic Text Generation the quality of synthetic tweet text (grammar and originality),
the following postprocessing steps are applied.
Methods for generating synthetic text for social media are
shown in Fig. 7. 1) Any synthetic post Pi whose grammar significantly
1) Topic Modeling and Markov Chain Construction for Text violates the grammatical rules found in the original set
Distributions: The textual content of social media is modeled of posts P is automatically identified and removed.
as a set of posts P, where each post Pi = (w1 , w2 , . . . , wm ) ∈ 2) The synthetic posts and real posts are ensured to contain
P is a sequence of terms wk . Here, wk is an ordinary distinct text, by removing any synthetic post whose
word or a special social media entity, such as a username, textual content is too similar to the textual content of
hashtag, or hyperlink. Before the textual content of synthetic some real post. A synthetic post Pi can be identified
social media posts P  is generated, real social media posts P to be too similar to a real post Pi if the sequence of
are classified according to the dominant topic T (Pi ) of the terms in Pi (excluding social media entities, such as
post Pi . Topics are identified by applying the latent Dirich- usernames, hashtags, and hyperlinks) is contained as a
let allocation (LDA) algorithm [31] to the text of the real posts subsequence within the terms of Pi .
after removing stopwords (stopwords are common words such An alternative method to generate synthetic data is to apply
as “the” in English that are considered to be not necessarily the GAN, as discussed in Section I.
significant for natural language processing purposes). Other Synthetic text generation has high complexity and is the
topic modeling techniques, such as latent semantic analy- bottleneck of synthetic social media data generation. For
sis [32] and probabilistic latent semantic analysis [33], may be instance, starting from 100k real tweets, it takes approximately
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

3 to 4 times longer to build the Markov chains and generate


synthetic text from them compared with generating a synthetic
interaction graph (with hubs). We will discuss how to reduce
complexity in Section II-C.
At this point, the usernames, hashtags, and hyperlinks in the
synthetic posts P  are the same ones as in the real posts P.
Hyperlinks can be randomly replaced with other hyperlinks
that exist in some other set of social media posts Q different
from P but thematically related. For example, for Twitter,
hyperlinks are generated by collecting a separate set of real
tweets Q through Twitter API [35], using keywords similar Fig. 8. Step 2: combine synthetic text and graph structures (through synthetic
to those that were used to obtain the input data set P. The social media entities).
generation of synthetic users and hashtags is described in the
next section. fully synthetic post Pi . The above-mentioned process is then
repeated for different edges of G  to generate a final set
of fully synthetic posts P  . Note that after each iteration,
C. Combination of Synthetic Graph and Text Structures the selected post Pi is removed from P  , to avoid unnecessary
Methods for generating synthetic social media entities and duplication in the generated posts. In order to be able to select
combining the synthetic graph with the synthetic text are an appropriate post Pi (with username counts and hashtag
shown in Fig. 8. counts that match the user vertex counts and hashtag vertex
1) Generation of Synthetic Social Media Entities: The gen- counts in S  (u  ), respectively) in every iteration, P  is typically
erated synthetic social media posts P  are connected with the overgenerated.
synthetic graph G  by replacing social media entities (such Notably, in the above-mentioned procedure, for cases where
as usernames and hashtags) in posts Pi ∈ P  with the labels S  (u  ) contains hashtag vertices, it is attempted to select
from the vertices V  and edges E  of G  . It is therefore first Pi such that its dominant topic T (Pi ) matches the most
necessary to generate these labels. If v  ∈ G  is a user vertex, common topic found among the hashtag vertices. In the
then it is assigned a random string of characters u(v  ) as the above-mentioned example, Pi is picked, such that T (Pi ) =
synthetic username. If v is a hashtag vertex, then it is assigned T (s2 ) or T (Pi ) = T (s3 ) (if the hashtag vertices have different
a synthetic hashtag as follows. For each hashtag vertex v in the topics assigned to them, then the tie is broken arbitrarily).
real graph, a procedure attempts to find a hashtag vertex v  in Because topics T (v  ) were assigned to vertices based on their
the synthetic graph, such that v and v  have a similar degree. degree (as described earlier), the synthetic hashtag assignment
Then, the vertex v  is labeled with the most common dominant process thus bridges content and topology. Furthermore, if the
topic T (v  ) = Td among all real posts P that mention v (recall edge (u  , s1 ) is a reply (i.e., u  replies to s1 ), then we also
that each post has been labeled with its dominant topic Td via generate the original post of a reply, as described earlier, but
LDA). The hashtag name h(v  ) for v  is generated by randomly using a subset of vertices S  (s1 ) and selecting a post Pi such
choosing and combining one or two LDA terms from this that T (Pi ) = T (Pi ) (i.e., the reply and the original post
topic, where the probability of choosing a particular term wk should have the same LDA topic). Finally, if (u  , s1 ) is a
is proportional to qd,k , which we define as the probability of retweet, then we simply replicate the generated retweet Pi
term wk for topic T (v  ) = Td (as computed by LDA). to generate the original tweet Pi .
2) Generation of Synthetic Social Media Posts: A fully 3) Generation of Synthetic Information Cascades: Synthetic
synthetic post Pi is generated as follows. For each iteration, forms of information cascades are added. As an example,
some synthetic edge e = (u  , v  ) ∈ E  is selected. One vertex information cascades can be defined for Twitter as follows. If a
of e (without loss of generality, call it u  ) is the user who user u 1 replies to another user u 2 and mentions a hashtag #h,
created the post. Now, let S  (u  ) be a subset of vertices that then user u 1 is influenced by user u 2 on the topic of #h.
are adjacent to u  , randomly drawn from the distribution of For a given topic, we may find some long influence paths
social media entities in the real set of posts P (for example, in a social graph. However, in a synthetic graph, there may
if posts in P contain two user mentions on the average, then be only a few influence paths. Long influence paths can be
the average number of user vertices in S  (u  ) will be 2). A post identified in the input graph, and then, the number of long
in Pi ∈ P  is selected whose entities as closely as possible influence paths in the synthetic graph can be checked to see
match the entities in S  (u  ). For example, if S  (u  ) contains how many influence paths should be added. An influence path
one user vertex s1 and two hashtag vertices s2 and s3 , then we can be added by adding influenced users one by one. For
select Pi such that one of its terms (call it w1 ) is a username synthetic Twitter data, suppose that the goal is to add a user u 1
and two of its terms (w2 and w3 ) are hashtag names. Then, influenced by user u 2 on the topic of #h. We then need to add
terms in Pi are replaced with vertex labels as appropriate, “@u 2 ” at the beginning of one of tweets from u 1 and add
i.e., w1 is replaced with u(s1 ), and w2 and w3 are replaced “#h” in this tweet.
with h(s2 ) and h(s3 ), respectively. 4) Generation of Other Synthetic Social Media Attributes:
The real usernames and hashtags in post Pi are thus Other synthetic attributes, such as locations and geo-
replaced with synthetic usernames and hashtags, yielding a coordinates, can be generated for social media posts, based on
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 9

the correlation between locations of real posts (where avail- a full set of synthetic tweets, from which a full synthetic
able) and LDA topics of corresponding posts. For example, interaction graph was extracted. We used the following Forest
consider the generation of synthetic locations. The coupling Fire, Nearest Neighbor, and dk-2.5 algorithms for fitting the
between content and locations is preserved since the subjects graph.
of the tweets are correlated in general with locations. Given 1) Forest Fire [3] reproduces dynamic ranges in graph
a number of real tweets with locations and synthetic tweets density and represents increasing density and decreasing
without locations, the synthetic location generator executes diameter over time. The graph grows with each new
the following steps to generate the locations for the synthetic vertex connecting to (“burning”) existing vertices. After
tweets. a new vertex connects to an existing vertex, it randomly
1) Clean all synthetic and real tweets by removing hash- connects to some of the neighbors of the vertex. There
tags, mentions, and hyperlinks (URLs). are two parameters p1 and p2 ; Once a vertex is burned,
2) Apply LDA to all tweets and compute the topic per tweet its neighbor will be burned with probability p1 , and this
distribution (namely the LDA scores). process is limited to its p2 -hop neighbors.
3) For each tweet, find the tweet’s dominant topic 2) Nearest Neighbor [4] emulates similarity and
(as defined in Section II-B1). transitivity-based growth. The model is based on
4) Collect the location information (latitude and longitude) the observation that two people sharing a common
for real tweets. friend are more likely to become friends. Each new
5) Divide the real tweets’ locations into small geographic vertex added to the graph is connected to a random
bins and compute the probability of the real tweet existing vertex. Additionally, random pairs of 2-hop
location falling in one of those bins, given a dominant neighbors around the new vertex are connected. There
topic. are two parameters p1 and p2 . The probability p1
6) Generate synthetic tweets’ locations in small geographic determines at each step if a new vertex is added or if
bins for a given dominant topic, from the probability a pair of 2-hop neighbors is connected. To reduce the
distribution defined in the previous step. power law exponent, each time a new vertex is added,
we also connect p2 pairs of existing vertices randomly
Real tweets with no location are assigned location (0, 0). This
particular (0, 0) location is assigned to synthetic tweets with chosen from the graph.
the same ratio as for input tweets. 3) dK-2.5 [11] is an extension to the dK-2 method [10],
5) Scalable Generation of Synthetic Social Media Data: which matches joint degree distributions px,y (the proba-
To improve scalability, input tweets are divided into multiple bility that a vertex with degree x is connected to a vertex
sets T1 , T2 , . . . by the posting time. The above-mentioned with degree y) of the synthetic graph to those of the real
generation procedure is applied to obtain a set of synthetic graph. dK-2 achieves only small clustering coefficient.
tweets Si for each set Ti . Then, user names and hashtags are dK-2.5 improves the clustering coefficient by rewiring
updated in Si+1 based on Si and the correlation between Ti+1 (adding and removing) edges while keeping joint degree
and Ti (measured by the Jaccard index |Ti ∩ Ti+1 |/|Ti ∪ Ti+1 |) distributions. There is no parameter associated with
such that the same time correlation can be observed among dK2.5.
Note that while the parameter p2 in Forest Fire and Nearest
synthetic and original sets. Synthetic data generation can
Neighbor algorithms is discrete, dK-2 distance minimization
be parallelized by dividing the input social media data into
can be further refined by instead using a floating number x.y,
multiple sets by the posting time, generating synthetic social
where x is the integer part of a decimal number p2 and y is
media for each set, and combining data sets through common
the fractional part. Then, with probability y, p2 is chosen as
usernames and social media text entities.
x + 1 and with probability 1 − y, p2 is chosen as x.
III. C OMPUTATIONAL E VALUATION OF S YNTHETIC For the parametric models, the objective for graph fitting
S OCIAL M EDIA DATA is to minimize the dK-2 distance. Each connected component
In this section, we describe a computational experiment was fitted separately, with the smallest components not fit-
where we collected a large set of Twitter data, generated ted. Hubs were added through the vertex rewiring procedure
a large set of synthetic tweets, and compared the two data described in Section II-A3, which was applied to every vertex
sets in terms of graph and text properties. In the following, with degree 50 or greater. For each subset of real tweets, text
we describe the details of the data collection and generation was generated along four LDA topics, with 400 000 posts
procedure and present the comparison measures and results. produced (overgenerated before tying them to the graph).
These posts were combined with the synthetic graph to gen-
A. Experimental Procedures erate 30 000 synthetic tweets. Prior to combining the subsets
To generate synthetic tweets, real tweets were first collected of synthetic tweets, we automatically filtered out those that
on the following keywords: “mlb,” “baseball,” “rockies,” and contained profane or potentially offensive language, or those
“mets” using Twitter API [35]. The collection was done at that were too similar to some real tweet.
the end of July 2016. The first 120 000 collected tweets were
used. These tweets were split into 12 subsets of 10 000 tweets; B. Synthetic Graph Properties
synthetic tweet generation was performed separately for each For each subset i of 10 000 real tweets, we compared the
subset; the resultant synthetic tweets were then combined into real interaction graph G i with the synthetic fitted graph G i ; the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE V
S TATISTICAL G RAPH C OMPARISON M EASURES

TABLE VI nonzero, are relatively small (compared with the total number
O PTIMAL PARAMETER VALUES of edges), suggesting statistical similarity between real and
synthetic graphs. On the other hand, for the dK-2.5 algorithm,
the number of edges does not change during graph generation,
while the number of vertices does. By design, dK-2 distances
are 0 for all 10k tweet graph pairs, though the distance is
positive for the two full graphs, due to the effects of the tweet
combination procedure.
It is important to note that the statistical differences do
not necessarily imply that the graphs have very different
concrete topologies; for example, consider a complete graph
of k vertices (with an edge between each pair of vertices), and
another complete graph of k − 1 vertices. The dK-2 distance
between the two graphs will be very high; however, the graphs
will have k −1 vertices and (k −1)(k −2)/2 edges in common;
in fact, the latter graph will be isomorphic to a subgraph of the
same was done for the real interaction graph G corresponding former. To determine the “amount” of isomorphism between
to the entire set of 120 000 real tweets and the synthetic two graphs G i and G i , we must find a bijection m from the
interaction graph G  that was extracted from the combined set vertices of G i to the vertices of G i , such that we maximize
of synthetic tweets. In comparing each pair of graphs (G i , G i ), the number of edges (u, v) in G i that have a corresponding
our goal is twofold: 1) we wish to show that statistically, edge (m(u), m(v)) in G i . Unfortunately, finding this bijection
G i and G i have similar properties and 2) in terms of concrete is a difficult problem; although some approximations have
topological features, we wish to show that the graphs are been developed, at present, no known algorithm exists that
different (i.e., are significantly nonisomorphic). will solve it optimally in a tractable amount of time for large
For the statistical comparison, we use the aforementioned graphs. We thus measure the concrete difference between the
dK-2 distance dist(G i , G i ) as the primary measure. We also graphs empirically over a number of trials, where each trial
count the number of vertices N(G i ) and N(G i ) in the real potentially uses a different bijection, which it creates with a
and synthetic graphs, respectively, as well as the number of different set of random choices, as follows.
edges E(G i ) and E(G i ) in these graphs. Table V provides 1) For both G i and G i , order the vertices from “highest”
these measures for the Forest Fire, Nearest Neighbor, and to “lowest,” according to the following criteria. First,
dK-2.5 fitting algorithms; the last row provides them for define deg0 (v) = deg(v) to be the degree of vertex v,
the full 120 000 tweet real and synthetic graphs, while the and degk (v) to be the degree of v’s kth neighbor, where
previous rows correspond to each subset of 10 000 tweets. the neighbors are ordered by degree, from highest to
Table VI gives the best graph fitting parameters found by the lowest (with ties broken randomly). In other words,
algorithm for the largest component in the graph (note that the when k = 1, degk (v) is the degree of v’s highest degree
dK-2.5 algorithm has no parameters). neighbor, and when k = deg(v), degk (v) is the degree of
By design, for Forest Fire and Nearest Neighbor, each v’s lowest degree neighbor. Now, the order of two ver-
10 000 tweet synthetic graph has the same number of vertices tices u and v is determined by the following algorithm.
as the corresponding real graph; while the number of edges
is not the same for G i and G i , the number is reasonably a) For k from 0 to deg(u), do the following.
close, implying a similar average degree; dK-2 distances, while i) If degk (u) > degk (v), exit and return u > v.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 11

TABLE VII
G RAPH R EALIZATION C OMPARISON M EASURES

ii) If degk (u) < degk (v), exit and return u < v. of 219937 − 1 [36]. While a long period is not a guarantee of
b) If the loop executed all deg(u) + 1 iterations quality in a random number generator, it is long enough to
without exiting, return u = v and the order of provide diverse random numbers to SHIELD.
the two vertices is determined randomly during the
sorting process. C. Synthetic Text Properties
2) Given the ordered (according to the above-mentioned
In addition to comparing the graphs, we also compared the
criteria) lists of vertices V (G i ) and V (G i ) from textual content of each subset of 10 000 real tweet posts P
G i and G i , respectively, randomly remove N(G i ) −
with textual content from the corresponding set of synthetic
N(G i ) vertices from V (G i ) if N(G i ) > N(G i ) (i.e., tweet posts P  and performed a similar comparison for the full
if |V (G i )| > |V (G i )|), to make the number of vertices
set of 120 000 real tweets and the combined set of synthetic
equal; similarly, if N(G i ) > N(G i ), then randomly
tweets. Again, our goal is to demonstrate both statistical
remove N(G i ) − N(G i ) vertices from V (G i ). similarity and a difference in concrete features. For the former,
3) Define a bijection m using the following algorithm.
we use the well-known cosine distance similarity, which we
a) For k from 1 to |V (G i )|, do the following. compute as follows.
i) Define m(V (G i )[k]) = V (G i )[k]. 1) Remove all user mentions, hashtags, and hyperlinks
4) Count the number C of edges (u, v) in G i that have a from each tweet in P and P  .
corresponding edge (m(u), m(v)) in G i , and compute 2) For P and P  , define vectors T (P) and T (P  ), respec-
the edge match score M(G i , G i ) = 2C/(E(G i ) + tively, of term frequency counts (where a term is a
E(G i )). word or any other string that is separated by whitespace
We compute the expected edge match score E[M(G i , G i )] but does not contain whitespace within it). It is ensured
between G i and G i over 100 trials. However, because the that T (P) and T (P  ) have the same length (correspond-
bijections (used in each trial) are likely not optimal, a low ing to the size of the vocabulary common to both sets
score does not necessarily imply a low number of matching of tweets), and that for any k, T (P)[k] and T (P  )[k]
edges. Thus, for comparison, we also computed E[M(G i , G i )] correspond to the frequency count for the same term.
and E[M(G i , G i )]. All these measures are shown in Table VII 3) Compute the cosine similarity as dcos(P, P  ) = T (P)×
for Forest Fire, Nearest Neighbor, and dK-2.5. The mea- T (P  )/(|T (P)||T (P  )|), where × is the vector product.
sures E[M(G i , G i )] and E[M(G i , G i )], though not equal As Table VIII shows, the cosine similarity values are
to 1 (which would be the case with an optimal bijection, relatively large for text (where the underlying graph is syn-
since a graph is isomorphic to itself), are much higher than thetically generated with Forest Fire, Nearest Neighbor, and
E[M(G i , G i )]; in all cases, the difference is statistically dK-2.5). This is not surprising, given that a Markov chain
significant, according to a two-sample two-tail t-test assuming is used to generate the synthetic text; since Markov chains
unequal variances ( p 0.01). This empirically suggests that are designed to replicate the statistical distribution of n-grams
G i and G i are significantly nonisomorphic. Note that 100 trials (sequences of n terms), they also replicate the statistical
were used, which should be sufficient if the random function distribution of individual terms. At the same time, as discussed
is, indeed, random. However, the random function provided in in Section II-B, we wish to avoid situations where the sequence
a programming language is not truly random, and its output of terms (which comprises a synthetic tweet) is identical to
is determined by seed numbers. We used the pseudorandom the sequence of terms comprising some real tweet or appears
generator of Python that uses the Mersenne Twister as the core as a subsequence within it (e.g., if there is a real tweet that
generator and produces 53-bit precision floats with a period says “It is cold today,” we wish to avoid synthetic tweets,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE VIII
T EXT C OMPARISON M EASURES

such as “cold today” or “It is cold”). As Table VIII shows, recruited participants, collected consent forms, and executed
the proportion R(P, P  ) of such tweets is quite small, and as the experiment.
we stated earlier, they are filtered out and do not appear in the
final set of tweets. A. Methods
Next, we evaluate the semantic features of synthetic data in
1) Data: For our experiment, we drew from two data sets:
terms of sentiments. For that purpose, we classify tweets into
one real and one synthetic. The data sets are described in
two different classes (sentiments) “positive” and “negative.”
Section III-A; the synthetic data set was generated with the
We use word distributions (i.e., the number of times a given
Forest Fire algorithm. For the purposes of a data evaluation
word appears in each tweet) as features. Training data are col-
exercise with human participants, we anonymized the tweets
lected through associations with emoticons in tweets according
by replacing user, hashtag, and URL mentions with the generic
to the procedure outlined in [37]. We use the Naive Bayes
strings @user, #hashtag, and http://url, respectively. We addi-
Classifier for illustration purposes (more advanced classifiers
tionally converted all alphabetical characters to lowercase and
can be used for sentiment analysis and the class “neutral”
removed all retweets. After the two data sets were processed
could be added). We compare sentiments of real tweets with
in this manner, 50 tweets were randomly selected from the
those of synthetic tweets generated with the Nearest Neighbor
real data set, and 50 were selected from the synthetic data set,
method; 63% of real tweets are positive, whereas 55% of
ensuring that no chosen tweet contains offensive content or too
synthetic tweets are positive. This implies that the synthetic
little content. The two subsets were combined and shuffled,
data generation maintains the general trend of sentiments.
and each participant was then presented with the same list
Importantly, the ultimate test of synthetic text quality
of 100 tweets.
is whether it appears indistinguishable (from real text) to
2) Participants: 25 Intelligent Automation, Inc. employees
humans. In Section IV, we demonstrate that this is, indeed,
were selected as the participants for the experiment. They have
often the case.
college and/or graduate degrees, are either native or fluent
English speakers, and include frequent, occasional, and non-
IV. DATA E VALUATION E XERCISE W ITH H UMAN Twitter users.
PARTICIPANTS AS A “A M INI T URING T EST ” 3) Experiment: Each participant was sent the list of tweets
(as generated earlier) via e-mail and asked to determine
The underlying goal of the experiment is to evaluate the whether each tweet was generated by a human (real) or a
content generation capabilities of the SHIELD system, which machine (synthetic), according to his/her best judgment. The
generates the synthetic Twitter data. The capabilities are participants were not told the proportion of real versus syn-
effective if it is difficult for a human to distinguish between thetic tweets in the list. No strict deadline or time limit
tweets that were generated by SHIELD and tweets that were was imposed, although 10–15 min was typically given as a
posted by real users. Thus, our experiment can be viewed guideline for the amount of time needed to process the list
as a simplified, noninteractive version of the Turing test, of tweets. Results were collected from all participants, and
where participants are asked to determine whether a given for each participant, we computed the false negative rate (the
tweet was generated by a human or a machine. We prepared number of synthetic tweets that were identified as real by
the plan/protocol for a data evaluation exercise with human the participant) and the false positive rate (the number of
participants, obtained the necessary approvals (we received real tweets that the participant identified as synthetic). After
an HRPO Concurrence Memorandum with the exempt deter- rating the tweets, each participant is asked to complete a brief
mination for the Protocol, “SHIELD Data Collection Plan”), follow-up survey on the features (listed in the first column
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 13

TABLE IX
C LASSIFICATION E RROR PER PARTICIPANT

Fig. 10. Distribution of classification error by tweet.

applied for the detection of synthetic social media profiles


(Facebook and Renren) in [27].
To determine whether highly misclassified tweets exist,
we “invert” our previous analysis: instead of gauging how well
Fig. 9. Distribution of classification error by participant. individual participants perform on the provided list of tweets,
and we measure how well each tweet is classified by the entire
set participants. In Fig. 10, we show how misclassification
of Tables X and XI) that he/she used to determine whether a rates are distributed across all tweets. As expected, synthetic
tweet is real or synthetic and to state his/her level of experience tweets tend to have much higher classification errors. Most
with Twitter. Again, each participant was able to complete notably, 28 of the 50 synthetic tweets (i.e., 56%) present a
the survey at his/her convenience, with no hard time limit problem for the majority. A one-sample, one-tail significance
imposed. test for a population proportion shows that in general, if the
25 participants were to collectively classify other tweets,
B. Results the misclassification rate (where 13 or more participants make
an error) will be greater than 41% ( p = 1.6 × 10−2 ).
1) Individual Classification Performance: In Table IX, Synthetic tweets that are classified as synthetic and real by
we present the overall mean false negative and false positive the largest number of participants are as follows.
error rates, along with standard deviations, while in Fig. 9,
1) Synthetic tweet “ihip mlb baseball mouse ny new
we show how these error rates are distributed across all
york yankees players names 100% cotton cool a
participants. Both Fig. 9 and Table IX suggest that it is much
http://url1 via @user1” is classified as synthetic by 88%
more common for a participant to misclassify a synthetic
of participants.
tweet as real, rather than the other way around. A two-sample,
2) Tweet “@user2 make it a rocket launch instead of a
two-tail pairwise t-test shows that the difference is significant
game homie i keep it real.httr” is synthetic. However,
( p = 1.5 × 10−2 ; this is the probability that the difference
88% of participants classified it as real.
is due to random chance). This validates our choice of real
tweets, as it indicates that most errors arise not because the We capture temporal dynamics first as cascades in graph
real tweets are of a low quality (and appear to be synthetic), models and then through the assignment of the same topic
but rather, due to SHIELD’s content generation capabilities to replies that capture temporal dynamics in text sequences.
(such that generated tweets appear to be real). We see that For a retweet, we ensure that its content is the same as that
on average, SHIELD is able to deceive participants 52.7% of the original tweet. Therefore, temporal text dynamics are
of the time. Using a one-sample, one-tail t-test, we can claim captured through topics in both replies and retweets. Some
that this misclassification rate is significantly greater than 44% examples of synthetic retweets and replies are given in the
( p = 1.5 × 10−2 ), i.e., it is likely that more than 44% of following.
SHIELD’s tweets will be classified as real by the average 1) “RT @user4: @user2 make it a rocket launch instead
participant. of a game homie i keep it real.httr” is a retweet of the
2) Group Classification Performance: Given the above- above-mentioned second tweet.
obtained results, it is natural to ask whether different par- 2) “RE @user5: angels 2 red sox vs kansas city royals
ticipants tend to misclassify different subsets of synthetic baseball card 0q3 http://url1” is a reply to “30 minutes
tweets, or whether certain tweets are much more susceptible to smoke em if you want to go to another baseball game
misclassification than others. In the latter case, such synthetic with aluminum bats.”
tweets would be difficult to detect even through “wisdom of 3) Classification Criteria: As mentioned earlier, we con-
the crowd” approaches, where multiple participants vote on ducted a survey to determine what tweet features are typi-
whether a given tweet is real or synthetic. This approach was cally used by participants to determine whether a tweet is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE X TABLE XI
F REQUENCY AND U TILITY OF T WEET F EATURES A SSOCIATED F REQUENCY AND U TILITY OF T WEET F EATURES A SSOCIATED
W ITH R EAL T WEETS W ITH S YNTHETIC T WEETS

real or synthetic. We analyzed the survey results to deter- was limited (false negative error rates were still over 30% in
mine not only which features tend to be used more or less all cases), we cannot claim that the difference in performance
frequently but also, to determine whether a given feature is is statistically significant. Thus, we cannot conclusively show
beneficial or detrimental to accurate classification. Specifically, that any feature stands out very strongly as a clear indicator of
in Table X (see Table XI), the second column (Number of whether a tweet is real or synthetic. Even if we use all these
Participants) reports the number of participants who said that features to build a classifier, the overall predictive power is
the feature is indicative of real (synthetic) tweets, while last uncertain and may be small.
two columns report the false negative rate (FN) and the false As a complement to the above-mentioned analysis, we con-
positive rate (FP), averaged only across those individuals who sidered a feature that was not present in the survey, namely,
selected the feature. For example, 20 participants stated that the sentiment of a tweet (positive versus negative). We man-
they considered tweet readability when making their decisions; ually classified the tweets according to sentiment, with
on average, these 20 participants misclassified 52.9% of the 33 tweets labeled as negative and the remaining 67 labeled as
synthetic tweets as real and 36.7% real tweets as synthetic; at positive. We further determined that 16 of the negative tweets
the same time, in Table XI, three participants stated that tweet were real, while 17 were synthetic; in other words, sentiment
readability is, in fact, indicative of synthetic tweets. In some distributions were very similar among real and synthetic
cases, these averages (outlined in bold in Tables X and XI) tweets. Furthermore, we sorted the tweets by classification
are lower than the averages over all participants (52.7% and error (see Fig. 10) and split them into two halves. We then
38.4%, as reported in Table IX). found that 17 of the negative tweets were in the lower half,
For the purposes of detecting synthetic tweets (i.e., reducing while 16 were in the upper half; thus, positive and negative
the false negative rate), several features appear to be useful tweets presented approximately the same level of difficulty for
(false negative error rate under 40%): similarity of the tweet the human participants. From these results, we can conclude
to other tweets in the data set, the presence of common that tweet sentiment is not a reliable feature for distinguishing
idioms/phrases, short tweet length, and descriptions of past between real and synthetic tweets.
experiences (e.g., of what the user did). However, because few Finally, we performed an analysis to determine whether
participants used these features in a beneficial way (e.g., only experience with Twitter had an effect on the participants’
three considered similarity of the tweet to other tweets in the performance; results are shown in Table XII. Although the
data set to be indicative of real tweets), and because the benefit participants who use Twitter perform somewhat better at
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SAGDUYU et al.: SYNTHETIC SOCIAL MEDIA DATA GENERATION 15

TABLE XII [8] L. Akoglu, M. McGlohon, and C. Faloutsos, “RTM: Laws and a
T WITTER E XPERIENCE AND I TS E FFECT ON C LASSIFICATION recursive generator for weighted time-evolving graphs,” in Proc. 8th
IEEE Int. Conf. Data Mining, Dec. 2008, pp. 701–706.
[9] L. Akoglu and C. Faloutsos, “RTG: A recursive realistic graph generator
using random typing,” in Proc. PKDD, 2009, pp. 13–28.
[10] P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat, “Systematic
topology analysis and generation using degree correlations,” in Proc.
SIGCOMM, 2006, pp. 135–146.
[11] M. Gjoka, M. Kurant, and A. Markopoulou, “2.5K-graphs: From sam-
pling to generation,” in Proc. IEEE INFOCOM, Turin, Italy, Apr. 2013,
identifying synthetic tweets than the participants who do not pp. 1968–1976.
[12] D. F. Nettleton, “A synthetic data generator for online social network
use it at all, the difference is not highly significant ( p = 0.10 graphs,” Soc. Netw. Anal. Mining, vol. 6, p. 44, Dec. 2016.
for detecting synthetic tweets and p = 0.19 for detecting real [13] P. J. Lin et al., “Development of a synthetic data set generator for build-
tweets, under a two-sample, two-tail t-test assuming unequal ing and testing information discovery systems,” in Proc. IEEE ITNG,
Las Vegas, NV, USA, Apr. 2006, pp. 707–712.
variances). Even nonusers of Twitter are often exposed to
[14] A. Sala, L. Cao, C. Wilson, R. Zablit, H. Zheng, and B. Y. Zhao,
tweets, e.g., in news articles, which may explain why Twitter “Measurement-calibrated graph models for social network experiments,”
users did not have a significant advantage when performing in Proc. WWW, Raleigh, NC, USA, 2010, pp. 861–870.
the experiment. [15] E. Reiter and R. Dale, Building Natural Language Generation Systems.
Cambridge, U.K.: Cambridge Univ. Press, 2000.
[16] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
V. C ONCLUSION Inf. Process. Syst., 2014, pp. 2672–2680.
This paper presented the SHIELD system, which generates [17] V. Dumoulin et al. (2016). “Adversarially learned inference.” [Online].
Available: https://arxiv.org/abs/1606.00704
the synthetic social media data, including rich text and graph [18] W. Liu, P.-Y. Chen, H. Cooper, M. H. Oh, S. Yeung, and T. Suzumura.
structures and dynamics. SHIELD jointly generates time- (2017). “Can GAN learn topological features of a graph?” [Online].
varying, directed and weighted interaction graph structures and Available: https://arxiv.org/abs/1707.06197
topic-driven text features similar to the input social media data. [19] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville.
(2017). “Adversarial generation of natural language.” [Online]. Avail-
Various properties, such as anomalies, temporal dynamics, able: https://arxiv.org/abs/1705.10929
and information cascades, are added synthetically to match [20] S. M. Rodrigo and J. G. F. Abraham, “Development and implementation
those in real data. Computational results and a data evaluation of a chat bot in a social network,” in Proc. IEEE ITNG, Apr. 2012,
pp. 751–755.
exercise with human participants show that SHIELD generates
[21] G. Pilato, A. Augello, G. Vassallo, and S. Gaglio, “EHeBby:
high-fidelity synthetic data (statistically similar to real data An evocative humorist chat-bot,” J. Mobile Inf. Syst., vol. 4, no. 3,
and difficult to distinguish by humans from real data). The pp. 165–181, 2008.
synthetic data generated by SHIELD does not duplicate real [22] A. Santangelo, A. Augello, A. Gentile, G. Pilato, and S. Gaglio,
“A chat-bot based multimodal virtual guide for cultural heritage tours,”
data and is suitable for evaluating new social media analytics in Proc. PSC, 2006, pp. 114–120.
while avoiding the violation of social media users’ privacy. [23] N. Ali, M. Hindi, and R. V. Yampolskiy, “Evaluation of authorship attri-
bution software on a chat bot corpus,” in Proc. IEEE ICAT, Oct. 2011,
ACKNOWLEDGMENT pp. 1–6.
[24] T. Holtgraves and T.-L. Han, “A procedure for studying online conversa-
The authors would like to thank the Defense Advanced tional processing using a chat bot,” Behav. Res. Methods, vol. 39, no. 1,
Research Projects Agency for their support. They would also pp. 156–163, 2007.
[25] C. Wilson, B. Boe, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao, “User
like to thank Prof. B. Zhao from the University of Chicago interactions in social networks and their implications,” in Proc. EuroSys,
for his guidance on graph models and initial code for graph 2009, pp. 205–218.
fitting. Distribution Statement A (Approved for Public Release, [26] Z. Yang, J. Xue, C. Wilson, B. Y. Zhao, and Y. Dai, “Uncovering user
Distribution Unlimited). interaction dynamics in online social networks,” in Proc. ICWSM, 2015,
pp. 698–701.
[27] G. Wang et al., “Social Turing tests: Crowdsourcing sybil detection,” in
R EFERENCES Proc. NDSS, 2012.
[1] Number of Monthly Active Twitter Users. Accessed: Jul. 18, 2018. [28] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread
[Online]. Available: https://www.statista.com/statistics/282087/number- of influence through a social network,” in Proc. SIGKDD, 2003,
of-monthly-active-twitter-users pp. 137–146.
[2] J. Jiang et al., “Understanding latent interactions in online social [29] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social
networks,” ACM Trans. Web, vol. 7, no. 4, pp. 1–39, 2013. network or a news media?” in Proc. WWW, 2010, pp. 591–600.
[3] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: Densi- [30] Twitter API Rate Limits. Accessed : Jul. 18, 2018. [Online]. Available:
fication laws, shrinking diameters and possible explanations,” in Proc. https://dev.twitter.com/rest/public/rate-limits
ACM SIGKDD, 2005, pp. 177–187. [31] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”
[4] A. Vazquez, “Growing network with local rules: Preferential attachment, J. Mach. Learn. Res., vol. 3, nos. 4–5, pp. 993–1022, 2003.
clustering hierarchy, and degree correlations,” Phys. Rev. E, Stat. Phys. [32] S. C. Deerwester et al., “Computer information retrieval using latent
Plasmas Fluids Relat. Interdiscip. Top., vol. 67, p. 056104, Jun. 2003. semantic structure,” U.S. Patent 4 839 853 A, Jun. 13, 1989.
[5] A.-L. Barabási and R. Albert, “Emergence of scaling in random net- [33] T. Hofmann, “Probabilistic latent semantic analysis,” in Proc. Conf.
works,” Science, vol. 286, no. 5439, pp. 509–512, 1999. Uncertainty Artif. Intell., Jul. 1999, pp. 289–296.
[6] H. Inaltekin, M. Chiang, and H. V. Poor, “Delay of social search on [34] D. Jurafsky and J. H. Martin, Speech and Language Processing.
small-world graphs,” J. Math. Sociol., vol. 38, no. 1, pp. 1–46, 2014. Englewood Cliffs, NJ, USA: Prentice Hall, 2000.
[7] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos, “Real- [35] Twitter API. Accessed: Jul. 18, 2018. [Online]. Available:
istic, mathematically tractable graph generation and evolution, using https://dev.twitter.com/rest/public
Kronecker multiplication,” in Proc. PKDD, Porto, Portugal, 2005, [36] Random Function. Accessed: Jul. 18, 2018. [Online]. Available:
pp. 133–145. https://docs.python.org/3.1/library/random.html
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

16 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

[37] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classifica- Yi Shi (S’02–M’08–SM’13) received the Ph.D.
tion using distant supervision,” Stanford Univ., Stanford, CA, USA, degree from Virginia Tech, Blacksburg, VA, USA.
Tech. Rep. CS224N, vol. 1, p. 12, 2009. He is currently a Senior Research Scientist at Intel-
ligent Automation, Inc., Rockville, MD, USA. His
current research interests include algorithm design,
optimization, machine learning, communication net-
works, and social networks.
Yalin E. Sagduyu (S’02–M’08–SM’15) received the Dr. Shi was a recipient of the IEEE INFOCOM
Ph.D. degree in electrical and computer engineering 2008 Best Paper Award and the IEEE INFOCOM
from the University of Maryland, College Park, MD, 2011 Best Paper Award Runner-Up. He has been a
USA. TPC Chair for IEEE and ACM workshops. He has
He is currently the Associate Director of been an Editor of the IEEE C OMMUNICATIONS S URVEYS AND T UTORIALS .
Networks and Security at Intelligent Automation,
Inc., Rockville, MD, USA. His current research
interests include machine learning, graph analytics,
networks, and communications.
Dr. Sagduyu chaired workshops at the ACM
MobiCom, the IEEE CNS, and the IEEE ICNP,
served as the Track Chair at the IEEE PIMRC 2014, and served in the
Organizing Committee of the IEEE GLOBECOM 2016.

Alexander Grushin received the Ph.D. degree in


computer science from the University of Maryland,
College Park, MD, USA.
He was with Quantum Leap Innovations, Newark,
DE, USA, where he developed and evaluated opti-
mization algorithms for constrained problems. He is
currently a Senior Research Scientist at Intelligent
Automation, Inc., Rockville, MD, USA, where he
has been involved in machine learning, network sci-
ence, optimization, multiagent systems, social media
analytics, air traffic analysis and modeling, and the
control of robot swarms.

Das könnte Ihnen auch gefallen