Beruflich Dokumente
Kultur Dokumente
Abstract— This paper presents a novel system, synthetic high- to model and analyze the social behavior. Existing small
fidelity social media data generator (SHIELD), for generating and static data sets cannot keep up with growing social
the synthetic social media data. SHIELD jointly generates time- media feeds. Social media data are available through either
varying, directed and weighted interaction graph structures and
topic-driven text features similar to the input social media data. public application programming interfaces (APIs) or paid data
A synthetic interaction graph is generated by a social network services. However, there are privacy limitations on collect-
model to minimize the distance to real graph and is enhanced ing, sharing, or distributing new social media data (such as
by adding various patterns, such as anomalies and information Facebook or Twitter data). These limitations may slow down
cascades, interaction types, and temporal dynamics. A synthetic the progress in social network research and social media
text generator based on the n-gram Markov model is trained
under each topic identified by topic modeling. Synthetic text analytics by granting data access only to a small group of
and graph structures are combined through the assignment of researchers and preventing full data disclosure that would be
synthetic social media entities. Extensive performance evaluation necessary to verify and further improve the results reported
via a graph and text analysis is provided to demonstrate the in research findings. Therefore, it is important to generate
statistical fidelity of large-scale synthetic data generated by large-scale synthetic data sets that reflect real-word data sets
SHIELD. A data evaluation exercise with human participants is
executed to identify how difficult it is for a human to distinguish in terms of statistical or application properties and can be
between tweets that were generated by SHIELD and tweets that shared with others without privacy concerns because of their
were posted by real users. Experimental results followed by a anonymized nature. In addition, synthetic data can be used
statistical significance analysis showed that human participants for generating and analyzing the large-scale and high-fidelity
cannot reliably distinguish between real and synthetic tweets. behavior of networks of social bots (automated social media
Index Terms— Computational test, dK-2 distance, experiment, posting programs, such as spammers) and for marketing,
graph, graph fitting, interactions, latent Dirichlet alloca- advertisement, and generating buzz in social media. These
tion (LDA), social media, social network, synthetic data, text, synthetic data that are needed for these purposes should be
topic modeling, topics.
large-scale, automatically generated, and high fidelity in terms
of similarity to real data.
I. I NTRODUCTION
A. Novel Contributions
S OCIAL media has been growing at a fast pace, with
various microblogging and social networking services
generating large-scale data. As an example, Twitter has around
This paper presents the synthetic high-fidelity social media
data generator (SHIELD) system for generating large volumes
336 million monthly active users (reported for the first quarter of synthetic data in social media, such as from microblogging
of 2018 [1]). Social media provides rich data feeds that can be (e.g., Twitter) or social networking (e.g., Facebook) services.
utilized for different purposes, such as marketing and social SHIELD integrates synthetic graph structures and synthetic
event prediction. It has changed the way that people interact text features (while preserving different graph and text pat-
with each other and respond to events. However, the full terns, such as anomalies, temporal dynamics, information
capture and understanding of social media is largely missing cascades, and topics) to produce large-scale and anonymized
due to the growing number and type of data feeds with social media data that is similar to input social media (in terms
complex interactions among social network actors. In this of statistical and application-level properties). SHIELD gen-
context, social media analytics (including graph, machine erates the underlying graph structure and the text content
learning, and natural language processing algorithms) play a jointly by fitting to the graph, text, statistical, and temporal
critical role in extracting, capturing, and assessing information characteristics of the input data generated by real social
in social media. media, such as data from microblogging or social networking
In turn, research and development progress in social media services. Computational tests demonstrate that SHIELD can
analytics depends on the availability of social media data generate high-fidelity data that is statistically similar to real
data (although realizations are widely different), whereas a
Manuscript received September 11, 2017; revised March 23, 2018 and
June 7, 2018; accepted June 10, 2018. This work was supported by the
data evaluation exercise with human participants demonstrates
Defense Advanced Research Projects Agency (DARPA) under Contract that participants cannot reliably distinguish synthetic data
W31P4Q-13-C-0055. The views, opinions, and/or findings expressed are those generated by SHIELD from real data.
of the author and should not be interpreted as representing the official
views or policies of the Department of Defense or the U.S. Government.
Our novel contributions are given as follows.
(Corresponding author: Yi Shi.) 1) To generate synthetic graph structures, a social inter-
The authors are with Intelligent Automation, Inc., Rockville, MD action graph is generated from social interactions in
20855 USA (e-mail: ysagduyu@i-a-i.com; agrushin@i-a-i.com; yshi@
i-a-i.com). the given social media data set [2]. Users and other
Digital Object Identifier 10.1109/TCSS.2018.2854668 social media entities (such as hashtags in Twitter) are
2329-924X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
assigned to vertices in the interaction graph and social survey about which features they used to decide
media interactions between two entities (e.g., mention, whether tweets are real or synthetic. We received a
reply, and retweet in Twitter) are assigned to edges. Human Research Protection Office (HRPO) Con-
Then, a synthetic graph similar to the given interaction currence Memorandum with the exempt determi-
graph is generated by minimizing the distance (statis- nation for the protocol “SHIELD Data Collection
tical or application-dependent) between the input and Plan.” Experimental results followed by the sta-
synthetic graphs. Synthetic patterns (such as anomalies tistical significance analysis showed that human
in the form of high-degree vertices, hubs, and cliques) participants cannot reliably distinguish between
are added to the synthetic graph. The synthetic graph real and synthetic tweets.
is generated with multiple connected components and SHIELD is particularly useful in generating data to be
attributes (such as group memberships) and can be used for various purposes, including developing and testing
fitted to composite social network models and temporal new social media analytics, generating or analyzing social bot
dynamics in the input data. network behavior and campaigns in social media, and sharing
2) The textual content is extracted from the given social test data with others without privacy concerns.
media data set and synthetically generated by dividing
data based on topics, training an n-gram model (Markov
chain) for each topic, sampling text data from the B. Related Work
models, and filtering synthetic social media posts with Synthetic graphs can be generated according to various
identical text or with grammar mistakes. Social media social network models, such as Forest Fire [3], Nearest Neigh-
entities (e.g., hashtags and hyperlinks in tweets) are bor [4], Random Walk [4], Barabasi–Albert [5], Octopus [6],
added with the same distributions as in the input data. Kronecker [7] and their tensor follow-ups [8], Random Typing
3) Large-scale synthetic social media data sets are gen- [9], dK-2 [10] and dK-2.5 [11], or by other graph gen-
erated by combining the synthetic graph and synthetic eration algorithms [12], [13]. For social network models,
textual content. Graph and textual properties of social graph parameters (in parametric models) can be selected to
media entities (such as hashtags in Twitter) in the input make the synthetic graph statistically similar to a real graph,
interaction graph and the synthetic graph are matched as discussed in [14].
with each other, and the text is selected for the most On the other hand, synthetic text can be generated separately
dominant topic assigned to some localized subset of via natural language generation processes such as n-gram
vertices in the synthetic graph. Synthetic data generation models [15]. Recently, machine learning, e.g., the generative
is scaled up beyond the size of input social media adversarial network (GAN) [16], has been applied to build
data by sampling vertices and edges according to the generator models for synthetic data, such as images [17],
fitted social network model and sampling text and social topological features of graphs [18], and text [19]. However,
media entities according to their statistical properties the scalability and the need for fine-tuning the underlying deep
in the input social media data. Synthetic data genera- neural networks for the generator and discriminator models in
tion is parallelized by dividing the input social media the GAN remain as challenges.
data into multiple sets by the posting time, generating From a more practical application perspective, there have
synthetic social media for each set, and combining data been efforts to generate chat bots [20], [21] that can produce
sets through common usernames and social media text synthetic natural language for a conversation. For example,
entities. a chat bot is developed in [20] for the Twitter social network
4) We performed extensive tests to evaluate the synthetic for entertainment and viral advertising, and an agent model is
data generated by SHIELD as follows. developed in [21] to generate humorous sentences and recog-
a) First, we performed computational experiments nize humoristic expressions introduced by the user during the
where we collected a large set of data from Twitter, dialog. Chat bots have been used for many applications, such
generated a large corresponding set of synthetic as speech recognition [22], authorship recognition [23], and
tweets, and compared the two data sets in terms of generating conversations [24]. SHIELD provides a systematic
graph and text properties. The analysis showed that and scalable solution to generate synthetic data that can
synthetic graph and text structures are statistically be used to drive not only individual but also coordinated
similar to their real counterparts, but that their groups of chat bots. SHIELD’s major advantage is jointly
realizations are different. generating synthetic graph and text features along with social
b) Second, we carried out a data evaluation exercise media entities that involve highly complex structures similar
with human participants to measure how difficult (statistically and at the application level) to the real social
it is for the participants to distinguish between media data.
tweets that were generated by SHIELD and tweets The rest of this paper is organized as follows. Section II
that were posted by real users. Participants are describes the SHIELD system that consists of synthetic graph
given a set of tweets and asked to determine generation (see Section II-A), synthetic text generation (see
whether each tweet was generated by a human Section II-B), and combination of synthetic text and graph
(real) or a machine (synthetic). After classifying structures (see Section II-C). Section III provides a computa-
the tweets, participants complete a brief follow-up tional fidelity analysis of synthetic graph (see Section III-B)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and text (see Section III-C) properties. Section IV provides Fig. 2. Step 1a: generate synthetic graph structures.
experimentation results based on a data evaluation exercise
with human participants. Section V concludes this paper.
is defined as vertex attributes, and edge attributes are defined
II. S YSTEM D ESCRIPTION by the number of interactions between two vertices.
In this section, we will use Facebook data (corresponding
Fig. 1 shows three main steps of SHIELD in generating to a graph of 10k vertices) in all the presented examples
synthetic social media data: Step 1a generates synthetic graph to demonstrate the effectiveness of individual graph gener-
structures, Step 1b generates synthetic text, and Step 2 com- ation components of SHIELD. In this data set, interactions
bines synthetic text and graph structures (through synthetic correspond to wall posts or photograph comment activities
social media entities). Step 1a and Step 1b can be run in (captured from Facebook for the New York region [14]) and
parallel, followed by Step 2. We will discuss each step in detail build the edges in the interaction graph. We will switch to
in Sections II-A–II-C. Each step applies state-of-the-art tech- a larger Twitter data set (corresponding to a graph of more
niques with analytical foundations. SHIELD combines these than 75k vertices) in Section III when we evaluate the overall
steps in a unique way to jointly generate synthetic graph and system performance.
text data for social media. SHIELD can be applied to social 2) Synthetic Graph Generation With Distance
media data from different platforms. While we use Facebook Minimization: A synthetic interaction graph G is generated
and Twitter data for numerical results, the underlying graph via some social graph model, minimizing the distance
and text generation procedures of SHIELD are generic enough between the real and synthetic interaction graphs.
to extend to different platforms. SHIELD uses various social graph models that can generate
a large graph with a small number of parameters.
A. Synthetic Graph Generation 1) Graph Generation With Statistical Distance: SHIELD
generates the synthetic graph by minimizing the sta-
Methods for generating synthetic graph structures G for tistical graph distance from the input graph, e.g., the
social media are shown in Fig. 2. dK-2 distance [10] between joint degree distributions.
1) Interaction Graph Generation: An interaction graph G The dK-2 distance has been used in [14] to cali-
is extracted from the input (real) social media data. An interac- brate synthetic graph generation with respect to input
tion graph represents how social network actors interact with real graphs. Under the dK-2 distance, the number of
each other [25], [26]. Entities and their interactions in social edges (n i j for the real graph and n i j for the syn-
media are identified, and an interaction graph is built with a thetic graph) is identified connecting a vertex with
vertex set V , including entities, an edge set E representing degree i to a vertex with degree j . The distance is then
interactions, and an attribute set A, which includes both defined as dist(G, G ( p)) = ( i, j (n i j − n i j )2 )1/2 . The
vertex (entity) attributes and edge (interaction) attributes. dK-2 distance refers to the difference of joint degree dis-
This framework can be applied to any social media platform. tributions and provides more fidelity than other measures
In the following, we provide examples. listed in Table I, as it captures higher order statistics.
1) Twitter: The vertex set V includes users and hashtags, SHIELD supports both parametric and nonparametric
while the edge set E includes replies, retweets, and social network models.
mentions (of both users and hashtags). a) Parametric Models: The parametric models include
2) Instagram: The vertex set V includes users and tags, Forest Fire [3], Nearest Neighbor [4], Random
while the edge set E includes post comments and Walk [4], Barabasi–Albert [5], Octopus [6], and
mentions in post captions or post comments. Kronecker [8]. For example, the Forest Fire
3) Facebook: The vertex set V includes users, while model has two parameters (connection probabil-
the edge set E includes wall posts and photograph ity and average number of iterations per vertex)
comments. to select. The parameters are tuned such that
In the attribute set A for each platform, user and hashtag/ the distance dist(G, G ( p)) between the real and
tag information, such as language and location for a user, synthetic interaction graphs is minimized over
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
S TATISTICAL F IDELITY E VALUATION
TABLE II
A PPLICATION -L EVEL F IDELITY E VALUATION
TABLE III
G ENERATION OF S YNTHETIC PATTERNS
TABLE IV
T EMPORAL G RAPH G ENERATION
the correlation between locations of real posts (where avail- a full set of synthetic tweets, from which a full synthetic
able) and LDA topics of corresponding posts. For example, interaction graph was extracted. We used the following Forest
consider the generation of synthetic locations. The coupling Fire, Nearest Neighbor, and dk-2.5 algorithms for fitting the
between content and locations is preserved since the subjects graph.
of the tweets are correlated in general with locations. Given 1) Forest Fire [3] reproduces dynamic ranges in graph
a number of real tweets with locations and synthetic tweets density and represents increasing density and decreasing
without locations, the synthetic location generator executes diameter over time. The graph grows with each new
the following steps to generate the locations for the synthetic vertex connecting to (“burning”) existing vertices. After
tweets. a new vertex connects to an existing vertex, it randomly
1) Clean all synthetic and real tweets by removing hash- connects to some of the neighbors of the vertex. There
tags, mentions, and hyperlinks (URLs). are two parameters p1 and p2 ; Once a vertex is burned,
2) Apply LDA to all tweets and compute the topic per tweet its neighbor will be burned with probability p1 , and this
distribution (namely the LDA scores). process is limited to its p2 -hop neighbors.
3) For each tweet, find the tweet’s dominant topic 2) Nearest Neighbor [4] emulates similarity and
(as defined in Section II-B1). transitivity-based growth. The model is based on
4) Collect the location information (latitude and longitude) the observation that two people sharing a common
for real tweets. friend are more likely to become friends. Each new
5) Divide the real tweets’ locations into small geographic vertex added to the graph is connected to a random
bins and compute the probability of the real tweet existing vertex. Additionally, random pairs of 2-hop
location falling in one of those bins, given a dominant neighbors around the new vertex are connected. There
topic. are two parameters p1 and p2 . The probability p1
6) Generate synthetic tweets’ locations in small geographic determines at each step if a new vertex is added or if
bins for a given dominant topic, from the probability a pair of 2-hop neighbors is connected. To reduce the
distribution defined in the previous step. power law exponent, each time a new vertex is added,
we also connect p2 pairs of existing vertices randomly
Real tweets with no location are assigned location (0, 0). This
particular (0, 0) location is assigned to synthetic tweets with chosen from the graph.
the same ratio as for input tweets. 3) dK-2.5 [11] is an extension to the dK-2 method [10],
5) Scalable Generation of Synthetic Social Media Data: which matches joint degree distributions px,y (the proba-
To improve scalability, input tweets are divided into multiple bility that a vertex with degree x is connected to a vertex
sets T1 , T2 , . . . by the posting time. The above-mentioned with degree y) of the synthetic graph to those of the real
generation procedure is applied to obtain a set of synthetic graph. dK-2 achieves only small clustering coefficient.
tweets Si for each set Ti . Then, user names and hashtags are dK-2.5 improves the clustering coefficient by rewiring
updated in Si+1 based on Si and the correlation between Ti+1 (adding and removing) edges while keeping joint degree
and Ti (measured by the Jaccard index |Ti ∩ Ti+1 |/|Ti ∪ Ti+1 |) distributions. There is no parameter associated with
such that the same time correlation can be observed among dK2.5.
Note that while the parameter p2 in Forest Fire and Nearest
synthetic and original sets. Synthetic data generation can
Neighbor algorithms is discrete, dK-2 distance minimization
be parallelized by dividing the input social media data into
can be further refined by instead using a floating number x.y,
multiple sets by the posting time, generating synthetic social
where x is the integer part of a decimal number p2 and y is
media for each set, and combining data sets through common
the fractional part. Then, with probability y, p2 is chosen as
usernames and social media text entities.
x + 1 and with probability 1 − y, p2 is chosen as x.
III. C OMPUTATIONAL E VALUATION OF S YNTHETIC For the parametric models, the objective for graph fitting
S OCIAL M EDIA DATA is to minimize the dK-2 distance. Each connected component
In this section, we describe a computational experiment was fitted separately, with the smallest components not fit-
where we collected a large set of Twitter data, generated ted. Hubs were added through the vertex rewiring procedure
a large set of synthetic tweets, and compared the two data described in Section II-A3, which was applied to every vertex
sets in terms of graph and text properties. In the following, with degree 50 or greater. For each subset of real tweets, text
we describe the details of the data collection and generation was generated along four LDA topics, with 400 000 posts
procedure and present the comparison measures and results. produced (overgenerated before tying them to the graph).
These posts were combined with the synthetic graph to gen-
A. Experimental Procedures erate 30 000 synthetic tweets. Prior to combining the subsets
To generate synthetic tweets, real tweets were first collected of synthetic tweets, we automatically filtered out those that
on the following keywords: “mlb,” “baseball,” “rockies,” and contained profane or potentially offensive language, or those
“mets” using Twitter API [35]. The collection was done at that were too similar to some real tweet.
the end of July 2016. The first 120 000 collected tweets were
used. These tweets were split into 12 subsets of 10 000 tweets; B. Synthetic Graph Properties
synthetic tweet generation was performed separately for each For each subset i of 10 000 real tweets, we compared the
subset; the resultant synthetic tweets were then combined into real interaction graph G i with the synthetic fitted graph G i ; the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
S TATISTICAL G RAPH C OMPARISON M EASURES
TABLE VI nonzero, are relatively small (compared with the total number
O PTIMAL PARAMETER VALUES of edges), suggesting statistical similarity between real and
synthetic graphs. On the other hand, for the dK-2.5 algorithm,
the number of edges does not change during graph generation,
while the number of vertices does. By design, dK-2 distances
are 0 for all 10k tweet graph pairs, though the distance is
positive for the two full graphs, due to the effects of the tweet
combination procedure.
It is important to note that the statistical differences do
not necessarily imply that the graphs have very different
concrete topologies; for example, consider a complete graph
of k vertices (with an edge between each pair of vertices), and
another complete graph of k − 1 vertices. The dK-2 distance
between the two graphs will be very high; however, the graphs
will have k −1 vertices and (k −1)(k −2)/2 edges in common;
in fact, the latter graph will be isomorphic to a subgraph of the
same was done for the real interaction graph G corresponding former. To determine the “amount” of isomorphism between
to the entire set of 120 000 real tweets and the synthetic two graphs G i and G i , we must find a bijection m from the
interaction graph G that was extracted from the combined set vertices of G i to the vertices of G i , such that we maximize
of synthetic tweets. In comparing each pair of graphs (G i , G i ), the number of edges (u, v) in G i that have a corresponding
our goal is twofold: 1) we wish to show that statistically, edge (m(u), m(v)) in G i . Unfortunately, finding this bijection
G i and G i have similar properties and 2) in terms of concrete is a difficult problem; although some approximations have
topological features, we wish to show that the graphs are been developed, at present, no known algorithm exists that
different (i.e., are significantly nonisomorphic). will solve it optimally in a tractable amount of time for large
For the statistical comparison, we use the aforementioned graphs. We thus measure the concrete difference between the
dK-2 distance dist(G i , G i ) as the primary measure. We also graphs empirically over a number of trials, where each trial
count the number of vertices N(G i ) and N(G i ) in the real potentially uses a different bijection, which it creates with a
and synthetic graphs, respectively, as well as the number of different set of random choices, as follows.
edges E(G i ) and E(G i ) in these graphs. Table V provides 1) For both G i and G i , order the vertices from “highest”
these measures for the Forest Fire, Nearest Neighbor, and to “lowest,” according to the following criteria. First,
dK-2.5 fitting algorithms; the last row provides them for define deg0 (v) = deg(v) to be the degree of vertex v,
the full 120 000 tweet real and synthetic graphs, while the and degk (v) to be the degree of v’s kth neighbor, where
previous rows correspond to each subset of 10 000 tweets. the neighbors are ordered by degree, from highest to
Table VI gives the best graph fitting parameters found by the lowest (with ties broken randomly). In other words,
algorithm for the largest component in the graph (note that the when k = 1, degk (v) is the degree of v’s highest degree
dK-2.5 algorithm has no parameters). neighbor, and when k = deg(v), degk (v) is the degree of
By design, for Forest Fire and Nearest Neighbor, each v’s lowest degree neighbor. Now, the order of two ver-
10 000 tweet synthetic graph has the same number of vertices tices u and v is determined by the following algorithm.
as the corresponding real graph; while the number of edges
is not the same for G i and G i , the number is reasonably a) For k from 0 to deg(u), do the following.
close, implying a similar average degree; dK-2 distances, while i) If degk (u) > degk (v), exit and return u > v.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
G RAPH R EALIZATION C OMPARISON M EASURES
ii) If degk (u) < degk (v), exit and return u < v. of 219937 − 1 [36]. While a long period is not a guarantee of
b) If the loop executed all deg(u) + 1 iterations quality in a random number generator, it is long enough to
without exiting, return u = v and the order of provide diverse random numbers to SHIELD.
the two vertices is determined randomly during the
sorting process. C. Synthetic Text Properties
2) Given the ordered (according to the above-mentioned
In addition to comparing the graphs, we also compared the
criteria) lists of vertices V (G i ) and V (G i ) from textual content of each subset of 10 000 real tweet posts P
G i and G i , respectively, randomly remove N(G i ) −
with textual content from the corresponding set of synthetic
N(G i ) vertices from V (G i ) if N(G i ) > N(G i ) (i.e., tweet posts P and performed a similar comparison for the full
if |V (G i )| > |V (G i )|), to make the number of vertices
set of 120 000 real tweets and the combined set of synthetic
equal; similarly, if N(G i ) > N(G i ), then randomly
tweets. Again, our goal is to demonstrate both statistical
remove N(G i ) − N(G i ) vertices from V (G i ). similarity and a difference in concrete features. For the former,
3) Define a bijection m using the following algorithm.
we use the well-known cosine distance similarity, which we
a) For k from 1 to |V (G i )|, do the following. compute as follows.
i) Define m(V (G i )[k]) = V (G i )[k]. 1) Remove all user mentions, hashtags, and hyperlinks
4) Count the number C of edges (u, v) in G i that have a from each tweet in P and P .
corresponding edge (m(u), m(v)) in G i , and compute 2) For P and P , define vectors T (P) and T (P ), respec-
the edge match score M(G i , G i ) = 2C/(E(G i ) + tively, of term frequency counts (where a term is a
E(G i )). word or any other string that is separated by whitespace
We compute the expected edge match score E[M(G i , G i )] but does not contain whitespace within it). It is ensured
between G i and G i over 100 trials. However, because the that T (P) and T (P ) have the same length (correspond-
bijections (used in each trial) are likely not optimal, a low ing to the size of the vocabulary common to both sets
score does not necessarily imply a low number of matching of tweets), and that for any k, T (P)[k] and T (P )[k]
edges. Thus, for comparison, we also computed E[M(G i , G i )] correspond to the frequency count for the same term.
and E[M(G i , G i )]. All these measures are shown in Table VII 3) Compute the cosine similarity as dcos(P, P ) = T (P)×
for Forest Fire, Nearest Neighbor, and dK-2.5. The mea- T (P )/(|T (P)||T (P )|), where × is the vector product.
sures E[M(G i , G i )] and E[M(G i , G i )], though not equal As Table VIII shows, the cosine similarity values are
to 1 (which would be the case with an optimal bijection, relatively large for text (where the underlying graph is syn-
since a graph is isomorphic to itself), are much higher than thetically generated with Forest Fire, Nearest Neighbor, and
E[M(G i , G i )]; in all cases, the difference is statistically dK-2.5). This is not surprising, given that a Markov chain
significant, according to a two-sample two-tail t-test assuming is used to generate the synthetic text; since Markov chains
unequal variances ( p 0.01). This empirically suggests that are designed to replicate the statistical distribution of n-grams
G i and G i are significantly nonisomorphic. Note that 100 trials (sequences of n terms), they also replicate the statistical
were used, which should be sufficient if the random function distribution of individual terms. At the same time, as discussed
is, indeed, random. However, the random function provided in in Section II-B, we wish to avoid situations where the sequence
a programming language is not truly random, and its output of terms (which comprises a synthetic tweet) is identical to
is determined by seed numbers. We used the pseudorandom the sequence of terms comprising some real tweet or appears
generator of Python that uses the Mersenne Twister as the core as a subsequence within it (e.g., if there is a real tweet that
generator and produces 53-bit precision floats with a period says “It is cold today,” we wish to avoid synthetic tweets,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
T EXT C OMPARISON M EASURES
such as “cold today” or “It is cold”). As Table VIII shows, recruited participants, collected consent forms, and executed
the proportion R(P, P ) of such tweets is quite small, and as the experiment.
we stated earlier, they are filtered out and do not appear in the
final set of tweets. A. Methods
Next, we evaluate the semantic features of synthetic data in
1) Data: For our experiment, we drew from two data sets:
terms of sentiments. For that purpose, we classify tweets into
one real and one synthetic. The data sets are described in
two different classes (sentiments) “positive” and “negative.”
Section III-A; the synthetic data set was generated with the
We use word distributions (i.e., the number of times a given
Forest Fire algorithm. For the purposes of a data evaluation
word appears in each tweet) as features. Training data are col-
exercise with human participants, we anonymized the tweets
lected through associations with emoticons in tweets according
by replacing user, hashtag, and URL mentions with the generic
to the procedure outlined in [37]. We use the Naive Bayes
strings @user, #hashtag, and http://url, respectively. We addi-
Classifier for illustration purposes (more advanced classifiers
tionally converted all alphabetical characters to lowercase and
can be used for sentiment analysis and the class “neutral”
removed all retweets. After the two data sets were processed
could be added). We compare sentiments of real tweets with
in this manner, 50 tweets were randomly selected from the
those of synthetic tweets generated with the Nearest Neighbor
real data set, and 50 were selected from the synthetic data set,
method; 63% of real tweets are positive, whereas 55% of
ensuring that no chosen tweet contains offensive content or too
synthetic tweets are positive. This implies that the synthetic
little content. The two subsets were combined and shuffled,
data generation maintains the general trend of sentiments.
and each participant was then presented with the same list
Importantly, the ultimate test of synthetic text quality
of 100 tweets.
is whether it appears indistinguishable (from real text) to
2) Participants: 25 Intelligent Automation, Inc. employees
humans. In Section IV, we demonstrate that this is, indeed,
were selected as the participants for the experiment. They have
often the case.
college and/or graduate degrees, are either native or fluent
English speakers, and include frequent, occasional, and non-
IV. DATA E VALUATION E XERCISE W ITH H UMAN Twitter users.
PARTICIPANTS AS A “A M INI T URING T EST ” 3) Experiment: Each participant was sent the list of tweets
(as generated earlier) via e-mail and asked to determine
The underlying goal of the experiment is to evaluate the whether each tweet was generated by a human (real) or a
content generation capabilities of the SHIELD system, which machine (synthetic), according to his/her best judgment. The
generates the synthetic Twitter data. The capabilities are participants were not told the proportion of real versus syn-
effective if it is difficult for a human to distinguish between thetic tweets in the list. No strict deadline or time limit
tweets that were generated by SHIELD and tweets that were was imposed, although 10–15 min was typically given as a
posted by real users. Thus, our experiment can be viewed guideline for the amount of time needed to process the list
as a simplified, noninteractive version of the Turing test, of tweets. Results were collected from all participants, and
where participants are asked to determine whether a given for each participant, we computed the false negative rate (the
tweet was generated by a human or a machine. We prepared number of synthetic tweets that were identified as real by
the plan/protocol for a data evaluation exercise with human the participant) and the false positive rate (the number of
participants, obtained the necessary approvals (we received real tweets that the participant identified as synthetic). After
an HRPO Concurrence Memorandum with the exempt deter- rating the tweets, each participant is asked to complete a brief
mination for the Protocol, “SHIELD Data Collection Plan”), follow-up survey on the features (listed in the first column
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
C LASSIFICATION E RROR PER PARTICIPANT
TABLE X TABLE XI
F REQUENCY AND U TILITY OF T WEET F EATURES A SSOCIATED F REQUENCY AND U TILITY OF T WEET F EATURES A SSOCIATED
W ITH R EAL T WEETS W ITH S YNTHETIC T WEETS
real or synthetic. We analyzed the survey results to deter- was limited (false negative error rates were still over 30% in
mine not only which features tend to be used more or less all cases), we cannot claim that the difference in performance
frequently but also, to determine whether a given feature is is statistically significant. Thus, we cannot conclusively show
beneficial or detrimental to accurate classification. Specifically, that any feature stands out very strongly as a clear indicator of
in Table X (see Table XI), the second column (Number of whether a tweet is real or synthetic. Even if we use all these
Participants) reports the number of participants who said that features to build a classifier, the overall predictive power is
the feature is indicative of real (synthetic) tweets, while last uncertain and may be small.
two columns report the false negative rate (FN) and the false As a complement to the above-mentioned analysis, we con-
positive rate (FP), averaged only across those individuals who sidered a feature that was not present in the survey, namely,
selected the feature. For example, 20 participants stated that the sentiment of a tweet (positive versus negative). We man-
they considered tweet readability when making their decisions; ually classified the tweets according to sentiment, with
on average, these 20 participants misclassified 52.9% of the 33 tweets labeled as negative and the remaining 67 labeled as
synthetic tweets as real and 36.7% real tweets as synthetic; at positive. We further determined that 16 of the negative tweets
the same time, in Table XI, three participants stated that tweet were real, while 17 were synthetic; in other words, sentiment
readability is, in fact, indicative of synthetic tweets. In some distributions were very similar among real and synthetic
cases, these averages (outlined in bold in Tables X and XI) tweets. Furthermore, we sorted the tweets by classification
are lower than the averages over all participants (52.7% and error (see Fig. 10) and split them into two halves. We then
38.4%, as reported in Table IX). found that 17 of the negative tweets were in the lower half,
For the purposes of detecting synthetic tweets (i.e., reducing while 16 were in the upper half; thus, positive and negative
the false negative rate), several features appear to be useful tweets presented approximately the same level of difficulty for
(false negative error rate under 40%): similarity of the tweet the human participants. From these results, we can conclude
to other tweets in the data set, the presence of common that tweet sentiment is not a reliable feature for distinguishing
idioms/phrases, short tweet length, and descriptions of past between real and synthetic tweets.
experiences (e.g., of what the user did). However, because few Finally, we performed an analysis to determine whether
participants used these features in a beneficial way (e.g., only experience with Twitter had an effect on the participants’
three considered similarity of the tweet to other tweets in the performance; results are shown in Table XII. Although the
data set to be indicative of real tweets), and because the benefit participants who use Twitter perform somewhat better at
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XII [8] L. Akoglu, M. McGlohon, and C. Faloutsos, “RTM: Laws and a
T WITTER E XPERIENCE AND I TS E FFECT ON C LASSIFICATION recursive generator for weighted time-evolving graphs,” in Proc. 8th
IEEE Int. Conf. Data Mining, Dec. 2008, pp. 701–706.
[9] L. Akoglu and C. Faloutsos, “RTG: A recursive realistic graph generator
using random typing,” in Proc. PKDD, 2009, pp. 13–28.
[10] P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat, “Systematic
topology analysis and generation using degree correlations,” in Proc.
SIGCOMM, 2006, pp. 135–146.
[11] M. Gjoka, M. Kurant, and A. Markopoulou, “2.5K-graphs: From sam-
pling to generation,” in Proc. IEEE INFOCOM, Turin, Italy, Apr. 2013,
identifying synthetic tweets than the participants who do not pp. 1968–1976.
[12] D. F. Nettleton, “A synthetic data generator for online social network
use it at all, the difference is not highly significant ( p = 0.10 graphs,” Soc. Netw. Anal. Mining, vol. 6, p. 44, Dec. 2016.
for detecting synthetic tweets and p = 0.19 for detecting real [13] P. J. Lin et al., “Development of a synthetic data set generator for build-
tweets, under a two-sample, two-tail t-test assuming unequal ing and testing information discovery systems,” in Proc. IEEE ITNG,
Las Vegas, NV, USA, Apr. 2006, pp. 707–712.
variances). Even nonusers of Twitter are often exposed to
[14] A. Sala, L. Cao, C. Wilson, R. Zablit, H. Zheng, and B. Y. Zhao,
tweets, e.g., in news articles, which may explain why Twitter “Measurement-calibrated graph models for social network experiments,”
users did not have a significant advantage when performing in Proc. WWW, Raleigh, NC, USA, 2010, pp. 861–870.
the experiment. [15] E. Reiter and R. Dale, Building Natural Language Generation Systems.
Cambridge, U.K.: Cambridge Univ. Press, 2000.
[16] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
V. C ONCLUSION Inf. Process. Syst., 2014, pp. 2672–2680.
This paper presented the SHIELD system, which generates [17] V. Dumoulin et al. (2016). “Adversarially learned inference.” [Online].
Available: https://arxiv.org/abs/1606.00704
the synthetic social media data, including rich text and graph [18] W. Liu, P.-Y. Chen, H. Cooper, M. H. Oh, S. Yeung, and T. Suzumura.
structures and dynamics. SHIELD jointly generates time- (2017). “Can GAN learn topological features of a graph?” [Online].
varying, directed and weighted interaction graph structures and Available: https://arxiv.org/abs/1707.06197
topic-driven text features similar to the input social media data. [19] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville.
(2017). “Adversarial generation of natural language.” [Online]. Avail-
Various properties, such as anomalies, temporal dynamics, able: https://arxiv.org/abs/1705.10929
and information cascades, are added synthetically to match [20] S. M. Rodrigo and J. G. F. Abraham, “Development and implementation
those in real data. Computational results and a data evaluation of a chat bot in a social network,” in Proc. IEEE ITNG, Apr. 2012,
pp. 751–755.
exercise with human participants show that SHIELD generates
[21] G. Pilato, A. Augello, G. Vassallo, and S. Gaglio, “EHeBby:
high-fidelity synthetic data (statistically similar to real data An evocative humorist chat-bot,” J. Mobile Inf. Syst., vol. 4, no. 3,
and difficult to distinguish by humans from real data). The pp. 165–181, 2008.
synthetic data generated by SHIELD does not duplicate real [22] A. Santangelo, A. Augello, A. Gentile, G. Pilato, and S. Gaglio,
“A chat-bot based multimodal virtual guide for cultural heritage tours,”
data and is suitable for evaluating new social media analytics in Proc. PSC, 2006, pp. 114–120.
while avoiding the violation of social media users’ privacy. [23] N. Ali, M. Hindi, and R. V. Yampolskiy, “Evaluation of authorship attri-
bution software on a chat bot corpus,” in Proc. IEEE ICAT, Oct. 2011,
ACKNOWLEDGMENT pp. 1–6.
[24] T. Holtgraves and T.-L. Han, “A procedure for studying online conversa-
The authors would like to thank the Defense Advanced tional processing using a chat bot,” Behav. Res. Methods, vol. 39, no. 1,
Research Projects Agency for their support. They would also pp. 156–163, 2007.
[25] C. Wilson, B. Boe, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao, “User
like to thank Prof. B. Zhao from the University of Chicago interactions in social networks and their implications,” in Proc. EuroSys,
for his guidance on graph models and initial code for graph 2009, pp. 205–218.
fitting. Distribution Statement A (Approved for Public Release, [26] Z. Yang, J. Xue, C. Wilson, B. Y. Zhao, and Y. Dai, “Uncovering user
Distribution Unlimited). interaction dynamics in online social networks,” in Proc. ICWSM, 2015,
pp. 698–701.
[27] G. Wang et al., “Social Turing tests: Crowdsourcing sybil detection,” in
R EFERENCES Proc. NDSS, 2012.
[1] Number of Monthly Active Twitter Users. Accessed: Jul. 18, 2018. [28] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread
[Online]. Available: https://www.statista.com/statistics/282087/number- of influence through a social network,” in Proc. SIGKDD, 2003,
of-monthly-active-twitter-users pp. 137–146.
[2] J. Jiang et al., “Understanding latent interactions in online social [29] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social
networks,” ACM Trans. Web, vol. 7, no. 4, pp. 1–39, 2013. network or a news media?” in Proc. WWW, 2010, pp. 591–600.
[3] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: Densi- [30] Twitter API Rate Limits. Accessed : Jul. 18, 2018. [Online]. Available:
fication laws, shrinking diameters and possible explanations,” in Proc. https://dev.twitter.com/rest/public/rate-limits
ACM SIGKDD, 2005, pp. 177–187. [31] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”
[4] A. Vazquez, “Growing network with local rules: Preferential attachment, J. Mach. Learn. Res., vol. 3, nos. 4–5, pp. 993–1022, 2003.
clustering hierarchy, and degree correlations,” Phys. Rev. E, Stat. Phys. [32] S. C. Deerwester et al., “Computer information retrieval using latent
Plasmas Fluids Relat. Interdiscip. Top., vol. 67, p. 056104, Jun. 2003. semantic structure,” U.S. Patent 4 839 853 A, Jun. 13, 1989.
[5] A.-L. Barabási and R. Albert, “Emergence of scaling in random net- [33] T. Hofmann, “Probabilistic latent semantic analysis,” in Proc. Conf.
works,” Science, vol. 286, no. 5439, pp. 509–512, 1999. Uncertainty Artif. Intell., Jul. 1999, pp. 289–296.
[6] H. Inaltekin, M. Chiang, and H. V. Poor, “Delay of social search on [34] D. Jurafsky and J. H. Martin, Speech and Language Processing.
small-world graphs,” J. Math. Sociol., vol. 38, no. 1, pp. 1–46, 2014. Englewood Cliffs, NJ, USA: Prentice Hall, 2000.
[7] J. Leskovec, D. Chakrabarti, J. Kleinberg, and C. Faloutsos, “Real- [35] Twitter API. Accessed: Jul. 18, 2018. [Online]. Available:
istic, mathematically tractable graph generation and evolution, using https://dev.twitter.com/rest/public
Kronecker multiplication,” in Proc. PKDD, Porto, Portugal, 2005, [36] Random Function. Accessed: Jul. 18, 2018. [Online]. Available:
pp. 133–145. https://docs.python.org/3.1/library/random.html
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[37] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classifica- Yi Shi (S’02–M’08–SM’13) received the Ph.D.
tion using distant supervision,” Stanford Univ., Stanford, CA, USA, degree from Virginia Tech, Blacksburg, VA, USA.
Tech. Rep. CS224N, vol. 1, p. 12, 2009. He is currently a Senior Research Scientist at Intel-
ligent Automation, Inc., Rockville, MD, USA. His
current research interests include algorithm design,
optimization, machine learning, communication net-
works, and social networks.
Yalin E. Sagduyu (S’02–M’08–SM’15) received the Dr. Shi was a recipient of the IEEE INFOCOM
Ph.D. degree in electrical and computer engineering 2008 Best Paper Award and the IEEE INFOCOM
from the University of Maryland, College Park, MD, 2011 Best Paper Award Runner-Up. He has been a
USA. TPC Chair for IEEE and ACM workshops. He has
He is currently the Associate Director of been an Editor of the IEEE C OMMUNICATIONS S URVEYS AND T UTORIALS .
Networks and Security at Intelligent Automation,
Inc., Rockville, MD, USA. His current research
interests include machine learning, graph analytics,
networks, and communications.
Dr. Sagduyu chaired workshops at the ACM
MobiCom, the IEEE CNS, and the IEEE ICNP,
served as the Track Chair at the IEEE PIMRC 2014, and served in the
Organizing Committee of the IEEE GLOBECOM 2016.