Beruflich Dokumente
Kultur Dokumente
Introduction
In section 1.1, we introduce the problem of link prediction for social networks. In section
1.2, we introduce the word embedding model word2vec used to perform NLP tasks on
text documents. In section 1.3, we summarize our work using word2vec as a basis for
social network link prediction.
1.1
First formalized by Liben-Nowell and Kleinberg, the link prediction problem asks if it is
possible to predict interactions between users in the near future, given a snapshot of a
social network (4). This problem is of interest to online social networking sites such as
Facebook, LinkedIn, and Twitter, which seek to recommend new social connections to
their users. Many approaches have been taken to devise appropriate algorithms for this
purpose that are both accurate and scalable (3, 9).
While these algorithms can depend on sophisticated sources of data external to the
network itself, such as GPS information (6), we seek to find a simpler way to predict social
connections. Following in the footsteps of Liben-Nowell and Kleinberg, we suggest that
good algorithms for link prediction can be constructed using network topology alone (4).
1.2
We turn to another area of computer science to develop a novel method of link prediction:
natural language processing (NLP). NLP is concerned with enabling computers to extract
meaning from input in the form of human language. Word embeddings are a type of model
used in NLP to map words to vectors in such a way that similar words have similar vector
representations. Compared to other NLP methods, they have the advantage that they
do not require expensive annotation. We note that this advantage of extracting meaning
2
from text documents without added annotation is analogous to our goal of predicting
links from a social network graph without external information.
One group of models used to produce word embeddings is word2vec. word2vecs Skipgram and continuous-bag-of-words (CBOW) models are shallow neural networks that take
in a large body of text and learns vector representations of words (5, 7). They preserve
semantic and syntactic patterns by mapping words with similar contexts to vectors located
close together in the vector space. The Skip-gram model attempts to use the current word
to learn surrounding context words. The CBOW model attempts to learn the current word
using surrounding context words. The CBOW model sacrifices semantic and syntactic
accuracy of words in order to compute distributed representations more efficiently. As we
wanted to perform a variety of tests on large data sets, we opted to use the CBOW model
for our experiments so we could benefit from its increased efficiency.
1.3
Our Contributions
Related Work
Liben-Nowell and Kleinberg first formalized the problem of link prediction for social networks and suggested that information about future user interactions could be extracted
3
from graph topology alone (4). A number of approaches to link prediction have since
emerged. Al Hasan and Zaki survey techniques including feature-based classification,
kernel-based methods, matrix factorization, and probabilistic graphical models (3). Our
work bears the greatest resemblance to techniques that use similarity scores to predict
links, such as those described by Srilatha and Manjula (9).
Other recent work has looked at adapting word2vec for new purposes. Barkan and
Koenigstein developed item2vec for the purpose of item-based collaborative filtering (1).
Xue, Fu, and Shaobin developed a clustering model based on word2vec focusing on user
tags on the Chinese microblogging website Sina Weibo (10).
word2vecs CBOW model takes in a document as input and attempts to predict a word
given its context. The context of a word is defined as the set of words within the sliding
window of a given size surrounding the word. To allow the word2vec model to learn
vector representations of nodes in a social network graph, we need to define an analogous
context for a node in a graph.
We propose the following method to find the context of a node in a social network
graph:
1. Take a random walk with random restarts on the social network graph, and write
these nodes in order to a file.
2. Now the context of a node is defined within the file the same way it is for a word
within a document: a sliding window surrounding each node. A nice property that
emerges is that nodes that are fewer hops away in the random walk will remain in
the sliding window for more iterations.
4
Gabbi
Tiffany
walks
the
dog
ferret
The following sentence is generated by a random walk along this graph. Note that
because of the sliding window aspect of word2vec, it does not matter if we generate
sensical sentences in their entirety, as long as the words within a given sliding window are
sensical.
Gabbi walks the dog
One context for the word walks with a window size of 4 is Gabbi, the, dog.
Note that there are many possible contexts that could be generated in this graph, similar
to how there are many contexts that a word walks could appear in a document. Thus
a context for a given node g in a directed graph can be any set of nodes that include g
in a random walk along the graph.
Experiments
We run experiments on two real world data sets: a directed Twitter network, and a Yelp
data set.
4.1
Data Sets
Twitter
The majority of our experiments are conducted on Twitter data from Stanford Universitys
SNAP collection1 . For this data set, we consider the problem of link prediction for directed
graphs, which is not as well-studied as link prediction for undirected graphs (8).
Twitter is a directed network where users have followers and other users they are
following. Each node in the Twitter graph is a user. There is an edge to a user from
each one of their followers, and from a user to each user they are following. Edges are
unweighted. The data set provided by SNAP is structured as a list of edges in the graph,
which we preprocess to put in the form of an adjacency list. There are 1,768,149 edges in
the Twitter graph and 81,306 nodes.
Yelp
We ran additional experiments on data provided by the Yelp Dataset Challenge2 . For
this data set, we cluster businesses based on patterns in how users leave tips3 on Yelp.
4.2
Setup
Below we describe the setup for our experiments on link prediction, as well as the setup
for our additional experiments with clustering on the Yelp data set.
1
See https://snap.stanford.edu/data/#socnets.
See https://www.yelp.com/dataset_challenge/dataset.
3
Note that a tip on Yelp refers to users writing a few sentences of inside information for other users
to read, not to be confused with a monetary sum given to restaurant staff.
2
Link Prediction
To test link prediction, we sparsify the social network graph by discarding at random some
percentage of the edges. Then we translate the sparsified graphs into input for word2vec,
as described in Section 3. We use the gensim library to generate word2vec models. The
parameters for the word2vec models were chosen from initial experiments on our test
data. A description of these parameters and our justification for choosing them can be
found in Appendix A.
We test to see if the word2vec similarity metric is a good link predictor as follows:
We consider each node g and its corresponding vector representation v. Let v1 , . . . , v10
be the 10 vectors most similar to v, with corresponding nodes g1 , . . . , g10 . Let a b
denote an edge from a to b. Then we consider each of the 10 edges g g1 , g g2 , . . . ,
g g10 to be a predicted edge. We call a predicted edge correct if the edge is in the
full graph.
We evaluate our results by comparing against a random baseline4 . Our random baseline model iteratively chooses two nodes in the sparse graph, and adds the edge from the
first node to the second node to its set of predicted edges. It marks the edge correct if
it exists in the full graph. It continues until is has predicted the same number of edges
as our word2vec model.
Clustering
To test clustering, we preprocess the Yelp data set to form a graph of businesses. Businesses are connected by weighted edges denoting the number of users that had tipped
(written a short review of) both businesses. We translate this graph into input for
4
Ideally, we would also compare against other link prediction models. We attempted to carry this out
using linkpred, an open-source library for link prediction, but ran into some problems (i.e., computers
crashing multiple times), so we had to relegate such comparisons to future work.
word2vec, as described in Section 3, and use the gensim library to generate models.
We hoped to discover ground-truth communities by clustering over our vector representations of businesses. Unfortunately, we were unsuccessful: our results for experiments
with the Yelp data set are hence relegated to Appendix B.
4.3
We varied three sets of parameters for link prediction on the Twitter data set:
1. The percentage of edges kept during sparsification is chosen to be 70%, 80%, or
90%.
2. The probability of a random restart when generating algorithm input is chosen to
be .01, .005, or .001.
3. The number of documents created when generating algorithm input is chosen to be
100,000, 500,000, or 1,000,0000.
Our link prediction results are promising. The best performance came from the 70%
sparse graph with 1,000,000 documents, which achieved approximately 10% accuracy in
predicting links that later appear in the full graph. Figure 1 below shows a comparison of
the link prediction accuracy for the various sparsified graphs for all numbers of documents
and restart probabilities.
Figure 1
The results in Figure 1 are interesting in that the probability of restarting a random
walk at every given node, which corresponds to the length of the documents generated, is
less important for prediction accuracy than the actual number of documents generated.
For the 70% sparsified graph, the prediction accuracy increases 5-fold from 100,000 to
1,000,000 documents. This implies that longer documents are not as useful to our algo9
rithm as sheer volume of documents alone. We hypothesize that this may be because
longer random walks on directed graphs are more likely to get stuck in cycles, producing information that is repeated and therefore less valuable. However, this hypothesis
warrants more exploration.
Figure 1 shows that the 70% sparsified graph outperforms the 80% and 90% sparsified
graphs in every scenario tested. We hypothesize that this result occurs simply because in
the 70% sparsified graph, there are more opportunities to be correct. That is, for each
node of each graph, we add 10 edges to our set of predicted edges based on the top 10
most similar nodes outputted by person2vec. In the 70% graph, it is more likely that at
least 1 of these 10 edges is correct. This hypothesis is supported by the accuracy of the
random baseline in sparser graphs: the sparser the graph, the better the random baseline
performed. Table 1 summarizes the average random baseline performance against the
average person2vec performance over all parameters for each sparse graph. It is clear that
as sparsity increases, the accuracy of the random predictor increases as well, supporting
our hypothesis.
Table 2 below summarizes the accuracy of our person2vec predictions for the 70%
sparsified graph versus the random baseline. We opted to include these results in table
form as our predictive results were hundreds of orders of magnitude better than the
random baseline At its best, person2vec has an approximately 10% chance of predicting a
link that later appears, which significantly outperforms the random baseline, but warrants
comparison with other methods of link prediction on directed graphs (see footnote 4 and
Section 6).
It is also worth noting that the generation of both text files from random walks on
the Twitter graph and word2vec models were reasonably efficient. It took on the order
of tens of minutes to transform all of the social network graphs into word2vec input, and
10
11
12
References
1. Oren Barkan and Noam Koenigstein. Item2vec: Neural item embedding for collaborative filtering. CoRR, abs/1603.04259, 2016.
2. Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-Based Clustering Based on Hierarchical Density Estimates, pages 160172. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
3. Mohammad Al Hasan and Mohammed J. Zaki. A survey of link prediction in social
networks. pages 243275, 2011.
4. David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the Twelfth International Conference on Information and
Knowledge Management, CIKM 03, pages 556559, New York, NY, USA, 2003. ACM.
5. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR,
abs/1310.4546, 2013.
6. Huy Pham, Cyrus Shahabi, and Yan Liu. Ebm: An entropy-based model to infer
social strength from spatiotemporal data. In Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data, SIGMOD 13, pages 265276, New
York, NY, USA, 2013. ACM.
7. Xin Rong. word2vec parameter learning explained. CoRR, abs/1411.2738, 2014.
8. Daniel Schall. Link prediction in directed social networks. Social Network Analysis
and Mining, 4(1):157, 2014.
13
9. Pulipati Srilatha and Ramakrishnan Manjula. Similarity index based link prediction algorithms in social networks: A survey. Journal of Telecommunications and
Information Technology, (2):8794, 2016.
10. Bai Xue, Chen Fu, and Zhan Shaobin. A new clustering model based on word2vec
mining on sina weibo users tags. International Journal of Grid Distribution Computing, 7.3, 2014.
14
Appendix
A
word2vec Parameters
word2vec takes in a series of parameters that can significantly alter the output. The
parameters that we set in our experiments for the CBOW word2vec model on the Yelp
and Twitter data sets were as follows:
size: Size represents the dimensionality of the vectors outputted. We set size to be
100 for each vector. We felt that this was an appropriate size given the number of
nodes in our data sets, and it is also a standard size to use for word2vec vectors on
smaller data sets.
min-count: This parameter was slightly different than the standard word2vec mincount parameter, which is usually set to be 5-15. Min-count denotes the minimum
number of times a word must appear in order for us to train on it. We set our
min-count to be 100, since this seemed to significantly reduce the amount of noise
in our initial experiments with the data set. This parameter is one that we plan on
tuning and exploring more in future work.
window Window represents the size of the sliding window over the text, from which
context is generated. Our window size was set to 5, which is both a standard
setting for word2vec, but also seemed appropriate for denoting the context of
both businesses (on Yelp) and people (on Twitter).
Yelp
For the Yelp data set, we learned vector representations of businesses as described in
Section 4. We started by examining the vector representation output for businesses, and
15
had a feeling our results were doomed when Advanced Auto Parts was considered to
be most similar to Beer Nutz Bottle Shop and Crazy Mocha Coffee. We were not
completely discouraged just yet, as we had hope that clustering over the vector representations of businesses would correspond to latitude and longitude locations rather than
type of product. This would be intuitive, as customers who visit two businesses and post
tips on Yelp about them are likely to live in the area.
Our hopes were unfortunately crushed. We initially tried clustering using a Gaussian
Mixture Model, but tuning the parameters was time consuming and yielding no promising
results: The clusters did not correlate with any known ground truth communities such
as location, business type, or star rating. We then tried clustering using hdbscan, an
automated clustering algorithm with an open-source Python implementation (2). hdbscan
is a high-performance clustering library that requires little to no parameter tuning. Our
results using hdbscan also showed no interesting correlations.
In an attempt to understand why our clusters did not correspond to latitude and
longitude, we visualized our vector representations of businesses by using Principal Component Analysis (PCA) to reduce the vectors to two dimensions and plot the results. We
offer the plot of the latitude and longitude of each business in Figure 2 and the plot of the
PCA reduced vectors in Figure 3. While distinct clusters are apparent in the latitutdelongitude plot, the PCA reduced vector representations do not seem to separate out into
discrete clusters as we would hope.
We conclude that the vector representations of businesses constructed using common
reviews were not useful for finding ground-truth communities, perhaps due to the heterogeneity of user business patronage and exceptional (either good or bad) reviews being
overrepresented in our dataset. In future work we would enjoy exploring a different means
of generating edges between businesses to see if we could produce more meaningful re16
sults. These include (1) generating similarities in reviewers instead of businesses, using
businesses visited in common as edge weight, (2) weighting using star rating so that a
5-star reviewed restaurant and a 1-star reviewed restaunt from the same user should be
negatively correlated.
Figure 2: Latitude and Longitude Business Plot
17