Sie sind auf Seite 1von 17

person2vec: Distributed Representations of People in

Vector Space for Link Prediction


Lydia Goldberg, Gabriela Merz, Tiffany Wu
December 9, 2016

Word embeddings are used in natural language processing to map


words to vectors in such a way that similar words have similar vector
representations. Since computing vector similarity is a well-studied
problem, these embeddings make it possible to answer questions
about relationships between words in a document. We consider the
problem of whether it is possible to adapt this model to answer questions about the relationships between people in a social network:
that is, can we map people in a social network to vectors in such
a way that similar people have similar vector representations? We
present person2vec: an algorithm for representing people in vector
space based on the word embedding model word2vec. person2vec manipulates social network graph data to fit within the word2vec framework, and outputs vector representations of each node in the social
network graph, allowing for link prediction through the word2vec
similarity score. Our methods significantly outperform a random
baseline for link prediction.

Introduction

In section 1.1, we introduce the problem of link prediction for social networks. In section
1.2, we introduce the word embedding model word2vec used to perform NLP tasks on
text documents. In section 1.3, we summarize our work using word2vec as a basis for
social network link prediction.

1.1

Link Prediction for Social Networks

First formalized by Liben-Nowell and Kleinberg, the link prediction problem asks if it is
possible to predict interactions between users in the near future, given a snapshot of a
social network (4). This problem is of interest to online social networking sites such as
Facebook, LinkedIn, and Twitter, which seek to recommend new social connections to
their users. Many approaches have been taken to devise appropriate algorithms for this
purpose that are both accurate and scalable (3, 9).
While these algorithms can depend on sophisticated sources of data external to the
network itself, such as GPS information (6), we seek to find a simpler way to predict social
connections. Following in the footsteps of Liben-Nowell and Kleinberg, we suggest that
good algorithms for link prediction can be constructed using network topology alone (4).

1.2

Word Embeddings and word2vec

We turn to another area of computer science to develop a novel method of link prediction:
natural language processing (NLP). NLP is concerned with enabling computers to extract
meaning from input in the form of human language. Word embeddings are a type of model
used in NLP to map words to vectors in such a way that similar words have similar vector
representations. Compared to other NLP methods, they have the advantage that they
do not require expensive annotation. We note that this advantage of extracting meaning
2

from text documents without added annotation is analogous to our goal of predicting
links from a social network graph without external information.
One group of models used to produce word embeddings is word2vec. word2vecs Skipgram and continuous-bag-of-words (CBOW) models are shallow neural networks that take
in a large body of text and learns vector representations of words (5, 7). They preserve
semantic and syntactic patterns by mapping words with similar contexts to vectors located
close together in the vector space. The Skip-gram model attempts to use the current word
to learn surrounding context words. The CBOW model attempts to learn the current word
using surrounding context words. The CBOW model sacrifices semantic and syntactic
accuracy of words in order to compute distributed representations more efficiently. As we
wanted to perform a variety of tests on large data sets, we opted to use the CBOW model
for our experiments so we could benefit from its increased efficiency.

1.3

Our Contributions

In this paper, we introduce person2vec, a model-based approach for link prediction in


social networks based off of word2vec. In Section 2, we describe related work. In Section
3, we explain how we transform social network graphs to be compatible with the word2vec
model. In Section 4, we describe our experimental setup for testing on person2vec on real
social network data. In Section 5, we discuss the results of our experiments, which show
that person2vec significantly outperforms a random baseline. In Section 6, we discuss
our conclusions and future work.

Related Work

Liben-Nowell and Kleinberg first formalized the problem of link prediction for social networks and suggested that information about future user interactions could be extracted
3

from graph topology alone (4). A number of approaches to link prediction have since
emerged. Al Hasan and Zaki survey techniques including feature-based classification,
kernel-based methods, matrix factorization, and probabilistic graphical models (3). Our
work bears the greatest resemblance to techniques that use similarity scores to predict
links, such as those described by Srilatha and Manjula (9).
Other recent work has looked at adapting word2vec for new purposes. Barkan and
Koenigstein developed item2vec for the purpose of item-based collaborative filtering (1).
Xue, Fu, and Shaobin developed a clustering model based on word2vec focusing on user
tags on the Chinese microblogging website Sina Weibo (10).

Transforming Social Network Graphs

word2vecs CBOW model takes in a document as input and attempts to predict a word
given its context. The context of a word is defined as the set of words within the sliding
window of a given size surrounding the word. To allow the word2vec model to learn
vector representations of nodes in a social network graph, we need to define an analogous
context for a node in a graph.
We propose the following method to find the context of a node in a social network
graph:
1. Take a random walk with random restarts on the social network graph, and write
these nodes in order to a file.
2. Now the context of a node is defined within the file the same way it is for a word
within a document: a sliding window surrounding each node. A nice property that
emerges is that nodes that are fewer hops away in the random walk will remain in
the sliding window for more iterations.
4

We will provide motivation for this definition by considering a model of language in


which every word is a node in a graph, and there is an edge between pairs of words that
can be joined to form a valid phrase. Imagine the following language, in which there are
only 7 words:
Lydia

Gabbi

Tiffany

walks

the

dog

ferret

The following sentence is generated by a random walk along this graph. Note that
because of the sliding window aspect of word2vec, it does not matter if we generate
sensical sentences in their entirety, as long as the words within a given sliding window are
sensical.
Gabbi walks the dog
One context for the word walks with a window size of 4 is Gabbi, the, dog.
Note that there are many possible contexts that could be generated in this graph, similar
to how there are many contexts that a word walks could appear in a document. Thus
a context for a given node g in a directed graph can be any set of nodes that include g
in a random walk along the graph.

Experiments

We run experiments on two real world data sets: a directed Twitter network, and a Yelp
data set.

4.1

Data Sets

Twitter
The majority of our experiments are conducted on Twitter data from Stanford Universitys
SNAP collection1 . For this data set, we consider the problem of link prediction for directed
graphs, which is not as well-studied as link prediction for undirected graphs (8).
Twitter is a directed network where users have followers and other users they are
following. Each node in the Twitter graph is a user. There is an edge to a user from
each one of their followers, and from a user to each user they are following. Edges are
unweighted. The data set provided by SNAP is structured as a list of edges in the graph,
which we preprocess to put in the form of an adjacency list. There are 1,768,149 edges in
the Twitter graph and 81,306 nodes.
Yelp
We ran additional experiments on data provided by the Yelp Dataset Challenge2 . For
this data set, we cluster businesses based on patterns in how users leave tips3 on Yelp.

4.2

Setup

Below we describe the setup for our experiments on link prediction, as well as the setup
for our additional experiments with clustering on the Yelp data set.
1

See https://snap.stanford.edu/data/#socnets.
See https://www.yelp.com/dataset_challenge/dataset.
3
Note that a tip on Yelp refers to users writing a few sentences of inside information for other users
to read, not to be confused with a monetary sum given to restaurant staff.
2

Link Prediction
To test link prediction, we sparsify the social network graph by discarding at random some
percentage of the edges. Then we translate the sparsified graphs into input for word2vec,
as described in Section 3. We use the gensim library to generate word2vec models. The
parameters for the word2vec models were chosen from initial experiments on our test
data. A description of these parameters and our justification for choosing them can be
found in Appendix A.
We test to see if the word2vec similarity metric is a good link predictor as follows:
We consider each node g and its corresponding vector representation v. Let v1 , . . . , v10
be the 10 vectors most similar to v, with corresponding nodes g1 , . . . , g10 . Let a b
denote an edge from a to b. Then we consider each of the 10 edges g g1 , g g2 , . . . ,
g g10 to be a predicted edge. We call a predicted edge correct if the edge is in the
full graph.
We evaluate our results by comparing against a random baseline4 . Our random baseline model iteratively chooses two nodes in the sparse graph, and adds the edge from the
first node to the second node to its set of predicted edges. It marks the edge correct if
it exists in the full graph. It continues until is has predicted the same number of edges
as our word2vec model.
Clustering
To test clustering, we preprocess the Yelp data set to form a graph of businesses. Businesses are connected by weighted edges denoting the number of users that had tipped
(written a short review of) both businesses. We translate this graph into input for
4
Ideally, we would also compare against other link prediction models. We attempted to carry this out
using linkpred, an open-source library for link prediction, but ran into some problems (i.e., computers
crashing multiple times), so we had to relegate such comparisons to future work.

word2vec, as described in Section 3, and use the gensim library to generate models.
We hoped to discover ground-truth communities by clustering over our vector representations of businesses. Unfortunately, we were unsuccessful: our results for experiments
with the Yelp data set are hence relegated to Appendix B.

4.3

Variation of Parameters for Link Prediction

We varied three sets of parameters for link prediction on the Twitter data set:
1. The percentage of edges kept during sparsification is chosen to be 70%, 80%, or
90%.
2. The probability of a random restart when generating algorithm input is chosen to
be .01, .005, or .001.
3. The number of documents created when generating algorithm input is chosen to be
100,000, 500,000, or 1,000,0000.

Results and Discussion

Our link prediction results are promising. The best performance came from the 70%
sparse graph with 1,000,000 documents, which achieved approximately 10% accuracy in
predicting links that later appear in the full graph. Figure 1 below shows a comparison of
the link prediction accuracy for the various sparsified graphs for all numbers of documents
and restart probabilities.

Figure 1

The results in Figure 1 are interesting in that the probability of restarting a random
walk at every given node, which corresponds to the length of the documents generated, is
less important for prediction accuracy than the actual number of documents generated.
For the 70% sparsified graph, the prediction accuracy increases 5-fold from 100,000 to
1,000,000 documents. This implies that longer documents are not as useful to our algo9

rithm as sheer volume of documents alone. We hypothesize that this may be because
longer random walks on directed graphs are more likely to get stuck in cycles, producing information that is repeated and therefore less valuable. However, this hypothesis
warrants more exploration.
Figure 1 shows that the 70% sparsified graph outperforms the 80% and 90% sparsified
graphs in every scenario tested. We hypothesize that this result occurs simply because in
the 70% sparsified graph, there are more opportunities to be correct. That is, for each
node of each graph, we add 10 edges to our set of predicted edges based on the top 10
most similar nodes outputted by person2vec. In the 70% graph, it is more likely that at
least 1 of these 10 edges is correct. This hypothesis is supported by the accuracy of the
random baseline in sparser graphs: the sparser the graph, the better the random baseline
performed. Table 1 summarizes the average random baseline performance against the
average person2vec performance over all parameters for each sparse graph. It is clear that
as sparsity increases, the accuracy of the random predictor increases as well, supporting
our hypothesis.
Table 2 below summarizes the accuracy of our person2vec predictions for the 70%
sparsified graph versus the random baseline. We opted to include these results in table
form as our predictive results were hundreds of orders of magnitude better than the
random baseline At its best, person2vec has an approximately 10% chance of predicting a
link that later appears, which significantly outperforms the random baseline, but warrants
comparison with other methods of link prediction on directed graphs (see footnote 4 and
Section 6).
It is also worth noting that the generation of both text files from random walks on
the Twitter graph and word2vec models were reasonably efficient. It took on the order
of tens of minutes to transform all of the social network graphs into word2vec input, and
10

Table 1: Comparison of Average Performance Across Parameters


Sparsity
70 %
80 %
90 %
Random
.01372 % .00571 % 0.0023 %
person2vec 7.2475 % 4.7448 % 2.3235 %
Table 2: Sparse Graph (70%) Comparison of Link Prediction Accuracy
Restart Probability # of Docs person2vec accuracy random accuracy
.01
100000
2.506862 %
0.009149 %
.005
100000
2.796967 %
0.025274 %
.001
100000
3.009188 %
0.030628 %
.01
500000
8.752303 %
0.009213 %
.005
500000
8.888202 %
0.011130 %
.001
500000
8.986142 %
0.008291%
.01
1000000
10.116963%
0.010985 %
.005
1000000
10.054969 %
0.009882 %
.001
1000000
10.116348 %
0.008950 %
on the order of tens of minutes to generate all models. Once the models were generated,
the link predictions were almost instantaneously returned when queried for. We leave a
rigorous time analysis of model generation and link prediction for future work, and refer
the reader to Mikolov et al. (5) for time analysis of word2vec models on regular word
documents.
We observe that person2vec link predictions improve drastically with the number of
random walks per graph, but seem to have no correlation with the restart probability of
a random walk for each number of documents. This result corresponds with Mikolov et
al.s findings regarding their distributed representations of words in vector space, namely
that training on more documents resulted in better vector representations of words (5).

11

Conclusions and Future Work

In this paper we introduced person2vec, an algorithm for representing people in vector


space. We used random walks with random restarts to transform social network graphs
into input for the word embedding model word2vec. We generated models using sparsified
graphs, and made link predictions based on vector similarity. Our person2vec models
significantly outperformed a random baseline for all sparsified graphs.
Future work will involve comparisons of person2vec link prediction on directed graphs
against other baselines, such as linkpred, which we were unable to do due to insufficient
computational resources for our set of 80,000 Twitter nodes.
Additionally, as model generation ended up being relatively fast, we also plan to test
our results using the alternative word2vec model, Skip-gram. Skip-gram was shown to
preserve semantic and syntactic accuracy of words more effectively than CBOW, and we
have high hopes that this would imply better vector representations of people and therefore
higher link prediction accuracy. While work in this paper focused on tuning parameters
regarding social network graph data, future work includes tuning the parameters given
to word2vec such as sliding window size and dimensionality of vector representations,
now that we better understand the relationship between the random walks and the vector
representations.
Overall, person2vec is a promising mechanism for link prediction that relies only
on graph topology to make its predictions. It is reasonably quick to implement, and
as new connections are made, person2vec vectors can easily be updated to reflect the
change. person2vec offers an efficient, reliable, and flexible method for link prediction
that significantly outperforms a random baseline.

12

References
1. Oren Barkan and Noam Koenigstein. Item2vec: Neural item embedding for collaborative filtering. CoRR, abs/1603.04259, 2016.
2. Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-Based Clustering Based on Hierarchical Density Estimates, pages 160172. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
3. Mohammad Al Hasan and Mohammed J. Zaki. A survey of link prediction in social
networks. pages 243275, 2011.
4. David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the Twelfth International Conference on Information and
Knowledge Management, CIKM 03, pages 556559, New York, NY, USA, 2003. ACM.
5. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR,
abs/1310.4546, 2013.
6. Huy Pham, Cyrus Shahabi, and Yan Liu. Ebm: An entropy-based model to infer
social strength from spatiotemporal data. In Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data, SIGMOD 13, pages 265276, New
York, NY, USA, 2013. ACM.
7. Xin Rong. word2vec parameter learning explained. CoRR, abs/1411.2738, 2014.
8. Daniel Schall. Link prediction in directed social networks. Social Network Analysis
and Mining, 4(1):157, 2014.

13

9. Pulipati Srilatha and Ramakrishnan Manjula. Similarity index based link prediction algorithms in social networks: A survey. Journal of Telecommunications and
Information Technology, (2):8794, 2016.
10. Bai Xue, Chen Fu, and Zhan Shaobin. A new clustering model based on word2vec
mining on sina weibo users tags. International Journal of Grid Distribution Computing, 7.3, 2014.

14

Appendix
A

word2vec Parameters

word2vec takes in a series of parameters that can significantly alter the output. The
parameters that we set in our experiments for the CBOW word2vec model on the Yelp
and Twitter data sets were as follows:
size: Size represents the dimensionality of the vectors outputted. We set size to be
100 for each vector. We felt that this was an appropriate size given the number of
nodes in our data sets, and it is also a standard size to use for word2vec vectors on
smaller data sets.
min-count: This parameter was slightly different than the standard word2vec mincount parameter, which is usually set to be 5-15. Min-count denotes the minimum
number of times a word must appear in order for us to train on it. We set our
min-count to be 100, since this seemed to significantly reduce the amount of noise
in our initial experiments with the data set. This parameter is one that we plan on
tuning and exploring more in future work.
window Window represents the size of the sliding window over the text, from which
context is generated. Our window size was set to 5, which is both a standard
setting for word2vec, but also seemed appropriate for denoting the context of
both businesses (on Yelp) and people (on Twitter).

Yelp

For the Yelp data set, we learned vector representations of businesses as described in
Section 4. We started by examining the vector representation output for businesses, and
15

had a feeling our results were doomed when Advanced Auto Parts was considered to
be most similar to Beer Nutz Bottle Shop and Crazy Mocha Coffee. We were not
completely discouraged just yet, as we had hope that clustering over the vector representations of businesses would correspond to latitude and longitude locations rather than
type of product. This would be intuitive, as customers who visit two businesses and post
tips on Yelp about them are likely to live in the area.
Our hopes were unfortunately crushed. We initially tried clustering using a Gaussian
Mixture Model, but tuning the parameters was time consuming and yielding no promising
results: The clusters did not correlate with any known ground truth communities such
as location, business type, or star rating. We then tried clustering using hdbscan, an
automated clustering algorithm with an open-source Python implementation (2). hdbscan
is a high-performance clustering library that requires little to no parameter tuning. Our
results using hdbscan also showed no interesting correlations.
In an attempt to understand why our clusters did not correspond to latitude and
longitude, we visualized our vector representations of businesses by using Principal Component Analysis (PCA) to reduce the vectors to two dimensions and plot the results. We
offer the plot of the latitude and longitude of each business in Figure 2 and the plot of the
PCA reduced vectors in Figure 3. While distinct clusters are apparent in the latitutdelongitude plot, the PCA reduced vector representations do not seem to separate out into
discrete clusters as we would hope.
We conclude that the vector representations of businesses constructed using common
reviews were not useful for finding ground-truth communities, perhaps due to the heterogeneity of user business patronage and exceptional (either good or bad) reviews being
overrepresented in our dataset. In future work we would enjoy exploring a different means
of generating edges between businesses to see if we could produce more meaningful re16

sults. These include (1) generating similarities in reviewers instead of businesses, using
businesses visited in common as edge weight, (2) weighting using star rating so that a
5-star reviewed restaurant and a 1-star reviewed restaunt from the same user should be
negatively correlated.
Figure 2: Latitude and Longitude Business Plot

Figure 3: PCA Reduced Vector Representations of Businesses

17

Das könnte Ihnen auch gefallen