Sie sind auf Seite 1von 6

2014 3rd International Conference on Eco-friendly Computing and Communication Systems

Future Collaboration Prediction in Co-authorship Network


Roopashree N

Umadevi V

Post Graduate Student


Department of CSE, BMS College of Engineering
Bangalore, India
roops.gowda@gmail.com

Associate Professor
Department of CSE, BMS College of Engineering
Bangalore, India
umav.77@gmail.com

Abstract The advent of proliferation of social networking is


high on use in present era. A co-authorship network which shows
research collaborations, are an important class of social
networks. Research collaborations often yield good results but
organizing a research group is a tedious task. Every researcher is
concerned to collaborate with the best expertise complimenting
him. Although there was abundant research conducted to find
future collaborators or links, very few of them are able to find
out effective relationship among them. In this article, we propose
a method that makes link predictions in co-authorship networks
using supervised approach. The model extracts the features from
the networks node and topological structure which can be good
indicators of future collaborations. The proposed method was
evaluated on synthetic as well as real social networks such as
NetScience. Our experiment corroborated the results, and
demonstrated the efficiency of the method.

underlying social structure. Understanding the mechanisms


that drive the volatility is a fundamental and complex question
that is still not well understood. However an important class of
technique that can be addressed is to predict future
associations and factors driving those associations. This
problem is known as Link Prediction which is a key research
direction within the social network analysis.
Prediction of imminent links in co-authorship graph
is an important research direction, since it is conceptually and
structurally identical with the realistic problem of social
network where the scientists in the community interact to
achieve common goal.
More formally, the link prediction task can be
formulated as follows (based upon the definition in LibenNowell and Kleinberg [1]): Given a social network in
which an edge represents some form of
interactions between its endpoints at a particular time .
Multiple interactions can be recorded by parallel edges or by
using a complex timestamp for an edge. For time we
assume that [ ] denotes the subgraph of G restricted to
the edges with time-stamps between and. In a supervised
training setup for link prediction, we can choose a training
interval [t0, t0] and a test interval [t1, t1] where t0< t1. Now
the link prediction task is to output a list of edges not present
in
t0, t0, which are predicted to appear in the network
t1,
t1.
The prime goal of this work is to propose an efficient
technique for designing a link predictor in networks, where
nodes can represent researchers and links representing
collaborations. The above mentioned goal has the following
challenges to be addressed:

KeywordsSVM;
collaboration

I.

Co-authorship

network;

Future

Introduction

Social Network Analysis (SNA) has been evolved as


one of the key research area which has attracted a considerable
amount of attention in recent years. It mainly focuses on
relationships between individuals also referred as social
entities. A social network can be defined as a network of
interactions whose nodes represent people or other entities
embedded in a social context, and whose edges signifies the
interaction, collaboration, or influence between entities which
are driven by mutual interests, intrinsic to the group.
In general, social networks are extremely rich in
content, and they contain a very large amount of linkage data
which can be leveraged for analysis. The linkage data
constitutes the graph structure of the social network. The
availability of massive amounts of data has given a new
impetus towards a scientific and statistically robust study in
the field of social networks. This data-centric thrust has led to
a significant amount of research, which has been unique in its
statistical and computational focus in analyzing large amounts
of online social network data.
SNA helps to identify highly peripheral people who
essentially represents untapped expertise and thus,
underutilized resources for the group.
Social networks are highly dynamic; they grow and
change quickly over time through the addition of new edges,
signifying the appearance of new connections in the
978-1-4799-7002-5/14 $31.00 2014 IEEE
DOI 10.1109/Eco-friendly.2014.45
10.1109/ICECCS.2014.45

184
183

To extract the node and topological features and use


them in combination to expect a better result.
Link prediction datasets are characterized by a large
amount of imbalance in the distribution of class
labels i.e., the existing number of edges often less
than the number of edges known to be not existing.
It is vital that the proposed method computes
effectively if they are scaled to a large network
consisting large number of nodes and edges.

II.

motivated these heuristics in terms of a flow-based


framework.
The eects of social inuence and homophily was
considered by Neil Zhenqiang Gong.et.al, [5], suggest that
both network structure and node attribute information should
inform the tasks of link prediction and node attribute
inference. They used the SAN (Social Attribute Network)
framework with several leading supervised and unsupervised
link prediction algorithms and demonstrate performance
improvement for each algorithm on both link prediction and
attribute inference. They made the novel observation that
attribute inference can help inform link prediction, i.e., link
prediction accuracy is improved by first inferring and
predicting missing attributes. They comprehensively evaluate
these algorithms and compare them with other existing
algorithms using a novel, large-scale Google+ dataset, which
we make publicly available. The evaluation with a large-scale
novel Google+ network dataset demonstrates performance
improvement for each of these generalized algorithms on both
link prediction and attribute inference. Another challenge in
the link prediction problem is to combine effectively the
information from network structure with rich nodes and edge
attribute data. An algorithm was developed by Lack
Backstrom.et.al [6] based on supervised Random walks that
combines the information from the network and edge attribute
information. The algorithm was formulated to assign strengths
to edges in networks and the random walker visits the nodes to
which the new links will be created in future. Their approach
outperformed the state-of-the art unsupervised approaches as
well as approaches that are based on feature extraction.
Previous works proposed in literature for link
prediction based on supervised or unsupervised approach have
used large feature set size. An approach with minimal number
of features will improve the performance of the algorithm. In
this work we propose a supervised learning approach which
uses minimal number of features for link prediction in social
networks.
Support Vector Machine (SVM) is one of the
supervised learning approaches that can be applied for
prediction. LibSVM [7], a library of SVM was used for link
prediction in this work.

Related Work

A. Background
The earliest and the most basic link prediction model
was proposed by Liben-Nowell and Kleinberg [1] that works
explicitly on a social network. Every vertex in the graph
represents a person and an edge between two vertices
represents the interaction between the persons. Multiplicity of
interactions can be modelled explicitly by allowing parallel
edges or by adopting a suitable weighting scheme for the
edges. The learning paradigm in this setup typically extracts
the similarity between a pair of vertices by various graphbased similarity metrics and uses the ranking on the similarity
scores to predict the link between two vertices. They
concentrate mostly on the performance of various graph-based
similarity metrics for the link prediction task.
The recent methods and techniques were surveyed by
Mohammad Al Hasan.et.al [2] which includes a variety of
techniques of link prediction ranging from feature-based
classification and kernel based method to matrix factorization
and probabilistic graphical models. These methods vary with
respect to complexity of the model, prediction performance,
scalability, and its generalization ability. They have
considered the traditional (non-Bayesian) models which
extract a set of features to train a binary classification model.
These authors also presented another work on link prediction
using supervised learning [3] in which many features have
been identified. The features are calculated and effectiveness
has been calculated. They also compare the different classes of
supervised learning algorithms in terms of their performance
metrics. This research work involves how to construct a
dataset for a machine learning algorithm. The features selected
were based on node and structural attributes both resulting in
the improved accuracy. They have experimented on two
datasets of co-authorship network using most of the wellknown supervised algorithms and based on the ranking. It is
known that small set of features always yield better
performance results.
According to Kanika Narang.et.al,[4] link prediction
heuristic should take into account not only how close two
nodes is in a network, but also their ability to send and receive
information or to influence each other. This is determined by
the nature of the flow taking place on the network, i.e., the
process by which information is transmitted from one node to
another node to show that how easily two nodes can interact
with or influence each other depends also on the nature of the
flow which is an intermediate between their interactions. They
show that different types of flows ultimately lead to different
notions of network proximity. They measure the performance
of different heuristics on the missing link prediction task in a
variety of real-world social, technological and biological
networks. They show that heuristics based on random walktype processes outperform the popular Adamic-Adar and the
number of common neighbors heuristics in many networks.
While the newly defined heuristics measures did not beat
existing ones in the missing link prediction task, the work

III.

Proposed method for Link


Prediction

The system architecture of our approach for future


collaboration prediction in co-authorship network is shown in
Fig 1. The main tasks of this approach are as follows:
Constructing an adjacency matrix from the dataset
Extraction of features
Feature Set Construction
Building training model using SVM
Testing the model

184
185

(u) denotes the set of neighbors for node u.


(u)  (v) denotes the set of common neighbors for nodes u
and v.
|(u)  (v)| denotes the cardinality of the common neighbors.
Jaccards coefficient. Jaccards coefficient is a normalized
measure of common neighbors. It calculates the ratio of
common neighbors out of all the neighbors between any two
nodes, and can also be used for comparison of the similarity
and diversity of neighbor set.
Jaccard coefficient =




Adamic/Adar. Adamic/Adar, a weighted version of common


neighbors, assigns greater weight to neighbors that are not
shared with many others. This means the contribution of a
common neighbor to the score is weighted in proportion to the
rarity of the neighbor.
Adamic/adar = 
2.

Preferential attachment. Preferential attachment was


introduced to explain the power-law degree distribution in
complex real-world networks.
The preferential attachment concept is akin to the
well-known rich get richer model. It means a node connected
to a higher degree is more likely to have more links in future
i.e., nodes with higher degree grabs more links which are
introduced to the respective network.

Fig.1. System Architecture of the proposed method for link prediction

A. Construction of adjacency matrix


Generally the dataset contains the information in the form
of edge-pairs representing the collaboration between authors.
The network in the problem space can be simulated to a graph
represented as an adjacency matrix in the solution space.
Formally, link prediction has an input, which is a partially
observed graph  where 0 denotes a known
non-existing link, 1 indicating a known present link, and ?
denoting an unknown link. Our goal is to make predictions for
the unknown links.

Preferential attachment score (u, v) = | (u)|. | (v)|


3.

Path based features

Shortest path. The fact that the friends of a friend can


become a friend suggests that the path distance between two
nodes in a social network can influence the formation of a link
between them. As the distance is shorter, the links are more
likely to happen between the nodes.

B. Feature Extraction
A multitude of topological features can be used for a pair
of nodes. In this paper, the features documented in [2] were
chosen for co-author relationship prediction.
1.

Vertex feature Aggregation

Katz. It is a variant of shortest path distance, but works better


than the former for link prediction. Katz defines a measure
that sums over all paths between two nodes, damping
exponentially by length and counts short paths more heavily.

Node Neighborhood based Features

Katz =  l.|paths(l)u,v|

Common neighbors. Common neighbors is a measure that


considers the intersection of neighbors of two nodes vi and vj.
The idea of using the size of common neighbors is just an
attestation to the network transitivity property. As the number
of common neighbors increases, the link that two nodes will
be linked will be higher.

where |pathslu,v| is the set of all paths of length l from u to v.


Katz generally works much better than the shortest path since
it is based on the ensemble of all paths between the nodes u
and v. The parameter ( 1) can be used to regularize this
feature. A small value of  considers only the shorter paths for
which this feature very much behaves like features that are
based on the node neighborhood.

Common neighbor =  

185
186

A. Characteristics of Synthetic dataset


Characteristics of the synthetic dataset of 10 node network is
as follows.

C. Feature Set Construction


For link prediction, each data point corresponds to a
pair of vertices with the label denoting their link status, i.e., 1
if link exist and 0 otherwise, so the chosen features should
represent some form of proximity between the pair of vertices.

 


Number of nodes

10

Number of Edge-pairs

16

TABLE 2: CHARACTERISTICS OF DATASET CONTAINING 10 NODES

The class labels being used is -1 and +1 where -1


denotes the non-existence of links and +1 denotes the
existence of link. Once the features are calculated for all the
nodes in the graph, a feature vector consisting of all the
feature based score for each node pairs and class labels is
obtained. Sample feature set representation is shown in the
table 1.

TABLE 1: FEATURE VECTOR REPRESENTATION

The edge-pairs column constitutes the obtained edge


pairs from the dataset. The corresponding column consists of
class labels, and the extracted feature values for each edge.

Fig. 2.

The Fig 2 shows the graph of the synthetic data.The link


prediction experiment was conducted on the synthetic data and
results obtained were observed. Based on the results, the
performance measures were calculated.

D. Building link predictive model


The feature set constructed was used to train the model.
The fractions of feature vector i.e., 70% among all feature
vectors were used for training. The feature set is input to SVM
function in order to obtain the prediction model.
For our experiment, LibSVM [7] was used to train and
obtain a prediction model.
The feature set constructed will be provided as input to the
LibSVM which outputs a predictive model containing the
attributes and other information required for prediction.

Metrics

E. Testing the model


The trained model obtained from SVM will tested for its
performance. The fraction of feature vectors i.e., 30% among
the entire feature vector retained from training was used for
testing purpose. SVM testing outputs a set of predicted labels.

IV.

Accuracy

Values
93%

Precision

93%

Recall

93%

F1-score

0.9333

TABLE 3: PERFORMANCE MEASURE OBTAINED FOR SYNTHETIC DATA

Table 3 shows the performance metrics values obtained for the


synthetic data.

Results

The observations were made on two datasets-Synthetic and


NetScience [8] data. Results were evaluated by four
performance metrics namely Accuracy, Precision, Recall and
F1-score.
I.

Graph of 10-node network

II.

Real-time Dataset

In this experiment, the real dataset to be used was


obtained from NetScience.
Co-authorship network of scientists working on
network theory and experiment, as compiled by M. Newman
in May 2006 [8] refers to NetScience data.

Synthetic Data

A graph of 10 node structure was considered to determine the


performance of the proposed approach

186
187

A. Characteristics of NetScience dataset


Characteristics

the number of training and test samples considered for


Experiment-3.

Number of nodes

Values
1588

Number of edges

2743

Characteristics

Values
2743

Number of Positive classes

TABLE 4: CHARACTERISCTICS OF NETSCIENCE DATA

The data was divided into several smaller datasets


based on number of positive, negative classes and three
experiments were conducted.
In order to check for class-imbalance, a small
variation in the number of positive and negative classes is
considered.

Number of Negative classes

5486

Number of Training data

5760

Number of Testing data

2469

TABLE 7: SAMPLE STATISTICS FOR EXPERIMENT -3

The table 8 shows the Performance metric values obtained for


the three experiments on NetScience data.

Experiment 1 was conducted considering equal


number of positive and negative classes. Table 5 shows the
number of training and test samples considered for
Experiment-1.

Experiment
Metrics

Characteristics
Number of Positive classes

Values
2743

Number of Negative classes

2743

Number of Training data

3840

Number of Testing data

1646

Accuracy (%)

99.6

99.6

99.8

Precision (%)

99.6

99.5

99.5

Recall (%)

99.7

99

99

F1-Score (%)

99.7

99.2

99.2

TABLE 8: PERFORMANCE MEASURE OBTAINED FOR NETSCIENCE DATA


TABLE 5: SAMPLE STATISTICS FOR EXPERIMENT-1

The supervised algorithm SVM performed well in coauthor relationship prediction with limited number of features.
The collaborations were easier to predict for authors who
are in higher degree of collaboration than less productive
authors in terms of all the four evaluation measures.

The Experiment 2 was conducted considering more number of


positive and less numbers of negative classes. Table 6 shows
the number of training and test samples considered for
Experiment-2.

V.
Characteristics
Number of Positive classes

Values
2743

Number of Negative classes

1371

Number of Training data

2879

Number of Testing data

1235

Conclusion

In this work, the classical problem of link prediction was


considered where we can predict the edges in a given snapshot
of a social network that have more probability to occur in
future. There have been numerous research attempts to address
the problem of link prediction using supervised learning
methods. However, the knowledge gained was not sufficient
for accurate link prediction. The links or future collaboration
can be predicted accurately by selection of the appropriate
features which extracts network related information. The
features selected for this work included node and vertex
features. In this work an approach using limited number of
feature set was proposed for building link prediction model in
co-authorship networks. Proposed approach was tested on
synthetic data and real network data.

TABLE 6: SAMPLES STATISTICS FOR EXPERIMENT -2

The experiment 3 was conducted considering less number of


positive and more number of negative classes. Table 7 shows

187
188

In the concluding remarks, it is emphasized that the


selection of appropriate features was helpful in predicting the
links with better results. We observed that the number of
samples to be selected for testing must be balanced with
appropriate number of positive classes and negative classes.
The proposed approach could be used in identifying latent
relationships yet potentially successful collaborations, which
would facilitate the development of research collaborations.
References
[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

LibenNowell, David, and Jon Kleinberg. "The linkprediction problem


for social networks." Journal of the American society for information
science and technology 58.7 (2007): 1019-1031.
Al Hasan, Mohammad, and Mohammed J. Zaki. "A survey of link
prediction in social networks." Social network data analytics. Springer
US, 2011. 243-275.
Al Hasan, Mohammad, et al. "Link prediction using supervised
learning."SDM06: Workshop on Link Analysis, Counter-terrorism and
Security. 2006.
Narang, Kanika, Kristina Lerman, and Ponnurangam Kumaraguru.
"Network flows and the link prediction problem." Proceedings of the 7th
Workshop on Social Network Mining and Analysis. ACM, 2013.
Gong, Neil Zhenqiang, et al. "Jointly predicting links and inferring
attributes using a social-attribute network (san)." arXiv preprint
arXiv:1112.3265 (2011).
Backstrom, Lars, and Jure Leskovec. "Supervised random walks:
predicting and recommending links in social networks." Proceedings of
the fourth ACM international conference on Web search and data
mining. ACM, 2011.
Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support
vector machines." ACM Transactions on Intelligent Systems and
Technology (TIST) 2, no. 3 (2011): 27.
NetScienceDataset:M.E.J.Newman,Phys.Rev.E 74,036104 (2006).

188
189

Das könnte Ihnen auch gefallen