Sie sind auf Seite 1von 6

curso: Protect Yourself from Curse of Attribute Inference

A social network privacy-analyzer


Eunsu Ryu

Yao Rong

Jie Li

Dept. of Computer Science


Duke University, Durham, NC, USA

{er40, yao.rong}@duke.edu

{jieli, ashwin}@cs.duke.edu

ABSTRACT

raised serious privacy concerns [2, 7, 9, 13]. For instance,


social networks have been criticized for leaking user privacy
[6], and advertisers take advantage of social networks to collect information about users.
As a remedy, social networking companies allow users to
hide a portion of their profiles, or to select specific groups
of friends with whom to share sensitive information. Unfortunately, it has been shown that this approach does little
to protect users from privacy breach. Recent work [5, 13]
demonstrates that it is still possible to infer sensitive user
attributes to an embarrassingly high accuracy using only
friendship and group information. The fact that every user
publishes different parts of profile implies that private information is present, can be learned, and even shared subconsciously in the social network. For instance, while Alice may
want to keep the fact that she can speak Mandarin private,
if all of her friends publicize the fact that they speak Mandarin, then one might infer with high probability that Alice
also speaks Mandarin.
Therefore, the access control mechanisms provided by social networks cannot protect against such privacy breaches,
and in fact lull users into a false sense of privacy. It is thus
important to raise awareness amongst social network users
about the possibility of the aforementioned attribute inference attacks. Our research goal is to build tools that can
execute on behalf of a user, detect potential attribute inference attacks, and warn the user so that they can make
an informed decision. In this paper, we present some initial
work toward this goal.

While social networking platforms allow users to control


how their private information is shared, recent research has
shown that a users sensitive attribute can be inferred based
on friendship links and group memberships, even when the
attribute value is not shared with anyone else. Thus, existing access control mechanisms are unable to protect against
such privacy breaches.
Our research goal is to develop tools that help a user Alice
be aware of privacy breaches via attribute inference. In this
paper, we specifically focus on two problems: (a) whether
Alices sensitive attribute can be inferred based on public
information in Alices neighborhood, and (b) whether making Alices sensitive attribute public leads to the disclosure
of sensitive information of another user Bob in Alices neighborhood. We propose three algorithms to detect the aforementioned privacy breaches. We limit our scope to the onehop neighbors of Alice information that is visible to an
app that can be executed on behalf of Alice. Our results indicate that analyzing local networks is sufficient to extract
a significant amount of information about most users.

Categories and Subject Descriptors


H.2 [Database Management]: Data mining

General Terms
Algorithms, Security

Keywords

Contributions: In this paper we focus on two concrete


problems: (a) whether a user Alices sensitive information
can be inferred based on public attributes of her friends, and
(b) whether making Alices attribute value publicly accessible results in the disclosure of a private attribute value for
another user Bob in Alices neighborhood. While the former
problem directly affects Alices privacy, the latter problem
may inform a conscientious friend of Bob about Bobs privacy disclosure. We depart from prior work in the following
way: rather than analyzing the risks of attribute inference
using large global networks that contain millions of users and
thousands of attributes [5, 12, 13], we focus our attention to
only the one-hop neighborhood of Alice. Not only does this
allows our algorithms to be very efficient, focusing on the
immediate neighborhood would help building tools (future
work) that run on behalf of the user by leveraging the information in the social network that is accessible to the user
(via APIs). Using the entire social network would require

social networks, attribute inference

1.

Ashwin Machanavajjhala

Dept. of Electrical & Comp. Engg


Duke University, Durham, NC, USA

INTRODUCTION

Social networks have gained a wide popularity over the


past decade. While the unprecedented success of the social
networking industry has established an attractive ecosystem
with advertisements and social gaming, the increasing volume of personal information shared in social networks has

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DBSocial 13 New York, NY USA
Copyright 2013 ACM 978-1-2191-4 ...$15.00.

13

these tools to be built in cooperation with the social network


platform.
We propose three models for detecting and quantifying
the aforementioned privacy breach scenarios. Our methods
are evaluated on real social network data. Our results indicate that analyzing the one-hop neighborhood is sufficient
to infer the values of private attributes from a significant
fraction of users. However, it should be noted that the proposed framework is unable to prove if Alices profile is free
of attribute inference, as the adversary can potentially have
more information than is available to our tool.
Outline: The remainder of the paper is organized as follows. In Section 2, we discuss related work. Section 3 introduces notation, and describes the problem formulation.
Section 4 presents three novel models for attribute inferences
along with inference methods. In Section 5 we discuss experiments and performance evaluations of our models, and
we conclude in Section 6.

2.

Figure 1: Local network around an end-user Alice.


Alice only has access to her local network.

sents Alice herself. Assume each user i has M binary attributes/features {xij }M
j=1 , where xij {0, 1}. For some
user-attribute pair (i, j), xij is public, while for others, xij
is hidden. If xij = 1, then we say that user i has a positive jth attribute/feature. Conversely, xij = 0 means the
user i has a negative jth attribute/feature. If xij is not
known (missing), then user i has a hidden (or private) jth
attribute/feature. Under this construction, Alice has access
to T = (L, X) for all public xij s.
Suppose Alice has a hidden attribute x0m {0, 1}. We
seek to:

RELATED WORK

Many recent research publications analyze social network


structures to infer hidden attributes. Zheleva & Getoor [13]
use a large Facebook network to show that friendships and
group memberships contain sufficient information to learn
end-user hidden attributes with astounding accuracy. Other
work on similar lines [7, 9], use friendships and attribute information to evaluate risks of privacy breach. Backstrom et
al [2] also show active attacks through which adversaries may
reidentify and learn sensitive information from anonymized
social networks.
Many models have been proposed for inferring links and
attributes in social networks. In [1, 5, 12], authors use socialattribute networks (SANs) to jointly infer latent attributes
as well as friendships.
The attribute inference problem can be formulated as a
clustering problem [3], a matrix factorization problem [8],
or a regression problem. Restrictive Boltzmann Machine
(RBM) introduced in [11] also has an interesting application in inferring latent features such as hidden attributes.
Using RBM for social network attribute inference may be
an interesting direction for future research.

3.

Quantify the amount of information about x0m that can


be inferred by adversaries based on the structure of T
(for Task 1).
Quantify the gain of information about the attribute m
across the network L induced by publicizing x0m (for
Task 2).
Specifically, we seek to design an estimator/predictor
f (xij ) for a hidden attribute j of user i. Based on f (xij ),
we compute an error function E(xij ) that evaluates goodness of our estimation. We shall design E(xij ) to have a low
value if:
An adversary can guess x0m with a high accuracy (for
Task 1).
Alice breaches her friends privacy by publicizing x0m
(for Task 2).

PROBLEM FORMULATION

For  > 0, we declare that privacy is breached at level 


if E(xij ) < .

Figure 1 shows the local social network around an enduser named Alice. Alice has a set of public attributes accessible by all of her friends. She has a hidden attribute, say,
her ability to speak Mandarin. She wants to find out, to
the best of her knowledge, whether an adversary can guess
this information based on her public attributes. She is also
worried if publicizing this attribute would breach any of her
friends privacy. Specifically, based on the structure of her
local network, Alice wants to:

3.1

Deviation and Error Metrics

In this section we formulate our error function E(xij ) associated with the estimator f (xij ). We first define the deviation function g(xij ) = g(i, j)
g(i, j) = |xij f (xij )|

(1)

as the residue in approximating xij with f (xij ).


Suppose Alice knows the value of her hidden attribute
x0m {0, 1}. For Task 1, we declare that x0m is breached
at level  if it is possible to infer x0m up to an error  based
on Alices local network L:

Task 1: determine whether her secret can be guessed


from public information.
Task 2: determine whether publicizing her hidden attribute would breach her friends privacy.
Formally, consider a local network L = (V, E) around a user
(Alice). This network is a graph that consists of Alice and
her friends; each node i V represents a user while the
edges in E model friendships. Let N denote the number
of Alices friends so that |V | = N + 1, and i = 0 repre-

E0 (m) g(0, m) = |x0m f (x0m )| < .

(2)

Since x0m is either 0 or 1, another interpretation of E0 (m)


is that an adversary can use f (x0m ) to correctly guess x0m
with probability 1 E0 (m).

14

4.2.1

For Task 2, we say that the mth attribute is breached


at level  due to x0m , if the deviation after publicizing x0m
is on average lower than that before publicizing x0m . Let
m = {i|xim is public} denote the set of Alices neighbors
whose mth attribute value is public. Then we have

1 X 
g(i, m) g 0 (i, m) < .
(3)
E(m)
|m | i

We define the importance of user i0 to user i as:


X
1
1
.
(4)
u(i, i0 ) =
log |Gi0 |
log
|It |
+
+
tIi I 0
i

Note that t runs through all the common friends and


attributes between i and i0 . |It | is the number of users
connected to the user/attribute t, signifying the popularity
of t. |Gi0 | is the total number of friends associated with i0 .
The logarithm log |It | is inspired by the Adamic-Adar
(AA) notion defined in [7], in which popular friends and
attributes are considered less significant. The multiplicative
1
factor log |G
takes into account the local nature of our neti0 |
work L by further reducing the significance of a social node
with a large number of friends (e.g. celebrities). u(i, i0 )
quantifies the significance of user i0 to user i, and is larger if
the i and i0 share more friends (or attributes) in common.

Here g(i, m) = g(xim ) is the deviation without x0m , while


g 0 (i, m) g(xim |x0m ) is the deviation given x0m . With
the error functions defined as above, we now turn our attention to the design of the estimator function f (xij ) to infer
attribute values.

4.

ATTRIBUTE INFERENCE

In this section, we present three techniques for inferring


private attributes values using a users 1-hop neighborhood
in a social network. Before we describe our algorithms, we
start by describing the social-attribute network model, and
describe utility metrics that will be used in our algorithms.

4.1

Importance of a friend

4.2.2

Value of an attribute

Define v(j, i) as the value of attribute j to user i:


X
1
.
(5)
v(j, i) =
log
|It+ |
tI I

Social-Attribute Network Model

We adopt the notion of Social-Attribute Networks (SAN)


[12]. A SAN can be constructed by augmenting the the
original network L with M distinct nodes corresponding to
M attributes. The original nodes corresponding to the users
are called social nodes, and the new nodes representing the
attributes are called the attribute nodes. An undirected link
between the user i and attribute j is formed if xij is public
(positive or negative). Figure 2 shows an example of SAN
(from [5]).

Observe that t runs through the friends of user i with (positive) feature j. |It+ | is the number of friends and positive
features associated with user t. As in the case of u(i, i0 ), we
downplay the significance of high-degree social nodes. This
utility function is designed so that feature j is more significant to user i if more of is friends have feature j.

4.2.3

Power of an attribute

The power of an attribute j (having value z) to user i is:


X
wz (i, j) =
u(i, t).
(6)
tIi Fjz

Here t runs through all the friends of user i having value


z for feature j (i.e. xij = z). The power of xij = z is
obtained by adding up the importance of all the friends i0 s
of user i, having xi0 j = 1. For example, the power of the
ability to speak Chinese to Alice is computed by summing
the importance of all of her Chinese-speaking friends.
We define the relative power of attribute j to user i:

Figure 2: Example of a simple SAN model (from [5])


The plus sign between a social node ui and and attribute
node j means xij = 1, while a minus sign signifies xij = 0.
The mutex links tie a set of mutually exclusive attributes
together so that no two mutually exclusive attributes are
selected simultaneously.
Based on the above description of a SAN, we define a number of useful sets that will be used in this section. As before,
we let i to represent a user, and j an attribute.

wij = w1 (i, j) w0 (i, j).

Note wij > 0 if and only if xij = 1 has more power/significance


to user i than does xij = 0. wij = 0 means xij = 1 possesses equal importance to xij = 0.
Now we present three designs of the estimator f (xij ).

4.3

Deterministic Algorithm

For a hidden attribute xij of interest, we compute the


relative power wij as defined in (7):

Il = {all users connected to user/attribute l }.


Ii+ = {all friends and positive attributes of user i}.
Fjz = {all users with feature j having value z}.

wij = w1 (i, j) w0 (i, j).

Gi = {all friends of user i in the network}.

Since wij can be any real number, we map it onto (0, 1)


to construct the estimator f (xij ):

Mi = {m|xim is public}.

4.2

(7)

f (xij ) = h(wij ) = 1/[1 + exp(wij )],

Utility Functions

Here, we define three utility functions as weighted sums


of common neighbors with lower weights on popular nodes.

(8)

where h() = 1/(1 + exp()) is the sigmoid function. We


say that xij = 1 is more likely than xij = 0 if xij = 1 has

15

by stochastic gradient descent on log p(D, S|W, ). For


simplicity, we employ
Y
Y
pG (D, S|W, ) =
N (di |di , H1
N (sj |sj , G1
d )
s )

more power relative to xij = 0 (i.e. wij > 0). Conversely,


xij = 0 is more likely than xij = 1 if wij < 0. In short,
wij > 0 implies that xij = 1 is a better guess than xij = 0.

4.4

Logistic Regression

Next, we use logistic regression to model an adversary


trying to learn a sensitive attribute xim associated with user
i and a given feature m. Since xim takes on binary values
and is currently hidden, we model
X

Pr [xim = 1] = h(
u(i, i0 ) + v(m, i0 ) i0 )
(9)

where the precisions Hdi and Gsj are the Hessians evaluated at the modes. We can use p(|W) to approximate
p(Wij |W) for all public (i, j) by approximating integral:
R
p(Wij |W) = p(Wij |di , sj , )p(di , sj |W, )
p(|W)dsi ddj d,

i0 6=i

from which we can approximate the expectation E(Wij |W ).


We can then construct the estimator f (xij ) by

using the utility functions u(i, i0 ) and v(j, i0 ) defined in (4)


and (5). To learn the coefficients i0 s, we will use regularized maximum likelihood approach with `1 penalty on .
Specifically, we minimize `() defined as
X

[xim log h(yim ) + (1 xim ) log(1 h(yim ))] + kk1 .

f (xij ) = h(E(Wij |W )).

5.

We may use known algorithms (such as gradient methods)


to solve the above optimization problem in . Once the
are estimated, we may construct f (xim ) by
coefficients
simply computing the predictive value yim
X

yim =
u(i, i0 ) + v(m, i0 ) i0

Datasets

5.1.1

Google+ dataset

The Google+ dataset introduced in [5] contains the social


and attribute links (SAN) of roughly 5200 users collected
separately at three different times of the year 2012. The
authors use the education and employment profiles of the
targets to construct a vocabulary of attributes. For analysis, we use education and employment attribute values that
belong to more than five users.

i0 6=i

and taking the sigmoid transformation:


f (xim ) = Pr(
xim = 1) = h(
yim ).

5.1.2

UCI Facebook data

The Facebook sampling dataset collected at UCI [4] contains the network of nearly one million unique users, their
network IDs and their privacy settings. Each person can
have zero, one or multiple network IDs, and exactly four privacy settings: add as friend, photo thumbnail, view friends,
send message. As more than 90% of users use the default
privacy settings (all enabled), we pre-process the dataset to
minimize the number of overlapping attributes that do not
add much information about the identity of the users. We
can also regard nodes with an exceptionally large number
of friends as the sensitive attributes and test how well our
model predicts these links.

Matrix Factorization

Here we use a Bayesian model to construct the estimator


f (xij ). For all public xij s, we compute the relative power
as in (7)
X
wij = w1 (i, j) w0 (i, j), wz (i, j) =
u(i, t),
tIi Fjz

and organize them in an (N + 1) M array W = [Wij ]ij


R(N +1)M . Observe that the matrix W has missing values.
Our goal is to estimate those missing entries Wij associated
with the hidden attribute of interest xij .
We first assume that the matrix W can be represented as
the inner product of two latent matrices D and S plus some
noise:
W = DT S + E

(11)

EXPERIMENTS

5.1

im

4.5

5.1.3

Duke Facebook data

We created a new Facebook dataset corresponding to profiles of students at Duke University. We crawled Facebook
pages of Duke students, and retrieved attributes such as gender, education, employment, and likes. We use employment
as our sensitive attribute.
Duke Online phonebook is a service available for all of the
Duke students, staff and faculty, which returns a comprehensive set of attributes about Duke affiliates. We use data
from Duke phonebook as ground truth (when the Facebook
profiles of Duke students are hidden) and use it to verify the
quality of detecting attribute inference using our algorithms.
Table 1 shows the summary of datasets.

(10)

Specifically, we assume
Wij N (dTi sj , 1 ), gamma(a, b)
dki N (0, 1), skj N (0, 1 )
gamma(c, d), K Uniform(1, ..., Kmax )
Let = (, , K). We use the Integrated Nested Laplace
Approximation (INLA) [10] to approximate the posterior
predictive distribution p(Wij |W). First, approximate the
marginal posterior p(|W) by

p(W, D, S, )
,
p(|W) p(|W)
p(D, S|W, ) (D,S)=(D ,S )

Dataset
Google+
UCI FB
Duke FB

Nodes
5200
984K
1475

Attributes
School, Work
Popular Nodes
Work

Domain Size
275
367
69

Table 1: Summary of datasets

where pG is the Gaussian approximation to p(D, S|W, )


with mode at (D , W ). These modes can be approximated

16

5.2

Experimental Setup

In order to test the performance of our proposed approaches,


we evaluate the prediction/inference accuracy for each of
the three algorithms on held-out test data. Specifically, we
randomly take out some public attributes (ground truth),
then run our algorithms to reconstruct these values assuming
that they are hidden. For the Duke dataset, we use ground
truth from the online phonebook when available. For Task
1, if M0 is the set of binary attributes on which we run attribute inference, the average prediction accuracy A0 is
computed as follows:
A0 = 1

X
1
|x0m f (x0m )|,
|M0 | mM

(12)

Figure 5: Scatter plot of degree of the user versus inference error for Task 1 using the Matrix algorithm
for the Duke Facebook dataset

For Task 2, we compute the improvement defined


B(m) = B 0 (m) B(m)
1 X
B(m) = 1
|xim f (xim )|
|m | i

(13)
(14)

B 0 (m) = 1

1 X
|xim f (xim |x0m )|,
|0m |
0

(15)

im

where: B(m) is the fraction of correctly predicted instances


without the knowledge of x0m , and B 0 (m) = B(m|x0m ) is
the same metric computed after x0m is publicized.

5.3

Algorithms

We evaluate the following algorithms:


Det: The deterministic method
Figure 6: Scatter plot of degree of the user versus
the number of friends with attribute for the Duke
Facebook dataset

Log: Logistic regression based inference


Mat: Matrix factorization using INLA
Maj: Majority vote, for baseline

5.4

Results

We demonstrate our results in greater detail in Figures 3


and 4. Figure 3 plots on the x-axis the inference error , and
on the y-axis the fraction of users with inference errors less
than threshold for each of the three datasets used in Task
1. We see, for instance, that for about 20% of the users the
inference error is less than 20%, and for more than 75% of
the users the inference error is less than 50% (that is we can
do better than random guessing for more than 75% of the
users). To further investigate our algorithms, we also plotted the inference error versus the degree of the user for Task
1 (Figure 5). We can see as the degree of the user increases,
the inference error also increases. This can be explained by
the fact that users with higher degree tend to have friends
who are more diverse and thus inferring their sensitive attribute is harder using our algorithms. Studying whether
this result holds fundamentally for all inference algorithms
is an interesting direction for future work.
In Figure 6, we plot the fraction of neighbors with an attribute value against the degree of users. As the number
of friends increases, a diverse set of attribute values are observed in Alices neighborhood. Hence, the prevalence of
the target attribute value decreases, and attribute inference
could give higher errors for high-degree nodes.
Figure 4 plots the fraction of users that experience accuracy improvement of at least after Alice publicizes her
hidden attribute x0m .
In summary, our results indicate that even the local net-

We now present our evaluation results. Table 2 shows


the average predictions accuracy of inferring Alices hidden
attributes. Higher accuracy means that the estimation is
in general accurate. In each of the datasets, the prediction
accuracy is averaged over 20 different users (acting as Alice). We can see that all algorithms have about the same
performance on all the datasets. Table 3 shows the improveMethod
Det
Log
Mat
Maj

Google+
.6844 .1068
.7635 .0788
.8073 .0917
.5082 .1385

UCI FB
.7490 .1233
.6812 .1381
.7249 .1192
.5201 .1305

Duke FB
.7511.0965
.7186.0611
.7401.0824
.6257.0717

Table 2: Average prediction accuracy


ment in prediction accuracy after publicizing a given hidden
attribute x0m for Alice.
Method
Det
Log
Mat
Maj

Google+
.0217 .0079
.0225 .0062
.0334 .0093
.0119 .0027

UCI FB
.0091.0038
.0057.0027
.0048.0016
.0021.0035

Duke FB
.0419.0064
.0327.0071
.0648.0108
.0196.0163

Table 3: Improvement induced by making x0m public

17

Figure 3: Fraction of users with inference errors less than threshold for each of the three datasets used in
Task 1

Figure 4: Fraction of users that experience accuracy improvement of at least after Alice publicizes her
hidden attribute x0m .
work can give a reasonable estimate of hidden attributes:
information content in social networks are densely clustered.
[3]

6.

CONCLUSION

Social networks are vulnerable to privacy attacks. Since


users publish different parts of their profiles, adversaries are
capable of inferring their sensitive attributes by exploiting
the link structure of the social network. Since sharing even a
small seemingly-benign chunk of personal information may
be detrimental to privacy, it is important to analyze the risk
of publicizing hidden attributes.
Though there has been recent interests in analyzing the
privacy risks in social networks, the current trend seems to
be on the use of large networks. However, for an end-user
with access to only his/her one-hop neighbors, using such
global networks is impractical. Thus we proposed three ways
for making the best use of the information locally available
to individual end-users.
Throughout the paper, we answered two question for an
end-user Alice:

[4]

[5]

[6]
[7]
[8]

Task 1: determine whether her secret can be guessed


from her public information.

[9]

Task 2: determine whether publicizing her hidden attribute would breach her friends privacy.
We presented three novel schemes to answer the above two
questions. While the proposed framework is not able to
prove if Alices profile is free of attribute inference, our results indicate that in some cases, even the local network can
give a reasonable estimate of hidden attributes, and thus
can be used to warn individuals of such privacy breaches.

7.

[10]

[11]

[12]

REFERENCES

[1] L. Adamic and E. Adar. Friends and neighbors on the


web. Social Networks, 25:211230, 2001.
[2] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore
art thou r3579x?: anonymized social networks, hidden

[13]

18

patterns, and structural steganography. In WWW,


2007.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. J. Mach. Learn. Res., 3:9931022,
Mar. 2003.
M. Gjoka, M. Kurant, C. T. Butts, and
A. Markopoulou. Walking in facebook: a case study of
unbiased sampling of osns. In INFOCOM, 2010.
N. Z. Gong, A. Talwalkar, L. W. Mackey, L. Huang,
E. C. R. Shin, E. Stefanov, E. Shi, and D. Song.
Predicting links and inferring attributes using a
social-attribute network (san). CoRR, abs/1112.3265,
2011.
R. Gross and A. Acquisti. Information revelation and
privacy in online social networks. In WPES, 2005.
J. He, W. W. Chu, and Z. V. Liu. Inferring privacy
information from social networks. In ISI, 2006.
Y. Koren, R. Bell, and C. Volinsky. Matrix
factorization techniques for recommender systems.
Computer, 42(8):3037, Aug. 2009.
J. Lindamood, R. Heatherly, M. Kantarcioglu, and
B. Thuraisingham. Inferring private information using
social network data. In WWW, 2009.
H. Rue, S. Martino, and N. Chopin. Approximate
Bayesian inference for latent Gaussian models using
integrated nested Laplace approximations. J. Royal
Stat. Soc., Series B, 2009.
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted
boltzmann machines for collaborative filtering. In
ICML, 2007.
Z. Yin, M. Gupta, T. Weninger, and J. Han. Linkrec:
a unified framework for link recommendation with
user attributes and graph structure. In WWW, 2010.
E. Zheleva and L. Getoor. To join or not to join: the
illusion of privacy in social networks with mixed
public and private user profiles. In WWW, 2009.

Das könnte Ihnen auch gefallen