Sie sind auf Seite 1von 20

Detecting Data Leakage

Panagiotis Papadimitriou
papadimitriou@stanford.edu
Hector Garcia-Molina
hector@cs.stanford.edu

Leakage Problem
Stanford Infolab 2
App. U
1
App. U
2
Jeremy Sarah Mark
Other Sources
e.g. Sarahs Network
Name: Mark
Sex: Male
.
Name: Sarah
Sex: Female
.
Kathryn
Outline
Problem Description
Guilt Models
Pr{U
1
leaked data} = 0.7
Pr{U
2
leaked data} = 0.2
Distribution Strategies
Stanford Infolab 3
Problem Description
Guilt Models
Distribution Strategies
Stanford Infolab 4
Problem Entities
Entity Dataset
Distributor
Facebook
T
Set of all Facebook profiles
Agents
Facebook Apps U
1
, , U
n

R
1
, , R
n

R
i
: Set of peoples profiles who have
added the application U
i

Leaker
S
Set of leaked profiles
Stanford Infolab 5
Agents Data Requests
Sample
100 profiles of Stanford people
Explicit
All people who added application
(example we used so far)
All Stanford profiles

Stanford Infolab 6
Problem Description
Guilt Models
Distribution Strategies
Stanford Infolab 7
Guilt Models (1/3)
Stanford Infolab 8
Other Sources
e.g. Sarahs
Network
8
p
p: posterior probability that a leaked profile
comes from other sources
p
Guilty Agent: Agent who leaks at least one profile
Pr{G
i
|S}: probability that agent U
i
is guilty, given
the leaked set of profiles S
Guilt Models (2/3)
Stanford Infolab 9 9
or
or
Agents leak each of their
data items independently
Agents leak all their data
items OR nothing
or
(1-p)
2
(1-p)p

p(1-p)

p
2
Guilt Models (3/3)
Independently NOT Independently
Stanford Infolab 10
Pr{G
1
}
Pr{G
2
}
Pr{G
2
}
Pr{G
1
}
Problem Description
Guilt Models
Distribution Strategies
Stanford Infolab 11
The Distributors Objective (1/2)
Stanford Infolab 12
U
1
U
2
U
3
U
4
R
1
Pr{G
1
|S}>>Pr{G
2
|S}
Pr{G
1
|S}>> Pr{G
4
|S}
S (leaked)
R
1
R
3
R
2
R
3
R
4
The Distributors Objective (2/2)
To achieve his objective the distributor has to
distribute sets R
i
, , R
n
that

minimize

Intuition: Minimized data sharing among
agents makes leaked data reveal the guilty
agents
Stanford Infolab 13
n j i R R
R
i i j
j i
i
,..., 1 , ,
1

Distribution Strategies Sample (1/4)


Set T has four profiles:
Kathryn, Jeremy, Sarah and Mark
There are 4 agents:
U
1
, U
2
, U
3
and U
4

Each agent requests a sample of any 2 profiles
of T for a market survey
Stanford Infolab 14
Distribution Strategies Sample (2/4)
Poor


j i
j i
R R
Minimize
Stanford Infolab 15
U
1
U
2
U
3
U
4




U
1
U
2
U
3
U
4




Distribution Strategies Sample (3/4)
Optimal Distribution






Avoid full overlaps and minimize
Stanford Infolab 16
U
1
U
2
U
3
U
4





i i j
j i
i
R R
R
1
Distribution Strategies Sample (4/4)
Stanford Infolab 17
Distribution Strategies
Sample Data Requests
The distributor has the
freedom to select the data
items to provide the agents
with
General Idea:
Provide agents with as much
disjoint sets of data as possible
Problem: There are cases
where the distributed data
must overlap E.g.,
|R
i
|++|R
n
|>|T|
Explicit Data Requests
The distributor must
provide agents with the
data they request
General Idea:
Add fake data to the
distributed ones to minimize
overlap of distributed data
Problem: Agents can collude
and identify fake data
NOT COVERED in this talk

Stanford Infolab 18
Conclusions
Data Leakage
Modeled as maximum likelihood problem
Data distribution strategies that help identify
the guilty agents
Stanford Infolab 19
Thank You!

Das könnte Ihnen auch gefallen