Sie sind auf Seite 1von 28

Efficient Closeness Privacy Metric

for

Data Publishing
P .RA J E S H 210CS2001

Need Of Privacy with Utility


y Collection of Digital Information by Governments , Corporations,

& Individuals y Publishing data driven by mutual benefits & regulations y Demand for exchange & Publication of data among various parties
Example:
 Netfix , a popular online movie rental service , recently published a

data set containing movie ratings of 500,000 subscribers, in a drive to improve the accuracy of movie recommendations based on personal preferences [New York Times, 2006]
 Licensed Hospitals in California are required to submit specific

demographic data on every patient disc charged from their facility [Carlisle et al. 2007]

Need Of Privacy with Utility


y Published Data (original data ) contains sensitive information y y y y y

about Individuals Such data publications violate individual privacy Published data rammed by policies , guidelines , & agreements Leading to excessive data distortion or insufficient protections Needs a method for publishing useful information on while preserving data privacy which names Privacy Preserving Data Publishing Privacy preserving data publishing to preserve the data , providing useful information

Ladder of Privacy Models y k Anonymity


y " diversity

y t Closeness y (n , t) Closeness

Basic words in Tables


SSN 101 102 209 340 789 123 657 Name Mr.A Mr.B Mr.C Mr.D Mr.E Mr.G Mr.H JOB Engineer Engineer Lawyer Writer Writer Dancer Dancer SEX Male Male Male Female Female Female Female AGE 35 38 38 30 30 30 30 DISEASE Hepatitis Hepatitis HIV Flu HIV HIV HIV

y Explicit Identifier : explicitly Identifies record Owner -- Name

, SSN y Quasi Identifier : potentially Identifies record owner -- sex , age , zip code y Sensitive Attribute : Person specific information -- Disease , Salary

Concerning Disclosures
y Identity Disclosure

Attacker can identify a subject or respondent from the released data y Attribute Disclosure Confidential information about a data subject is revealed
SSN 101 102 209 340 789 123 657 Name Mr.A Mr.B Mr.C Mr.D Mr.E Mr.G Mr.H JOB Engineer Engineer Lawyer Writer Writer Dancer Dancer SEX Male Male Male Female Female Female Female AGE 35 38 38 30 30 30 30 DISEASE Hepatitis Hepatitis HIV Flu HIV HIV HIV

k - Anonymity
let RT(A1,...,An) be a table and QIRT be the quasi-identifier associated with it. RT is said to satisfy k -anonymity if and only if each sequence of values in RT[QIRT] appears with at least k occurrences in RT[QIRT]. Drawbacks y k Anonymity protects against identity disclosure but not attribute disclosure y Homogeneity Attack y Back ground knowledge Attack

" diversity
An equivalence class is said to have l -diversity if there are at least l well-represented values for the sensitive attribute. A table is said to have l - diversity if every equivalence class of the table has l diversity Drawbacks

y " diversity principle beyond k anonymity in protection


y y y y

against attribute disclosure Achieving is difficult Insufficient to prevent attribute disclosure Similarity Attack Skewness Attack

t closeness
An equivalence lass is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness
Drawbacks
y Distance between two probability functions is measured by

using
earth movers distance (emd) method
y emd does not satisfy distance metric property

Probability scaling y Loss of information i.e. Utility is less y Extension of t closeness is (n,t) closeness

(n, t) closeness
y Extended t closeness with different distance metric y t closeness & (n, t ) closeness protect against attribute

disclosure

Drawbacks
y Doesnt deal Identity disclosure y Desired to use (n , t) closeness with k anonymity y Possibility of
y Homogeneity y Background knowledge y Similarity Attacks

Need of New Model


y As Privacy is important in publishing data y t closeness achieves the privacy y (n, t ) closeness also achieves privacy by maintaining same t y y y y y y

closeness metric Compromise with privacy or utility Trading between Privacy and utility should be good Preserving of data using t closeness metric To provide t- closeness , distance between two probabilities is measured Distance measure forms correlation privacy and utility Deploying other distance metric may give a chance efficient model

Design of New Model

Bhattacharya Distance Measure


y Distance between two probabilities measures using

Bhattayacharya distance metric y Bhattacharya satisfies all distance metric properties y Formula:
for Discrete Attribute for Continuous Attribute

p(x) , q(x) are probabilities of x in one equivalence class and whole table

Bhattacharya Vs (n, t) distance meesure


y Computed the distance between two probabilities using

Bhattacharya distance method and (n, t) distance method Experimental Results y D1 distance value using Bhattacharya method y D2 distance value
using (n, t) method.

Variations with (n,t) closeness Distance


y Bhattacharya is close comparatively with (n, t) distance

method
y Anonymixation Hides the information

Experimental Results y Anonymization is inverse to Utility


y Amount of anonymization is less y Loss of information is very less y Efficient utility than (n, t) closeness

Dealing With Identity Disclosure


y Identity Disclosure is one of the serious privacy concern y K anonymity is well known model for Identity Disclosure y Problem with k anonymity is that an intruder can know

about the cofidentional information of individuals in anonymized data. y Solution is Data Reconstruction Approach

Data Reconstruction Approach


y Mainly two step

1)Randomization with discretize median 2) Swapping with 2nd order distribution

Randomization with Discretized Median


y Randomization means adding noise to column data y X is the original data y Y s the noise data y Z is the randomized data
y

i.e Z = X + Y

y Original table has X attribute with (x1,x2,x3,..xn)

where n is number of records y Y is the noise data with (y1,y2,..,yk) and yi is the ith group median where k is number of discretized groups y Z is (z1,z2,z3zn)

Computing Noise

ID 1 2 3 4 5 6 7 8 9 10 11 12 ID 1 2 3 4 5 6 7 8 9 10 11 12

Age 23 23 27 28 32 33 35 37 40 43 45 45 Age 28 28 28 28 28 28 40 40 40 40 40 40

Marital Status Never married Never married Never married Never married Married Married Married Divorced Widow(er) Divorced Married Divorced Marital Status Never married Never married Never married Never married Married Married Married Divorced Widow(er) Divorced Married Divorced

BP 75/12 0 66/11 3 74/11 5 77/12 8 72/12 5 93/14 7 75/12 4 BP 95/14 2 75/12 88/14 0 6 66/11 110/1 3 55 74/11 90/14 5 0 77/12 104/1 8 45 72/12 5 93/14 7 75/12

Blood Type O O A AB B O AB O A O O B Blood Type O O A AB B O AB O A O O B

Test Result Negative Positive Negative Negative Negative Negative Positive Negative Positive Positive Positive Positive Test Result Negative Positive Negative Negative Negative Negative Positive Negative Positive Positive Positive Positive

Original table
Age Attribute values divided into two groups and their group median is being masked age <=34 with median = 28 age>34 with median = 40

Anonymized Table

Experimental Results

Experimental Results

Swapping
y Swapping with Second order Frequency distribution y Algorithm Input : The Set of all Frequency tables Ft , 0 <= t <=2 Output : An N x V database D consistent with Ft For i = 1 to N Do Choose ( i,1, f0 (1)); For j=2 toV Do p2=0; For k = 1 to j-1 Do p1= f1(i , k, j); p2= f2(p1, p2); End End End

Algorithm terms
y Choose( i , j,p) sets Dij =0 with probability p and to 1 otherwise y Choose(i, j, p) sets Dij = 0 with probability p, and to 1 otherwise. y There are obvious choices for the selection of fO, f1, f2, and f3. Notably,

if F1[vj=Di,j] f 1(i, j, k) =
y f 2(p1,p2) = p1+p2 y f3(p2,n) = (p2)/(n-1)

otherwise

Future Work
y Design a clustering based Efficient Closeness algorithm with

overlapping equivalence classes. y Applying Efficient Closeness in Social Networking to prevent Neighbour hood Attack

Conclusion
y Trading between privacy & utility sholud be good in

publishing data , considering Identity Disclosure , Attribute Disclosure which can be addressed by closeness metric using Bhattayacharya distance method provided with Data reconstruction Approach. y Existing privacy models were unable to maintain good utility while our approach achieved and therefore became Efficient Closeness Privacy Metric For Publishing Data

References Preserving Data Publishing A Survey of Recent Developments , ACM Computing Fung et al., Privacy
Surveys , Vol-42(4) June 20 0.
Charu C Aggarwal , Jian Pei , Bo Zhang , On privacy preservation against adversarial data mining,

ACM Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, USA 2006.
Charu C Aggarwal , Philips, Privacy preserving data mining : models and algorithms, Springer ,

Advances in database Systems , Vol 34 , 2008.


George T. Duncan, Thomas B. Jabine, and Virginia A. de Wolf, PRIVATE LIVES AND PUBLIC

POLICIES,NATIONAL ACADEMY PRESS , 1993.


Mohammad S. Obaidat, Joaquim Filipe , e- Business and Telecommunications , Springer ,

Communications in computer and information sciences , 6th international joint conference ,July 2009.
L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty,

Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.


LI, N., LI, T., AND VENKATASUBRAMANIAN S, t-closeness: Privacy beyond k-anonymity and l-

diversity. In Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE). 2007.
MACHANAVAJJHALA, A., KIFER, D., GEHRKE, J., AND VENKITASUBRAMANIAM, M, l-

diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov.


MOTWANI, R. AND XU,Y, Efficient algoData 1, 2007.rithms for masking and finding quasi-

identifiers. In Proceedings of the Conference onVery Large Data Bases (VLDB),2007.


Ninghui Li , Tiancheng Li , Venkatasubramaniam S , Closeness : A new privacy measure for data

publishing , IEEE , Knowledge and Data Engineering , vol-22(7), July 2007 , pg: 943 - 956.
REISS, S. P, Practical data-swapping: The first steps. ACM Trans. Datab. Syst. 9, 1, 2037, 1984. K G Derpanis , The Bhattacharyya Measure , Mendeley , Computer ,Volume: 1, Issue: 4, Pages: 1990-

1992 , 2008.

Road Map
Period June September September October October November Activity Literature survey and learning R programming language Proposal of model , theoretical computations , Practical Implementations of Bhattacharya , randomization, comparison of distance methods Swapping with 2nd order distribution implementation Evaluation of methods using various parameters Thesis Report Review of Thesis

December January January February February March March April

THANKING YOU .!
Seminar Presentation under the Guidance of

Prof. K Satya Babu

Das könnte Ihnen auch gefallen