Beruflich Dokumente
Kultur Dokumente
for
Data Publishing
P .RA J E S H 210CS2001
& Individuals y Publishing data driven by mutual benefits & regulations y Demand for exchange & Publication of data among various parties
Example:
Netfix , a popular online movie rental service , recently published a
data set containing movie ratings of 500,000 subscribers, in a drive to improve the accuracy of movie recommendations based on personal preferences [New York Times, 2006]
Licensed Hospitals in California are required to submit specific
demographic data on every patient disc charged from their facility [Carlisle et al. 2007]
about Individuals Such data publications violate individual privacy Published data rammed by policies , guidelines , & agreements Leading to excessive data distortion or insufficient protections Needs a method for publishing useful information on while preserving data privacy which names Privacy Preserving Data Publishing Privacy preserving data publishing to preserve the data , providing useful information
y t Closeness y (n , t) Closeness
, SSN y Quasi Identifier : potentially Identifies record owner -- sex , age , zip code y Sensitive Attribute : Person specific information -- Disease , Salary
Concerning Disclosures
y Identity Disclosure
Attacker can identify a subject or respondent from the released data y Attribute Disclosure Confidential information about a data subject is revealed
SSN 101 102 209 340 789 123 657 Name Mr.A Mr.B Mr.C Mr.D Mr.E Mr.G Mr.H JOB Engineer Engineer Lawyer Writer Writer Dancer Dancer SEX Male Male Male Female Female Female Female AGE 35 38 38 30 30 30 30 DISEASE Hepatitis Hepatitis HIV Flu HIV HIV HIV
k - Anonymity
let RT(A1,...,An) be a table and QIRT be the quasi-identifier associated with it. RT is said to satisfy k -anonymity if and only if each sequence of values in RT[QIRT] appears with at least k occurrences in RT[QIRT]. Drawbacks y k Anonymity protects against identity disclosure but not attribute disclosure y Homogeneity Attack y Back ground knowledge Attack
" diversity
An equivalence class is said to have l -diversity if there are at least l well-represented values for the sensitive attribute. A table is said to have l - diversity if every equivalence class of the table has l diversity Drawbacks
against attribute disclosure Achieving is difficult Insufficient to prevent attribute disclosure Similarity Attack Skewness Attack
t closeness
An equivalence lass is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness
Drawbacks
y Distance between two probability functions is measured by
using
earth movers distance (emd) method
y emd does not satisfy distance metric property
Probability scaling y Loss of information i.e. Utility is less y Extension of t closeness is (n,t) closeness
(n, t) closeness
y Extended t closeness with different distance metric y t closeness & (n, t ) closeness protect against attribute
disclosure
Drawbacks
y Doesnt deal Identity disclosure y Desired to use (n , t) closeness with k anonymity y Possibility of
y Homogeneity y Background knowledge y Similarity Attacks
closeness metric Compromise with privacy or utility Trading between Privacy and utility should be good Preserving of data using t closeness metric To provide t- closeness , distance between two probabilities is measured Distance measure forms correlation privacy and utility Deploying other distance metric may give a chance efficient model
Bhattayacharya distance metric y Bhattacharya satisfies all distance metric properties y Formula:
for Discrete Attribute for Continuous Attribute
p(x) , q(x) are probabilities of x in one equivalence class and whole table
Bhattacharya distance method and (n, t) distance method Experimental Results y D1 distance value using Bhattacharya method y D2 distance value
using (n, t) method.
method
y Anonymixation Hides the information
about the cofidentional information of individuals in anonymized data. y Solution is Data Reconstruction Approach
i.e Z = X + Y
where n is number of records y Y is the noise data with (y1,y2,..,yk) and yi is the ith group median where k is number of discretized groups y Z is (z1,z2,z3zn)
Computing Noise
ID 1 2 3 4 5 6 7 8 9 10 11 12 ID 1 2 3 4 5 6 7 8 9 10 11 12
Age 23 23 27 28 32 33 35 37 40 43 45 45 Age 28 28 28 28 28 28 40 40 40 40 40 40
Marital Status Never married Never married Never married Never married Married Married Married Divorced Widow(er) Divorced Married Divorced Marital Status Never married Never married Never married Never married Married Married Married Divorced Widow(er) Divorced Married Divorced
BP 75/12 0 66/11 3 74/11 5 77/12 8 72/12 5 93/14 7 75/12 4 BP 95/14 2 75/12 88/14 0 6 66/11 110/1 3 55 74/11 90/14 5 0 77/12 104/1 8 45 72/12 5 93/14 7 75/12
Test Result Negative Positive Negative Negative Negative Negative Positive Negative Positive Positive Positive Positive Test Result Negative Positive Negative Negative Negative Negative Positive Negative Positive Positive Positive Positive
Original table
Age Attribute values divided into two groups and their group median is being masked age <=34 with median = 28 age>34 with median = 40
Anonymized Table
Experimental Results
Experimental Results
Swapping
y Swapping with Second order Frequency distribution y Algorithm Input : The Set of all Frequency tables Ft , 0 <= t <=2 Output : An N x V database D consistent with Ft For i = 1 to N Do Choose ( i,1, f0 (1)); For j=2 toV Do p2=0; For k = 1 to j-1 Do p1= f1(i , k, j); p2= f2(p1, p2); End End End
Algorithm terms
y Choose( i , j,p) sets Dij =0 with probability p and to 1 otherwise y Choose(i, j, p) sets Dij = 0 with probability p, and to 1 otherwise. y There are obvious choices for the selection of fO, f1, f2, and f3. Notably,
if F1[vj=Di,j] f 1(i, j, k) =
y f 2(p1,p2) = p1+p2 y f3(p2,n) = (p2)/(n-1)
otherwise
Future Work
y Design a clustering based Efficient Closeness algorithm with
overlapping equivalence classes. y Applying Efficient Closeness in Social Networking to prevent Neighbour hood Attack
Conclusion
y Trading between privacy & utility sholud be good in
publishing data , considering Identity Disclosure , Attribute Disclosure which can be addressed by closeness metric using Bhattayacharya distance method provided with Data reconstruction Approach. y Existing privacy models were unable to maintain good utility while our approach achieved and therefore became Efficient Closeness Privacy Metric For Publishing Data
References Preserving Data Publishing A Survey of Recent Developments , ACM Computing Fung et al., Privacy
Surveys , Vol-42(4) June 20 0.
Charu C Aggarwal , Jian Pei , Bo Zhang , On privacy preservation against adversarial data mining,
ACM Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, USA 2006.
Charu C Aggarwal , Philips, Privacy preserving data mining : models and algorithms, Springer ,
Communications in computer and information sciences , 6th international joint conference ,July 2009.
L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty,
diversity. In Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE). 2007.
MACHANAVAJJHALA, A., KIFER, D., GEHRKE, J., AND VENKITASUBRAMANIAM, M, l-
publishing , IEEE , Knowledge and Data Engineering , vol-22(7), July 2007 , pg: 943 - 956.
REISS, S. P, Practical data-swapping: The first steps. ACM Trans. Datab. Syst. 9, 1, 2037, 1984. K G Derpanis , The Bhattacharyya Measure , Mendeley , Computer ,Volume: 1, Issue: 4, Pages: 1990-
1992 , 2008.
Road Map
Period June September September October October November Activity Literature survey and learning R programming language Proposal of model , theoretical computations , Practical Implementations of Bhattacharya , randomization, comparison of distance methods Swapping with 2nd order distribution implementation Evaluation of methods using various parameters Thesis Report Review of Thesis
THANKING YOU .!
Seminar Presentation under the Guidance of