Sie sind auf Seite 1von 67

Privacy and Security Aspects of Data Mining

Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining
Houston, USA, November 27, 2005

Edited by

Stan Matwin
University of Ottawa (Canada)

LiWu Chang
Naval Research Laboratory (USA)

Rebecca N. Wright
Stevens Institute of Technology (USA)

Justin Zhan
University of Ottawa (Canada)

ISBN 0-9738918-9-0

Table of Contents .i Foreword ii What is Privacy? Critical Steps for Privacy-Preserving Data Mining 1
Chris Clifton (Purdue University)

An Adaptable Perturbation Model of Privacy Preserving Data Mining .. 8


Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu & Latifur Khan (University of Texasat Dallas)

A Robust Data-obfuscation Approach for Privacy Preservation of Clustered Data . 18


Rupa Parameswaram & Douglas Blough (Georgia Institute of Technology)

Implementing Privacy-Preserving Bayesian-Net Discovery for Vertically Partitioned Data ............ 26


Onur Kardes (Stevens Institute of Technology), Raphael S. Ryger (Yale University), Rebecca N. Wright (Stevens Institute of Technology) & Joan Feigenbaum (Yale University)

Collaborative Recommendation Vulnerability to Focused Bias Injection Attacks . 35


Robin Burke, Bamshad Mobasher, Runa Bhaumik & Chad Williams (DePaul University)

Secure K-Means Clustering Algorithm for Distributed Databases . 44


Raj Bhatnagar, Ahmed Khedr, & Amit Sinha (University of Cincinnati)

Generating Cryptographic Keys from Face Images While Preserving Biometric Secrecy ..... 54
Alwyn Goh (Corentix Technologies), Yip Wai Kuan, David Ling & Andrew Jin (Multimedia University)

Foreword
Privacy and security of data mining has become an active research area in recent years. Broadly, it addresses how to utilize confidential data for data mining purposes without revealing the actual confidential data values to the data miners. The goal of this workshop is to bring together researchers who have studied different aspects of this topic in order to discuss issues of privacy and security in data mining, synergize different views of techniques and policies, and explore future research directions. This proceedings contains seven papers: one invited paper and six regular papers. Each regular paper received on average three critical reviews. Authors of accepted papers were invited to present them at the workshop. We would like to thank the authors, our invited speaker Chris Clifton, the program committee, and the external reviewers for contributing to the success of this workshop. Finally, we would like to thank the ICDM workshop organizer, Pawan Lingras, for his overall help in the organization of PDSM 2005. Workshop Co-Organizers Stan Matwin U. of Ottawa, CA Program Committee Elisa Bertino (Purdue University), Chris Clifton (Purdue University), Ping Chen (University of Houston Downtown), Steve Fienberg (Carnegie-Mellon University), Tom Goldring (National Security Agency), Philippe Golle (Palo Alto Research Center), Sushil Jajodia (George Mason University), Helger Lipmaa (Cybernetica AS and University of Tartu, Estonia), Taneli Mielikinen (University of Helsinki, Finland), Ira Moskowitz (Naval Research Laboratory), Kobbi Nissim (Ben Gurion, Israel), Jerry Reiter (Duke University), Pierangela Samarati (Universit degli Studi di Milano, Italy), Aleksandra Slavkovic (Penn State), Jaideep Srivastava (University of Minnesota), Bhavani Thuraisingham (University of Texas at Dallas), Jaideep Vaidya (Rutgers University), Vassilis Verykios (University of Thessaly, Greece) External Reviewers Anya Kim (Naval Research Laboratory), Murat Kantarcioglu (University of Texas at Dallas) LiWu Chang NRL Rebecca N. Wright Stevens Institute of Technology Justin Zhan U. of Ottawa, CA

Copyright of the photo of the cover of this proceedings: Photohome.com

ii

What is Privacy? Critical Steps for Privacy-Preserving Data Mining


Chris Clifton Purdue University Department of Computer Science 250 North University Street West Lafayette, Indiana 47907-2066 USA clifton@cs.purdue.edu

Abstract
Privacy-Preserving Data Mining has generated many research successes, but as yet little real-world impact. One problem is that we do not yet have accepted denitions of privacy; either legal, social, or technical; that apply to privacy-preserving data mining. This paper discusses this issue, and surveys work on the topic. In spite of this problem, there are real-world scenarios that can be addressed by todays technology; the paper concludes with a discussion of such areas and the research needed to make technology transfer happen. In ve short years, the research community has developed numerous technical solutions for privacy-preserving data mining. What path should the community follow to bring these solutions to adoption? What technical challenges must be solved before adoption? I claim we still face one key technical challenge: We do not yet have a coherent denition of privacy that satises both technical and societal concerns. In spite of this, we have an opportunity to begin technology transfer, moving solutions into practice in areas without hard privacy constraints. This will establish credibility for the technology, speeding adoption when we do have a solid denition of privacy. With so many published papers in privacy-preserving data mining, how can I say we dont have a denition for privacy? The problem is that we have several, but none by themselves satisfy legal and societal norms. A dictionary denition of privacy that is relevant to data mining is freedom from unauthorized intrusion[16]. Unauthorized is easy to understand, which leaves us with freedom from intrusion. What constitutes intrusion? To understand this question, let us rst look at legal denitions of privacy. Most privacy laws (e.g., European Community privacy guidelines[8] or the U.S. healthcare laws[9])

only apply to individually identiable data. Combining intrusion and individually identiable leads to a standard to judge privacy-preserving data mining: A privacypreserving data mining technique must ensure that any information disclosed 1. cannot be traced to an individual; or 2. does not constitute an intrusion. Formal denitions for both these items are an open challenge. We could assume that any data that does not give us completely accurate knowledge about a specic individual meets these criteria. This is unlikely to satisfy either privacy advocates or courts. At the other extreme, we could consider any improvement in our knowledge about an individual to be an intrusion. The latter is particularly likely to cause a problem for data mining, as the goal is to improve our knowledge. Even though the target is often groups of individuals, knowing more about a group does increase our knowledge about individuals in the group. The answer, and technical challenge, is measures for both the knowledge gained and our ability to relate it to a particular individual. For our research community to truly have the impact we seek, we must develop legally and socially defensible measures of privacy. Our solutions must be proven to meet these measures, guaranteeing that information disclosed (including data mining results) does not reveal private information beyond the measures. Existing work is weak in this respect. Secure multiparty computation based approaches (those that follow the approach in Lindell & Pinkass seminal paper [13, 14]) state what is and is not disclosed; typically, they state that only the data mining result is disclosed. This says nothing about the potential privacy impact of that result. Works based on randomization (as in Agrawal & Srikants seminal paper [2]) have developed a plethora of measures, but none cleanly addresses both individual identiability and intrusiveness.

In this paper/talk I review measures for both individual identiability and knowledge gain. In the process, I point out shortcomings of those measures, and try to identify promising research directions. We do not need to complete such research to be completed and accepted by the privacy and legal community before moving ahead with technology transfer; Section 3 concludes with a discussion of viable application areas where privacy-preserving data mining can provide benet today.

Table 1. Excerpt from Table of Census Data, U.S. Census Bureau Block Group 1, Census Tract 1, District of Columbia, District of Columbia 9 3 2 1 ... 6 3 2 ...

1 Individual Identiability
The U.S. Healthcare Information Portability and Accountability Act (HIPAA) denes individually nonidentiable data as data that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual[10]. This requires showing that the risk of identifying an individual in disclosed data is very small. Note that the analysis must be based not only on the disclosed data, but also other easily available information. For example, Sweeney demonstrated that (commonly disclosed) anonymous medical data could be linked with (publicly available) voter registration records on birth date, gender, and postal code to give a name and address for many of the medical records[20]. Just because the individual is not identiable in the data is not sufcient; joining the data with other sources must not enable identication. One proposal to address this problem is k -anonymity[19, 20]. K -anonymity alters identifying information so that identication is only to a group of k , not to an individual. A key concept is the notion of a quasi-identier: information that can be used to link a record to an individual. With respect to the HIPAA denition, a quasi-identier would be data that could link to reasonably available information. The HIPAA regulations actually give a list of presumed quasi-identiers; if these items are removed, data is (legally) considered not to be individually identiable. The denition of k -anonymity states that for any value of a quasi-identier, there must be at least k records with the same quasi-identier. This ensures that an attempt to identify an individual will result in at least k records that could apply to the individual. Assuming that the privacysensitive data (e.g., medical diagnoses) are not the same for all k records, this throws uncertainty into any knowledge about an individual. The uncertainty lowers the risk that the knowledge constitutes an intrusion. The idea that knowledge that applies to a group rather than a specic individual does not violate privacy is legally defensible. Census bureaus have used this approach as a means of protecting privacy. Census data is typically published as contingency tables; counts of individuals meeting a particular criterion (see Table 1). Aggregates that

Total: Owner occupied: 1-person household 2-person household Renter occupied: 1-person household 2-person household

reect a large enough number of households are not considered privacy sensitive. However, when cells list only a few individuals (as in Table 1, combining the data with other tables may reveal private information. For example, if we know that all owner-occupied 2-person households have salary over $40,000, and of the nine multiracial households, only one has salary over $40,000, we can determine that the single multiracial individual in an owner-occupied 2-person household makes over $40,000. Since race and household size can often be observed, and home ownership status is publicly available (in most of the U.S.), this would result in disclosure of an individual salary. Several methods are used to combat this. The data used to generate Table 1 uses introduction of noise; the Census Bureau warns that statistical procedures have been applied that introduce some uncertainty into data for small geographic areas with small population groups. Other techniques include cell suppression, in which counts smaller than a threshold are not reported at all; and generalization, where cells with small counts are merged (e.g., changing Table 1 so that it doesnt distinguish between owner-occupied and Renter-occupied housing.) Generalization and suppression are also common techniques for k anonymity. This work gives us one metric that applies to privacypreserving data mining. Demonstrating that disclosures from a technique (including the results) generalize to large enough groups of individuals, then the size of the group can be used as a metric for privacy protection. The size of group standard may be easily met for some techniques; e.g., pruning approaches for decision trees may already generalize outcomes that apply to only small groups and association rule support counts provide a clear group size. There have been several other techniques developed by

the ofcial statistics (census research) community to mitigate risk of individual identication. These include Generalization (e.g., limiting geographic detail), top/bottom coding (e.g., reporting a salary only as greater than $100,000), and data swapping (taking two records and swapping their values for one attribute.) These techniques introduce uncertainty into the data, thus limiting the condence in attempts to identify an individual in the data. These have been used to create Public Use Microdata Sets: Data sets that appear to be an actual sample of census data. Because the data is only a sample, and these techniques have been applied, a match with a real individual is unlikely. Even if an apparent match is found, it is likely that this match in the quasi-identier is actually created from some other individual through the data perturbation techniques. Knowing that this is likely, an adversary trying to compromise privacy can have little condence that the matching data really applies to the targeted individual. Metrics for evaluating such techniques look at both privacy and the value of the data. Determining value of data is based on preservation of univariate and covariate statistics on the data. Privacy is based on the percentage of individuals that a particularly well-equipped adversary could identify. Assumptions are that the adversary: 1. knows that some individuals are almost certainly in the sample (e.g., 600-1000 for a sample of 1500 individuals), 2. knows that the sample comes from a restricted set of individuals (e.g., 20,000), 3. has a good estimate (although some uncertainty) about the non-sensitive values (quasi-identiers) for the target individuals, and 4. has a reasonable estimate of the sensitive values (e.g., within 10%.) The metric is based on the number of individuals the adversary is able to correctly and condently identify. In [17], identication rates of 13% are considered acceptably low. Note that this is an extremely well-informed adversary; in practice rates would be much lower. This experimental approach could be used to determine the ability of a well-informed adversary to identify individuals based on privacy-preserving data mining approaches. However, it is not amenable to a simple, one size ts all standard as demonstrated in [17], applying this approach demands considerable understanding of the particular domain and the privacy risks associated with that domain. A metric presented in [6] tries to formalize this concept of distinguishability in a more general form than k anonymity. The idea is that we should be unable to learn a classier that distinguishes between individuals with high probability. The specic metric proposed was:

Denition 1 [6] Two records that belong to different individuals I1 , I2 are p-indistinguishable given data X if for every polynomial-time function f : I {0, 1} |P r{f (I1 ) = 1|X } P r{f (I2 ) = 1|X }| p where 0 < p < 1. Note the similarity to k -anonymity. A related denition is given in [4], which denes isolation based on the ability of an adversary to single out an individual y in a dataset using a query q : Denition 2 [4] Let y be any RDB point, and let y = ||q y ||2 . We say that q (c, t)-isolates y iff B (q, cy ) contains fewer than t points in the RDB, that is, |B (q, cy ) RDB | < t. The idea is that if y has at least t close neighbors, then anonymity (and privacy) is preserved. Close is determined by both a privacy threshold c, and how close the adversarys guess q is to the actual point y . With c = 0, or if the adversary knows the location of y , then this is equivalent to k -anonymity. However, if an adversary has less information about y , the anonymizing neighbors need not be as close. The paper also gives several sanitization algorithms that meet the (c, t)-isolation standard. Perhaps most relevant to our discussion is that they show how to relate the denition to different strength adversaries. They show that the ability of an adversary that generates a region as an estimate vs. an adversary that gives an estimate as a query point are essentially equivalent with respect to isolation. The ability to show equivalence of different metrics and approaches will go a long way toward establishing the efcacy of privacypreserving data mining methods. Another unsolved problem for privacy-preserving data mining is the cumulative effect of multiple disclosures. While building a single model may meet the standard, multiple data mining models in combination may enable deducing individual information. This is closely related to the multiple table problem of census release, or the statistical disclosure limitation problem. Statistical disclosure limitation has been a topic of considerable study; readers interested in addressing the problem for data mining are urged to delve further into statistical disclosure limitation[7, 22, 21].

2 Is Disclosure Intrusive?
One problem methods for preventing individual identication is that anonymity does not necessarily prevent linking sensitive information to an individual. This is particularly relevant to privacy-preserving data mining. Association rules provide a good example: An association rules

supported by 50 individuals with 100% condence would not violate a 50-anonymity standard. But the consequent of the rule, if sensitive, would become known for all 50 individuals. An alternative concept, l-diversity, was introduced in [15]. This measures the range of sensitive values that can be ascribed to any k -anonymous individuals. While a start, we really need to go beyond this. It isnt just linking to the data, it is determining how intrusive this linkage is. Developing denitions for intrusiveness is much more challenging than individual identiability. Release of some types of data, such as date of birth, pose only a minor annoyance by themselves. But in conjunction with other information date of birth can be used for identity theft, an unquestionable intrusion. Determining intrusiveness must be evaluated independently for each domain, making general approaches difcult. What we can do is measure the amount or quality of information about a privacy sensitive attribute that is revealed to an adversary. There have been several proposals in this respect, in addition to the k -diversity mentioned above. A few of these proposals are listed below. Bounded Knowledge. Just as we can measure the ability of an adversary to identify an individual, we can measure the ability of an adversary to estimate a sensitive value. One such measure is given by [1], who propose using the differential entropy h(A) of a random variable A. Their metric for privacy is 2h(A) . Specically, if we add noise from a random variable A, the privacy is: (A) = 2

on adding noise to a system of linear equations, then later factoring out the noise. The protocols result in sharing the noisy data; the technique of [5] enables evaluating the expected change in entropy resulting from the shared noisy data. While perhaps not directly applicable to all privacypreserving data mining, the technique shows another way of calculating the information gained. Need to know. While tough to incorporate in a metric, the reason for disclosing information matters. Privacy laws generally allow disclosure for permitted purposes, e.g., the European Union privacy guidelines specically allow disclosure for government use or to carry out a transaction requested by the individual[8]: Member States shall provide that personal data may be processed only if: (a) the data subject has unambiguously given his consent; or (b) processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract; or ... This principle can be applied to data mining as well: disclose only the data actually needed to perform the desired task. In Section 3 we discuss an application where need to know may well be sufcient. Protected from disclosure. Sometimes disclosure of certain data is specically proscribed. Recently there has been work in privacy-aware languages for specifying and enforcing authorization policy[3, 12]. These approaches add release and use policies to basic access control mechanisms. Given such specications, we are able to determine if specic values must be protected from disclosure. With these policies, we may nd that any knowledge about proscribed data is deemed too sensitive to reveal. The challenge is showing that data mining outcomes (knowledge) does not reveal proscribed knowledge, even in combination with external information. A rst cut at such indirect disclosure was given in [11]. They explored ways to analyze a classier to determine if it discloses sensitive data Their work made the assumption that the disclosure was a black box classier the adversary could classify instances, but not look inside the classier. A key insight of this work was to divide data into three classes: Sensitive data, Public data, and data that is U nknown to the adversary. The basic metric used was the Bayes classication error rate. Assume we have data (x1 , x2 , . . . , xn ), that we want to classify xi s into m classes {0, 1, . . . , m 1}. For any classier C : xi C (xi ) {0, 1, . . . , m 1}, i = 1, 2, . . . , n,

fA (a)log2 fA (a)da

where A is the domain of A. There is a nice intuition behind this measure: The privacy is 0 if the exact value is known, and a value known to fall within a range of width a (but the estimate is a uniform distribution in that range) has (A) = a. The authors go further to introduce a denition that describes the amount of information an adversary gains from the disclosure caused by data mining, conditional privacy: (A|B ) = 2

A,B

fA,B (a,b)log2 fA|B =b (a)dadb

This was applied to noise addition to a dataset in [1]. However, the same metric can be applied to disclosures other than of the source data (although calculating the metric may be a challenge.) A similar approach is taken in [5], where conditional entropy was used to evaluate disclosure from secure distributed protocols. Assuming a uniform distribution of data, they are able to calculate the conditional entropy resulting from execution of a protocol (in particular, a set of linear equations that combine random noise and real data.) Using this, they analyze several scalar product protocols based

we dene the classier accuracy for C as:


m 1

P r{C (x) = i|z = i}P r{z = i}.


i=0

As an example, assume we have n samples X = (x1 , x2 , . . . , xn ) from a 2-point Gaussian mixture (1 )N (0, 1) + N (, 1). We generate a sensitive data set Z = (z1 , z2 , . . . , zn ) where zi = 0 if xi is sampled from N (0, 1), and zi = 1 if xi is sampled from N (, 1). For this simple classication problem, notice that out of the n samples, there are roughly n samples from N (, 1), and (1 )n from N (0, 1). The total number of misclassied samples can be approximated by: n(1 )P r{C (x) = 1|z = 0} + nP r{C (x) = 0|z = 1}; dividing by n, we get the fraction of misclassied samples: (1 )P r{C (x) = 1|z = 0} + P r{C (x) = 0|z = 1}; and the metric gives the overall possibility that any sample is misclassied by C . Notice that this metric is an overall measure, not a measure for a particular value of x. Several problems were analyzed in [11]. The obvious case is the example above: The classier returns sensitive data. Another issue arises when the classier takes both public and unknown data as input. If we assume that all of the training instances are known to the adversary (public and sensitive, but not unknown, values), the classier C (P, U ) S gives the adversary no additional knowledge about the sensitive values. But if the training data is unknown to the adversary, the classier C does reveal sensitive data, even though the adversary does not have complete information as input to the classier. Another issue is the potential for privacy violation of a classier that takes public data and discloses non-sensitive data to the adversary. While not in itself a privacy violation (no sensitive data is revealed), such a classier could enable the adversary to deduce sensitive information. An experimental approach to evaluate this possibility is given in [11]. The issue of unknown versus public dat is complicated by the fact that publicly available records already contain considerable information that many would consider private. If the private data revealed by a data mining process is already publicly available, does this pose a privacy risk? If the ease of access to that data is increased (e.g., available on the internet versus in person at a city hall), then the answer is yes. But if the data disclosed through data mining is as hard to obtain as the publicly available records, it isnt clear that the data mining poses a privacy threat. This leads to an idea for measuring intrusiveness: the metric for privacy should be based on the cost of performing an intrusion. This cost can involve many factors, allowing us to incorporate many of the ideas above:

Individual Identiability: The cost of an intrusion is raised by the number of attacks that need to be made to harm the targeted individual. This cost can be measured, and is not necessarily linear. For example, a newspaper publishing Senator X was spotted writing grafti on the Capitol Dome, if correct, would be a valid news story. But One of Senators X, Y, or Z was ... would be an invitation to a libel suit. Quality of Disclosed Data: Any of the measures on the quality of information can be related to cost. For example, a marketer may know that only individuals with high disposable income are able to respond to a sales call. If only half of the individuals who appear to have a high income actually do, the response rate will be cut in half. In addition, the number of individuals not reached will double - perhaps making a less intrusive marketing campaign more cost effective. Availability of Information: The cost of obtaining publicly available information is becoming easier to estimate; most web telephone directories include advertisements for such public records searches. This makes it easier to compare with the costs of performing an equivalent intrusion based on privacy-preserving data mining outcomes. If privacy-preserving data mining is a more expensive way to violate privacy, then it is reasonable to argue that it is sufcient.

3 Are we Ready for Technology Transfer?


While our eld has considerable research remaining before we will satisfy dedicated privacy advocates, or even privacy legislation, there are areas where the technology has value today. They key is to distinguish between privacy and condentiality. Most techniques developed in this community have strong statements regarding the condentiality provided. This is easier to measure than privacy, and for many applications is sufcient. I now describe some of these application areas, and describe why I feel that our technology is ready for these applications.

3.1

Authorization vs. Need to Know: Protecting Data is Costly

While data is rightly viewed as a valuable asset, concern over privacy is increasing the cost of maintaining that asset. Some of these costs are legislative: California SB1386 requires use of data encryption, and perhaps more important requires notication of individuals whose data is disclosed. Typical estimates for the total cost of mailing a rst class letter (not just postage) are over one dollar. Microsoft was ned for violating Spanish privacy laws. Other

costs are contractual: CardSystems was terminated by Visa and American Express after having credit card information stolen, potentially killing the company[18]. Public Relations costs can be enormous; ChoicePoint stock lost 20% in value in the month following their disclosure of information theft. The CardSystems example is particularly relevant. The data kept by CardSystems consisted of transactions which were not completed for a variety of reasons. This data was stored for research purposes in order to determine why these transactions did not successfully complete.[18] This is clearly a case where privacy-preserving data mining techniques would have provided value. Methods that enabled the research without the potential for disclosing the data would have prevented the problem. The key is that in these situations, the companies were authorized to see the information (although in the case of CardSystems, not to retain it.) Because such authorization exists, replacing existing analysis methods with privacypreserving methods would not need further approval from a privacy standpoint. Instead, companies would need to be convinced only of the cost/benet: the cost of implementing privacy-preserving techniques, versus the cost of protecting (or potential cost of failing to protect) the data.

a few companies that the benet is worth the risk is feasible.

3.3

Anti-trust

Competing corporations often gain benet from collaborating. Positive examples include shared research on new products, such as fuel cell vehicle technology in the U.S. automotive industry; or industry-wide evaluation of safety data. However, collaboration could also have negative effects such as (illegal) price-xing cartels. As a result, any collaboration between competitors is open to scrutiny. Privacy-preserving data mining has a potential application in this arena. Anti-trust generally comes down to does it benet the consumer? The results of the collaboration (e.g., data showing what technologies give the best safety improvements in practice) may benet to the consumer. But data used to generate the results (e.g., costs to produce various safety technologies) could also be used for illegal purposes. Even if not used for these purposes, the opportunity to violate antitrust law could lead to intense and expensive scrutiny. Privacy-preserving techniques can prevent the disclosure of the underlying data; the companies only need show the consumer benet of the result.

3.2

Corporate Secrecy

4 What Next?
This paper lays out two new directions for our research community. One is to better understand and dene privacy; this is a challenging research question for the long term. The second is areas where our technology is nearly ready for application. While research challenges will arise as we study these applications, the potential exists for rapid technology transfer. One benet of pursuing technology transfer today is to build public trust in our technology. We can use the applications in Section 3 to demonstrate that the technology is effective in improving condentiality while still allowing benecial use of data. Once we resolve issues of what individual privacy really means in the context of privacypreserving data mining, this public trust will enable more rapid acceptance of the technology. This will move us more rapidly toward the real win: enabling research that is today unimaginable because of the risk to privacy.

Companies today must be exible to survive. Efcient use of resources demands that companies work with multiple suppliers and customers. Competition and open markets have fragmented utility infrastructure. While this has a signicant benet in terms of reduced prices and rewarding efciency, it poses a challenge to global optimization and planning. At one time, an auto manufacturer controlled the entire production process, and could plan accordingly. Now parts come from many suppliers, a disruption anywhere in the chain can cause havoc. A single failure in the utility grid can cause wide-scale blackouts. Many of these problems could be addressed through global data analysis. However, the data needed is often proprietary; companies are justiably concerned about the competitors (or even partners) misusing the data. The issues, and goals, are in many ways similar to privacy concerns. The key difference is that the individuals whose privacy could be violated are a few corporations, rather than millions of people. With private data about people, the data mining benet rarely goes to the individuals concerned; it instead goes to the entity doing the data mining. Convincing people that privacy-preserving data mining is okay, when they see no value from it, is a challenge. However, the beneciaries of data sharing in a corporate global planning scenario are often the corporations themselves. Convincing

References
[1] D. Agrawal and C. C. Aggarwal. On the design and quantication of privacy preserving data mining algorithms. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 247 255, Santa Barbara, California, May 21-23 2001. ACM. [2] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD Conference on

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15]

[16]

Management of Data, pages 439450, Dallas, TX, May 1419 2000. ACM. C. A. Ardagna, E. Damianai, S. D. C. di Vimercati, and P. Samarati. Towards privacy-enhanced authorization policies and languages. In Proceedings of the 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, pages 1627, Storrs, Connecticut, Aug. 7-10 2005. S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Toward privacy in public databases. In Theory of Cryptography Conference, Cambridge, MA, Feb. 9-12 2005. Y.-T. Chiang, D.-W. Wang, C.-J. Liau, and T. sheng Hsu. Secrecy of two-party secure computation. In Proceedings of the 19th Annual IFIP WG 11.3 Working Conference on Data and Applications Security, pages 114123, Storrs, Connecticut, Aug. 7-10 2005. C. Clifton, M. Kantarco glu, and J. Vaidya. Dening privacy for data mining. In H. Kargupta, A. Joshi, and K. Sivakumar, editors, National Science Foundation Workshop on Next Generation Data Mining, pages 126133, Baltimore, MD, Nov. 1-3 2002. A. Dobra and S. E. Fienberg. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. In Proceedings of the National Academy of Sciences, number 97, pages 1188511892, 2000. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Ofcial Journal of the European Communities, No I.(281):3150, Oct. 24 1995. Standard for privacy of individually identiable health information. Federal Register, 67(157):5318153273, Aug. 14 2002. Standard for privacy of individually identiable health information. Technical report, U.S. Department of Health and Human Services Ofce for Civil Rights, Aug. 2003. M. Kantarco glu, J. Jin, and C. Clifton. When do data mining results violate privacy? In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 599604, Seattle, WA, Aug. 22-25 2004. K. LeFevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, and D. J. DeWitt. Limiting disclosure in hippocratic databases. In Proceedings of the 30th International Conference on Very Large Databases, pages 108119, Toronto, Canada, Aug. 31 - Sept. 3 2004. Morgan Kaufmann Publishers Inc. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology CRYPTO 2000, pages 3654. Springer-Verlag, Aug. 20-24 2000. Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3):177206, 2002. A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta Georgia, Apr. 2006. Merriam-webster online dictionary.

[17] R. A. Moore, Jr. Controlled data-swapping techniques for masking public use microdata sets. Statistical Research Division Report Series RR 96-04, U.S. Bureau of the Census, Washington, DC., 1996. [18] J. M. Perry. Statement of john m. perry, president and ceo, cardsystems solutions, inc. before the united states house of representatives subcommittee on oversight and investigations of the committee on nancial services. http://nancialservices.house.gov/hearings.asp? formmode=detail&hearing=407&comm=4, July 21 2005. [19] P. Samarati. Protecting respondents privacy in microdata release. IEEE Trans. Knowledge Data Eng., 13(6):1010 1027, Nov./Dec. 2001. [20] L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, (5):557570, 2002. [21] L. Wang, D. Wijesekera, and S. Jajodia. Cardinality-based inference control in data cubes. Journal of Computer Security, 12(5):655692, 2005. [22] L. Willenborg and T. D. Waal. Elements of Statistical Disclosure Control, volume 155 of Lecture Notes in Statistics. Springer Verlag, New York, NY, 2001.

An Adaptable Perturbation Model of Privacy Preserving Data Mining


Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu and Latifur Khan Computer Science Department University of Texas at Dallas
{liliu, bhavani.thuraisingham, muratk, lkhan}@ utdallas.edu

Abstract Randomization and perturbation are two very important techniques in privacy preserving data mining. Loss of information versus preservation of privacy is always a trade off. Furthermore, an approach that uses random matrix properties has recently posed a challenge to the perturbation-based techniques. The question is, can pertubationbased techniques still protect privacy? In order to find the answer to this question, we scrutinize two different approaches; one proposed by Agrawal et al using Bayes density functions and the other proposed by Kargupta et al using random matrix. We set up simulation experiments to study these two approaches. We have obtained some interesting results and have made some observations. We propose a modified version of Agrawal et als algorithm, which reconstructs the original distribution from the perturbed distribution rather than using the perturbed data. Furthermore, under the same conditions, and by using the random matrix filter approach we failed to obtain the original distribution. We give a hypothesis to explain this observation. Based on this hypothesis, we propose an adaptable perturbation model, which accounts for the diversity of information sensitivity. 1. Introduction
Randomization and perturbation are two very important techniques in privacy preserving data

mining and have been extensively studied in recent years. Several approaches have been developed in this field [1-4] and many of them are based on perturbation and randomization techniques. One particular approach developed by Kargupta et al [10] has challenged the perturbation techniques as to whether such techniques can ensure any level of privacy? To determine the merit of this claim we simulate two of the well-known algorithms. One is proposed by Agrawal et al in [1], and the other is proposed by Kargupta et al in [10]. In our experiments we examine both algorithms in great detail and we find that certain assumptions and conditions will have significant effects on the results. By changing the set up conditions of the experiments, we find some interesting results. The main question is, can we use less information and still get similar results? Based on our observations from these experiments, we propose a modified version of Agrawal et als algorithm. This modified algorithm reconstructs the original distribution from the perturbed distribution rather than from the perturbed dataset. We have however failed to reconstruct the original distribution under the same conditions for the random matrix technique proposed by Kargupta et al in [10]. We have subsequently stated a hypothesis as to why this may be the case. Furthermore, this paper proposes several possible reasons for the negative result of the experiments. We are now exploring ways to develop a mathematical proof of the hypothesis. We expect to give such a proof in a future paper. Based on the modified Agrawal et als algorithm, we propose an adaptable perturbation model. In this model, we first define different privacy levels, with each privacy level corresponding to a perturbation level. There is a parameter denoted as in our model. We are able to perturb data at different levels by adjusting this parameter . The paper is organized as follows: Section 2 discusses related work. Section 3 briefly describes the two algorithms proposed by Agrawal and

Kargupta, respectively, as well as our simulation experiments. Section 4 discusses our observations for these two approaches and the modified Agrawal et al approach. In section 5, we describe our proposed adaptable perturbation model and our experimental results. Section 6 discusses the directions and future work.

2. Related Work
Several approaches to privacy preserving data mining have been developed in recent years. These approaches can be classified into two main categories; they are based on perturbation and randomization techniques [1-4] and secure multiparty computation based techniques (SMC) [5-9]. The approach proposed by Kargupta et. al in [10] poses a challenge to the perturbation and randomization-based approaches. It claims that such approaches may lose information as well as not provide privacy by introducing random noise to the data. By using random matrix properties, Kargupta et al successfully separates the data from the random noise and subsequently discloses the original data. Several approaches [5-9] fall into the second category (i.e. the multi-party computation), but they all require very high computation costs. Furthermore, these multi-party computation based approaches assume that each party uses the same data scheme thereby working only for a homogeneous environment. Heterogeneity, where different parties use different schemes, is a major issue that we need to tackle in the future. To explore the challenge discussed in [10] by Kargupta et al, we propose a modified version of the Agrawal et al algorithm which only needs the distribution of the perturbed data and the distribution of the noise introduced to reconstruct the original distribution. The existing perturbation-based approaches apply privacy preserving techniques to the dataset without considering different sensitivity levels of the private information, thus leading to more loss of information and/or loss of privacy. At present, there is no metric for privacy, although some initial ideas are given in [11]. Furthermore, privacy is hard to measure. For example, how much protection do the privacy preserving strategies give? To address this issue, we propose an adaptable model in this paper. Our approach applies different perturbation strategies according to the privacy sensitive levels, thus minimizing the loss of information.

As stated earlier, perturbation is one of the major techniques to preserve privacy. This strategy is based on introducing noise while not significantly changing the information of the original data. Subsequently data mining techniques are applied to the perturbed data [1]. Loss of information versus privacy preservation is always a trade off. The extent to which we perturb the original data will dramatically affect the mining result and subsequently contribute to the potential risk of privacy disclosure. The random perturbation technique was first introduced by Agrawal et al in [1]. Since then this technique as well as its variations [2, 3, and 4] have been widely used in privacy preserving data mining. Recently research in signal processing (see for example [10]) offers many filters to remove the white noise. Furthermore, the signal processing method discussed in [10] poses a challenge to the random perturbation technique. The question is, do the random perturbation techniques sufficiently preserve the privacy of the information? The key point is that the randomness does seem to have structure and this may be used to compromise the original dataset rather than the distribution of the original data. The best way to thoroughly understand and investigate this technique is to conduct simulation experiments. We have simulated the approaches proposed in [1] and [10]. For completeness and clarity we briefly describe the two algorithms below. We also discuss our simulation results. Then we summarize our observations of the two approaches.

3.1

The Problem

This problem is referred to as the reconstruction problem, originally defined in [1] by Agrawal et al as follows: Let x1, x2 xn be the original values of a onedimensional distribution as realization of n independent identically distributed (iid) random variables, each has the same distribution X. Let y1,y2 yn be the random values used to distort the data, yi is the realization of Yi, with iid, each has the same distribution Y. In the experiments, y is a random value drawn from uniform and Gaussian distributions as below: Uniform Distribution: The random variable has a uniform distribution over an interval [, ]. The mean of the random variable is 0.

3. Algorithms and Simulation Experiments

Gaussain Distribution: The random variable has a normal distribution with mean = 0 and standard deviation . Given, x1+y1, x2+y2, xn+yn ( perturbed data) Cumulative probability distribution FY (noise) Estimate probability distribution FX (of original data)

min = (1 1 / q ) 2 max = (1 + 1 / q ) 2
If the matrix dimensions are m, and n; m/n = q, q1 Recall the problem defined in section 3.1. The noise that we have introduced is satisfied by the condition above. So using steps similar to those defined in section 3.1, we can obtain the perturbed data matrix denoted as Q, which is: Q = U + V. Given, Q ( perturbed data matrix) Cumulative probability distribution FY (noise); The element vij in matrix V, is the realization of Yi, with iid, each has the same distribution FY. Estimate probability distribution FX (of original data) Let UTU = UUSUUUT QTQ = UQSQUQT VTV = UVSVUVT SQ SU + SV
2

3.2 Agrawal et als Algorithm Using Bayes Theorem in Density Function


Using Bayes theorem, given the probability distribution FY, and the random values (xi + y i = w i) Estimated density function: f ( w a ) f X (a ) 1 n (a) = Y i fX n i =1 f ( w z ) f ( z ) dz

Given a large number of samples, it would be equal to the real density function FX is unknown, j fY ( wi a) f X (a) 1 n j +1 f X (a) = j n i=1 f ( w z ) f ( z )dz

Initially FX is the uniform distribution Perform this iteratively until the stopping criterion is met.

And

Figure 1. Agrawals Reconstruction Algorithm

3.3 Kargupta et als Algorithm Using Random Matrix


Kargupta et als algorithm is based on random matrices properties. That is, we have the original data in matrix form denoted by U; similarly as in the case of the problem defined in section 3.1, we add the noise data, which is a random matrix denoted by V. V satisfies the condition below as white noise: 1) 2) 3) i.i.d (independent identically distributed) Mean = 0 Unit variance

We can use the upper and lower bound of the eigenvalues of covariance matrix of V, denoted as max and min, respectively, to separate the Data from Noise eigenvalues in SQ; map the actual data and noise data with columns corresponding to the eigenvectors, and eventually get the estimated matrix P, which should be our estimate of the original dataset. (For more details of the algorithm we refer to [10].)

3.4

Simulation Experiments

The Covariance Matrix of V denoted as CoV=1/m (VTV), will follow Wigners semi-circle law, which has given the distribution of eigenvalues of CoV. So eventually we can calculate the upper and lower bound of eigenvalues, denoted as max and min, respectively, of CoV. For more details, we refer to the original paper given in [10].

As defined in section 3.1, the original dataset is denoted as X; the noise dataset denoted as Y. Y can be: Gaussian Distribution: The random variable has a normal distribution with mean = 0 and standard deviation . - deviation varies, e.g. 0.25, 0.3 or 0.4, to see how it affects the result. Uniform Distribution: The random variable has a uniform distribution over an interval [-, ], e.g. [-0.5, 0.5]. The mean of the random variable is 0. We use Gaussian Distribution noise in both algorithms; and we use Uniform Distribution noise in Agrawal et als algorithm to compare with.

10

The variance 2 is important in Gaussian distribution data. It can dramatically affect the results. In [10] Kargupta et al define the term Signal-to-Noise Ratio (SNR) to quantify the relative amount of noise added to actual data. Variance of Actual Data SNR = Noise Variance In our experiments, the standard deviation of actual data is 0.3522. The standard deviation of noise is 0.25, the SNR = 1.9848. The standard deviation of noise is 0.30, the SNR = 1.3784. The standard deviation of noise is 0.40, the SNR = 0.7753. In [10] Kargupta et al have stated that when SNR is above 1.3, the algorithm produces good results. Our experiment results confirm on this of Kargupta et als statement.

4. Our Observations and Modified Algorithm 4.1 Observations Results and Experiment

The two algorithms solve the same problem; that is, given perturbed data and some information of random noise, they reconstruct the distribution of the original data. In our simulation experiments, we use the same setup as described in [1] and [10] . We use 10k records of X dataset, which has a triangle distribution. In the simulation experiments of the Agrawal et al algorithm, we add both uniform and Gaussian distribution noise Y dataset and get W dataset as defined in section 3.1: w1, w2, wn (xi + y i = w i) is n iid samples, with cumulative distribution function FW. From this 10k iid samples w1, w2, wn, and the cumulative distribution function FY, we reconstruct the FX. In the simulation experiment of Agrawal et als approach, we need the range of the original dataset, which can be derived from W dataset and FY. In the simulation experiment of Kargupta et als approach, we use exactly the same dataset W which is obtained by adding the Gaussian distribution noise Y, = 0 and standard deviation =0.25, 0.30 and 0.40 to compare. According to the description in [10], we divide the 10k one-dimensional dataset W into 50 groups, thus each group has 200 records. Using 200 as the column number and 50 as the row number, we convert the W dataset into a 200 by 50 matrix Q. We

can also convert the one-dimensional dataset X into a 200 by 50 matrix U, and noise dataset Y into a same dimension matrix V, using the same approach. We noticed that, the condition matrix Q = U + V, has a significant effect on the results. That is, the dataset X maps to matrix U, and the dataset W maps to matrix Q, and the two mappings have to be the same. The condition is that: When you divide the X dataset into 50 groups, and each group has 200 records, then Q should have the same number of divisions. The 200 records in each group do not have to be in the same order, and the 50 groups do not have to be in the same order as well, but the components of each group should maintain the mapping. In view of the matrix, if we convert X into U first, then add random matrix V to U to obtain Q, then Q is a 200 column by 50 row matrix. If we shuffle the values inside each row, or switch the 50 rows with each other, or do both; then by using Kargupta et als algorithm, we can still reconstruct the distribution of original dataset X. But if we shuffle the values of a column, we fail to reconstruct the distribution of the original dataset with Karguptas et als algorithm. Our simulation experiment results are shown in the figure from 2-7. Figure 2 and 3 show the simulation results of Agrawal et al algorithm apply to the dataset perturbed by uniform distribution and Gaussian distribution noise data respectively. We can see the reconstructed distribution is very close to the distribution of original dataset. Figure 4, 5, and 6 show the simulation results of Kargupta et al algorithm with the mapping of matrix Q = U + V. Figure 7 shows the result of Kargupta et al algorithm when the mapping no longer exists. We can see from figure 4 and 5 that the results closely match the original distribution, in figure 6 the result shows more variation from the original distribution, and figure 7 completely fails to reconstruct the original distribution. Later in section 4.3, we will give our analysis of this negative result in figure 7. The variance 2 of Gaussian distribution noise data dramatically affects the results. Figure 4, 5 and 6 show the results of the same algorithm and original dataset by using different standard deviation =0.25, 0.30 and 0.40 of the noise data respectively. From figure 4 and 5, we can see the algorithm reconstructs the distribution very successfully. From figure 6, we can see the result is not very good. This confirms on Kargupta et als statement that when Signal-to-Noise Ratio (SNR) is above 1.3, the algorithm performs well. The SNR in the experiments are 1.9848, 1.3784, and 0.7753 with the noise standard deviation are 0.25, 0.30 and 0.40 respectively.

11

Uniform 10K records in 40 interval with 1% stopping criterion


1200

Gaussian 10K records in 40 interval with 1% stopping criterion


1200 1000

1000

800

800 600 400

600

400

200

200 0

0 -1 -0.5 0
Original Data

-1

-0.5

0 Original Data

0.5 Perturbed

1 Reconstructed

1.5

0.5
Perturbed

1
Reconstructed

1.5

Figure2 Agrawal et al. algorithm with uniform distribution noise.


Kargupta et al. Algorithm with Gaussian Distribution std=0.25
1200

Figure3 Agrawal et al. Algorithm with Gaussian distribution noise.


Kargupta et al. Algorithm with Gaussian Distribution STD=0.30
1200 1000 800 600 400 200 0 -1 -0.5 0 Original Data 0.5 Perturbed 1 Reconstructed 1.5 2

1000

800

600

400

200

0 -1 -0.5 0 Original Data 0.5 Perturbed 1 Reconstructed 1.5 2

Figure4 Kargupta et al. Algorithm with Gaussian Distribution Noise std = 0.25

Figure5 Kargupta et al. Algorithm with Gaussian Distribution Noise std = 0.30

Kargupta et al. Algorithm with Gaussian Distribution STD=0.40


1200 1000 800 600 400 200 0 -1 -0.5 0
Original Data

Kargupta et al. A lgorithm failed when can not keep Division Mapping Relationship

1200

1000

800

600

400

200

0.5
Perturbed

1
Reconstructed

1.5

0 -1 -0.5 0 0.5 1 1.5 2

Original Data

Perturbed

Reconstructed

Figure6 Kargupta et al Algorithm with Gaussian Distribution Noise std = 0.40

Figure7 Kargupta et al Algorithm failed when can not keep Division Mapping Relationship between X to U and W to Q

12

After examining the two approaches, below we summarize the information that is needed by each algorithm to reconstruct the original distribution.
Agrawal et al Algorithm [1] Perturbed Dataset W Probability Distribution FY Kargupta et al Algorithm [10] Perturbed Dataset W

Perturbed Data Noise Data

Probability Distribution FY Variance of the Distribution Range of The Noise Y Has to Other Original Qualify Random Dataset* Matrices Conditions (* This can be A Division Mapping derived from W Relationship dataset and FY) between X to U and W to Q Ratio of Column to Row of Matrix Table 1 Comparison of conditions for Agrawal et al. Algorithm and Kargupta et al. algorithm

4.2

Modified Agrawal et al Algorithm

Based on the simulation studies, Kargupta et als algorithm [10] gives the better results. Furthermore they claim the following: if in the results we can get the point to point estimation of the original dataset not just the distribution, this may mean a privacy disclosure. The question is, how can we overcome this problem with Agrawal et als algorithm? Based on our observations discussed in the previous section, we propose a modified version of Agrawal et als algorithm to answer the question posed in the previous paragraph, where we reconstruct the original distribution FX from Probability Distribution FW and Probability Distribution FY. Modified Algorithm Probability Distribution FW Probability Distribution FY Range of Original Dataset* (* This can be derived from FW and FY) Agrawal et al. Algorithm [1] Perturbed Dataset W Probability Distribution FY Range of Original Dataset* (* This can be derived from W dataset and FY)

Perturbed Data Noise Data Other

Table 2 Comparison of conditions for modified algorithm and Agrawal et al. Algorithm

Let w1, w2 wn be the perturbed samples of n independent identically distributed (iid), each has the

same cumulative distribution function FW. We are only given FW, not w1, w2 wn values. FW is specified by the number of records that fall into some intervals. Let I1, I2, Ip , be the intervals given a parameter , the interval width. Each interval has equal width . Let p be the total interval number. Let N1, N2, Np, be the numbers of values that fall into each interval I1, I2, Ip. FW is given in terms of number of values N1, N2, Np in corresponding interval I1, I2, Ip. First, from each interval return the same amount of values of W1, W2 Wp dataset; we assume that in each small interval the distribution is uniform. Combining W1, W2 Wp together is the W dataset, which has n records. We draw n times in W to get n samples, and each record in W has the same probability. Thus we obtain n records of samples with the cumulative distribution function FW, denoted as W. Using the same dataset X and W, we set the experiment for modified Agrawal et al. algorithm as follow: 1. Given FW in terms of 40 intervals. 2. Create 30k records dataset W using the FW. 3. Draw 10k times in W, each record in W has the same probability, then get 10k samples dataset denoted as W. 4. Use Bayes theorem to estimate the density function of the original dataset FX from W, the algorithm is shown in figure1. In the figures 8A and 8B we can see that the results are very good. The original dataset has triangular distribution. The perturbed data obtained from FW has a totally different distribution. The reconstructed distribution of the original dataset is very close to the original one. Comparing this with the results obtained by Agrawal et als original approach, shown in figure 3, we see that the results are very similar. There is a stop threshold in the algorithm. Figure 8A shows the result when the threshold is set to 0.5%; figure 8B shows the result when the threshold is set to 0.25%. We can see that when the threshold is smaller the result is better. This means computing more iteration will get a more accurate estimation to the original distribution. Using the same dataset X and W, we set the experiment for Kargupta et als algorithm as follows: 1. Given FW in terms of 1000 intervals. 2. Create 30k records dataset W using the FW. 3. Draw 10k times in W, each record in W has the same probability, then get 10k samples dataset denoted as W. 4. Use Kargupta et al. algorithm on dataset W, the result is shown in figure 9.

13

M o d i fi e d A g ra w a l e t a l . A lg o ri th m o n C o n stru c t D a ta i n 4 0 I n te rv a l s
w i th 4 0 i n te rva l w i th 0 .5 % s to p p i n g c ri te ri o n

3300

2800

2300

1800

1300

800

300

FW looks similar in figures 8A and 8B, but the reconstructed original distribution has a high peak. This is not even a little bit close to the original distribution. We set the interval number to 1000, and compare the results to the interval number 40 set in the modified Agrawal et als algorithm. Although using interval 1000 is much better than using interval 40, Kargupta et als algorithm still fails. We analyze the possible reasons and give a hypothesis in the next section.
2

-1

-0 .5

-2 0 0 0

0 .5

1 .5

O rig in a l D a ta

C o n s tru c t P e rtu rb e d D a ta

R e s u lt w ith Mo d ifie d Alg o rith m

Figure 8A Agrawal et al. Algorithm reconstruct original data distribution from perturbed data distribution

4.3 Analysis of the Negative Result in Modified Kargupta et als Approach


In the original Kargupta et als approach, they have U + V = Q. By estimating the upper and lower bounds of the eigenvalues of covariance matrix of V, denoted as max and min, respectively, they use this bound to separate the Data U corresponding columns, and Noise V corresponding columns from perturbed data matrix Q. Although we obtain the samples matrix Q row by row to keep the mapping, we still introduce a bias, denoted as T. So we have U + V + T = Q. We still can calculate the bounds of max and min, respectively, to separate V from Q, but now we only have the matrix Q instead of Q to map the actual data and noise data with columns corresponding to the eigenvectors, and eventually get the estimated matrix P, which should be our estimate of the original dataset. Q has a small bias T; this T does not change the variance of (V+T) much, but it significantly affects the eigenvalues. This makes significant difference to the result. From another point of view, by adding this T, we preserve more privacy and thus overcome the challenge which the random matrix approach has posed to the perturbation techniques. We are still working on constructing a sound proof for this negative result. In the future we may find more issues with the random matrix approach, and we may also find some improved approaches to make it work under new conditions.

Modified Agrawal et al. Algorithm on Construct Data in 40 Intervals


with 40 interval with 0.25% stopping criterion

3300 2800 2300 1800 1300 800 300 -1 -0.5


Original Data

-200 0

0.5

1.5

Construct Perturbed Data

Result with Modified Algorithm

Fi

gure 8B Agrawal et al. Algorithm reconstruct original data distribution from perturbed data distribution

Modified Kargupta et al. Algorithm with 1000 Intervals

3300 3000 2700 2400 2100 1800 1500 1200 900 600 300 0 -1 -0.5
Original Data

5.
0 0.5 1 1.5
Construct Perturbed Data

An Adaptable Perturbation Model Privacy Level, Perturbation and Accuracy

Result with Modified Algorithm

5.1

Figure 9 Kargupta et al. Algorithm failed to


reconstruct original data distribution from perturbed data distribution

From the figure 9, we can see that Kargupta et als algorithm failed to reconstruct the original distribution. That is, the original distribution has a triangular shape, the perturbed dataset obtained from

As we have mentioned before, perturbation techniques always face trade-offs between information loss and privacy preservation. The question is, to what extent do we perturb the original data to dramatically affect the mining results, subsequently risking potential privacy disclosure? In the real world, the sensitivity levels of privacy information varies.

14

We hereby propose an adaptable perturbation model. We classify privacy information into different levels. Different perturbation techniques will be applied to different privacy levels. There are three main characteristics, privacy level, perturbation and accuracy (Shown in figure 10). The more sensitive a privacy field the higher its privacy level. The higher privacy level requires a stronger perturbation to hide more data, but this will lead to a loss of more information as well. So the higher the perturbation, the less the accuracy is. The accuracy here means comparing the mining results obtained from the perturbed dataset with those from the original dataset. In the figure, note that the accuracy is objective accuracy. That is, we try to find suitable data mining tools to achieve this goal objective accuracy.

the original value W; second it benefits us to keep the same or similar distribution of W.

5.3

Experiment Results

Figure 10 Privacy level, perturbation and accuracy

The advantage of our model is the flexibility and diversity. This feature enables us to preserve privacy properly while keeping the loss of information to a minimum. For example, a party has a dataset with multiple features, and several features are sensitive. He/she can choose the different privacy level for each feature.

5.2

Adaptable Perturbation Model

In section 4.2, we described how to obtain samples from a given cumulative probability distribution. We have noticed that , the width of interval is a very important parameter. We can achieve our adaptable goal by simply adjusting the value of parameter . When the privacy level is high, choose a larger , and vice versa, choose a small for low level. Notice that when is limited to a very small number, e.g. 10-12, then the dataset W is much closer to the dataset W. W value type need not be the same as W value. For example, W is age, its value type is integer; W can be a real number and fall into that interval I where W lies. The reason is that we are interested in the W distribution, not the W value. The W type flexibility has two advantages: first it helps us to hide

Our experiments show that with only knowing the distribution of the perturbed data, Bayes density function approach still works fine. One thing need to mention here, the algorithm uses Bayes theorem where an interval is used to estimate the density function, denoted as IA. In Agrawal et als [1] experiment, they set the interval number for the density function to 40; and in our experiment for comparison, we also set the interval number for the density function to 40. So if the interval number used to construct the perturbed data, denoted as IB is higher than the interval number used to estimate the density function IA, then there will be no significant effects on the results. But if IB less than IA, increasing IB till equal to IA, will make a difference. In our experiments, the IA is 40, so we set the interval number IB to 20, 40 and 1000, the result are shown in figure 11, figure 8B and figure 12 respectively. From the figures we can see: when we set the interval number equal to 40 to construct the perturbed data, the result (shown in figure 8B) has no distinct change compared to the result when the interval number is set to 1000 (shown in figure 12), and is similar to the result using the perturbed dataset in Agrawal et als approach (Shown in figure 3). Comparing these to figure 11, which using IB = 20, the result in figure 11 is less accurate. The random matrix approach seems no longer workable when we know the distribution of the perturbed dataset. When we increase the interval number to 1000 and 10,000; the results are quite bad (shown in figure 9). This may be due to the fact that this approach is based on the random matrix features, which then lead to a dependence on eigenvalues, which may depend on values in the matrix quite a lot. We are working on finding a mathematic proof for our observations.

5.4

Stopping Criterion

There is a threshold in our modified Agrawal et als algorithm, which is same as in their original algorithm [1]. This threshold is the minimum difference between two calculation iterations. This threshold is used as stopping criterion. When the accuracy is met, the calculation stops. The threshold has significant affects to the results. Figure 8A and figure 8B show the two experiments with exactly same set up except the threshold. Figure 8A shows the result when the threshold is set to 0.5%; figure 8B shows the

15

result when the threshold is set to 0.25%. We can see that when the threshold is smaller the result is better. This means computing more iteration will get a more accurate estimation to the original distribution.
Modified Agrawal et al. Algorithm on Construct Data in 20 Intervals
with 40 interval with 0.25% stopping criterion

1200 1000 800 600 400 200 0 -1 -0.5 Original Data 0 Perturbed 0.5 1 1.5 2

Reconstruted using modified algorithm

Figure 11 Agrawal et al. Algorithm reconstructs original data distribution from perturbed data distribution at 20 intervals
Modified Agrawal et al. Algorithm on Construct Data in 1000 Intervals
with 40 interval with 0.25% stopping criterion

can be extended to include secure multi-party computation (SMC) approaches and distribution datasets. In the future, a metric measure for privacy needs to be defined in more detail. By knowing the perturbation level, we should label the perturbed data as to how much we can trust this data. Then we will have an idea as to how much we can trust the results obtained from the perturbed data. This trust label will be very useful when multiple parties share their data. In this paper we propose an adaptable perturbation model. It uses less information to ensure privacy to overcome the challenge posed in [10]. Furthermore the flexibility of the perturbation model minimizes the information loss. Information security plays a very important role in almost every aspect of data and application management, and privacy is an important aspect of it. Data mining with privacy preservation is vital for the wired world of today and in the future.

1200 1000 800 600 400 200 0 -1 -0.5 Original Data 0 Perturbed 0.5 1 1.5 2

References:
[1]. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, pages 439-450, Dallas, TX, May 14-19 2000. ACM. [2]. D. Agrawal and C. Aggarwal. On the design and quantication of privacy preserving data mining algorithms. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 247255, Santa Barbara, California, USA, May 21-23 2001. [3]. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proceedings of the 8th Conference on Knowledge Discovery and Data Mining (KDD'02), 2002. [4]. S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in association rule mining. In Proceedings of 28th International Conference on Very Large Data Bases. VLDB, Aug. 20-23 2002. [5]. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M. Y. Zhu. Tools for Privacy Preserving Distributed Data Mining, In SIGKDD Explorations, 4(2): 28-34 December 2002. [6]. M. Kantarcioglu and C. Clifton. Privacypreserving distributed mining of association rules on horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues on

Reconstruted using modified algorithm

Figure 12 Agrawal et al. Algorithm reconstructs original data distribution from perturbed data distribution at 1000 intervals

6.

Conclusions and Future Work

In this paper, we propose a modified approach which only needs the perturbed data distribution to construct the original distribution. In our work reported here, we use only the experimental dataset that are used also in [1] and [10]. In the future we would like to apply our modified algorithm to real datasets. We also would like to improve our approach by combining it with other theorems to get more accurate results. For example we need to apply our method to the approach proposed in [2]. In this paper, we only illustrate how to reconstruct the original data from perturbed data distribution. In the future, we will also perform some data mining tasks on this reconstructed data set, e.g. association rule, decision tree or clustering. The adaptable perturbation model which we present here can also be applied with other data structures. The concept of adaptable privacy levels

16

[7].

[8].

[9].

[10].

[11].

Data Mining and Knowledge Discovery, June 2002. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology CRYPTO 2000, pages 36-54. Springer-Verlag, August 20-24 2000. J. Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 639-- 644, Edmonton, Alberta, Canada, July 2326 2002. J. Vaidya, C. Clifton, 2003, Privacy-Preserving K-Means Clustering over Vertically Partitioned Data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206 215, 2003. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, On the privacy preserving properties of random data perturbation techniques, IEEE ICDM, 2003. B. Thuraisingham, Privacy Constraint Processing in a Privacy Enhanced Database Management System, To appear in Data and Knowledge Engineering Journal, 2005

17

A Robust Data-obfuscation Approach for Privacy Preservation of Clustered Data


Rupa Parameswaran and Douglas M. Blough School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA [rupa,dblough]@ece.gatech.edu

Abstract
Privacy is dened as the freedom from unauthorized intrusion. The availability of personal information through online databases such as government records, medical records, and voters lists poses a threat to personal privacy. Intelligent search engines and data mining techniques further exacerbate the problem of privacy by simplifying access and retrieval of personal records. Data Obfuscation (DO) techniques distort data in order to hide information. One application area for DO is privacy preservation. Many data obfuscation techniques have been suggested and implemented for privacy preserving data mining applications. However, existing approaches are either not robust to privacy attacks or they do not preserve data clusters, thereby making it difcult to apply data mining techniques. The absence of a standard for measuring the privacy provided by the various data obfuscation techniques makes it hard to compare the robustness of the techniques. The main contributions of this paper are (1) to propose a data obfuscation technique called Nearest Neighbor Data Substitution (NeNDS), that has strong privacy-preserving properties and maintains data clusters, (2) to dene a property called Reversibility for the categorization and comparison of data obfuscation techniques, in terms of their resilience to reverse engineering, and (3) to formally prove that cluster preserving geometric transformations, by themselves are extremely easy to reverse engineer.

of a standard for classifying DO techniques, comparison and performance analysis of the different techniques is not straight-forward. The domain of interest in this research is data mining. Many data mining applications involve learning through cluster analysis. The term Usability is used to refer to the usefulness of the transformed data. In this paper, the usability is measured in terms of the preservation of the inherent clustering of the original data. The need for an obfuscation technique that preserves privacy as well as usability of the transformed data has motivated the design, development, and preliminary performance analysis of a robust cluster retaining DO technique in this research. The paper proposes the use of the Reversibility Property as a measure of privacy preservation. The privacy provided by the proposed data obfuscation technique NeNDS is evaluated and compared with other obfuscation techniques with respect to its reversibility and usability. The main contribution of this paper is the design, development, and analysis of the proposed DO technique NeNDS as well as a hybrid Geometrically Transformed version called GT-NeNDS. The motivation for the choice of the DO technique as well as the description of the proposed technique is provided in Section 3. The denition of the Reversibility Property, the classication of different transformation techniques based on reversibility, and the evaluation of existing DO techniques is provided in Section 4. An experimental analysis of NeNDS is carried out in Section 6 to study its cluster preserving characteristics.

2. Motivation and Related Work 1. Introduction


The concern over privacy of personal and sensitive information has led to the implementation of several techniques for hiding, obfuscating and encrypting sensitive information in databases. The need for privacy has led to the development of several (data obfuscation) DO techniques that provide privacy preservation at the cost of information loss. Most of the techniques cater to specic domains and perform well for a limited set of applications. In the absence The abundance of information available online has resulted in the loss of individual privacy [5]. Several methods have been proposed and implemented towards privacy preservation of sensitive data sets. DO techniques [4] range from encryption based techniques [1, 16] to geometrical transformation schemes [12, 13]. In the case of encryption based DO techniques, the data is unusable in the encrypted form, and the decryption key for obtaining the original data is provided only to a limited set of users. For sev-

18

eral applications, it is necessary to provide different levels of precision of data, based on the type of user type of user requesting access. Data encryption does not provide this capability as the data is either usable in its original form or completely unusable. Hence, for trend analysis and statistical and inference-based computations from data sets, encryption-based security schemes add complexity without much benet in terms of privacy. Geometric transformation schemes, on the other hand, are extremely vulnerable to privacy breaches and provide very little privacy. Other existing techniques include Data Randomization [2], Data Anonymization [17, 9, 19] and Data Swapping [15]. Data Randomization and Data Anonymization perform obfuscation by modication of the original data and do not address cluster preservation. They are also vulnerable to the notion of privacy breaches proposed in [6], which describes a privacy breach as the revelation of any property of the original data in the obfuscated data. One of the techniques that proposes to preserve usability while preserving privacy is Geometric Transformation [12, 13]. While this technique does involve modifying of data, the inter-relation of the data elements within the data sets and across the elds are maintained even after the obfuscation. Geometric transformation based DO is very weak in terms of privacy preservation and unsuitable for use in sensitive databases. The concept of data-swapping was rst proposed in [15]. This technique intelligently swaps entries within a single eld in a set of records so that the individual record entries are unmatched. The reective nature of data swapping, however, makes it vulnerable to reversal. The requirement of preserving privacy as well as the usability of sensitive data has led to the proposal and development of a robust DO technique called Nearest Neighbor Data Substitution (NeNDS). A hybrid version of NeNDS is also proposed here, called GT-NeNDS, which provides stronger privacy by combining geometric transformations with NeNDS. The attack model for data obfuscation is different from the attack model for encryption-based security techniques, but no common standard has been implemented as yet for DO. Each of the proposed obfuscation techniques uses a different form of comparison of the effectiveness of the approach. Existing work on the privacy analysis of DO techniques has primarily considered a model where the attacker correlates obfuscated responses with data from other publicly-accessible databases in order to reveal the sensitive information of interest. In this work, we consider a model where the attacker uses side-channels to obtain some partial information about the process used to obfuscate the data and/or some of the original data items themselves. The attacker can then use this partial information to attempt to reverse engineer the entire data set. To motivate this new attack model, we give two concrete examples where partial information can be revealed. In the rst example, the

database is temporarily left open, i.e. without an obfuscation mechanism in place. Before the situation is detected, an attacker can access some unobfuscated data records. Clearly, if the database is extremely large and the problem is discovered quickly, only a small percentage of the database will be revealed in its original unobfuscated form. Such situations can occur due to programming errors, soft failures, or conguration (human) errors. The second example is a large distributed database, e.g. that of an international corporation with many data sites throughout the world. In this situation, an inside attacker will be able to access the unobfuscated information from one data site and might use this to try to reveal information from the remaining sites.

Obfuscated Database

Obfuscated Database Privacy breach

Original Data

Data Obf. process

Attacker
Comparison & Correlation

Original data

Data encryption

Comparison Attacker & Correlation

Public Data (Unobfuscated)

Privacy breach

Partial process information Partial data through data leaks

Partial knowledge bank


Attacker

(a) Existing model

(b) Proposed Model

Figure 1. Attack Models for Analysis One useful byproduct of this model is a measure of the robustness of a data obfuscation technique, namely the percentage of the unobfuscated data set that an attacker must know in order to be able to learn the entire set. Using this new measure, we are able to demonstrate that many wellknown data obfuscation techniques are highly vulnerable to reverse engineering through unintentional release of only a small percentage of the unobfuscated data set. We also propose to use the amount of information required for reverse engineering as a measure of privacy preservation for this attack model.

3. Proposed Data Obfuscation Technique


This section provides a detailed description of the proposed DO technique called Nearest Neighbor Data Substitution (NeNDS). Applications of the proposed technique lie in sensitive databases that require a data protection technique without loss of information content. Examples of such applications are medical records as well as microdatabases released by the Census Bureau, where the privacy of individuals is important as the correctness of the data provided to the end user [10]. The data substitution

19

technique proposed here preserves privacy by permuting elements among groups of data items that are close to each other. Data substitution is performed individually for each eld (dataset) in the database, and each eld is permuted independently of the rest of the elds. NeNDS can be used for transformation of any data set that has some notion of distance among the elements. In other words, any dataset that forms a metric space can be transformed using NeNDS.

3.1. Nearest Neighbor Data-Substitution - NeNDS


NeNDS is a lossless DO technique that preserves privacy of individual data elements by substituting them with one of their neighbors in the metric space. A set of neighboring data elements are grouped together to form a neighborhood. The minimum number of neighbors that comprise a neighborhood is specied by the parameter c, where 1 < c < N 1, and N is the size of the data set. The minimum size of a neighborhood is given as c + 1, so that each data element in a neighborhood has at least c neighbors. Hence, the number of neighborhoods in a data set is given by N Hsize = cN +1 . In the case where c = 1, each neighborhood would contain at least two neighbors, reducing the substitution technique to data swapping in some cases. The reective nature of data swapping makes it vulnerable to privacy breaches in case of prior knowledge of some of the elements of the original data set thereof. In order to strengthen the privacy preserving capability of NeNDS, c is set to be greater than 1. The NeNDS process involved is explained here with an example database. Each eld in the database is treated individually and NeNDS transformed independently of other elds in the database. Let in represent the original database of m attributes and n records, and out represent the NeNDS transformed database. The obfuscation technique performs substitutions on data items that lie close to each other within a single attribute eld, so that the correlation of the data across the different attributes is not destroyed.
Age 35 37 40 42 Salary 75,000 80,000 78,000 95,000 Location LA NY SJC SFO Age 40 35 42 37 Salary 80,000 95,000 75,000 78,000 Location LA NY SJC SFO

N Hsize = 1. The elements of each eld are permuted independently such that all the elements are displaced from their original position. This is achieved by using Algorithm 3.1. It can be observed that each record in Table 2 is different from Table 1, but the actual values of the database are unchanged. For larger databases, each dataset is rst divided into neighborhoods of size c and each neighborhood is transformed using NeNDS. The substitution process in NeNDS is performed on each eld by determining the optimal permutation set subject to the following conditions: (1) No two elements in the neighborhood undergo swapping, (2) The elements are displaced from their original position, and (3) Substitution is not performed among identical elements. These three conditions ensure that each element in the transformed dataset is different from the original dataset. The restriction on swapping of data makes the transformed data robust to partial reversibility, which is a shortcoming of data swapping techniques. The cluster preservation of the data set depends strongly on the value of c that is selected. Reducing the size of the neighborhoods, however, would lead to fewer elements in each neighborhood, which would fail to provide the necessary protection of privacy. Selecting a small value for c results in a highly cluster preserving database, but limited privacy protection, while a large value of c might render the database less clustered as a result of substitution among neighbors that are further away. The selection of c is specic to the nature of the database and the amount of protection that is required. The effect of c on the computation time and the cluster preservation property is evaluated in Section 6. The neighborhoods created are likely to be of different sizes depending on the number of identical elements in each neighborhood. The algorithm uses a tree-traversal approach to obtain an optimum substitution pattern. The nodes of the tree correspond to the elements of a single data set with the rst element as the root of the tree. The children are ordered from left to right based on their nearness to the parent node. The distance between the parent and child are given along the edge connecting them. A Depth First Search (DFS) approach is used here to traverse the tree. A maximum edge CME cost counter is maintained for each path being probed. An optimum substitution pattern is one that has the least cost CME . The substitution corresponding to the path chosen is the permutation used to replace the original data set. The Algorithm 3.1 shows the working of NeNDS. in is the input database with m attributes(elds) an n records. The number of neighbors in each neighborhood, c, is the input parameter for the algorithm. Each individual dataset of the original and transformed database is denoted by i in , i out , respectively, where i [1, m]. Each dataset is divided into N Hsize neighborhoods, denoted by N Hj , j [1, N Hsize]. The recursive CreateTree algorithm is then invoked to build a c ary tree for each N Hj .

Table 1. Orig. DB

Table 2. Trans. DB

The example in Tables 1 shows a database in with 3 elds and 4 records. The transformed database out , in which the two of the elds, Age and Salary, are transformed by NeNDS is shown in Table 2. A simple substitution technique is used on the enitre dataset with c = 3 and

20

The procedure Ancestors(Tree, NH) returns all the ancestors of a specied node, and the procedure Identical(Parent, NH) returns all the entries in N H that are identical to the parent of the specied node. ChildrenTree holds the set of valid children of the parent node in Tree. The populated tree is then assigned to the variable T reej in Algorithm 3.1. All paths in T reej that have a length equal to the size of the neighborhood are candidates for substitution. The maximum edge distance CME is determined for each candidate path. Procedure min(CandidateSet) identies the path with the smallest CME as the optimum substitution . The datasets pattern. This path is then assigned to N Hj 1 2 m ( out , out , . . . , out ) form the transformed database out , where i out = (N H1 , N H2 , . . . , N HN Hsize ). NeNDS can be performed on any data set in which the elements are related by some notion of distance, and can be expressed as a metric space. The algorithm is run for each eld in the database forms a metric space. A brute-force analysis of the DFS based algorithm for nding the substitution pattern indicates that the algorithm has an exponential order of complexity. However, the heuristic nature of the branch and bound implemented reduces the exponential order of complexity to a much smaller value, which is indicated by the successful completion of NeNDS even for large data sets.

data. As NeNDS preserves the original values of the data even after transformation, it is still vulnerable to privacy breaches as mentioned in [6]. This type of privacy breach may be unacceptable in highly sensitive databases. Section 3.3 provides a hybrid version of NeNDS that preserves all the favorable characteristics of NeNDS and also overcomes this shortcoming of NeNDS.

3.2. Geometric Transformation Technique


An overview of the geometric transformation based DO proposed in [12] [13] is given here. This approach is of interest in data mining applications due to its inherent cluster preservation property. Hence, this technique will be used as a benchmark to evaluate the cluster retention capability of NeNDS. Transformations such as rotation, scaling and translation are used for distorting the data [7]. With geometric transformations, any pair of numerical elds in the database is interpreted as a two-dimensional space and the co-ordinates of the data items are distorted by geometric transformation. The approaches can also be scaled to three or more dimensions without loss of generality. The database is denoted by Dd,n , where d is the number of attributes and n represents the number of records or entries in the database. The transformations translation, scaling and rotation can be implemented using matrix multiplication. Each of the three transformations can be represented in terms of the equation [X Y ]T = A[X Y ]T + B . In all of the transformations, A, B are the transformation matrices, (X, Y ) are the original data, and (X , Y ) are the results of the transformations on the original data. From the description of the transformations, it can be observed that each data set is distorted by the same amount relative to the placement of the individual elements in the set. In this way the clusters are maintained during obfuscation.

NeNDS(c)
1. For each i [1, m] do (a) N Hsize = cN +1 (b) i in = (N H1 , N H2 , . . . N HNHsize ) (c) For each N Hj i in do i. T reej = CreateTree(N Hj ,0, NHsize) ii. dj = depth(T reej ) iii. For each pathk in T reej of length dj 1 CandidateSet = CandidateSet + (pathk ) = min(CandidateSet) iv. N Hj
, . . . , N HNHsize ) 2. i out = (N H1 , N H2

3.3. A Hybrid Data Substitution Approach


In this section, we propose a hybrid version of NeNDS. In this approach, termed as GT-NeNDS, the data sets are rst geometrically transformed, and then operated upon by NeNDS. NeNDS provides a privacy preserving wrapper on the geometrically transformed data. The transformation functions like rotation and translation are isometric in nature, thereby preserving cluster information of the data sets and retaining the nearest neighbor information for the substitution step. In this way, the data can be transformed to a form suitable for use by a third party analyst. The two step transformation results in transformed data that preserve clustering information, but bear no resemblance to the original database. As a result, GT-NeNDS is also robust to the notion of privacy breaches as proposed by [6], making it a suitable candidate for privacy preserving data mining.

CreateTree(NH, Tree, Size)


1. If T ree = 0 then Tree = NH[0] 2. If N H = 0 then Return Tree 3. ChildrenTree = NH - Ancestors(Tree) - Identical(Parent, NH) 4. Child(Tree) = Sort(ChildrenTree) 5. Tree = Child(Tree)

NeNDS ensures a completely robust framework for data mining applications by preserving all the information content for cluster preservation and providing a secure and privacy preserving framework for drawing inferences on the

21

4. Reversibility - A Standard for Classication


The proposed DO technique, NeNDS, was described in Section 3.1. The absence of a standard for measuring and comparing the privacy provided by different DO techniques, makes it difcult to evaluate the performance of the techniques. The term Reversibility is used to denote the property of the DO technique, that dictates the ease or difculty of the process of reverse engineering obfuscated data. This property, was rst proposed in [4], and species how robust a given obfuscation technique is in terms of hiding sensitive data. The reversibility property of an obfuscation technique is an indicator of the robustness of its privacy preservation. Cryptanalysis is used for analyzing the security provided by encryption-based techniques [18]. Since encryption is a deterministic and reversible process, cryptanalysis assumes the transformation to be deterministic as well as reversible. However, DO techniques have no such restriction and therefore require a new standard for analysis. An obfuscation technique that can be reversed with the knowledge of the process, is known as a process reversible transformation function. Process reversible DO techniques are analyzed with respect their vulnerability to complete reversal with little or complete a priori knowledge of the process used for DO. Process reversibility is sub-classied into the following categories. 1. Partial knowledge reversibility: Partial knowledge reversibility implies that a transformation function exhibiting this property can be reverse engineered with the knowledge of either some of the original data entries, or a combination of some original entries of data and some information regarding the process used. The level of difculty of the reversal process is dependent on the DO technique. Obfuscation techniques that involve a one-to-one mapping between the original and the transformed data, are vulnerable to partial knowledge reversibility. The reversibility analysis for linear and non-linear one-to-one transformations is provided in Section 5.2. 2. Random number reversibility: This property indicates that the original data set can be reverse engineered with knowledge of the process, the Pseudo-RandomNumber Generator (PRNG) and the seed. Most obfuscation techniques invoke PRNGs to generate random sequences. The robustness of DO techniques exhibiting this property relies in protecting the PRNG sequence. As long as the random seed and the sequence are unknown to the attacker, the obfuscated data is robust to reversal. Once this information is revealed, and the obfuscation process is known, the entire data is compromised. Transformations that fall under this category cannot be analyzed using cryptanalysis due to

their non-deterministic nature. Obfuscation techniques that result in a non-invertible transformation exhibit Irreversibility. A Maximum likelihood reversibility estimate can be made in the case of some of the techniques, which provides an estimate of the condence with which a guess can be made on the original data. Cryptanalysis fails to account for such transformations as well. With irreversible techniques, there is an inherent loss of information. Lossy compression techniques and data generalization techniques, which make it impossible to exactly recover the original data, fall under this category.

5. Reversibility Analysis
Section 4 provides a classication of all transformation functions based on their reversibility property. Random data perturbation techniques are hard to reverse because they exhibit random number reversibility. Geometric transformations, being linear one-to-one transformations can be reversed with the knowledge of a nite number of original records. NeNDS involves a non-linear one-to-one transformation, and hence can also be reversed with the knowledge of sufcient number of original records. In this section, we derive the value for the minimum number of original records that are required to reverse engineer data that is obfuscated using Geometric Transformations and NeNDS.

5.1. Analysis of Geometric Transformations


Geometric transformations fall under the category of linear transformation functions. These functions are the most vulnerable DO techniques that are subject to partial reversibility. A cryptanalysis of linear geometric transformations renders it weak to cipher-text only attacks. The knowledge of the type of obfuscation technique used results in an immediate reversal of the data. The linearity property of this data obfuscation technique preserves the clustered nature of the data, but also results in weak privacy protection. The assumption made here is that the attacker is aware that the DO process is a linear transformation. In this case, we prove that for a database with d n entries, where d is the number of attributes and n is the number of records, the knowledge of only d + 1 linearly independent records in the original matrix, is sufcient to uniquely determine the linear transformation. Once the transformation matrix is obtained, all the original data entries for which the obfuscated values are available, are compromised. Therefore the Geometric Transformations of [12] [13], being instances of linear transformation functions, are compromised with the knowledge of d +1 linearly independent records in the original data [11].

22

5.2. NeNDS versus Data Swapping


Data Swapping as well as NeNDS fall under the category of Non-linear bijective transformations. In this type of transformation, reversibility is dependent on the minimum number of records r that are sufcient for complete reverse engineering. In the case of data swapping, the minimum value for r is half the number of elements in the data set. For each element in the data set that is known a priori, the corresponding element involved in the swap is revealed. In the case of NeNDS, complete reversal of the entire data set would require the knowledge of at least r = c distinct data elements for each neighborhood, where c + 1 is the minimum size of a neighborhood. Even partial reversal of a neighborhood would require the knowledge of c of its elei ments. The fraction cic+1 determines the ease of reversal of a specic neighborhood i having exactly ci elements. The robustness of the obfuscation technique proposed increases with larger values of c. For the case where c = 1, data substitution is reduced to data swapping. In this case, the complexity of reversal is reduced to 1/2. For all values of c > 1, NeNDS provides a more robust DO, since complete reversal would require the knowledge of at least
c c+1 , where N H is the number of neighborhoods. This shows that the reversibility provided by NeNDS is stronger than data swapping making it a favorable candidate for use in public databases as well as Census records, where unmodied techniques are favored. GT-NeNDS is a combination of a geometric transformation and NeNDS, hence the fraction of the original data that is required for complete reversal is greater than or equal to that required for NeNDS, along with the added robustness to the notion of privacy breaches [6].
H c N i=1 c+1 NH

6. Experimental Results
This section provides an experimental analysis of the cluster preserving performance of NeNDS. Geometric transformations are inherently cluster preserving and are therefore used as a benchmark for evaluating the performance of NeNDS with respect to cluster preservation. The datasets used for performance analysis are obtained from the UCI Knowledge Discovery Archive database [21] as well as an open source synthetic data generator [8]. The experiments are performed using the clustering toolbox in Matlab, and cluster analysis is performed by k-means, which is a partition-based clustering technique [3]. K-means takes as input the number of clusters k , and selects k centroids in the data space representing the k cluster centers. A quantitative analysis of the cluster preservation is evaluated using the Misclassication Error (MCE), which is a measure of the percentage of legitimate data points that are not well-grouped in the transformed data set. The

1 expression for MCE ME [20] is given as ME = N k ( | Cluster ( X ) | | Cluster ( X ) | ) , where N is the i i i=1 total number of records in the data set X : X Dk,n , k is the number of clusters into which the data are grouped, and |Clusteri (Y )| is the number of points of Y in the cluster i. The selection of the number of neighborhoods is an important factor in NeNDS. The number of neighborhoods N H is expressed as as cN +1 , where c + 1 is the neighborhood size and N is the size of the data set. The effect of the change in number of neighborhoods on the computation time of NeNDS as well as the misclassication error after clustering are shown in Figure 6. The data set size is 3000, and the X-axis represents the number of neighborhoods [1, 32]. In Figure 6(a) it is observed that the computation time reduces exponentially as the number of neighborhoods increases. This decrease is due to the fact that an increase in number of neighborhoods leads to a smaller size of each neighborhood, which results in an exponential decrease in the time taken for the tree-based search. The graph in Figure 6(b) shows the variation of MCE% with the number of neighborhoods. The gure shows that the misclassication error is maximum for a single neighborhood and has a minimum value when the number of neighborhoods corresponds to the inherent clustering degree of the data, which is 10 for this data set. The misclassication error increases slightly when the number of neighborhoods is increased beyond the inherent clustering degree. The actual values of the MCE% are very small, being 0.02% between the extreme values in the gure shown. Hence, the effect of the number of neighborhoods on MCE% is almost negligible. The computation time for NeNDS is dependent on the number of neighborhoods and yields better performance for smaller neighborhoods. The branch-and-bound search technique used by NeNDS is able to reduce the computation time signicantly compared to an exponential-time brute-force search. This is evidenced by the fact that, even for a very large neighborhood size (3000 records), NeNDS yielded a solution within 1, 700 seconds. Furthermore, Figure 6 shows that the selection of the number of neighborhoods does not signicantly affect the accuracy of the result. Therefore, the neighborhood size can be suitably chosen based on the level of privacy required for the database. In order to evaluate the worst case performance of NeNDS, the experimental evaluation for the rest of the section is carried out for a single neighborhood of size N . Figure 6 shows the performance of the NeNDS transformed data with respect to rotational transformation. The data used for these graphs is generated using the synthetic data generator. Here, D1, D2, D3 represent the Salary, Commission, Age elds of the synthetic database. The DO results are displayed for grouping parameter values of 2 (default), 5, and 15. The output clustering parameter Cqu in

23

Number of clusters: 10, Number of records: 3000


2000
Misclassification Error Percentage (MCE%)

Number of clusters: 10, Number of records: 3000

0.25 0.2 0.15 0.1 0.05 0

1500
Time (seconds)

1000

500

10

20 30 40 50 Number of Neighborhoods

60

70

20

40 60 Number of Neighborhoods

80

100

(a) Computation Time

(b) MCE Percentage

Figure 2. Effect of Varying Number of Neighborhoods

this case is the same as the inherent clustering of the original data. The angle of rotation between the attributes D1, D2 is 89.9 and between D1, D3 is 35.4 degrees. The database is inherently grouped into 5 clusters.

formed are rotated by the same angle. This, however, would weaken the privacy preservation capability of the transformation. NeNDS, on the other hand, is noted to perform consistently in all cases. A detailed experimental evaluation of the performance of NeNDS is provided in [14]. Table 3 shows a summary of the misclassication error for the different DO techniques. Two sets of experiments are performed for Random Data Perturbation (RDP), denoted by RDP low and RDP high. The values for the noise vector (mean, var) for RDP low are (0.0, 1) and for RDP high are (0.0, 100). The angle of rotation for rotation based Geometric Transformation is 89.4 degrees. The value of c for NeNDS is kept as N 1 in order to compare the worst case performance of the algorithm. The size of the database used for comparison is N = 10, 000 and the inherent clustering factor Cin = 10. The error percentages resulting from k-means is used in the table. The table provides a comparison of MCE as a percentage.
Obf. Clu. 2 3 5 10 20 RDP low 0.0 0.03 0.10 0.21 0.25 RDP high 10.1 25.02 36.1 40.5 40.5 Rot. const 0.0 0.05 0.08 0.18 0.40 Rot. var 0.0 0.10 0.17 0.24 0.45 NeNDS c=N-1 0.0 0.08 0.11 0.20 1.60 GTNeNDS c=N-1 0.0 0.11 0.13 0.22 2.18

80
1 2 3 4 5

6000
1 2 3 4 5

70

5000

60

4000

D3rot

D3

50

3000

40

2000

30

1000

20

2
D2

5 x 10
5

3 D2rot

6 x 10
7

Table 3. Comparison of MCE % It is observed that RDP low yields a very low value of MCE for all cases. This is because the amount of noise added is extremely small. RDP high performs poorly for all cluster sizes, whereas the other obfuscation techniques are comparable. Although Rotation provides a smaller MCE percentage, its vulnerability to reverse engineering makes it unusable for DO of sensitive data. The two columns for Rotation techniques show the performance of the algorithm for a constant rotational angle over the entire database, and for different angles selected for each transformation. The data obtained for rotational transformation assumes a 2 D rotation. The performance of NeNDS, GT-NeNDS are observed to be almost as good as the rotational transformation, and their robust privacy preservation capability makes them more suitable candidates for data protection. The performance of the obfuscation techniques degrade if the number of clusters required is chosen as a number much larger than the inherent clustering of the data as can be noted in the case where the number of clusters is 20. This is twice the value of C . The loss of information in this case is a necessary condition for privacy preservation in order to prevent individual records from being exposed. The results indicate that NeNDS and GT-NeNDS yield cluster-preserving obfuscated

(a) Original Data

(b) Rotated Data

80
1 2 3 4 5

70

60

D3

50

40

30

20

2
D2swap

5 x 10
5

(c) NeNDS transformed Data

Figure 3. Comparison of Cluster Preservation

In this gure, a comparison of the clustering nature with respect to the attributes D2, D3 are shown. As the number of clusters are small, the neighborhoods remain intact, and the error in clustering is very small. It can be observed that the rotational transformation has changed the shaped of the clusters slightly, but cluster strengths remain the same. The misclassication error would be minimal across the entire database only if all the attributes on which clustering is per-

24

data that are difcult to reverse engineer.

7. Conclusion
Technique RDP low RDP high Data Swapping NeNDS Geometric GT-NeNDS Disp. Very Low High Low Low High High Reversibility Difcult Difcult Partial 1 2 c Partial c+1 Easy Difcult Clustering Good Poor Good Good Good Good

Table 4. Performance of DO Techniques The main contributions of this paper are: (1) the proposal of a robust DO technique for clustered data, (2) the denition of a standard for the classication of DO techniques, and (3) the demonstration of the weak privacy provided by existing obfuscation techniques such as linear transformations and data swapping. Table 4 provides a comparison of NeNDS, GT-NeNDS with existing DO techniques with respect to three parameters: displacement, reversibility, and cluster preservation. The rst two parameters indicate the strength of privacy provided by the DO technique, while the third parameter is an indicator of the usability of the DO. Displacement is the average value of MCE (MCEAvg ). A robust DO technique is one with High displacement, that is Difcult to reverse engineer, and that has Good cluster preservation. Random Data perturbation is difcult to reverse engineer, but the other two parameters are dependent on the amount of noise added. A large offset provides more displacement, and better privacy, but results in poor cluster capability. On the other hand, a small offset preserves clustering, but results in data with very small displacement, thereby making them vulnerable. Data swapping and NeNDS provide small displacements dependent on the nature of the dataset, but are cluster preserving. NeNDS is more difcult to reverse than Data Swapping, which makes it a more robust technique. Geometric transformations displace the data substantially and also preserve clustering, but are extremely easy to reverse, which makes them unsuitable for sensitive databases. GT-NeNDS provides cluster preservation, high displacement, and is also difcult to reverse, thereby proving to be a robust DO approach for the privacy preservation of clustered data.

References
[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Order Preserving Encryption for Numeric Data. In Proc. of Special Interest Group on Management of Data, pages 563574, Paris, France, June 2004. ACM Press.

[2] R. Agrawal and S. Ramakrishnan. Privacy-Preserving Data Mining. In ACM Special Interest Group on Management of Data, pages 439450, 2000. [3] http://www-2.cs.cmu.edu/awm/tutorials/kmeans.html. [4] D. Bakken, R. Parameswaran, and D. Blough. Data Obfuscation: Anonymity and Desensitization of Usable Data Sets. IEEE Security and Privacy, 2(6):3441, Nov-Dec 2004. [5] D. Denning and M. Schwartz. The Tracker A Threat to Statistical Database Security. In ACM Transactions on Database Systems, volume 4, pages 7696, 1979. [6] A. Evmievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. In Principles of Database Systems, San Diego, CA, June 2003. [7] R. Gonzalez and R. Woods. Digital Image Processing. Addison-Wesley Publishing Company, 1992. [8] http://www.almaden.ibm.com/software/quest/resources/ datasets/syndata.html. [9] W. Klosgen. Anonimization Techniquesfor Knowledge Discovery in Databases. In Proc. of the First International Conference on Knowledge and Discovery in Data Mining, pages 186191, Montreal, Canada, Aug 1995. [10] R. Moore. Controlled Data-swapping Techniques for Masking Public Use Microdata Sets. In SRD Report RR 96-04, U.S. Bureau of the Census, 1996. [11] D. Moursmund. Chebyshev Solution of n+1 Linear Equations in n Unknowns. Journal of the ACM, 12:383 387, July 1965. [12] S. Oliveira and O. Zaane. Privacy Preserving Clustering by Data Transformation. In Proc. of the 18th Brazilian Symposium on Databases, pages 304318, Manaus, Brazil, Oct 2003. [13] S. Oliveira and O. Zaane. Achieving Privacy Preservation When Sharing Data for Clustering. In Workshop on Secure Data Management in conjunction with VLDB2004, Toronto, Canada, Aug 2004. Springer Verlag LNCS 3178. [14] R. Parameswaran and D. Blough. An Investigation of the Cluster Preservation Property of Nends. Technical report, Georgia Institute of Technology, 2005. [15] S. P. Reiss. Practical Data-swapping The First Steps. In ACM Transactions on Database Systems, volume 9, pages 2037, Mar 1984. [16] R. Rivest, L. Adleman, and M. Dertouzas. On Data Banks and Privacy Homomorphisms. In R. A. D. et al, editor, Foundations of Secure Computations, pages 169179. Academic Press, 1978. [17] P. Samarati. Protecting Respondents Privacy in Microdata Release. IEEE Transactions on Knowledge and Databases, 13(6), 2001. [18] W. Stallings. Network Security Essentials. Prentice Hall, 2000. [19] L. Sweeney. k-Anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557570, 2002. [20] G. Toussaint. Bibliography on Estimation of Misclassication. IEEE Transactions on Information Theory, 20(4):472479, July 1974. [21] http://kdd.ics.uci.edu/.

25

Implementing Privacy-Preserving Bayesian-Net Discovery for Vertically Partitioned Data


Onur Kardes
Stevens Institute of Technology onur@cs.stevens.edu

Raphael S. Ryger
Yale University ryger@cs.yale.edu

Rebecca N. Wright
Stevens Institute of Technology rwright@cs.stevens.edu

Joan Feigenbaum
Yale University feigenbaum@cs.yale.edu

Abstract
The great potential of data mining in a networked world cannot be realized without acceptable guarantees that private information will be protected. In theory, general cryptographic protocols for secure multiparty computation enable data mining with privacy preservation that is optimal with respect to the desired end results. However, the performance expense of such general protocols is prohibitive if applying the technology naively to non-trivial databases. The gap between theory and practice in cryptographic approaches is being narrowed, in part, by the introduction of problem-specic secure computation protocols. We describe our implementation of the recent YangWright secure protocol for Bayes-net discovery in vertically partitioned data. Our development occasions the proposal of a general coordination architecture for assembly of modularly described, complex protocols from independently implemented and tested subprotocol building blocks, which should facilitate future similar implementation efforts.

1. Introduction
The discovery of Bayesian networks in large bodies of personal datamedical, racial, ethnic, educational, nancial, criminal, etc.may be the key to scientic progress of great immediate benet, informing public policy and even leading to breakthroughs in the understanding of underlying mechanisms. As simultaneous access to such varied data
Supported in part by NSF grant 0331584 and by the the Wireless Network Security Center (WiNSeC) at Stevens Institute of Technology. Supported in part by ONR grant N00014-01-1-0795 and by US-Israel BSF grant 2002065. Supported in part by NSF grants 0219018, 0331548, and 0428422 and by ONR grants N00014-01-1-0795 and N00014-04-1-0725.

dispersed across separately maintained databases is becoming increasingly feasible technically, privacy concerns regarding this accessibility are forcing its curtailment in practice through ethical and legal hindrances, hence the efforts toward privacy-preserving data mining. Among the approaches in this area, the theory of secure multiparty computation [2] offers strong assurances of privacy preservation with no compromise of accuracy, although typically at a performance penalty that can only be mitigated through ingenuity in its application to particular problems. Devising a protocol to achieve some desired accuracy, privacy, and performance characteristics is one step; implementing it is another. Implementation reveals possible theoretical gaps in the protocol design while raising new issues of software maintainability, deployability, and usability. Understandably, there is little incentive to implement a complicated protocol that will surely be impractical in its performance, which is why so little of secure multiparty computation theory has been implemented. However, available computing resources have become more powerful, bringing previously impractical protocols into new consideration and occasioning the development of software such as the Fairplay system [4], which implements the general two-party Yao protocol ([11], discussed in section 1.2 below). At the same time, problem-specic secure protocols promising much better performance than general approaches have been proposednotably, in the data-mining setting, the protocols of Lindell and Pinkas [3]further encouraging interest in the prospect of practical implementation. Yang and Wright have recently presented just such a problem-specic secure protocol [8, 9], indeed posing, along with signicant challenges, an invitation to implementors. The few points at which the protocol depends on general secure two-party computation involve very conned tasks, and the Fairplay system is now available to address

26

these. Our implementation of this protocol is the subject of this paper. Yang and Wright adapt the K2 heuristic algorithm of Cooper and Herskovits for Bayes-net discovery [1] to the very relevant scenario of a logical database partitioned verticallydividing up the (non-key) elds for the same logical recordsbetween two parties who must not learn each others data beyond what follows from the output of the protocol. To accomplish their adaptation, they rst transform the K2 scoring function and then invoke several cryptographic technologies to compute and compare scores securely. We sketch the original K2 algorithm (1.1) and introduce the cryptographic technologies that will be brought to bear (1.2). We then describe the synthesis of these elements in the design of the Yang-Wright protocol (2). With this background, we turn to the issues that arise in implementing this protocol (3). Beyond the implementation issues relating to the specic subprotocols needed by the Yang-Wright protocol (3.3), our experience in this project leads us to some broad observations and a development approach, including a subprotocol coordination framework (discussed in 3.1), applicable to complex protocols in general.

1.1. The K2 algorithm


Given a database tablethe rows viewed as records representing entities, and the columns, each dening a record eld, corresponding to attributes of the entitiesthe statistical relationships among the attribute values generalized over the entire body of data may be partially represented by a Bayesian network. Each node of the network represents an attribute; we speak of nodes and attributes interchangeably. The directed arcs incoming to any given node come from parent nodes, whose values are viewed as predictive of the values of the given node. This predictiveness is specied by a conditional probability table for the values of the given node, keyed by the possible joint value assignments to its parent nodes. The Bayes-net structure is the network without the nodes conditional probability tables. Given the database and a Bayes-net structure, the conditional probability tables are determined. However, different choices of Bayes-net structure can produce Bayes nets, all accurate, that vary greatly in their predictive usefulness. Clearly, conditional probability tables that are sharply modal are best; those that mirror the unconditional attribute-value probabilities are unenlightening. The primary challenge, then, is to choose an optimal Bayes-net structure, i.e., an optimal identcation of parent nodes for each node. The K2 algorithm for Bayes-net structure discovery involves two main elements: a scoring function for candidate parent-node sets; and a greedy heuristic to constrain the combinatorics of exhaustively searching the space of candidate parent sets for each node and scoring each can-

didate set. The heuristic itself has two aspects, as we will outline presently: a xed constraint on the size of the increments to candidate parent sets between rounds of exploration of the candidate-parent-set space; and a congurable constraint on the allowable size of candidate parent sets for eligibility for consideration. (A constraint on the set size does, of course, constrain the set-increment size. Cooper and Herskovits, however, have a more stringent constraint on the set-increment size in mind.) Without the heuristic, the scoring function could be the basis of an exponentialtime algorithm to nd the optimal (as must be precisely dened) Bayes net for the data. The scoring function always applies to a node and a candidate set of possible Bayes-net parents. The search for an optimally scoring parent set is conducted for each node entirely independently. The search heuristic operates as follows. We begin with a linear ordering of all the nodes such that all directed arcs in the sought Bayes net will be consistent with this linear order. (The availability of this linear ordering is a major assumption.) For each node, we build its Bayes-net parent set incrementally from the nodes that precede it in the linear order, beginning with the empty set, always adding from among the unused candidates a single node (the heuristic xed increment) that most improves the score, and aborting the incrementing if the parent set has grown to the stipulated maximum size (the congurable heuristic parameter) or if no single added node does improve the score. Note that it is perfectly possible that two nodes added at once would improve the score even though no single node can be added to improve the score. The heuristic xed increment size amounts to a gamble that this is not so in the case at hand. This observation suggests a natural algorithm extension that we implement, as discussed in section 3.2. In virtue of the xed candidate-parent-set increment size between rounds, the running time of the discovery algorithm, as measured in applications of the scoring function, goes from being exponential to polynomial in the number of nodes. Now, what of the computation of the scoring function itself? The scoring function for a node i and a candidate parent set looks like this, (di 1)! ( d i 1 + j )! j =1
q

di

ijk !
k=1

where j indexes the q possible value assignments to the nodes in , k indexes the di possible value assignments to node i itself, j counts the records in the database matching value assignment j to the nodes in , and ijk counts the records that match value assignment j to the nodes in and additionally match value assignment k to node i itself. Noting that the outer product ranges over all value assignments to the nodes in the candidate parent set, we see that

27

it is here that the K2 heuristic needs to constrain the size of parent-set candidates to be considered, to avoid worst-case growth of the outer iteration count that would be exponential in the total number of nodes. It is suggestive and economical to grasp the operand of the outer product in the scoring function as precisely the inverse of di 1 + j (di 1), ij 1 , . . . , ijdi
n where the notation r,s,... , with n = r + s + . . ., denotes the number of combinations of n things taken exhaustively in bins respectively of sizes r, s, . . .. (The usual choose n notation, n r , coincides with r,(nr ) .) It is easily seen that this expression is smallest, and its inverse biggest (but always the reciprocal of an integer!), when the bin sizes are in a sharply modal distribution. A sharply modal distribution in the bin sizesthe -parametersof the outerproduct operand in the K2 scoring expression, translates directly into a sharply modal distribution in row j of the conditional probability table for node i, which is just what we would like, as we have observed. The -parameters in the arguments to the factorial function represent counts of records matching partial eld-value specications, hence they may be as large as the number of records in the database. This is of little practical concern in ordinary computation. Even for a database of 100 million records, an approach as crude as looking up factorial values in a table would be feasible (if not recommended). On the other hand, in secure computation the practical options are much more limited, and so the approach to these factorials in the scoring function is at the heart of the Yang-Wright proposal.

is mentioned at the end of this section. The other two are described in section 2. The implementation we use is the recent Fairplay system [4], which provides two facilities: (1) a Boolean-circuit generator that takes a high-level algorithmic description as input; and (2), taking a Boolean circuit as input, run-time software for the two parties that will engage in the protocol. The Fairplay circuit generator, currently implemented in Java, is resource-hungry itself and, more important, may produce circuits that could be signicantly optimized to the benet of protocol performance, so it is often best to develop a custom circuit generator and then use Fairplay to run the protocol. Our implementation uses custom circuit generators for two of the three Yao-protocol episodes. The Yang-Wright protocol requires an encryption scheme with the following additive homomorphic property, where E encrypts, and suppressing the details of the ring(s) in which the operations take place: E(m1 + m2 ) = E(m1 )E(m2 ) More accurately, it being essential that different encryptions of the same plaintext be possible, we need the following property, where D decrypts, and r1 , r2 are random values: D(E(m1 , r1 )E(m2 , r2 )) = m1 + m2 The scheme used is one proposed by Paillier [6], as implemented using OpenSSL library functions by Subramaniam, Wright and Yang [7]. The -parameters for the scoring, as said, represent counts of records matching partial value assignments to the database elds. With the database vertically partitioned as in the Yang-Wright setting, the logical records to be matched and counted span the local records of the parties. The elds to which values are assigned for matching are, in the general case, partitioned between the two parties. The count of matching records, then, is the scalar product of two bit vectors, the match vectors, each marking by 1-bits the matching local records (the local portions of the logicalrecords) held by one of the parties. It is supremely important to keep these scalar products, the -parameters essential to the scoring and hence to the whole computation, secret from both parties! Revealing them can be tantamount to revealing to one party the eld values held by the other party for a particular logical record. Accordingly, a secure protocol is needed to compute the scalar product of binary vectors leaving additive shares of the result, rather than the result itself, with the two parties for further computation. Yang and Wright use a simple scheme based on additive homomorphic encryption. Let Alice and Bob be the two parties, where both can encrypt but Alice alone possesses the decryption key. Alice sends Bob a bit-wise encryption of her match vector. Bob multiplies just those bit encryptions submitted by Alice that correspond to matching local

1.2. Cryptographic tools


The Yao protocol for general secure two-party computation [10, 11] (see [5] for a detailed account) is the protocol that rst demonstrated that general secure multiparty computation was possible. It requires the function to be computed to be represented as a Boolean circuit. A Boolean circuit is distinctive in doing the same computational work regardless of its inputs, a feature essential to the disguising of its inputs. This is wasteful when the amount of computation needed for different inputs differs signicantly. Thus, the Yao protocol is inappropriate even just for the exact computation of factorials for arguments that range from very small to very large, as in the K2 scoring function. On the other hand, the generality of the Yao protocol allows it to be a fallback option when no specialized protocol has been devised and the task is small. The Yang-Wright protocol resorts to episodes of the Yao protocol in this capacity at three points. One instance, within the Lindell-Pinkas ln x subprotocol,

28

records in his own data (or 1-bits in his match vector).1 Bob further multiplies this product by the encryption of a random r and returns the result to Alice. Alice decrypts it to get her additive share of the scalar product while Bob holds r (in the appropriate modulus) as his share. Oblivious polynomial evaluation is a basic cryptographic protocol component we have to implement to serve within secure multiplication and secure natural-logarithm protocols, both of which are needed in the Yang-Wright version of the K2 scoring. In oblivious polynomial evaluation, one party has a secret polynomial and another party has a secret argument. Neither party may learn the others secret, yet the argument holder is to learn the value of the polynomial k at his argument. If the secret polynomial is i=0 ai xi and the secret argument is b then, given an additive homomorphic encryption scheme for which the argument holder has the decryption key, the argument holder may send the polynomial holder a vector of encryptions of powers of his argument E(bi , ri ) k i=0 ; the polynomial holder can then compute and return
k

(E(bi , ri ))ai
i=0

which the argument holder can decrypt to yield the value of the polynomial at the argument. A secure multiplication protocol will be needed that takes additive shares of the factors and leaves additive shares of the product to the respective parties. This may be accomplished easily through two oblivious polynomial evaluations, as shown in [3]. The last major protocol building block we need is a secure protocol taking additive shares of x as party inputs and leaving the parties with additive shares of ln x. Such a protocol is provided by Lindell and Pinkas [3] and is our most intricate building block. It begins with an episode of Yaoprotocol computation establishing the logarithm approximately, then proceeds to an oblivious polynomial evaluation to compute some number of terms of Taylor expansion to reduce the initial error. The result emerges scaled up by a publicly known factor to retain precision while always computing in integers.

2. The Yang-Wright protocol


Returning to the K2 scoring function, how can a factorial be computed securely from secret shares of the argument? We have already observed that a Yao-style computation of a Boolean circuit, whether it carries out the multiplication or looks up the result in a large table, would not be practical, because too large a circuit would be required, necessarily to be traversed entirely for every factorial invocation.
1 Bob may need to pad his response time to disguise his computation time, which will be proportional to his 1-bit count.

The Yang-Wright protocol addresses this problem by replacing each factorial in the scoring function with a Stirling approximation. For n 1, n n! 2n( )n e Next, since scores are important only in their ordering, the natural logarithm of the entire Stirling-approximated expression is taken as the scoring function to implement securely. The transformed function is then amenable to secure computation from additive shares of the -parameters using the Lindell-Pinkas protocols just mentioned. The protocol has a superstructure that involves no private information and tracks the K2 algorithm almost precisely. The one difference is that scores cannot appear in the clear for comparison at the top level, as they are too revealing of private data. Instead, the top level is aware only of the chosen best score-improving increment, if there is one, to the parent set being grown. All else that is known at the top level of the protocol is considered safely disclosable to both parties: the parent set that has already been established for a node; which additional node is being scored; and which of all the additional nodes tried, if any, has been chosen for inclusion in the parent set. This information is considered disclosable on the premise that the parties could reconstruct these stages of the progress of the algorithm from the end result anyway.2 The parties are particularly not to know the -parameters that feed the scoring. This means that the -parameters must be computed cooperatively by the parties so as not only to hide each others inputs, which are the identities of their respective matching private records, but also to disguise the -parameter outcomes, which are the counts of matching logical (cross-party) records, as well. This is accomplished by the secure scalar product protocol described above. The transformation of the scoring as describedStirling approximation, then natural logleaves a specicational gap in failing to attend to the cases where j or ijk is 0. The Stirling approximation formula does not apply in this case. The 0 value here means simply that no record matches a partial eld-value conguration being considered, which is a perfectly normal state of affairs. Clearly, this case must be handled differently. The challenge is that the parties must not realize that different handling has been triggered. A tip that, in some instance, the value of ijk is 0 can be tantamount to specic information regarding the value of a eld held by the other party for a specic record.
2 Strictly, the parties could not necessarily reconstruct in which round of consideration each parent node was included, so the Yang-Wright protocol, by revealing the progress of the growth of the parent sets, even without revealing complete score-based orderings (let alone scores themselves) within the rounds of parent-set-increment consideration, presumably does reveal some non-result-implied private information. This leak should be small. Remedying it would be extremely expensive.

29

Two resolutions for this problem come to mind, both involving interpolation of an additional Yao-protocol episode and adjustments to the algebraic manipulations of the transformed scoring formula. The rst resolution invokes a Yao episode immediately after returning from the scalar product computation that yields ijk . Observing that 0! = 1!, securely check whether the shares of the scalar product are shares of 0; if they are, replace them with shares of 1; if not, reshare the sum of the shares. The weakness of this resolution is that in the common 0 case it proceeds to calculate the Stirling approximation of 1!, which is the least accurate instance of this approximation, undershooting by 8%. Whereas a value conguration not instantiated in the candidate parent set would, in original K2 scoring, either not be considered at all or put a 1 to no effect in the product which should translate to a clean 0 in the summation of the logarithmsthe promotion of 0 to 1 in this secure version introduces additional error. The alternative resolution is to invoke a similar Yao episode, but to do so late, after all the protocol for securely computing (an approximation of) the ln of (the Stirling approximation of) ijk ! has run, possibly quite inappropriately. We feed the corrective Yao episode both the newly obtained shares of the result and the original shares of ijk , as well as random values from the two parties. If the protocol nds that we started with ijk shares that are shares of 0, it returns new shares of 0; otherwise, it returns new shares of the elaborately computed result. A Yao protocol is again used when deciding which candidate parent node most improves the score of the parentnode set. The inputs are vectors of score shares from the two parties. The output is the index of the (rst) best score.

3. Implementation
The original presentation [8] of the Yang-Wright protocol addressed only binary data in the database elds. We currently implement that version, although almost no change is necessary to implement the more recent, general version [9].

3.1. The coordination architecture


Our implementation takes the unusual course of positing a seemingly extraneous role of coordinator for distributed computations. When considering security in distributed computations, we often do imagine, at least for theoretical comparison, an added role of trusted party. By denition, such an added party can be depended upon by the principal parties to compute and communicate as required, and particularly to refrain from communicating any more than is required (and so it can be resorted to straightforwardly

for a benchmark, ideal solution to any secure-multipartycomputation problem). In contrast, the coordinator we are envisioning is a party that, at least in this role, assumes absolutely no responsibility toward the principal parties whose protocol activity it coordinates, whether in computation, communication, or discretion with condential information. On the contrary, the principal parties who may have privacy concerns should think that information that has reached the coordinator has become public thereby; in a sense, the coordinator may be taken to represent the public at large to them. The advantage in having a trusted party, if one can be found, is clear. Why would we add a party that is not to be trusted? There is a denite change in orientation here as to who is responsible for what, and it turns on appreciating, in the rst place, that as we move from theorizing about a large, intricate protocol to implementing it, we are moving squarely into the realm of software engineering. We need modularity not only in the design but also in the coding and testing for all the reasons that apply in developing any software. We need the modules to know as little as possible of the world outside themselves, interfacing with each other minimally. The more outside awareness an individual code module has, the more difcult it is to keep it up to date and deployed to the agents running it as changes occur elsewhere in the code, whether in the way of enhancements or bug xes. Now, consider what it takes to run the Yang-Wright protocol. The protocol involves several subprotocols, each of which runs many times in the course of a single run of the overall protocol. As it happens, this protocol is completely synchronous. There is no indeterminacy in the sequence of the communication, so two non-faulty parties cannot have doubts as to who is waiting for whose message and where they stand in the protocol. This means that if the database owners both know the entire protocol, with no version discrepancies, and if they know exactly when a run is to commence, and if no additional messages could possibly appear on the channel between them (for whatever reason), and if no messages between the parties get dropped, then they should be able to step through the whole protocol and ultimately both output the computed Bayes-net structure. On one hand, this is the manner of execution envisioned during the design of the protocol at the theoretical level. On the other hand, each of the run-time assumptions just enumerated introduces fragility that is unacceptable in real deployed software. Instead of requiring that the database owners themselves know the whole protocol, we envision them as being willing and able to run the particular needed subprotocols on cue as discrete services available to certain requestors. Note that this notion of service is more elaborate than the one referred to in common client/server terminology. We are imagining a service provided not by a single party but by multiple parties

30

who are expected, when cued, to engage each other in some protocol in order to return the sought result, each party, or at least some quorum, reporting back to the client. This is not directly supported in our networking infrastructure, and so it must be built on top of the common client/single-server model. Given that all the subprotocols in question are acceptable to all the principal parties with respect to their privacy concerns, this leaves it up to any interested party authorized to request their services to invoke any of those subprotocols in any order to achieve whatever larger endor fail to achieve it. That shifts the onus of getting the overall protocol right to some one party, the coordinator, that is cuing the principal parties to engage in subprotocols and is then somehow processing the results. We can imagine that the coordinator is the party that has the primary interest in the result, or may have a secondary interest in virtue of receiving payment for provision of the result to the party with primary interest, or may have a farther removed interest still, of course. If the database owners are to offer discrete lower-level services, they should be able to do so in support of multiple concurrent runs of larger protocols. Otherwise, a longrunning higher-level protocol will either be subject to interference by other requests to the database owners or else, if a locking scheme is used, the long-running protocol would monopolize services that should really be a general resource. Interleaving multiple provisions of the discrete services, possibly involving interaction with some of the same peer parties and possibly even on behalf of the same client party, entails keeping the messages associated with the different concurrent conversations properly sorted out. (Note that this is not a concern over possible information leakage across multiple concurrent conversations carried on by the same parties, but rather the more basic requirement that all the communication involved, in the rst place, be attributable to the distinct conversations.) The challenge here is entirely familiar from basic networking, wherein order is maintained in the cloud of message exchanges through reliance on metainformation in successive layers of message wrappers. Our coordination protocol requires similar metainformation. Parties need to know who their peers are (practically, domain names or IP addresses and port numbers) for each requested multiparty computation episode. Parties need to know which type of computation to engage in and they need the inputs to the computation. The individual episode of the computation needs an identier assigned to it and passed around throughout. The coordination protocol must accommodate error propagation or non-propagation. In the broader realm of multiparty computation, the coordinator for a subprotocol may need to implement timeouts for individual parties and appropriate quorum and consistency threshold checks for the available party responses to afford promised correctness

and robustness to the overall protocol. This suggests that, as we bring more of distributed-computation theory into practice, enhancement of a general coordination protocol will be an ongoing development project. The present Yang-Wright implementation relies on a basic coordination protocol implemented in a library of Perl functions. Thus, the highest level of our implementation is expressed in a relatively small amount of Perl code in the coordinator. Most coordination housekeeping is done inside the coordination-library functions, allowing the code to read much like the pseudo-code for the original K2 protocol. In fact, the coordinator code can be used unchanged to run the original K2 against a single physical database, the original K2 against a logical database vertically partitioned among any number of parties with no privacy concern, or the Yang-Wright privacy-preserving K2 variant. (To make the coordinator code this general, we do need it to deviate slightly from the K2 pseudo-code. We keep parentset scores and any intermediate values from their computation down at the principal-party code level. Because in the privacy-preserving case these values must not be public, we keep them from the coordinator in all cases for uniformity. The parties, on cue, determine best-score parent sets between themselves, either securely or not, and report them to the coordinator.) The principal parties similarly run a thin layer of Perl code which interacts with the coordinator through invocation of a coordination-library function. It is at this level that the Yang-Wright security proposals are brought in. The parties provide the various requested services to the coordinator by engaging in secure subprotocols between themselves, whereas they could satisfy the coordinator equally by engaging in the simpler insecure subprotocols. This is appropriate. We imagine that the database owners are the ones that have the interest, intrinsic or however induced, in privacyelse privacy is pretty hopeless! The coordinator just wants the answers. The principal parties run their ends of each subprotocol as command-line-invokable processes. (The overhead of process invocation is not signicant, especially when running secure subprotocols, as it is completely dominated by the computation to be done by at least one of the parties in each subprotocol episode.) This means that all the subprotocol code is available for direct testing and use from the command line just as is the overall protocol. Nothing in the separation of roles in the coordination architecture precludes an agency acting both as coordinator of a mutliparty computation and as one (or more) of the parties whose interactive episodes are being coordinated. In the case of the Yang-Wright protocol, one of the private-data owners may also run the coordinator process, on the same machine that runs the party process or on another machine. We have said that the coordinator is not to be entrusted with private data. The data owners must in any case, wherever

31

the coordinator is to run, assure themselves that the code they run as parties proper, in communicating with the coordinator process, not pass it revealing information. (This is in addition to having faith in the security of the subprotocols that they run with each other without the coordinators intervention.) A data owner assuming the coordinator role as well must also be assured that the coordinator code not somehow access the local private data without mediation of the party process. From this perspective, there is an advantage to running the coordinator process physically removed from the data-accessing party process. On the other hand, scrutiny of the coordinator codewhich should generally, as in this case, be relatively simpleshould allay any concern over possible rogue data access.

3.2. An extension to K2
Our coordinator code, which we said is not specic to the Yang-Wright security-oriented transformation of K2, actually implements an extension to K2 itself, allowing further tuning of its complexity-controlling heuristic. The desirability of the enhancement became evident in the course of our experimentation. If, for example, two binary elds in the database are in themselves uniformly randomly distributed and the two elds are independent of each other, and if a third eld is completely determined by the rst two elds, being 1 if the rst two elds match and 0 if not, then the K2 algorithm is likely not to discover any Bayesian structure in the three data elds. This is because the rst two elds are tried only one at a time as members of a prospective Bayes parent set for the third eld, but neither eld is kept because, in itself, it is completely unpredictive of the value of the third eld. We add a parameter which we call interactivity. It controls the maximum number of nodes to be tried together in incrementing the parent set for a node. The idea is that it may be the interaction of several antecedent nodes, rather than any one in itself, that is predictive of the consequent node in question. The original K2 algorithm is obtained as the special case of interactivity-parameter 1. Raising this parameter improves the analytical capability of the algorithm at the expense of raising the degree of its polynomial time complexity. This enhancement is mentioned here as a feature of the present implementation. It is orthogonal to the Yang-Wright security enhancements to K2 and their special implementation issues which are our main focus.

respect to their data interchange. For instance, choices of moduli and bit lengths in different subprotocols must match up acceptably. Within the Lindell-Pinkas computation of shares of ln x from shares of x, there is an oblivious polynomial evaluation for which we use a Paillier homomorphic encryption scheme. The homomorphism entails that additions and multiplications of plaintext numbers modulo some product of primes pq are carried out, respectively, as multiplications and exponentiations modulo (pq )2 of their encryptions. For Lindell-Pinkas, however, the polynomial is to be computed obliviously modulo |F|, the size of F , the latter prescribed to be a eld. The requirement for a Paillier plaintext modulus and the requirement for a Lindell-Pinkas modulus, then, would appear to be incompatible. Fortunately, examining the Lindell-Pinkas polynomial specication, we see that we need multiplicative inverses only of powers of 2. These inverses would exist modulo the Paillier pq , so we simply proceed letting the Paillier plaintext ring serve for the LindellPinkas computation, dropping the eld requirement. The polynomial-computation space needed within the Lindell-Pinkas natural-log protocol, hence the Paillier pq for the homomorphic-encryption plaintext space, must be much larger than the space of allowable inputs for the protocol to avoid loss of information in the polynomial evaluation. The space of inputs, in turn, must accommodate all parameters, which may be as large as the number of records in the database. Collecting the various requirements, then, where s is the size in records of the database being mined, we need an integer N , a count of Taylor terms k for LindellPinkas, and primes p, q for Paillier that will determine also the ring (not eld) F for Lindell-Pinkas, such that s < 2N 2(N +2)k |F| = pq We give two examples of the interplay of these conditions. To accommodate databases of up to 8,000 records and go to four Taylor terms, we can use a Lindell-Pinkas ring of size approximately 260 . To accommodate databases of up to 8,000,000 records and go to ve Taylor terms, we need a Lindell-Pinkas ring of size 2125 . As usual, more bits strengthen the cryptography, but incur a performance penalty, particularly evident in our case in the episodes of Fairplay-implemented Yao protocol. As mentioned, the natural logarithm delivered in shared fashion by the Lindell-Pinkas protocol is scaled up by a large factor in order to preserve precision in the integer output. It is not practical to have the parties engage in a secure protocol episode, for each logarithm computed, to replace their shares with new shares, scaled back down, of the correct logarithm value. The Yang-Wright score expression involves many privately shared terms each of which

3.3. Subprotocol issues


The coordination architecture addresses only the generic needs of subprotocol marshaling. Specic difculties arise in ensuring that subprotocols nest and tile properly with

32

involves a Lindell-Pinkas-computed logarithm and would thus be scaled up, and terms that are public with no scaleup. Since scores are important only in their comparison to other scores, we need only multiply the public terms in each score by the scaling factor to scale up the entire system of scores, preserving their comparisons.

4. Some experimental results


The running time of the Yang-Wright protocol depends on the number of elds in the database, the number of records in the database, and the size of the computational spaces chosen to accommodate the database size, as reected in the bit lengths of Paillier keys and computation values and in the input-wire and gate counts of Yao Boolean circuits. Figures 1 and 2 show how the program modules behave with different key lengths and numbers of records.

Figure 3. Distribution of overall running time than the oblivious polynomial evaluation (Figure 4). Not surprisingly then, increasing the number of Taylor terms computed in the oblivious polynomial evaluation has little impact on the ln x running time (Figure 5).

Figure 1. Key length and total running time

Figure 4. Distribution of running time in secure ln x computation

4.1. Visualization
The implementation provides a graphical user interface allowing the user to view the progress of the algorithm. The partitioning of the nodes between the two parties is indicated by a color code, Alices nodes appearing in pink, Bobs in blue. The nodes carry text labels derived from a conguration table. At any time, the node whose parent set is currently being determined is highlighted. A parent-child arrow appears in red while it is being considered; the arrow disappears if the candidate parent node is rejected (in that iteration); it turns blue and becomes permanent if the candidate is accepted. For development purposes, presumably against test data, the user may click on a node to display its conditional probability tabledata not properly revealed in the structure-learning-only version of the Yang-Wright protocol implemented here.

Figure 2. Database size and scalar product time The total running time is dominated by the LindellPinkas secure computation of ln x (Figure 3). The secure ln x computation itself is dominated by its initial Fairplay Yao episode, which is far more expensive

33

References
[1] G. F. Cooper and E. Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309347, 1992. [2] O. Goldreich. Foundations of Cryptography, Volune II: Basic Applications. Cambridge, 2004. Chapter 7, General Cryptographic Protocols. [3] Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3):177206, 2002. [4] D. Malkhi, N. Nissan, B. Pinkas, and Y. Sella. Fairplay a secure two-party computation system. In Proc. of the 13th Symposium on Security, pages 287302. Usenix, 2004. [5] M. Naor, B. Pinkas, and R. Sumner. Privacy preserving auctions and mechanism design. In Proc. of the 1st Conference on Electronic Commerce (EC), pages 129139. ACM, 1999. [6] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Advances in Cryptology EUROCRYPT 99, volume 1592 of Lecture Notes in Computer Science, pages 223238. SpringerVerlag, 1999. [7] H. Subramaniam, R. N. Wright, and Z. Yang. Experimental analysis of privacy-preserving statistics computation. In Proc. of the VLDB Workshop on Secure Data Management (SDM), volume 3178 of Lecture Notes in Computer Science, pages 5566. SpringerVerlag, 2004. [8] Z. Yang and R. N. Wright. Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In Proc. of the 10th SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 713718. ACM, 2004. [9] Z. Yang and R. N. Wright. Privacy-preserving computation of bayesian networks on vertically partitioned data. Manuscript, http://www.cs.stevens.edu/ rwright/Publications/bayes.ps, 2005. [10] A. C. Yao. Protocols for secure computation. In Proc. of the 23rd Symposium on Foundations of Computer Science (FOCS), pages 160164. IEEE, 1982. [11] A. C. Yao. How to generate and exchange secrets. In Proc. of the 27th Symposium on Foundations of Computer Science (FOCS), pages 162167. IEEE, 1986.

Figure 5. Terms and oblivious polynomial evaluation time

5. Conclusion

This project, implementing the Yang-Wright secure Bayes-net discovery protocol, attempts to negotiate a range of issues involved in turning a theoretical cryptographic protocol into usable privacy-preserving data-mining software. Considerable success has been achieved in understanding how such software might be deployed and invoked on demand. We have identied various specic and general pitfalls, easily missed in early design, when assembling a complex cryptographic protocol out of building blocks. We come away with a methodology for building complex protocolsparticularly privacy-preserving protocolsand software imlementing a coordination framework that will be reusable in future projects translating complex theoretical protocols to practice. The coordination framework itself is an arena for further development. With respect to privacy-preserving discovery of Bayesnet structure via secure multiparty computation, performance remains a central concern. Our implementation currently takes over two hours to determine the Bayes-net structure in a six-node database of 1,000 records partitioned between two parties, three nodes each, computing with a key length of 512 bits on a Pentium 4 processor with 1 GB of RAM. On one hand, this seems very slow. On the other hand, the present technology already can render research of Bayes-net structure feasible against private data. What can be learned thereby may well justify the computational resources and the wait for the results, even at this level of performance. This is an initial report on the implementation work. There is every reason to believe that further work will yield optimizations at various levels and that performance will improve.

34

Collaborative Recommendation Vulnerability To Focused Bias Injection Attacks


Robin Burke, Bamshad Mobasher, Runa Bhaumik, Chad Williams Center for Web Intelligence, DePaul University School of Computer Science, Telecommunication, and Information Systems Chicago, Illinois, USA {rburke, mobasher, rbhaumik, cwilli43}@cs.depaul.edu Abstract
Signicant vulnerabilities have recently been identied in collaborative recommender systems. Attackers who cannot be readily distinguished from ordinary users may inject biased data in an attempt to force the system to adapt in a manner advantageous to them. Researchers have studied simple attack models and their impact on a systems population of users. In this paper, we examine attacks that concentrate on a targeted set of users with similar tastes, biasing the systems responses to these users. Not only are such attacks more pragmatically benecial for the attacker (since a particular item can be pushed to those most likely to buy it), but as we show, such attacks are also highly effective against both user-based and item-based algorithms. As a result, an attacker can mount such a segmented attack with little knowledge of the specic system being targeted and with strong likelihood of success.

which are assumed to represent the preferences of many different individuals, and makes recommendations by nding peers with like proles. If the prole database contains biased data (many proles all of which rate a certain item highly, for example), these biased proles may be considered peers for genuine users and result in biased recommendations. This is precisely the effect found in [5] and [6]. Researchers who have examined this phenomenon have concentrated on broad attack models whose proles contains ratings across the spectrum of available objects and have measured their results by looking at how all of the users of the system are affected in the aggregate. However, it is a basic truism of marketing that the best way to increase the impact of a promotional activity is to target ones effort to those already predisposed towards ones product. In other words, it is likely that an attacker wishing to promote a particular product will be interested not in how often it is recommended to all users, but how often it is recommended to likely buyers. If the attacker can successfully target the appropriate market segment, the relatively minor marginal utility to be gained by pushing the product to any one of the out-ofsegment users may be outweighed by the increased possibility of detection that such a move entails. A rational attack strategy is therefore a segmented one: push the product to the high-probability purchasers. This paper examines a particular attack model that we call the segmented attack in which the attacker concentrates on a set of items of similar content that have high visibility, the Harry Potter series being a good example in the book domain. It is certainly the case that these books are highly popular and widely read it would follow that they would be rated by many users of a collaborative system. Users who enjoy these books are likely to share some characteristics: they may be children or parents who have an interest in exciting fantasy stories involving magic. These facts are general knowledge about the book domain readily available

1. Introduction
Recent research has begun to examine the vulnerabilities and robustness of different recommendation techniques, such as collaborative ltering, in the face of what has been termed shilling attacks [2, 1, 5, 6]. Our preferred term is prole injection attacks, since promoting a particular product is only one way such attack might be used. In a prole injection attack, an attacker interacts with the recommender system to build within it a number of proles associated with ctitious identities with the aim of biasing the systems output. It is easy to see why collaborative recommendation is vulnerable to prole injection attacks. A user-based collaborative recommendation algorithm collects user proles,
This research was supported in part by the National Science Foundation Cyber Trust program under Grant IIS-0430303.

35

The ratings r1 through rm1 are assigned to the corresponding items according to the specic attack model. Each attack model has its own strategy for selecting the items for the attack prole and assigning ratings to them. In the remainder of this section, we provide a detailed example that will help illustrate the vulnerability of collaborative recommendation algorithms, and will serve as a motivation for the formal description of the attack models that follow. Figure 1. The general form of a push attack prole.

2.1

An Example

outside of any particular recommender system. The segmented attack model is designed to push an item to a targeted group of users with known or easily predicted preferences. Proles are inserted that maximize the similarity between the pushed item and items preferred by the group. We show that the segmented attack is both effective and practical against user-based and item-based collaborative algorithms. The paper is organized as follows. In Section 2 we provide a general framework for prole injection attacks against collaborative systems, and we present the details of our proposed segmented attack model. Section 3 includes some background information and the specic details of the user-based and item-based recommendation algorithms used in our experiments. In Section 4 we describe our evaluation methodology, including two evaluation metrics we have used to determine the effectiveness of the segmented attack against each algorithm. We then present our experimental results, with a detailed analysis of the proposed segmented attack model, and show its effectiveness against both user-based and item-based algorithms.

2. Attack Models
A prole injection attack against a collaborative recommender system consists of a set of attack proles, biased prole data associated with ctitious user identities, and a target item, the item that the attacker wishes the system to recommend more highly (a push attack), or wishes to prevent the system from recommending (a nuke attack). We concentrate on push attacks in this paper. An attack model is an approach to constructing attack proles, based on knowledge about the recommender system, its rating database, its products, and/or its users. The general form of a push attack prole is depicted in Figure 1. Each attack prole consists of an m-dimensional vector of ratings, where m is less than or equal to the total number of items in the system. The rating given to the pushed item is rmax , the maximum allowable rating value within the target recommender system.

Consider, as an example, a recommender system that identies books that users might like to read using a user-based collaborative algorithm [3]. A user prole in this hypothetical system might consist of that users ratings (in the scale of 1-5 with 1 being the lowest) on various books. Alice, having built up a prole from previous visits, returns to the system for new recommendations. Figure 2 shows Alices prole along with that of seven genuine users. An attacker, Eve, has inserted attack proles (Attack1-3) into the system, all of which give high ratings to her book labeled Item6. If the system is using a standard user-based collaborative approach, then the predicted ratings for Alice on Item6 will be obtained by nding the closest neighbors to Alice. Without the attack proles, the most similar user to Alice, using correlation-based similarity, would be User6. The prediction associated with Item6 would be 2, essentially stating that Item6 is likely to be disliked by Alice. After the attack, however, the Attack1 prole is the most similar one to Alice, and would yield a predicted rating of 5 for Item6, the opposite of what would have been predicted without the attack.1 So, Eves attack is successful and Alice will get Item6 as a recommendation, regardless of whether this is really the best suggestion for her. She may nd the suggestion inappropriate, or worse, she may take the systems advice, buy the book, and then be disappointed by the delivered product. On the other hand, if a system is using an item-based collaborative ltering approach, then the predicted rating for Item6 will be determined by comparing the rating vector for Item6 with those of the other items. This algorithm does not lend itself to an attack as obvious as the previous one, since Eve does not have control over ratings given by other users to any given item. However, Eve can make a successful attack more likely with a small amount of knowledge about the ratings distributions for some items. In the example of Figure 2, for instance, Eve knows that Item1 is a popular item among a signicant group of users to which Alice also belongs. By designing the attack pro1 Of course, a real implementation would use more than a single neighbor for prediction, but the same principle applies with a larger number of neighbors.

36

Figure 2. An example of a push attack favoring the target item Item6. les so that high ratings are associated with both Item1 and Item6, Eve can attempt to increase the similarity of these two items, resulting in a higher likelihood that Alice (and the rest of the targeted group) will receive Item6 as a recommendation. Indeed, as the example portrays, such an attack is successful regardless of whether the system is using an item-based or a user-based algorithm. This latter observation illustrates the motivation behind the attack model we introduce and analyze in this paper, namely the segmented attack. We can frame this intuition as a question of utility. We assume that the attacker has a particular item i that she wants recommended more highly because she has a personal stake in the success of this product. The attacker receives some positive utility or prot pi each time i is purchased. In biasing the recommender system, the attacker hopes to increase the probability that purchases will happen, but of course not every user to whom a recommendation is made will actually purchase. Let us denote the event that a recommendation of product i is made to a user u, by Ru,i and the event that a user buys an item by Bu,i . The probability that a user will purchase i if it is recommended we can describe as a conditional probability: P (Bu,i |Ru,i ). Over all users U that visit the system over some time period, the expected prot would be P =
u U

2.2

The Segmented Attack

Prior work on recommender system stability has examined primarily three attacks. The sampling attack from [6] is primarily of theoretical interest as it requires the attacker to have access to the ratings database itself. The random attack [5] forms proles by associating a positive rating for the target item with random values for the other items. The average attack [5] assumes that the attacker knows the average rating for each item in the database and assigns values randomly distributed around this average, except for the target item. This attack has been found to be effective against user-based collaborative recommendation algorithms, but less so against item-based recommendation. Each of these prior attack models makes the implicit assumption that the attacker is interested in promoting the pushed item to every user in the system. However, suppose that Eve in our previous example had written a fantasy book for children. She would no doubt prefer that her book be recommended to buyers who had expressed an interest in this genre, for example buyers of Harry Potter books, rather than buyers of books on Java programming or motorcycle repair. Eve would rightly expect that the fantasy book buyer segment of the market would be more likely to respond to a recommendation for her book than others.

pi P (Ru,i ) P (Bu,i |Ru,i )

The attacker of a recommender system hopes to increase her prot by increasing P (Ru,i ), the probability that the system will recommend the item to a given user. However, preferences for most consumer items are not uniformly distributed over the population of buyers. For many products, there will be users (like a Harry Potter buyers) who who would be susceptible to following a recommendation for a related item (another fantasy book for children) and others who would not. In other words, there will be some segment of users S that are distinguished from the rest of the user population N = U S , by being likely recommendation followers: s S, n N, P (Bs,i |Rs,i ) P (Bn,i |Rn,i ))

Let us consider an extreme case of a niche market in which P (Bn,i |Rn,i ) is zero. The only customers worth recommending to are those in the segment S . Everyone else

37

3. Recommendation Algorithms
This paper reports on results for two of the most commonlyused collaborative algorithms: user-based and item-based collaborative recommendation using nearest-neighbor techniques [3, 7]. In each case, the algorithm assumes there is a single user / item pair for which a prediction is sought in our experiments this is generally the pushed item, since we are primarily interested in the impact that attacks have on this item. The standard collaborative ltering algorithm is based on user-to-user similarity [3]. This k NN algorithm operates by selecting the k most similar users to the target user, and formulates a prediction by combining the preferences of these users. k NN is widely used and reasonably accurate. The similarity between the target user, u, and a neighbor, v , can be calculated by the Pearsons correlation coefcient dened below: (ru,i r u ) (rv,i r v ) simu,v =
i I

Figure 3. General form of the Segmented Attack.

will ignore the recommendation. It is in the attackers interest to make sure that the attacker item is recommended to the segment users; it does not matter what happens to the rest of the population. The attacker will be only interested in manipulating the quantity P (Rs,i ). In other words, the quantity that matters to an attacker may not be the overall impact of an attack, but rather its impact on a segment of the market distinguished as likely buyers. This may even be true if P (Bn,i |Rn,i ) > 0 because these out-ofsegment buyers contribute relatively little to the expected utility compared to the in-segment ones. Obviously, the maximum P is realized when every single user gets the pushed item as a recommendation. That is, when P (Ru,i ) = 1. That may not be a realistic goal. Only a very large attack (perhaps with a number of biased proles equal to or greater than the size of the original prole database) would be able to ensure such an effect, and such an attack would be likely to be noticed by a sites operators. In addition, the ubiquity of the pushed item would be noticed by users for whom it is not a good match: buyers of motorcycle repair books suddenly getting recommendations for childrens fantasy titles might complain, and the complaints would form a detectable pattern. This increased risk of detection is a cost associated with large attack sizes. Therefore, it is rational for the attacker to focus solely on the in-segment users to the extent that this is possible. We dene a segment as a set of users with shared strong favorable preferences for a set of segment items (such as Harry Potter books in our example.) Let SI be the set of items that dene a target segment. To target the users in the segment, we construct proles with high ratings for the items in the set SI and low ratings for other items. These proles will match users who also have a strong preference for the items in SI . See Figure 3. An attacker like Eve only needs to identify books that are similar to the one she wants to push and relatively popular in order to generate the attack.

(ru,i r u )2
iI
i I

(rv,i r v )2

where I is the set of all items that can be rated, ru,i and rv,i are the ratings of some item i for the target user u and a neighbor v , respectively, and r u and r v are the average of the ratings of u and v over I , respectively. Once similarities are calculated, the most similar users are selected. In our implementation, we have used a value of 20 for the neighborhood size k . We also lter out all neighbors with a similarity of less than 0.1 to prevent predictions being based on very distant or negative correlations. Once the most similar users are identied, we use the following formula to compute the prediction for an item i for target user u. simu,v (rv,i r v ) pu,i = r v +
v V

|simu,v |
v V

where V is the set of k similar users and rv,i is the rating of those users who have rated item i, r v is the average rating for the target user over all rated items, and simu,v is the mean-adjusted Pearson correlation described above. Item-based collaborative ltering works by comparing items based on their pattern of ratings across users. Again, a nearest-neighbor approach can be used, but here a more common approach is the adjusted cosine similarity measure introduced by [7]. The adjusted cosine similarity formula is given by:

38

(ru,i r u ) (ru,j r u ) simi,j =


u U

(ru,i r u )2
uU

n uU

(ru,j r u )2

where ru,i represents the rating of user u on item i, and r u is the average of the user us ratings as before. In this measure, all user proles are normalized by subtracting the users mean rating. When items are compared, the ratings given by each user to that item are combined in a vector and the similarity between them is calculated as the vector cosine. After computing the similarity between items we select a set of k most similar items to the target item and generate a predicted value: ru,j simi,j pu,i =
j J

simi,j
j J

where J is the set of k similar items, ru,j is the prediction for the user on item j , and simi,j is the adjusted cosine similarity between items i and j . The users own ratings of similar items are used to extrapolate the prediction for the target item. We consider a neighborhood of size 20 and ignore items with negative similarity.

4. Experiments
In our experiments we use the publicly-available MovieLens 100K dataset2 . This dataset consists of 100,000 ratings on 1682 movies by 943 users. All ratings are integer values between one and ve where one is the lowest (disliked) and ve is the highest (most liked). Our data includes all the users who have rated at least 20 movies.

4.1

Methodology

for the attacker in a push attack is that the pushed item be more likely to be recommended after the attack than before. In the experiments reported below, we follow the lead of [6] in measuring stability via prediction shift, the change in predicted rating for the target item after the attack. However, we also measure hit ratio, the average likelihood that a top N recommender will recommend the pushed item [7]. Average prediction shift is dened as follows. Let U and I be the sets of target users and items, respectively. For each user-item pair (u, i) the prediction shift denoted by u,i , can be measured as u,i = pu,i pu,i , where p represents the prediction after the attack and p before. A positive value means that the attack has succeeded in making the pushed item more positively rated. The average prediction shift for an item i can be computed by averaging u,i over all users, and an overall average can be generated by picking a number of different items to attack and averaging over them. We chose 50 movies at random from the MovieLens data, being careful that this set of target items mirrored the distribution of the data as a whole. Note that a strong prediction shift is not a guarantee that an item will be recommended. It is possible that other items scores are affected by an attack as well or that the target item scores so low to begin with that even a signicant shift does not promote it to recommended status. It is a good rough indicator of the success of an attack, but it does not get at our notion of a win: increased probability of recommendation. In order to measure the benet of the attack from the attackers point of view, we use the notion of the hit ratio. The idea is to establish a window of size N at the top of the recommendation list. We count a success a hit if the pushed movie shows up in this window. Let Ru be the set of top N recommendations for user u. For each push attack on item i, the value of a recommendation hit for user u denoted by Hui , can be evaluated as 1 if i Ru ; 0, otherwise. We dene hit ratio as the number of hits across all users in the test set divided by the number of users in the test set, computed as: HitRatioi = Hui / |U |. The average hit ratio can then be calculated as the sum of the hit ratios for attacks on each item i across all items divided by the number of items. For the segmented attack, we investigated two market segments: one dened by Harrison Fords action movies and one by popular horror lms. Recall that the segmented attack is constructed by identifying a set SI of segment items and the attacked users are the ones who have rated those items highly. In the Harrison Ford segment, the movies were Star Wars, Return of the Jedi, Indiana Jones and the Last Crusade, and Raiders of the Lost Ark. In the Horror segment, the movies were Alien, Psycho, The Shining, Jaws, and The Birds.3
3 This

uU

There has been considerable research in the area of recommender systems evaluation [4]. Some of these concepts can also be applied to the evaluation of the security of recommender systems, but in evaluating security, we are interested not in raw performance, but rather in the change in performance induced by an attack. The metrics of stability and robustness were introduced in [6]. Our interest is along the lines of stability: how the attack changes the systems ratings for the pushed item, but more generally we are interested in measuring the effectiveness of an attack - the win for the attacker. The desired outcome
2 http://www.cs.umn.edu/research/GroupLens/data/

list

was

generated

from

on-sources

of

the

pop-

39

For the Harrison Ford segment, we chose those users who had given top rating (5) to all four movies. From this set, we chose 50 users at random. For the Horror movie segment, we chose those users who had given above average scores (4 or 5) to any three of the ve movies. For this set of ve movies, we selected all combinations of three movies that had at least 50 users support, chose 50 of those users randomly and averaged the results. For all the attacks, we generated a number of attack proles and inserted them into the system database and then generated predictions. We measure size of attack as a percentage of the pre-attack user count. There are approximately 1000 users in the database, so an attack size of 1% corresponds to 10 attack proles added to the system.

4.2

Experimental Results

Figure 4. Prediction Shift results for the Harrison Ford segment. User-based algorithm.

If we evaluate the segmented attack based on its average impact on all users, there is nothing remarkable. The attack has an effect but does not approach the numbers reached by the average attack, the most effective attack we had previously studied [1]. However, we must recall our market segment assumption: namely, that recommendations made to in-segment users are much more useful to the attacker than recommendations to other users. Our focus must therefore be with the in-segment users, those users who have rated the segment movies highly and presumably are desirable customers for pushed items that are similar: an attacker using the Harrison Ford segment might be interested in pushing a new movie featuring the star in an action role. The intuition behind the segmented attack is borne out in Figure 4. The gure shows prediction shift results for the Harrison Ford segment, comparing all users with insegment users. The in-segment prediction shift is slightly stronger for the segmented attack than the average attack. Note also that the segmented attack requires considerably less knowledge of the ratings distribution in the system than the average attack requires. ([1] discusses the question of limited knowledge attacks in greater detail.) The hit ratio results are shown in Figure 5 for a 1% attack at different values of N . These results show that even an attack as small as 1% on the user based algorithm can have a major impact on the hit ratio.4 It is also interesting that although the overall user base is not affected as much as the in-segment users, the shift is still very large, more than a whole point on the rating scale with a 1% attack and with the target movie showing up in the top ve more than 40% of the time. The most likely reason for this is that some of the movies in this segment (such as Star Wars and Raiders
ular horror lms: http://www.imdb.com/chart/horror and http://www.lmsite.org/a100thrillers1.html. 4 The hit ratio prior to the attack is very small, about 1% at N = 10 and less than 5% even with an N of 50.

Figure 5. Hit Ratio results for the Harrison Ford segment. User-based algorithm.

of the Lost Ark) were rated highly by a majority of users in the database. Essentially, there is not that much difference between the in-segment users and the rest of the user base with respect to these movies, no doubt having to do with the characteristics of the population using the MovieLens system at the time the ratings were collected. The benet of the segmented attack is considerably more striking in the item-based case shown in Figures 6 and 7. Lam and Reidl concluded, based on their results with the random and average attacks, that item-based algorithms were more robust than user-based ones [5]. However, as the gures show, the segmented attack works well against the item-based algorithm. The reason has to do with prole construction. Since the segmented attack assigns maximum ratings to both the segment items SI and the target item, the similarity between these items and the target item is in-

40

Figure 6. Prediction Shift results for the Harrison Ford segment. Item-based algorithm.

Figure 8. Prediction Shift results for the Horror Movie segment. User-Based algorithm.

Figure 7. Hit Ratio results for the Harrison Ford segment. Item-based algorithm.

Figure 9. Prediction Shift results for the Horror Movie segment. Item-based algorithm.

creased. The low ratings given to the other items makes them more distant. If the segment items are in the algorithms prediction neighborhood for the target item, they will boost the recommendation scores since these are items that an in-segment user will have rated highly. In the case of the Horror movie segment, the movies were selected from on-line sources as the best movies of their type but none of them are as broadly popular as Star Wars. So, these movies represent more of a market niche. Figure 8 shows a similar result to that seen with prediction shift for the Harrison Ford segments against the userbased algorithm. Figure 9 indicates the focused manner in which this attack homes in on its target audience when the item-based algorithm is attacked. The general population is barely effected by the injected proles, but there is a sizable prediction shift for in-segment users. The hit ratio results

for this user segment are depicted in Figures 10 and 11 and are similar to those already seen. These results also point out an interesting difference between the user-based and item-based algorithms. While, is both cases, the attack has a dramatic impact on the insegment users, the overall impact of the segmented attack on the whole user group is more pronounced in the case of user-based algorithm. Another way in which the item-based algorithm shows robustness is with respect to prole size. In the segmented attack, the items that are not in the SI set (see Figure 3) are given low values. In our initial experiments with the attack, all such movies were used in the attack prole. We dene this as a prole size of 100%. However, this means that each attack prole must be very large, perhaps unrealistically so. We experimented with decreasing the number

41

Figure 10. Hit Ratio results for the Horror Movie segment. User-Based algorithm.

Figure 12. Comparing item-based and userbased algorithms at different prole sizes.

Figure 11. Hit Ratio results for the Horror Movie segment. Item-based algorithm.

view, however, such attacks are sub-optimal: they require a signicant degree of system-specic knowledge to mount, and they push items to users who may not be likely purchasers. In addition, they are not effective against itembased implementations. In this paper, we introduce the segmented attack, a prole injection attack that associates the pushed item with a small number of popular items of similar type. As our results show, the attack does well at ensuring that the pushed item will be recommended to those users that are its target market. It is effective against item-based recommendation algorithms to a degree that broader attacks are not, and has no requirement for system-specic ratings distribution data.

References
of non-favorite items rated in each attack prole, leaving these items unrated. Interestingly, there is a peak at a low value (about 3%, or about 50 movies) when the user-based algorithm is attacked. It is this 3% prole version of the attack that was used in the experimental results shown above. As Figure 12 shows, the item-based algorithm has no such peak: the prediction shift increases monotonically for larger prole sizes. Item-based recommendation would therefore appear to have an additional advantage over user-based an attacker must build larger proles to be successful.
[1] R. Burke, B. Mobasher, and R. Bhaumik. Limited knowledge shilling attacks in collaborative ltering systems. In Proceedings of the 3rd IJCAI Workshop in Intelligent Techniques for Personalization, Edinburgh, Scotland, August 2005. [2] R. Burke, B. Mobasher, R. Zabicki, and R. Bhaumik. Identifying attack models for secure recommendation. In Beyond Personalization: A Workshop on the Next Generation of Recommender Systems, San Diego, California, January 2005. [3] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative ltering. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR99), Berkeley, CA, August 1999. [4] J.Herlocker, J. Konstan, L. G. Tervin, and J. Riedl. Evaluating collaborative ltering recommender systems. ACM Transactions on Information Systems, 22(1):553, 2004. [5] S. Lam and J. Reidl. Shilling recommender systems for fun and prot. In Proceedings of the 13th International WWW Conference, New York, May 2004.

5. Conclusions
Previous research has examined prole injection attacks against recommender systems that are broad in their construction and impact. Of these, the average attack has been found to be most effective. From a cost-benet point of

42

[6] M. OMahony, N. Hurley, N. Kushmerick, and G. Silvestre. Collaborative recommendation: A robustness analysis. ACM Transactions on Internet Technology, 4(4):344377, 2004. [7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative ltering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference, Hong Kong, May 2001.

43

Secure K-Means Clustering Algorithm for Distributed Databases


Raj Bhatnagar, Ahmed Khedr, Amit Sinha ECECS Department Univesity of Cincinnati Cincinnati, OH 45221, USA Email: [rbhatnag,akhedr,sinhaam]@ececs.uc.edu

Abstract The emerging knowledge and data environments consist of networked databases that are geographically distributed across the globe. It is neither feasible nor secure to move data from these nodes to a single site to perform desired computation with the collective data of a few networked databases. We need algorithms that work by a mixture of local computations and communication to perform the global computations with minimum and secure movement of data and related information across the network. Security and privacy concerns require that no data tuples be exchanged on the network, where security risk is the largest. In this paper we present a decomposable version of the popular K-Means clustering algorithm that works in this desired manner with a set of networked databases. We show that it is possible to perform global computation in a reasonably secure manner for either horizontally partitioned or vertically partitioned databases. The computation is completed by only exchanging a few local summaries among the databases and there is never any need to exchange individual data tuples among the networked sites. An empirical and analytical validation of our results is also presented in the paper.

I. I NTRODUCTION The clustering problem aims at partitioning a collection of data points into groups such that each data point is, in terms of some distance metric, relatively close to other data points in its own group and is farther from data points in other groups. Clustering algorithms have been studied extensively [1], [2], [8], [14], [26] in the context of single processor or parallel processor systems and there has been relatively recent work [13], [11], [16] on mining distributed databases while preserving data security and privacy during the exchange of information among the nodes. The work in [11] argues that a parallel algorithm running on widely separated nodes, with very high inter-node communication costs, is more efcient than performing clustering on databases that are moved to a central site before clustering is done. Our work described here presents a more efcient algorithm that has signicantly smaller communication cost and more data security. The work in Kargupta depends on sampling the databases and accuracy of results depends on the extent and nature of sampling. We seek to get a result very close to actual cluster centers by a k-means algorithm without transferring any tuple among the database sites. The work in [13], [15] describes an algorithm for the case when records are distributed among a few sites such that each site contains a key to identify some individual and a set of attributes that does not overlap with the attribute set at any other database site. Our algorithm is applicable for the most general situations in which existing distributed databases can cooperate for k-means clustering by properly accounting for any number of shared attributes and without the restriction that there be only one record at each site for each individual. For example, in case of a set of databases related
44

to credit card customers, the customer information database may contain only one record per person, the billing database in a different city may contain one bill per person per month, and the charges database may contain a variable number of transactions per person per month. The global database in such a case would be the implicitly dened cross product of all the relational tables, facilitated by the set of those attributes that are shared by at least two databases. Distributed Data Sources: Computing situations are beginning to emerge in the networked environments that require data from a number of geographically distributed sites to be mined or processed simultaneously. A number of geographically distributed databases together form an implicitly Joined global dataset that contains all the data relevant for mining or other computational tasks. For example, a data mining task may require simultaneous processing of data, parts of which may exist in census database, labor statistics database, health databases, and employment related databases. Each of these is a huge and live database that resides on a different networked site in a different city. One cannot hope to easily move all these databases to a single computer site, then merge or join them, and then execute a data mining algorithm on the tuples of the huge resulting database. It would be desirable to have algorithms that let the individual databases reside at their own sites and work with an imagined implicit join of the databases. These algorithms decompose themselves into, possibly a series of, localized sub-computations such that each sub-computation can be performed locally at a single database site. The decomposition process must be such that: (i) it is mathematically equivalent to the global computation to be performed; (ii) it is efcient (communication is as little as possible); (iii) and it preserves the security and integrity of data in each local database either while it resides in the system or while a summary is exchanged over the network. Let us say are local databases and is the implicit global database formed by Merging (in case of horizontal partitioning) or Join-ing (in case of vertical partitioning) all the participating local s. Let us say a result is obtained by applying a function (or running an algorithm) on , that is: In our case, , the global database, cannot be made explicit and is known only implicitly in terms of the explicit can components , , . . . , . The implementation of now be redesigned by a functionally equivalent formulation: That is, a local computation is performed at using the database . The results of these local computations are aggregated using the operation . In case is not directly

decomposable in this manner, we may write a sequence of , , such that the m-step algorithm is equivalent to computing and then decompose each of the steps. Our notion of data privacy requires that when the s are exchanged over the network, even if they are captured by someone, they should not enable reconstruction of any single tuple residing in any of the participating databases. It is facilitated, partly, by the absence of knowledge of by the network intruder. And with this constraint, any of the participating site should be able to determine the cluster centers for the collective data. In this paper we present one such decomposable version to be of the k-means clustering algorithm. The function computed is the same set of clusters that would be produced by the k-means algorithm if the data at the networked databases were to be collected at one site. In [3], [4] an exact decomposition has been presented for inducing decision trees and for computing covariance matrices from distributed databases. The spirit is similar for the k-means clustering algorithm but the decomposition is not exact. However, the nal results are a close approximation of the results that may be obtained by merging the local databases. A coordinator site that seeks to compute the global results of the algorithm rst determines all the databases and sites that should be involved in a clustering task and then communicates to them requests for results of some computations performed locally at each site. Only the results of these local computations are transmitted to the coordinator site, followed by new requests for results of more local computations, until the global computation is completed and the results obtained at the coordinator site.
steps,

various sites [13], [15]. Our target is to enable those databases for participation that were designed independently and may have arbitrary overlap of attribute sets with the other databases they have to collaborate with. The database for which the kmeans clustering is performed is the implicit cross product of the relations stored at the distributed sites. Handling Exception Tuples: It is possible that not all tuples generated by ing the local databases may be consistent data tuples and hence some consistency constraints may be specied to exclude tuples satisfying these constraints during the decomposed computations. from the implicit For example, A database at one site may store tuples with attributes and and a database at another site may store tuples formed with attributes and . A of the two databases will produce all those tuples in the space that are consistent with the two projections. However, some of these tuples may not be consistent with the domain knowledge and the constraints of the implicit 3-D space. We dene to be a set of domain knowledge based constraints that help us prune the database produced by the of all the local databases. For example, the set may contain the condition that values and are inconsistent and no tuple in should have this combination of values. This information can then be used to eliminate from the implicit all those tuples that satisfy the conditions included in the set of conditions . Therefore, the implicit set with which we want to perform the clustering task is given as:

(1)

II. NATURE OF DATA D ISTRIBUTION Depending on the sets of attributes contained in each , there are two primary ways in which the databases may be seen as forming an implicit global dataset . Horizontally Partitioned Datasets: It is a partitioning of into components contains such that each tuples consisting of an identical set of attributes ; but a resides distinct set of data tuples resides at each site. Each on a different site of the network and the tuples contained in all the s, taken together, constitute the complete dataset . Vertically Partitioned Datasets: In this case each comconsists of tuples formed with a different set of ponent attributes but each may share some attributes with those of some other databases, . Each may also contain some attributes that are unique to the local site and are not shared with a database at any other site. In effect, is a projection of the implicit global . Vertically each partitioned datasets require computations to be performed in s, but without ever making the implicit Join, , of all the explicit the tuples of . The decomposed algorithm must appropriately account for all the shared attributes that would have played a role in enumerating the tuples of the Join-ed , if it were to be made explicit. This formulation models more general circumstances than the case of a single key and non-overlapping attribute sets for single records distributed at
45

is the set of all tuples that are consistent with the where conditions in set . Our decomposable algorithm is designed to perform computations with a that can take care of the conditions in the Exclusions set and exclude from implicit those tuples that are known to be inconsistent with the domain knowledge. III. R ELEVANT R ESEARCH Parallel implementations of pattern analysis algorithms take advantage of the high performance of multiprocessor computer systems [18], [19], [21], [24] and work by transferring data form one processor to the other. Rasmussen and Willett [21] discuss parallel implementation of the single link clustering methods on an SIMD array processor. Li and Fang [18] describe parallel partitioning clustering (the k-means clustering algorithm ) and parallel hierarchical clustering (single link clustering algorithm) on an -site hypercube and an -site buttery. Olson [19] has described several implementations of hierarchical clustering algorithms which achieve time in site CRCW PRAM and time on site buttery networks or tree. All these parallel clustering algorithms are tailored for situations in which: (i) it is assumed that all data resides in main memory or distributed shared memory of a set of closely connected processors; (ii) they need a large number of closely connected processors at a single site that can access the shared memory to achieve the reasonable performance; and (iii) Inter-processor communication is

extremely fast and involves reading data in the main or the shared memory. Our algorithm is tailored for very different situations in which we dont have closely connected processors. There are multiple processors but they are independent and reside at geographically distant sites and communication among them may be many orders of magnitude slower than the rates of inter-processor communication in a set of closely coupled processors. This is the case in many real-life scenarios. Therefore, our formalism implements a methodology wherein each site works as much as possible with its own local data and then communicates with others at the level of local results of some partial computations. Distributed knowledge discovery work by (Chan and Stolfo [7]; Lam and Segre [17]; Yamanishi [25]; Crestana and Soparkar [9]; Tumer and Ghosh [23]; Grossman et. al. [12]; Park et. al. [20]) merges the computation with communication but either at the raw data level or at the local nal result level. Former is highly insecure, and the latter adds high level of noise in the global results. Our past work [3], [4] and the algorithm presented here work by exchanging summaries at intermediate level so as to preserve the data privacy and also reduce the amount of error. Vaidya et al. in [13] present a method for k-means clustering when different sites contain different attributes for a common set of entities. Security requirements to be met in this algorithm demand that a site can know only its own attributes-value pairs when they become parts of the global cluster centers. Our algorithm computes global cluster centers at any one of the participating sites. Kargupta et al. in [16] have presented an algorithm for clustering high-dimensional heterogeneous data using a distributed principal component analysis (PCA) technique. In their approach [16] partial results (principle eigen vectors) are computed at each site and transmitted to a central site along with a number of data tuples corresponding to each eigen vector. The error between the actual global result and the computed result decreases as the number of tuples transmitted from each site to the central site increases. We seek to decompose the computations of k-means algorithm in such a way that no actual data tuples need to be transmitted among the sites and the error is kept low even by transmitting intermediate results only. IV. D ECOMPOSITION
OF THE K -M EANS

minimize a close approximation of this same error for the networked databases, and thus mimic the behavior of the kmeans algorithm run on the implicit . An outline of our k-means algorithm for the vertically partitioned databases is as follows: 1) Each site computes local cluster centers for its dimensional data space and sends them to the coordinating site. This is elaborated in section IV-B below. 2) The central site performs a cross-product of the various local clusters in smaller dimensional spaces received from local sites to generate larger dimensional (full dimensions/attributes of ) globalized cluster center candidates. This is elaborated in section IV-C below. 3) The central site then runs a k-means algorithm on these globalized cluster centers candidates into desired number of nal clusters and also trying to minimize an estimate of the quantity TotalError mentioned in equation 2 above. The main algorithm is explained in section IV-E. This algorithm needs populations of data around the potential cluster centers and points in the way to compute these is elaborated in section IVD below. A. Managing the Implicit Join If an explicit were to be generated from the s, the process would have been mediated by the attributes shared s. Let us say the set of attributes contained in among the relation is represented as . For a pair of databases and the corresponding sets of attributes and may have a set of shared attributes given by such that For vertically partitioned a subset of and will be we will obtained as and for horizontally partitioned have = = . To facilitate clustering in the implicitly described , we dene a set that is the union of all the intersection sets dened above. That is, The set , in effect, contains all those attributes that occur in more than one . We dene a relation called Shareds on the attributes in the set . This relation, Shareds, contains tuples corresponding to all possible combinations of values for the attributes in . The relation Shareds would have mediated the creation of the explicit , if it was attempted, and is used by us in a very similar role. B. Clustering at Local Sites The following steps are executed at each site on its local database. The site receives the value of global number of clusters, , from the coordinating site and then determines the number of local clusters, , that it should generate . Typically, we select to be some with the local constant multiple of . The constant depends on the total number of points in and , the number of nal clusters. In the nal composition step , the local cluster centers will be used as representative of the data at the local sites. Using k-means clustering algorithm, locally cluster data into clusters. in the database

A LGORITHM

The total clustering error for clustering of data points is dened as follows:

(2)

where is the data point in the cluster and is the center of the cluster. It has been shown [22] that the clusters and cluster centers computed by a k-means algorithm are located in such a way that they minimize the magnitude of the square of TotalError quantity as dened above, even though it may be a local minima they settle down in. Our decomposable version of k-means is guided by the goal to
46

Send the following information to the central coordinating site: The set of local cluster centers : it is the center of the cluster at the site . is the radius of the local cluster centered at and is the maximum of the distances between the cluster center and the points in the cluster.

30
25
20
15
10
5
0 30
25
20
15
10
10

25
20
15

C. Globalizing Local Cluster Centers A cross product of the cluster centers from the local sites is computed at the coordinating site to get the globalized cluster centers. The cross product of local dimensional points from each local site will result in the global -dimensional vector. We consider here an example in which Site 1 has a database and and Site 2 has a database containing attributes containing attributes and . A snippet of the data is shown in Table I. Together, the two sites describe an implicit database containing attributes , , and . Figure 1 shows the data points of Site 1 in the plane and Figure 2 shows the data points of Site 2 on the plane. After the local clustering phase Site 1 returns cluster centers that lie in the plane and Site 2 returns cluster centers that lie in plane. Figure 3 shows the cross product of the the points in the two planes, resulting in the points in the 3-D space, the implicit dataset . Candidates for the global cluster centers, s can be generated by performing a cross-product of the cluster center points from the two distinct planes. Some globalized cluster centers may not have many data points in around them because, as explained in section II implicit above, the set Exclusions may have eliminated many tuples from . Figure 4 shows the reduced set of points (form those in Figure 3) due to the conditions in the Exclusions set.
Fig. 3.

5
B

The Cross product of the Local Databases

30
25
20
15
10
5
20

25

0 30

15

25

20

10

15

10
B

Fig. 4.

The Cross Product Database after excluding the Exception Data

9
8
7
6
5
C

One problem in determining the s relates to determining when two values of a shared attribute, in this case , may be considered identical for the purpose of performing a cross product of the two sets of local cluster centers. That is, should from Site 1 be considered the same as a value of 3.5 for a value of 3.2 for from Site 2 as part of a cluster center coordinate? Cluster centers in a k-means algorithm adjust with every iteration and may be somewhat different for different sets of tuples representing the same underlying process. Therefore, we need to match the values of shared attributes only approximately for determining the globalized cluster centers. Once the globalized cluster centers are processed and adjusted in future iterations the impact of approximation here would be automatically eliminated.
Site 1 cluster centers Site 2 cluster centers A B B C 2.0 3.5 3.2 19.0 18.0 6.7 18.0 8.2 5.0 17.9 6.0 12.5 TABLE I S NIPPET OF D ATA AT S ITE 1 AND 2

4
3
2
1

0 30
25
20
15
10
5
B

25
20
15
10
5

Fig. 1.

Database at Local Site 1

30
25
20
15
10
5
0 30
25
20
15
10
5
B

1
0.5
0
0.5

Fig. 2.

Database at Local Site 2

47

So, for each value of a shared attribute, we create a window around it and whenever two windows overlap we consider the values to be identical for the purpose of cross product. For example, in the table shown above Site 1 has a and Site 2 has a cluster cluster center at . If we replace values of center at throughout with windows then and overlap, and we can generate a global cluster center at by choosing any of the two values as the candidate value for . These points are only candidates for the global cluster centers and will be automatically adjusted for better accuracy in later iterations. So, they do not have to

be computed exactly at this stage. The coordinating site executes this procedure to determine the globalized centers, s. Table II shows the globalized cluster centers generated for the data shown in Table I with .
B C 3.5 19.0 6.7 12.5 6.7 30.4 17.9 8.2 TABLE II G LOBALIZED C LUSTER C ENTERS A 2.0 18.0 18.0 5.0

D. Populations around Globalized Cluster Centers The clustering algorithm presented in later subsections of this paper requires that we compute the number of data points in the implicit data space that are within some xed distance from a global cluster center when the is known only at the coordinating site. We have presented in [3] a decomposable algorithm for counting tuples that match some conditions in an implicit . This algorithm uses some counts taken from explicit local s that constitute the and sending them to the coordinating site. This count of tuples can also satisfy conditions [3] imposed by the Exclusions set. For each globalized cluster center we nd population of data points around it in . For an implicitly stated set of tuples the counting process proceeds in such a way that each decomposed part can be sent as a request to an explicit database site and the responses composed to reconstruct the counts. The decomposition for obtaining the count for each globalized cluster center candidate is as follows:

Therefore, it is justied to use the same value of at each local database. For each sum-of-products term in the above expression the coordinating site sends a message to the corresponding local site to obtain the count of tuples satisfying the conditions of tup-in-shrd. A number of such count requests may be combined in a single message to reduce the number of messages exchanged. The product, in the above expression, is performed for the counts obtained from all the s (sites) for each tuple from ; and the summation is performed over the product values for all the tuples s in the relation . The coordinating site stores the relation and sends out messages to individual sites to obtain various s and it is assumed that each local site retains a copy of the set Exclusions. Here the decomposition of the global counting can be related to the discussion in section I in the context of equation I as follows. The sum-of-products is the composing function and the individual counts s are the local s. The expression for , therefore, simulates the effect of a operation on all the s without explicitly enumerating the tuples. E. Clustering of Globalized Data Points The next task we examine is of using the set of globalized cluster centers, , available at the coordinating site to determine the cluster centers for all the data points in the implicit . This is to be done in such a way that the error for k-means clustering of is minimized. To get an intuitive feel we can say that the may contain 10s of thousands of data points, and if each local site sends about 15 local cluster centers and there are 3 participating sites, then at the coordinating site approximately thousand globalized cluster centers are created. These global cluster centers can now be used to nd the nal , say 10, cluster centers for . The expression for total clustering error is stated earlier in equation 2. By triangle law of inequality, the distance from a point in to its nal cluster center will be less than the sum of the distance from that point to its globalized cluster center and the distance from the globalized cluster center to the nal cluster center . That is:

(3)

where is the number of participating database sites ( s), tup-in-shrd is a tuple selected by the coordinating site from the relation , as dened in section IV-A above, and sent as part of request to each local site; and is the count of those tuples in that meet the following conditions: 1) The values of shared attributes of in the local tuple are the same as in the tuple selected from the relation . 2) The attribute values in the tuple do not violate any of the conditions in the Exclusions set; and 3) The value of each attribute in the counted tuple at the local site has a distance from the value of the same attribute in that is less than some threshold radius value . In effect, we are counting the number of data points in that are within the hypersphere of radius centered at the point . A hypersphere of radius in an n-dimensional space when projected to an (n-1)dimensional space retains the same radius, as a sphere of of radius in 3-dimensions when projected in 2dimensions becomes a circle of the same radius .
48

(4) where is the data point from and belongs to globalized cluster center , belongs to the nal cluster and is the center of the nal cluster. The total error can be minimized by individually minimizing both the quantities. The rst quantity in equation 4 is less than where is the point counted in the globalized cluster center and is its radius. This is the worst case scenario, when all the points are on the periphery of the hypersphere. However, minimizing this quantity requires that we use the smallest feasible value for the radius while computing the population around each . At the same time we need to have

around a large enough radius to include all the points of the members of . The second quantity in equation 4, GlobalizingError, is equal to where is counted in globalized cluster center and the is the nal cluster center to which is assigned. Note that the error is summed over all the points in . If is the number of points in the hypersphere around the globalized cluster center, then . If we take the GlobalizingError = distance function as Euclidean, then GlobalizingError = . Differentiating GlobalizingError with respect to a cluster center , we get . For minimum error, this quantity should be equal to 0. Therefore,

ii) Compute the estimated contribution to the total clustering error if the globalized data point is added to the nal cluster as:

(7)

(5)

iii) Include globalized data point in that nal cluster which results in minimum value for the estimated clustering error. The difference between the above and the traditional clustering algorithm lies primarily in the way the membership of a point in a potential cluster is decided. Here we make this decision to minimize the potential contribution a point would make to the total clustering error when examined for its inclusion in all possible candidate clusters. This decision is weighed by the population associated with each point, which is actually a cluster center from a local database. The next phase is to recompute the nal cluster centers, as per the iterations of the k-means algorithm. This part is also performed entirely at the coordinating site and does not require any communication with the local sites. Recompute Cluster Centers For each nal cluster 1) Find the total population of the cluster. This is the same as described above. 2) Find all the points that belong to the cluster . 3) Recompute the new center of cluster . In this phase we have a deviation from the traditional k-means algorithm. We compute the new centers such that each point is weighed by the population of data points in that is associated with it using equation 5 above. The last two of the above steps are repeated until the decrease in the estimated cluster error remains below a threshold for a number of iterations. The population weight related adaptations in the previous two steps can be shown to result in clusters that actually minimize the estimated cluster error. V. R ESULTS We outline below our results obtained by running this algorithm with a vertically partitioned dataset and a horizontally partitioned dataset. A. Example of Vertical Partitioning We demonstrate a simple test situation consisting of a vertically partitioned global database containing points in an implicit 3-dimensional (attributes A, B, and C) space and two explicit local databases that are projections of the global database. The local databases and from two sites are shown in Figure 1, and Figure 2. The database is the projection on plane and is the projection on plane. For the test we have made the global database explicit in order to compare the performance of our algorithm. In this example the local cluster centers obtained from are shown in Table III and the local cluster centers obtained from are shown in Table IV.

Equation 5 implies that the nal cluster center should be the mean of the globalized cluster centers included in a cluster, weighted by the population of points around each globalized cluster center. This modied rule for nding cluster center is used, and iterations of regular k-means algorithm are applied to arrive at the nal cluster centers. This part is performed entirely at the coordinating site and does not require any communication with the local sites. In a distributed environment we cannot compute the clustherefore, we dene its estimate, called tering error for Estimated Clustering Error (ECE), as an estimated of the total distance between a data point of and its nal cluster center (to be found in this phase of the algorithm), weighted by the population associated with . Step 2.ii of the algorithm below shows the quantitative denition of the estimated clustering error (ECE). The coordinating site now runs a modied version of the k-means algorithm on the globalized cluster centers in set as follows: 1) Randomly choose a set of points as the initial candidates for nal cluster centers s. 2) For each data point having a coordinate do a) For each nal cluster having a coordinate do i) Compute the weighted mid-point between a nal cluster center and a point , weighted by populations around s, as:

where is the population in the hypersphere around point in , and is the population of the cluster at , and

(6)
49

A B 10.5 7.5 10.5 5.5 6.0 5.0 3.67 22.67 5.0 24.0 3.27 5.36 22.0 28.6 23.22 25.56 TABLE III L OCAL C LUSTER C ENTERS FROM SITE

B C 6.489 24.04 26.94 14.5 23.433 6.4 5.283 3.65 TABLE VII C LUSTER C ENTERS WITH REGULAR k-means

A 10.469 22.84 4.5 5.866

ON EXPLICIT

B C 4.0 3.5 7.2 3.6 26.0 15.0 28.71 13.57 5.5 24.5 7.29 23.86 23.0 6.0 24.33 7.33 TABLE IV L OCAL C LUSTER C ENTERS FROM SITE

excess of 500 points and was partitioned similarly into two projections and measured the clustering error for each case. The plots in Figure 5 show the total clustering error for cluster centers determined by our decomposable algorithm and by direct application of a k-means algorithm on an explicitly created . It can be seen that the difference between the error quantities is very small and follows the same pattern in two tests. The plots also show the estimated clustering error that is computed by the coordinating site to guide itself towards the nal cluster centers. This quantity reduces faster than the actual clustering error but follows the same trend, and thus, can guide towards the minimum error cluster centers.
1600
1500
1400
1300
1200
1100
.. . _____ Total ECE for Distributed Version Algorithm Total Error Distributed Version Algorithm Total Error Traditional kmeans Algorithm

Total Error

The cross product of the local cluster centers is used to obtain the candidates for the globalized cluster centers. Table V shows the set of globalized clusters centers as members of after computing populations around the local cluster centers points and taking conditions in Exclusions into account.
A 10.5 10.5 10.5 6.0 3.67 5.0 5.0 22.0 23.22 B 7.5 7.5 5.5 5.0 22.67 24.0 24.0 28.6 25.56 TABLE V C 3.6 23.86 24.5 3.5 6.0 6.0 7.33 13.57 15.0

1000
900
800
700
600
500
400
300
200
100
0 3
4

Number Of Clusters

Fig. 5. Total Error versus Number of Clusters in Distributed and traditional algorithms

G LOBALIZED C LUSTER C ENTERS

B. Example of Horizontal Partitioning The coordinating site then clusters these points into four clusters, taking into account the population around each point in and minimizing the ECE. The cluster centers thus obtained are shown in Table VI.
A 10.51 22.489 4.425 8.475 B C 6.3163 24.239 27.386 14.542 23.424 6.310 6.375 3.555 TABLE VI F INAL C LUSTER C ENTERS

We then make the dataset explicit and run the k-means algorithm to determine the four cluster centers using a means algorithm. The resulting cluster centers are as shown in Table VII. The two sets of cluster centers are very close. We ran the that had in tests by varying the number of clusters for a
50

The case of a horizontally partitioned global database can be looked at as a special case of vertical partitioning. In horizontal partitioning an identical set of attributes is used at each local site. All attributes can, therefore, be considered as shared attributes and the above algorithm then applied. We show results of the decomposable algorithm with an example of a horizontally partitioned database. We consider three local databases each consisting of points in a 2-dimensional space. The layout of points from the three local databases are shown in Figures 6, 7 and 8 respectively. If the three databases were to be collected at a single site, the collective layout of their points would be as shown in Figure 9. During the rst phase of the algorithm, clustering is performed at local sites using k-means algorithm. We obtained a total of fteen local cluster centers at the central coordinating site. Figure 13 shows the locations of these cluster centers. Globalization of local cluster centers is not needed here because all attributes are shared by each database. These

70

70

60

60

50

50

40

40

30

30

20

20

10

10

0 0

10

20

30

40

50

60

70

0 0

10

20

30

40

50

60

70

Fig. 6.

Dataset at site-1
70

Fig. 8.

Dataset at site-3

70

60

60

50

50

40

40

30

30

20

20

10

10

0 0

10

20

30

40

50

60

70

0 0

10

20

30

40

50

60

70

Fig. 7.

Dataset at site-2

Fig. 9.

Data points at all three sites

fteen points, based on their population information, are used to form global clusters.

minimizes the total error. VI. A NALYSIS


AND

5.03 52.19 38.26 20.85 4.77 32.06 43.22 10.43 58.18 43.02 8.319 4.01 53.32 57.17 37.36

52.11 36.68 9.018 24.54 4.08 17.85 12.03 7.22 37.86 26.62 61.21 9.14 40.88 58.12 24.16 TABLE


37 35 33 40 25 12 33 28 62 33 60 38 10 430 38 VIII

C OMMENTS

C ANDIDATE CLUSTER CENTERS WITH THEIR POPULATION

Just for illustration, we rst run the k-means algorithm on the set of local cluster centers, without applying any weighing derived from the populations at the cluster centers, to create six clusters. The result is shown in Figure 11 below. The linear boundaries around the points show the cluster boundaries as determined by the algorithm. Figure 12 shows the results of clustering when the weights based on the populations of individual cluster centers are taken into account to minimize the clustering error as described in 2 above. It should be noted that the point at the top right corner is now clustered by itself instead of being clustered along with some others in Figure 11 (simple k-means algorithm). This occurs because the single point has a large population and if merged with other points then a larger contribution is made to the clustering error. The normal k-means algorithm, working on a collection of all points brought at a single site, results in this cluster being separate from other points because it then
51

The cost of working with implicitly specied set of tuples can be measured in various ways. One cost model computes the number of messages that must be exchanged among various sites. Complexity for distributed query processing in databases has been discussed in [24] and this cost model measures the total data transferred for answering a query. In our case the amount of data transferred is very little (statistical summaries) but the number of messages to be exchanged may grow rapidly with the number of iterations for the clustering algorithm. We derive below an expression for the number of messages that need to be exchanged for our clustering algorithm dealing with the implicit set of tuples. Let us say: (i) There are relations, , residing at different network sites. (ii) There are attributes in set . Each attribute in this set appears at more than one site. (iii) There are distinct attributes in all the sets ( ) combined. (iv) There are possible discrete values for each attribute in set , and (v) The number of globalized cluster centers in is . The number of tuples in relation is because it contains all possible combinations of values for its attributes. Fortunately, is very small for real-life datasets as often there are only few common attributes among different nodes. A. Complexity We have three scenarios to compute the complexity of our algorithms: 1) Communication Cost Only: In this cost model we count the number of messages, , that must be exchanged among all the participating sites in order to complete the execution of the algorithm. that is, one site is asked for its cluster centers, an answer

70

70

Population Weighted and population Variance Weighted Algorithm Results

60

60

50

50

40

40

30

30

20

20

10

10

0 0

10

20

30

40

50

60

70

0 0

10

20

30

40

50

60

70

Fig. 10.

The Local Cluster Centers

Fig. 12.

Clustering Using population Weighted Algorithm

70
Kmeans Algorithm Results

60

50

40

cost for the algorithm will be where and are the weights representing the relative costs of exchanging a message and performing a local operation. B. Advantages of Decomposable Algorithm

30

20

10

0 0

10

20

30

40

50

60

70

Fig. 11.

Clustering Using k-means Algorithms

is obtained, and then the request is sent to the next participating database. Each product term in the expression for counting tuples in equation 3 above requires an exchange of messages between the coordinating site and the participating sites. The product steps are repeated for each tuple in Shareds and therefore, the total number of messages to be exchanged for determining the count of tuples in is . In this case the complexity will be:

exchanged messages to perform local clustering. exchanged messages to compute the popula. tion for every

Then the total number of exchanged messages will be where is the time taken to exchange a message. This cost is independent of the number of tuples in a database which always expresses the problem size. Number of attributes is always xed and is generally very small, especially when compared to the number of records in databases. There is this cost represents a very efcient algorithm. 2) Communication plus Local Computation Cost: In this model we examine a weighted sum of the number of messages exchanged and the number of local operations performed. For each exchanged message, this algorithm performs a local operation at the responding site. Therefore, the total cost for the algorithm will be where and are the weights representing the relative costs of exchanging a message and performing a local operation. 3) Elapsed Communication plus Local Computation Cost: In this model we examine a weighted sum of the number of messages exchanged and the number of local operations performed, while discounting the effects of messages and operations that can be executed in parallel, simultaneously at different sites. Therefore, the total
52

The above analysis of complexity shows that the number of messages that need to be exchanged among the sites is not dependent on the size of the database at each site. The communication complexity, in the case of vertically partitioned data, is dependent primarily on the number and manner in which the attributes are shared among the participating sites. This is signicant because it shows that as the sizes of the individual databases grow, the communication complexity of the algorithm would remain unaffected. Computational cost of local computations would grow with the database size at each individual site but our decomposable version has an advantage in this regard also over the transport, join, and then cluster alternative. If a k-means algorithm runs iterations for nding cluster centers and it has data points then it must compute distances. If each local database has tuples, then in the worst case the join of local databases would produce a relation containing order of tuples. There is additional cost of order of ( ) comparisons for creating the Join. When the k-means algorithm is run with this explicitly created , we would need to compute distances. In our decomposable version, each of the sites would be computing only distances. Thus, there is tremendous saving in the computational cost when the decomposable version is executed instead of moving the data, creating a Join and then running the clustering algorithm. Also, for the communication cost, the number of partial results that need to be transmitted is far fewer that the messages that may have to be transmitted if entire databases are collected at some central site. Another important gain of decomposable version is that it preserves the privacy of the data by not requiring any data tuples to be placed on a communication network. It also preserves the integrity of individual databases because no site needs to update or write into any of the participating databases. All the queries are strictly reading queries. C. Privacy and Security Considerations We have demonstrated above that k-means algorithm can be very closely approximated for distributed databases without

10

having to move the databases to a centralized site. From the point of view of data security and privacy the following observations can be made: 1) No data tuple is exchanged between the sites. 2) Initial cluster centers from local sites are transmitted to the central site. Global cluster centers are maintained within the coordinating site and never transmitted. 3) When computing the population around global cluster centers, the coordinating site sends only the locally relevant attributes to each site and distances from the attribute-value pairs of the local data are returned only for a subset of attributes contained in the global cluster centers. If the information security and privacy is dened by not having to release any data tuple out of a database for transmission over the network and the reconstruction of any data tuple being impossible by the released data summaries then the above algorithm preserves the privacy of the data in each participating database. No data tuple is ever transmitted and the summaries are not sufcient to reconstruct any individual data tuple. The only loophole would be when a cluster of one data point is formed and its contents are released for communication to other sites in the form of the cluster center of this cluster. This can be easily avoided by setting a minimum threshold (at least t = 2) so that any time a local site sends out a summary it must be for at least tuples. The tradeoff would be that the algorithm would not be able to form clusters that are smaller than in size. VII. C ONCLUSION In this paper we have presented a decomposable version of k-means clustering algorithm for vertically and horizontally partitioned datasets that are geographically distributed. The algorithm succeeds in obtaining results very close to those that would be achieved by moving all the databases to one site, joining them, and then executing the k-means clustering algorithm. Our distributed version of the algorithm succeeds in doing so by minimizing the total clustering error, a characteristic property of the k-means algorithm. We use the information about the clusters formed at local sites to determine the approximate locations of the possible global cluster centers. Information about the centers and an algorithm to count populations of points around cluster centers in an implicitly specied relation are used by the central coordinating site to minimize a close estimate of the total clustering error. We have demonstrated that the convergence of over version and the original k-means algorithm are to centers that are very closely placed; signied by a very small difference in the total clustering error. Our version achieves these very close results at a very great savings in the total communication cost and also preserves the privacy and integrity of the individual databases. R EFERENCES
[1] Anderberg, M. R. (1973). Cluster Analysis for Applications, Academic Press, New York. [2] Arkadev, A. G. and Bravermman, E. M. (1966). Teaching Computers to recognize pattern, Academic press, New York. 53

[3] Bhatnagar, R. and Srinivasan, S. (1997). Pattern Discovery in Distributed Databases. In Proceedings of the fourteenth National Conference of AAAI, pp. 508-513. [4] Bhatnagar, R. and Young, B. (1999). Computations in Distributed Databases. In Proceedings of the ADCOM99 Conference, Roorkee, India, pp. 32-38. [5] Bezdek and James, C. (1981). Pattern recognition with fuzzy objective function algorithms, Plenum press, c1981, pp. 241-248. [6] Chan, P. and Stolfo, S. (1996). Sharing Learned Model s among Remote Database Partitions by Local Meta-learning. In Proceedings of second International Conference on Knowledge Discovery and Data Mining, pp. 2-7. [7] Chan, P. and Stolfo, S. (1997). On the Accuracy of Met a-learning for Scalable Data Mining, J. Intelligent Information Systems, 8:5-28. [8] Coleman, G. B. and Andrews, H. C. (1979). Image Segmentation by Clustering. In Proceedings IEEE (May 1979) pp. 773-785. [9] Crestana, V., Soparkar, N. (1999). Mining Decentralized data repositories. Technical Report CSE-TR-385-399, Ann Arbor, MI. [10] Dubes, R. and Jain, A. K. (1976). Clustering techniques: the users Dilemma, Pattern Recognition, 8(4):247-260. [11] Forman, G., and Zhang, B. (2000). Distributed Data Clustering can be Efcient and Exact. SIGKDD Explorations, 2(2):34-38. [12] Grossman, R., Bailey, S., Ramu, A., Malhi, B. and Turinski, A. (2000). The Preliminary design of Papyrus: A system for high performance, distributed data mining over clusters. In Advances in Distributed and Parallel Knowledge Discovery, AAAI Press/MIT Press, pp. 259-275. [13] Vaidya, J. and Clifton, C. (2003). Privacy-Preserving K-Means clustering over vertically partitioned data. In Proceedings of KDD 2003, pp 206215. [14] Jain, A. K., Cluster Analysis in [Young 86] pp.33-57. [15] Jagannathan, G. and Wright, R. Privacy preserving distributed k-means clustering over arbitrarily partitioned data, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004. [16] Kargupta, H., Huang, W., Sivakumar, K. and Johnson, E. (2001). Distributed Clustering using Collective Principal Component Analysis. Knowledge and Information System 3(4):422-448. [17] Lam, W. and Segre, A. M. (1997). Distributed data Mining for Probabilistic Knowledge. In Proceedings of the 17th international conference on distributed computing system. IEEE computer Society Press, Washington DC, PP 178-185 [18] Li, X. and Fang, Z. (1989). Parallel Clustering Algorithms. Parallel Computing, 11:275-290. [19] Olson, C. F. (1995) Parallel Algorithms for hierarchical Clustering. Parallel Computing, 21(8):1313-1325. [20] Park, H., Ayyagari, R. and Kargupta, H. (2001) A Fourier Analysis-based approach to learn classiers from distributed heterogeneous data. In Proceedings of the rst SIAM International Conference on data mining. [21] Rasmussen, E. M. and Willett, P. (1989). Efciency of hierarchical agglomerative clustering using the ICL distributed array processor. Journal of Documentation 45(1):1-24. [22] Theodoridis, S. and Kouteroumbas, K. (1999) Pattern Recognition, Academic Press, New York. [23] Tummer, K. and Ghosh, J. (1999). Robust order statistics based ensembles for distributed data mining. In Advances in distributed and parallel knowledge discovery, MIT Press, Cambridge, MA. [24] Wang, C., Chen, M. (1996). On the Complexity of Distributed Query Optimization. In IEEE Transactions on Knowledge and Data Engineering, 8(4):650-662. [25] Yamanishi, K. (1997). Distributed Cooperative Bayesian Learning strategies. In Proceeding of COLT 97, ACM, New York, pp 250-262. [26] Young, T. Y. and Fu, K. S. eds. (1986). Handbook of Pattern Recognition and Image Processing, Academic Press, Orlando, FL.

Generating Cryptographic Keys from Face Images While Preserving Biometric Secrecy
1

Alwyn Goh, 12David Chek-Ling Ngo, 12Andrew Beng-Jin Teoh, 12Wai-Kuan Yip
Corentix Technologies Sdn Bhd, B-5-06 Kelana Square, 17 Jln SS 7/26 47301 Petaling Jaya, Malaysia alwyn@corentix.com 2 Faculty of Information Science and Technology (FIST), Multimedia University Jln Ayer Keroh Lama, 75450 Melaka, Malaysia {david.ngo, bjteoh, yip.wai.kuan04}@mmu.edu.my
1

Abstract
This paper proposes a method of extracting cryptographic key from face images that does not require storage of the face template or reveal statistical information that could be used for template reconstruction. Also, the keys produced are not permanently linked to the biometric hence, allowing them to be replaced in the event of key compromise. This is achieved by incorporating randomness which provide high-entropy to the naturally low-entropy biometric bitstrings using iterative inner product method as in the Biohash scheme of Goh-Ngo [7] and user-specific permutation, and error correction. We also show that the construction of our proposed method is a secure one-way transformation and has sufficient unpredictability in its key space as with the case of good block ciphers design. Keywords: Biometric Security, Cryptographic Key, Face Verification

1. Introduction
The issues of poor security in password-based (what you know) access systems and stolen private keys ( what you have) have prompted the use of biometric (what you are) as an additional or even replacement for authentication factor. In this paper, we consider utilizing face images as biometric for key transformation because it is a physically simple and socially accepted method of authentication. In this context, we have identified a few requirements for a secure and good cryptographic key extraction technique: (1) No face template storage, (2) Reissuable keys, (3) One-way transformation, (4) Biometric secrecy protection, (5) Error tolerance, (6) Unpredictable key space and (7) Secure transformation. Most conventional face verification schemes require a template of the face to be stored for comparison later

either in a server or offline storage devices eg. smart card. Unfortunately, this provides no security in the event the template is stolen as the user cannot re-register for a new face. Hence, it is desirable for the transformation method to allow re-issuable keys in the event of a key compromise. Also, the transformation from face to cryptographic key should not be easily reversed to thwart attempts in recovering the biometric data by analysis compromised keys. Hence, throughout the transformation process, no statistical information that can be used for reconstruction of the biometric data should be revealed. This is to ensure that an adversary cannot recover the secret biometric based on user-specific statistics which can leak the keys. In terms of the keys generated, it should not be possible for an adversary to perform a statistical extraction of key space patterns based on intercepting multiple keys. However, we note that this is a non-trivial problem, as it is impossible to correct keys from biometric data without storing some information that can be helped to reconstruct close-toexact keys stable enough for cryptographic usage as they are prone to variations. Some limited form of correction would be needed produce stable keys. In terms of the key space, they should be sufficiently different in terms of bits from non-genuine keys, and should be uniformly distributed. Finally, the security design of block ciphers to promote robustness against cryptanalysis can be adopted as they bear strong similarity to the biometric-tokey transformation.

2. Literature Review
Biometric to cryptographic bitstrings transformation method is a relatively new direction of research, spurred on by the need to incorporate biometrics data into cryptographic algorithms and protocols. Earlier works in this area had been based on fingerprint, face, keystrokes and voice data. Key generation from voice passphrase was proposed by Monrose et al [1][2] that used a scalable

54

vector of biometric features in conjunction with a randomized lookup table generated using generalized secret sharing scheme. The biometrics iris identification in Davida et al [3] uses a different approach in that error correction codes are used. During enrollment, a digital signature that links the iris biometric is generated and stored onto a trusted authority distributed smartcard. Chang et al [4] utilised user-dependent statistics to generate multiple bits which allow for more compact and distinguishable keys. The feature space is divided into multiple segments allowing more than one bits to be assigned. The main security issue with these schemes is that the keys are permanently associated with the biometric eg. when stolen a new biometric need to be used which is not possible for physiological biometrics such as fingerprint, iris, face and such. Additional token may be combined with the biometric to allow cancelable keys as shown in Soutar et al [6], which combines the Fourier transforms of biometric images with a random digital key enabling the key, or bioscrypt to be modified later in the event of key loss. However, the scheme did not explain in a satisfactory manner the cryptographic security of the transformations and there were not results published. Goh-Ngo [7] introduced cancelable keys via inner product between randomized token and face data and is advantageous in comparison to Soutar et al as the step is a one-way process. JuelsWattenberg [8] extended the work in Davida et al [3] by introducing the idea of a fuzzy commitment. A difference vector is computed by taking the difference (via XOR) between the biometric key and a reissuable secret. At verification step, the test biometric is added to the difference vector to recover the secret that will be decoded back to the original secret using error correction code. Next, Juels-Sudan [9] improved the earlier version by incorporating polynomial-based secret sharing on the secret message in their fuzzy vault scheme. Clancy et al [10] implemented the fuzzy vault scheme on fingerprints but also pointed out that a perfect Juels-Sudan vault scheme is not possible to be implemented and presented methods to improve and optimally configure the vault for fingerprint data. Dodis et al [11] gave general definitions of a fuzzy extractor and secure sketch whereby the former denotes a method for extracting randomness from the original biometric while the latter produces public information allows the recovery of the secret without compromising the secret itself. The authors proved how error-correction codes and permutation groups can be used to achieve these primitives and showed how the Juels-Wattenberg scheme fit into their models. However, Boyen [12] later pointed out that the earlier method did not address the issue of repeated extraction of the secret which would

leak the secret key itself. Here the author proposed a fully randomized generic fuzzy sketch and a client-toserver remote biometric authentication protocol that allows a user to store the public information in a server. In Boyen et al [13], the authors then extended the scheme to provide two-way authentication. The methods of [1113] however, did not address the transformation of the biometrics extracted vector (in real space) to bit-strings, and assumed binary space operations on already extracted bitstring which are simpler and easier to implement. However, simpler operations also suggest that the biometric-random token mixing process can be easily reversed. For example, as Boyen [12] has mentioned, it is easy to recover the biometric bistrings using the XOR operation in the Juels-Wattenberg scheme. Our proposed scheme on the contrary, implements the random mixing in the real biometric space so that it is not possible to recover exactly the biometrics feature, although the partial information or range of the biometric features in real space could be recovered if multiple random tokens and keys were compromised.

3. Proposed Methodology
The proposed methodology is essentially a one-way analogue-to-digital (A/D) transformation of form F : 2 m V 2 n with inputs of token and raw biometric data . Our scheme (see Figure 1 for outline) features projection of biometric data onto multiple random independant subspaces; with subsequent threshold-based discretisation, permutation and error correction as in Davida et al [3]. We also utilized Fourier transformation for lossy compression into biometric feature, instead of the more common Principal Component Analysis (PCA), to avoid storage of large eigenbases as in the original Goh-Ngo [7] construction. The extension of permutation is to enhance the entropy of the final keys so as to reduce the cryptanalysis of keys to brute-force attack.

Our methodology comprises: (1) biometric extraction and compression, (2) computation of A/D hash in which biometric data is irreversibly mixed with random token data, and (3) error correction so as to ensure stable biohash outputs. For this paper we use Fourier analysis of face images, as an alternative to the eigenanalysis of [7]. We follow the work of Harmon [14] and Sergent [15] in our extraction of the low-frequency spectra for biometric analysis. This approach is vindicated by Nastar et al [16] studied the effect of user (different images of the same face) and imposter (different faces) variations on Fourier spectra. The detailed steps follow:-

55

Step 1: Discrete Fourier Transform (DFT) of input face image into frequency domain. Two-dimensional input signal (x,y) is represented by a discete Fourier integral of form:

weight of components on either side of any hyperplane which intersects the transform space origin. Step 3: Iterative inner-product computation as in [7]. Each inner-product of form: [Eq 2] c i = a i, b = a i,p b p
p

b(p, q) =

1 MN x,y

px qy (x, y) e i2 M + N

[Eq 1]

with integration limits x = 0M1 and y = 0N1. This computation is efficiently implemented using Cooley-Tukey[17], following which all (p,q) spectral components are normalised via division with total energy content
p,q

b(p, q) .

for i = 1m < functions as a random extractor for biometric b. The process projects b onto successive random subspaces, each specified by random vector a i , and with all vectors mutually orthonormal such that a i, a j = 0 for i j. This condition is straightforwardly
enforced via the Gram-Schmidt algorithm, with the random vectors taken from some random number generator (RNG) parameterised by external token . Subsequent to this all c i are normalised so as to be in interval (1,1). Step 4: Discretisation into bitstring = ( i ) for i = 1n

Face images

Token
Feature Extraction

VERIFICATION
1. DFT
mean b

RNG

2. Normalization b
3. Inner Products

Random basis

< m via threshold-based decision of i = 0 if c i , i = 1 if c i and indefinite assignment if c i (, ) ; with being a small empirical parameter as in [7]. Note this step can be regarded as being error correcting due to subsequent exclusion of indefinite bit values.

Biometric Hashing

threshold

4. Discretization

ENROLLMENT
Encoding
parity

5. Permutation x

Random permutation,

Step 5: Permutation of x i = (i) of discretised outputs

6. Error Correction

so as to diffuse the influence of each inner-product computation throughout the entire binary string. In this case random permutation (i) are also be obtained from the above-discussed RNG parameterised by token . Step 6:

Correction

key

Fig 1: Proposed method Step 2: Extraction and normalisation of perceptually significant DFT components resulting in acceptable performance in terms of equal error rate (EER), taken as the average of the minimum false acceptance (FA) and minimum false rejection (FR) rates between the genuine users and imposters. Our method is to extract the low frequency components encompassed within equilateral triangular windows of dimension in the first two transform domain quadrants, the latter being sufficient due to DFT symmetries resulting from real-valued inputs. These elements are then formed into one-dimensional vector b = ( b p ) for p = 1 < MN, and subsequently

Correction of bitstring x = ( x i ) to some

specified reference value x, with the important stipulation that secrecy of x is preserved. This is accomplished via application of [ n, k, d ] q error 2 correcting codes, ie Reed-Solomon (RS), of n-symbol codewords with k-symbol information content over Galois field 2 q and d the minimum symbolic distance. Such codes are able to correct up to (d1)/2 errors via prior storage of (nk)-symbol parity checksums computed from the k-symbol information. Threshold d is chosen empirically so as to correct small bit-differences in x from authentic users, while leaving uncorrected large bitdifferences from imposters. Note that the proposed system does not require storage of reference biometric or statistical information derived thereof. The only stored parameter is the checksum of Step 6, which does not reveal any information about the reference output bitstring. There is an additional level of

normalised via subtraction of the mean vector computed from population-wide image samples. This ensures that the normalised biometric b will have essentially equal

56

security from this bitstring resulting from an irreversible compression of token and biometric inputs. It is also noteworthy that the token and biometric inputs are mutually independant, so that key compromise incidents are straightforwardly handled via token replacement.

perpendicular

to

ai

is
i, cj

possible.

Consequently

cj =

i, cj

+ c i, j

where

i, aj

,b

is

manifestly

correlated to c i , and hence output bit j to i . Our construction of mutually orthonormal a i elements ensures that all c i are mutually uncorrelated, which results in all i being mutually independant. Occurrence of the avalanche effect is demonstrated via experimentation on the Essex database [22] (with faces normalized according to Ngo et al [23]), from which selected 100 users (10 images each) for testing and 100 users (3 pictures each) for computation of the mean DFT. Comparison of differences between reference and test data is done in terms of Hamming (as opposed the more common R n or L 2 ) distances. We tested two scenarios ie with: (1) all users matched with same token , and (2) each user matched to different . It is also necessary to determine the optimal configurations of Fourier domain and RS(q,k) error correcting code, as window respectively illustrated in Figs 2 and 3 below.
EER in User Token (Typical Scenario) and Same Token (Worst-Case Scenario) before Error Correction

Security Considerations Experimental Results

and

An unpredictable key space as produced by the abovedescribed method should be: (1) complete, (2) bitindependant, and (3) subject to the avalanche effect [19, 20, 21]. Cryptosystems are deemed complete if outputs are dependant on the entirety of the input (as opposed subsets thereof), and bit-independant if each output bit is probabilistically uncorrelated with respect all others. Avalanching, on the other hand, describes an effect in which a substantive change in the input changes approximately half of the output bits. For conventional cryptosystems, substantive denotes a single bit difference, while in our case it refers to biometric inputs sufficiently different from the reference user so as to be regarded as associated with an imposter. Proposition 1: Output is dependent on entirety of input biometric b.

4.3 3.8 3.3 2.8

EER

Proof: Each output bit i in Step 4 of the previous section depends on inner-product c i = a i, b which is defined over the perceptually important spectral components of biometric vector b. Any input variation b b + b with b = b + b results in corresponding
output variation c i c i + c i with c i = a i, b . This has an excellent chance of affecting outcome i if b is

2.3 1.8 1.3 0.8 0.3

EER (Diff Token)

EER (Same Token)

12

13

14

15

16

17

18

19

20

21

Window Size,

of same order as b, and if b is randomly distributed between b and b . The mutual orthonormality of all elements then allows for specification ai b = b i, + b i, such that any given component of b would affect one or more inner products c i , and consequently one or more outputs i . Output is hence dependant on the entirety of input b. Proposition 2: Output bits i are independant of one another.

Fig 2: EER for same and different token scenarios, with optical result at =19
0.6 0.5 Mean Distribution for Window Size=19 after Error Correction

Mean Distribution

0.4 0.3 0.2 0.1 0


NoCorrect RS(5,13) RS(6,23)

MeanGen (Diff)

MeanImp (Diff)

MeanGen (Same)

MeanImp (Same)

Proof: Consider relaxation of stipulation that a i and a j for i j are orthonormal, such that decomposition i, i, denoting components parallel and aj = aj +aj

RS Configurations

Fig 3: Mean Hamming distances for user & imposter distributions, with optimal result for RS(6,23) error correction

57

RS(6,25)

RS(6,27)

RS(7,55)

RS(4,7)

RS(4,9)

Ham(3)

22

-0.2

The objective of cryptographic key generation requires error correction as can be seen from Figs 4 and 5 below depicting the user and imposter distributions of Hamming distances for uncorrected bitstring .

Fig 7: User and imposter distributions for x in same token scenario Fig 4: User and imposter distributions for in different token scenario Error correction results in the user distribution moving leftwards to a near-zero mean of 0.03%, with 80% of the user bitstrings successfully corrected. The other noteworthy occurrence is the peak of the imposter distributions at 50% of the bitstring length, which is a demonstration of the avalanche effect. This effect, explain in Teoh et al [24] in a statistical result utilizing Gaussian randomness-based tokens, is also true for all other configurations as shown in Fig 3. Our scheme can be argued to be a computationally secure one-way transformation via the information-theoretic properties of: (1) confusion, (2) diffusion, and (3) noncommutative compositions; as set out by Shannon [5]. Confusion conceals the relationship between biometric b and output x (or even intermediate values) so that the former cannot be deduced from analysis of the latter. Diffusion, on the other hand, dissipates the effect of input b over the entirety of output x, and can be implemented using permutations. Finally, the composition principle states that cascades of sub-cipher operations enhance the overall cipher strength so long as the component operations are associative but not commutative ie forward-only transformations. All three design considerations can be observed in well-known ciphers [18]. In our scheme, confusion is implemented at the inner-product and discretisation operations of Steps 3 and 4, and is broadly analogous to the S-box construction of DES or the modular multiplication of IDEA. This transformation is irreversible on the basis that resolution of the inner-products to obtain biometric b is an intractable problem, as subsequently explained. Proposition 3: Determination of biometric vector b from random vectors a i and inner-products c i for i = 1m < is an intractable problem, if token parameter is not known. Fig 6: User and imposter distributions for x in different token scenario

Fig 5: User and imposter distributions for in same token scenario Note the good separation between user and impostor bitstrings with EERs of 0.41% and 3.4% for the different and same token scenarios respectively. On the other hand, bitstrings in the user distributions still differ by averages exceeding 20% of the total bitlength. Rectification of this output instability requires application of error correction, specifically RS(6,23) as indicated in Fig 3. Figs 6 and 7 below illustrate the resulting enhanced user and imposter distributions for corrected bitstring x.

58

Proof:

Inner-products c i = a i, b

for i = 1 m and

a i, a j = 0 for i j form a linearly independant system of m equations. On the other hand b has > m unknown parameters, and is hence not recoverable from the abovediscussed system if the token is not known. Determination of b from a i and c i is hence an intractable problem. Discretisation following the above-discussed inner product computations is a lossy quantisation, and can be regarded as an irreversible symbol substitution process. This irreversibility would hold even in the adversarial scenario where token parameter and random sequence Diffusion is implemented via the a i is known. permutation of Step 5, which is analogous to the P-box construction in basic block ciphers. In our case, the permutations are randomly generated from token parameter . This process is obviously reversible if is known, but does not compromise biometric vector b due to Prop 3. Proposition 4: The sequence of operations ie (1) lossy compression on the biometric vector, (2) inner-product randomisation, (3) discretisation, and (4) permutation; is a non-commuting composition, and hence a one-way transformation.

within a short period of time. However, because the mixing stage in our scheme via the iterative inner product is in the real space, an adversary collecting the compromised keys cannot directly recover the biometric secret exactly but only a range of the biometric feature. This enables a higher security than that of the case of binary level mixing via XOR, as of the previous schemes of Juels-Wattenberg.

5. Concluding Remarks & Future Works


The experiment results confirm the proposed method demonstrates high unpredictability of key space and as long as the tokens are not stolen, cryptanalysis will be reduced to only brute-force attack on the key space. We believe that our method is a significant contribution to the design of secure biometric authentication which does not require template and/or statistical information storage and where the keys are replaceable. The one-way transformation and non-storage of user-specific statistics guarantee that an adversary cannot recover the biometric feature vector from the stolen keys. The biometric-tokey transformation of
n n n

{0,1}n instead of the

Proof: The biometric data as used in Step 3 is essentially a lossy compression of form F1 : V R m , with V being the biometric vector space of interest ie Fourier for this case, or Euclidean for [7]. This step is clearly irreversible. The subsequent inner-product randomisation m m F 2 : R R (1,1) for each a i random element also entails information loss, with the composition F1 F 2,i
i

simple {0,1} {0,1} of the fuzzy extractor method provides robust shield against recovering exact biometric feature bits in multiple stolen key attack. We hope to provide formal definitions of the leakage associated with such of attack using formal proof and information theoretic construction in our future work.
Acknowledgement: We would like to thank LiWu Chang and various anonymous reviewers for their useful comments and corrections.

6. References:
[1] Monrose, F., Reiter, M.K., Li, Q. & Wetzel, S. (2001). Cryptographic Key Generation from Voice, Proceedings of the 2001 IEEE Symp on Security and Privacy, May 2001. [2] Monrose, F., Reiter, M.K., Li, Q., Lopresti, D.P. & Shih, C. (2002). Toward Speech-Generated Cryptographic Keys on Resource Constrained Devices, 11th USENIX Security Symposium, 2002 [3] Davida, G., Frankel, Y., Matt, B.J. & Peralta, R. (1999). On the Relation of Error Correction and Cryptography to an Off Line Biometric Based on Identification Scheme, WCC99, Workshop on Coding and Cryptography , January, 1999, Paris [4] Chang, Y.C., Zhang, W. & Chen, T. (2004). Biometric-based Cryptographic Key Generation, IEEE Conference on Multimedia and Expo, Taiwan, 2004

irreversible by Prop 3. The non-commutation of


n

F 2,i
i

and discretisation F3 : ( 1,1) 2 n can be seen from the dissimilarities between the range of F 3 and the domain of F 2,i . F3 also results in additional information loss,
i

hence the non-commutation of F 3 and permutation n n F 4 : 2 (n) 2 . F 4 is itself furthermore irreversible so long as token parameter is not known. Putting it together the composition F1 F 2,i F3 F 4 is noni

commutative, and hence a one-way transformation. One of the assumptions we have taken in section is that the probability of the users random token is being stolen is low. We note that this assumption is weak as it is possible for the user to lose his/her token a few times

59

[5] C. E. Shannon (1949), Communication Theory of Secrecy Systems, Bell Systems Technical Journal, Vol. 28, pp.656-715 [6] Soutar, C., Roberge, D., Stoianov, A., Gilroy, R. & Kumar, B.V.K.V. (1998). Biometric Encryption Using Image Processing. SPIE 3314: pp 178-188 [7] Goh, A. & Ngo, D.C.L. (2003). Computation of Cryptographic Keys from Face Biometrics, Seventh IFIP CMS 2003, Springer-Verlag LNCS2828 [8] Juels, A. & Wattenberg, M. (1999). A Fuzzy Commitment Scheme, in Proc. 6th ACM Conf. Computer and Communications Security, G. Tsudik, Ed., 1999, pp.28-36 [9] Juels, A. & Sudan, M. (2002). A Fuzzy Vault Scheme, in Proc. IEEE Int. Symp. Information Theiry, A. Lapidoth & E. Teletar, Eds., 2002, p.408 [10] Clancy, T.C., Kiyavash, N. & Lin, D.J. (2003). Secure Smartcard-based Fingerprint Authentication, SCM SIGMM Multimedia, Biometrics Methods & Apps Workshop [11] Dodis, Y., Reyzin, L. & Smith, A. (2004). Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data, EUROCRYPT 2004, LNCS 3027 [12] Boyen, X. (2004). Reusable Cryptographic Fuzzy Extractors, 11th ACM Conference on Computer and Communications Security (CCS 2004), pages 82-91, ACM Press, 2004. [13] Boyen, X., Dodis, Y., Katz, J., Ostrovsky, R. & Smith, A. (2005). Secure Remote Authentication Using Biometrics, Advances in Cryptology-EUROCRYPT, May 2005 [14] Harmon, L.D. (1973). The recognition of faces, Scientific America. 229, 1973. [15] Sergent, J. (1986). Microgenesis of face perception, in: H.D. Ellis, M.A. Jeeves, F. Newcombe, A. Young (Eds.), Aspects of Face Processing, Nijhoff, Dordrecht, 1986. [16] Nastar, C., Moghaddam, B. & Pentland, A. (1997). Flexible images: matching and recognition using learned deformations, Comput. Vision Image Understanding 65(2), 179-191, 1997. [17] Cooley, J. W. & Tukey, J. W. (1965). An Algorithm for Machine Calculation of Complex Fourier Series, Math. Comp., 19 (1965), 297301 [18] Schneider, B. (1996). Applied Cryptography, 2nd Edition, John Wiley, 1996 [19] Kam, J. and Davida, G. (1979). Structured Design of Substitution-Permutation Encryption Networks. IEEE Transactions on Computers. C-28(10): 747-753. [20] Feistel, H. (1973). Cryptography & Computer Privacy. Scientific America. 228(5):15-23. [21] Webster, A.F. & Tavares, S.E. (1986). On the Design of S-Boxes, Advances in Cryptology: Proceedings of CRYPTO 85, Springer-Verlag, New York, pp. 523-

534. [22] Spacek, L. (2000). Face Recognition Data. Available at http://cswww.essex.ac.uk/allfaces/index.html [23] Ngo, D. C. L., Goh, A. and Teoh, A. B. J (2004). Front-view facial feature extraction using dynamic symmetry, Technical Report, Multimedia University, 2004 [24] Teoh, A, Ngo, David & Goh , Alwyn (2005). Quantized Multispace Random Mapping for Two-factor Identity Identity, IEEE Transaction on Pattern Recognition and Machine Intelligence (Submitted for review)

60

Published by Department of Mathematics and Computing Science

Technical Report Number: 2005-13 November, 2005 ISBN 0-9738918-9-0

Das könnte Ihnen auch gefallen