Sie sind auf Seite 1von 69

DECLARATION

I hereby declare that the academic research work entitled “Privacy preserving Data mining
using group based Anonymization” has not been presented anywhere for the award of degree
of M .Tech. to the best of my knowledge and belief.

Date: Rani Srivastava

~ 1 ~
Certificate

This is to certify that the dissertation (Dissertation code – IS 602) entitled “Privacy preserving
data mining using group based Anonymization” done by Rani Srivastava, Roll
No.0021010808 is an authentic work carried out by her at Ambedkar Institute of Technology
under my guidance. The matter embodied in this project work has not been submitted earlier for
the award of any degree or diploma to the best of my knowledge and belief.

Date: Signature of the Guide


Name of the Guide:
Mr.Vishal Bhatnagar,
Asst.Professor,
CSE Department,
Ambedkar Institute of Technology,
Geeta Colony, Delhi – 110031.

~ 2 ~
Abstract

Group based Anonymization method including k -anonymity, l-diversity and t-closeness was
introduced for data privacy. The t- closeness model was introduced in order to provide a
safeguard against the similarity attacks on published data set. It requires that the EMD between
the distribution of a sensitive attribute within each equivalence class and distribution of sensitive
attribute in the whole table should not differ more than a threshold t. Our aim in this thesis is to
provide the logical security of data through data Anonymity. There are many other models which
can work as a substitute of Group based Anonymization but can not remove all shortcomings and
also suffer from some limitations so this is efficient.

~ 3 ~
Acknowledgement

Success of any project is the outcome of hard work, dedication & co-ordination. This work
would not have been in its present form without co-operation & co-ordination of many
people who helped me whenever I got hindered in between but also kept my moral high. I
express my deep sense of gratitude and sincere, thanks to my guide Mr.Vishal Bhatnagar
not only for his guidance but also for his enthusiasm towards work that encouraged me to
bring about something new and never stopped me from experimenting. He was always there
to motivate me whenever needed. I would express my gratitude to HOD Mr. Manoj Kumar
for understanding students need .I would like to express my gratitude to Mr. Suresh Kumar
for his instant direction. I also express my gratitude to principal Prof. Asok De for providing
me the necessary resources and for his worthy suggestions. I owe a deep debt of gratitude to
faculty members, librarians, lab technicians, and friends for their constructive suggestions
and giving me the benefit of their substantial experience.

Rani Srivastava

TABLE OF CONTENTS

~ 4 ~
Page N.o

A.SYNOPSIS…………………………………………………………7-58

1. INTRODUCTION………………………………………………7-19

1.1 Objectives
1.2 An Overview of Privacy Preserving Data Mining
1.3 Motivation

1.4 Literature Review

1.5 Problems identified

1.6 Thesis Outline

2. DATA MINING: AN INTRODUCTION……………………………….20-29

2.1 What kind of information are we collecting?


2.2 Data Mining and Knowledge Discovery

2.3 Data Mining in Business

2.4 Privacy, ethics and data mining

2.5 Issues in Data Mining

3. PRIVACY PRESERVING DATA MINING…………………………….30-35

3.1 Confidentiality issues in data mining.

4. ANONYMIZATION TECHNIQUES……………………………………….36-40
4.1Method based on data reduction

4.2Method based on data perturbation

5. t-CLOSENESS PRIVACY PRESERVING DATA MINING…………….41-53


~ 5 ~
5.1 k- Anonymity

5.2 l –diversity

5.3 t - Closeness:

5.4 Proposed work

6. CONCLUSION AND FUTURE RESEARCH………………………………..54-55

6.1 Conclusion

6.2 Future Research

B.LIST OF TABLE

1. Table 1…………………………………………………………………….43

2. Table 2…………………………………………………………………….44

3.. Table 3…………………………………………………………………….45

4. Table 4. ………………………………………………………………….47

C.LIST OF FIGURE

1. Figure 1…………………………………………………………………….25

2. Figure 2…………………………………………………………48

3. Figure 3……………………………………………………………………53

ABBREVIATIONS………………………………………………………….56

REFERENCES…………………………………………………………….57-58

REPRINT AND PUBLICATIONS………………………………………59-68

~ 6 ~
CHAPTER 1

INTRODUCTION

Privacy is an issue that is hotly debated today, and it is likely that it will continue to be debated
in the future. In the information age, it sometimes seems as if everyone wants to know
everything about you. The rapid transfer of personal information has led to the rise of identity
theft. Because of privacy concerns, it is likely that data mining will become a well known topic

~ 7 ~
of discussion within the next 10 years. In the light of developments in technology to analyze
personal data, public concerns regarding privacy are rising. While some believe that statistical
and KDDM research is detached from this issue, we can certainly see that the debate is gaining
momentum as KDDM and statistical tools are more widely adopted by public and private
organizations hosting large databases of personal records. One of the key requirements of a data
mining project is access to the relevant data. Privacy and Security concerns can constrain such
access, threatening to derail data mining projects. This area will bring together researchers and
practitioners to identify problems and solutions where data mining interferes with privacy and
security. Here our objectives are as follows,

1.1 Objectives

a. Techniques for protecting confidentiality of sensitive information, including work on


statistical databases, and obscuring or restricting data access to prevent violation of privacy and
security policies.

b. Underlying methods and techniques to support data mining while respecting privacy and
security e.g., secure multi-party computation.

• Meanings and measuring of “privacy” in privacy-preserving data mining.


• Use of data mining results to reconstruct private information, and corporate
security in the face of analysis by KDDM and statistical tools of public data by
competitors.
• Use of anonymity techniques to protect privacy in data mining.

1.2 An Overview of Privacy Preserving Data Mining


~ 8 ~
The significant advances in data collection and data storage technologies have provided the
means for the inexpensive storage of enormous amounts of data in data warehouses that reside in
companies and public organizations. A data warehouse as a store house is a repository of data
collected from multiple data sources often heterogeneous and is intended to be used as a whole
under the same unified schema. A data warehouse gives the option to analyze data from different
sources under the same roof [1]. Apart from the benefit of using this data like for keeping up to
date profiles of the customers and their purchases, maintaining a list of the available products,
their quantities and price etc. The mining of these datasets with the existing data mining tools
can reveal invaluable knowledge that was unknown to the data holder before hand. The extracted
knowledge patterns can provide insight to the data holders as well as be invaluable in tasks such
as decision making and strategic business planning. Moreover, companies are often willing to
collaborate with other entities who conduct similar business, towards the mutual benefit of their
businesses. Significant knowledge patterns can be derived and shared among the collaborative
partners through the aggregate mining of their datasets. Furthermore, public sector organizations
and civilian federal agencies usually have to share a portion of their collected data or knowledge
with other organizations having a similar purpose, or even make this data and knowledge public.
For example, NIH endorses research that leads to significant findings which improve human
health and provides a set of guidelines which sanction the sharing of NIH supported research
findings with research institutions. As it becomes evident, there exists an extended set of
application scenarios in which data or knowledge derived from the data has to be shared with
other possibly not trusted entities. The sharing of data and/or knowledge may come at a cost to
privacy primarily due to two reasons:

• If the data refers to individuals e.g. as in customers’ market basket data Market basket
analysis is a common mathematical technique used by marketing professionals to reveal
affinities between individual products or product grouping. Then the disclosure of this
data or any knowledge extracted from the data can potentially violate the privacy of the
individuals if their identity is revealed to un- trusted third parties.
• If the data concerns business information, then the disclosure of this data or any
knowledge extracted from the data may potentially reveal sensitive trade secrets, whose

~ 9 ~
knowledge can provide a significant advantage to business competitors and thus can
cause the data holder to lose business over his/her peers.

The aforementioned privacy issues in the course of data mining are amplified due to the fact that
un-trusted entities means adversaries may utilize other external and publicly available sources of
information e.g. public reports in conjunction with the released data or knowledge, in order to
reveal any protected sensitive information. Since its inception in 2000 with the pioneering work
of [2] and [3], privacy preserving data mining has gained increasing popularity in the data
mining research community. As a result, a whole new set of approaches were introduced to allow
the mining of data, while at the same time prohibiting the leakage of any private and sensitive
information. The majority of the existing approaches can be classified along two broad
categories:

• Methodologies that protect the sensitive data itself in the mining process, and
• Methodologies that protect the sensitive data mining results (i.e. the extracted
knowledge) that were produced by the application of data mining.

The first category refers to methodologies that apply perturbation, sampling,


generalization/suppression, transformation, etc. techniques to the original datasets in order to
generate their sanitized counterparts that can be safely disclosed to un-trusted third parties. The
goal of this category of approaches is to enable the data miner to get accurate data mining results
when is not provided with the real data. As part of this category we highlight Secure Multiparty
Computation methodologies that have been proposed to enable a number of data holders to
collectively mine their data without having to reveal their datasets to each other. On the other
hand, the second category deals with techniques that prohibit the disclosure of sensitive
knowledge patterns derived through the application of data mining algorithms, as well as
techniques for downgrading the effectiveness of the classifiers in classification tasks, such that
they do not reveal any sensitive knowledge. In what follows, we further investigate each of these
two categories of approaches.

~ 10 ~
1.2.1 Protecting the Sensitive Data

A wide range of methodologies have been proposed in the research literature to effectively shield
the sensitive information contained in a dataset by producing its privacy-aware counterpart that
can be safely released. The goal of all these privacy preserving methodologies is to ensure that
the sanitized dataset

(a) It properly shields all the sensitive information that was contained in the original dataset.

(b) It has similar properties e.g. first/second order statistics etc to the original dataset possibly
resembling it to a high extent and

(c)It maintains reasonably accurate data mining results when compared to those attained when
mining the original dataset. The protection of sensitive data from disclosure has been extensively
studied in the context of micro data release, where methodologies have been proposed for the
protection of sensitive information regarding individuals, which are recorded in a dataset. In
micro data we consider each record of the dataset to represent an individual for whom the values
of a number of attributes are being recorded e.g. name, date of birth, residence, occupation,
salary, etc. Among the complete set of attributes, there exist some attributes that explicitly
identify the individual e.g. name, social security number, etc, as well as attributes which, once
combined together or with publicly available external resources, may lead to the identification of
the individual e.g. address, gender, age, etc. The first type of attributes, also known as
identifiers, must be removed from the data prior to its publishing. On the other hand, the second
type of attributes, also known as quasi-identifiers, have to be handled by the privacy preservation
algorithm in such a way that in the sanitized dataset, the knowledge of their values regarding an
individual does no longer pose a threat to the identification of his/her identity. The existing
methodologies for the protection of sensitive micro data can be partitioned in two directions:

(a) data modification approaches.

(b) synthetic data generation approaches.

~ 11 ~
In [4] it is further partitioned data modification approaches into perturbative and non-
perturbative, depending on whether they introduce false information in the attribute-values of the
data e.g. by the addition of noise based on a data distribution or they operate by altering the
precision of the existing attribute-values e.g. by changing a value to an interval that contains it.

1.2.2 Secure Multiparty Computation

The approaches discussed so far aim at generating a sanitized dataset from the original one,
which can be safely shared with untrustworthy third parties as it contains only non-sensitive
data.SMC provides an alternative family of approaches that effectively protect the sensitive data.
SMC considers a set of collaborators who wish to collectively mine their data but are unwilling
to disclose their own datasets to each other. As it turns out, this distributed privacy preserving
data mining problem can be reduced to the secure computation of a function based on distributed
inputs and is thus solved by using cryptographic approaches. [5] Elaborates on this close relation
that exists between privacy-aware data mining and cryptography. In SMC, each party contributes
to the computation of the secure function by providing its private input. A secure cryptographic
protocol that is executed among the collaborating parties ensures that the private input that is
contributed by each party is not disclosed to the others. Most of the applied cryptographic
protocols for multi-party computation result to some primitive operations that have to be
securely performed: secure sum, secure set union, and secure scalar product. As a final remark,
we should point out that the operation of the secure protocols in the course of distributed privacy
preserving data mining depends highly on the existing distribution of the data in the sites of the
collaborators. Two types of data distribution have been so far investigated

1. In a horizontal data distribution, each collaborator holds a number of records and for each
record he/she has knowledge of the same set of attributes as his/her peers.

2. In a vertical partitioning of the data, each collaborator is aware of different attributes referring
to the same set of records.

1.2.3 Protecting the Sensitive Knowledge

In this section, we focus our attention on privacy preserving methodologies that protect the
sensitive knowledge patterns that would otherwise be revealed after the course of mining the

~ 12 ~
data.[4] Similarly to the methodologies that we have presented for protecting the sensitive data
prior to its mining, the approaches in this category also modify the original dataset but in such a
way that certain sensitive knowledge patterns are suppressed, when mining the data. In what
follows, we briefly discuss some categories of methodologies that have been proposed for the
hiding of sensitive knowledge in the context of association and classification rule mining.

1.2.4 Association Rules Hiding

The association rule mining framework, along with some computationally efficient heuristic
methodologies for the generation of association rules, have been proposed in the work of [2]
briefly stated, the goal of association rule mining is to produce a set of interesting and potentially
useful rules that hold in a dataset. The holding of a rule in a dataset is judged based on its
statistical significance, quantified with the aid of two measures: confidence and support.[5] All
the association rules whose confidence and support are above some user-specified thresholds are
thus mined. However, some of these rules may be sensitive from the owner's perspective. The
association rules hiding methodologies aim to sanitize the original dataset in a way that:

• All the sensitive rules as indicated by the data holder that appear when mining the
original dataset for association rules, do not appear when mining the sanitized dataset for
association rules at the same or higher` levels of support and confidence.
• All the non-sensitive rules can be successfully mined from the sanitized dataset at the
same or higher levels of support and confidence.
• No rule that was not found when mining the original dataset can be found in its sanitized
counterpart, when mining the latter at the same or higher levels of support and
confidence.

The first goal simply states that all the sensitive association rules are properly hidden in the
sanitized dataset. The hiding of the sensitive knowledge comes at a cost to the utility of the
sanitized outcome. The second and the third goals aim at minimizing this cost. Specifically, the
second goal requires that only the sensitive knowledge is hidden in the sanitized dataset and thus
no other, non-sensitive rules are lost due to side-effects of the sanitization process. On the other
hand, the third rule requires that no artifacts i.e. false association rules are generated by the
sanitization process. To recapitulate, in association rule hiding the sanitization process has to be

~ 13 ~
accomplished in a way that minimally affects the original dataset, preserves the general patterns
and trends of the dataset, and achieves to conceal all the sensitive knowledge, as indicated by the
data holder. Association rule hiding has been studied along three principal directions:

(a) Heuristic approaches.

(b) Border-based approaches.

(c) Exact approaches.

The first direction collects time and memory efficient algorithms that heuristically select a
portion of the transactions of the original dataset to sanitize, in order to facilitate sensitive
knowledge hiding. Due to their efficiency and scalability, these approaches have been
investigated by the majority of the researchers in the knowledge hiding field of privacy
preserving data mining. However, as in all heuristic methodologies, the approaches of this
category take locally best decisions when performing knowledge hiding, which may not always
be and usually are no globally best. As a result, there are several cases in which these
methodologies suffer from undesirable side-effects and may not identify optimal hiding
solutions, when such solutions exist. Heuristic approaches can rely on a distortion i.e.
inclusion/exclusion of items from selected transactions or on a blocking i.e. replacing some of
the original values in a transaction with question marks scheme. The second class of approaches
collects methodologies that hide the sensitive knowledge by modifying only a selected portion of
item sets which belong to the border in the lattice of the frequent i.e. statistically significant and
the infrequent i.e. statistically insignificant patterns of the original dataset. In particular, the
sensitive knowledge is hidden by enforcing the revised borders which accommodate the hiding
of the sensitive item sets in the sanitized database. The algorithms in this class differ in the
borders they track, as well as in the methodology that they apply to enforce the revised borders in
the sanitized dataset .Finally, the third class of approaches involves non-heuristic algorithms
which conceive the knowledge hiding process as a constraints satisfaction problem an
optimization problem that is solved through the application of integer or linear programming.
This class of approaches differs from the previous two, primarily due to the fact that it collects
methodologies that can guarantee optimality in the computed hiding solution provided that an

~ 14 ~
optimal hiding solution exists or a very good approximate solution in the case that an optimal
one does not exist. On the negative side, these approaches are usually several orders of
magnitude slower than the heuristic ones, especially due to the runtime that is required for the
solution of the constraints satisfaction problem by the integer/linear programming solver.

1.2.5 Classification Rule Hiding

Privacy-aware classification has been studied to a substantially lower extent than privacy
preserving association rule mining. Similarly to association rules hiding, classification rules
hiding algorithms consider a set of classification rules as sensitive and proceed to protect them
from disclosure by using either suppression-based or reconstruction-based techniques. In
suppression-based techniques the confidence of a classification rule measured in terms of the
owner’s belief regarding the holding of the rule when given the data is reduced by distorting a set
of attributes in the data set that belongs to transactions related to its existence. On the other hand,
reconstruction-based approaches target at reconstructing the dataset by using only those
transactions of the original dataset that support the non-sensitive classification rules, thus leaving
the sensitive rules unsupported.

1.3 Motivation

Huge databases exist in the society today, they include census data, media data, consumer data,
data gathered by government agencies and like research data need to publish publically. Now it
is possible for an adversary to learn a lot of information about individual from public data like
purchasing pattern, family history, medical data, media data and also business trend. This is an
age of competitiveness and every organization either government or non government wants to be
ahead of each other .Then why allow adversary to get useful information to make any inference
and to share confidential information. Privacy is becoming an increasingly important issue in
many data-mining applications. This has triggered the development of many privacy-preserving
data-mining techniques.

1.4 Literature Review

~ 15 ~
It is estimated that more than a half data collected by government and non government
organization are confidential so there are more risk associated with data breaches. So it is a
big question how to publish data publically without compromising individual’s privacy for
the purpose of data mining. There are lots of model regarding privacy preserving data mining
like Injector model, Anatomy model, Masking etc. model. But these models can not work on
every condition. There are two models which is better than this first [6] and second [7].But
these model also have shortcomings but if we enhance first model then it remove
shortcomings.

1.4.1 A.

In a paper titled “providing k-anonymity in data mining” [6] extended definitions of k


-anonymity was presented and used to show that a given data mining model does not violate the
k-anonymity of the individuals represented in the learning examples. It shows that model can be
applied to various data mining problems, such as classification, association rule mining and
clustering. In k-anonymity method, the values of quasi identifier QI attributes of each tuple in a
table are identical to those of at least (k-1) other tuples. The larger the value of k, the greater the
implied privacy since no individual can be identified with probability exceeding 1/k through
linking attack .the process of k- Anonymization involves Data Generalization and cell value
Suppression. The k anonymity model makes two major assumptions:

1. The database owner is able to separate the columns of the table into a set of quasi-identifiers,
which are attributes that may appear in external tables the database owner does not control, and a
set of private columns, the values of which need to be protected. We prefer to term these two sets
as public attributes and private attributes, respectively.

2. The attacker has full knowledge of the public attribute values of individuals, and no
knowledge of their private data. The attacker only performs linking attacks. A linking attack is
executed by taking external tables containing the identities of individuals, and some or all of the
public attributes. When the public attributes of an individual match the public attributes that
appear in a row of a table released the database owner, then we say that the individual is linked

~ 16 ~
to that row. Specifically the individual is linked to the private attribute values that appear in that
row. A linking attack will succeed if the attacker is able to match the identity of an individual
against the value of a private attribute.

1.4.2 B.
In this paper a novel privacy preserving algorithm was developed that overcomes all the above
problems.[7] The core of solution is the concept of transforming a part of quasi identifiers and
personalizing the sensitive information so that privacy preserving micro data can be released
with less information loss. This concept introduces 3 different questions to be answered:
First, what is the transformation to be done on QI and why?
Second, how is the personalized privacy to be introduced in the micro data?
And third, how do we prove that information loss is less? By answering all these queries we
formalize an algorithm which preserves privacy in the real data sets.
Quasi-Identifier attributes Aq is a set of attributes Aq= {Ax... Ay) which is subset of {AI, A2....,
Am} whose values are able to fetch unique record in the table T (A1, A2…, Am ). This property
only leads to the problem of linking attack .In simple term, QI attributes are candidate key
(minimal super key) in a data table. All the attributes in a data base table e. g. Patient table may
be classified into four categories as identifying attributes Ai e.g. Name, sensitive attributes A,
- e.g. disease, neutral attributes An- e.g. Length of stay and quasi-identifier attributes Aq e.g.
Age , Gender , Zip. To preserve privacy, identifying attributes are not published Here we take an
assumption that one of the QI members is numeric data type and value of that attribute is
transformed to fuzzy membership value. If actual value of any of the QI member attribute is not
known, it can not identify unique record. There by, linking attack problem is solved. The values
of sensitive attribute as may be confidential for an individual and therefore to be published
according to his/her preference. For each sensitive attribute the user is allowed to set PL and DL
Boolean values. If the individual is willing to disclose his actual information, he/she has to set
PL and DL value as True and so no transformation is done. If he does not mind linking him with
other sibling value of taxonomy tree with less probability then he has to set the value PL as true
and DL as False. Then his sensitive attribute value is replaced with its ancestor generalized value
followed by an arbitrary value, to differentiate among the sibling values.
1.5 Problems identified.

~ 17 ~
Since k-anonymity does not put any restriction on sensitive attributes, it faces homogeneity
attack and background knowledge attack. In other words, k-anonymity creates groups which leak
information due to lack of diversity within the group. This limitation is overcome by l-diversity
principle. But l-diversity principles suffer from similarity attack and skew attack so this problem
can be overcome closeness. And also in (b) following assumptions can create problems.
a. It is assumed that one of the quasi identifier is numeric and it is transformed in to fuzzy based
transformation. In fuzzy based transformation number of fuzzy set is calculated which is equal to
the number of linguistic term set by deciding size of fuzzy set (k), min, max and mid point of
each fuzzy set m1….m k .Transform the actual value(x) using the functions f1(x), f2(x),…f
k(x).and replace actual value with x n plus category number. If actual value of any of the quasi
identifier member attribute is not known, it can not identify unique record. There by, linking
attack problem is solved but what happen if quasi identifier does not have numerical attribute
like Netflix movie rating then this approach can not solve linking problem.
b. Identifier attribute are not disclosed and replaced by auto generated id number if present. This
leads loss of truthfulness of information and also information loss.

c. For categorical sensitive attribute transformation is performed using the mapping table
prepared with prepared with domain knowledge considering privacy level (PL) and disclosure
level (DL) set by the user? If any individual willing to disclose his information then his PL and
DL is set to be true and so no transformation is done if he does not mind to linking him another
with less probability then PL as true and DL as false, then its sensitive value is replaced by
generalized value plus arbitrary value but if privacy level is false then value is replaced by
overall general value plus arbitrary value. But if we have to do research then if Disclosure level
and privacy level both are true then there is no problem for do research because true data is
available but there is a privacy breach if linking problem can not be solved above then attribute
disclosure is not solved .If no one want to disclose his sensitive information then how will
research etc. take place.

1.6 Thesis outline.

1. In chapter 1 we look an overview of privacy preserving data mining, specify literature survey
related our method and find out problems.

~ 18 ~
2. In chapter 2 we will study what is data mining, what kind of information can be mined, what
are the steps in knowledge discovery process what is the privacy, ethics issues in data mining
and what is the controversial issues in data mining.

3. In chapter 3 we will study what is privacy preserving data mining, what is the different
dimension in privacy preserving data mining and detail study of confidentiality issue in data
mining.

4. In chapter 4 we will study category of Statistical disclosure limitation methods and their
division.

5. In this chapter 5 we will study that what anonymity methods are going to make a group
anonymity method and how we apply it on privacy preserving data mining

6. In chapter 6 is concerned for Conclusion, Research work.

And at the end Abbreviations and References are included.

In this chapter we look an overview of privacy preserving data mining, specified literature
survey related our method and find out problems. In the next chapter we will study what is
data mining, what kind of information can be mined, what are the steps in knowledge
discovery process, what is the privacy, ethics issues in data mining and what is the
controversial issues in data mining.

~ 19 ~
CHAPTER 2

DATA MINING: AN
INTRODUCTION

~ 20 ~
Initially, with the advent of computers and means for mass digital storage, we started collecting
and storing all sorts of data, counting on the power of computers to help sort through this
amalgam of information. Unfortunately these massive collections of data stored on different
structures very rapidly became overwhelming. [1]This initial chaos has led to the creation of
structured databases and DBMS. The efficient database management systems have been very
important assets for management of a large corpus of data and especially for effective and
efficient retrieval of particular information from a large collection whenever needed. The
proliferation of database management systems has also contributed to recent massive gathering
of all sorts of information. Today, we have far more information than we can handle: from
business transactions and scientific data, to satellite pictures, text reports and military
intelligence. Information retrieval is simply not enough anymore for decision-making.
Confronted with huge collections of data, we have now created new needs to help us make better
managerial choices. These needs are automatic summarization of data, extraction of the
“essence” of information stored, and the discovery of patterns in raw data.

2.1 What kind of information are we collecting?

We have been collecting a myriad of data, from simple numerical measurements and Text
Documents to more complex information such as Spatial data, Multimedia Channels and
Hypertext Documents. Here is a non-exclusive list of a variety of information collected in
Digital form in Database and Flat Files.

2.1.1 Business transactions:

Every transaction in the business industry is often “memorized” for perpetuity. Such transactions
are usually time related and can be inter business deals such as purchases, exchanges, banking,
stock, etc., or intra business operations such as management of in house wares and assets. Large
department stores, for example, thanks to the widespread use of bar codes, store millions of
transactions daily representing often terabytes of data. Storage space is not the major problem, as
the price of hard disks is continuously dropping, but the effective use of the data in a reasonable

~ 21 ~
time frame for competitive decision making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world.

2.1.2 Scientific data:

Whether in a Swiss nuclear accelerator laboratory counting particles, in the Canadian forest
studying readings from a grizzly bear radio collar, on a South Pole iceberg gathering data about
oceanic activity, or in an American university investigating human psychology, our society is
amassing colossal amounts of scientific data that need to be analyzed. Unfortunately, we can
capture and store more new data faster than we can analyze the old data already accumulated.

2.1.3 Medical and personal data:

From government census to personnel and customer files, very large collections of information
are continuously gathered about individuals and groups. Governments, companies and
organizations such as hospitals, are stockpiling very important quantities of personal data to help
them manage human resources, better understand a market, or simply assist clientele. Regardless
of the privacy issues this type of data often reveals, this information is collected, used and even
shared .When correlated with other data this information can shed light on customer behavior
and the like.

2.1.4 Surveillance video and pictures:

With the amazing collapse of video camera prices, video cameras are becoming ubiquitous.
Video tapes from surveillance cameras are usually recycled and thus the content is lost.
However, there is a tendency today to store the tapes and even digitize them for future use and
analysis.

2.1.5 Satellite sensing:

~ 22 ~
There are a countless number of satellites around the globe: some are geo-stationary above a
region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to
the surface. NASA, which controls a large number of satellites, receives more data every second
than what all NASA researchers and engineers can cope with. Many satellite pictures and data
are made public as soon as they are received in the hopes that other researchers can analyze
them.

2.1.6 Games:

Our society is collecting a tremendous amount of data and statistics about games, players and
athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times,
boxer’s pushes and chess positions and all the data are stored. Commentators and journalists are
using this information for reporting, but trainers and athletes would want to exploit this data to
improve performance and better understand opponents.

2.1.7 Digital media:

The proliferation of cheap scanners, desktop video cameras and digital cameras is one of the
causes of the explosion in digital media repositories. In addition, many radio stations, television
channels and film studios are digitizing their audio and video collections to improve the
management of their multimedia assets. Associations such as the NHL and the NBA have
already started converting their huge game collection into digital forms.

2.1.8 CAD and Software engineering data:

There are a multitude of CAD systems for architects to design buildings or engineers to
conceive system components or circuits. These systems are generating a tremendous amount of
data. Moreover, software engineering is a source of considerable similar data with code, function
libraries, objects, etc., which need powerful tools for management and maintenance.

2.1.9Virtual Worlds:

~ 23 ~
There are many applications making use of three-dimensional virtual spaces. These spaces and
the objects they contain are described with special languages such as VRML. Ideally, these
virtual spaces are described in such a way that they can share objects and places. There is a
remarkable amount of virtual reality object and space repositories available. Management of
these repositories as well as content-based search and retrieval from these repositories are still
research issues, while the size of the collections continues to grow.

2.1.10 Text reports and memos like e-mail messages:

Most of the communications within and between companies or research organizations or even
private people, are based on reports and memos in textual forms often exchanged by e-mail.
These messages are regularly stored in digital form for future use and reference creating
formidable digital libraries.

2.1.11 The World Wide Web repositories:

Since the inception of the World Wide Web in 1993, documents of all sorts of formats, content
and description have been collected and inter-connected with hyperlinks making it the largest
repository of data ever built. Despite its dynamic and unstructured nature, its heterogeneous
characteristic, and its very often redundancy and inconsistency, the World Wide Web is the most
important data collection regularly used for reference because of the broad variety of topics
covered and the infinite contributions of resources and publishers. Many believe that the World
Wide Web will become the compilation of human knowledge [1].

2.2 Data Mining and Knowledge Discovery

Data Mining, also popularly known as KDD refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data in databases. While data
mining and KDD are frequently treated as synonyms, data mining is actually part of the
knowledge discovery process. The following figure shows data mining as a step in an iterative
knowledge discovery process.

~ 24 ~
Knowledge

Pattern
Evaluation

Data Mining

Task-relevant Selection and


Data
Transformation

Data
Warehouse

Data Cleaning

Data Integration

Databases

Figure.1 Data Mining is the core of Knowledge Discovery process [1]

~ 25 ~
2.3 Data Mining in Business

Through the use of data mining techniques, businesses are discovering new trends and patterns
of behavior that previously went unnoticed. Once they have uncovered this vital intelligence, it
can be used in a predictive manner for a variety of applications. Data mining in customer
relationship management applications can contribute significantly. Rather than randomly
contacting a customer through a call center or sending mail, a company can concentrate its
efforts on customer that are predicted to have a high likelihood of responding to an offer. More
sophisticated methods may be used to optimize resources across campaigns so that one may
predict which channel and which offer an individual is most likely to respond to across all
potential offers. Data mining can also be helpful to human-resources departments in identifying
the characteristics of their most successful employees. Information obtained, such as universities
attended by highly successful employees, can help HR focus recruiting efforts
accordingly. Another example of data mining, often called the market basket analysis, relates to
its use in retail sales. If a clothing store records the purchases of customers, a data-mining system
could identify those customers who favor silk shirts over cotton ones. Although some
explanations of relationships may be difficult, taking advantage of it is easier. The example deals
with association rules within transaction-based data.

2.4 Privacy, Ethics and Data mining

Absolute privacy is not possible to achieve in the information age. Because, Individuals
disseminate data in their common daily activities such as web browsing , e-commerce and e-
government dealings, e-mail and mobile phone communications , credit card and ATM
transactions etc.. Major data sources include server and cookie logs, customer information,
intelligent Internet agents and centralized demographic and other official records. Powerful data
processing storage and communication technologies allow these data to be manipulated and used
by the other people and agencies freely and in most cases indiscreetly. The reason is that data
mining does not supply information about the social and ethical consequences of the results and
also it does not necessarily discover causal relationships .Therefore how and where to use the
results is largely a choice of the “miner”. This mechanism is inevitably prone to have negative
impacts on privacy and individual rights. In any data processing system, it is fair to expect that

~ 26 ~
sensitive personal data must be protected from misuses and abuses by the outside entities. The
position of data mining is even more critical in this respect, because of its power and potential
naturally every individual should have control over his or her personally sensitive data. But what
constitutes “personal” or “sensitive”? What are the limits? Of course these are questions whose
answers may differ from individual to individual. Another important problem arises when
individual’s rights and public interest contradict. In most cases, public or business oriented data
mining applications are claimed to produce beneficial outcomes for individuals, organizations
and the society. Unfortunately, it is very difficult to impose universally acceptable guidelines and
rules for distinguishing the right from the wrong. Because, data mining applications are generally
open ended and they can as well lead to unpredictable and possibly harmful personal results. For
example very detailed data about buying habits, times and locations are extracted from customer
transactions and data mined regularly by most supermarket chains. It is impossible for an
average customer to be knowledgeable on possible uses of sensitive data and their consequences.
Data mining for crime prevention is based on the records of activities of the individuals, such as
travel, electronic and phone communications, shopping and encounters with the other people.
Data mining in these applications can produce inaccurate and faulty results, which usually
constitute breaches of privacy and can be harmful for the individuals. For example, a person may
be classified into a “suspect” short list while he or she has nothing to do with the affair at hand.
The consequences could be serious in such a case and the whole process is certainly unethical if
not illegal .The negative impacts such as litigation, adverse publicity, loss of reputation,
discrimination could be further aggravated if the data are unreliable, faulty or even fabricated
.This is where an inherently unethical process could also turn into an “illegal” one. The legal
issues related to data mining applications are complex and difficult to evaluate and as such, can
not be put into the framework of any particular law. Besides the challenges of the information
age have not yet been resolved properly by legal doctrines and by the legal systems of countries.
Most countries choose to strengthen the privacy of personal information by a specific law. But
such a law cannot be expected to foresee all possible violations and types of offences that might
come about. In practice, legal court cases are usually resolved by applying some other available
code or legislations which aim to protect the individual’s rights in general. The major problem
here is the growth rate of information and information technologies, which render obsolete the
rules, regulations and even laws in relatively short time spans. In many situations, it is very

~ 27 ~
difficult to decide what constitutes “legal” or “private” let alone to establish consistent court
rules. The main reason is that, most physical limitations of the hardware and software
technologies loose their meaning rapidly and newer and more versatile ones replace them. This
phenomenon influences the application methodologies and environments, including data
mining .Various privacy preservation methods and measures aim to find solutions for this kind of
legal and ethical problems by reducing their likelihoods of occurrence in data mining
applications

2.5 Issues in Data Mining

Before data mining develops into a conventional, mature and trusted discipline many issues have
to be addressed .Some of these issues are addressed below.

2.5.1 Security and social issues:

Security is an important issue with any data collection that is shared and/or is intended to be used
for strategic decision-making. When data is collected for customer profiling, user behavior,
understanding, correlating personal data with other information etc., large amounts of sensitive
and private information about individuals or companies is gathered and stored .This become
controversial due to confidential nature of some of the data and the potential illegal access to the
information.

2.5.2 User interface issues:

Data visualization simplify data interpretation results, helps users better understand their needs.
There are many visualization ideas and proposals for effective data graphical presentation.
However there is still much research to obtain good visualization tools for large datasets that
could be used to display and manipulate mined knowledge.

2.5.3 Mining methodology issues:

~ 28 ~
These issue concerns data mining approaches applied and there limitations. Topics such as
versatility of the mining approaches, the diversity of data available, the dimensionality of the
domain, the broad analysis needs (when known), the assessment of the knowledge discovered,
the exploitation of background knowledge and metadata, the control and handling of noise in
data, etc. are all examples that can dictate mining methodology choices.

2.5.4 Performance issues:

There are many statistical methods exist for data analysis but these method are not designed for
very large data sets data mining is dealing with today .This raises the issues of scalability and
efficiency of the data mining methods when processing considerably large data.

2.5.5 Data source issues:

There are many issue related to data sources like diversity of data. We are storing different types
of data in a variety of repositories. It is difficult to expect a data mining system to effectively and
efficiently achieve good mining results on all kinds of data and sources. Different kinds of data
and sources may require distinct algorithms and methodologies. Currently, there is a focus on
relational databases and data warehouses, but other approaches need to be pioneered for other
specific complex data types. A versatile data mining tool, for all sorts of data, may not be
realistic.

In this chapter we studied that what is data mining, what kind of information can be
mined, what are the steps in knowledge discovery process what is the privacy, ethics issues
in data mining and what is the controversial issues in data mining. In the next chapter we
will study what is privacy preserving data mining, what is the different dimension in
privacy preserving data mining and detail study of confidentiality issue in data mining.

~ 29 ~
CHAPTER 3

PRIVACY PRESERVING
DATA MINING

~ 30 ~
The problem of privacy-preserving data mining has become more important in recent years
because of the increasing ability to store personal data about users, and the increasing
understanding of data mining algorithms to open this information. It’s estimated that more than
50 percent of the data stored in databases could be classified as confidential. With your IT
organization collecting and storing more data and more sensitive data for longer periods of time,
there are risk associated with data breaches, a relatively new research area, is focused on
preventing privacy violations that might arise during data mining operations.[8] In implementing
this goal, PPDM algorithms modify original datasets in order to preserve privacy even after the
mining process is activated. The aim is to ensure minimal data loss and to obtain qualitative data
mining results. Describe PPDM approaches based on five dimensions:
1. Data distribution - whether the data is centralized or distributed.
2. Data modification – where the modification technique is used to transform the data
values.
3. Data mining algorithm the data mining task to which the approach is applied
4. Data or rule hiding refers to whether raw data or aggregated data should be hidden
5. Privacy preservation the type of selective modification performed on the data as a part of the
PPDM technique: heuristic-based, cryptography-based or reconstructing-based.

3.1 Confidentiality issues in data mining.

A key problem that arises in any large collection of data is that of confidentiality. The need for
privacy is sometimes due to law e.g., for medical databases or can be motivated by business
interests. However, there are situations where the sharing of data can lead to mutual gain. A key
utility of large databases today is research, whether it is scientific or economic and market
oriented. Thus, for example, [8] the medical field has much to gain by pooling data for research;
as can even competing businesses with mutual interests. Despite the potential gain, this is often
not possible due to the confidentiality issues which arise. We address this question and show that
highly efficient solutions are possible. Our scenario is the following:
Let P1 and P2 be parties owning large private databases D1 and D2. The parties wish to apply a
data-mining algorithm to the joint database D1/D2 without revealing any unnecessary
information about their individual databases. That is, the only information learned by P1 about

~ 31 ~
D2 is that which can be learned from the output of the data mining algorithm, and vice versa. We
do not assume any “trusted” third party who computes the joint output.

3.1.1 Semi-honest adversaries.

In any multi-party computation setting, a malicious adversary can always alter its input. In the
data-mining setting, this fact can be very damaging since the adversary can define its input to be
the empty database. Then, the output obtained is the result of the algorithm on the other party’s
database alone. Although this attack cannot be prevented, we would like to prevent a malicious
party from executing any other attack. However, for this initial work we assume that the
adversary is Semihonest also known as passive. That is, it correctly follows the protocol
specification, yet attempts to learn additional information by analyzing the transcript of messages
received during the execution. The key directions in the field of privacy-preserving data mining
are as follows:

3.1.2 Privacy-Preserving Data Publishing:

These techniques tend to study different transformation methods associated with privacy.
Another related issue is how the perturbed can be used with classical data mining methods such
as association rule mining. Other related problems include that of determining privacy-
preserving methods to keep the underlying data useful, or the problem of studying the different
definitions of privacy, and how they compare in terms of effectiveness in different scenarios.

3.1.3 Changing the results of Data mining applications to preserve privacy:

There are many cases in which the results of data mining applications such as association rule or
classification rule mining can compromise the privacy of data. This has resulted a field of
privacy in which data mining algorithms such as association rule mining are modified in order to
preserve the privacy of data for example association rule hiding methods.

3.1.4 Query Auditing


Here, we are either modifying or restricting the results of queries.

~ 32 ~
3.1.5 Cryptographic Methods for Distributed Privacy:

There are many cases in which data is distributed across multiple sites. So owner of data may
want to compute a common function. That makes function computation possible without
revealing sensitive information. SMC for Privacy-Preserving Data Mining Privacy-preserving
data mining considers the problem of running data mining algorithms on confidential data that is
not supposed to be revealed even to the party running the algorithm. There are two classic
settings for privacy-preserving data mining although these are by no means the only ones. In the
first, the data is divided amongst two or more different parties, and the aim is to run a data
mining algorithm on the union of the parties' databases without allowing any party to view
anyone else's private data. In the second, some statistical data that is to be released so that it can
be used for research using statistics and/or data mining may contain confidential data and so is
first modified so that
(a) The data does not compromise anyone's privacy.
(b) It is still possible to obtain meaningful results by running data mining algorithms on the
modified data set. A classical example of a privacy-preserving data mining problem of the first
type is from the field of medical research. Consider the case that a number of different hospitals
wish to jointly mine their patient data for the purpose of medical research. Furthermore, let us
assume that privacy policy and law prevents these hospitals from ever pooling their data or
revealing it to each other, due to the confidentiality of patient records. In such a case, classical
data mining solutions cannot be used. Rather it is necessary to find a solution that enables the
hospitals to compute the desired data mining algorithm on the union of their databases, without
ever pooling or revealing their data. Privacy-preserving data mining solutions have the property
that the only information provably earned by the different hospitals is the output of the data
mining algorithm. This problem whereby different organizations cannot directly share or pool
their databases, but must nevertheless carry out joint research via data mining, is quite common.
For example, consider the interaction between different intelligence agencies. For security
purposes, these agencies cannot allow each other free access to their confidential information if
they did, then a single mole in a single agency would have access to an overwhelming number of
sources. Nevertheless, as we all know, homeland security also mandates the sharing of

~ 33 ~
information! It is much more likely that suspicious behavior will be detected if the different
agencies were able to run data mining algorithms on their combined data.

3.1.6 Secure computation and privacy-preserving data mining.

There are two distinct problems that arise in the setting of privacy-preserving data mining. The
first is to decide which functions can be safely computed, where safety means that the privacy of
individuals is preserved. For example, is it safe to compute a decision tree on confidential
medical data in a hospital, and publicize the resulting tree? For the most part, we will assume
that the result of the data mining algorithm is either safe or deemed essential. Thus, the question
becomes how to compute the results while minimizing the damage to privacy. For example, it is
always possible to pool all of the data in one place and run the data mining algorithm on the
pooled data. However, this is exactly what we don't want to do hospitals are not allowed to hand
their raw data out, security agencies cannot afford the risk, and governments risk citizen outcry if
they do. Thus, the question we address is how to compute the results without pooling the data,
and in a way that reveals nothing but the final results of the data mining computation. This
question of privacy-preserving data mining is actually a special case of a long-studied problem in
cryptography called secure multiparty computation. This problem deals with a setting where a set
of parties with private inputs wish to jointly compute some function of their inputs. Loosely
speaking, this joint computation should have the property that the parties learn the correct output
and nothing else, even if some of the parties maliciously collude to obtain more information.
Clearly, a protocol that provides this guarantee can be used to solve privacy-preserving data
mining problems of the type discussed above.

3.1.7 Privacy-Preserving Data Mining Algorithms:

In recent years, data mining has been viewed as a threat to privacy because of the widespread
proliferation of electronic data maintained by corporations. This has lead to increased concerns
about the privacy of the underlying data. In recent years, a number of techniques have been
proposed for modifying or transforming the data in such a way so as to preserve privacy. A
survey on some of the techniques used for privacy-preserving data mining may be found in. Most
methods for privacy computations use some form of transformation on the data in order to

~ 34 ~
perform the privacy preservation. Typically, such methods reduce the granularity of
representation in order to reduce the privacy. This reduction in granularity results in some loss of
effectiveness of data management or mining algorithms. This is the natural trade-off between
information loss and privacy. The randomization method: The randomization method is a
technique for privacy-preserving data mining in which noise is added to the data in order to mask
the attribute values of records. The noise added is sufficiently large so that individual record
values cannot be recovered. The k-anonymity model and l-diversity : The k-anonymity model
was developed because of the possibility of indirect identification of records from public
databases. This is because combinations of record attributes can be used to exactly identify
individual records. In the k-anonymity method, we reduce the granularity of data representation
with the use of techniques such as generalization and suppression. This granularity is reduced
sufficiently that any given record maps onto at least k other records in the data. The l-diversity
model was designed to handle some weaknesses in the k-anonymity model since protecting
identities to the level of k-individuals is not the same as protecting the corresponding sensitive
value s, especially when there is homogeneity of sensitive values within a group.

3.1.8 Distributed privacy preservation:

In many cases, individual entities may wish to derive aggregate results from data sets which are
partitioned across these entities. Such partitioning may be horizontal when the records are
distributed across multiple entities or vertical when the attributes are distributed across multiple
entities. Downgrading application effectiveness: In many cases even though the data may not be
available, the output of applications such as association rule mining, classification or query
processing may result in violations of privacy. This has lead to research in downgrading the
effectiveness of applications by either data or application modifications. Some examples of such
techniques include association rule hiding.

In this chapter we studied that what is privacy preserving data mining, what is the
different dimension in privacy preserving data mining and detail study of confidentiality
issue in data mining. In the next chapter we will study category of Statistical disclosure
limitation methods and their division

~ 35 ~
CHAPTER 4

ANONYMIZATION
TECHNIQUES

~ 36 ~
Anonymization is the making data publicly without compromising the individual privacy. One
of the functions of a federal statistical agency is to collect individually sensitive data, process it
and provide statistical summaries, and/or public use micro data files to the public. Some of the
data collected are considered proprietary by respondents. On the other hand, not all data
collected and published by the government are subject to disclosure limitation techniques. Some
data on businesses that is collected for regulatory purposes are considered public. In addition,
some data are not considered sensitive and are not collected under a pledge of confidentiality.
The statistical disclosure limitation techniques described however confidentiality is required and
data or estimates are made publicly available. All disclosure limitation methods result in some
loss of information and sometimes the publicly available data may not be adequate for certain
statistical studies. However, the intention is to provide as much data as possible, without
revealing individually sensitive data. Statistical disclosure limitation methods can be classified in
two categories [9]:

4.1 Methods based on data reduction. Such methods aim at increasing the number of
individuals in the sample population sharing the same or similar identifying characteristics
presented by the investigated statistical unit. Such procedures tend to avoid the presence of
unique or rare recognizable individuals.

4.2 Methods based on data perturbation. Such methods aim at increasing the number of
individuals in the sample/population sharing the same or similar identifying characteristics
presented by the investigated statistical unit. Such procedures tend to avoid the presence of
unique or rare recognizable individuals.

4.1 Methods based on data reduction

4.1.1 Removing variables


The first obvious application of this method is the removal of direct identifiers from the data file.
A variable should be removed when it is highly identifying and no other protection methods can
be applied. A variable can also be removed when it is too sensitive for public use or irrelevant
for analytical purpose.

~ 37 ~
4.1.2 Removing records

Removing records can be adopted as an extreme measure of data protection when the unit is
identifiable in spite of the application of other protection techniques. For example, in an
enterprise survey dataset, a given enterprise may be the only one belonging to a specific industry.
In this case, it may be preferable to remove this particular record rather than removing the
variable "industry" from all records. Since it largely impacts the statistical properties of the
released data, removing records has to be avoided as much as possible.

4.1.3 Global recoding


The global recoding method consists in aggregating the values observed in a variable into pre-
defined classes for example, recoding the age into five-year age groups, or the number of
employees in three-size classes: small, medium and large. The method applies to numerical
variables, continuous or discrete. It affects all records in the data file.

4.1.4 Top and bottom coding


Top and bottom coding can be referred to as a special case of global recoding that can be applied
to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two typical
examples. The highest values of these variables are usually very rare and therefore identifiable.
Top coding at certain thresholds introduces new categories such as "monthly salary higher than
6000 dollars" or "age higher than 75", leaving unchanged the other observed values. The same
reasoning applied to the smaller observed values defines bottom coding. When dealing with
ordinal categorical variables, a top or bottom category is defined by aggregating the "highest" or
"smallest" categories.

4.1.5 Local suppression


Local suppression consists in replacing the observed value of one or more variables in a certain
record with a missing value. Local suppression is particularly suitable for the setting of
categorical key variables and when combinations of scores on such variables are at stake. In this
case, local suppression consists in replacing an observed value in a combination with a missing
value. The aim of the method is to reduce the information content of rare combinations.

4.2 Methods based on data perturbation

~ 38 ~
4.2.1 Micro-aggregation
Micro-aggregation is a perturbation technique first proposed by Euro-stat as a statistical
disclosure method for numerical variables. The idea is to replace an observed value with the
average computed on a small group of units, small aggregate or micro-aggregate, including the
investigated one. The units belonging to the same group will be represented in the released file
by the same value. The groups contain a minimum predefined number k of units. The k minimum
accepted value is 3. For a given k, the issue consists in determining the partition of the whole set
of units in groups of at least k units k-partition minimizing the information loss usually
expressed as a loss of variability. Therefore, the groups are constructed according to a criterion
of maximum similarity between units. The micro-aggregation mechanism achieves data
protection by ensuring that there are at least k units with the same value in the data file.
When micro-aggregation is independently applied to a set of variables, the method is
called individual ranking. When all the variables are averaged at the same time for each group,
the method is called multivariate micro-aggregation.

4.2.2 Data swapping


Data swapping was initially proposed as a perturbation technique for categorical micro-data, and
aimed at protecting tabulation stemming from the perturbed micro data file. Data swapping
consists in altering a proportion of the records in a file by swapping values of a subset of
variables between selected pairs of records swap pairs.

4.2.3 PRAM
As a statistical disclosure control technique, PRAM induces uncertainty in the values of some
variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be
considered as a randomized version of data swapping.

4.2.4 Adding noise


Adding noise consists in adding a random value ε, with zero mean and predefined variance σ2, to
all values in the variable to be protected. Generally, methods based on adding noise are not
considered very effective in terms of data protection4.2.5 Re-sampling

~ 39 ~
Re-sampling is a protection method for numerical micro-data that consists in drawing with
replacement samples of n values from the original data, sorting the sample and averaging the
sampled values. Data protection level guaranteed by this procedure is generally considered quite
low.

4.2.6 Synthetic micro-data

Synthetic micro-data are an alternative approach to data protection, and are produced by using
data simulation algorithms. The rationale for this approach is that synthetic data do not pose
problems with regard to statistical disclosure control because they do not contain real data but
preserve certain statistical properties.

In this chapter we studied category of Statistical disclosure limitation methods and their
division. In the next chapter we will study that what anonymity methods we are going to
make a group anonymity method and how we apply this on privacy preserving data
mining.

~ 40 ~
CHAPTER 5

t- CLOSENESS PRIVACY
PRESERVING DATA
MINING

~ 41 ~
Here we try to achieve privacy that is required during data mining process by the use of Group
based Anonymization which will include k-anonymity l-diversity and t-closeness.

5.1 k -Anonymity

Data holder often remove or encrypt explicit identifiers such as name and social security
numbers .De-identifying data, however, provide no guarantee of anonymity. Released
information often contains other data, such as race, birth date, sex, and ZIP code that can be
linked to publicly available information to re-identify respondents and to infer information that
was not intended for release [10].
k –anonymity is one of the micro-data protection concept k-anonymity demands that every tuple
in the micro-data table released be indistinguishably related to no fewer than k respondents.
Generalization and suppressions are used to achieve k-anonymity.
If there are many numbers of respondent whose dataset we have to publish then k-anonymity
says that each release of data must satisfy constraint that every combination of value of quasi
identifier can be indistinctly matched to at least k-respondents. If T(A1…..An) be table and QI
be a quasi identifier associated with it .T is said to satisfy k-anonymity with respect to Quasi
identifier iff each sequence of values in T[QI] appears at least with k occurrence in T[QI

~ 42 ~
Race DOB Sex ZIP Disease
Asian 64 F 941** hypertension
Asian 64 F 941** obesity
Asian 64 F 941** chest pain
Asian 63 M 941** obesity
Asian 63 M 941** obesity
black 64 F 941** short breath
black 64 F 941** short breath
white 64 F 941** chest pain
white 64 F 941** short breath

Table 1. Micro data which is 2-Anonymous [11]

If there is a Inpatient micro data with Quasi identifier ( Race, DOB, Sex, Zip) and sensitive
attribute Disease .Then this table will satisfy 2- anonymity if each tuple for quasi identifier as at
least one other tuples in the table [6][7][8]. To achieve anonymity, k-anonymity focuses on two
techniques known as generalization and suppression which preserve truthfulness unlike another
technique scrambling and swapping.

5.2 l –diversity

To overcome homogeneous attack and back ground detail attack faced by k-anonymity l-
diversity was introduced [12].A q⋆-block is l-diverse if contains at least l “well-represented”
values for the sensitive attribute S. A table is l-diverse if every q⋆-block is l-diverse. We present
a 3-diverse version in Table 2.

Non- N.S N.S Sensitive


Sensitive(N.S.)

~ 43 ~
Zip Code Age Nationality Condition

1 13053 28 Russian Heart Disease

2 13068 29 American Heart Disease

3 13068 21 Japanese Viral Infection

4 13053 23 American Viral Infection

5 14853 50 Indian Cancer

6 14853 55 Russian Heart Disease

7 14850 47 American Viral Infection

8 14850 49 American Viral Infection

9 13053 31 American Cancer

10 13053 37 Indian Cancer

11 13068 36 Japanese Cancer

12 13068 35 American Cancer

Table 2. Inpatient Micro-data [13]

This is Micro- data table which indicate that it is original table stored on computer. Now we will
find how to calculate anonymity, diversity and closeness from this table up to some extent.

~ 44 ~
Non-Sensitive Non-Sensitive Non-Sensitive Sensitive

Zip Code Age Nationality Condition


Heart Disease
1 1305* ≤ 40 *
Viral Infection

4 1305* ≤40 * Cancer


Cancer
9 1305* ≤40 *

10 1305* ≤40 *

5 1485* >40 * Cancer

6 1485* >40 * Heart Disease

7 1485* >40 * Viral Infection

8 1485* >40 * Viral Infection

2 1306* ≤40 * Heart Disease

3 1306* ≤40 * Viral Infection

11 1306* ≤40 * Cancer

12 1306* ≤40 * Cancer

Table3. 3-diverse 4-anonymous Inpatient Micro-data[13]

5.2.1 Homogeneity Attack:

If there is 4-anonymous table in which quasi attributes Zip code, Birth date and gender and
sensitive attribute is disease. A set of no sensitive values is called quasi identifier if these

~ 45 ~
attribute can be linked with external data to uniquely identify at least one individual in the
general population. A sensitive attribute is an attribute whose value for any particular individual
must be kept secret from people who have no direct access to the original data.
If Alice and Bob are two person and Alice want to know the sensitive attribute of Bob so he
discovered 4 anonymous table published by hospital .By the knowledge of value of quasi
identifier Alice can know Bob is in which equivalence class and if sensitive attributes in that
equivalence have common values then he can know sensitive attribute of Bob. This is called
homogeneous attack.

5.2.2 Background Knowledge Attack:

If there are 4 anonymous tables in which quasi attributes Zip code, Birth date and gender and
sensitive attribute is disease. Alice knows that Bob is a 21 years old Japanese male who currently
lives in zip code 12087.Based on this information; Alice learns that Bob’s information is
contained in record number 1, 2, 3 or 4.Without additional information, Alice is not sure weather
Bob caught a virus or has heart disease. It is well known that Japanese have an extremely low
incidence of heart disease. Therefore Alice concludes with near certainty that Bob has a viral
infection l-diversity was introduced to overcome this problem A q*equivalence class is l-diverse
if contains at least l-“well-represented” values for the sensitive attribute S. A table is l-diverse if
every q* equivalence class is l-diverse, where q*equivalence class to be the set of tuples in table
T* whose non sensitive attribute values generalize to q*. Let Inpatient Micro data table shown
above. It has non sensitive attribute Zip code. Age, Nationality and sensitive attribute condition.
It is 3 diverse because it has at least 3 distinct sensitive attribute in each equivalence class.

5.3 t - Closeness:

Privacy measured by the information gain by an adversary. And he gain information by his
Posterior belief and Prior belief. [14] If Q is the distribution of the sensitive attribute in the whole
~ 46 ~
table and P is the distribution of sensitive attribute in equivalence class .An equivalence class is
said to have t-closeness if the distance between the distribution of a sensitive attribute in this
class and the distribution of the attribute in the whole table is no more than a threshold t. A table
is said to have t-closeness if all equivalence classes have t-closeness.

Kno
S.No. ZIP Age Salary Disease w.l.
Code ee
1 4767* ≤ 40 3K gastric
ulcer
3 4767* ≤ 40 5K stomach
cancer
8 4767* ≤ 40 9K pneumonia
4 4790* ≥ 40 6K gastritis
5 4790* ≥ 40 11K flu
6 4790* ≥ 40 8K bronchitis
2 4760* ≤ 40 4K gastritis
7 4760* ≤ 40 7K bronchitis
9 4760* ≤ 40 10K stomach
cancer

Table 4.

This table is 3 Anonymous, 3 Diverse and having 0.167-closeness w.r.t. salary, 0.278
closeness w.r.t. diseases [12]

5.4 PROPOSED WORK

Interpretation/Evaluation

~ 47 ~

Data
Data mining Patterns

Group based Anonymization

Transformation

Transformed Data

Preprocessing

Preprocessed Data

Selection

Target Date

Figure2. Steps in privacy preserving data mining

These are steps in privacy preserving data mining .Here firstly target data are selected from
database then data are preprocessed after then transformation is applied then we apply Group
based Anonymization then Data mining to find patterns and then Evaluate patterns to generate
knowledge.t-closeness come in to focus in order to prevent skewness attack for this it makes

~ 48 ~
calculation based on EMD between two distribution and similarity attack. Let us take one
example in which there are 1000 students enrolled and their corresponding records. Say 1% of
the students have failed and rest has passed. Suppose one equivalent class has equal number of
pass and fail records. Anyone belonging to that equivalent class would be considered to have
50% chance of having failed as compared with the 1% initially. This is thus a major privacy risk.
Again consider an equivalence class with 49 fail records and 1 pass record. It would be satisfy 2-
diverse but still there will be 98% chance of having failed for someone in that equivalence class,
which is much more than 1% initially. This equivalence class has the same diversity as a class
that has 1 failed and 49 pass records but we can clearly see that both have different levels of
sensitivity. This is skewness attack. Similarity attacks occur when the sensitive attributes are
semantically similar. t-closeness requires that the earth mover's distance between the distribution
of the sensitive attribute within each equivalence class does not differ from the distribution of the
sensitive attribute in the whole table by more than a predefined parameter t. The EMD is based
on the minimum amount of work needed to transform one distribution to another by moving
distribution mass between each other.EMD can be formally defined using the well studied
transportation problem. Let p= (p1,p2,…….pm),Q= (q1,q2………….qm), and dij be the ground
distance between element I of p and element j of Q. We want to find a flow F= [fij] where fij is
the flow of mass from element i of p to element j of Q that minimize the the overall work [12].

m m
WORK (P,Q,F) =∑ ∑ d ij f ij
i=1,j=1
Subject to the following constraints,

fij≥0 1≤i≤m , 1≤j≤m (c1)

m m
pi - ∑f ij+∑f ji=q i 1≤i≤ m (c2)

~ 49 ~
j=1 j= 1

m m m m
∑ ∑ f ij=∑ p i =∑ q i =1 (c3)
i=1,j=1 i=1 j=1

These three constraints guarantee that P is transformed to Q by the mass flow F. Once the
transportation problem is solved, the EMD is defined to be the total work.

m m

D [P,Q]=WORK (P,Q,F)=∑ ∑ d ij f ij

i=1, j=1

1. If 0≤dij≤1 for all i, j then0≤D[P,Q]≤1.The above fact follows directly from constraint (c1) and
(c3). It says that if grounds distances are normalized, i.e., all distances are between 0 and 1, then
the EMD between any two distributions is between 0 and 1. This gives a range from which one
can choose the t value for t-closeness.

2. Given two equivalences classes E1 and E2, Let P1, P2 and P be the distribution of a sensitive
attribute in E1,E2 and E1 E2 respectively then group based Anonymization is efficient if we
want to perform data mining in a privacy preserving way .

D[P,Q] ≤( |E1|/|E1| + |E2| ) D[P1,Q] + |E2|/(|E1| + |E2|) D[P2,Q] .

5.4.1 Group Based Anonymization

The need of Group based Anonymization occurs due to the following weakness in existing
system during privacy preserving Data mining. The randomization method is a simple technique

~ 50 ~
which can be easily implemented at data collection time, because the noise added to a given
record is independent of the behavior of other data records. This is also a weakness because
outlier records can often be difficult to mask. Clearly, in cases in which the privacy-preservation
does not need to be performed at data-collection time, it is desirable to have a technique in which
the level of accuracy depends upon the behavior of the locality of that given record. Another key
weakness of the randomization framework is that it does not consider the possibility that publicly
available records can be used to identify the owners of that record. The use of publicly available
records can lead to the privacy getting heavily compromised in high-dimensional cases. This is
especially true of outlier records which can be easily distinguished from other records in their
locality. Therefore, a broad approach too many privacy transformations are to construct groups
of anonymous records which are transformed in a group-specific way. Group based
Anonymization is efficient if we want to perform data mining in a privacy preserving way .If
we apply this which is the integration of k-anonymity, l-diversity and t-closeness on micro-data
as shown in figure 2.It takes Micro-data from organization where prevention of sensitive
information is needed ,run Group based Anonymization and then perform data mining task . It
preserve privacy in the following way.
a. Prevent attribute disclosure .Attribute disclosure occurs when confidential information about
a data subject is revealed and can be attributed to the subject. Attribute disclosure may occur
when confidential information is revealed exactly or when it can be closely estimated. Thus,
attribute disclosure comprises identification of the subject and divulging confidential information
pertaining to the subject. Identity disclosure occurs if a third party can identify a subject or
respondent from the released data [6].Revealing that an individual is a respondent or subject of a
data collection may or may not violate confidentiality requirements.

b. Prevent Inference disclosure. Inference disclosure occurs when information can be inferred
with high confidence from statistical properties of the released data.

c. It decreases utility measure, which tells how useful the given candidate is? For example let
medical information of a person disclose, he is suffer from sugar and has insured in a company
then competitor company approach him and try to offer more beneficial scheme and he also
get opportunity to increase his business more and more. But t-closeness privacy preserving data
mining avoid this problem.

~ 51 ~
d. It preserves the truth fullness of data.

5.4.2 Group based Anonymization applied in Secure multiparty communication.

If there are number of parties P1,P2…………….Pn ,their data sets are D1,D2…………….Dn
respectively, and they want to do data mining on their databases collectively to obtain a function
F (x) where x uses sensitive value of each party. If we add trusted third party to the computation
then he collect all the input from the parties and gives the results but what happen if there are no
trusted third party, we can assume two types of abnormal behavior by parties.

1. We assume they follow protocol but they can collaborate and try to learn additional
information.

2. Abnormal parties can collaborate in any way to gather information and disrupt the secure
computations .If we apply Group based Anonymization on data held by all parties where
confidentiality needed ,then we achieve data anonymity and then we do data mining task and get
result without compromising privacy. This is shown in following figure

D1

~ 52 ~
P1
G
A

DM

G
A

Dn

Pn

Figure 3. Group based Anonymization applied on SMC.

In this chapter we studied that what anonymity methods are going to make a group
anonymity method and how we apply it on privacy preserving Data mining. Our next
chapter is for Conclusion and Future Research.

~ 53 ~
CHAPTER 6

CONCLUSION AND
FUTURE RESEARCH

6.1 Conclusion

There are a number of privacy preserving model for micro data release to prevent sensitive
information from disclosure .But they are not very promising because they do not guarantee
privacy .Also we read most recent publications for privacy preserving data mining, but they

~ 54 ~
also suffer from attacks like skewness attack or attribute disclosure attack. So we narrowed down
on t-closeness privacy preserving model because it guarantee complete privacy. But it also has
limitations that we can not have threshold value close to zero because data mining become
difficult .Also we want to compute t-closeness effectively for high dimensional data. In a high
dimensional data where high degree of generalization and suppression is needed for data
Anonymization, so more information loss occur. Slicing is advanced approach to achieve data
anonymity which split high dimensional data to low dimensional data .Now it depend on us how
we will achieve t-closeness for efficient data Anonymization.

6.2 Future research

Group Anonymization method can be efficient method for privacy preserving data mining but in
high dimensional data. It is a big question that how we apply group Anonymization because
there is no universal rule that which method should be applied for group Anonymization to
calculate values of individual model. . So Slicing is a good concept for high dimensional data to
achieve data Anonymization.

ABBREVIATIONS

~ 55 ~
DM…………….Data mining

KDDM …….. Knowledge Discovery and Data Mining

DBMS …….. Database Management System

KDD ………. Knowledge Discovery in Databases

NHL ………. National Hockey League

NBA ……….. National Basketball Association

NIH …………… National Institute of Health

PL ……………. Privacy Level

DL …………… Disclosure Level

VRML……………The virtual reality modeling language

PPDM ………… Privacy preserving data mining

PRAM ……….. Post-randomization

EMD …………. Earth mover’s distance

SMC ………. Secure Multiparty Computation

PRAM ……….. Post -randomization

NASA ………. National Aeronautics and Space Administration

CAD …………… Computer Assisted Design

GA…………………Group based Anonymization

REFERENCES

[1] Osmar R. Zaiane, 1999 CMPUT690 ‘Principles of Knowledge Discovery in Databases’.

[2] R. Agrawal and R. Srikant, 2000. Privacy preserving data mining. In Proceedings of the

~ 56 ~
ACM SIGMOD Conference on management of Data. 439- 450.

[3] Y. Lind ell and B. Pinkas, 2000,Privacy preserving data mining.


J. Cryptology 15, 3. 36-54.

[4] L.Wallenberg, and T. De Waal 2001. Elements of Statistical disclosure


Control Springer-Verlag, Berlin, Germany.

[5] H .Mannila and H.Toivonen, 1997. Level wise search and borders of
theories in knowledge discovery. Data mining Knowledge discovery. 1, 3,
241-258.

[6] A. Friedman, R. Wolff, A. Schuster ‘Providing k-anonymity in data mining’


The VLDB Journal, Vol.17, 2008, pp. 789-804.

[7] E. Poovammal and M. Ponnavaikko ‘An Improved Method for Privacy Preserving Data
Mining’, International Advance Computing Conference, 2009, pp. 1453-1458.

[8] C. C. Aggarwal and P. S. Yu, “Privacy preserving data mining Model and Algorithms”,
Kluwer Academic Publishers Boston/Dordrecht/London.

[9]“Anonymization techniques’, International Household survey Network”,


http://www.surveynetwork.org/home/index.php?q=tools/anonymization/techniques.

[10]Y. Ting and S. Jajodia ‘k-Anonymity’, Secure Data management in Decentralized systems,
Springer-Verlag, 2007.

[11] V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati, “k-Anonymity”

Springer US, Advances in Information Security, 2007

[12] D.Kifer and J. Gehrke, ‘l-diversity: Privacy Beyond k-Anonymity’, International


Conference on Data Engineering (ICDE), 2006.

~ 57 ~
[13] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam.

ℓ-diversity: Privacy beyond k-anonymity. Available athttp://www.cs.cornell.edu/_mvnak, 2005

[14] N .Li, T. Li and S. Venkatasubramanian,‘t-Closeness: Privacy Beyond k-Anonymity and l-


Diversity’, International Conference on Data Engineering(ICDE), 2007, pp. 106-115.

Publications and Reprint:

~ 58 ~
Topic name- t-closeness privacy preserving data mining .

Conference name-Have to go for presentation in 2010 International Conference


on the Business and Digital Enterprises (ICBDE 2010)

Place- Gopalan Educational Society, Bangalore, India, in cooperation


with the Digital Information Research Foundation (DIRF).

~ 59 ~
t- Closeness Privacy Preserving Data Mining
Rani Srivastava, Vishal Bhatnagar

Ambedkar Institute of Technology, Geeta Colony, Delhi

srivastava.rani32@gmail.com, vishalbhatnaga,@yahoo.com

~ 60 ~
Abstract

Group based Anonymization method including k -anonymity, l-diversity and t-closeness was
introduced for data privacy. The t- closeness model was introduced in order to provide a
safeguard against the similarity attacks on published data set. It requires that the earth
mover’s distance (EMD) between the distribution of a sensitive attribute within each
equivalence class and distribution of sensitive attribute in the whole table should not differ
more than a threshold t. Our aim in this paper is to provide the logical security of data
through data Anonymity. There are many other model which can work as a substitute of t-
closeness but can not remove all shortcomings and also suffer from some limitations so t-
closeness yet be efficient.

Keywords: Data Mining, Anonymization, t-closeness, EMD.

1 .Introduction

Data mining methodologies have been widely adopted in research area which may be
economic, medical, or various business oriented domains, such as marketing, credit scoring,
fraud detection, where data mining has become an indispensable tool for business
success.Increasingly data mining methods are also being applied to industrial process
optimization and control. Generally, data mining (sometimes called information or
knowledge discovery) is the process of analyzing data from different perspectives and
summarizing it into useful information - that can be used to increase revenue, cuts costs, or
both. Data mining tool is one of the analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases. Basically data mining is the
looking at large number of information that has been collected on computer and using it to
provide new information. Privacy in data mining is needed sometimes for business purpose
or sometime due to law for example medical database, where no doctor is allowed to tell
sensitive information about any patient to someone else. Now our focus is to do data mining
in a privacy preserving way using anonymization method such that no adversary is able to
determine any sensitive information and data measure. When we saw database with respect
to privacy there is two type of attribute first is sensitive attribute and second is non sensitive
.An attribute is marked sensitive if an adversary must not allowed discovering value of that
attribute for any individual in the database .Attribute not marked sensitive is non sensitive in
a dataset. Sets of attributes that can be linked with external data to uniquely identify
individuals in the dataset are called quasi-identifier. To counter linking attacks using quasi
identifiers, there is a concept of k- anonymity. A table satisfies k-anonymity if every record
in the table is indistinguishable from at least k − 1 other records. But k-anonymity suffers
from homogeneous attack and background detail attack so a new concept l-diversity is
introduced. l- diversity tries to put constraints on minimum number of distinct values seen
within an equivalence class for any sensitive attribute. An equivalence class has l-diversity if
there is l or more well-represented values for the sensitive attribute. A table is said to be l-

61
diverse if each equivalence class of the table is l-diverse, but l-diverse table has
disadvantage due to skew attack and similarity attack so t-closeness introduced. An
equivalence class is said to have t-closeness if the distance between the distribution of a
sensitive attribute in this class and distribution of the attribute in the whole table is no more
than a threshold t .A table is said to have t-closeness if all equivalence classes have t-
closeness.

2. Motivation and Related Research

Huge databases exist in the society today, they include census data, media data, consumer
data, data gathered by government agencies and like research data need to publish
publically. Now it is possible for an adversary to learn a lot of information about individual
from public data like purchasing pattern, family history, medical data, media data and much
more data. This is an age of competitiveness and every organization either government or
non government want to ahead of each other .Then why allow adversary to get useful
information to make any inference and to share confidential information. Privacy is
becoming an increasingly important issue in many data-mining applications. This has
triggered the development of many privacy-preserving data-mining techniques.

1.In a paper titled ‘An Improved Method for Privacy Preserving Data Mining[1] we noticed
following point.

a. It is assumed that one of the quasi identifier is numeric and it is transformed in to fuzzy
based transformation. In fuzzy based transformation number of fuzzy set is calculated which
is equal to the number of linguistic term set by deciding size of fuzzy set (k), min, max and
mid point of each fuzzy set m1 ….mk. Transform the actual value(x) using the functions
f1(x), f2(x),…fk(x).and replace actual value with xn plus category number. If actual value of
any of the quasi identifier member attribute is not known, it can not identify unique record.
There by, linking attack problem is solved but what happen if Quasi identifier does not have
numerical attribute like Netflix movie rating then this approach can not solve linking
problem.

b. Identifier attribute are not disclosed and replaced by auto generated id number if
present. This leads loss of truthfulness of information and also information loss.

c. For categorical sensitive attribute transformation is performed using the mapping table
prepared with prepared with domain knowledge considering privacy level (PL) and disclosure
level (DL) set by the user. If any individual willing to disclose his information then his PL and
DL is set to be true and so no transformation is done if he does not mind to linking him
another with less probability then PL as true and DL as false, then its sensitive value is
replaced by generalized value plus arbitrary value but if privacy level is false then value is
replaced by overall general value plus arbitrary value. But if we have to do research then if
Disclosure level and privacy level both are true then there is no problem for do research
because true data is available but there is a privacy breach if linking problem can not be

62
solved above then attribute disclosure is not solved .If no one want to disclose his sensitive
information then how will research place.

2. Problem of privacy can be resolved by using k-anonymity and extended definition of k-


anonymity can be used to prove that a given data mining model does not violate the k-
anonymity of the individuals represented in the data [2]. Extension of this provides a tool
that measures the amount of anonymity retained during Data mining. It is also shown that
this model can be applied to various data mining problems, such as classification,
association rule mining and clustering. Two data mining algorithms are explained which
exploit this extension to guarantee that they will generate only k-anonymous output. Finally,
it is shown that this method contributes new and efficient ways to anonymize data and
preserve patterns during Anonymization. We have to prevent identity disclosure, attribute
disclosure, utility measure which tells how useful the given respondent is and inference
disclosure for publishing data publically so we require t-closeness.

3. Data Mining: An Introduction

Data mining involves the use of sophisticated data analysis tools to discover previously
unknown, valid patterns and relationships in large data sets. These tools can include
statistical models, mathematical algorithms, and machine learning methods (algorithms that
improve their performance automatically through experience, such as neural networks or
decision trees). Consequently, data mining consists of more than collecting and managing
data, it also includes analysis and prediction [3] [4]. It is the process of discovering
meaningful new correlations, patterns and trends by sifting through large amounts of data
stored in repositories, using pattern recognition technologies as well as statistical and
mathematical techniques. While data mining is a technology that has a large number of
advantage but biggest problem that need to be addressed is privacy. In the information age
it sometimes seems as if everyone wants to know everything about you that led to the rise
of identity theft. Whenever you go to a bank to fill out a loan application, the information
you put on it will probably be placed in a database. When you conduct an interview over the
phone or on the internet, the information that you submit is also placed in a database.
Medical data is stored in a database for research purpose.

Many proponents of data mining assume that the information held by an organization will
exist in one location. In reality, this information can fall into the hands of anyone, and once a
single copy of it surfaces on the internet, it can be replicated numerous times and also data
need to publish for research purpose.. Many of the consumers who buy products or services
are not aware of data mining technology. They may not know that their shopping habits,
names, addresses, and other information is being stored in a database. While data mining
might be a term that is well understood in certain circles, If you are the owner of a
business, do your customers know that you are putting their information in a database? Do
you bother to tell them? If you are adding them to a database without their knowledge, how
do you think they would feel if they found out? While data mining has a number of
advantage. Customers should be assured about privacy of information placed in a database.
Large corporations that are fiercely competitive may avoid giving assurance to their
customers of not publishing their data because they don't want to lower their chances of
being able to have an edge on their competition. Because of this, they are faced with the
ethical problem of whether or not they should give customers the option of not publishing
their data for data mining. One of the important issues in such processes is how to protect
the trade secrecy of corporations and privacy of customers contained in the data sets

63
collected and used for the purpose of data mining. Detailed person-specific data, present in
the centralized server or in the distributed environment, in its original form often contains
sensitive information about individuals, and publishing such data immediately violates
individual privacy. The main problem in this regard is to develop method for publishing data
in a more hostile environment so that the published data remains practically useful while
individual privacy is preserved. There are n parties, each having a private database, want to
jointly conduct a data mining operation on the union of their databases. How could these
parties accomplish this without disclosing their database to the other parties or any third
party. Given all these facts it seems that privacy preserving data mining will play an
increasing role in the future for data privacy in the information age.

4 .Anonymization Methods: An Introduction:

Anonymization is the making data publicly without compromising the individual privacy.
One of the functions of a federal statistical agency is to collect individually sensitive data,
process it and provide statistical summaries, and/or public use micro data files to the public.
Some of the data collected are considered proprietary by respondents. On the other hand,
not all data collected and published by the government are subject to disclosure limitation
techniques. Some data on businesses that is collected for regulatory purposes are
considered public. In addition, some data are not considered sensitive and are not collected
under a pledge of confidentiality. The statistical disclosure limitation techniques described
however confidentiality is required and data or estimates are made publicly available. All
disclosure limitation methods result in some loss of information and sometimes the publicly
available data may not be adequate for certain statistical studies. However, the intention is
to provide as much data as possible, without revealing individually sensitive data. Statistical
disclosure limitation methods can be classified in two categories [5]:

a. Methods based on data reduction. Such methods aim at increasing the number of
individuals in the sample/population sharing the same or similar identifying characteristics
presented by the investigated statistical unit. Such procedures tend to avoid the presence of
unique or rare recognizable individuals.

b. Methods based on data perturbation. Such methods achieve data protection


from a twofold perspective. First, if the data are modified, re-identification by means of
record linkage or matching algorithms is harder and uncertain. Secondly, even when an
intruder is able to re-identify a unit, he/she cannot be confident that the disclosed data are
consistent with the original data. An alternative solution consists in generating synthetic
micro-data Data reduction includes removing variables, removing records, global recoding,
top and bottom coding and local suppression. Data Perturbation includes micro aggregation,
data swapping, post randomization, adding noise, re-sampling. Synthetic data is an
alternative approach to data protection and are produced by using data simulation
algorithms. The rationale for this approach is that synthetic data do not approach problem
with regard to statistical disclosure control because they do not contain real data but
preserve certain statistical properties. Synthetic data can be generated using bootstrap
method, multiple imputations and data distribution by probability.

5. t-closeness privacy preserving data mining

64
Group based Anonymization method is the result of use of k-anonymity, l-diversity and t-
closeness .k-anonymity make similar data in a group and then anonymize each group
individually

5.1 k -anonymity

If there are many numbers of respondent whose dataset we have to publish then k-
anonymity says that each release of data must satisfy constraint that every combination of
value of quasi identifier can be indistinctly matched to at least k-respondents. If T(A1…..An)
be table and QI be a quasi identifier associated with it .T is said to satisfy k-anonymity with
respect to Quasi identifier iff each sequence of values in T[QI] appears at least with k
occurrence in T[QI] .If there is a Inpatient micro data with Quasi identifier (Zip code, Age,
Nationality) and sensitive attribute Condition .Then this table will satisfy 4- anonymity if
each tuple for quasi identifier as at least three other tuples in the table [6][7][8]. To achieve
anonymity, k-anonymity focuses on two techniques known as generalization and
suppression which preserve truthfulness unlike another technique scrambling and swapping.

5.2 l –diversity

To overcome homogeneous attack and back ground detail attack faced by k-anonymity l-
diversity was introduced. A q*equivalence class is l-diverse if contains at least l- “well-
represented” values for the sensitive attribute S. A table is l-diverse if every q* equivalence
class is l-diverse, where q*equivalence class to be the set of tuples in table T* whose non
sensitive attribute values generalize to q* [9]. Let 4-anonymious inpatient micro-data table
.It is 3 diverse , say l= 3 if every block have at least 3 distinct sensitive values.

5.3 t -closeness

Privacy measured by the information gain by an adversary. And he gain information by his
Posterior belief and Prior belief. If Q is the distribution of the sensitive attribute in the whole
table and P is the distribution of sensitive attribute in equivalence class .An equivalence
class is said to have t-closeness if the distance between the distribution of a sensitive
attribute in this class and the distribution of the attribute in the whole table is no more than
a threshold t. A table is said to have t-closeness if all equivalence classes have t-
closeness [10].

S.No. ZIP Age Salary Disease


Code
1 4767* ≤ 40 3K gastric
ulcer
3 4767* ≤ 40 5K stomach
cancer
8 4767* ≤ 40 9K pneumo
nia
4 4790* ≥ 40 6K gastritis
5 4790* ≥ 40 11K flu
6 4790* ≥ 40 8K bronchiti
s
2 4760* ≤ 40 4K gastritis
7 4760* ≤ 40 7K bronchiti

65
9 4760* ≤ 40 10K s
stomach
cancer

Table 1

This table is 3 Anonymous, 3 Diverse and having 0.167-closeness w.r.t Salary, 0.278
closeness w.r.t. Disease [10]

DB1

t
DM
D

t
DB n
Figure.1

t-closeness come in to focus in order to prevent skewness attack for this it makes
calculation based on EMD between two distribution and similarity attack. Let us take one
example in which there are 1000 students enrolled and their corresponding records. Say 1%
of the students have failed and rest has passed. Suppose one equivalent class has equal
number of pass and fail records. Anyone belonging to that equivalent class would be
considered to have 50% chance of having failed as compared with the 1% initially. This is
thus a major privacy risk. Again consider an equivalence class with 49 fail records and 1
pass record. It would be satisfy 2-diverse but still there will be 98% chance of having failed
for someone in that equivalence class, which is much more than 1% initially. This
equivalence class has the same diversity as a class that has 1 failed and 49 pass records but
we can clearly see that both have different levels of sensitivity. This is skewness attack.
Similarity attacks occur when the sensitive attributes are semantically similar.t-closeness
requires that the earth mover's distance between the distribution of the sensitive attribute
within each equivalence class does not differ from the distribution of the sensitive attribute
in the whole table by more than a predefined parameter t. The EMD is based on the

66
minimum amount of work needed to transform one distribution to another by moving
distribution mass between each other.EMD can be formally defined using the well studied
transportation problem. Let p= (p1,p2,…….pm),Q= (q1,q2………….qm),and dij be the
ground distance between element I of p and element j of Q. We want to find a flow F= [fij]
where fij is the flow of mass from element i of p to element j of Q that minimize the the
overall work [10].

m m
WORK (P,Q,F) =∑ ∑ d ij f ij
i=1,j=1

subject to the following constraints.

fij≥0 1≤i≤m , 1≤j≤m (c1)

m m
pi - ∑f ij+∑f ji=q i 1≤i≤ m (c2)
j=1 j= 1

m m m m
∑ ∑ f ij=∑ p i =∑ q i =1 (c3)
i=1,j=1 i=1 j=1
These three constraints guarantee that P is transformed to Q by the mass flow F. Once the
transportation problem is solved. The EMD is defined to be the total work.

m m

D [P,Q]=WORK (P,Q,F)=∑ ∑ d ij f ij

i=1, j=1

1. If 0≤dij≤1 for all i, j then0≤D[P,Q]≤1.The above fact follows directly from constraint (c1)
and (c3). It says that if ground distances are normalized, i.e., all distances are between 0
and 1, then the EMD between any two distributions is between 0 and 1. This gives a range
from which one can choose the t value for t-closeness.

2. Given two equivalences classes E1 and E2, Let P1, P2 and P be the distribution of a
sensitive attribute in E1,E2 and E1 E2 respectively then group based Anonymization is
efficient if we want to perform data mining in a privacy preserving way .

D[P,Q] ≤( |E1|/|E1| + |E2| ) D[P1,Q] + |E2|/(|E1| + |E2|) D[P2,Q] .

67
Group based Anonymization is efficient if we want to perform data mining in a
privacy preserving way .If we apply k-anonymity l-diversity and t-closeness on micro-
data as shown in figure 1..It takes Micro-data from all department ,organization etc. where
prevention of sensitive information is neded and run t-closeness separately on each
microdata of concerned organization then perform data mining task .Here DB1…….DBn
represent original database which is also termed microdata.DM is for data mining and t for t-
closeness. It preserve privacy in the following way.

a. Prevent attribute disclosure .Attribute disclosure occurs when confidential


information about a data subject is revealed and can be attributed to the subject.
Attribute disclosure may occur when confidential information is revealed exactly or
when it can be closely estimated. Thus, attribute disclosure comprises identification
of the subject and divulging confidential information pertaining to the subject.
Identity disclosure occurs if a third party can identify a subject or respondent from
the released data[6].

Revealing that an individual is a respondent or subject of a data collection may or may not
violate confidentiality requirements.

b. Prevent inference disclosure. Inference disclosure occurs when information can be


inferred with high confidence from statistical properties of the released data.

c. It decreases utility measure, Which tells how useful the given candidate is? For
example let medical information of a person disclose, he is suffer from sugar and has
insured in a company then competitor company approach him and try to offer more
beneficial scheme and he also get opportunity to increase his business more and more.
But t-closeness privacy preserving data mining avoid this problem.

d. It preserves the truth fullness of data.

6. Conclusion and future research

There are many privacy preserving model for micro data release to prevent sensitive
information from disclosure .But they are not very promising because they do not guarantee
privacy .Also we read most recent publications for privacy preserving data mining, but they
also suffer from attacks like skewness attack or attribute disclosure attack. So we narrowed
down on t-closeness privacy preserving model because it guarantee complete privacy. But it
also has limitations that we can not have threshold value close to zero because data mining
become difficult .Also we want to compute t-closeness effectively for high dimensional data.
In a high dimensional data where high degree of generalization and suppression is needed
for data Anonymization ,so more information loss occur. Slicing is advanced approach to
achieve data anonymity which split high dimensional data to low dimensional data .Now it
depend on us how we will achieve t-closeness for efficient data Anonymization.

68
.

Reference
[1] Poovammal E., Ponnavaikko M., ‘An Improved Method for Privacy Preserving Data Mining’,
International Advance Computing Conference, 2009, pp. 1453-1458.

[2] Friedman A., Wolff R., Schuster A. ‘Providing k-anonymity in data mining’
The VLDB Journal , Vol.17 ,2008 pp. 789-804.

[3] Jeffrey W. S., ‘Data mining: An overview’, CRS report RL 31798.

[4] Aggarwal C.C, Yu P.S., ‘Models and Algorithms: Privacy-Preserving Data Mining’
Springer, 2008.

[5]‘ Anonymization techniques’, International Household Survey Network”,


http://www.surveynetwork.org/home/index.php?q=tools/anonymization/techniques.

[6] Samarati P., ‘Protecting Respondents Identities in Micro-data Release’, IEEE Trans. Knowledge and
Data Eng. Vol. 13, No. 6, 2001, pp. 1010-1027.

[7] Jajodia S. , Yu T , Bayar do R. J., Agrawal R. , ‘k-Anonymity. Security in Decentralized Data


Management’ Springer, 2006.

[8] Ting Y.,Jajodia S. ‘k-Anonymity’, Secure Data Management in Decentralized Systems, Springer-
Verlag, 2007.

[9] Kifer D. and Gehrke J., ‘l-diversity: Privacy Beyond k-Anonymity’, International Conference on
Data Engineering (ICDE), 2006.

[10] Li N., Li T. and Venkatasubramanian S. , ‘t-Closeness: Privacy Beyond k-Anonymity and l-


Diversity’, International Conference on Data Engineering(ICDE), 2007, pp. 106-115.

69

Das könnte Ihnen auch gefallen