K-Anonymity Model Project Report

1
- A N O N Y M I T Y M O D E L
- P R O T E C T I N G
P R I V A C Y

Project Submitted in Partial FulIillment oI the Requirement Ior the
ard oI Degree oI
Bachelor oI Technology in Computer Science & Engineering

Submitted By

nubhav ggaral (87/CSE/2K7)
Saurav Suman (46/CSE/2K7)
Ravi Rajak (07/CSE/2K7)
8huto8h Kr. Jha (41/CSE/2K7)

Under Supervision of:
ProI. Binod Kumar

Department of Computer Science and Engineering
Cambridge Institute of Technology, Ranchi
(2011)

2

ACKNOWLEDGEMENTS

We are pleased to acknowledge Mr Binod Kumar for his invaluable guidance during the
course of this project work.
We extend our sincere thanks to Mr Arshad Usmani who continuously helped us
throughout the project and without his guidance, this project would have been an uphill
task.
We are also grateful to other members of the CT team who co-operated with us
regarding some issues.

We would also like to thank 'UC rwin machine repository' for writing the very useful
database for different organizations under the Open Source banner which greatly
helped us in writing the database part.

Last but not the least, our friends who also co-operated with us nicely for the
smooth development of this project.

July 13
th
, 2011

Anubhav, Saurav, Ravi, Ashutosh

TABLE OF CONTENTS

A b s tr a ct 5

1. nt roduction 6

1.1. General k-anonymity model 7

2. Motivation 8

2.1. Statistical databases 8
2.2. Multi-level databases 8
2.3. Computer security is not privacy protection 10
2.4. Multiple queries can leak inference 10

3. Preliminaries 11

3.1. Basic Concepts 11
3.2. Existing Techniques 15

4. Anonymization and Clustering 16

4.1. Categorization of major clustering methods 16
5. Definition (frequently used terms in clustering) 20
5.1 Cluster 20
5.2 Distance between Two Clusters 20
5.3 Similarity 20
5.4 Average Similarity 21
5.5 Threshold 21
5.6 Similarity Matrix 21
5.7 Dissimilarity Coefficient 21
5.8 Cluster Seed 21

4

6. Real world Applications of Clustering 22
6.1 Similarity searching in Medical mage Database 22
6.2 Data Mining 23
6.3 Windows NT 24
6.4 Other applications 25
7. Existing Theorem used in this project 26

7.1 Distance and Cost f unct i on 27

8. Proposed Anonymization Al g o r i t h m 33

9. Java Code for implementing proposed algorithm 37

10. Project Output Screen 64

11. Experimental Results 65

11.1 Experi mental Setup 65

12. Conclusions 66

12.1 Clustering-Based Approaches 67

13. References

A b s t r a c t

k-anonymity is a model that addresses the question, "How can a data holder
release a version of its private data with scientific guarantees that the
individuals who are the subjects of the data can't be re-identified while the data
remains practically useful? [13] For instance, a medical institution may want to
release a table of medical records. Even though the names of the individuals
can be replaced with dummy identifiers, some set of attributes called the quasi-
identifier which can leak confidential information. For instance, the birth date,
zip code and the gender attributes in the disclosed table can uniquely determine an
individual. Joining such a table with some other publicly available information source,
like a voter's list table, which consists of records containing the attributes that make up
the quasi-identifier as well as the identities of individuals, the medical information, can
be easily linked to individuals. k-anonymity prevents such a privacy breach by
ensuring that each individual record can only be released if there is at least k-1
other
(distinct) individuals whose associated records are indistinguishable from the former in
terms of their quasi-identifier values.
k-anonymization techniques have been the focus of intense research in the last few
years. An important requirement for such techniques is to ensure anonymization of
data while at the same time minimizing the information loss resulting from data
modifcation.
n this paper we propose an approach that uses the idea of clustering to minimize
information loss and thus ensure good data quality. The key observation here is that
data records that are naturally similar to each other should be part of the same
equivalence class. We thus formulate a specifc clustering problem, referred to as k-
member clustering pr obl em. We prove that this problem is NP-hard and present a
greedy heuristic, the complexity of which is in ( n

).

6

1. Int roduction

A recent approach addressing data privacy relies on the notion of k-anonymity. n this
approach, data privacy is guaranteed by ensuring that any record in the released data is
indistinguishable from at least (k ~ 1) other records with respect to a set of attributes
called the quasi-identifer. Although the idea of k-anonymity is conceptually
straightforward, the computational complexity of fnding an optimal solution for the k-
anonymity problem has been shown to be NP-hard, even when one considers only cell
suppression. The k-anonymity problem has recently drawn considerable interest from
research community, and a number of algorithms have been proposed. Current
solutions, however, su er from high information loss mainly due to reliance on predefned
generalization hierarchies or total order imposed on each attribute domain.

The main goal of our work is to develop a new k-anonymization approach that
a d d r e s s e s such limitations. The key idea underlying our approach is that the k-
anonymization problem can be viewed as a clustering problem. ntuitively, the k-anonymity
requirement can be naturally transformed into a clustering problem where we want to
fnd a set of clusters (i.e., equivalence classes), each of which contains at least k
records. n order to maximize data quality, we also want the records in a cluster to be
as similar to each other as possible. This ensures that less distortion is required when the
records in a cluster are modifed to have the same quasi-identifer value. We thus
formulate a specifc clustering problem, which we call k-member clustering problem. We
prove that this problem is NP-hard and present a greedy algorithm which runs in time
( n

). Although our approach does not rely on generalization hierarchies, if there
exist some natural relations among the values in a domain, our algorithm can
incorporate such information to fnd more desirable solutions. We note that while
many quality metrics have been proposed for the hierarchy-based generalization, a
metric that precisely measures the information loss introduced by the hierarchy- free
generalization has not yet been introduced. For this reason, we defne a data quality
metric for the hierarchy-free generalization, which we call information loss metric. We
also show that with a small modifcation, our algorithm is able to reduce classifcation
errors e ectively.
7

The remainder of this paper is organized as follows. We review the basic concepts
of the k-anonymity model and survey existing techniques. We formally defne the
problem of k-anonymization as a clustering problem and introduce our approach. Then
we evaluate our approach based on the experimental results.

1.1. GeneraI -Anonymity modeI

k-anonymity is a model that addresses the question, "How can a data holder
release a version of its private data with scientific guarantees that the
individuals who are the subjects of the data can't be re-identified while the data
remains practically useful?. For instance, a medical institution may want to
release a table of medical records. Even though the names of the individuals
can be replaced with dummy identifiers, some set of attributes called the quasi-
identifier which can leak confidential information. For instance, the birth date,
zip code and the gender attributes in the disclosed table can uniquely determine an
individual. Joining such a table with some other publicly available information source,
like a voter's list table, which consists of records containing the attributes that make up
the quasi-identifier as well as the identities of individuals, the medical information, can
be easily linked to individuals, k-anonymity prevents such a privacy breach by
ensuring that each individual record can only be released if there is at least k-1
other
(distinct) individuals whose associated records are indistinguishable from the former in
terms of their quasi-identifier values.

8

. Motivation

The problem of releasing a version of privately held data so that the individuals who
are the subjects of the data cannot be identified is not a new problem. There are
existing works in the statistics community on statistical databases and in the computer
security community on multi-level databases to consider. However, none of these
works provide solutions to the broader problems experienced in today's data rich
setting.

.1. StatisticaI databases

Federal and state statics offices around the world have traditionally been interested with
the release of statistical information about all aspects of the populance. But like other
data holders, statistics offices are also facing tremendous demand for person specific
data for application such as data mining, cost analysis, fraud detection and
retrospective research. But many of the established statistical database techniques,
which involve various ways of adding noise to the data while still maintaining some
statistical invariant, often destroy the integrity of records, or tuples, and so, for many
new uses of data, these established techniques are not appropriate. Wallenberg and De
Wall provide more extensive coverage of traditional statistical techniques.

.. MuIti-IeveI databases

Another related area is aggregation and inference in multi-level databases which
concerns restricting the release of lower classified information such that higher
classified information cannot be derived. Denning and Lunt described a multilevel
relational database system (MDB) as having data stored at different security
classifications and users having different security clearances. Su and Ozsoyoglu
formally investigated inference in MDB. They showed that eliminating precise
inference compromise due to functional dependencies and multi-valued dependencies
is NP-complete. By extension to this work, the precise elimination of all inferences with

respect to the identities of the individuals whose information is included in person-
specific data is typically impossible to guarantee. ntuitively this makes sense because
the data holder cannot consider a priori every possible attack. n trying to produce
anonymous data, the work that is the subject of this paper seeks to primarily protect
against known attacks. The biggest problems result from inferences that can be
drawn after linking the released data to other knowledge, so in this work, it is the
ability to link the result to foreseeable data sources that must be controlled.

Many aggregation inference problems can be solved by database design, but this
solution is not practical in today's data rich setting. n today's environment, information
is often divided and partially replicated among multiple data holders and the data
holders usually operate autonomously in making decisions about how data will be
released. Such decisions are typically made locally with incomplete knowledge of how
sensitive other holders of the information might consider replicated data. For example,
when somewhat aged information on joint projects is declassified differently by the
Department of Defense than by the Department of Energy, the overall declassification
effort suffers; using the two partial releases, the original may be reconstructed in its
entirety. n general, systems that attempt to produce anonymous data must operate
without the degree of omniscience and level of control typically available in the
traditional aggregation problem.
n both aggregation and MDB, the primary technique used to control the flow of
sensitive information is suppression, where sensitive information and all information
that allows the inference of sensitive information are simply not released.
Suppression can drastically reduce the quality of the data, and in the case of
statistical use, overall statistics can be altered, rendering the data practically
useless.
When protecting national interests, not releasing the information at all may be possible,
but the greatest demand for person-specific data is in situations where the data holder
must provide adequate protections while keeping the data useful, such as sharing
person-specific medical data for research and survey purposes.

10

.3. Computer security is not privacy protection

An area that might appear to have a common ancestry with the subject of this paper is
access control and authentication, which are traditional areas associated with
computer security. Work in this area ensures that the recipient of information has the
authority to receive that information. While access control and authentication
protections can safeguard against direct disclosures, they do not address disclosures
based on inferences that can be drawn from released data. The more insidious
problem in the work that is the subject of this paper is not so much whether the
recipient can get access or not to the information as much as what values will
constitute the information the recipient will receive. A general doctrine of the work
presented herein is to release all the information but to do so such that the identities of
the people who are the subjects of the data (or other sensitive properties found in the
data) are protected. Therefore, the goal of the work presented in this paper lies
outside of traditional work on access control and authentication.

.4. MuItipIe queries can Ieak inference

Denning and others were among the first to explore inferences realized from
multiple queries to a database. For example, consider a table containing only
(physician, patient, and medication). A query listing the patients seen by each
physician, i.e., a relation R (physician, patient), may not be sensitive. Likewise, a query
itemizing medications prescribed by each physician may also not be sensitive. But
the query associating patients with their prescribed medications may be
sensitive because medications typically correlate with diseases. One common
solution, called query restriction, prohibits queries that can reveal sensitive
information. This is effectively realized by suppressing all inferences to sensitive
data. n contrast, this work poses a real-time solution to this problem by advocating
that the data be first rendered sufficiently anonymous, and then the resulting data
used as the basis on which queries are processed. Doing so typically retains far
more usefulness in the data because the resulting release is often less distorted.
11

n summary, the dramatic increase in the availability of person-specific data from
autonomous data holders has expanded the scope and nature of inference control
problems and exasperated established operating practices. The goal of this work is to
provide a model for understanding, evaluation and constructing computational system
that control inferences in this setting.

3. PreIiminaries

3.1. Basic Concepts

The k-anonymity model assumes that person-specifc data are stored in a table (or a
relation) of columns (or attributes) and rows (or records). The process of anonymizing
such a table starts with removing all the explicit identifers, such as name and SSN, from
the table. However, even though a table is free of explicit identifers, some of the
remaining attributes in combination could be specifc enough to identify individuals if the
values are already known to the public. For example, as shown by Sweeney, [1, 2, 3],
most individuals in the United States can be uniquely identifed by a set of attributes
such as { ZIP, gender, date of birth} . Thus, even if each attribute alone is not specifc
enough to identify individuals, a group of certain attributes together may identify a
particular individual. The set of such attributes is called quasi-identifer.

The main objective of the k-anonymity model is thus to transform a table so that no
one can make high-probability associations between records in the table and the
corresponding entities. n order to achieve this goal, the k-anonymity model requires
that any record in a table be indistinguishable from at least (k ~ 1) other records with
respect to the pre-determined quasi-identifer. A group of records that are
indistinguishable to each other is often referred to as an equivalence class. By
enforcing the k-anonymity requirement, it is guaranteed that even though an
adversary knows that a k-anonymous table contains the record of a particular
individual and also knows some of the quasi-identifer.
12

ZP Gender Age Disease
831001 Male 25 Flu
825303 Female 12 Obesity
834009 Male 34 Cancer
831001 Male 26 HV+
825303 Male 16 Cancer
834009 Male 32 Diabetes
825303 Female 26 Obesity
831001 Male 27 Flu
834009 Female 31 Flu

Fig.1 Patient Table

ZP Gender Age Diagnosis
83100* Person [25-30] Flu
82530* Person [10-15] Obesity
83400* Person [30-35] Cancer
83100* Person [25-30] HV+
82530* Person [15-20] Cancer
83400* Person [30-35] Diabetes
82530* Person [25-30] Obesity
83100* Person [25-30] Flu
83400* Person [30-35] Flu

Fig.2 Anonymous Patient Table

Attribute values of the individual, he/she cannot determine which record in the table
corresponds to the individual with a probability greater than 1/ k . For example, a 3-
anonymous version of the table in Fig. 1 is shown in Fig. 2.

1

The k-anonymit y model is an approach to protect data from individual
ident ificat ion. t works by ensuring that each record of a table is identical to at
least k ~ 1 other records with respect to a set of privacy-relat ed attributes,
called quasi-identifiers, that could be potent ially used to identify individuals
by linking these attributes to external data sets. For example, consider the
hospital data in Table 1 where the attributes Zip Code, Gender and Age are
regarded as quasi-ident ifiers. Table 2 gives a 3-anonymizat ion version of the
table in Table 1, where anonymizat ion is achieved via generalizat ion at the
attribute level, i.e., if t wo records contain the same value at a quasi-ident ifier,
they will be generalized to the same value at the quasi-identifier as well. Table3
gives anot her 3-anonymization version of the table in Table1,where
anonymizat ion is achieved via generalization at the cell level, i.e., t wo cells with
same value could be generalized to different values (e.g., value 75275 in the Zip
Code column and value Male in the Gender column)

Tabl e 1: Pati e nt r e c or d s of a hospi ta l
Zip Code Gender Age Disease Expense
75275 Male 22 Flu 100
75277 Male 23 Cancer 3000
75278 Male 24 HV+ 5000
75275 Male 33 Diabetes 2500
75275 Female 38 Diabetes 2800
75275 Female 36 Diabetes 2600

Tabl e 2: Anonymi zati on a t a t t r i b u t e l evel
Zip Code Gender Age Disease Expense
7527* Person [21-30] Flu 100
7527* Person [21-30] Cancer 3000
7527* Person [21-30] HV+ 5000
7527* Person [31-40] Diabetes 2500
14

Tabl e 3: Anony mi zati on at cell level

Because anonymization via generalizat ion at the cell level generat es data
that cont ains different generalizat ion level within a column, utilizing such data
becomes more complicat ed than utilizing the data generat ed via
generalizati on at the attribute level. However, generali zation at the cell level
causes less information loss than generalizat ion at the attribute level. Hence,
as far as data qualit y is concerned, generalizat ion at the cell level seems to
generat e better dat a than generalization at the attribute level.
Anonymizat ion via generalizat ion at the cell level can proceed in t wo steps.
First, all records are partitioned int o several groups such that each group
cont ains at least k records. Then, the records in each group are generalized
such that their values at each quasi-identifier are ident ical. To minimize the
information loss incurred by the second step, the first step should place similar
records (with respect to the quasi-identifiers) in the same group. n the
cont ext of dat a mining, clust ering is a useful technique that partitions records into
clusters such that records within a cluster are similar to each other, while
zip Code Gender Age Disease Expense
7527* Male [21-25] Flu 100
7527* Male [21-25] Cancer 3000
7527* Male [21-25] HV+ 5000
75275 Person [31-40] Diabetes 2500
1

records in different clusters are most distinct from one another. Hence,
clustering could be used for k-anonymization.

3. Existing Techniques

The k-anonymity requirement is typically enforced through generalization, where real
values are replaced with "less specifc but semantically consistent values. Given a
domain, there are various ways to generalize the values in the domain. Typically,
numeric values are generalized into intervals, and categorical values are generalized into
a set of distinct values (e.g., { USA, Canada} ) or a single value that represents such a
set (e.g., North-America).

Various generalization strategies have been proposed. A non overlapping
generalization-hierarchy is first defined for each attribute of quasi-identifier. Then an
algorithm tries to find an optimal (or good) solution which is allowed by such
generalization hierarchy. Note that in these schemes, if a lower level domain needs to
be generalized to higher level domain, all the values in the lower domain are
generalized to the higher domain. This restriction could be significant drawback in that it
may lead to relatively high data distortion due to unnecessary generalization. on the
other hand, possible generalization are still limited by the imposed generalization
hierarchies.

Recently, some schemes that do not rely on generalization hierarchies have been
proposed. For instance, LeFevre transform the k-anonymity problem into a partitioning
problem. Specifcally, their approach consists of the following two steps. The frst step is
to fnd a partitioning of the d-dimensional space, where d is the number of attributes in
the quasi-identifer, such that each partition contains at least k records. Then the
records in each partition are generalized so that they all share the same quasi-identifer
value. Although shown to be e cient, these approaches also have a disadvantage that it
requires a total order for each attribute domain. This makes it impractical in most cases
involving categorical data which have no meaningful order.
16

4. Anonymization and CIustering

The key idea underlying our approach is that the k-anonymization problem can be
viewed as a clustering problem. Clustering is the problem of partitioning a set of objects
into groups such that objects in the same group are more similar to each other than
objects in other groups with respect to some defned similarity criteria. ntuitively, an
optimal solution of the k-anonymization problem is indeed a set of equivalence
classes such that records in the same equivalence class are very similar to each
other, thus requiring a minimum generalization.

4.1. Categorization of major cIustering methods

There exists a large number of clustering algorithms in the literature. The choice of
clustering algorithm depends both on the particular purpose and application. f cluster
analysis is used as a descriptive or exploratory tool, it is possible to try several
algorithms on the same data the data may disclose.
n general, major clustering methods can be classified into the following categories.

1. Partitioning methods.
2. Hierarchical methods.
3. Density-based methods.
4. Grid-based methods.
5. Model-based methods.

17

1. Partitioning methods

Given a database of n objects or data tuples, a partitioning method constructs k
partitions of the data, where each partition represents a cluster, and n. That is, it
classifies the data into k groups, which together satisfy the following requirements:

(1) Each group must contain at least one object
(2) Each object must belong to exactly one group

Notice that the second requirement is relaxed in some fuzzy partitioning techniques.
Given k, the number of partitions to construct, a partitioning method creates an initial
partitioning. t then uses an iterative relocation technique which attempts to improve the
partitioning by moving objects from one group to another. The general criterion of a
good partitioning is that objects in the same cluster are "close or related to each other,
whereas objects of different clusters are "far apart or very different. There are various
kinds of other criteria for judging the quality of partitions.
To achieve global optimality in partitioning-based clustering would require the
exhaustive enumeration of all of the possible partition. nstead, most application adopts
one of the two popular heuristic methods.

(1) The k- means algorithm, where each cluster is represented by the means value
of the objects in the cluster.
(2) The k-medoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.

This heuristic clustering methods works well for finding spherical- shaped clusters in
small to medium sized databases. For finding clusters with complex shapes and for
clustering very large data sets, partitioning- based methods need to be extended.

18

2. Hierarchical methods.

A hierarchical method creates a hierarchical decomposition of the given set of data
objects. A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed. The agglomerative
approach, also called the "bottom-up approach, start with each object forming a
separate group. t successively merges the object or group close to one another, until all
of the group is merged into one, or until a termination condition holds. The divisive
approach, also called the "top-down approach, starts with all the objects in the same
cluster. n each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step is done, it can never be
undone. This rigidity of the hierarchical method is both the key to its success because it
leads to smaller computation cost without worrying about a combinatorial number of
different choices, as well as the key to its main problem because it cannot correct
erroneous decisions.
t can be advantageous to combine iterative relocation and hierarchical agglomeration
by first using hierarchical agglomerative algorithm and then refining the result using
iterative relocation. Some scalable clustering algorithms, such as BRCH and CURE,
have been developed based on such an integrated approach.

3. Density based methods

Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
methods have been developed based on notion of density. The general idea is to
continue growing the given cluster so long as the density (number of objects or points)
in the "neighborhood exceeds some threshold, i.e., for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers), and discover a cluster of
DBSCAN is a typical density-based method which grows clusters according to density
1

threshold. OPTCS is a density-based method which computes an augmented
clustering ordering for automatic and iterative cluster analysis. Density-based method
which computes an augmented clustering ordering for automatic and iterative cluster
analysis.

4. Grid-based methods.

A grid-based method quantizes the object space into a finite number of cells which form
a grid structure. t then performs all of the clustering operation on the grid structure. The
main advantage of this approach is its fast processing time which is typically
independent of the number of data objects, and dependent only on the number of cells
in each dimension in the quantized space.
STNG is a typical example of a grid-based method. CLQUE and wave-cluster are two
clustering algorithms which are both grid-based and density-based.

5. Model based methods.

A model based method hypothesizes a model for each of the clusters, and find the best
fit of the data to that model. A model-based algorithm may locate clusters by
constructing a density function that reflects the spatial distribution of data points. t also
lead to a way of automatically determining the number of clusters based on standard
statistics, taking "noise or outliers into account and thus yielding robust clustering
method.
Data clustering is a method in which we make cluster of objects that are somehow
similar in characteristics. The criterion for checking the similarity is implementation
dependent.
20

Clustering is often confused with classification, but there is some difference between the
two. n classification the objects are assigned to pre defined classes, whereas in
clustering the classes are also to be defined.
Precisely, Data Clustering is a technique in which, the information that is logically similar
is physically stored together. n order to increase the efficiency in the database systems
the numbers of disk accesses are to be minimized. n clustering the objects of similar
properties are placed in one class of objects and a single access to the disk makes the
entire class available.
. DEFINITIONS (frequentIy used terms in cIustering)
n this section some frequently used terms are defined.
.1. CIuster
Cluster is an ordered list of objects, which have some common characteristics. The
objects belong to an interval a, b], in our case , 1] 1]
.. Distance between Two CIusters
The distance between two clusters involves some or all elements of the two clusters.
The clustering method determines how the distance should be computed.
.3 SimiIarity
A similarity measure SIMILAR (Di, D
j
) can be used to represent the similarity between
the documents. Typical similarity generates values of 0 for documents exhibiting no
agreement among the assigned indexed terms, and 1 when perfect agreement is
detected. ntermediate values are obtained for cases of partial agreement.

21

.4 Average SimiIarity
f the similarity measure is computed for all pairs of documents ( D
i
, D
j
) except when i=j,
an average value Average Similarity is obtainable. Specifically, Average Similarity =
CONSTANT Similar(Di, D
j
), where i=1, 2. n , j=1, 2, ...n and i < > j
. ThreshoId
The lowest possible input value of similarity required to join two objects in one cluster.

.6 SimiIarity Matrix
Similarity between objects calculated by the function SIMILAR (D
i,
Dj), represented in
the form of a matrix is called a similarity matrix.
.7 DissimiIarity Coefficient
The dissimilarity coefficient of two clusters is defined to be the distance between them.
The smaller the value of dissimilarity coefficient, the more similar two clusters are.
.8 CIuster Seed
First document or object of a cluster is defined as the initiator of that cluster i.e. every
incoming object's similarity is compared with the initiator. The initiator is called the
cluster seed.

22

6. ReaI worId AppIications of CIustering
Data clustering has immense number of applications in every field of life. One has to
cluster a lot of thing on the basis of similarity either consciously or unconsciously. So
the history of data clustering is old as the history of mankind.
n computer field also, use of data clustering has its own value. Specially in the field of
information retrieval data clustering plays an important role. Some of the applications
are listed below.
6.1 SimiIarity searching in MedicaI Image Database
This is a major application of the clustering technique. n order to detect many diseases
like Tumor etc, the scanned pictures or the x-rays are compared with the existing ones
and the dissimilarities are recognized.
We have clusters of images of different parts of the body. For example, the images of
the CT scan of brain are kept in one cluster. To further arrange things, the images in
which the right side of the brain is damaged are kept in one cluster. The hierarchical
clustering is used. The stored images have already been analyzed and a record is
associated with each image. n this form a large database of images is maintained using
the hierarchical clustering.
Now when a new query image comes, it is firstly recognized that what particular cluster
this image belongs, and then by similarity matching with a healthy image of that specific
cluster the main damaged portion or the diseased portion is recognized. Then the image
is sent to that specific cluster and matched with all the images in that particular cluster.
Now the image, with which the query image has the most similarities, is retrieved and
the record associated to that image is also associated to the query image. This means
that now the disease of the query image has been detected.
2

Using this technique and some really precise methods for the pattern matching,
diseases like really fine tumor can also be detected.
So by using clustering an enormous amount of time in finding the exact match from the
database is reduced.
6. Data Mining
Another important application of clustering is in the field of data mining. Data mining is
defined as follows.
Definition1: "Data mining is the process of discovering meaningful new correlation,
patterns and trends by sifting through large amounts of data, using pattern recognition
technologies as well as statistical and mathematical techniques."
Definition: Data mining is a "knowledge discovery process of extracting previously
unknown, actionable information from very large databases."
Use of CIustering in Data Mining: Clustering is often one of the first steps in data
mining analysis. t identifies groups of related records that can be used as a starting
point for exploring further relationships. This technique supports the development of
population segmentation models, such as demographic-based customer segmentation.
Additional analyses using standard analytical and other data mining techniques can
determine the characteristics of these segments with respect to some desired outcome.
For example, the buying habits of multiple population segments might be compared to
determine which segments to target for a new sales campaign.
For example, a company those sales a variety of products may need to know about the
sale of all of their products in order to check that what product is giving extensive sale
and which is lacking. This is done by data mining techniques. But if the system clusters
the products that are giving fewer sales then only the cluster of such products would
have to be checked rather than comparing the sales value of all the products. This is
actually to facilitate the mining process.
24

6.3 Windows NT
Another major application of clustering is in the new version of windows NT. Windows
NT uses clustering, it determine the nodes that are using same kind of resources and
accumulate them into one cluster. Now this new cluster can be controlled as one node.
Other appIications
Social network analysis
n the study of social network, clustering may be used to recognize communities
within large groups of people.

Software evolution
Clustering is useful in software evolution as it helps to reduce legacy properties
in code by reforming functionality that has become dispersed. t is a form of
restructuring and hence is a way of directly preventative maintenance.

mage segmentation
Clustering can be used to divide a digital image into distinct regions for border
detection or object recognition.

Data mining
Many data mining applications involve partitioning data items into related
subsets; the marketing applications discussed above represent some examples.
Another common application is the division of documents, such as World Wide
Web pages, into genres.

Search result grouping
n the process of intelligent grouping of the files and websites, clustering may be
used to create a more relevant set of search results compared to normal search
engines like Google. There are currently a number of web based clustering tools
such as Clusty.
2

Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number
of markers on a map. This makes it both faster and reduces the amount of visual
clutter.

MRT segmentation
Clustering can be used to divide a fluence map into distinct regions for
conversion into deliverable fields in MLC-based Radiation Therapy.

Grouping of Shopping tems
Clustering can be used to group all the shopping items available on the web into
a set of unique products. For example, all the items on eBay can be grouped into
unique products.

Recommender systems
Recommender systems are designed to recommend new items based on a
user's tastes. They sometimes use clustering algorithms to predict a user's
preferences based on the preferences of other users in the user's cluster.

Mathematical chemistry
To find structural similarity, etc., for example, 3000 chemical compounds were
clustered in the space of 90 topological indices.

Climatology
To find weather regimes or preferred sea level pressure atmospheric patterns.

Petroleum Geology
Cluster Analysis is used to reconstruct missing bottom hole core data or missing
log curves in order to evaluate reservoir properties.

26

Physical Geography
The clustering of chemical properties in different sample locations.

Crime Analysis
Cluster analysis can be used to identify areas where there are greater incidences
of particular types of crime. By identifying these distinct areas or "hot spots"
where a similar crime has happened over a period of time, it is possible to
manage law enforcement resources more effectively.

7. Existing Theorem used in this project

Typical clustering problems require that a specifc number of clusters be found in
solutions. However, the k-anonymity problem does not have a constraint on the
number of clusters; instead, it requires that each cluster contains at least k records.
Thus, we pose the k-anonymity problem as a clustering problem, referred to as k-
member clustering problem.

Denition 1: (k-member cl ust er i ng pr obl em) The k-member clustering
problem is to fnd a set of clusters from a given set of n records such that each
cluster contains at least (k > n) data points and that the sum of all intra-cluster
distances is minimized. Formally, let S be a set of n records and k the specifed
anonymization parameter. Then the optimal solution of the k- clustering problem is a
set of clusters = { e
1
, . , e
m
} such that:

i = j { 1, . . . , m}, e
i
i e
j
= ,

U
i=1,...m
ei = S,

ei , |ei | < k, and
27

_
l=1,..,m
|
e
l

| M AX
i , j = 1, . . . , | e
l |
A (p
( L , i )
, p
(L ,j)
) i s mi ni mal .

Here |e| is the size of cluster e, P(l, i)

represents the i-th data point in cluster e
l
, and
A (x, y) is the distance between two data points x and y.

Note that in Defnition 1, we consider the sum of all intra-cluster distances, where an
intra-cluster distance of a cluster is defned as the maximum distance between any two
data points in the cluster (i.e., the diameter of the cluster). As we describe in the
following section, this sum captures the total information loss, which is the amount of
data distortion that generalizations introduce to the entire table.

7.1 Distance and Cost f unct i on

At the heart of every clustering problem are the distance functions that measure the
dissimilarities among data points and the cost function which the clustering problem
tries to minimize. The distance functions are usually determined by the type of data (i.e.,
numeric or categorical) being clustered, while the cost function is defned by the specifc
objective of the clustering problem. n this section, we describe our distance and cost
functions which have been specifcally tailored for the k-anonymization problem.

As previously discussed, a distance function in a clustering problem measures how
dissimilar two data points are. As the data we consider in the k-anonymity problem are
person-specifc records that typically consist of both numeric and categorical
attributes, we need a distance function that can handle both types of data at the
same time.

For a numeric attribute, the di erence between two values (e.g., |x ~ y|) naturally
describes the dissimilarity (i.e., distance) of the values. This measure is also suitable for
the k-anonymization problem. To see this, recall that when records in the same
equivalence class are generalized, the generalized quasi-identifer must subsume all the
28

attribute values in the equivalence class. That is, the generalization of two values x
and y in a numeric attribute is typically represented as a range x, y], provided that x <
y. Thus, the di erence captures the amount of distortion caused by the generalization
process to the respective attribute (i.e., the length of the range).

Denition . (Distance between two numeric values) Let D be a fnite numeric
domain. Then the normalized distance between two values b

, b

e B is defned as:
N
(v
1
, v
2
) = |v
1
~ v
2
| / | D |

Where |D| is the domain size measured by the difference between the maximum and minimum
values in D.

For categorical attributes, however, the di erence is no longer applicable as most of the
categorical domains cannot be enumerated in any specifc order. The most
straightforward solution is to assume that every value in such a domain is equally
di erent to each other; e.g., the distance of two values is 0 if they are the same, and
1 if di erent. However, some domains may have some semantic relationships among
the values. n such domains, it is desirable to defne the distance functions based on
the existing relationships. Such relationships can be easily captured in a taxonomy
t r ee. We assume that a taxonomy tree of a domain is a balanced tree of which the
leaf nodes represent all the distinct values in the domain. For example, Fig. 3
illustrates a natural taxonomy tree for the Country a t t r i bu t e . However, for some
attributes such as Occupation, there may not exist any semantic relationship which can
help in classifying the domain values. For such domains, all the values are classifed
under a common value as

in Fig. 4. We now defne the distance function for categorical
values as follows:

Denition 3. (Distance between two categorical values) Let D be a categorical
2

domain and T
D be
a taxonomy tree defned for D. The normalized distance
between two values b

, b

e B is defned as [3, 5]:

o

(

) =
E
(
A
(
,

))
E
(

)

Taxonomy tree can be considered similar to generalization hierarchy introduced in.
However, we t reat t axonomy tree not as a restriction, but a user's preference.

Country

America Asia

North South West East

USA Canda Brazil Mexico ran Egypt ndia pak

Fig 3 Taxonomy Tree of country

Occupation

Armed-Forces Teacher Doctor Salesman Tech-Support

Fig - 4 Taxonomy Tree of Occupation

0

here (x, y ) is the subtree rooted at the lowest common ancestor of x and y, and H (R)
represents the height of tree T.

Example 1. Consider attribute Country and its taxonomy tree in Fig. 3. The
distance between ndia and USA i s 3/ 3 = 1, while the distance between ndia and
ran is 2/ 3 = .66. On the other hand, for attribute Occupation a n d its taxonomy
tree in Fig. 4 which goes up only one level, the distance between any two values is
always 1.Combining the distance functions for both numeric and categorical domains, we
defne the distance between two records as follows:

Defnition 4 (Distance b e t we e n two records) Let Q
T
= { N
1
, . . . , N
m
, C
1
, . . . , C
n
} be the quasi-identifer of table T , where N
i
(i = 1, . . . , m) is an
attribute with a numeric domain and C
j
( j = 1, . . . , n) is an attribute with a
categorical domain. The distance of two records ^
,
^

e T is defned as:

(r
1
, r
2
) = Z
N
(r
1
N
i
], r
2
N
i
] ) + Z
C (r1 | Cj |, r2 | Cj |),

i = 1 , . . , m j = 1 , . . . , n

Where r
i
A] represents the value of attribute A in r
i
, and
N
and
C
are the distance
functions defned in Defnition 2 and 3, respectively.

Now we discuss the cost function which the k-members clustering problem tries to
minimize. As the ultimate goal of our clustering problem is the k- anonymization of
data, we formulate the cost function to represent the amount of distortion (i.e.,
information loss) caused by the generalization process. Recall that, records in each
cluster are generalized to share the same quasi-identifer value that represents every
original quasi-identifer value in the cluster. We assume that the numeric values are
generalized into a range [min, max] and categorical values into a set that unions all
distinct values in the cluster. With these assumptions, we defne a metric, referred to
as nformation Loss metric (L), that measures the amount of distortion introduced by
the generalization process to a cluster.
1

Defnition 5. (nformation loss) Let e = { r
1
, . . . , r
k
} be a cluster (i.e., equivalence
class) where the quasi-identifer consists of numeric attributes N
1
, . . . , N
m
And
categorical attributes C
1,
. . . , C
n
. Let T
Ci
be the taxonomy tree defned for the domain
of categorical attribute C
i
. Let MIN
N i

and MAX
N i

be the min and max values in e with
respect to attribute N
i
, and let
Ci
be the union set of values in e with respect to
attribute C
i
. Then the amount of information loss occurred by generalizing 0 denoted
by I L(e), is defned as:

IL (e) = | e |. ( _

(MAX
Ni
MIN
Ni
)/|N
i
| + _ H( (U
Cj
))/H(T
C j
) )
i=1,..,m j=1,..,n

where |e| is the number of records in e, |N | represents the size of numeric domain N ,
(
Cj
) is the subtree rooted at the lowest common ancestor of every value in
Cj
,
and H (T ) is the height of taxonomy tree T.3,4,6]

Using the defnition above, the total information loss of the anonymized table is defned
as follows:

Defnition 6. (Total i nfor mati on loss) Let be the set of all equivalence classes in
the anonymized table AT. Then the amount of total information loss of AT is defned
as:

Total-IL (AT) = _ IL (e).
e c
Recall that the cost function of the k-members problem is the sum of all intra- cluster
distances, where an intra-cluster distance of a cluster is defned as the maximum
distance between any two data points in the cluster. Now, if we consider how records in
each cluster are generalized, minimizing the total information loss of the anonymized
2

table intuitively minimizes the cost function for the k -members clustering problem as
well. Therefore, the cost function that we want to minimize in the clustering process is
Total-L.

Theorem. The k-member cl usteri ng decision problem is NP-complete.

Proof. That the k-member clustering decision problem is in NP follows from the
observation that if such a clustering scheme is given, verifying that it satisfes the two
conditions in Defnition 7 can be done in polynomial time.
t is proved that optimal k-anonymity by suppression is NP-hard, using a reduction
from the Edge Pa r t i ti o n i n t o T r i a n g l e s problem. n the reduction, the table to
be k-anonymized consists of n records; each record has m attri butes, and each
attribute takes a value from { , 1, 2} . The k-anonymization technique used is to
suppress some cells in the table. t showed that determining whether there exists a 3-
anonymization of a table by suppressing certain number of cells is NP-hard.
We observe that the problem in 1] is a special case of the k-member clustering
problem where each attribute is categorical and has a fat taxonomy tree. t thus
follows that the k-member clustering problem is also NP-hard. When each attribute has
a fat taxonomy tree, the only way to generalize a cell is to the root of the fat taxonomy
tree, and this is equivalent to suppressing the cell. Given such a database, the
information loss of each record in any generalization is the same as the number of cells
in the record that di er from any other record in the equivalent class, which equals the
number of cells to be suppressed. Therefore, there exists a k-anonymization with total
information loss no more than t if and only if there exists a k-anonymization that
suppresses at most t cells.

Faced with the hardness of the problem, we propose a simple and e cient algorithm
that fnds a solution in a greedy manner. The idea is as follows. Given a set of n
records, we frst randomly pick a record r
i and
make it as a cluster e
1.
Then we
choose a record r
j
that makes IL (e
1
{ r
j }
) minimal. We repeat this until |e
1
| = k.
When |e
1
| reaches k, we choose a record that is furthest from r
i
and repeat the

clustering process until there are less than k records left. We then iterate over these
leftover records and insert each record into a cluster with respect to which the
increment of the information loss is minimal. We provide the core of our greedy k-
member clustering algorithm, leaving out some trivial functions, in Figure 5.

8. Proposed Anonymization AIgorithm

Armed with the distance and cost functions, we are now ready to discuss the k-
member clustering algorithm. As in most clustering problems, an exhaustive search for
an optimal solution of the k-member clustering is potentially exponential. n order to
precisely characterize the computational complexity of the problem, we defne the k-
member clustering problem as a decision problem as follows.

Defnition 7. (k-member clustering decision pr o bl e m) Given n records, is
there a clustering scheme = { e
1
, . . . , e
}
such that

1. W
k, < k n: the size of each cluster is greater than or equal to a

positive integer k, and

2.
_
i= 1 , . . . , I L
(e
i
) < c, c > the Total-L of the clustering scheme is less than a
positive constant c.

Theorem: Let n be the total number of input records and k be the specifed
anonymity parameter. Every cluster that the greedy k-member clustering
algorithm fnds has at least k records, but no more than 2k ~ 1 records.

Proof: Let S be the set of input records. As the algorithm fnds a cluster with exactly k
records as long as the number of remaining records is equal to or greater than k,
every cluster contains at least k records. f there remain less than k records, these
leftover records are distributed to the clusters that are already found. That is, in the
4

worst case, k ~ 1 remaining records are added to a single cluster which already contains
k records. Therefore, the maximum size of a cluster is 2k ~ 1.

Greedy k-member clustering algorithm

Function greedy_k_member_clustering (S, k)
nput: a set of records S and a threshold value k.
Output: a set of cluster each of which contains at least k records.
1. f ( |S| > k)
2. Return S;
3. End if;
4. Result =; r = a randomly picked from S;
5. While ( |S| < k)
6. r= the furthest record from r;
7. S=S-{r};
8. C ={r};
9. While ( |C| < k)
10. r= find_best_record(S,C);
11. S=S-{r};
12. C=C U {r};
13. End while;
14. Result =Result U {C};
15. End while;
16. While ( |S| =0)
17. r= a randomly picked record from S;
18. S=S-{r};
19. C=find_best_cluster(Result, r);
20. C=C U {r};
21. End while;
22. Return Result;
End;

Function find_best_record (S, c)
nput: a set of records S and a cluster c
Output: a record r c S such that L(c U {r}) is minimal
1. n= |S|; min=; best = null;
2. for(i=1..n)
3. r= i-th record in S;
4. diff= L(c U {r}) L(c);
5. f(diff<min)
6. min=diff;
7. best=r;
8. End if;
9. End for;
1O. Return best;
End;

Function find_best_cluster (C, r)
nput: a set of clusters C and a record r.
Output: a cluster c C such that IL(c {r} is minimal
1. n=|C|; min=; best=null;
2. for( i=1..n)
3. c=i-th cluster in C;
4. diff=L(CU{r}) L(C);
5. if(diff<min)
6. min=diff;
7. best=c;
8. end if;
9. end for;
10. return best;
End;

Theorem: Let n be the total number of input records and k be the specifed
6

anonymity parameter. The time compl exi t y of the greedy k-member
c l us t er i ng algorithm is in ( n

).

Proof. Observe that the algorithm spends most of its time selecting records from the
input set S one at a time until it reaches |S| = k (Line 9). As the size of the input
set decreases by one at every iteration, the total execution time T is estimated as:

T = (n-1) + (n-2) + (n-3) + -------------+k =n(n - )

Therefore, T is in ( n

).

7

. Code for impIementing proposed aIgorithm

//CODE FOR GU
import mypack.Cluster;
import mypack.DataBase;
import java.awt.*;
import javax.swing.*;
import java.awt.event.*;
import java.sql.*;
import java.net.*;
import java.util.StringTokenizer;

public class Project implements ActionListener,temListener,KeyListener

{ //for generate
JLabel generate,g1,kam,pmi;
Choice choice;
//for find
JLabel f_l1,f_record,f_display;
JTextField f_t1,temp;
JButton search;
JLabel l1,l2,l3,l4,l5,l6,l7,l8,l9,l10,msginsert,msgdelete,msgupdate;
JTextField t1,t2,t3,t4,t5,t6;
JTextField t7,t8,t9,t10;
JLabel common;
JFrame f;
JPanel p;
List list1,list2;
Connection con;
Statement st;
ResultSet rs;
JLabel banner1,banner2;
JButton insert,update,refresh,delete,find;
JButton b1,b2,b3;
JLabel l_total_record,l_total_cluster;

public Project(String name)
{
f = new JFrame(name);
p = new JPanel();
temp = new JTextField(20);//for key event
list1 = new List(20);
list2 = new List(20);
choice = new Choice();
b1 = new JButton("New Entry");
b2 = new JButton("Update");
b3 = new JButton("Delete");
insert = new JButton("nsert Confirm");
update = new JButton("Update Confirm");
refresh = new JButton("refresh");
delete = new JButton("Delete Confirm");
find = new JButton("Find");
search = new JButton("search");
l_total_record = new JLabel("Total Number of Records ");
l_total_cluster = new JLabel("Total Number of Cluster ");
8

kam=new JLabel("k-Anonymity Model");
pmi=new JLabel("Protecting Medical nformation");
banner1 = new JLabel("Original Patient Record");
banner2 = new JLabel("Annomized Patient Record");
banner1.setFont(new Font("Sanserrif",Font.BOLD,20));
banner2.setFont(new Font("Sanserrif",Font.BOLD,20));
kam.setFont(new Font("Times New Roman",Font.BOLD,20));
pmi.setFont(new Font("Tahoma",Font.TALC,18));
pmi.setForeground(Color.blue);

msginsert = new JLabel("Press nsert Confirm Button for saving");
msgupdate = new JLabel("Press Update Confirm Button for saving");
msgdelete = new JLabel("Press Delete Confirm Button for saving");
msginsert.setFont(new Font("Sanserrif",Font.BOLD,12));
msgupdate.setFont(new Font("Sanserrif",Font.BOLD,12));
msgdelete.setFont(new Font("Sanserrif",Font.BOLD,12));
generate = new JLabel("Generate");
generate.setFont(new Font("Sanserrif",Font.BOLD,12));
g1 = new JLabel("Annomized Table");
g1.setFont(new Font("Sanserrif",Font.BOLD,12));
choice.add("3");
choice.add("4");
choice.add("5");
choice.add("6");
choice.add("7");
choice.add("8");
choice.add("9");
choice.add("10");
msginsert.setForeground(Color.red);
msgupdate.setForeground(Color.red);
msgdelete.setForeground(Color.red);
l_total_record.setForeground(Color.blue);
l_total_cluster.setForeground(Color.blue);
l_total_record.setFont(new Font("Times New Roman",Font.BOLD,14));
l_total_cluster.setFont(new Font("Times New Roman",Font.BOLD,14));
common = new JLabel("");
common.setFont(new Font("Sanserrif",Font.BOLD,12));
common.setForeground(Color.red);
f_l1=new JLabel("Enter PD");
f_record=new JLabel("Press search button for display record");
f_t1 = new JTextField(10);
f_l1.setFont(new Font("Sanserrif",Font.BOLD,12));
f_record.setFont(new Font("Times New Roman",Font.BOLD,14));
f_l1.setForeground(Color.blue);
f_record.setForeground(Color.blue);
f_display=new JLabel("");
f_display.setForeground(Color.red);

l1 = new JLabel("PD");
l2 = new JLabel("NAME");
l3 = new JLabel("PH NO");
l4 = new JLabel("CTY");

l5 = new JLabel("COMPANY");
l6 = new JLabel("ZPCODE");
l7 = new JLabel("GENDER");
l8 = new JLabel("AGE");
l9 = new JLabel("DSEASE");
l10 = new JLabel("EXPENCES");

t1 = new JTextField(10);
t1.addKeyListener(this);

try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con=DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st=con.createStatement();
String sql ="select * from patient_record order by pid";
rs = st.executeQuery(sql);
list1.add("PD ZPCODE GENDER AGE DSEASE
EXPENCES");
list1.add("-----------------------------------------------------------------------------------------------------------------
-------------");

while(rs.next())

{
list1.add(rs.getString(1)+" "+rs.getString(2)+" "+rs.getString(3)+"
"+rs.getString(4)+" "+rs.getString(5)+" "+rs.getString(6));
list1.add("--------------------------------------------------------------------------------------------------------------
-----------------");
}

}catch(Exception e){System.out.print("\n"+e.getMessage());}

p.setLayout(null);

p.add(kam);
p.add(pmi);
p.add(list1);
p.add(list2);
p.add(banner1);
p.add(banner2);
p.add(insert);
p.add(update);
p.add(delete);
p.add(refresh);
p.add(find);
p.add(search);
40

p.add(f_l1);
p.add(f_t1);
p.add(f_display);
p.add(f_record);
p.add(l1);
p.add(l2);
p.add(l3);
p.add(l4);
p.add(l5);
p.add(l6);
p.add(l7);
p.add(l8);
p.add(l9);
p.add(l10);
p.add(t1);
p.add(t2);
p.add(t3);
p.add(t4);
p.add(t5);
p.add(t6);
p.add(t7);
p.add(t8);
p.add(t9);
p.add(t10);
p.add(b1);
p.add(b2);
p.add(b3);
p.add(generate);
p.add(g1);
p.add(choice);
p.add(msgupdate);
p.add(msgdelete);
p.add(msginsert);
p.add(common);
p.add(l_total_record);
p.add(l_total_cluster);
f.add(p);
setEntryAdd(false);

choice.addtemListener(this);
insert.addActionListener(this);
delete.addActionListener(this);
update.addActionListener(this);
refresh.addActionListener(this);
find.addActionListener(this);
search.addActionListener(this);
b1.addActionListener(this);
p.setBackground(Color.pink);

kam.setBounds(530,1,250,40);
pmi.setBounds(490,31,250,40);
b1.setBounds(30,90,100,25);
b2.setBounds(135,90,100,25);
41

b3.setBounds(240,90,100,25);
l1.setBounds(90,130,100,25);
t1.setBounds(210,130,120,25);
l2.setBounds(90,160,100,25);
t2.setBounds(210,160,120,25);
l3.setBounds(90,190,100,25);
t3.setBounds(210,190,120,25);
l4.setBounds(90,220,100,25);
t4.setBounds(210,220,120,25);
l5.setBounds(90,250,100,25);
t5.setBounds(210,250,120,25);
l6.setBounds(90,280,110,25);
t6.setBounds(210,280,120,25);
l7.setBounds(90,310,100,25);
t7.setBounds(210,310,120,25);
l8.setBounds(90,340,100,25);
t8.setBounds(210,340,120,25);
l9.setBounds(90,370,130,25);
t9.setBounds(210,370,120,25);
l10.setBounds(90,400,130,25);
t10.setBounds(210,400,120,25);
common.setBounds(88,430,300,25);
msginsert.setBounds(88,430,300,25);
msginsert.setVisible(false);
insert.setBounds(120,450,150,25);
insert.setVisible(false);
msgupdate.setBounds(88,430,300,25);
msgupdate.setVisible(false);
update.setBounds(120,450,150,25);
update.setVisible(false);
msgdelete.setBounds(88,430,300,25);
msgdelete.setVisible(false);
delete.setBounds(120,450,150,25);
delete.setVisible(false);
find.setBounds(150,490,80,20);
f_l1.setBounds(70,515,100,25);
f_t1.setBounds(150,515,120,20);
f_record.setBounds(50,535,240,25);
f_display.setBounds(50,590,400,30);
search.setBounds(150,560,80,20);
banner2.setBounds(700,80,300,40);
generate.setBounds(720,120,180,25);
choice.setBounds(900,120,70,25);
list2.setBounds(500,140,710,220);
l_total_record.setBounds(600,367,200,20);
l_total_record.setVisible(false);
l_total_cluster.setBounds(930,367,200,20);
l_total_cluster.setVisible(false);
banner1.setBounds(720,410,300,40);
refresh.setBounds(790,450,90,20);
list1.setBounds(500,480,710,220);
addFind(false);
setEntryAdd(false);

f.setSize(1280,960);
42

f.setVisible(true);
f.setDefaultCloseOperation(JFrame.EXT_ON_CLOSE);
}
void addFind(boolean t)
{
if(t==true)
{
f_l1.setVisible(true);
f_record.setVisible(true);
f_t1.setVisible(true);
search.setVisible(true);
f_t1.requestFocus();
}
else
{
f_l1.setVisible(false);
f_record.setVisible(false);
f_t1.setVisible(false);
search.setVisible(false);
}
}

void setEntryAdd(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
t1.setEditable(true);
}
else
{
t1.setEditable(false);
4

}
}

void setEntryDelete(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
}
else
{
}
}
void setEntryUpdate(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
44

t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
}
else
{
}
}
public void keyPressed(KeyEvent ke)
{

}
public void keyReleased(KeyEvent ke)
{

try
{
String sql1="select * from patient_information where pid='"+t1.getText()+"'";
con = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st = con.createStatement();
ResultSet rs = st.executeQuery(sql1);

if(rs.next())
{
t2.setText(rs.getString(2));
}
else
4

{
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
}

con.close();
st.close();
}catch(Exception EK1){System.out.print("\n"+EK1.getMessage());}

}
public void keyTyped(KeyEvent ke)
{
String t=t1.getText();

}

public void itemStateChanged(temEvent ie)
{
int n=nteger.parsent(choice.getSelectedtem());

DataBase obj1 = new DataBase();
Cluster obj2 = new Cluster();
obj1.CreateTable();
int r = obj2.totalRow();
int c = obj2.totalRow()/n;

Choice nCluster[] = obj2.k_Member_Cluster(n);
list2.removeAll();

list2.add("PD ZPCODE GENDER AGE DSEASE EXPENCES");
for(int i=0; i<nCluster.length; i++)
{
list2.add("-----------------------------------------------------------------------------------------------------------------
");
for(int j=0; j<nCluster[i].gettemCount(); j++)
{
list2.add(nCluster[i].gettem(j));

}

}
list2.add("-----------------------------------------------------------------------------------------------------------------");

l_total_record.setText("Total Number of Records ");
String t1=l_total_record.getText();
t1+=" : "+String.valueOf(r);
l_total_record.setText(t1);
l_total_record.setVisible(true);
l_total_cluster.setText("Total Number of Cluster ");
String t2 = l_total_cluster.getText();
t2+=" : "+String.valueOf(c);
l_total_cluster.setText(t2);
l_total_cluster.setVisible(true);
46

obj1.DropTable();

}

public void actionPerformed(ActionEvent e)
{

Object obj=e.getSource();

if(obj==insert)
{
common.setVisible(false);
Connection con1,con2;
Statement st1,st2;
try
{
Connection con;
Statement st;
int n=0;

String sql1="select * from patient_information where pid='"+t1.getText()+"'";
con = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSTVE,ResultSet.CONCUR_UPDATABLE);
ResultSet rs = st.executeQuery(sql1);
int flag=0;
while(rs.next())
{
flag=1;
}

con.close();
st.close();

if(flag==1)
{
con1 = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st1 = con1.createStatement();
String sql ="insert into patient_information values
('"+t1.getText()+"','"+t2.getText()+"','"+t3.getText()+"','"+t4.getText()+"','"+t5.getText()+"')";
int r = st1.executeUpdate(sql);
if(r==1)
{
System.out.print("\n"+r+" Record is inserted in patient_information table");
common.setText("nserted succesfully");
common.setVisible(true);
}
con1.commit();
con1.close();
st1.close();
con2 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
47

sql ="insert into patient_record values
('"+t1.getText()+"','"+t6.getText()+"','"+t7.getText()+"','"+t8.getText()+"','"+t9.getText()+"','"+t10.getText()+"'
)";
r = st2.executeUpdate(sql);
if(r==1)
System.out.print("\n"+r+" Record is inserted in patient_record table");
con2.commit();
con2.close();
st2.close();

}
else
{
String sql ="insert into patient_record values
('"+t1.getText()+"','"+t6.getText()+"','"+t7.getText()+"','"+t8.getText()+"','"+t9.getText()+"','"+t10.getText()+"'
)";
if(r==1)
System.out.print("\n"+r+" Record is inserted in patient_record table");
con2.commit();
con2.close();
st2.close();
}
}
catch(Exception e2){System.out.print("\n"+e2.getMessage());common.setText("Record can't be
inserted");common.setVisible(true);}
setEntryAdd(false);

}
if(obj==delete)
{
Statement st1,st2;
try
{
String sql2 ="delete patient_record where pid='"+t1.getText()+"'";
int r2 = st2.executeUpdate(sql2);

String sql1 ="delete patient_information where pid='"+t1.getText()+"'";
int r1 = st1.executeUpdate(sql1);

if(r1>=1 || r2>=1)
{
common.setText("One record deleted");
}
48

else
{
con1.rollback();
con2.rollback();
common.setText("Pid did not match");
}
con1.close();
con2.close();
st1.close();
st2.close();
}
catch(Exception e3){ System.out.print("\n"+e3.getMessage());common.setText("Record can't be
deleted");common.setVisible(true);}
setEntryDelete(false);
}

if(obj==update)
{

Statement st1,st2;
try
{

String sql="update patient_information set pid='"+t1.getText()+"',
name='"+t2.getText()+"',phno='"+t3.getText()+"',city='"+t4.getText()+"',company='"+t5.getText()+"' where
pid='"+t1.getText()+"'";

if(r==1)
{
System.out.print("\n"+r+" Record is updated in patient_information table");
common.setText("one record has updated");
}
else
{
common.setText("Record not found");
}
con1.commit();
con1.close();
st1.close();
}
catch(Exception e4)
{ System.out.print("\n"+e4.getMessage()); common.setText("Updation failed in 1st
table");common.setVisible(true);}

try
4

{
//String sql="insert into library values
('"+t1.getText()+"','"+t2.getText()+"','"+t3.getText()+"','"+t4.getText()+"')";
String sql="update patient_record set pid='"+t1.getText()+"', zipcode='"+t6.getText()+"',
gender='"+t7.getText()+"', age='"+t8.getText()+"',disease='"+t9.getText()+"',expences='"+t10.getText()+"'
where pid='"+t1.getText()+"'";

if(r==1)
{
System.out.print("\n"+r+" Record is updated in patient_record table");
common.setText("One recored is updated");
}
con2.commit();
con2.close();
st2.close();
}
catch(Exception e5){System.out.print("\n"+e5.getMessage());common.setText("Updation
failed");common.setVisible(true);}
setEntryDelete(false);

}

if(obj==b1)
{
setEntryAdd(true);
msginsert.setVisible(true);
insert.setVisible(true);
}
if(obj==b2)
{
setEntryUpdate(true);
msgupdate.setVisible(true);
update.setVisible(true);
}
if(obj==b3)
{
setEntryDelete(true);
msgdelete.setVisible(true);
delete.setVisible(true);
0

}

if(obj==refresh)
{
setEntryAdd(false);
addFind(false);
Connection con;
Statement st;
ResultSet rs;
list1.removeAll();
try
{
con=DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st=con.createStatement();
String sql ="select * from patient_record order by pid";
rs = st.executeQuery(sql);
list1.add("PD ZPCODE GENDER AGE DSEASE
EXPENCES");
list1.add("-----------------------------------------------------------------------------------------------------------------
------------------------------");

while(rs.next())

{
list1.add(rs.getString(1)+" "+rs.getString(2)+" "+rs.getString(3)+"
"+rs.getString(4)+" "+rs.getString(5)+" "+rs.getString(6));
list1.add("--------------------------------------------------------------------------------------------------------------
--------------------------------");
}
}catch(Exception e2){System.out.print("\n"+e2.getMessage());}
}
if(obj==find)
{
f_t1.setText("");
f_t1.requestFocus();
f_record.setText("Press Search button to display record");
addFind(true);
}

if(obj==search)
{
String t;
Connection con1;
Statement st1;
try
1

{
st1 =
con1.createStatement(ResultSet.TYPE_SCROLL_SENSTVE,ResultSet.CONCUR_UPDATABLE);
String sql = "select * from patient_information where pid='"+f_t1.getText()+"'";
ResultSet r = st1.executeQuery(sql);

if(r.first())
{
t=r.getString(2)+" "+r.getString(3)+" "+r.getString(4)+" "+r.getString(5);
f_display.setText(t);
}
else
{
f_display.setText("record could not found");
}
con1.close();
st1.close();
}
catch(Exception e6){System.out.print("\n"+e6.getMessage());f_display.setText("Record did not
match");}

}

}
public static void main(String args[])
{
new Project("k-Annomity Model(Protecting Privacy)");
}
}

End of GUI Code

//CODE FOR Cluster Package
package mypack;
import java.util.StringTokenizer;
import java.sql.*;
2

import java.awt.*;

//Beginning of Cluster

public class Cluster
{

public Choice[] k_Member_Cluster(int k)
{
int r=totalRow()/k;

Choice nCluster1[] = new Choice[r];
Choice nCluster2[] = new Choice[r];
for(int i=0; i<r; i++)
{
nCluster1[i] = new Choice();
nCluster2[i] = new Choice();
}
//*-------------------------------
System.out.print("\n Total cluster :"+r);
try
{
int i=0;
String cluster_age[]=new String[r];

while(totalRow()>= k)
{
int mid = totalRow()/2;
if(mid>0)
{
String t = annomize_getRecord(mid);
nCluster1[i].add(t);
String tt = getRecord(mid);
nCluster2[i].add(tt);
cluster_age[i] = get_ResultSet_Age(mid);

deleteRow(mid);

while(nCluster2[i].gettemCount()<k)
{

int index = find_Best_Record(cluster_age[i]);

String t2 = getRecord(index);

nCluster2[i].add(t2);

String t1= annomize_getRecord(index);
nCluster1[i].add(t1);

deleteRow(index);
}

if(i<r)
{
System.out.print("\n "+i);
i++;

}
else
break;

}
}

for(int p=0; p<nCluster2.length; p++)
{
if(nCluster2[p].gettemCount()==0)
{
int index = totalRow()/2;
String tp = annomize_getRecord(index);
nCluster1[p].add(tp);
String tp1=getRecord(index);
nCluster2[index].add(tp1);
deleteRow(index);
}
}

int n=totalRow();
System.out.print("\n At Last :"+n);

for(int idx=1; idx<=n; idx++)
{

String rs_age=get_ResultSet_Age(1);
int c = find_Best_Cluster(nCluster2,cluster_age,rs_age);
String record1=annomize_getRecord(1);
nCluster1[c].add(record1);
String record2=getRecord(1);
nCluster2[c].add(record2);
deleteRow(1);

}

}//end try
catch(Exception e7)
{
System.out.print("\nError2: K_member_Cluster "+e7.getMessage());
}

//*-----------------------------------
return(nCluster1);
}

public int find_Best_Record(String cluster_age)
{
4

int index=1;
try
{

int n = totalRow();

if(n>=2)
{

int min = difference(cluster_age.trim(),get_ResultSet_Age(1).trim());
index=1;

for(int i=2; i<=n; i++)
{
int diff=difference(cluster_age,get_ResultSet_Age(i));
if(diff<min)
{
min=diff;
index=i;
}

}
}
else
index=1;

}catch(Exception e9){System.out.print("\nError9 Find best record"+e9.getMessage());}
return(index);
}

public String get_ResultSet_Age(int index)
{

String age="0";

try{
Connection con;
Statement st;
ResultSet rs;
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
rs.absolute(index);
age =rs.getString("AGE");
con.close();
st.close();

}
catch(Exception e8){System.out.print("\nError8 get_ResultSet_age "+e8.getMessage());}

return(age);

}

public int find_Best_Cluster(Choice nCluster[],String cluster_age[],String rs_age)
{

int min=difference(cluster_age[0],rs_age);
int index=0;
int diff=0;

for(int i=1; i<nCluster.length; i++)

{

diff=difference(rs_age,cluster_age[i]);

if(diff<min)
{
index=i;
min=diff;
}

}

return(index);
}

public String get_Cluster_Age(Choice nCluster[],int index)
{

String age="0";

StringTokenizer token = new StringTokenizer(nCluster[index].gettem(0)," ");

while(token.hasMoreTokens())
{
String a1=token.nextToken();
age=token.nextToken();

}
6

return(age);

}

public int difference(String age1,String age2)
{
int r=nteger.parsent(age1.trim())-nteger.parsent(age2.trim());

if(r<0)
r=r*-1;

return(r);
}

public void showTable()
{
try
{
Connection con;
Statement st;
ResultSet rs;
st =
while(rs.next())
{

System.out.print("\n"+rs.getString(1)+"\t"+rs.getString(2)+"\t"+rs.getString(3)+"\t"+rs.getString(4)+"\t"+rs.g
etString(5)+"\t"+rs.getString(6));
}
if(rs.last()){}
System.out.print("\n total "+rs.getRow());
con.close();
st.close();

}
catch(Exception e1){System.out.print("\nError showTable "+e1.getMessage());}
}

public void deleteRow(int index)
{
try
{ if(index>0)
{
Connection con;
Statement st;
ResultSet rs;
st =
7

rs.absolute(index);
rs.deleteRow();
con.close();
st.close();

}

}
catch(Exception e2){System.out.print("\nError deleteRow "+e2.getMessage());}

}

public int totalRow()
{
int n=0;
ResultSet rs;
try
{
Connection con;
Statement st;

st =


if(rs.last())
{
n=rs.getRow();
con.close();
st.close();

return(n);
}

con.close();
st.close();

}
catch(Exception e5){System.out.print("\nError totalRow "+e5.getMessage());}
return(0);
}
public String annomize(String t)
{

StringBuffer sb1 = new StringBuffer(t);
8

for(int i=sb1.length()-1; i>=sb1.length()-3; i--)
{
sb1.setCharAt(i,'*');
}

return(String.valueOf(sb1));
}

public String annomize_Age(String age)
{
String t="[";
int n=nteger.parsent(age);
if(n>=1 && n<=10)
t+="1-10";
if(n>=11 && n<=20)
t+="11-20";
if(n>=21 && n<=30)
t+="21-30";
if(n>=31 && n<=40)
t+="31-40";
if(n>=41 && n<=50)
t+="41-50";
if(n>=51 && n<=60)
t+="51-60";
if(n>=61 && n<=70)
t+="61-70";
if(n>=71 && n<=80)
t+="71-80";
if(n>=81 && n<=90)
t+="81-90";
if(n>=91 && n<=100)
t+="91-100";
t+=" ]";
return(t);
}

public String annomize_getRecord(int index)
{
String t="";
if(index>0)
{
try
{
Connection con;
Statement st;
ResultSet rs;

st =


rs.absolute(index);

String t1=rs.getString("GENDER");
t1.trim();

if(t1.equals("male") || t1.equals("MALE"))
{
rs.absolute(index);
t+=annomize(rs.getString(1))+" "+annomize(rs.getString(2))+" "+"[ person ]"+"
"+annomize_Age(rs.getString(4))+" "+format(rs.getString(5))+" "+rs.getString(6);
}
else
{
rs.absolute(index);
t+=annomize(rs.getString(1))+" "+annomize(rs.getString(2))+" "+"[ person ]"+"
"+annomize_Age(rs.getString(4))+" "+format(rs.getString(5))+" "+rs.getString(6);

}
rs.close();
con.close();
st.close();
return(t);

}
catch(Exception e6){System.out.print("\nError6 getRecord "+e6.getMessage());}
}
return(t);
}
public String format(String t)
{
String t1=t;
System.out.print("\n length of "+t+" is "+t.length());
for(int i=t.length(); i<=20; i++)
t1+=" ";
return(t1);
}
public String getRecord(int index)
{
String t="";
if(index>0)
{
try
{
Connection con;
Statement st;
ResultSet rs;

st =


rs.absolute(index);
60

t+=annomize(rs.getString(1))+" "+rs.getString(2)+" "+rs.getString(3)+"
"+(rs.getString(4))+" "+rs.getString(5)+" "+rs.getString(6);
rs.close();
con.close();
st.close();
return(t);

}
catch(Exception e6){System.out.print("\nError6 getRecord "+e6.getMessage());}
}
return(t);
}

}
End of CIuster cIass

//CODE FOR DataBase Package
package mypack;
import java.sql.*;

61

public class DataBase
{
Connection con1;
Statement st1;
Connection con2;
Statement st2;
ResultSet rs1;
int n=0;

public void CreateTable()
{
try
{
st1 =
rs1 = st1.executeQuery("select * from patient_record");
if(rs1.last())
{
n = rs1.getRow();
}
ResultSetMetaData rsmd = rs1.getMetaData();
int nCols = rsmd.getColumnCount();
con2 = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
String row[] = new String[nCols+1];
for(int i=1; i<=nCols; i++)
{
row[i]=""+rsmd.getColumnName(i)+"
"+rsmd.getColumnTypeName(i)+"("+rsmd.getColumnDisplaySize(i)+")";
}
String addrow="";

for(int i=1; i<=nCols; i++)
{
if(i==1)
{
addrow+=row[i];
}
else
{
addrow+=","+row[i];
}
}

String sql="create table patient_record2"+
"("+addrow+")";
st2.executeUpdate(sql);

//for insert all row in a patient_record2 table

if(rs1.first()){}
for(int i=1; i<=n; i++)
{
String sql1="insert into patient_record2 values (";
62

for(int j=1; j<=nCols; j++)
{
if(j==1)
sql1+="'"+rs1.getString (j)+"'";
else
sql1+=","+"'"+rs1.getString(j)+"'";
}
sql1+=")";
//System.out.print("\n"+sql1);
if(rs1.next()){}
st2.executeUpdate(sql1);

}
con2.commit();
con2.close();
st2.close();
con1.close();
st1.close();
}
catch(Exception e1)
{
System.out.print("\nDataBase Error1 "+e1.getMessage());
}
}//close createTable

public void DropTable()
{
Connection con3;
Statement st3;
try
{
String sql ="drop table patient_record2";
int r= st3.executeUpdate(sql);
System.out.print("\n One Table Dropped ");
}
catch(Exception e2)
{
System.out.print("\n DataBase Error2 "+e2.getMessage ());
}

}//close DropTable

public ResultSet getResultSet()
{
Connection con4;
Statement st4;
ResultSet rs4;
try
{
con4 =DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st4 =
6

String sql="select * from patient_record2";
rs4 = st4.executeQuery(sql);
con4.close();
st4.close();

return(rs4);
}
catch(Exception e4)
{
rs4=null;
System.out.print("\nDataBase Error3"+e4.getMessage());
return (rs4);
}

}//close getResultSet

public Connection getConnection()
{
/*try
{
}
catch(Exception e5){System.out.print("\nDataBase Error5 "+e5.getMessage());} */
return(con2);
}
public void CloseConnection()
{
try
{
con2.close();
st2.close();
con1.close();
st1.close();
}
catch(Exception e6){System.out.print("\n DataBase Error6"+e6.getMessage());}
}
}

10.Project Output Screen

64

Figure:

The whole screen is divided into three parts, part one display the functionality for
interfacing with the database, part two display the records of table in anonymized form
and part three display the original records of table.
Search button is used to search the particular record by PID. We can add new record by
pressing New Entry button and Update and Delete button is used to update and delete
the record in the database.
Choice object is used to select the value of k=1, 2,., n as the parameter to the cluster
algorithm.
The value of k is used to decide the total number of cluster and maximum number of
records that a cluster can accommodate.
The above output screen we selected the value of k = 3 that means each cluster can
contain at least 3 records and a cluster can contain at most 2k-1 records.
The output screen also showing total number of cluster and total number of records
according to the value of k selected.
11. ExperimentaI ResuIts
6

The main goal of the experiment was to investigate the performance of our approach in
terms of data quality, e ciency, and scalability. To accurately evaluate our approach, we
also compared our implementation with another algorithm, namely the greedy k-
member al gori thm.

11.1. ExperimentaI Setup

We have worked on a 1.60 GHz ntel(R) Pentium M processor machine with 512 MB
of RAM. The operating system on the machine was Microsoft Windows XP
Professional version 2002, service pack 2, and the implementation was built and run
in Java 2 Platform, Standard Edition 5.0.
For our experiments, we used the Adult dataset from the UC rvine Machine Learning
Repository, which is considered a de facto benchmark for evaluating the performance
anonymity algorithms. Before the experiments, the Adult data set was prepared and
then we removed records with missing values and retained only nine of the original
attributes. For k-anonymization, we considered {age, zipcode, gender, disease,
expenses, patient name, address} in the database attributes. n which attributes {age,
zip code, and gender} are the quasi-identifier. Among these age and zip code were
treated as numerical attributes while the gender is treated as categorical attributes
and the disease is treated as sensitive attribute.

We have created two tables for keeping patient information, The names of the table
are given below:--
1. patient information table
2. patient record table
patient information table is primary table that contains information regarding
patient having attributes PID,NAME, ADDRESS,MOBILENO and OCCUPATION,
the patient record table is the secondary table that contains the attributes having
fields PID,ZIPCODE,GENDER,AGE,DIESEASE and EXPENCES.
To design database schema we used Oracle 10G, Database management Server.

1. ConcIusions
66

n this thesis we proposed an e cient k-anonymization algorithm by transforming the k-
anonymity problem to the k-member clustering problem. We also proposed two
important elements of clustering, that is, distance and cost functions, which are
specifcally tailored for the k-anonymization problem. We emphasize that our distance
and cost function, naturally captures the data distortion introduced by the generalization
process and is general enough to be used as a data quality for any k-anonymized
dataset.

1.1. CIustering-Based Approaches

Byun proposed the greedy k-member clustering algorithm (k-member algorithm
for short) for k-anonymizat ion. This algorithm works by first randomly selecting a
record r as the seed to start building a cluster, and subsequent ly select ing
and adding more records to the cluster such that the added records incur the
least information loss within the cluster. Once the number of records in this
cluster reaches k, this algorithm select s a new record that is the furthest from r,
and repeats the same process to build the next cluster.
Eventually when there are fewer than k records not assigned to any
clusters yet, this algorithm then individually assigns these records to their
closest clusters. This algorithm has two drawbacks.
O First, it is slow. The ti me complexity of this algorithm is ( n

) .
O Second, if the cluster contains outliers, the information loss is
increases.

This thesis proposed greedy algorithm for k-anonymi zation. Si milar to the
k-member algorithm, this algorithm chooses the seed (i.e., the first selected
record) of each cluster randomly. Also, when building a cluster, this
algorithm keeps selecting and adding records to the cluster until the
diversity (si milar to information loss) of the cluster exceeds a user-defined
threshold. Subsequently, if the number of records in this cluster is less than
k, the entire cluster is deleted.
67

With the help of the user-defined threshold, this algorithm is less sensitive
to outliers. However, this algorithm also has two drawbacks.
O First, it is difficult to decide a proper value for the user-defined
threshold.
O Second, this algorithm mi ght delete many records, which in turn
cause a significant information loss.
The ti me complexity of this algorithm is (( n

log (n)c), where c is the
average number of records in each cluster.

68

13. References

1. L. Sweeney. K-anonymity: A model for protecting privacy, "nternational Journal
on Uncertainty, Fuzziness and knowledged-base system.pp.557-570,2002.

2. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas,
and A. Zhu. Anonymizing tables. n nternational conference on Database
Theory, pages 246-256, 2005.

3. C. C. Aggarwal and P. S. Yu, A condensation approach to privacy preserving
data mining, n nternational conference on Extending Database Technology,
2002.

4. R. J. Bayardo and R.Agarwal. Data privacy through optimal k-anonymization. n
nternational Conference on Data Engineering, 2005.

5. B. C. M. Fung, K. Wang, and P.S. Yu. Top-down specialization for information
and privacy preservation, n nternational Conference on Data Engineering, 2005.

6. Z. Huang. Extensions to the k-means algorithm for clustering large data sets with
Categorical values. Data Mining and knowledge Discovery, 1998.

7. V. S. yengar. Transforming data to satisfy privacy constraints. n ACM
conference on knowledge Discovery and Data mining, 2002.

8. K. LeFevre, D. DeWitt, and R. Ramakrishnan. ncognito: Efficient full- domain k-
anonymity. n ACM nternational Conference on Management of Data, 2005.

9. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-
anonymity. n nternational Conference on Data Engineering, 2006.

10. A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. n ACM
symposium on principles of Database Systems, 2004.

6

11. P. Samarati. Protecting respondent's privacy in microdata release, EEE
Transaction on Knowledge and Data Engineering, 13, 2001.

12. L. Sweeney, nformation Explosion. Confidentiality, Disclosure, and Data
Access: Theory and Practical Application for Statistical Agencies, L.Zayatz,
P.Doyle, J. Theeuwes and J.Lane, Urban nstitute, Washington, DC, 2001.

13. L. Sweeney, Uniqueness of simple Demographics in the U.S. population, LDAP-
WP4. Carnegie Mellon University, Laboratory for nternational Data Privacy,
Pittsburg, PA: 2000. Forthcoming Book Entitled, the dentifiability of Data.

14. L. Sweeney. K-anonymity: a model for protecting privacy. nternational Journal
of Uncertainty, Fuzziness and knowledge-based system, 10(7), 2002.

15. T. Dalenius. Finding a needle in a haystack- or identifying anonymous census
record. Journal of official Statistics, 2(3): 329-336, 1986.

16. L. Sweeney. Guaranteeing anonymity when sharing medical data, the Datafly
system. Proceedings, Journal of the American Medical nformatics Association.
Washington, DC; Hanley & Belfus, nc., 1997.

17. A.Hundepol and L. Willenbord. And Argus: software for statistical disclosure
control. Third nternational Seminar on Statistical Confidentiality. Bled: 1996.

18. J. Ullman. Principles of Database and knowledge Base system. Computer
Science Press. Rockville, MD. 1988.

19. L.Sweeney, Computational Disclosure Control: A primer on data privacy
protection. Ph. D. Thesis, Massachusetts nstitute of Technology, 2001.

70

20. A. R. Adam and J.C. Wortman. Security-control methods for statistical
databases. ACM Computing, Survey, 1989.

21. F.Y. Chin and G. Ozsoyoglu. Auditing and inference control in statistical
databases. EEE Transactions on Software Engineering, 1982.

22. Computer Science and Telecommunications Board. T Roadmap to a Geospatial
Future. The National Academics Press, November 2003.

23. D.E. Denning. Secure statistical database with random sample queries. ACM
Transactions on Database Systems, 1980.

24. D. Dobkin, A.K.Jones, and R.J. Lipton. Secure database: Protection against user
influence. ACM Transactions on database system, 1979.

25. A.D.Friedman and L.J. Hoffman. Towards a fail-safe approach to secure
databases. n EEE symposium on security and privacy, 1980.

26. Global mapper. http://www.globalmapper.com/, November 2003.

27. M. Gruteser and D. Grunwald. Anonymous usage of location based services
through spatial and temporal cloaking. n ACM/USENX MobiSys, 2003.

28. C. K. Liew, W. J. Choi, and C. J. Liew. A data distortion by probability
distribution. ACM Transactions on Database Systems, 10(3), 1985.

29. L. Sweeney. K-anonymity: A model for protecting privacy.
JUFKS, 10(5), 2002.
38. L. Sweeney. k-anonymity privacy protection using generalization and
suppression. JUFKS, 10(5), 2002.
71

Thank You

K-Anonymity Model Project Report

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

K-Anonymity Model Project Report

Hochgeladen von

Copyright:

Verfügbare Formate

1

k, < k n: the size of each cluster is greater than or equal to a

Das könnte Ihnen auch gefallen