Sie sind auf Seite 1von 8

2010 IEEE 6th World Congress on Services

Value Added Privacy Services for Healthcare Data


Luvai Motiwalla and Xiaobai (Bob) Li
Department of Operations and Information Systems
University of Massachusetts Lowell
Lowell, MA 10854
luvai_motiwalla@uml.edu
For example, the widespread uses of electronic health
records (EHR), recently, have resulted in an explosion of
digital patient data being generated and collected by healthcare organizations. In tandem with this unprecedented
growth of digital data, techniques for data mining have
gained popularity in a wide variety of domains. While the
health-care industry has benefited from information sharing
and data mining, patients are increasingly concerned about
invasion of their privacy by these practices. Similarly, the
public and civil libertarian groups have also been concerned
about privacy protection with EHR becoming main stream
and increasingly popular with the medical professionals and
patients. These growing concerns on privacy led to the
passage of Health Insurance Portability and Accountability
Act (HIPAA) in 2001 and have increased compliance
requirements for health-care organizations.
This project is devoted to developing value-added
privacy protection services to address these concerns. It
explores the use of innovative data masking techniques, such
as linear programming, Bayes estimation, kd-trees, and data
masking, and attempts to apply them to help protect patient
privacy by embedding the best practices of data masking into
reusable software systems. The current focus of this project
is on the health-care area. The broader implications of this
research is that it allows safe sharing of patient data across
organizational boundaries, while satisfying compliance
requirements, and providing the data to analysts for datamining or medical research that benefits both the
organizations and society at large.

Abstract The widespread use of digital data, storage and


sharing for data mining has given data snoopers a big
opportunity to collect and match records from multiple sources
for identity theft and other privacy-invasion activities. While
most healthcare organizations do a good job in protecting their
data in their databases, very few organizations take enough
precautions to protect data that is shared with third party
organizations. This data is vulnerable to data hackers,
snoopers and rouge employees that want to take advantage of
the situation. Only recently has the regulatory environment
(like HIPAA) tightened the laws to enforce data and privacy
protection. The goal of this project was to explore use of value
added software services to counter this invasion of privacy
problem when data is shared with an external organization for
data mining, statistical analysis or other purposes. Specifically,
the goal of this service is to protect data without removing
sensitive/non-sensitive attributes. Sophisticated data masking
algorithms are used in these services to intelligently perturb
and swap data fields making it extremely difficult for data
snoopers to reveal personal identity, even after linking records
with other data sources. Our software service provides value
added data analysis with the masked dataset. Dataset-level
properties and statistics remain approximately the same after
data masking; however, individual record-level values are
changed or perturbed to confuse the data snoopers.
Keywords-Privacy Management in Data Collection and
Sharing, Privacy Governance Methods and Tools, Healthcare
Records, and Data Masking.

I.

INTRODUCTION

A. Significance of the Privacy Problem


There is an increasing demand for good health data to
improve patients experience, expanding healthcare
knowledge on diseases and appropriate treatments,
strengthening insights about effectiveness and efficiency of
healthcare systems, and supporting public health policy
requirements. This is evident because over 70 million
Americans have some portion of their medical record in
electronic format (Kaelber and Jha, 2008) and there is
growing interest from patients in electronic tracking and
messaging; according to some estimates, 75% of Americans
would like to e-mail communication with their physicians
and 60% would like to track their medical records
electronically (My HealtheVet, 2009). Large software
vendors like Google and Microsoft are providing eHealth
software and the Obama Administrations stimulus package
which includes $20 billion funding for computerization of

Privacy has become a major concern in digital


environments. When data is collected, the contributors of the
data have concerns about the possibility of inappropriate
exposure of the collected data. When data is transformed, the
same privacy issue requires a governance process to prevent
the involved parties from exposing data in an inappropriate
manner during the data transformation. In the data access and
sharing process, there is a question of how to ensure that
only the right receiver can get the privacy-related
information. Therefore, systematic privacy protection values
added services must be available for data providers. A set of
corresponding tools can help privacy practitioners deliver
regular privacy assurance service and serve as platforms for
new innovations. The new privacy assurance verification and
enforcement algorithms can be plugged into the existing data
tools to get high quality service delivery for privacy or
security assurance.
978-0-7695-4129-7/10 $26.00 2010 IEEE
DOI 10.1109/SERVICES.2010.42

64

healthcare records is going to expand this electronic health


(eHealth) data storage and use revolution.
HIPAA is designed to give patients more control over
their personal medical information. It explicitly outlines how
medical records can be given to third parties and carries stiff
penalties for violations. The impact of HIPAA on medical
research is beginning to surface in the research community
with some researchers fearing that it could jeopardize studies
of drug safety, medical device validation, and disease
prediction and prevention (Evfimievski, Srikant, et.al.,
2002). While HIPAA was intended to protect patient
privacy, it has a significant impact on medical studies
involving collection of data from a variety of health-care
organizations. Because HIPAA guidelines are so
cumbersome and the penalties for violations so steep, many
organizations, particularly those small community hospitals
and clinics, may decide it is safer and easier not to provide
data for the medical research. Due to this concern, the
Association of American Medical Colleges plans to compile
a database so it can document the effect of HIPAA on
research activities (Markoff and S. Shane, 2002).
In addition to HIPAA, privacy issues are having impact
on data mining-based counter-terrorism programs developed
by the U.S. federal and state governments, such as Terrorism
Information Awareness (TIA), Computer-Assisted Passenger
Prescreening System (CAPPS II), and Multistate AntiTerrorism Information Exchange System (MATRIX). Some
of these programs were terminated due to strong opposition
by the public and privacy advocates (Seifert 2006). The
publics concern about privacy is also exemplified by the
recent case of Google vs. U.S. Justice Department. A poll
conducted by KDnuggets (2006) revealed that 51% of
respondents, most of them data mining professionals,
believed Google should not release search data to the Justice
Department. Concern about privacy threats has caused data
quality and integrity to deteriorate. According to Teltzrow
and Kobsa (2004), 82% of online users have refused to give
personal information and 34% have lied when asked about
their personal habits and preferences. A survey by
Time/CNN (Greengard, 1996) revealed that 93% of the
respondents believed companies selling personal data should
be required to gain permission from the individuals.

individual records contained in a dataset. As a result, query


restriction methods are no longer applicable and data
masking becomes the primary approach for privacy
protection in data mining. Further, while the main purpose of
an SDB is to provide summary statistics, data mining
essentially focuses on discovering relationships between data
attributes. Preserving such relationships may or may not be
consistent with preserving summary statistics.
One of the popular data-masking methods is noise-based
perturbation (Traub et al. 1984; Liew et al. 1985). The basic
idea of this approach is to add noise to the sensitive data to
disguise their true values, while preserving the statistical
properties of the data. The noise-based methods can be used
for some data mining applications. However, one limitation
is that they primarily apply to numeric data. Another
limitation is that the perturbation mechanisms typically
depend on some assumptions about the properties of the
data, such as normality. This can cause data utility to
deteriorate when the assumptions are violated.
For categorical data, two primary data-masking
approaches are data swapping and cell suppression (Dalenius
and Reiss 1982; Reiss 1984). Many of these methods,
however, are directed at data presented in a summarized
contingency table, instead of a dataset of individuals usually
required in data mining. In addition, due to high
computational cost, these approaches typically aim at
preserving lower-order summary statistics from the data,
which is not necessarily consistent with preserving datamining quality.
A method for privacy protection in the de-identified data,
called k-anonymity (Sweeney, 2002; Samarati, 2001), has
recently gained increasing popularity. K-anonymity masks
the values of some potentially identifying attributes, called
quasi-identifiers (QIs), such that the values of the QI
attributes for any individual matches those of at least k 1
other individuals in the same dataset. In this way, the
identities of individuals are expected to be better protected.
However, it is still likely that an intruder can discover the
sensitive information of individuals in the k-anonymized
data (Machanavajjhala etal. 2006). The problem is that kanonymity protects identity disclosure by generalizing
different but similar QI values into the same value. If a group
of k-anonymized records have the same sensitive values,
these individuals privacy are not adequately protected.
Another
popular
data-masking
approach
is
microaggregation (Defays and Nanopoulos 1992; DomingoFerrer and Mateo-Sanz 2002), which masks data by
aggregating attribute values. Univariate microaggregation
(UMA) involves sorting records by each attribute, grouping
adjacent records, and replacing the individual values in each
group with the group average. UMA does not give adequate
considerations to the relationships between attributes.
Multivariate microaggregation (MMA) groups data using
clustering techniques based on multi-dimensional distance
measures. Therefore, the relationships between attributes are
better preserved. However, this benefit comes with a higher
cost in computational complexity. In addition, similar to the
noise-based perturbation approaches, microaggregation
methods do not naturally apply to categorical data.

B. Prior and Current Work


There has been extensive research in the area of
statistical databases (SDBs) on how to protect individuals
sensitive data when providing summary statistical
information (Adam and Wortmann 1989). The privacy issue
arises in SDBs when summary statistics are derived using
data on very few individuals. In this case, releasing the
summary statistics may result in disclosing sensitive data.
The methods for preventing such disclosure can be broadly
classified into two categories: (i) query restriction, which
prohibits queries that would reveal sensitive data, and (ii)
data masking, which alters individual data in a way such that
the summary statistics remain approximately the same.
The problems in data mining are somewhat different
from those in SDBs. A data-mining task, such as
classification or numeric prediction, requires working on

65

This NSF-funded project takes a systematic approach for


masking data. The software service we develop includes
many of the algorithms described above. In addition, we
develop a framework that includes two sets of methods, used
for masking categorical and numeric data respectively. These
methods are nonparametric in nature and do not require any
assumptions about statistical distributions of the data.
II.

technology. The healthcare industry typically spends more


than 5% of their total revenues on IT, roughly spending
$39.5 billion in 2008 (Bernie Monegain, 2006). Using
encryption to protect data in storage is similar to keeping
data locked in a vault. This is because encrypted data cannot
be used for medical or healthcare research. Another solution
is data masking which not only secures the data from
identifying individual records (protects individual rights) but
also makes it useful for research analysis. Data masking
technology perturbs or swaps data fields using a masking
algorithm to alter the true values of individual records, while
preserving the overall data integrity for the research and
analysis. Encryption and masking are complementary
technologies for privacy. Encryption is most useful to
prevent hackers from interpreting the data when transmitting
over networks, while masking is useful to prevent snoopers
from identifying data records in storage (like databases, data
marts and warehouses). Masking is also cost effective and
computationally efficient. The other benefit of masking
solution is that it removes two of the unanticipated
consequences of HIPAA, namely, the restrictions posed on
clinical and public policy research by restricting access to
medical
records
data
and
decreasing
the
recruitment/retention of research subjects who have not
consented on allowing healthcare organizations for
secondary usage or when investigators are unable to meet the
regulatory requirements (Kimberly, Welker, et. al., 2008).
Data masking is very different from encryption; it does
not change the data via ciphering nor does it require any keys
or digital certificates to change the data. Instead data
masking changes the data values using noise perturbation,
data aggregation, or data swapping. The properties of the
data are generally maintained after masking for statistical
analysis and data-mining research. Data masking is not as
resource intensive as encryption and it is used for preserving
privacy of data before sharing with external organizations
whereas encryption is more useful for protecting data during
the process of data transmission. Our software service uses a
set of masking algorithms to alter the sensitive attribute
values of the high-risk records before they are released to the
third party. The data are masked such that the property of the
dataset as a whole is preserved. Consequently, data mining
and other research results, such as summary statistics and
predictive models, based on the masked data will be similar
to the one based on the original data.

PRIVACY PROTECTION PROBLEM IN DE-IDENTIFIED


DATA

Typically, there are three parties involved in the privacy


problem in data mining: (i) the data owner (the organization
that owns the data) who wants to discover knowledge from
the data without compromising the confidentiality of the
data; (ii) individuals who provide their personal information
to the data owner and want their privacy protected; and (iii)
the third party data user who has access to the data released
by the data owner. This third party can be an individual data
miner (either insider or outsider to the data owner), or an
organization that has a data sharing agreement with the data
owner. In this research, this third party is considered as a
potential privacy intruder.
From a privacy viewpoint, the attributes of data can be
classified into three categories as below:
1) Identity attributes, which can be used to directly
identify an individual, including name, social security
number, credit card number, phone number, and address.
2) Sensitive attributes, which contain private
information that an individual typically does not want
revealed, such as salary, medical test results, sexual
orientation, and academic transcripts.
3) Non-sensitive attributes, which are normally not
considered as sensitive by individuals; many of these
attributes can be found from publicly available sources.
Examples of this category include (in a general context)
age, gender, race, education, occupation, height, eye color,
and so on.
A common practice for many organizations today is to
remove identity attributes from the customer records (called
de-identification) before releasing them to the third party.
Many believe that de-identification is adequate for protecting
individual privacy. The fact is, however, that an intruder can
identify a target subject based on the values of the nonsensitive attributes in the de-identified data, and thus
discover the sensitive values of the subject. Sweeney (2002)
pointed out that 87% of the population in the United States
can be uniquely identified using deterministic record linkage
with three non-sensitive attributes, gender, date of birth, and
5-digit zip code, which are accessible from voter registration
records available to the public.

B. Value Added Privacy Protection Service


While there are some privacy protection systems
available in the market, our system was designed to give the
control of protecting their data from privacy attacks to the
data owners. Our solution implements the complex masking
technology yet, ensures that it is easily configurable by endusers. Here are the key features of our software service:
9 Empowerment - allows data owners to take control
over their data privacy
9 Powerful Technology - provides a choice of
multiple data masking techniques developed from
years of research and analysis

A. The Data Masking Solution


The key challenge this project is addressing is protection
of patient data in storage while sharing through data
clearinghouses. Most of current protection focus has been on
data in transit. Large healthcare organizations have spent
millions of dollars on access control and encryption

66

9
9
9
9

Scalable Solution permits increasing or decreasing the


level of masking depending on privacy desired for data
Open Connectivity ODBC or JDBC support for all
databases & file formats
Seamless Integration with applications, databases and
file formats
Open Standards software service or multi-platform

likely, some patients will share the same attribute values, but
others will have unique values. The patients with unique
values will be exposed to high risk of confidentiality
disclosure.
Now, suppose that an intruder acquired the data in a
claim of studying the relationships between the disease and
the patient demographic attributes. Suppose the intruder has

.CSV Data
Masked
Data
Participating
Organization

PrivGuard
Middleware

ExternalOrganizations

observed that there is a unique record with {age = 40-49,


gender = female, marital status = married, education =
bachelor, occupation = engineer}. This profile matches the
demographic data of one of his colleagues, Alice, whom he
knew had taken the test in the hospital. Then he has
effectively discovered Alices test result.
To illustrate the problem in more details, consider the
simplified hospital data set which includes only 16 records
with three non-confidential attributes, age, gender, and
marital status, and a confidential attribute, test result, as
shown in the following table. A record is identifiable if its
confidential attribute value can be determined by the
combined values of its non-confidential attributes. Further, a
record can be uniquely identifiable (e.g., the Alices record),
or collectively identifiable. In the above table, a record
marked U is uniquely identifiable, and a record with V is
collectively identifiable (V1, V2 and V3 denote three
different groups).
Collectively identifiable records are subject to the same
disclosure risk as a uniquely identifiable record. For
example, if an intruder knew a patients with {age = 30-39,
gender = female, marital status = married}, as shown in
record #2 and #3, then he has found the patients test result
(which is Yes). Whether the patient is record #2 or #3 is not
important.
Our application allows the data owner to swap the
confidential attribute values of the (uniquely and
collectively) identifiable records while preserving the
property of the data set as a whole. This masked data is
released to users. If an intruder tries to uncover confidential
data for an individual, s/he will most likely get a faked test

support for O/S


Intuitive GUI Browser user interface require minimal
training

C. Privacy Protection Problem: A Scenario


A hospital maintains a database of 1,000 patient records
regarding a sensitive disease. Originally, the data set has a
few identity related attributes, such as name and address, and
a number of other attributes, such as age, gender, marital
status, education and occupation, as well as a set of physical
and medical measures (e.g., weight, height, blood pressure,
etc.). The data set also includes a confidential attribute, test
result, which has three values: yes, no, and maybe. To
protect privacy, the identity related attributes are removed
from the dataset. In addition, the values of numeric attributes
(e.g., age) are grouped into a few categories. Knowing they
were protected in this way, the patients authorized the
hospital to make the data available to related professionals
and organizations for the purposes of knowledge sharing and
medical research. Both the hospital and the patients would
believe that no sensitive information could be disclosed from
the data. A simple calculation, however, shows a different
picture. Assume the numbers of categories in age, gender,
marital status, education and occupation are 5, 2, 2, 5 and 10,
respectively. The total number of possible category
combinations
for
the
five
attributes
is:

5 u 2 u 2 u 5 u 10 1,000 . Assuming each category

combination is presented with equal likelihood, and then


each of the 1,000 patients is a unique record. In real
situations, of course, the assumption of the uniform
distribution across all attributes is rather unrealistic. More
67

result value. However, if a data analyst wants to conduct


legitimate research on the data (e.g., to study the
relationships between the disease and the patient
demographic attributes), the results based on the masked data
will be very close to the one based on the original data.
III.

PILOT STUDY

The pilot projects goal was to develop a consumer


electronic privacy protection service to address the concerns
of organizations that regularly share their sensitive consumer
and patient data for research analysis or data mining. The
wide spread use of digital data by todays firms and the need
of sharing this data with their external partners. For example,
the Massachusetts Health Data Consortium (MHDC) is
raising awareness with forums and events trying to answer
this critical question How will Clinicians Share Your
Health Records Electronically, and Keep them Private and
Secure? 1 , due to growing concerns on privacy related to
HIPAA. This is a big issue in health-care, finance and
banking, and many other industries that want to improve
quality and lower the cost of their services with data mining
and statistical data analysis but are restricted due to privacy
concerns.
Our pilot study started with a meeting with managers of
five health-care organizations and potential end-users of
participating organizations to identify the nature, scope and
functional requirements of the application, and develop a
conceptual model and framework for the software service.
Based on their recommendations we developed the
requirements for our prototype software service. We
approached the privacy protection issue from the standpoint
of an organization that owns data; our software service can
be installed and operated by the data owner organizations.
Users can import and run the data masking algorithms and
then export the data they choose to share with external
organizations. External organizations will never see the
unmasked data. Our prototype application has an easy
graphical user interface (GUI) environment for users to
import data from variety of sources, run the data masking
algorithms, and export masked data for sharing with external
organization; it runs on Windows or Linux platforms. It has
two sets of data masking algorithms one for numeric data
and other for categorical data.
We conducted several experiments to compare our
approach with existing approaches. Our goal was to compare
the data from the prototype with data without the prototype
to evaluate the effectiveness of the prototype application in
terms of both disclosure risk and data quality. We tested our
prototype with several large databases of patient data from
New Hampshire Department of Health & Human Services
(NH-DHHS). We also let evaluators from NH-DHHS, Blue
Cross, Inc. and MassPro, Inc. to download and run our
prototype on their computers for testing. They are currently
testing the software and will give us their detailed feedback.
Our software architecture has been designed as a software
service as shown in Fig 1 below.
1

Figure 1: ePrivacy Software Service


A. Prototype System Development
The development process included three inter-related stages.
Stage 1: System Analysis
We met with a few end-users who showed interest in the
product and gave initial feedback about what they would
like to see in the product. That helped us determine the user
requirements, identify the nature, scope and functional
requirements of the product, and develop a conceptual
model for the system. We wanted to make sure that we have
a user-friendly, portable and flexible user interface which
can run on a Web platform as a service or on client
operating systems like Windows and Linux as an
application. In addition, it can interface with various
databases like Oracle, MySQL, MS Access and flat file
formats. It also needed to be scalable to support other
databases and masking algorithms in the future.
Stage 2: System Design
The application was initially designed as an installable
application to simplify client access and testing. The
middleware product can be installed as client system and
also as a SaaS (Software as a Service) agent. The prototype
architecture was designed with the following goals:
1) Modularization Separate modules for the GUI and
masking algorithms. This helps in providing the user with
the algorithm needed without changing the product. Also,
this makes it easier to add/modify the algorithms.
2) Usability The GUI is designed to be intuitive and
easy-to-use for either beginner or advanced users with a
Point & Click environment. After the data is imported, users
have a choice of specifying the fields as identity, sensitive
or non-sensitive fields. Also, user can decide how to process
identity and non-sensitive fields.
3) Portability Universal design principles are utilized
to make our application interoperable in a multi-platform
client environment. It is developed in Java platform and
runs on Windows, UNIX/LINUX, and MacOS.
4) Flexibility The architecture is designed to work
with different flat-file formats and can access major
database platforms (e.g., Oracle, SQL Server, MySQL,

see MHDC Website (accessed June 30th, 2008)


68

us with a number of large datasets that included all of the


New Hampshire inpatient records from 1999 to 2006. We
selected 2006 inpatient dataset for our experimental study.
This dataset contains 126,358 records of patients. It has 40
attributes, including age, gender, address in county, hospital,
length of stay, diagnosis, treatment, disposition, payer type,
amount charged, and so on. The data is available for medical
and health-care related research.
The main purpose of the experiment is to examine
whether and to what extend a patient can be re-identified
after the data are masked, and how the statistical properties
of the original data can be preserved in the masked data. The
real identities of the patients, such as name and social
security number, are not disclosed in the original data.
However, the identity of a record can be easily revealed by
linking some attributes in the released data to other publicly
available data sources (e.g., a voter registration list) that
contain both these attributes and the identities of the subjects.
Once a patient is linked and identified, his/her health record
will be disclosed. In this study, we selected four attributes,
age, gender, address in county, and hospital, as the
potentially identifying attributes and thus subject to masking.
We use a deterministic record linkage measure to
represent re-identification risk. A record in the masked file is
said to be linked if the record closest to it in the original
file is indeed the corresponding unmasked record. A record
in the masked file is second closely linked if the second
closest record in the original file is the corresponding one.
The record linkage measure is defined as the percentage of
records that are either linked or second closely linked. A
small percentage value indicates a low re-identification risk.
The utility of the masked data is measured by how the
summary statistics of the data are preserved. The most
relevant summary statistic for categorical data is the
categorical frequency (count) distributions of the masked
attributes. In this aspect, we use the following measure:
Error rate in frequency distribution

DB2, etc.), through the use of ODBC or JDBC technology.


It can currently import delimited (e.g, csv, tab) or text files
(fixed-width), spreadsheets and relational databases.
5) Scalability Our prototype can handle data with
millions of records with reasonable response time. GUI
supports configuration of different data masking algorithms,
un-do feature and context sensitive help.
6) Maintainability There are design documents and
source code is documented in-line. The sources are stored in
the source code control system to keep track of various
releases.
Stage 3: System Implementation, Testing and Assessment
1) Data - Incoming data from participating
organizations that consists of identity, sensitive and nonsensitive attributes are imported in the system. This data
could be of any type numbers, text, date, boolean and
others that are normally used for data mining or research
analysis.
2) Application - Data fields by default are labeled as
non-sensitive types. Through a Config menu user can assign
field types as Identity, Sensitive and Non-Sensitive based on
the metadata descriptions. Our prototype provides a list of
available data masking algorithms, as well as adequate help
functions. The masked output data is first generated in .CSV
format and then converted to the format of the source data
for external organizations and an administrative interface to
select and run masking algorithms. An uploading and
downloading utility is also provided to the external
organization to use the system.
3) System Configuration - In the prototyping stage, the
system was implemented as a stand-alone application on
Windows platform using Java. Open source application
development standards are utilized for the prototype
application.
4) Implementation - The prototype application was
implemented in phases. The key technical tasks involve
design of the prototype to support all algorithm
configuration and data input and output, and setup of a
product development and testing plan. Another important
task has been interacting with the external organization in
terms of data requirements.
5) Evaluation We have tested this prototype with the
three of the five participating health-care organizations. We
first conducted internal quality assurance tests with sample
data before allowing the external participants to test our
system. We reviewed and analyzed the log files from the
participants before revising and fine tuning the system. We
then re-tested the revised application using hand generated
and sample data received from the potential users. This was
done for every masking algorithm and for the overall
product.

1 J 1

J j 1 K j

~
| F jk  F jk |

F jk
k 1

Kj

where J is the number of masked attributes, and

Kj (j

1, ... , J )

masked attribute.

is the number of categories in the jth

F jk

is the frequency count for the kth


~
F

category of the jth attribute in the original dataset, and jk is


the corresponding count in the masked dataset. A small error
rate in frequency distribution indicates that the summary
statistic is well preserved in the masked data.
We adjusted parameters in the chosen algorithms to
create two setups. In setup A, a high percentage of values
were masked to yield very low record linkage values but
high error rates in summary statistics. In setup B, a relatively
low percentage of values were masked to yield low error rate
in summary statistics but high record linkage values.

B. Pilot Study Evaluation


One of the participating organizations, New Hampshire
State Department of Health and Human Services, provided

Setup

69

Linkage
(%)

Linkage
(Count)

Error Rate in Summary


Stat (%)

A
B

0
0.02

0
25

pseudo field values of the same data type and format. This
required us to add the Preferences menu which allows user to
select options on how they want to process these fields
before data is given to external organizations.
In general, we have been very successful in
accomplishing our pilot project objectives. However, due to
time and other constraints we have the following unresolved
issues which we feel very confident of finishing in the next
phase.
a) Identifier Mapping: We have developed an
algorithm to map between the original identifier and masked
identifier for use by the data owner. This algorithm works
only on a single dataset. Some potential clients wanted to be
able to do this in a relational database environment, which is
a challenging problem for future work.
b) Macros or Control Programs: One of our clients
requested the ability of user being able to capture the entire
process of running the features such as selecting the data
types, masking algorithms, etc. into a control program like a
Macro which can then be applied repeatedly on the data files
to be masked and exported to external organization.
c) Limited Evaluation: Due to time constraints, we
were not able to get feedback on our prototype from all our
beta release sites. We will work with these clients to get their
feedback before moving on to next Phase.

0.36
0

The results of the experiments are shown in the table


above. In setup A, we were able to achieve zero record
linkage, which implies excellent protection against identity
disclosure. The error rate is very low at 0.36%. In setup B,
we obtain zero error in summary statistics with a linkage of
only 0.02% (or 25 out of 126,358 records). In terms of
computation time, it took only about five minutes to
complete masking data. Given that this dataset is fairly large,
the algorithms are very efficient.
C. Challenges and Limitations
We faced following challenges during this project:
1) Developing GUI interface: The original version of
the data masking algorithms were operational in commanddriven environment requiring scripting and customization to
load/input the data, execute the algorithms and export/output
data in text-delimited files. The algorithms were developed
and run by the researchers using Java development
environment. The process of converting from commanddriven environment into a graphical user interface (GUI)
environment was complex as we had to determine all the
parameters required by the masking algorithms and develop
GUI prompts to accept this information from end-users who
do not understand the operations of the algorithms.
2) Dealing with Different Data Types: Some of the
data masking algorithms work with text or categorical data
while others work only with numeric data. If user tried to run
categorical masking algorithms with numeric data this would
give errors. So, we had to develop programs to convert
numeric data into text before running these algorithms.
3) Working with Large-Scale Databases: This
challenge was to let our application work with large-scale
data sources that contain millions of records. It needed quite
a lot of memory. It also took a long time to compute on large
databases. Measures were taken to extend the early smallscale versions of the algorithms to the large scale.
4) Masking Single vs. Multiple Fields: Many
algorithms used by our system initially could mask only a
single field. We extended these algorithms to mask multiple
fields. The basic procedure is to run an original algorithm
multiple times, each run masking a single field while
maintaining the other fields unmasked. At the end, the
procedure substitutes the results of the multiple runs into the
original data, resulting in multiple fields being masked.
5) Business Rules for Identity and Pseudo-identity
Fields: Identity fields are those that can be used to identify a
patient, such as social security number and patient name.
Pseudo-identity fields are those that can help easily identify a
patient, such as address and phone number. Some potential
clients wanted different options to process identity and
pseudo-identity fields. For identity fields, they wanted to be
able to either remove the fields or generate disguised field
values whose original values can be retrieved by the original
data owner. For pseudo-identity fields, they wanted to be
able to either remove them or substitute them with random

IV. CONCLUSION AND FUTURE DIRECTIONS


This project allowed us to develop a value added privacy
protection software service. We investigated the feasibility
for this service, designed and developed prototype
application and interacted with real-world users from five
different health-care organizations. The direct interaction
with health-care organizations and their feedback was crucial
in the success of our prototype design. We found the
application was just as easily usable by technically
sophisticated and novice computer users with minimal
training. Although we had some good technical
accomplishments, the application still requires lot more
refinements for commercial services. Some of the feedbacks
we received from our clients are presented below. We plan to
incorporate these in next phase.
Complete integration of our applications with Database
Software, ETL and Data Mining Software. Instead of
importing and exporting the data, our clients would like us to
embed the data masking capabilities either in the data base,
data mining or ETL platforms or in data warehouse. We will
investigate and determine the best integration process for this
tool.
Integrate the system with other security and privacy
software there are many security and privacy vendors
without data masking capabilities. We will work with them
to create integrated solutions so the end-user can have all
features in one software suite.
Implement in the next version with service oriented
architecture (SOA) capability for quick and easy integration
with legacy systems and databases. One large health care
organization we talked with requested this capability. In
addition, software as a service (SaaS) model has made
tremendous gains in the marketplace. It is specifically very
70

useful for many of our small to medium size clients.


Therefore, in the next phase we plan to create a web based
GUI using Java based development platform and create a
SaaS product. This will provide an opportunity for customers
to lease the application as service on annual fees instead of
buying software.
Develop macro or batch mode capability as mentioned
before we need to develop this capability to allow our clients
to run the system in unattended mode after the initial
configurable run with our application.
Implement Load Balancing capability in the system to
handle large-scale databases or files in batch processing
mode.
The above list indicates that our prototype application
requires more research. The next version must not only
support the features mentioned above, but also extend our
application to work smoothly with other industries such as
banking and finance, retail and government sector.
V.
[1]

[2]

[3]

[4]

[5]

[6]

[7]
[8]

[9]

[10]

[11]

[12]

[13]

[14] J. Markoff and S. Shane, Government looks at ways to mine


databases, New York Times (Late Edition, East Coast). February 25,
2006, p. C.1.
[15] Reiss, S. P. 1984. Practical Data-Swapping: The First Steps, ACM
Transactions on Database Systems (9:1), pp. 20-37.
[16] Samarati, P. 2001. Protecting Respondents Identities in Microdata
Release, IEEE Transactions on Knowledge and Data Engineering
(13:6), pp. 1010-1027.
[17] Seifert, J. W. 2006. Data Mining and Homeland Security: An
Overview, CRS Report for Congress, January 27, 2006. Retrieved
July 2006, from http://www.fas.org/sgp/crs/intel/RL31798.pdf.
[18] Sweeney, L. 2002. k-Anonymity: A Model for Protecting Privacy,
International Journal on Uncertainty, Fuzziness and Knowledgebased Systems (10:5), pp. 557-570.
[19] Teltzrow M., and Kobsa, A. 2004. Impacts of User Privacy
Preferences on Personalized Systems: A Comparative Study,
Designing Personalized User Experiences in eCommerce, C. M.
Karat, J. Blom, and J. Karat (eds.), Dordrecht, Netherlands: Kluwer
Academic Publishers, pp. 315332.
[20] Traub, J. F., Yemini, Y., and Wozniakowski, H. 1984. The
Statistical Security of a Statistical Database, ACM Transactions on
Database Systems (9:4), pp. 672-679.

REFERENCES

Adam N. R., and Wortmann, J. C. 1989. Security-Control Methods


for Statistical Databases: A Comparative study, ACM Computing
Surveys (21:4), pp. 515 556.
Bernie
Monegain,
2006,
Healthcare
ITNews,
http://www.healthcareitnews.com/story.cms?id=4242, (accessed Jan,
2009)
Dalenius, T., and Reiss, S. P. 1982. Data Swapping: A Technique for
Disclosure Control, Journal of Statistical Planning and Inference
(6:1), pp. 73-85.
Defays D., and Nanopoulos, P. 1992. Panels of Enterprises and
Confidentiality: The Small Aggregates Method, Proceedings of
Statistics Canada Symposium 92 on Design and Analysis of
Longitudinal Surveys, Ottawa, Canada, pp. 195-204.
Domingo-Ferrer J., and Mateo-Sanz, J. M. 2002. Practical DataOriented Microaggregation for Statistical Disclosure Control, IEEE
Transactions on Knowledge and Data Engineering (14:1), pp. 189201.
A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy
preserving mining of association rules, Proceedings of the 8th ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 217-228, Edmonton, Canada, July 2002.
Greengard, S.1996. Privacy: Entitlement or Illusion? Personnel
Journal (75:5), pp. 74-88.
Kaelber, D., Jha, A., 2008. A Research Agenda for Personal Health
Records (PHRs), Journal of the American Medical Informatics
Association Volume 15 Number 6 November / December
KDnuggets, Google Subpoena: Child Protection vs. Privacy,
Accessed
July
2006,
from
http://www.kdnuggets.com/polls/2006/google_subpoena.htm.
Kimberly Jarrell, Jan Welker, Donna Silsbee, Francis Tucker. 2008.
The Unintended Effects of the HIPAA Privacy Protections on Health
Care Treatment Team and Patient Outcomes. The Business Review,
Cambridge, 11(1), 14-25.
Retrieved January 2, 2009, from
ABI/INFORM Global database.
Liew, C. K., Choi, U. J., and Liew, C. J. 1985. A Data Distortion by
Probability Distribution, ACM Transactions on Database Systems
(10:3), pp. 395-411.
My HealtheVetThe Gateway to Veteran Health and Wellness.
2007. Available at: http://www.myhealth.va.gov/. Accessed: Apr 10,
2009.
Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasubramaniam,
M. 2006. l-Diversity: Privacy beyond k-Anonymity, Proceedings of
the 22nd IEEE International Conference on Data Engineering (ICDE
2006), Atlanta, GA.

71

Das könnte Ihnen auch gefallen