Beruflich Dokumente
Kultur Dokumente
Recent years have witnessed a tremendous growth in data collection thanks to the
exponential development of information technology that not only facilitates our
daily life, but also generates extensive amounts of person-specific data. Data of
different types are generated on a daily basis, such as GPS data, RFID data
m o v i n g objects or trajectory data, health- care data, search queries, and customers’
p u r c h a s e s . There has been a compelling demand in various sectors for collecting
such data, whether by government agencies, transportation authorities, medical
centers, online service providers, or retail corporations. This demand stems from
the need to analyze the collected big data and extract useful information in order to
construct a knowledge base that helps in decision-making operations. Usually, the
collected data is transferred to a third party, e.g., a research center, in order to
conduct the desired analysis.
We call the entity that collects, holds, and publishes the data a data holder
or a data publisher, the individuals from whom the data is collected record owners
(each record in a tabular database belongs to a unique individual), the entity that
receives the published data a data recipient, and data recipients who attempt to
perform a privacy attack adversaries or attackers. Participants in a data collection
and sharing scenario. For example, a search engine company (data holder) may want
to publish queries submitted by users (data owners) to the public (data recipients)
in order to improve the company’s recommendation system. The terms data
sharing and data release refer to the same process and will be used interchangeably
in this research.
Sharing collected data can serve a variety of purposes and can be mutually
beneficial to both the data holder and data recipient. For example, experts mining
released road traffic data can extract interesting patterns, which in turn helps in
improving the city’s road network and, thus, reducing traffic jams. Another inter-
esting example is Netflix, the giant movie rental service, which released a dataset
containing movie ratings that belong to nearly 500,000 subscribers and offered an
enticing prize for anyone who could improve their recommendation service. In some
cases, publishing data is mandated by law. For example, there is a regulation in
California that requires licensed hospitals to publish records of discharged pa-
tients. Generally speaking, whether the published data is shared with a private entity
or is available for public access, it is significantly b e n e fic i a l for researchers and
practitioners to test a proposed solution on real-life data, as opposed to computer-
generated, synthetic data.
In most cases, the collected data belongs to real-life individuals, and sharing
their data may lay their privacy on the line. The shared data must not allow
individuals’ identities to be exposed, nor their private information to be linked to
them, whether accidentally or deliberately. A privacy attack is, thus, the ability to
perform such linkage. Therefore, before releasing the data, data holders tend to
anonymizing their collected data by removing explicit identifiers such as Name,
Date of Birth, and SSN. The objective is to prevent linking sensitive information
to an individual. Though seemingly benign, such data anonymization h a s been
proven inadequate in protecting individuals’ privacy.
Privacy-Protecting Big Data sharing
Privacy-Protecting Big Data sharing (PPBDS) is the study of applying adequate
anonymization me asu r es to the data to be published, taking into consideration two
factors : protecting individuals’ privacy in the published data and maintaining high data
utility for accurate analysis of the anonymous data. Privacy protection is achieved by
transforming the raw data to another anonymous version that adheres to some privacy
requirements dictated by the applied privacy model; i.e., this process is referred to as
anonymizing the data. Protecting data utility concerns the cost, in terms of data quality
degradation, ensued due to anonymizing the raw data.
The importance of PPBDS stems from the need to share real-life data with
researchers, data analysts, or the general public, while safeguarding individuals’ privacy
in the sharing data. Moreover, we do not want data holders to hesitate when publishing
their data because such real-life data is crucial for obtaining meaningful and accurate
data mining results, e.g., when performing classification analysis .Therefore, PPBDS
ensures that useful, yet privacy-preserving, data can be shared and benefited from in
various sectors.
In this research work, raw data refers to the original collected data that includes
person-specific and sensitive pieces of information about some individuals and is kept
in the data holder’s possession. We assume that the data holder wishes to first anonymize,
and then share the collected data. The shared data, which has been anonymized
before sharing, is described as being anonymous or sanitized. We use these two terms
interchangeably in this research work.
APPROACH
Trajectory Streams
Recent advancement in mobile computing and sensory technology has facilitated the
possibility of continuously updating, monitoring, and detecting the latest location
and status of moving individuals. Spatio-temporal data generated and collected on
the fly is described as trajectory streams. This work is motivated by the concern
that publishing individuals’ trajectories on the fly may jeopardize their privacy.
Static Trajectories
In recent years, the collection of spatio-temporal data that captures human move-
ments has increased tremendously due to advancements in hardware and software
systems capable of collecting user-specific data. The bulk of the data collected by
these systems has numerous applications, including general data analysis. Therefore,
publishing such data is greatly beneficial for data recipients. However, in its raw
form, the collected data contains sensitive information pertaining to the individuals
from whom it was collected and must be anonymized before final submission.
Various organizations collect data about individuals for various reasons, such as ser-
vice improvement. In order to mine the collected data for useful information, it has
become a common practice to share the collected data with data analysts, research
institutes, or simply the general public. The quality of published data significantly
affects the accuracy of the data analysis, and thus affects decision making at the
corporate level.
We divide the presentation of the problems in this research according to the applied
privacy model. W e present the problem of anonymizing trajectory streams under a
syntactic privacy model. We then proceed to present the second problem,
anonymizing static trajectories, and the third problem, anonymizing relational data,
under a semantic privacy model. The rest of the research work is organized as
follows:
Task 1:
• Task 2: explores prominent works that have been proposed for protecting
individuals’ privacy in the domain of privacy-protecting data sharing. Particularly,
we present in-depth literature review of existing anonymization methods for four
types of data: data streams, trajectory data, transaction data and relational data.