Sie sind auf Seite 1von 76

ta

REPORT TO THE PRESIDENT

BIG DATA AND PRIVACY:


A TECHNOLOGICAL
PERSPECTIVE
Executive Office of the President
Presidents Council of Advisors on
Science and Technology
May2014


REPORT TO THE PRESIDENT

BIG DATA AND PRIVACY:


A TECHNOLOGICAL PERSPECTIVE

Executive Office of the President


Presidents Council of Advisors on
Science and Technology

May 2014

About the Presidents Council of Advisors on


Science and Technology

ThePresidentsCouncilofAdvisorsonScienceandTechnology(PCAST)isanadvisorygroupof
theNationsleadingscientistsandengineers,appointedbythePresidenttoaugmentthescience
and technology advice available to him from inside the White House and from cabinet
departments and other Federal agencies. PCAST is consulted about, and often makes policy
recommendationsconcerning,thefullrangeofissueswhereunderstandingsfromthedomains
of science, technology, and innovation bear potentially on the policy choices before the
President.

FormoreinformationaboutPCAST,seewww.whitehouse.gov/ostp/pcast

The Presidents Council of Advisors on


Science and Technology

CoChairs
JohnP.Holdren
AssistanttothePresidentfor
ScienceandTechnology
Director,OfficeofScienceandTechnology
Policy

EricS.Lander
President
BroadInstituteofHarvardandMIT

ViceChairs

WilliamPress
RaymerProfessorinComputerScienceand
IntegrativeBiology
UniversityofTexasatAustin

MaxineSavitz
VicePresident
NationalAcademyofEngineering

Members

S.JamesGates,Jr.
JohnS.TollProfessorofPhysics
Director,CenterforStringandParticle
Theory
UniversityofMaryland,CollegePark

RosinaBierbaum
Dean,SchoolofNaturalResourcesand
Environment
UniversityofMichigan
ChristineCassel
PresidentandCEO
NationalQualityForum

MarkGorenberg
ManagingMember
ZettaVenturePartners

ChristopherChyba
Professor,AstrophysicalSciencesand
InternationalAffairs
Director,ProgramonScienceandGlobal
Security
PrincetonUniversity

SusanL.Graham
PehongChenDistinguishedProfessor
EmeritainElectricalEngineeringand
ComputerScience
UniversityofCalifornia,Berkeley
i

ShirleyAnnJackson
President
RensselaerPolytechnicInstitute

CraigMundie
SeniorAdvisortotheCEO
MicrosoftCorporation

RichardC.Levin(throughmidApril2014)
PresidentEmeritus
FrederickWilliamBeineckeProfessorof
Economics
YaleUniversity

EdPenhoet
Director,AltaPartners
ProfessorEmeritus,BiochemistryandPublic
Health
UniversityofCalifornia,Berkeley

MichaelMcQuade
SeniorVicePresidentforScienceand
Technology
UnitedTechnologiesCorporation

BarbaraSchaal
MaryDellChiltonDistinguishedProfessorof
Biology
WashingtonUniversity,St.Louis

ChadMirkin
GeorgeB.RathmannProfessorofChemistry
Director,InternationalInstitutefor
Nanotechnology
NorthwesternUniversity

EricSchmidt
ExecutiveChairman
Google,Inc.
DanielSchrag
SturgisHooperProfessorofGeology
Professor,EnvironmentalScienceand
Engineering
Director,HarvardUniversityCenterfor
Environment
HarvardUniversity

MarioMolina
DistinguishedProfessor,Chemistryand
Biochemistry
UniversityofCalifornia,SanDiego
Professor,CenterforAtmosphericSciences
attheScrippsInstitutionofOceanography

Staff
MarjoryS.Blumenthal
ExecutiveDirector

AshleyPredith
AssistantExecutiveDirector
KnatokieFord
AAASScience&TechnologyPolicyFellow

ii

PCAST Big Data and Privacy Working Group


WorkingGroupCoChairs

SusanL.Graham
PehongChenDistinguishedProfessor
EmeritainElectricalEngineeringand
ComputerScience
UniversityofCalifornia,Berkeley

WilliamPress
RaymerProfessorinComputerScienceand
IntegrativeBiology
UniversityofTexasatAustin

WorkingGroupMembers

S.JamesGates,Jr.
JohnS.TollProfessorofPhysics
Director,CenterforStringandParticle
Theory
UniversityofMaryland,CollegePark

EricS.Lander
President
BroadInstituteofHarvardandMIT

CraigMundie
SeniorAdvisortotheCEO
MicrosoftCorporation

MarkGorenberg
ManagingMember
ZettaVenturePartners

MaxineSavitz
VicePresident
NationalAcademyofEngineering

JohnP.Holdren
AssistanttothePresidentforScienceand
Technology
Director,OfficeofScienceandTechnology
Policy

EricSchmidt
ExecutiveChairman
Google,Inc.

WorkingGroupStaff

MarjoryS.Blumenthal
ExecutiveDirector
PresidentsCouncilofAdvisorsonScience
andTechnology

MichaelJohnson
AssistantDirector
NationalSecurityandInternationalAffairs

iii

iv

EXECUTIVE OFFICE OF THE PRESIDENT


PRESIDENTS COUNCIL OF ADVISORS ON SCIENCE AND TECHNOLOGY
WASHINGTON, D.C. 20502

President Barack Obama


The White House
Washington, DC 20502
Dear Mr. President,
We are pleased to send you this report, Big Data and Privacy: A Technological Perspective, prepared for you by the
Presidents Council of Advisors on Science and Technology (PCAST). It was developed to complement and inform
the analysis of big-data implications for policy led by your Counselor, John Podesta, in response to your requests of
January 17, 2014. PCAST examined the nature of current technologies for managing and analyzing big data and for
preserving privacy, it considered how those technologies are evolving, and it explained what the technological
capabilities and trends imply for the design and enforcement of public policy intended to protect privacy in big-data
contexts.
Big data drives big benefits, from innovative businesses to new ways to treat diseases. The challenges to privacy
arise because technologies collect so much data (e.g., from sensors in everything from phones to parking lots) and
analyze them so efficiently (e.g., through data mining and other kinds of analytics) that it is possible to learn far more
than most people had anticipated or can anticipate given continuing progress. These challenges are compounded by
limitations on traditional technologies used to protect privacy (such as de-identification). PCAST concludes that
technology alone cannot protect privacy, and policy intended to protect privacy needs to reflect what is (and is not)
technologically feasible.
In light of the continuing proliferation of ways to collect and use information about people, PCAST recommends that
policy focus primarily on whether specific uses of information about people affect privacy adversely. It also
recommends that policy focus on outcomes, on the what rather than the how, to avoid becoming obsolete as
technology advances. The policy framework should accelerate the development and commercialization of
technologies that can help to contain adverse impacts on privacy, including research into new technological options.
By using technology more effectively, the Nation can lead internationally in making the most of big datas benefits
while limiting the concerns it poses for privacy. Finally, PCAST calls for efforts to assure that there is enough talent
available with the expertise needed to develop and use big data in a privacy-sensitive way.
PCAST is grateful for the opportunity to serve you and the country in this way and hope that you and others who read
this report find our analysis useful.
Best regards,

Eric S. Lander
Co-chair, PCAST

John P. Holdren
Co-chair, PCAST

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

Table of Contents

ThePresidentsCouncilofAdvisorsonScienceandTechnology............................................i
PCASTBigDataandPrivacyWorkingGroup...........................................................................ii
TableofContents..................................................................................................................vii
ExecutiveSummary................................................................................................................ix
1.Introduction........................................................................................................................1
1.1Contextandoutlineofthisreport............................................................................1
1.2Technologyhaslongdriventhemeaningofprivacy................................................3
1.3Whatisdifferenttoday?..........................................................................................5
1.4Values,harms,andrights.........................................................................................6
2.ExamplesandScenarios....................................................................................................11
2.1Thingshappeningtodayorverysoon....................................................................11
2.2Scenariosofthenearfutureinhealthcareandeducation.....................................13
2.2.1Healthcare:personalizedmedicine.............................................................13
2.2.2Healthcare:detectionofsymptomsbymobiledevices..............................13
2.2.3Education....................................................................................................14
2.3Challengestothehomesspecialstatus................................................................14
2.4Tradeoffsamongprivacy,security,andconvenience............................................17
3.Collection,Analytics,andSupportingInfrastructure........................................................19
3.1Electronicsourcesofpersonaldata.......................................................................19
3.1.1Borndigitaldata......................................................................................19
3.1.2Datafromsensors.......................................................................................22
3.2Bigdataanalytics....................................................................................................24
3.2.1Datamining.................................................................................................24
3.2.2Datafusionandinformationintegration....................................................25
3.2.3Imageandspeechrecognition....................................................................26
3.2.4Socialnetworkanalysis...............................................................................28
3.3Theinfrastructurebehindbigdata........................................................................30
3.3.1Datacenters................................................................................................30
3.3.2Thecloud....................................................................................................31
4.TechnologiesandStrategiesforPrivacyProtection.........................................................33
4.1Therelationshipbetweencybersecurityandprivacy.............................................33
4.2Cryptographyandencryption................................................................................35

vii

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

4.2.1WellEstablishedencryptiontechnology.....................................................35
4.2.2Encryptionfrontiers....................................................................................36
4.3Noticeandconsent................................................................................................38
4.4Otherstrategiesandtechniques............................................................................38
4.4.1Anonymizationordeidentification............................................................38
4.4.2Deletionandnonretention........................................................................39
4.5Robusttechnologiesgoingforward.......................................................................40
4.5.1ASuccessortoNoticeandConsent............................................................40
4.5.2ContextandUse..........................................................................................41
4.5.3Enforcementanddeterrence......................................................................42
4.5.4OperationalizingtheConsumerPrivacyBillofRights.................................43
5.PCASTPerspectivesandConclusions................................................................................47
5.1Technicalfeasibilityofpolicyinterventions...........................................................48
5.2Recommendations.................................................................................................49
5.4FinalRemarks.........................................................................................................53
AppendixA.AdditionalExpertsProvidingInput...................................................................55
SpecialAcknowledgment......................................................................................................57

viii

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

Executive Summary
Theubiquityofcomputingandelectroniccommunicationtechnologieshasledtotheexponential
growthofdatafrombothdigitalandanalogsources.Newcapabilitiestogather,analyze,disseminate,
andpreservevastquantitiesofdataraisenewconcernsaboutthenatureofprivacyandthemeansby
whichindividualprivacymightbecompromisedorprotected.
Afterprovidinganoverviewofthisreportanditsorigins,Chapter1describesthechangingnatureof
privacyascomputingtechnologyhasadvancedandbigdatahascometothefore.Thetermprivacy
encompassesnotonlythefamousrighttobeleftalone,orkeepingonespersonalmattersand
relationshipssecret,butalsotheabilitytoshareinformationselectivelybutnotpublicly.Anonymity
overlapswithprivacy,butthetwoarenotidentical.Likewise,theabilitytomakeintimatepersonal
decisionswithoutgovernmentinterferenceisconsideredtobeaprivacyright,asisprotectionfrom
discriminationonthebasisofcertainpersonalcharacteristics(suchasrace,gender,orgenome).Privacy
isnotjustaboutsecrets.
ConflictsbetweenprivacyandnewtechnologyhaveoccurredthroughoutAmericanhistory.Concern
withtheriseofmassmediasuchasnewspapersinthe19thcenturyledtolegalprotectionsagainstthe
harmsoradverseconsequencesofintrusionuponseclusion,publicdisclosureofprivatefacts,and
unauthorizeduseofnameorlikenessincommerce.Wireandradiocommunicationsledto20thcentury
lawsagainstwiretappingandtheinterceptionofprivatecommunicationslawsthat,PCASTnotes,have
notalwayskeptpacewiththetechnologicalrealitiesoftodaysdigitalcommunications.
Pastconflictsbetweenprivacyandnewtechnologyhavegenerallyrelatedtowhatisnowtermedsmall
data,thecollectionanduseofdatasetsbyprivateandpublicsectororganizationswherethedataare
disseminatedintheiroriginalformoranalyzedbyconventionalstatisticalmethods.Todaysconcerns
aboutbigdatareflectboththesubstantialincreasesintheamountofdatabeingcollectedand
associatedchanges,bothactualandpotential,inhowtheyareused.

Bigdataisbigintwodifferentsenses.Itisbiginthequantityandvarietyofdatathatareavailabletobe
processed.And,itisbiginthescaleofanalysis(termedanalytics)thatcanbeappliedtothosedata,
ultimatelytomakeinferencesanddrawconclusions.Bydataminingandotherkindsofanalytics,non
obviousandsometimesprivateinformationcanbederivedfromdatathat,atthetimeoftheir
collection,seemedtoraiseno,oronlymanageable,privacyissues.Suchnewinformation,used
appropriately,mayoftenbringbenefitstoindividualsandsocietyChapter2ofthisreportgivesmany
suchexamples,andadditionalexamplesarescatteredthroughouttherestofthetext.Eveninprinciple,
however,onecanneverknowwhatinformationmaylaterbeextractedfromanyparticularcollectionof
bigdata,bothbecausethatinformationmayresultonlyfromthecombinationofseeminglyunrelated
datasets,andbecausethealgorithmforrevealingthenewinformationmaynotevenhavebeen
inventedatthetimeofcollection.
Thesamedataandanalyticsthatprovidebenefitstoindividualsandsocietyifusedappropriatelycan
alsocreatepotentialharmsthreatstoindividualprivacyaccordingtoprivacynormsbothwidely

ix

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


sharedandpersonal.Forexample,largescaleanalysisofresearchondisease,togetherwithhealthdata
fromelectronicmedicalrecordsandgenomicinformation,mightleadtobetterandtimeliertreatment
forindividualsbutalsotoinappropriatedisqualificationforinsuranceorjobs.GPStrackingofindividuals
mightleadtobettercommunitybasedpublictransportationfacilities,butalsotoinappropriateuseof
thewhereaboutsofindividuals.Alistofthekindsofadverseconsequencesorharmsfromwhich
individualsshouldbeprotectedisproposedinSection1.4.PCASTbelievesstronglythatthepositive
benefitsofbigdatatechnologyare(orcanbe)greaterthananynewharms.
Chapter3ofthereportdescribesthemanynewwaysinwhichpersonaldataareacquired,bothfrom
originalsources,andthroughsubsequentprocessing.Today,althoughtheymaynotbeawareofit,
individualsconstantlyemitintotheenvironmentinformationwhoseuseormisusemaybeasourceof
privacyconcerns.Physically,theseinformationemanationsareoftwotypes,whichcanbecalledborn
digitalandbornanalog.
Wheninformationisborndigital,itiscreated,byusorbyacomputersurrogate,specificallyforuseby
acomputerordataprocessingsystem.Whendataareborndigital,privacyconcernscanarisefrom
overcollection.Overcollectionoccurswhenaprogramsdesignintentionally,andsometimes
clandestinely,collectsinformationunrelatedtoitsstatedpurpose.Overcollectioncan,inprinciple,be
recognizedatthetimeofcollection.
Wheninformationisbornanalog,itarisesfromthecharacteristicsofthephysicalworld.Such
informationbecomesaccessibleelectronicallywhenitimpingesonasensorsuchasacamera,
microphone,orotherengineereddevice.Whendataarebornanalog,theyarelikelytocontainmore
informationthantheminimumnecessaryfortheirimmediatepurpose,andforvalidreasons.One
reasonisforrobustnessofthedesiredsignalinthepresenceofvariablenoise.Anotheris
technologicalconvergence,theincreasinguseofstandardizedcomponents(e.g.,cellphonecameras)in
newproducts(e.g.,homealarmsystemscapableofrespondingtogesture).
Datafusionoccurswhendatafromdifferentsourcesarebroughtintocontactandnewfactsemerge
(seeSection3.2.2).Individually,eachdatasourcemayhaveaspecific,limitedpurpose.Their
combination,however,mayuncovernewmeanings.Inparticular,datafusioncanresultinthe
identificationofindividualpeople,thecreationofprofilesofanindividual,andthetrackingofan
individualsactivities.Morebroadly,dataanalyticsdiscoverspatternsandcorrelationsinlargecorpuses
ofdata,usingincreasinglypowerfulstatisticalalgorithms.Ifthosedataincludepersonaldata,the
inferencesflowingfromdataanalyticsmaythenbemappedbacktoinferences,bothcertainand
uncertain,aboutindividuals.
Becauseofdatafusion,privacyconcernsmaynotnecessarilyberecognizableinborndigitaldatawhen
theyarecollected.Becauseofsignalprocessingrobustnessandstandardization,thesameistrueof
bornanalogdataevendatafromasinglesource(e.g.,asinglesecuritycamera).Borndigitaland
bornanalogdatacanbothbecombinedwithdatafusion,andnewkindsofdatacanbegeneratedfrom
dataanalytics.Thebeneficialusesofnearubiquitousdatacollectionarelarge,andtheyfuelan
increasinglyimportantsetofeconomicactivities.Takentogether,theseconsiderationssuggestthata
policyfocusonlimitingdatacollectionwillnotbeabroadlyapplicableorscalablestrategynorone

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


likelytoachievetherightbalancebetweenbeneficialresultsandunintendednegativeconsequences
(suchasinhibitingeconomicgrowth).
Ifcollectioncannot,inmostcases,belimitedpractically,thenwhat?Chapter4discussesindetaila
numberoftechnologiesthathavebeenusedinthepastforprivacyprotection,andothersthatmay,toa
greaterorlesserextent,serveastechnologybuildingblocksforfuturepolicies.
Sometechnologybuildingblocks(forexample,cybersecuritystandards,technologiesrelatedto
encryption,andformalsystemsofauditableaccesscontrol)arealreadybeingutilizedandneedtobe
encouragedinthemarketplace.Ontheotherhand,sometechniquesforprivacyprotectionthathave
seemedencouraginginthepastareusefulassupplementarywaystoreduceprivacyrisk,butdonot
nowseemsufficientlyrobusttobeadependablebasisforprivacyprotectionwherebigdatais
concerned.Foravarietyofreasons,PCASTjudgesanonymization,datadeletion,anddistinguishingdata
frommetadata(definedbelow)tobeinthiscategory.Theframeworkofnoticeandconsentisalso
becomingunworkableasausefulfoundationforpolicy.
Anonymizationisincreasinglyeasilydefeatedbytheverytechniquesthatarebeingdevelopedformany
legitimateapplicationsofbigdata.Ingeneral,asthesizeanddiversityofavailabledatagrows,the
likelihoodofbeingabletoreidentifyindividuals(thatis,reassociatetheirrecordswiththeirnames)
growssubstantially.Whileanonymizationmayremainsomewhatusefulasanaddedsafeguardinsome
situations,approachesthatdeemit,byitself,asufficientsafeguardneedupdating.
Whileitisgoodbusinesspracticethatdataofallkindsshouldbedeletedwhentheyarenolongerof
value,economicorsocialvalueoftencanbeobtainedfromapplyingbigdatatechniquestomassesof
datathatwereotherwiseconsideredtobeworthless.Similarly,archivaldatamayalsobeimportantto
futurehistorians,orforlaterlongitudinalanalysisbyacademicresearchersandothers.Asdescribed
above,manysourcesofdatacontainlatentinformationaboutindividuals,informationthatcanbe
knownonlyiftheholderexpendsanalyticresources,orthatmaybecomeknowableonlyinthefuture
withthedevelopmentofnewdataminingalgorithms.Insuchcasesitispracticallyimpossibleforthe
dataholdereventosurfaceallthedataaboutanindividual,muchlessdeleteitonanyspecified
scheduleorinresponsetoanindividualsrequest.Today,giventhedistributedandredundantnatureof
datastorage,itisnotevenclearthatdata,evensmalldata,canbedestroyedwithanyhighdegreeof
assurance.
Asdatasetsbecomemorecomplex,sodotheattachedmetadata.Metadataareancillarydatathat
describepropertiesofthedatasuchasthetimethedatawerecreated,thedeviceonwhichtheywere
created,orthedestinationofamessage.Includedinthedataormetadatamaybeidentifying
informationofmanykinds.Itcannottodaygenerallybeassertedthatmetadataraisefewerprivacy
concernsthandata.
Noticeandconsentisthepracticeofrequiringindividualstogivepositiveconsenttothepersonaldata
collectionpracticesofeachindividualapp,program,orwebservice.Onlyinsomefantasyworlddo
usersactuallyreadthesenoticesandunderstandtheirimplicationsbeforeclickingtoindicatetheir
consent.

xi

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Theconceptualproblemwithnoticeandconsentisthatitfundamentallyplacestheburdenofprivacy
protectionontheindividual.Noticeandconsentcreatesanonlevelplayingfieldintheimplicitprivacy
negotiationbetweenprovideranduser.Theprovideroffersacomplex,takeitorleaveitsetofterms,
whiletheuser,inpractice,canallocateonlyafewsecondstoevaluatingtheoffer.Thisisakindof
marketfailure.
PCASTbelievesthattheresponsibilityforusingpersonaldatainaccordancewiththeuserspreferences
shouldrestwiththeproviderratherthanwiththeuser.Asapracticalmatter,intheprivatesector,third
partieschosenbytheconsumer(e.g.,consumerprotectionorganizations,orlargeappstores)could
intermediate:Aconsumermightchooseoneofseveralprivacyprotectionprofilesofferedbythe
intermediary,whichinturnwouldvetappsagainsttheseprofiles.Byvettingapps,theintermediaries
wouldcreateamarketplaceforthenegotiationofcommunitystandardsforprivacy.TheFederal
governmentcouldencouragethedevelopmentofstandardsforelectronicinterfacesbetweenthe
intermediariesandtheappdevelopersandvendors.
Afterdataarecollected,dataanalyticscomeintoplayandmaygenerateanincreasingfractionof
privacyissues.Analysis,perse,doesnotdirectlytouchtheindividual(itisneithercollectionnor,
withoutadditionalaction,use)andmayhavenoexternalvisibility.Bycontrast,itistheuseofaproduct
ofanalysis,whetherincommerce,bygovernment,bythepress,orbyindividuals,thatcancause
adverseconsequencestoindividuals.
Morebroadly,PCASTbelievesthatitistheuseofdata(includingborndigitalorbornanalogdataand
theproductsofdatafusionandanalysis)thatisthelocuswhereconsequencesareproduced.Thislocus
isthetechnicallymostfeasibleplacetoprotectprivacy.Technologiesareemerging,bothinthe
researchcommunityandinthecommercialworld,todescribeprivacypolicies,torecordtheorigins
(provenance)ofdata,theiraccess,andtheirfurtherusebyprograms,includinganalytics,andto
determinewhetherthoseusesconformtoprivacypolicies.Someapproachesarealreadyinpractical
use.
Giventhestatisticalnatureofdataanalytics,thereisuncertaintythatdiscoveredpropertiesofgroups
applytoaparticularindividualinthegroup.Makingincorrectconclusionsaboutindividualsmayhave
adverseconsequencesforthemandmayaffectmembersofcertaingroupsdisproportionately(e.g.,the
poor,theelderly,orminorities).Amongthetechnicalmechanismsthatcanbeincorporatedinause
basedapproacharemethodsforimposingstandardsfordataaccuracyandintegrityandpoliciesfor
incorporatinguseableinterfacesthatallowanindividualtocorrecttherecordwithvoluntaryadditional
information.
PCASTschargeforthisstudydidnotaskittorecommendspecificprivacypolicies,butrathertomakea
relativeassessmentofthetechnicalfeasibilitiesofdifferentbroadpolicyapproaches.Chapter5,
accordingly,discussestheimplicationsofcurrentandemergingtechnologiesforgovernmentpoliciesfor
privacyprotection.Theuseoftechnicalmeasuresforenforcingprivacycanbestimulatedby
reputationalpressure,butsuchmeasuresaremosteffectivewhenthereareregulationsandlawswith
civilorcriminalpenalties.Rulesandregulationsprovidebothdeterrenceofharmfulactionsand
incentivestodeployprivacyprotectingtechnologies.Privacyprotectioncannotbeachievedby
technicalmeasuresalone.

xii

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Thisdiscussionleadstofiverecommendations.
Recommendation1.Policyattentionshouldfocusmoreontheactualusesofbigdataandlessonits
collectionandanalysis.Byactualuses,wemeanthespecificeventswheresomethinghappensthatcan
causeanadverseconsequenceorharmtoanindividualorclassofindividuals.Inthecontextofbig
data,theseevents(uses)arealmostalwaysactionsofacomputerprogramorappinteractingeither
withtherawdataorwiththefruitsofanalysisofthosedata.Inthisformulation,itisnotthedata
themselvesthatcausetheharm,northeprogramitself(absentanydata),buttheconfluenceofthe
two.Theseuseevents(incommerce,bygovernment,orbyindividuals)embodythenecessary
specificitytobethesubjectofregulation.Bycontrast,PCASTjudgesthatpoliciesfocusedonthe
regulationofdatacollection,storage,retention,apriorilimitationsonapplications,andanalysis(absent
identifiableactualusesofthedataorproductsofanalysis)areunlikelytoyieldeffectivestrategiesfor
improvingprivacy.Suchpolicieswouldbeunlikelytobescalableovertime,ortobeenforceableby
otherthansevereandeconomicallydamagingmeasures.
Recommendation2.Policiesandregulation,atalllevelsofgovernment,shouldnotembedparticular
technologicalsolutions,butrathershouldbestatedintermsofintendedoutcomes.
Toavoidfallingbehindthetechnology,itisessentialthatpolicyconcerningprivacyprotectionshould
addressthepurpose(thewhat)ratherthanprescribingthemechanism(thehow).
Recommendation3.WithcoordinationandencouragementfromOSTP,1theNITRDagencies2should
strengthenU.S.researchinprivacyrelatedtechnologiesandintherelevantareasofsocialscience
thatinformthesuccessfulapplicationofthosetechnologies.
Someofthetechnologyforcontrollingusesalreadyexists.However,research(andfundingforit)is
neededinthetechnologiesthathelptoprotectprivacy,inthesocialmechanismsthatinfluenceprivacy
preservingbehavior,andinthelegaloptionsthatarerobusttochangesintechnologyandcreate
appropriatebalanceamongeconomicopportunity,nationalpriorities,andprivacyprotection.
Recommendation4.OSTP,togetherwiththeappropriateeducationalinstitutionsandprofessional
societies,shouldencourageincreasededucationandtrainingopportunitiesconcerningprivacy
protection,includingcareerpathsforprofessionals.
Programsthatprovideeducationleadingtoprivacyexpertise(akintowhatisbeingdoneforsecurity
expertise)areessentialandneedencouragement.Onemightenvisioncareersfordigitalprivacyexperts
bothonthesoftwaredevelopmentsideandonthetechnicalmanagementside.

TheWhiteHouseOfficeofScienceandTechnologyPolicy
NITRDreferstotheNetworkingandInformationTechnologyResearchandDevelopmentprogram,whose
participatingFederalagenciessupportunclassifiedresearchinadvancedinformationtechnologiessuchas
computing,networking,andsoftwareandincludebothresearchandmissionfocusedagenciessuchasNSF,NIH,
NIST,DARPA,NOAA,DOEsOfficeofScience,andtheD0Dmilitaryservicelaboratories(see
http://www.nitrd.gov/SUBCOMMITTEE/nitrd_agencies/index.aspx).
2

xiii

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Recommendation5.TheUnitedStatesshouldtaketheleadbothintheinternationalarenaandat
homebyadoptingpoliciesthatstimulatetheuseofpracticalprivacyprotectingtechnologiesthat
existtoday.Itcanexhibitleadershipbothbyitsconveningpower(forinstance,bypromotingthe
creationandadoptionofstandards)andalsobyitsownprocurementpractices(suchasitsownuseof
privacypreservingcloudservices).
PCASTisnotawareofmoreeffectiveinnovationorstrategiesbeingdevelopedabroad;rather,some
countriesseeminclinedtopursuewhatPCASTbelievestobeblindalleys.Thiscircumstanceoffersan
opportunityforU.S.technicalleadershipinprivacyintheinternationalarena,anopportunitythat
shouldbetaken.

xiv

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

1. Introduction
InawidelynotedspeechonJanuary17,2014,PresidentBarackObamachargedhisCounselor,JohnPodesta,
withleadingacomprehensivereviewofbigdataandprivacy,onethatwouldreachouttoprivacyexperts,
technologists,andbusinessleadersandlookathowthechallengesinherentinbigdataarebeingconfrontedby
boththepublicandprivatesectors;whetherwecanforgeinternationalnormsonhowtomanagethisdata;and
howwecancontinuetopromotethefreeflowofinformationinwaysthatareconsistentwithbothprivacyand
security.3ThePresidentandCounselorPodestaaskedthePresidentsCouncilofAdvisorsonScienceand
Technology(PCAST)toassistwiththetechnologydimensionsofthereview.
ForthistaskPCASTsstatementofworkreads,inpart,
PCASTwillstudythetechnologicalaspectsoftheintersectionofbigdatawithindividualprivacy,in
relationtoboththecurrentstateandpossiblefuturestatesoftherelevanttechnologicalcapabilities
andassociatedprivacyconcerns.
Relevantbigdataincludedataandmetadatacollected,orpotentiallycollectable,fromorabout
individualsbyentitiesthatincludethegovernment,theprivatesector,andotherindividuals.Itincludes
bothproprietaryandopendata,andalsodataaboutindividualscollectedincidentallyoraccidentallyin
thecourseofotheractivities(e.g.,environmentalmonitoringortheInternetofThings).
Thisisatallorder,especiallyontheambitioustimescalerequestedbythePresident.Theliteratureandpublic
discussionofbigdataandprivacyarevast,withnewideasandinsightsgenerateddailyfromavarietyof
constituencies:technologistsinindustryandacademia,privacyandconsumeradvocates,legalscholars,and
journalists(amongothers).IndependentlyofPCAST,butinformingthisreport,thePodestastudysponsored
threepublicworkshopsatuniversitiesacrossthecountry.Limitingthisreportschargetotechnological,not
policy,aspectsoftheproblemnarrowsPCASTsmandatesomewhat,butthisisasubjectwheretechnologyand
policyaredifficulttoseparate.Inanycase,itisthenatureofthesubjectthatthisreportmustberegardedas
basedonamomentarysnapshotofthetechnology,althoughwebelievethekeyconclusionsand
recommendationshavelastingvalue.

1.1 Context and outline of this report


Theubiquityofcomputingandelectroniccommunicationtechnologieshasledtotheexponentialgrowthof
onlinedata,frombothdigitalandanalogsources.Newtechnologicalcapabilitiestocreate,analyze,and
disseminatevastquantitiesofdataraisenewconcernsaboutthenatureofprivacyandthemeansbywhich
individualprivacymightbecompromisedorprotected.
Thisreportdiscussespresentandfuturetechnologiesconcerningthissocalledbigdataasitrelatestoprivacy
concerns.Itisnotacompletesummaryofthetechnologyconcerningbigdata,noracompletesummaryofthe
waysinwhichtechnologyaffectsprivacy,butfocusesonthewaysinwhichbigdataandprivacyinteract.Asan
example,ifLeslieconfidesasecrettoChrisandChrisbroadcaststhatsecretbyemailortexting,thatmightbea

RemarksbythePresidentonReviewofSignalsIntelligence,January17,2014.http://www.whitehouse.gov/thepress
office/2014/01/17/remarkspresidentreviewsignalsintelligence

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


privacyinfringinguseofinformationtechnology,butitisnotabigdataissue.Asanotherexample,if
oceanographicdataarecollectedinlargequantitiesbyremotesensing,thatisbigdata,butnot,inthefirst
instance,aprivacyconcern.Somedataaremoreprivacysensitivethanothers,forexample,personalmedical
data,asdistinctfrompersonaldatapubliclysharedbythesameindividual.Differenttechnologiesandpolicies
willapplytodifferentclassesofdata.
Thenotionsofbigdataandthenotionsofindividualprivacyusedinthisreportareintentionallybroadand
inclusive.BusinessconsultantsGartner,Inc.definebigdataashighvolume,highvelocityandhighvariety
informationassetsthatdemandcosteffective,innovativeformsofinformationprocessingforenhancedinsight
anddecisionmaking,4whilecomputerscientistsreviewingmultipledefinitionsofferthemoretechnical,a
termdescribingthestorageandanalysisoflargeand/orcomplexdatasetsusingaseriesoftechniques
including,butnotlimitedto,NoSQL,MapReduce,andmachinelearning.5(SeeSections3.2.1and3.3.1for
discussionofthesetechnicalterms.)Inaprivacycontext,thetermbigdatatypicallymeansdataaboutoneor
agroupofindividuals,orthatmightbeanalyzedtomakeinferencesaboutindividuals.Itmightincludedataor
metadatacollectedbygovernment,bytheprivatesector,orbyindividuals.Thedataandmetadatamightbe
proprietaryoropen,theymightbecollectedintentionallyorincidentallyoraccidentally.Theymightbetext,
audio,video,sensorbased,orsomecombination.Theymightbedatacollecteddirectlyfromsomesource,or
dataderivedbysomeprocessofanalysis.Theymightbesavedforalongperiodoftime,ortheymightbe
analyzedanddiscardedastheyarestreamed.Inthisreport,PCASTusuallydoesnotdistinguishbetweendata
andinformation.
Thetermprivacyencompassesnotonlyavoidingobservation,orkeepingonespersonalmattersand
relationshipssecret,butalsotheabilitytoshareinformationselectivelybutnotpublicly.Anonymityoverlaps
withprivacy,butthetwoarenotidentical.Votingisrecognizedasprivate,butnotanonymous,while
authorshipofapoliticaltractmaybeanonymous,butitisnotprivate.Likewise,theabilitytomakeintimate
personaldecisionswithoutgovernmentinterferenceisconsideredtobeaprivacyright,asisprotectionfrom
discriminationonthebasisofcertainpersonalcharacteristics(suchasanindividualsrace,gender,orgenome).
So,privacyisnotjustaboutsecrets.
Thepromiseofbigdatacollectionandanalysisisthatthederiveddatacanbeusedforpurposesthatbenefit
bothindividualsandsociety.Threatstoprivacystemfromthedeliberateorinadvertentdisclosureofcollected
orderivedindividualdata,themisuseofthedata,andthefactthatderiveddatamaybeinaccurateorfalse.The
technologiesthataddresstheconfluenceoftheseissuesarethesubjectofthisreport.6
Theremainderofthisintroductorychaptergivesfurthercontextintheformofasummaryofhowthelegal
conceptofprivacydevelopedhistoricallyintheUnitedStates.Interestingly,andrelevanttothisreport,privacy
rightsandthedevelopmentofnewtechnologieshavelongbeenintertwined.Todaysissuesarenoexception.
Chapter2ofthisreportisdevotedtoscenariosandexamples,somefromtoday,butmostanticipatinganear
tomorrow.YogiBerrasmuchquotedremarkItstoughtomakepredictions,especiallyaboutthefutureis

Gartner,Inc.,ITGlossary.https://www.gartner.com/itglossary/bigdata/
Barker,AdamandJonathanStuartWard,UndefinedByData:ASurveyofBigDataDefinitions,arXiv:1309.5821.
http://arxiv.org/abs/1309.5821
6
PCASTacknowledgesgratefullytheassistanceofseveralcontributorsattheNationalScienceFoundation,whohelpedto
identifyanddistillkeyinsightsfromthetechnicalliteratureandresearchcommunity,aswellasothertechnicalexpertsin
academiaandindustrythatitconsultedduringthisproject.SeeAppendixA.
5

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


germane.Butitisequallytrueforthissubjectthatpoliciesbasedonoutofdateexamplesandscenariosare
doomedtofailure.Bigdatatechnologiesareadvancingsorapidlythatpredictionsaboutthefuture,however
imperfect,mustguidetodayspolicydevelopment.
Chapter3examinesthetechnologydimensionsofthetwogreatpillarsofbigdata:collectionandanalysis.Ina
certainsensebigdataisexactlytheconfluenceofthesetwo:bigcollectionmeetsbiganalysis(oftentermed
analytics).Thetechnicalinfrastructureoflargescalenetworkingandcomputingthatenablesbigisalso
discussed.
Chapter4looksattechnologiesandstrategiesfortheprotectionofprivacy.Althoughtechnologymaybepartof
theproblem,itmustalsobepartofthesolution.Manycurrentandforeseeabletechnologiescanenhance
privacy,andtherearemanyadditionalpromisingavenuesofresearch.
Chapter5,drawingonthepreviouschapters,containsPCASTsperspectivesandconclusions.Whileitisnot
withinthisreportschargetorecommendspecificpolicies,itisclearthatcertainkindsofpoliciesaretechnically
morefeasibleandlesslikelytoberenderedirrelevantorunworkablebynewtechnologiesthanothers.These
approachesarehighlighted,alongwithcommentsonthetechnicaldeficienciesofsomeotherapproaches.This
chapteralsocontainsPCASTsrecommendationsinareasthatliewithinourcharge,thatis,otherthanpolicy.

1.2 Technology has long driven the meaning of privacy


Theconflictbetweenprivacyandnewtechnologyisnotnew,exceptperhapsnowinitsgreaterscope,degreeof
intimacy,andpervasiveness.Formorethantwocenturies,valuesandexpectationsrelatingtoprivacyhave
beencontinuallyreinterpretedandrearticulatedinlightoftheimpactofnewtechnologies.
ThenationwidepostalsystemadvocatedbyBenjaminFranklinandestablishedin1775wasanewtechnology
designedtopromoteinterstatecommerce.Butmailwasroutinelyandopportunisticallyopenedintransituntil
Congressmadethisactionillegalin1782.WhiletheConstitutionsFourthAmendmentcodifiedtheheightened
privacyprotectionaffordedtopeopleintheirhomesorontheirpersons(previouslyprinciplesofBritish
commonlaw),ittookanothercenturyoftechnologicalchallengestoexpandtheconceptofprivacyrightsinto
moreabstractspaces,includingtheelectronic.Theinventionofthetelegraphand,later,telephonecreatednew
tensionsthatwereslowtoberesolved.Abilltoprotecttheprivacyoftelegrams,introducedinCongressin
1880,wasneverpassed.7
Itwasnottelecommunications,however,buttheinventionoftheportable,consumeroperablecamera(soon
knownastheKodak)thatgaveimpetustoWarrenandBrandeiss1890articleTheRighttoPrivacy,8thena
controversialtitle,butnowviewedasthefoundationaldocumentformodernprivacylaw.Inthearticle,Warren
andBrandeisgavevoicetotheconcernthat[i]nstantaneousphotographsandnewspaperenterprisehave
invadedthesacredprecinctsofprivateanddomesticlife;andnumerousmechanicaldevicesthreatento
makegoodthepredictionthatwhatiswhisperedintheclosetshallbeproclaimedfromthehousetops,
furthernotingthat[f]oryearstherehasbeenafeelingthatthelawmustaffordsomeremedyforthe
unauthorizedcirculationofportraitsofprivatepersons9

Seipp,DavidJ.,TheRighttoPrivacyinAmericanHistory,HarvardUniversity,ProgramonInformationResourcesPolicy,
Cambridge,MA,1978.
8
Warren,SamuelD.andLouisD.Brandeis,"TheRighttoPrivacy."HarvardLawReview4:5,193,December15,1890.
9
Id.at195.

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


WarrenandBrandeissoughttoarticulatetherightofprivacybetweenindividuals(whosefoundationliesincivil
tortlaw).Today,manystatesrecognizeanumberofprivacyrelatedharmsascausesforcivilorcriminallegal
action(furtherdiscussedinSection1.4).10
FromWarrenandBrandeisrighttoprivacy,ittookanother75yearsfortheSupremeCourttofind,in
Griswoldv.Connecticut11(1965),arighttoprivacyinthe"penumbras"and"emanations"ofotherconstitutional
protections(asJusticeWilliamO.Douglasputit,writingforthemajority).12Withabroadperspective,scholars
todayrecognizeanumberofdifferentlegalmeaningsforprivacy.Fiveoftheseseemparticularlyrelevantto
thisPCASTreport:
(1) Theindividualsrighttokeepsecretsorseekseclusion(thefamousrighttobeleftaloneofBrandeis
1928dissentingopinioninOlmsteadv.UnitedStates).13
(2) Therighttoanonymousexpression,especially(butnotonly)inpoliticalspeech(asinMcIntyrev.Ohio
ElectionsCommission14)
(3) Theabilitytocontrolaccessbyotherstopersonalinformationafteritleavesonesexclusivepossession
(forexample,asarticulatedintheFTCsFairInformationPracticePrinciples).15
(4) Thebarringofsomekindsofnegativeconsequencesfromtheuseofanindividualspersonal
information(forexample,jobdiscriminationonthebasisofpersonalDNA,forbiddenin2008bythe
GeneticInformationNondiscriminationAct16).
(5) Therightoftheindividualtomakeintimatedecisionswithoutgovernmentinterference,asinthe
domainsofhealth,reproduction,andsexuality(asinGriswold).
Theseareasserted,notabsolute,rights.Allaresupported,butalsocircumscribed,bybothstatuteandcaselaw.
Withtheexceptionofnumber5onthelist(arightofdecisionalprivacyasdistinctfrominformational
privacy),allareapplicableinvaryingdegreesbothtocitizengovernmentinteractionsandtocitizencitizen
interactions.Collisionsbetweennewtechnologiesandprivacyrightshaveoccurredinallfive.Apatchworkof
stateandfederallawshaveaddressedconcernsinmanysectors,buttodatetherehasnotbeencomprehensive
legislationtohandletheseissues.Collisionsbetweennewtechnologiesandprivacyrightsshouldbeexpectedto
continuetooccur.

10

DigitalMediaLawProject,PublishingPersonalandPrivateInformation.http://www.dmlp.org/legalguide/publishing
personalandprivateinformation
11
Griswoldv.Connecticut,381U.S.479(1965).
12
Id.at48384.
13
Olmsteadv.UnitedStates,277U.S.438(1928).
14
McIntyrev.OhioElectionsCommission,514U.S.334,34041(1995).Thedecisionreadsinpart,Protectionsfor
anonymousspeecharevitaltodemocraticdiscourse.Allowingdissenterstoshieldtheiridentitiesfreesthemtoexpress
criticalminorityviews...Anonymityisashieldfromthetyrannyofthemajority....Itthusexemplifiesthepurposebehind
theBillofRightsandoftheFirstAmendmentinparticular:toprotectunpopularindividualsfromretaliation...atthehand
ofanintolerantsociety.
15
FederalTradeCommission,PrivacyOnline:FairInformationPracticesintheElectronicMarketplace,May2000.
16
GeneticInformationNondiscriminationActof2008,PL110233,May21,2008,122Stat881.

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

1.3 What is different today?


Newcollisionsbetweentechnologiesandprivacyhavebecomeevident,asnewtechnologicalcapabilitieshave
emergedatarapidpace.Itisnolongerclearthatthefiveprivacyconcernsraisedabove,ortheircurrentlegal
interpretations,aresufficientinthecourtofpublicopinion.
Muchofthepublicsconcerniswiththeharmdonebytheuseofpersonaldata,bothinisolationorin
combination.Controllingaccesstopersonaldataaftertheyleaveonesexclusivepossessionhasbeenseen
historicallyasameansofcontrollingpotentialharm.Buttoday,personaldatamayneverbe,orhavebeen,
withinonespossessionforinstancetheymaybeacquiredpassivelyfromexternalsourcessuchaspublic
camerasandsensors,orwithoutonesknowledgefrompublicelectronicdisclosuresbyothersusingsocial
media.Inaddition,personaldatamaybederivedfrompowerfuldataanalyses(seeSection3.2)whoseuseand
outputisunknowntotheindividual.Thoseanalysessometimesyieldvalidconclusionsthattheindividualwould
notwantdisclosed.Worseyet,theanalysescanproducefalsepositivesorfalsenegativesinformationthatis
aconsequenceoftheanalysisbutisnottrueorcorrect.Furthermore,toamuchgreaterextentthanbefore,the
samepersonaldatahavebothbeneficialandharmfuluses,dependingonthepurposesforwhichandthe
contextsinwhichtheyareused.Informationsuppliedbytheindividualmightbeusedonlytoderiveother
informationsuchasidentityoracorrelation,afterwhichitisnotneeded.Thederiveddata,whichwerenever
undertheindividualscontrol,mightthenbeusedeitherforgoodorill.
Inthecurrentdiscourse,someassertthattheissuesconcerningprivacyprotectionarecollectiveaswellas
individual,particularlyinthedomainofcivilrightsforexample,identificationofcertainindividualsata
gatheringusingfacialrecognitionfromvideos,andtheinferencethatotherindividualsatthesamegathering,
alsoidentifiedfromvideos,havesimilaropinionsorbehaviors.
Currentcircumstancesalsoraiseissuesofhowtherighttoprivacyextendstothepublicsquare,ortoquasi
privategatheringssuchaspartiesorclassrooms.Iftheobserversinthesevenuesarenotjustpeople,butalso
bothvisibleandinvisiblerecordingdeviceswithenormousfidelityandeasypathstoelectronicpromulgation
andanalysis,doesthatchangetherules?
Alsorapidlychangingarethedistinctionsbetweengovernmentandtheprivatesectoraspotentialthreatsto
individualprivacy.Governmentisnotjustagiantcorporation.Ithasamonopolyintheuseofforce;ithasno
directcompetitorswhoseekmarketadvantageoveritandmaythusmotivateittocorrectmissteps.
Governmentshavechecksandbalances,whichcancontributetoselfimposedlimitsonwhattheymaydowith
peoplesinformation.Companiesdecidehowtheywillusesuchinformationinthecontextofsuchfactorsas
competitiveadvantagesandrisks,governmentregulation,andperceivedthreatsandconsequencesoflawsuits.
Itisthusappropriatethattherearedifferentsetsofconstraintsonthepublicandprivatesectors.But
governmenthasasetofauthoritiesparticularlyintheareasoflawenforcementandnationalsecuritythat
placeitinauniquelypowerfulposition,andthereforetherestraintsplacedonitscollectionanduseofdata
deservespecialattention.Indeed,theneedforsuchattentionisheightenedbecauseoftheincreasinglyblurry
linebetweenpublicandprivatedata.
Whilethesedifferencesarereal,bigdataistosomeextentalevelerofthedifferencesbetweengovernmentand
companies.Bothgovernmentsandcompanieshavepotentialaccesstothesamesourcesofdataandthesame
analytictools.Currentrulesmayallowgovernmenttopurchaseorotherwiseobtaindatafromtheprivate

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


sectorthat,insomecases,itcouldnotlegallycollectitself,17ortooutsourcetotheprivatesectoranalysesit
couldnotitselflegallyperform.18Thepossibilityofgovernmentexercising,withoutpropersafeguards,itsown
monopolypowersandalsohavingunfetteredaccesstotheprivateinformationmarketplaceisunsettling.
Whatkindsofactionsshouldbeforbiddenbothtogovernment(Federal,state,andlocal,andincludinglaw
enforcement)andtotheprivatesector?Whatkindsshouldbeforbiddentoonebutnottheother?Itisunclear
whethercurrentlegalframeworksaresufficientlyrobustfortodayschallenges.

1.4 Values, harms, and rights


AswasseeninSections1.2and1.3,newprivacyrightsusuallydonotcomeintobeingasacademicabstractions.
Rather,theyarisewhentechnologyencroachesonwidelysharedvalues.Wherethereisconsensusonvalues,
therecanalsobeconsensusonwhatkindsofharmstoindividualsmaybeanaffronttothosevalues.Notall
suchharmsmaybepreventableorremediablebygovernmentactions,but,conversely,itisunlikelythat
governmentactionswillbewelcomeoreffectiveiftheyarenotgroundedtosomedegreeinvaluesthatare
widelyshared.
Intherealmofprivacy,WarrenandBrandeisin189019(seeSection1.2)beganadialogueaboutprivacythatled
totheevolutionoftherightinacademiaandthecourts,latercrystalizedbyWilliamProsserasfourdistinct
harmsthathadcometoearnlegalprotection.20Adirectresultisthat,today,manystatesrecognizeascauses
forlegalactionthefourharmsthatProsserenumerated,21andwhichhavebecome(thoughvaryingfromstate
tostate22)privacyrights.Theharmsare:

Intrusionuponseclusion.Apersonwhointentionallyintrudes,physicallyorotherwise(nowincluding
electronically),uponthesolitudeorseclusionofanotherpersonorherprivateaffairsorconcerns,can
besubjecttoliabilityfortheinvasionofherprivacy,butonlyiftheintrusionwouldbehighlyoffensiveto
areasonableperson.
Publicdisclosureofprivatefacts.Similarly,apersoncanbesuedforpublishingprivatefactsabout
anotherperson,evenifthosefactsaretrue.Privatefactsarethoseaboutsomeonespersonallifethat
havenotpreviouslybeenmadepublic,thatarenotoflegitimatepublicconcern,andthatwouldbe
offensivetoareasonableperson.

17

OneHundredTenthCongress,Privacy:Theuseofcommercialinformationresellersbyfederalagencies,Hearingbefore
theSubcommitteeonInformationPolicy,Census,andNationalArchivesoftheCommitteeonOversightandGovernment
Reform,HouseofRepresentatives,March11,2008.
18
Forexample,ExperianprovidesmuchofHealthcare.govsidentityverificationcomponentusingconsumercredit
informationnotavailabletothegovernment.SeeConsumerReports,Havingtroubleprovingyouridentityto
HealthCare.gov?Here'showtheprocessworks,December18,2013.
http://www.consumerreports.org/cro/news/2013/12/howtoproveyouridentityonhealthcare
gov/index.htm?loginMethod=auto
19
Warren,SamuelD.andLouisD.Brandeis,"TheRighttoPrivacy."HarvardLawReview4:5,193,December15,1890.
20
Prosser,WilliamL.,Privacy,CaliforniaLawReview48:383,389,1960.
21
Id.
22
(1)DigitalMediaLawProject,PublishingPersonalandPrivateInformation.http://www.dmlp.org/legal
guide/publishingpersonalandprivateinformation.(2)Id.,ElementsofanIntrusionClaim.http://www.dmlp.org/legal
guide/elementsintrusionclaim

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

Falselightorpublicity.Closelyrelatedtodefamation,thisharmresultswhenfalsefactsarewidely
publishedaboutanindividual.Insomestates,falselightincludesuntrueimplications,notjustuntrue
factsassuch.
Misappropriationofnameorlikeness.Individualshavearightofpublicitytocontroltheuseoftheir
nameorlikenessincommercialsettings.

ItseemslikelythatmostAmericanstodaycontinuetosharethevaluesimplicitintheseharms,evenifthelegal
language(bynowrefinedinthousandsofcourtdecisions)strikesoneasarchaicandquaint.However,new
technologicalinsultstoprivacy,actualorprospective,andacenturysevolutionofsocialvalues(forexample,
todaysgreaterrecognitionoftherightsofminorities,andofrightsassociatedwithgender),mayrequirea
longerlistthansufficedin1960.
AlthoughPCASTsengagementwiththissubjectiscenteredontechnology,notlaw,anyreportonthesubjectof
privacy,includingPCASTs,shouldbegroundedinthevaluesofitsday.Asastartingpointfordiscussion,albeit
onlyasnapshotoftheviewsofonesetoftechnologicallymindedAmericans,PCASTofferssomepossible
augmentationstotheestablishedlistofharms,eachofwhichsuggestsapossibleunderlyingrightintheageof
bigdata.
PCASTalsobelievesstronglythatthepositivebenefitsoftechnologyare(orcanbe)greaterthananynew
harms.Almosteverynewharmisrelatedtooradjacenttobeneficialusesofthesametechnology.23To
emphasizethispoint,foreachsuggestednewharm,wedescribearelatedbeneficialuse.

Invasionofprivatecommunications.Digitalcommunicationstechnologiesmakesocialnetworking
possibleacrosstheboundariesofgeography,andenablesocialandpoliticalparticipationonpreviously
unimaginablescales.Anindividualsrighttoprivatecommunication,securedforwrittenmailand
wirelinetelephoneinpartbytheisolationoftheirdeliveryinfrastructure,mayneedreaffirmationinthe
digitalera,however,whereallkindsofbitssharethesamepipelines,andthebarrierstointerception
areoftenmuchlower.(Inthiscontext,wediscusstheuseandlimitationsofencryptioninSection4.2.)
Invasionofprivacyinapersonsvirtualhome.TheFourthAmendmentgivesspecialprotectionagainst
governmentintrusionintothehome,forexampletheprotectionofprivaterecordswithinthehome;
tortlawoffersprotectionagainstsimilarnongovernmentintrusion.Thenewvirtualhomeincludes
theInternet,cloudstorage,andotherservices.Personaldatainthecloudcanbeaccessibleand
organized.Photographsandrecordsinthecloudcanbesharedwithfamilyandfriends,andcanbe
passeddowntofuturegenerations.Theunderlyingsocialvalue,thehomeasonescastle,should
logicallyextendtoonescastleinthecloud,butthisprotectionhasnotbeenpreservedinthenew
virtualhome.(WediscussthissubjectfurtherinSection2.3.)
Publicdisclosureofinferredprivatefacts.Powerfuldataanalyticsmayinferpersonalfactsfrom
seeminglyharmlessinputdata.Sometimestheinferencesarebeneficial.Atitsbest,targeted
advertisingdirectsconsumerstoproductsthattheyactuallywantorneed.Inferencesaboutpeoples
healthcanleadtobetterandtimeliertreatmentsandlongerlives.Butbeforetheadventofbigdata,it
couldbeassumedthattherewasacleardistinctionbetweenpublicandprivateinformation:eithera
factwasoutthere(andcouldbepointedto),oritwasnot.Today,analyticsmaydiscoverfactsthat

23

Oneperspectiveinformedbynewtechnologiesandtechnologymedicatedcommunicationsuggeststhatprivacyisabout
thecontinualmanagementofboundariesbetweendifferentspheresofactionanddegreesofdisclosurewithinthose
spheres,withprivacyandonespublicfacebeingbalancedindifferentwaysatdifferenttimes.See:LeysiaPalenandPaul
Dourish,UnpackingPrivacyforaNetworkedWorld,ProceedingsofCHI2003,AssociationforComputingMachinery,
April510,2003.

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

arenolessprivatethanyesterdayspurelyprivatesphereoflife.Examplesincludeinferringsexual
preferencefrompurchasingpatterns,orearlyAlzheimersdiseasefromkeyclickstreams.Inthelatter
case,theprivatefactmaynotevenbeknowntotheindividualinquestion.(Section3.2discussesthe
technologybehindthedataanalyticsthatmakessuchinferencespossible.)Thepublicdisclosureofsuch
information(andpossiblyalsosomenonpubliccommercialuses)seemsoffensivetowidelyshared
values.
Tracking,stalking,andviolationsoflocationalprivacy.Todaystechnologieseasilydeterminean
individualscurrentorpriorlocation.Usefullocationbasedservicesincludenavigation,suggesting
bettercommuterroutes,findingnearbyfriends,avoidingnaturalhazards,andadvertisingthe
availabilityofnearbygoodsandservices.Sightinganindividualinapublicplacecanhardlybeaprivate
fact.Whenbigdataallowssuchsightings,orotherkindsofpassiveoractivedatacollection,tobe
assembledintothecontinuouslocationaltrackofanindividualsprivatelife,however,manyAmericans
(includingSupremeCourtJusticeSotomayor,forexample24)perceiveapotentialaffronttoawidely
acceptedreasonableexpectationofprivacy.
Harmarisingfromfalseconclusionsaboutindividuals,basedonpersonalprofilesfrombigdata
analytics.Thepowerofbigdata,andthereforeitsbenefit,isoftencorrelational.Inmanycasesthe
harmsfromstatisticalerrorsaresmall,forexampletheincorrectinferenceofamoviepreference;or
thesuggestionthatahealthissuebediscussedwithaphysician,followingfromanalysesthatmay,on
average,bebeneficial,evenwhenaparticularinstanceturnsouttobeafalsealarm.Evenwhen
predictionsarestatisticallyvalid,moreover,theymaybeuntrueaboutparticularindividualsand
mistakenconclusionsmaycauseharm.Societymaynotbewillingtoexcuseharmscausedbythe
uncertaintiesinherentinstatisticallyvalidalgorithms.Theseharmsmayunfairlyburdenparticular
classesofindividuals,forexample,racialminoritiesortheelderly.
Foreclosureofindividualautonomyorselfdetermination.Dataanalysesaboutlargepopulationscan
discoverspecialcasesthatapplytoindividualswithinthatpopulation.Forexample,byidentifying
differencesinlearningstyles,bigdatamaymakeitpossibletopersonalizeeducationinwaysthat
recognizeeveryindividualspotentialandoptimizethatindividualsachievement.Buttheprojectionof
populationfactorsontoindividualscanbemisused.Itiswidelyacceptedthatindividualsshouldbeable
tomaketheirownchoicesandpursueopportunitiesthatarenotnecessarilytypical,andthatnoone
shouldbedeniedthechancetoachievemorethansomestatisticalexpectationofthemselves.Itwould
offendourvaluesifachildschoicesinvideogameswerelaterusedforeducationaltracking(for
example,collegeadmissions).Similarlyoffensivewouldbeafuture,akintoPhilipK.Dicksscience
fictionshortstoryadaptedbyStevenSpielberginthefilmMinorityReport,whereprecrimeis
statisticallyidentifiedandpunished.25
Lossofanonymityandprivateassociation.Anonymityisnotacceptableasanenablerofcommitting
fraud,orbullying,orcyberstalking,orimproperinteractionswithchildren.Apartfromwrongful
behavior,however,theindividualsrighttochoosetobeanonymousisalongheldAmericanvalue(as,
forexample,theanonymousauthorshipoftheFederalistpapers).Usingdatato(re)identifyan
individualwhowishestobeanonymous(exceptinthecaseoflegitimategovernmentalfunctions,such
aslawenforcement)isregardedasaharm.Similarly,individualshavearightofprivateassociationwith
groupsorotherindividuals,andtheidentificationofsuchassociationsmaybeaharm.

24

Iwouldaskwhetherpeoplereasonablyexpectthattheirmovementswillberecordedandaggregatedinamannerthat
enablestheGovernmenttoascertain,moreorlessatwill,theirpoliticalandreligiousbeliefs,sexualhabits,andsoon.
UnitedStatesv.Jones(101259),Sotomayorconcurrenceathttp://www.supremecourt.gov/opinions/11pdf/101259.pdf.
25
Dick,PhillipK.,TheMinorityReport,firstpublishedinFantasticUniverse(1956)andreprintedinSelectedStoriesof
PhilipK.Dick,NewYork:Pantheon,2002.

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Whileinnosenseistheabovelistintendedtobecomplete,itdoeshaveafewintentionalomissions.For
example,individualsmaywantbigdatatobeusedfairly,inthesenseoftreatingpeopleequally,but(apart
fromthesmallnumberofprotectedclassesalreadydefinedbylaw)itseemsimpossibletoturnthisintoaright
thatisspecificenoughtobemeaningful.Likewise,individualsmaywanttheabilitytoknowwhatothersknow
aboutthem;butthatissurelynotarightfromthepredigitalage;and,inthecurrenteraofstatisticalanalysis,it
isnotsoeasytodefinewhatknowmeans.ThisimportantissueisdiscussedinSection3.1.2,andagaintaken
upinchapter5,wheretheattemptistofocusonactualharmsdonebytheuseofinformation,notbyaconcept
astechnicallyambiguousaswhetherinformationisknown.

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

10

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

2. Examples and Scenarios


ThischapterseekstomakeChapter1sintroductorydiscussionmoreconcretebysketchingsomeexamplesand
scenarios.Whilesomeoftheseapplicationsoftechnologyareinusetoday,otherscomprisePCASTs
technologicalprognosticationsaboutthenearfuture,uptoperhaps10yearsfromtoday.Takentogetherthe
examplesandscenariosareintendedtoillustrateboththeenormousbenefitsthatbigdatacanprovideandalso
theprivacychallengesthatmayaccompanythesebenefits.
Inthefollowingthreesections,itwillbeusefultodevelopsomescenariosmorecompletelythanothers,moving
fromverybriefexamplesofthingshappeningtodaytomorefullydevelopedscenariossetinthefuture.

2.1 Things happening today or very soon


Herearesomerelevantexamples:
Pioneeredmorethanadecadeago,devicesmountedonutilitypolesareabletosensetheradiostations
beinglistenedtobypassingdrivers,withtheresultssoldtoadvertisers.26
In2011,automaticlicenseplatereaderswereinusebythreequartersoflocalpolicedepartments
surveyed.Within5years,25%ofdepartmentsexpecttohavetheminstalledonallpatrolcars,alerting
policewhenavehicleassociatedwithanoutstandingwarrantisinview.27Meanwhile,civilianusesof
licenseplatereadersareemerging,leveragingcloudplatformsandpromisingmultiplewaysofusingthe
informationcollected.28
ExpertsattheMassachusettsInstituteofTechnologyandtheCambridgePoliceDepartmenthaveuseda
machinelearningalgorithmtoidentifywhichburglarieslikelywerecommittedbythesameoffender,
thusaidingpoliceinvestigators.29
Differentialpricing(offeringdifferentpricestodifferentcustomersforessentiallythesamegoods)has
becomefamiliarindomainssuchasairlineticketsandcollegecosts.Bigdatamayincreasethepower
andprevalenceofthispracticeandmayalsodecreaseevenfurtheritstransparency.30

26

ElBoghdady,Dina,AdvertisersTuneIntoNewRadioGauge,TheWashingtonPost,October25,2004.
http://www.washingtonpost.com/wpdyn/articles/A600132004Oct24.html
27
AmericanCivilLibertiesUnion,YouAreBeingTracked:HowLicensePlateReadersAreBeingUsedToRecordAmericans
Movements,July,2013.https://www.aclu.org/files/assets/071613aclualprreportoptv05.pdf
28
Hardy,Quentin,HowUrbanAnonymityDisappearsWhenAllDataIsTracked,TheNewYorkTimes,April19,2014.
29
Rudin,Cynthia,Predictivepolicing:UsingMachineLearningtoDetectPatternsofCrime,Wired,August22,2013.
http://www.wired.com/insights/2013/08/predictivepolicingusingmachinelearningtodetectpatternsof
crime/.:www.wired.com/insights/2013/08/predictivedetectpattern
30
(1)Schiller,Benjamin,FirstDegreePriceDiscriminationUsingBigData,Jan.30.2014,BrandeisUniversity.
http://benjaminshiller.com/images/First_Degree_PD_Using_Big_Data_Jan_27,_2014.pdfand
http://www.forbes.com/sites/modeledbehavior/2013/09/01/willbigdatabringmorepricediscrimination/(2)Fisher,
WilliamW.WhenShouldWePermitDifferentialPricingofInformation?UCLALawReview55:1,2007.

11

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

TheUKfirmFeatureSpaceoffersmachinelearningalgorithmstothegamingindustrythatmaydetect
earlysignsofgamblingaddictionorotheraberrantbehavioramongonlineplayers.31
RetailerslikeCVSandAutoZoneanalyzetheircustomersshoppingpatternstoimprovethelayoutof
theirstoresandstocktheproductstheircustomerswantinaparticularlocation.32Bytrackingcell
phones,RetailNextoffersbricksandmortarretailersthechancetorecognizereturningcustomers,just
ascookiesallowthemtoberecognizedbyonlinemerchants.33SimilarWiFitrackingtechnologycould
detecthowmanypeopleareinaclosedroom(andinsomecasestheiridentities).
TheretailerTargetinferredthatateenagecustomerwaspregnantand,bymailinghercoupons
intendedtobeuseful,unintentionallydisclosedthisfacttoherfather.34
Theauthorofananonymousbook,magazinearticle,orwebpostingisfrequentlyoutedbyinformal
crowdsourcing,fueledbythenaturalcuriosityofmanyunrelatedindividuals.35
Socialmediaandpublicsourcesofrecordsmakeiteasyforanyonetoinferthenetworkoffriendsand
associatesofmostpeoplewhoareactiveontheweb,andmanywhoarenot.36
MaristCollegeinPoughkeepsie,NewYork,usespredictivemodelingtoidentifycollegestudentswhoare
atriskofdroppingout,allowingittotargetadditionalsupporttothoseinneed.37
TheDurkheimProject,fundedbytheU.S.DepartmentofDefense,analyzessocialmediabehaviorto
detectearlysignsofsuicidalthoughtsamongveterans.38
LendUp,aCaliforniabasedstartup,soughttousenontraditionaldatasourcessuchassocialmediato
providecredittounderservedindividuals.Becauseofthechallengesinensuringaccuracyandfairness,
however,theyhavebeenunabletoproceed.39,40

31

BurnMurdoch,John,UKtechnologyfirmusesmachinelearningtocombatgamblingaddiction,TheGuardian,August1,
2013.http://www.theguardian.com/news/datablog/2013/aug/01/ukfirmusesmachinelearningfightgamblingaddiction
32
Clifford,Stephanie,UsingDatatoStageManagePathstothePrescriptionCounter,TheNewYorkTimes,June19,2013.
http://bits.blogs.nytimes.com/2013/06/19/usingdatatostagemanagepathstotheprescriptioncounter/
33
Clifford,Stephanie,Attention,Shoppers:StoreIsTrackingYourCell,TheNewYorkTimes,July14,2013.
34
Duhigg,Charles,HowCompaniesLearnYourSecrets,TheNewYorkTimesMagazine,February12,2012.
http://www.nytimes.com/2012/02/19/magazine/shoppinghabits.html?pagewanted=all&_r=0
35
Volokh,Eugene,OutingAnonymousBloggers,June8,2009.http://www.volokh.com/2009/06/08/outinganonymous
bloggers/;A.Narayananetal.,OntheFeasibilityofInternetScaleAuthorIdentification,IEEESymposiumonSecurityand
Privacy,May2012.http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6234420
36
FacebooksTheGraphAPI(athttps://developers.facebook.com/docs/graphapi/)describeshowtowritecomputer
programsthatcanaccesstheFacebookfriendsdata.
37
Oneoffourbigdataapplicationshonoredbythetradejournal,Computerworld,in2013.King,Julia,UNtacklessocio
economiccriseswithbigdata,Computerworld,June3,2013.
http://www.computerworld.com/s/article/print/9239643/UN_tackles_socio_economic_crises_with_big_data
38
Ungerleider,Neal,ThisMayBeTheMostVitalUseOfBigDataWeveEverSeen,FastCompany,July12,2013.
http://www.fastcolabs.com/3014191/thismaybethemostvitaluseofbigdataweveeverseen.
39
CenterforDataInnovations,100DataInnovations,InformationTechnologyandInnovationFoundation,Washington,DC,
January2014.http://www2.datainnovation.org/2014100datainnovations.pdf
40
Waters,Richard,Dataopendoorstofinancialinnovation,FinancialTimes,December13,2013.
http://www.ft.com/intl/cms/s/2/3c59d58a43fb11e2844c00144feabdc0.html

12

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

Insightintothespreadofhospitalacquiredinfectionshasbeengainedthroughtheuseoflargeamounts
ofpatientdatatogetherwithpersonalinformationaboutuninfectedpatientsandclinicalstaff.41
Individualsheartratescanbeinferredfromthesubtlechangesintheirfacialcolorationthatoccurwith
eachbeat,enablinginferencesabouttheirhealthandemotionalstate.42

2.2 Scenarios of the near future in healthcare and education


Hereareafewexamplesofthekindsofscenariosthatcanreadilybeconstructed.

2.2.1 Healthcare: personalized medicine


Notallpatientswhohaveaparticulardiseasearealike,nordotheyrespondidenticallytotreatment.
Researcherswillsoonbeabletodrawonmillionsofhealthrecords(includinganalogdatasuchasscansin
additiontodigitaldata),vastamountsofgenomicinformation,extensivedataonsuccessfulandunsuccessful
clinicaltrials,hospitalrecords,andsoforth.Insomecasestheywillbeabletodiscernthatamongthediverse
manifestationsofthedisease,asubsetofthepatientshaveacollectionoftraitsthattogetherformavariant
thatrespondstoaparticulartreatmentregime.
Sincetheresultoftheanalysiscouldleadtobetteroutcomesforparticularpatients,itisdesirabletoidentify
thoseindividualsinthecohort,contactthem,treattheirdiseaseinanovelway,andusetheirexperiencesin
advancingtheresearch.Theirdatamayhavebeengatheredonlyanonymously,however,oritmayhavebeen
deidentified.
Solutionsmaybeprovidedbyspecificnewtechnologiesfortheprotectionofdatabaseprivacy.Thesemay
createaprotectedquerymechanismsoindividualscanfindoutwhethertheyareinthecohort,orprovidean
alertmechanismbasedonthecohortcharacteristicssothat,whenamedicalprofessionalseesapatientinthe
cohort,anoticeisgenerated.

2.2.2 Healthcare: detection of symptoms by mobile devices


ManybabyboomerswonderhowtheymightdetectAlzheimer'sdiseaseinthemselves.Whatwouldbebetter
toobservetheirbehaviorthanthemobiledevicethatconnectsthemtoapersonalassistantinthecloud(e.g.,
SiriorOKGoogle),helpsthemnavigate,remindsthemwhatwordsmean,rememberstodothings,recalls
conversations,measuresgait,andotherwiseisinapositiontodetectgradualdeclinesontraditionalandnovel
medicalindicatorsthatmightbeimperceptibleeventotheirspouses?
Atthesametime,anyleakofsuchinformationwouldbeadamagingbetrayaloftrust.Whatareindividuals
protectionsagainstsuchrisks?Cantheinferredinformationaboutindividualshealthbesold,without
additionalconsent,tothirdparties(e.g.,pharmaceuticalcompanies)?Whatifthisisastatedconditionofuseof

41

(1)Wiens,Jenna,JohnGuttag,andEricHorvitz,AStudyinTransferLearning:LeveragingDatafromMultipleHospitalsto
EnhanceHospitalSpecificPredictions,JournaloftheAmericanMedicalInformaticsAssociation,January2014.(2)
Weitzner,DanielJ.,etal.,ConsumerPrivacyBillofRightsandBigData:ResponsetoWhiteHouseOfficeofScienceand
TechnologyPolicyRequestforInformation,April4,2014.
42
Frazer,Bryant,MITComputerProgramRevealsInvisibleMotioninVideo,TheNewYorkTimesvideo,February27,2013.
https://www.youtube.com/watch?v=3rWycBEHn3s

13

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


theapp?Shouldinformationgotoindividualspersonalphysicianswiththeirinitialconsentbutnota
subsequentconfirmation?

2.2.3 Education
Drawingonmillionsoflogsofonlinecourses,includingbothmassiveopenonlinecourses(MOOCs)andsmaller
classes,itwillsoonbepossibletocreateandmaintainlongitudinaldataabouttheabilitiesandlearningstylesof
millionsofstudents.Thiswillincludenotjustbroadaggregateinformationlikegrades,butfinegrainedprofiles
ofhowindividualstudentsrespondtomultiplenewkindsofteachingtechniques,howmuchhelptheyneedto
masterconceptsatvariouslevelsofabstraction,whattheirattentionspanisinvariouscontexts,andsoforth.A
MOOCplatformcanrecordhowlongastudentwatchesaparticularvideo;howoftenasegmentisrepeated,
spedup,orskipped;howwellastudentdoesonaquiz;howmanytimesheorshemissesaparticularproblem;
andhowthestudentbalanceswatchingcontenttoreadingatext.Astheabilitytopresentdifferentmaterialto
differentstudentsmaterializesintheplatforms,thepossibilityofblind,randomizedA/Btestingenablesthegold
standardofexperimentalsciencetobeimplementedatlargescaleintheseenvironments.43
Similardataarealsobecomingavailableforresidentialclasses,aslearningmanagementsystems(suchas
Canvas,Blackboard,orDesire2Learn)expandtheirrolestosupportinnovativepedagogy.Inmanycoursesone
cannowgetmomentbymomenttrackingofthestudent'sengagementwiththecoursematerialsandcorrelate
thatengagementwiththedesiredlearningoutcomes.
Withthisinformation,itwillbepossiblenotonlytogreatlyimproveeducation,butalsotodiscoverwhatskills,
taughttowhichindividualsatwhichpointsinchildhood,leadtobetteradultperformanceincertaintasks,orto
adultpersonalandeconomicsuccess.Whilethesedatacouldrevolutionizeeducationalresearch,theprivacy
issuesarecomplex.44
Therearemanyprivacychallengesinthisvisionofthefutureofeducation.Knowledgeofearlyperformancecan
createimplicitbiases45thatcolorlaterinstructionandcounseling.Thereisgreatpotentialformisuse,ostensibly
forthesocialgood,inthemassiveabilitytodirectstudentsintohighorlowpotentialtracks.Parentsand
othershaveaccesstosensitiveinformationaboutchildren,butmechanismsrarelyexisttochangethose
permissionswhenthechildreachesmajority.

2.3 Challenges to the homes special status


Thehomehasspecialsignificanceasasanctuaryofindividualprivacy.TheFourthAmendmentslist,persons,
houses,papers,andeffects,putsonlythephysicalbodyintherhetoricallymoreprominentposition;anda
houseisoftenthephysicalcontainerfortheotherthree,aboundaryinsideofwhichenhancedprivacyrights
apply.

43

ForanoverviewofMOOCsandassociatedanalyticsopportunities,seePCASTsDecember2013lettertothePresident.
http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_edit_dec2013.pdf
44
Thereisalsouncertaintyabouthowtointerpretapplicablelaws,suchastheFamilyEducationalRightsandPrivacyAct
(FERPA).RecentFederalguidanceisintendedtohelpclarifythesituation.See:U.S.DepartmentofEducation,Protecting
StudentPrivacyWhileUsingOnlineEducationalServices:RequirementsandBestPractices,February2014.
http://ptac.ed.gov/sites/default/files/Student%20Privacy%20and%20Online%20Educational%20Services%20%28February%
202014%29.pdf
45
Cukier,Kenneth,andViktorMayerSchoenberger,"HowBigDataWillHauntYouForever,"Quartz,March11,2014.
http://qz.com/185252/howbigdatawillhauntyouforeveryourhighschooltranscript/

14

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


ExistinginterpretationsoftheFourthAmendmentareinadequateforthepresentworld,however.We,along
withthepapersandeffectscontemplatedbytheFourthAmendment,liveincreasinglyincyberspace,where
thephysicalboundaryofthehomehaslittlerelevance.In1980,afamilysfinancialrecordswerepaper
documents,locatedperhapsinadeskdrawerinsidethehouse.By2000,theyweremigratingtotheharddrive
ofthehomecomputerbutstillwithinthehouse.By2020,itislikelythatmostsuchrecordswillbeinthe
cloud,notjustoutsidethehouse,butlikelyreplicatedinmultiplelegaljurisdictionsbecausecloudstorage
typicallyuseslocationdiversitytoachievereliability.Thepictureisthesameifonesubstitutesforfinancial
recordssomethinglikepoliticalbookswepurchase,orlovelettersthatwereceive,oreroticvideosthatwe
watch.Absentdifferentpolicy,legislative,andjudicialapproaches,thephysicalsanctityofthehomespapers
andeffectsisrapidlybecominganemptylegalvessel.
ThehomeisalsothecentrallocusofBrandeisrighttobeleftalone.Thisrightisalsoincreasinglyfragile,
however.Increasingly,peoplebringsensorsintotheirhomeswhoseimmediatepurposeistoprovide
convenience,safety,andsecurity.Smokeandcarbonmonoxidealarmsarecommon,andoftenrequiredby
safetycodes.46Radondetectorsareusualinsomepartsofthecountry.Integratedairmonitorsthatcandetect
andidentifymanydifferentkindsofpollutantsandallergensarereadilyforeseeable.Refrigeratorsmaysoonbe
abletosniffforgasesreleasedfromspoiledfood,or,asanotherpossiblepath,maybeabletoreadfood
expirationdatesfromradiofrequencyidentification(RFID)tagsinthefoodspackaging.Ratherthantodays
annoyingcacophonyofbeeps,tomorrowssensors(assomealreadydotoday)willinterfacetoafamilythrough
integratedappsonmobiledevicesordisplayscreens.Thedatawillhavebeenprocessedandinterpreted.Most
likelythatprocessingwilloccurinthecloud.So,todeliverservicestheconsumerwants,muchdatawillneedto
haveleftthehome.
Environmentalsensorsthatenablenewfoodandairsafetymayalsobeabletodetectandcharacterizetobacco
ormarijuanasmoke.Healthcareorhealthinsuranceprovidersmaywantassurancethatselfdeclarednon
smokersaretellingthetruth.Mightthey,asaconditionoflowerpremiums,requirethehomeownersconsent
fortappingintotheenvironmentalmonitorsdata?Ifthemonitordetectsheroinsmoking,isaninsurance
companyobligatedtoreportthistothepolice?Cantheinsurercancelthehomeownerspropertyinsurance?
Tosome,itseemsfarfetchedthatthetypicalhomewillforeseeablyacquirecamerasandmicrophonesinevery
room,butthatappearstobealikelytrend.Whatcanyourcellphone(alreadyequippedwithfrontandback
cameras)hearorseewhenitisonthenightstandnexttoyourbed?Tablets,laptops,andmanydesktop
computershavecamerasandmicrophones.Motiondetectortechnologyforhomeintrusionalarmswilllikely
movefromultrasoundandinfraredtoimagingcameraswiththebenefitoffewerfalsealarmsandtheability
todistinguishpetsfrompeople.Facialrecognitiontechnologywillallowfurthersecurityandconvenience.For
thesafetyoftheelderly,camerasandmicrophoneswillbeabletodetectfallsorcollapses,orcallsforhelp,and
benetworkedtosummonaid.
Peoplenaturallycommunicatebyvoiceandgesture.Itisinevitablethatpeoplewillcommunicatewiththeir
electronicservantsinbothsuchmodes(necessitatingthattheyhaveaccesstocamerasandmicrophones).

46

Nest,acquiredbyGoogle,attractedattentionearlyforitsdesignanditsuseofbigdatatoadapttoconsumerbehavior.
See:Aoki,Kenji,"NestGivestheLowlySmokeDetectoraBrain,"Wired,October,2013.
http://www.wired.com/2013/10/nestsmokedetector/all/

15

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


CompaniessuchasPrimeSense,anIsraelifirmrecentlyboughtbyApple,47aredevelopingsophisticated
computervisionsoftwareforgesturereading,alreadyakeyfeatureintheconsumercomputergameconsole
market(e.g.,MicrosoftKinect).Consumertelevisionsarealreadyamongthefirstappliancestorespondto
gesture;already,devicessuchastheNestsmokedetectorrespondtogestures.48Theconsumerwhotapshis
templetosignalaspokencommandtoGoogleGlass49maywanttousethesamegestureforthetelevision,or
forthatmatterforthethermostatorlightswitch,inanyroomathome.Thisimpliesomnipresentaudioand
videocollectionwithinthehome.
Alloftheseaudio,video,andsensordatawillbegeneratedwithinthesupposedsanctuaryofthehome.But
theyarenomorelikelytostayinthehomethanthepapersandeffectsalreadydiscussed.Electronicdevices
inthehomealreadyinvisiblycommunicatetotheoutsideworldviamultipleseparateinfrastructures:Thecable
industryshardwiredconnectiontothehomeprovidesmultipletypesoftwowaycommunication,including
broadbandInternet.WirelinephoneisstillusedbysomehomeintrusionalarmsandsatelliteTVreceivers,and
asthephysicallayerforDSLbroadbandsubscribers.Somehomedevicesusethecellphonewireless
infrastructure.ManyotherspiggybackonthehomeWiFinetworkthatisincreasinglyanecessityofmodern
life.TodayssmarthomeentertainmentsystemknowswhatapersonrecordsonaDVR,whatsheactually
watches,andwhenshewatchesit.Likepersonalfinancialrecordsin2000,thisinformationtodayisinpart
localizedinsidethehome,ontheharddriveinsidetheDVR.Aswithfinancialinformationtoday,however,itis
ontracktomoveintothecloud.Today,NetflixorAmazoncanofferentertainmentsuggestionsbasedon
customerspastkeyclickstreamsandviewinghistoryontheirplatforms.Tomorrow,evenbettersuggestions
maybeenabledbyinterpretingtheirminutebyminutefacialexpressionsasseenbythegesturereading
camerainthetelevision.
Thesecollectionsofdataarebenign,inthesensethattheyarenecessaryforproductsandservicesthat
consumerswillknowinglydemand.Theirchallengestoprivacyarisebothfromthefactthattheiranalogsensors
necessarilycollectmoreinformationthanisminimallynecessaryfortheirfunction(seeSection3.1.2),andalso
becausetheirdatapracticallycryoutforsecondaryusesrangingfrominnovativenewproductstomarketing
bonanzastocriminalexploits.Asinmanyotherkindsofbigdata,thereisambiguityastodataownership,data
rights,andalloweddatause.Computervisionsoftwareislikelyalreadyabletoreadthebrandlabelson
productsinitsfieldofviewthisisamucheasiertechnologythanfacialrecognition.Ifthecamerainyour
televisionknowswhatbrandofbeeryouaredrinkingwhilewatchingafootballgame,andknowswhetheryou
openedthebottlebeforeorafterthebeerad,who(ifanyone)isallowedtosellthisinformationtothebeer
company,ortoitscompetitors?Isthecameraallowedtoreadbrandnameswhenthetelevisionsetis
supposedlyoff?Canitwatchformagazinesorpoliticalleaflets?IftheRFIDtagsensorinyourrefrigerator
usefullydetectsoutofdatefood,canitalsoreportyourbrandchoicestovendors?Isthiscreepyandstrange,
oraconsumerfinancialbenefitwheneverysupermarketcanofferyourelevantcoupons?50Or(thedilemmaof

47

Reuters,AppleacquiresIsraeli3DchipdeveloperPrimeSense,November25,2013.
http://www.reuters.com/article/2013/11/25/usprimesenseofferappleidUSBRE9AO04C20131125
48
Id.
49
Google,Glassgestures.https://support.google.com/glass/answer/3064184?hl=en
50
Tene,Omer,andJulesPolonetsky,"ATheoryofCreepy:Technology,PrivacyandShiftingSocialNorms,"YaleJournalof
LawandTechnology16:59,2013,pp.59100.

16

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


differentialpricing51)isitanydifferentifthedataareusedtoofferothersabetterdealwhileyoupayfullprice
becauseyourbrandloyaltyisknowntobestrong?
AboutonethirdofAmericansrent,ratherthanown,theirresidences.Thisnumbermayincreasewithtimeasa
resultoflongtermeffectsofthe2007financialcrisis,aswellasagingoftheU.S.population.Todayand
foreseeably,rentersarelessaffluent,onaverage,thanhomeowners.Thelawdemarcatesafinelinebetween
thepropertyrightsoflandlordsandtheprivacyrightsoftenants.Landlordshavetherighttoentertheir
propertyundervariousconditions,generallyincludingwherethetenanthasviolatedhealthorsafetycodes,or
tomakerepairs.Asmoredataarecollectedwithinthehome,therightsoftenantandlandlordmayneednew
adjustment.Ifenvironmentalmonitorsarefixturesofthelandlordsproperty,doesshehaveanunconditional
righttotheirdata?Canshesellthosedata?Iftheleasesoprovides,cansheevictthetenantifthemonitor
repeatedlydetectscigarettesmoke,oracamerasensorisabletodistinguishaprohibitedpet?
Ifathirdpartyoffersfacialrecognitionservicesforlandlords(nodoubtwithallkindsofcryptographic
safeguards!),canthelandlordusethesedatatoenforceleaseprovisionsagainstsublettingoradditional
residents?Cansherequiresuchmonitoringasaconditionofthelease?Whatifthelandlordscamerasare
outsidethedoors,butkeeptrackofeveryonewhoentersorleavesherproperty?Howisthisdifferentfromthe
caseofasecuritycameraacrossthestreetthatisownedbythelocalpolice?

2.4 Tradeoffs among privacy, security, and convenience


Notionsofprivacychangegenerationally.Oneseestodaymarkeddifferencesbetweentheyoungergeneration
ofdigitalnativesandtheirparentsorgrandparents.Inturn,thechildrenoftodaysdigitalnativeswilllikely
havestilldifferentattitudesabouttheflowoftheirpersonalinformation.Raisedinaworldwithdigital
assistantswhoknoweverythingaboutthem,and(onemayhope)withwisepoliciesinforcetogovernuseofthe
data,futuregenerationsmayseelittlethreatinscenariosthatindividualstodaywouldfindthreatening,ifnot
Orwellian.PCASTsfinalscenario,perhapsattheouterlimitofitsabilitytoprognosticate,isconstructedto
illustratethispoint.
TaylorRodriguezpreparesforashortbusinesstrip.Shepackedabagthenightbeforeandputitoutsidethe
frontdoorofherhomeforpickup.Noworriesthatitwillbestolen:Thecameraonthestreetlightwaswatching
it;and,inanycase,almosteveryiteminithasatinyRFIDtag.Anywouldbethiefwouldbetrackedand
arrestedwithinminutes.Noristhereanyneedtogiveexplicitinstructionstothedeliverycompany,becausethe
cloudknowsTaylorsitineraryandplans;thebagispickedupovernightandwillbeinTaylorsdestinationhotel
roombythetimeofherarrival.
Taylorfinishesbreakfastandstepsoutthefrontdoor.Knowingtheschedule,thecloudhasprovidedaself
drivingcar,waitingatthecurb.Attheairport,Taylorwalksdirectlytothegatenoneedtogothroughany
security.Norarethereanyformalitiesatthegate:Atwentyminuteopendoorintervalisprovidedfor
passengerstostrollontotheplaneandtaketheirseats(whicheachseesindividuallyhighlightedinhisorher
wearableopticaldevice).Therearenoboardingpassesandnoorganizedlines.Whybother,whenTaylors
identity(asforeveryoneelsewhoenterstheairport)hasbeentrackedandisknownabsolutely?Whenher
knowninformationemanations(phone,RFIDtagsinclothes,facialrecognition,gait,emotionalstate)areknown
tothecloud,vetted,andessentiallyunforgeable?When,intheunlikelyeventthatTaylorhasbecomederanged
anddangerous,manydetectablesignswouldalreadyhavebeentracked,detected,andactedon?

51

Seereferencesatfootnote30.

17

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Indeed,everythingthatTaylorcarrieshasbeenscreenedfarmoreeffectivelythananyrushedairportsearch
today.FriendlycamerasineveryLEDlightingfixtureinTaylorshousehavewatchedherdressandpack,asthey
doeveryday.NormallythesedatawouldbeusedonlybyTaylorspersonaldigitalassistants,perhapstooffer
remindersorfashionadvice.Asaconditionofusingtheairporttransitsystem,however,Taylorhasauthorized
theuseofthedataforensuringairportsecurityandpublicsafety.
Taylorsworldseemscreepytous.Taylorhasacceptedadifferentbalanceamongthepublicgoodsof
convenience,privacy,andsecuritythanwouldmostpeopletoday.Tayloractsintheunconsciousbelief
(whetherjustifiedornot,dependingonthenatureandeffectivenessofpoliciesinforce)thatthecloudandits
roboticservantsaretrustworthyinmattersofpersonalprivacy.Insuchaworld,majorimprovementsinthe
convenienceandsecurityofeverydaylifebecomepossible.

18

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

3. Collection, Analytics, and Supporting Infrastructure


Bigdataisbigintwodifferentsenses.Itisbiginthequantityandvarietyofdatathatareavailabletobe
processed.And,itisbiginthescaleofanalysis(analytics)thatcanbeappliedtothosedata,ultimatelyto
makeinferences.Bothkindsofbigdependontheexistenceofamassiveandwidelyavailablecomputational
infrastructure,onethatisincreasinglybeingprovidedbycloudservices.Thischapterexpandsonthesebasic
concepts.

3.1 Electronic sources of personal data


Sinceearlyinthecomputerage,publicandprivateentitieshavebeenassemblingdigitalinformationabout
people.Databasesofpersonalinformationwerecreatedduringthedaysofbatchprocessing.52Indeed,early
descriptionsofdatabasetechnologyoftentalkaboutpersonnelrecordsusedforpayrollapplications.As
computingpowerincreased,moreandmorebusinessapplicationsmovedtodigitalform.Therenowaredigital
telephonecallrecords,creditcardtransactionrecords,bankaccountrecords,emailrepositories,andsoon.As
interactivecomputinghasadvanced,individualshaveenteredmoreandmoredataaboutthemselves,bothfor
selfidentificationtoanonlineserviceandforproductivitytoolssuchasfinancialmanagementsystems.
Thesedigitaldataarenormallyaccompaniedbymetadataorancillarydatathatexplainthelayoutand
meaningofthedatatheydescribe.Databaseshaveschemasandemailhasheaders,53asdonetworkpackets.54
Asdatasetsbecomemorecomplex,sodotheattachedmetadata.Includedinthedataormetadatamaybe
identifyinginformationsuchasaccountnumbers,loginnames,andpasswords.Thereisnoreasontobelieve
thatmetadataraisefewerprivacyconcernsthanthedatatheydescribe.
Inrecenttimes,thekindsofelectronicdataavailableaboutpeoplehaveincreasedsubstantially,inpartbecause
oftheemergenceofsocialmediaandinpartbecauseofthegrowthinmobiledevices,surveillancedevices,and
adiversityofnetworkedsensors.Today,althoughtheymaynotbeawareofit,individualsconstantlyemitinto
theenvironmentinformationwhoseuseormisusemaybeasourceofprivacyconcerns.Physically,these
informationemanationsareoftwotypes,whichcanbecalledborndigitalorbornanalog.

3.1.1 Born digital data


Wheninformationisborndigital,itiscreated,byusorbyacomputersurrogate,specificallyfordigitaluse
thatis,forusebyacomputerordataprocessingsystem.Examplesofdatathatareborndigitalinclude:

emailandtextmessaging
inputviamouseclicks,taps,swipes,orkeystrokesonaphone,tablet,computer,orvideogame;thatis,
datathatpeopleintentionallyenterintoadevice

52

Suchdatabasesendureandformthebasisofcontinuingconcernamongprivacyadvocates.
Schemasareformaldefinitionsoftheconfigurationofadatabase:itstables,relations,andindices.Headersarethe
sometimesinvisibleprefacestoemailmessagesthatcontaininformationaboutthesendinganddestinationaddressesand
sometimestheroutingofthepathbetweenthem.
54
IntheInternetandsimilarnetworks,informationisbrokenupintochunkscalledpackets,whichmaytravel
independentlyanddependonmetadatatobereassembledproperlyatthedestinationofthetransmission.
53

19

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

GPSlocationdata
metadataassociatedwithphonecalls:thenumbersdialedfromorto,thetimeanddurationofcalls
dataassociatedwithmostcommercialtransactions:creditcardswipes,barcodereads,readsofRFID
tags(asusedforantitheftandinventorycontrol)
dataassociatedwithportalaccess(keycardorIDbadgereads)andtollroadaccess(remotereadsof
RFIDtags)
metadatathatourmobiledevicesusetostayconnectedtothenetwork,includingdevicelocationand
status
increasingly,datafromcars,televisions,appliances:theInternetofThings

Consumertrackingdataprovideanexampleofborndigitaldatathathasbecomeeconomicallyimportant.Itis
generallypossibleforcompaniestoaggregatelargeamountsofdataandthenusethosedataformarketing,
advertising,ormanyotheractivities.Thetraditionalmechanismhasbeentousecookies,smalldatafilesthata
browsercanleaveonauserscomputer(pioneeredbyNetscapetwodecadesago).Thetechniqueistoleavea
cookiewhenauserfirstvisitsasiteandthenbeabletocorrelatethatvisitwithasubsequentevent.This
informationisveryvaluabletoretailersandformsthebasisofmanyoftheadvertisingbusinessesofthelast
decade.Therehasbeenavarietyofproposalstoregulatesuchtracking,55andmanycountriesrequireoptin
permissionbeforethistrackingisdone.Cookiesinvolverelativelysimplepiecesofinformationthatproponents
representasunlikelytobeabused.Althoughnotalwaysawareoftheprocess,peopleacceptsuchtrackingin
returnforafreeorsubsidizedservice.56Atthesametime,cookiefreealternativesaresometimesavailable.57
Evenwithoutcookies,socalledfingerprintingtechniquescanoftenidentifyauserscomputerormobile
deviceuniquelybytheinformationthatitexposespublicly,suchasthesizeofitsscreen,itsinstalledfonts,and
otherfeatures.58Mosttechnologistsbelievethatapplicationswillmoveawayfromcookies,thatcookiesaretoo
simpleanidea,andthattherearebetteranalyticscomingandbetterapproachesbeinginvented.Theeconomic
incentivesforconsumertrackingwillremain,however,andbigdatawillallowformorepreciseresponses.
Trackingisalsotheenablingtechnologyofsomemorenefarioususes.Unfortunately,manysocialnetworking
appsbeginbytakingapersonscontactlistandspammingalltherecipientswithadvertisingfortheapp.This
techniqueisoftenabused,especiallybysmallstartupswhomayassessthevaluegainedbyreachingnew
customersasbeinggreaterthanthevaluelosttotheirreputationforhonoringprivacy.

55

FederalTradeCommission,FTCStaffRevisesOnlineBehavioralAdvertisingPrinciples,PressRelease,February12,2009.
http://www.ftc.gov/newsevents/pressreleases/2009/02/ftcstaffrevisesonlinebehavioraladvertisingprinciples
56
(1)Cf.TheWallStreetJournalsWhattheyknowseries(http://online.wsj.com/public/page/whattheyknowdigital
privacy.html).(2)Turow,Joseph,TheDailyYou:HowtheAdvertisingIndustryisDefiningyourIdentityandYourWorth,Yale
UniversityPress,2012.http://yalepress.yale.edu/book.asp?isbn=9780300165012
57
DuckDuckGoisanontrackingsearchenginethat,whileperhapsyieldingfewerresultsthanleadingsearchengines,is
usedbythoselookingforlesstracking.See:https://duckduckgo.com/
58
(1)Tanner,Adam,TheWebCookieIsDying.Here'sTheCreepierTechnologyThatComesNext,Forbes,June17,2013.
http://www.forbes.com/sites/adamtanner/2013/06/17/thewebcookieisdyingheresthecreepiertechnologythat
comesnext/(2)Acar,G.etal.,FPDetective:DustingtheWebforFingerprinters,2013.
http://www.cosic.esat.kuleuven.be/publications/article2334.pdf

20

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Allinformationthatisborndigitalsharescertaincharacteristics.Itiscreatedinidentifiableunitsforparticular
purposes.Theseunitsareinmostcasesdatapacketsofoneoranotherstandardtype.Sincetheyarecreated
byintent,theinformationthattheycontainisusuallylimited,forreasonsofefficiencyandgoodengineering
design,tosupporttheimmediatepurposeforwhichtheyarecollected.
Whendataareborndigital,privacyconcernscanariseintwodifferentmodes,oneobvious(overcollection),
theothermorerecentandsubtle(datafusion).Overcollectionoccurswhenanengineeringdesign
intentionally,andsometimesclandestinely,collectsinformationunrelatedtoitsstatedpurpose.Whileyour
smartphonecouldeasilyphotographandtransmittoathirdpartyyourfacialexpressionasyoutypeevery
keystrokeofatextmessage,orcouldcaptureallkeystrokes,therebyrecordingtextthatyouhaddeleted,these
wouldbeinefficientandunreasonablesoftwaredesignchoicesforthedefaulttextmessagingapp.Inthat
contexttheywouldbeinstancesofovercollection.
ArecentexampleofovercollectionwastheBrightestFlashlightFreephoneapp,downloadedbymorethan50
millionusers,whichpassedbacktoitsvendoritslocationeverytimetheflashlightwasused.Notonlyislocation
informationunnecessaryfortheilluminationfunctionofaflashlight,butitalsodisclosespersonalinformation
thattheusermightwishtokeepprivate.TheFederalTradeCommissionissuedacomplaintbecausethefine
printonthenoticeandconsentscreen(seeSection4.3)hadneglectedtodisclosethatlocationinformation,
whosecollectionwasdisclosed,wouldbesoldtothirdparties,suchasadvertisers.59,60Oneseesinthisexample
thelimitationsofthenoticeandconsentframework:AmoredetailedinitialfineprintdisclosurebyBrightest
FlashlightFree,whichalmostnoonewouldhaveactuallyread,wouldlikelyhaveforestalledanyFTCaction
withoutmuchaffectingthenumberofdownloads.
Incontrasttoovercollection,datafusionoccurswhendatafromdifferentsourcesarebroughtintocontactand
new,oftenunexpected,phenomenaemerge(seeSection3.1).Individually,eachdatasourcemayhavebeen
designedforaspecific,limitedpurpose.Butwhenmultiplesourcesareprocessedbytechniquesofmodern
statisticaldatamining,patternrecognition,andthecombiningofrecordsfromdiversesourcesbyvirtueof
commonidentifyingdata,newmeaningscanbefound.Inparticular,datafusionfrequentlyresultsinthe
identificationofindividualpeople(thatis,theassociationofeventswithuniquepersonalidentities),the
creationofdatarichprofilesofanindividual,andthetrackingofanindividualsactivitiesoverdays,months,or
years.
Bydefinition,theprivacychallengesfromdatafusiondonotlieintheindividualdatastreams,eachofwhose
collection,realtimeprocessing,andretentionmaybewhollynecessaryandappropriateforitsovert,immediate
purpose.Rather,theprivacychallengesareemergentpropertiesofourincreasingabilitytobringintoanalytical
juxtapositionlarge,diversedatasetsandtoprocessthemwithnewkindsofmathematicalalgorithms.

59

FederalTradeCommission,AndroidFlashlightAppDeveloperSettlesFTCChargesItDeceivedConsumers,Press
Release,December5,2013.http://www.ftc.gov/newsevents/pressreleases/2013/12/androidflashlightappdeveloper
settlesftcchargesitdeceived
60
(1)FTCFileNo.1323087Decisionandorder.
http://www.ftc.gov/system/files/documents/cases/140409goldenshoresdo.pdf(2)FTCApprovesFinalOrderSettling
ChargesAgainstFlashlightAppCreator.http://www.ftc.gov/newsevents/pressreleases/2014/04/ftcapprovesfinal
ordersettlingchargesagainstflashlightapp

21

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

3.1.2 Data from sensors


Turnnowtothesecondbroadclassofinformationemanations.Onecansaythatinformationisbornanalog
whenitarisesfromthecharacteristicsofthephysicalworld.Suchinformationdoesnotbecomeaccessible
electronicallyuntilitimpingesonasensor,anengineereddevicethatobservesphysicaleffectsandconverts
themtodigitalform.Themostcommonsensorsarecameras,includingvideo,whichsensevisible
electromagneticradiation;andmicrophones,whichsensesoundandvibration.Therearemanyotherkindsof
sensors,however.Today,cellphonesroutinelycontainnotonlycameras,microphones,andradiosbutalso
analogsensorsformagneticfields(3Dcompass)andmotion(acceleration).Otherkindsofsensorsinclude
thoseforthermalinfrared(IR)radiation;airquality,includingtheidentificationofchemicalpollutants;
barometricpressure(andaltitude);lowlevelgammaradiation;andmanyotherphenomena.
Examplesofbornanalogdataprovidingpersonalinformationandinusetodayinclude:

thevoiceand/orvideocontentofaphonecallbornanalogbutimmediatelyconvertedtodigitalbythe
phonesmicrophoneandcamera
personalhealthdatasuchasheartbeat,respiration,andgait,assensedbyspecialpurposedevices
(Fitbithasbeenaleadingprovider61)orcellphoneapps
cameras/sensorsintelevisionsandvideogamesthatinterpretgesturesbytheuser
videofromsecuritysurveillancecameras,mobilephones,oroverheaddrones
imaginginfraredvideothatcanseeinwhatpeopleperceiveastotaldarkness(andalsoseeevanescent
tracesofpastevents,socalledheatscars)
microphonenetworksincities,usedtodetectandlocategunshotsandforpublicsafety
cameras/microphonesinclassroomsandothermeetingrooms
ultrasonicmotiondetectors
medicalimaging,CT,andMRIscans,ultrasonicimaging
opportunisticallycollectedchemicalorbiologicalsamples,notablytraceDNA(todayrequiringslow,off
lineanalysis,butforeseeablymorenimble)
syntheticapertureradar(SAR),whichcanimagethroughcloudsand,undersomeconditions,seeinside
ofnonmetallicstructures
unintendedradiofrequencyemissionsfromelectricalandelectronicdevices

Whendataarebornanalog,theyarelikelytocontainmoreinformationthantheminimumnecessaryfortheir
immediatepurpose,forseveralvalidreasons.Oneisthatthedesiredinformation(signal)mustbesensedin
thepresenceofunwantedextraneousinformation(noise).Thetechnologiestypicallyworkbysensingthe
environment(signalplusnoise)withhighprecision,sothatmathematicaltechniquescanthenbeappliedthat
willseparatethetwoevenintheworstanticipatedcasewhenthesignalissmallestorthenoiseislargest.
Anotherreasonistechnologicalconvergence.Forexample,asthecamerasincellphonesbecomesmallerand
cheaper,theuseofidenticalcomponentsinotherproductsbecomesafavoreddesignchoice,evenwhenfull
imagesarenotneeded.WhereabigscreentelevisiontodayhasseparatesensorsforitsIRremotecontrol,
roombrightness,andmotiondetection(afeaturethatturnsoffthepicturewhennooneisintheroom),plusa
truevideocameraintheaddongameconsole,tomorrowsmodelmayintegrateallofthesefunctionsina
single,cheap,highresolution,IRsensitivecamera,afewmillimetersinsize.

61

See:http://www.fitbit.com/

22

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Inadditiontotheinformationavailablefromdigitalandanalogsourcesconsciouslyintendedtoprovide
informationaboutpeople,inadvertentdisclosureaboundsfromtheemergingInternetofThings,an
amalgamationofsensorswhoseprimarypurposeisenhancedbysmartnetworkconnectedcomputational
capabilities.Examplesincludesmartthermostatsthatdetecthumanpresenceandadjustairtemperatures
accordingly,smartautomobileignitionsystems,andlockingsystemsthatarebiometricallytriggered.
Theprivacychallengesofbornanalogdataaresomewhatdifferentfromthoseofborndigitaldata.Where
overcollection(aswasdefinedabove)isanirrationaldesignchoicefortheprincipleddigitaldesignerand
thereforeanidentifiableredflagforprivacyissuesovercollectionintheanalogdomaincanbearobustand
economicaldesignchoice.Aconsequenceisthatbornanalogdatawilloftencontaininformationthatwasnot
originallyexpected.Unexpectedinformationcouldinmanycasesleadtounanticipatedbeneficialproductsand
services,butitcouldalsogiveopportunitiesforunanticipatedmisuse.
Asaconcreteexample,onemightconsiderthreekeyparametersofvideoimaging:resolution(howmanypixels
intheimage),contrastratio(howwellcantheimageseeintodarkregions),andphotometricprecision(how
accurateistheimageinbrightnessandcolor).Allthreeparametershaveimprovedbyordersofmagnitudeand
arelikelytokeepimproving.Today,withspecialcameras,onecanimageacityscapefromahighrooftopand
seeclearlyintoeveryfacinghouseandapartmentwindowwithinseveralmiles.62Or,alreadymentioned,the
abilityexiststosenseremotelythepulseofanindividual,givinginformationonhealthstatusandemotional
state.63
Itisforeseeable,perhapsinevitable,thatthesecapabilitieswillbepresentineverycellphoneandsecurity
surveillancecamera,oreverywearablecomputerdevice.(Imaginetheprocessofnegotiatingthepriceforacar,
ornegotiatinganinternationaltradeagreement,wheneveryparticipantsGoogleGlass(orsecuritycameraor
TVcamera)isabletomonitorandinterprettheautonomicphysiologicalstateofeveryotherparticipant,inreal
time.)Itisunforeseeablewhatotherunexpectedinformationalsoliesinsignalsfromthesamesensors.
Oncetheyenterthedigitalworld,bornanalogdatacanbefusedandminedalongwithborndigitaldata.For
example,facialrecognitionalgorithms,whichmightbeerrorproneinisolation,mayyieldnearlyperfectidentity
trackingwhentheycanbecombinedwithborndigitaldatafromcellphones(includingunintendedemanations),
pointofsaletransactions,RFIDtags,andsoforth;andalsowithotherbornanalogdatasuchasvehicletracking
(e.g.,fromoverheaddrones)andautomatedlicenseplatereading.Biometricdatacanprovideidentity
informationthatenhancestheprofileofanindividualevenmore,anddataonbehavior(asfromsocial
networks)arebeingusedtoanalyzeattitudesoremotions(sentimentanalysis,forindividualsorgroups64).In
short,moreandmoreinformationcanbecapturedandputinaquantifiedformatsoitcanbetabulatedand
analyzed.65

62

Koonin,StevenE.,GregoryDoblerandJonathanS.Wurtele,UrbanPhysics,AmericanPhysicalSocietyNews,March,
2014.http://www.aps.org/publications/apsnews/201403/urban.cfm
63
Durand,Fredo,etal.,MITComputerProgramRevealsInvisibleMotioninVideo,TheNewYorkTimes,video,February
27,2013.https://www.youtube.com/watch?v=3rWycBEHn3s
64
Feldman,Ronen,TechniquesandApplicationsforSentimentAnalysis,CommunicationsoftheACM,56:4,pp.8289.
65
MayerSchnberger,ViktorandKennethCukier,BigData:ARevolutionThatWillTransformHowWeLive,Work,and
Think,Boston,NY:HoughtonMifflinHarcourt,2013.

23

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

3.2 Big data analytics


Analyticsiswhatmakesbigdatacomealive.Withoutanalytics,bigdatasetscouldbestored,andtheycouldbe
retrieved,whollyorselectively.Butwhatcomesoutwouldbeexactlywhatwentin.Analytics,comprisinga
numberofdifferentcomputationaltechnologies,iswhatfuelsthebigdatarevolution.66Analyticsiswhat
createsthenewvalueinbigdatasets,vastlymorethanthesumofthevaluesoftheparts.67

3.2.1 Data mining


Datamining,sometimeslooselyequatedtoanalyticsbutactuallyonlyasubsetofit,referstoacomputational
processthatdiscoverspatternsinlargedatasets.Itisaconvergenceofmanyfieldsofacademicresearchin
bothappliedmathematicsandcomputerscience,includingstatistics,databases,artificialintelligence,and
machinelearning.Likeothertechnologies,advancesindatamininghavearesearchanddevelopmentstage,in
whichnewalgorithmsandcomputerprogramsaredeveloped,andtheyhavesubsequentphasesof
commercializationandapplication.
Dataminingalgorithmscanbetrainedtofindpatternseitherbysupervisedlearning,socalledbecausethe
algorithmisseededwithmanuallycuratedexamplesofthepatterntoberecognized,orbyunsupervised
learning,wherethealgorithmtriestofindrelatedpiecesofdatawithoutpriorseeding.Arecentsuccessof
unsupervisedlearningalgorithmswasaprogramthat,searchingmillionsofimagesontheweb,figuredouton
itsownthatcatwasamuchpostedcategory.68
Thedesiredoutputofdataminingcantakeseveralforms,eachwithitsownspecializedalgorithms.69

Classificationalgorithmsattempttoassignobjectsoreventstoknowncategories.Forexample,a
hospitalmightwanttoclassifydischargedpatientsashigh,medium,orlowriskforreadmission.
Clusteringalgorithmsgroupobjectsoreventsintocategoriesbysimilarity,asinthecatexample
above.
Regressionalgorithms(alsocallednumericalpredictionalgorithms)trytopredictnumericalquantities.
Forexample,abankmaywanttopredict,fromthedetailsinaloanapplication,theprobabilityofa
default.
Associationtechniquestrytofindrelationshipsbetweenitemsintheirdataset.Amazonssuggested
productsandNetflixssuggestedmoviesareexamples.
Anomalydetectionalgorithmslookforuntypicalexampleswithinadataset,forexample,detecting
fraudulenttransactionsonacreditcardaccount.
Summarizationtechniquesattempttofindandpresentsalientfeaturesindata.Examplesincludeboth
simplestatisticalsummaries(e.g.,averagestudenttestscoresbyschoolandteacher),andhigherlevel
analysis(e.g.,alistofkeyfactsaboutanindividualasgleanedfromallwebpostingsthatmentionher).

66

NationalResearchCouncil,FrontiersinMassiveDataAnalysis,NationalAcademiesPress,2013.
(1)Thill,BrentandNicoleHayashi,BigData=BigDisruption:OneoftheMostTransformativeITTrendsOvertheNext
Decade,UBSSecuritiesLLC,October2013.(2)McKinseyGlobalInstitute,CenterforGovernment,andBusinessTechnology
Office,Opendata:Unlockinginnovationandperformancewithliquidinformation,McKinsey&Company,October2013.
68
Le,Q.V.etal.,BuildingHighlevelFeaturesUsingLargeScaleUnsupervisedLearning,
http://static.googleusercontent.com/media/research.google.com/en/us/archive/unsupervised_icml2012.pdf
69
Bramer,M.,PrinciplesofDataMining,Springer,2013.
67

24

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Dataminingissometimesconfusedwithmachinelearning,thelatterabroadsubfieldofcomputersciencein
academicandindustrialresearch.70Dataminingmakesuseofmachinelearning,aswellasotherdisciplines,
whilemachinelearninghasapplicationstofieldsotherthandatamining,forexample,robotics.
Therearelimitations,bothpracticalandtheoretical,towhatdataminingcanaccomplish,aswellaslimitsto
howaccurateitcanbe.Itmayrevealpatternsandrelationships,butitusuallycannottelltheuserthevalueor
significanceofthesepatterns.Forexample,supervisedlearningbasedonthecharacteristicsofknownterrorists
mightfindsimilarpersons,buttheymightormightnotbeterrorists;anditwouldmissdifferentclassesof
terroristswhodontfittheprofile.
Dataminingcanidentifyrelationshipsbetweenbehaviorsand/orvariables,buttheserelationshipsdonot
alwaysindicatecausality.Ifpeoplewholiveunderhighvoltagepowerlineshavehighermorbidity,itmight
meanthatpowerlinesareahazardtopublichealth;oritmightmeanthatpeoplewholiveunderpowerlines
tendtobepoorandhaveinadequateaccesstohealthcare.Thepolicyimplicationsarequitedifferent.While
socalledconfoundingvariables(inthisexample,income)canbecorrectedforwhentheyareknownand
understood,thereisnosurewaytoknowwhetherallofthemhavebeenidentified.Imputingtruecausalityin
bigdataisaresearchfieldinitsinfancy.71
Manydataanalysesyieldcorrelationsthatmightormightnotreflectcausation.Somedataanalysesdevelop
imperfectinformation,eitherbecauseoflimitationsofthealgorithms,orbytheuseofbiasedsampling.
Indiscriminateuseoftheseanalysesmaycausediscriminationagainstindividualsoralackoffairnessbecauseof
incorrectassociationwithaparticulargroup.72Inusingdataanalyses,particularcaremustbetakentoprotect
theprivacyofchildrenandotherprotectedgroups.
Realworlddataareincompleteandnoisy.Thesedataqualityissueslowertheperformanceofdatamining
algorithmsandobscureoutputs.Wheneconomicsallow,carefulscreeningandpreparationoftheinputdata
canimprovethequalityofresults,butthisdatapreparationisoftenlaborintensiveandexpensive.Users,
especiallyinthecommercialsector,musttradeoffcostandaccuracy,sometimeswithnegativeconsequences
fortheindividualrepresentedinthedata.Additionally,realworlddatacancontainextremeeventsoroutliers.
Outliersmayberealeventsthat,bychance,areoverrepresentedinthedata;ortheymaybetheresultofdata
entryordatatransmissionerrors.Inbothcasestheycanskewthemodelanddegradeperformance.Thestudy
ofoutliersisanimportantresearchareaofstatistics.

3.2.2 Data fusion and information integration


Datafusionisthemergingofmultipleheterogeneousdatasetsintoonehomogeneousrepresentationsothat
theycanbebetterprocessedfordataminingandmanagement.Datafusionisusedinanumberoftechnical
domainssuchassensornetworks,video/imageprocessing,roboticsandintelligentsystems,andelsewhere.

70

Mitchell,TomM.,TheDisciplineofMachineLearning,TechnicalReportCMUML06108,CarnegieMellonUniversity,
July2006.
71
DARPA,forexample,hasaprojectinvolvingmachinelearningandothertechnologiestobuildmedicalcausalmodelsfrom
analysisofcancerliterature,leveragingthegreatercapacityofacomputerthanapersontoprocessinformationfroma
largenumberofsources.Seedescriptionathttp://www.darpa.mil/Our_Work/I2O/Programs/Big_Mechanism.aspx
72
Dataminingbreaksthebasicintuitionthatidentityisthegreatestsourceofpotentialharmbecauseitsubstitutes
inferenceforidentifyinginformationasabridgetogetatadditionalfacts.Barocas,SolonandHelenNissenbaum,Big
DatasEndRunAroundAnonymityandConsent,ChapterII,inLane,Julia,etal.,Privacy,BigData,andthePublicGood,
CambridgeUniversityPress,2014.

25

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Dataintegrationisdifferentiatedfromdatafusioninthatintegrationmorebroadlycombinesdatasetsand
retainsthelargersetofinformation.Indatafusion,thereisusuallyareductionorreplacementtechnique.Data
fusionisfacilitatedbydatainteroperability,theabilityfortwosystemstocommunicateandexchangedata.
Datafusionanddataintegrationarekeytechniquesforbusinessintelligence.Retailersareintegratingtheir
online,instore,andcatalogsalesdatabasestocreatemorecompletepicturesoftheircustomers.Williams
Sonoma,forexample,hasintegratedcustomerdatabaseswithinformationon60millionhouseholds.Variables
includinghouseholdincome,housingvalues,andnumberofchildrenaretracked.Itisclaimedthattargeted
emailsbasedonthisinformationyieldtento18timestheresponserateofemailsthatarenottargeted.73Thisis
asimpleillustrationofhowmoreinformationcanleadtobetterinferences.Techniquesthatcanhelpto
preserveprivacyareemerging.74
Thereisagreatamountofinteresttodayinmultisensordatafusion.75Thebiggesttechnicalchallengesbeing
tackledtoday,generallythroughdevelopmentofnewandbetteralgorithms,relatetodataprecision/resolution,
outliersandspuriousdata,conflictingdata,modality(bothheterogeneousandhomogeneousdata)and
dimensionality,datacorrelation,dataalignment,associationwithindata,centralizedvs.decentralized
processing,operationaltiming,andtheabilitytohandledynamicvs.staticphenomena.Privacyconcernsmay
arisefromsensorfidelityandprecisionaswellascorrelationofdatafrommultiplesensors.Asinglesensors
outputmightnotbesensitive,butthecombinationfromtwoormoremayraiseprivacyconcerns.

3.2.3 Image and speech recognition


Imageandspeechrecognitiontechnologiesareabletoextractinformation,insomelimitedcasesapproaching
humanunderstanding,frommassivecorpusesofstillimages,videos,andrecordedorbroadcastspeech.
Urbansceneextractioncanbeaccomplishedusingavarietyofdatasourcesfromphotosandvideostoground
basedLiDAR(aremotesensingtechniqueusinglasers).76Inthegovernmentsector,citymodelsarebecoming
vitalforurbanplanningandvisualization.Theyareequallyimportantforabroadrangeofacademicdisciplines
includinghistory,archeology,geography,andcomputergraphicsresearch.Digitalcitymodelsarealsocentralto
popularconsumermappingandvisualizationapplicationssuchasGoogleEarthandBingMaps,aswellasGPS
enablednavigationsystems.77Sceneextractionisanexampleoftheinadvertentcaptureofpersonal
informationandcanbeusedfordatafusionthatrevealspersonalinformation.
Facialrecognitiontechnologiesarebeginningtobepracticalincommercialandlawenforcementapplications.78
Theyareabletoacquire,normalize,andrecognizemovingfacesindynamicscenes.Realtimevideosurveillance
withsinglecamerasystems(andsomewithmulticamerasystems,whichcanbothrecognizeobjectsand
analyzeactivity)hasawidevarietyofapplicationsinbothpublicandprivateenvironments,suchashomeland

73

Manyika,J.etal.,BigData:Thenextfrontierforinnovation,competition,andproductivity,McKinseyGlobalInstitute,
2011.
74
NavarroArriba,G.andV.Torra,"Informationfusionindataprivacy:Asurvey,"InformationFusion,13:4,2012,pp.235
244.
75
Khaleghi,B.etal.,"Multisensordatafusion:Areviewofthestateoftheart,"InformationFusion,14:1,2013,pp.2844.
76
Lam,J.,etal.,"Urbansceneextractionfrommobilegroundbasedlidardata,"Proceedingsof3DPVT,2010.
77
Agarwal,S.,etal.,"BuildingRomeinaday,"CommunicationsoftheACM,54:10,2011,pp.105112.
78
WorkshoponFrontiersinImageandVideoAnalysis,NationalScienceFoundation,FederalBureauofInvestigation,
DefenseAdvancedResearchProjectsAgency,andUniversityofMarylandInstituteforAdvancedComputerStudies,January
2829,2014.http://www.umiacs.umd.edu/conferences/fiva/

26

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


security,crimeprevention,trafficcontrol,accidentpredictionanddetection,andmonitoringpatients,the
elderly,andchildrenathome.79Dependingontheapplication,useofvideosurveillanceisatvaryinglevelsof
deployment.80
Additionalcapabilitiesofimagerecognitioninclude

Videosummarizationandscenechangedetection(thatis,pickingthesmallnumberofimagesthat
summarizeaperiodoftime)
Precisegeolocationinimageryfromsatellitesordrones
Imagebasedbiometrics
Humanintheloopsurveillancesystems
Reidentificationofpersonsandvehicles,thatis,trackingthesamepersonorvehicleasitmovesfrom
sensortosensor
Humanactivityrecognitionofvariouskinds
Semanticsummarization(thatis,convertingpicturesintotextsummaries)

Althoughsystemsareexpectedtobecomeabletotrackobjectsacrosscameraviewsanddetectunusual
activitiesinalargeareabycombininginformationfrommultiplesources,reidentificationofobjectsremains
hardtodo(achallengeforintercameratracking),asisvideosurveillanceincrowdedenvironments.

Althoughthedatatheyuseareoftencapturedinpublicareas,sceneextractiontechnologieslikeGoogleStreet
Viewhavetriggeredprivacyconcerns.PhotoscapturedforuseinStreetViewmaycontainsensitiveinformation
aboutpeoplewhoareunawaretheyarebeingobservedandphotographed.81
Socialmediadatacanbeusedasaninputsourceforsceneextractiontechniques.Whenthesedataareposted,
however,usersareunlikelytoknowthattheirdatawouldbeusedintheseaggregatedwaysandthattheirsocial
mediainformation(althoughpublic)mightappearsynthesizedinnewforms.82
Automatedspeechrecognitionhasexistedsinceatleastthe1950s,83butrecentdevelopmentsoverthelast10
yearshaveallowedfornovelnewcapabilities.Spokentext(e.g.,newsbroadcastersreadingpartofadocument)
cantodayberecognizedwithaccuracyhigherthan95percentusingstateofthearttechniques.Spontaneous
speechismuchhardertorecognizeaccurately.Inrecentyearstherehasbeenadramaticincreaseinthe
corpusesofspontaneousspeechdataavailabletoresearchers,whichhasallowedforimprovedaccuracy.

79

Forexample,NewarkAirportrecentlyinstalledasystemof171LEDlights(fromSensity[http://www.sensity.com/])that
containspecialchipstoconnecttosensorsandcamerasoverawirelesssystem.Thesesystemsallowforadvanced
automaticlightingtoimprovesecurityinplaceslikeparkinggarages,andindoingsocapturealargerangeofinformation.
80
Thiswasdiscussedattheworkshopcitedinfootnote78.
81
SuchconcernsarelikelytogrowascommercialsatelliteimagerysystemssuchasSkybox(http://skybox.com/)provide
thebasisformoreservices.
82
Billitteri,ThomasJ.,etal.SocialMediaExplosion:Dosocialnetworkingsitesthreatenprivacyrights?CQResearcher,
January25,2013,23:84104.
83
Juang,B.H.andLawrenceR.Rabiner,AutomatedSpeechRecognitionABriefHistoryoftheTechnologyDevelopment,
October8,2004.http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALIASRHistoryfinal108.pdf

27

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Overthenextfewyearsspeechrecognitioninterfaceswillbeinmanymoreplaces.Forexample,multiple
companiesareexploringspeechrecognitiontocontroltelevisionsandcars,tofindashowonTV,ortoschedule
aDVRrecording.ResearchersatNuancesaytheyareactivelyplanninghowspeechtechnologywouldhaveto
bedesignedtobeavailableonwearablecomputers.84Googlehasalreadyimplementedsomeofthisbasic
functionalityinitsGoogleGlassproduct,andMicrosoftsXboxOnesystemalreadyintegratesmachinevision
andmultimicrophoneaudioinputforcontrollingsystemfunctions.

3.2.4 Socialnetwork analysis


Socialnetworkanalysisreferstotheextractionofinformationfromavarietyofinterconnectingunitsunderthe
assumptionthattheirrelationshipsareimportantandthattheunitsdonotbehaveautonomously.85Social
networksoftenemergeinanonlinecontext.Themostobviousexamplesarededicatedonlinesocialmedia
platforms,suchasFacebook,LinkedInandTwitter,whichprovidenewaccesstosocialinteractionbyallowing
userstoconnectdirectlywitheachotherovertheInternettocommunicateandshareinformation.Offline
humansocialnetworksmayalsoleaveanalyzabledigitaltraces,suchasinphonecallmetadatarecordsthat
recordwhichphoneshaveexchangedcallsortexts,andforhowlong.Analysisofsocialnetworksisincreasingly
enabledbytherisingcollectionofdigitaldatathatlinkspeopletogether,especiallywhenitiscorrelatedtoother
dataormetadataabouttheindividual.86Toolsforsuchanalysisarebeingdevelopedandmadeavailable,87
motivatedinpartbythegrowingamountofsocialnetworkcontentaccessiblethroughopenapplication
programminginterfacestoonlinesocialmediaplatforms.Thissortofanalysisisanactivearenaforresearch.
Socialnetworkanalysiscomplementsanalysisofconventionaldatabases,andsomeofthetechniquesused(e.g.,
clusteringinassociationnetworks)canbeusedineithercontext.Socialnetworkanalysiscanbemorepowerful
becauseoftheeasyassociationofdiversekindsofinformation(i.e.,considerabledatafusionispossible).It
lendsitselftovisualizationoftheresults,whichaidsininterpretingtheresultsoftheanalysis.Itcanbeusedto
learnaboutpeoplethroughtheirassociationwithothers,inacontextofpeoplestendencytoassociatewith
otherswhoarehavesomesimilaritiestothemselves.88
Socialnetworkanalysisisyieldingresultsthatmaysurprisepeople.Inparticular,uniqueidentificationofan
individualiseasierthanfromdatabaseanalysisalone.Moreover,itisachievedthroughmorediversekindsof

84

WhereSpeechRecognitionisGoing,TechnologyReview,May29,2012.http://www.kurzweilai.net/wherespeech
recognitionisgoing
85
Wasserman,S.Socialnetworkanalysis:Methodsandapplications,CambridgeUniversityPress,8,1994.
86
See,forexample:(1)Backstrom,Lars,etal.,InferringSocialTiesfromGeographicCoincidences,Proceedingsofthe
NationalAcademyofSciences,2010.(2)Backsrom,Lars,etal.,WhereforeArtThoughR3579X?AnonymizedSocial
Networks,HiddenPatterns,andStructuralSteganography,InternationalWorldWideWebConference2007,Alberta,
Canada,May12,2007.
87
Avarietyoftoolsexistformanaging,analyzing,visualizingandmanipulatingnetwork(graph)datasets,suchas
Allegrograph,GraphVis,R,visoneandWolframAlpha.Some,suchasCytoscape,GephiandNetvizareopensource.
88
(1)Geetoor,L.andE.Zheleva,Preservingtheprivacyofsensitiverelationshipsingraphdata,Privacy,security,andtrust
inKDD,153171,2008.(2)Mislove,A.,etal.,AnanalysisofsocialbasednetworkSybildefenses,ACMSIGCOMM
ComputerCommunicationReview,2011.(3)Backstrom,Lars,etal.,FindMeIfYouCan:ImprovingGeographicPrediction
withSocialandSpatialProximity,Proceedingsofthe19thinternationalconferenceonWorldWideWeb,2010.(4)
Backstrom,L.andJ.Kleinberg,RomanticPartnershipsandtheDispersionofSocialTies:ANetworkAnalysisofRelationship
StatusonFacebook,Proceedingsofthe17thACMConferenceonComputerSupportedCooperativeWorkandSocial
Computing(CSCW),2014.

28

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


datathanmanypeoplemayunderstand,contributingtotheerosionofanonymity.89Thestructureofan
individualsnetworkisuniqueanditselfservesasanidentifier;cooccurrenceintimeandspaceisasignificant
meansofidentification;and,asdiscussedelsewhereinthisreport,differentkindsofdatacanbecombinedto
fosteridentification.90
Socialnetworkanalysisisusedincriminalforensicinvestigationstounderstandthelinks,means,andmotivesof
thosewhomayhavecommittedcrimes.Inparticular,socialnetworkanalysishasbeenusedtobetter
understandcovertterroristnetworks,whosedynamicsmaybedifferentfromthoseofovertnetworks.91
Intherealmofcommerce,itiswellunderstoodthatwhatapersonsfriendslikeorbuycaninfluencewhatheor
shemightbuy.Forexample,in2010,itwasreportedthathavingoneiPhoneowningfriendmakesaperson
threetimesmorelikelytoownaniPhonethanotherwise.ApersonwithtwoiPhoneowningfriendswasfive
timesmorelikelytohaveone.92Suchcorrelationsemergeinsocialnetworkanalysisandcanbeusedtohelp
predictproducttrends,tailormarketingcampaignstowardsproductsanindividualmaybemorelikelytowant,
andtargetcustomers(saidtohavehighernetworkvalue)withacentralrole(andalargeamountofinfluence)
inasocialnetwork.93
Becausediseaseiscommonlyspreadviadirectcontactbetweenindividuals(humansoranimals),understanding
socialnetworksthroughwhateverproxiesareavailablecansuggestpossibledirectcontactsandtherebyassistin
monitoringandstemmingtheoutbreakofdisease.
ArecentstudybyresearchersatFacebookanalyzedtherelationshipbetweengeographiclocationofindividual
usersandthatoftheirfriends.Fromthisanalysis,theywereabletocreateanalgorithmtopredictthelocation
ofanindividualuserbaseduponthelocationsofasmallnumberoffriendsintheirnetwork,withhigher
accuracythansimplylookingattheusersIPaddress.94
Therearemanycommercialsociallisteningservices,suchasRadian6/SalesforceCloud,CollectiveIntellect,
Lithium,andothers,thatminedatafromsocialnetworkingfeedsforuseinbusinessintelligence.95Coupled

89

(1)Narayanan,A.andV.Shmatikov,Deanonymizingsocialnetworks,30thIEEESymposiumonSecurityandPrivacy,
173187,2009.(2)Crandall,DavidJ.,etal.,Inferringsocialtiesfromgeographiccoincidences,ProceedingsoftheNational
AcademyofSciences,107:52,2010.(3)Backstrom,L,C.DworkandJ.Kleinberg,WhereforeArtThouR3579X?Anonymized
SocialNetworks,HiddenPatterns,andStructuralSteganography,Proceedingsofthe16thIntl.WorldWideWeb
Conference,2007.(4)Saramki,Jari,etal.,"Persistenceofsocialsignaturesinhumancommunication,"Proceedingsofthe
NationalAcademyofSciences,111.3:942947,2014.
90
Fienberg,S.E.,"IsthePrivacyofNetworkDataanOxymoron?"JournalofPrivacyandConfidentiality,4:2,2013.
91
Krebs,V.E.,"Mappingnetworksofterroristcells,"Connections,24.3:4352,2002.
92
Sundsy,P.R.,etal.,"Productadoptionnetworksandtheirgrowthinalargemobilephonenetwork,"AdvancesinSocial
NetworksAnalysisandMining(ASONAM),2010.
93
Hodgson,Bob,AVitalNewMarketingMetric:TheNetworkValueofaCustomer,PredictiveMarketing:OptimizeYour
ROIWithAnalytics.http://predictivemarketing.com/index.php/avitalnewmarketingmetricthenetworkvalueofa
customer/
94
Backstrom,Larsetal,"Findmeifyoucan:improvinggeographicalpredictionwithsocialandspatialproximity,"
Proceedingsofthe19thinternationalconferenceonWorldWideWeb,2010.
95
Top20socialmediamonitoringvendorsforbusiness,Socialmedia.biz,http://socialmedia.biz/2011/01/12/top20
socialmediamonitoringvendorsforbusiness/

29

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


withsocialnetworkanalysis,thisinformationcanbeusedtoevaluatechanginginfluencesandthespreadof
trendsbetweenindividualsandcommunitiestoinformmarketingstrategies.

3.3 The infrastructure behind big data


Bigdataanalyticsrequiresnotjustalgorithmsanddata,butalsophysicalplatformswherethedataarestored
andanalyzed.Therelatedsecurityservicesusedforpersonaldata(seeSections4.1and4.2)arealsoan
essentialcomponentoftheinfrastructure.Onceavailableonlytolargeorganizations,thisclassofinfrastructure
isnowavailablethroughthecloudtosmallbusinessesandtoindividuals.Totheextentthatthesoftware
infrastructureiswidelyshared,privacypreservinginfrastructureservicescanalsobemorereadilyused.

3.3.1 Data centers


Onewaytothinkaboutbigdataplatformsisinphysicalunitsofdatacenters.Inrecentyears,datacenters
havebecomealmoststandardcommodities.Atypicaldatacenterisalarge,warehouselikebuildingona
concreteslabthesizeofafewfootballfields.Itislocatedwithgoodaccesstocheapelectricpowerandtoa
fiberoptic,Internetbackboneconnection,usuallyinaruralorisolatedarea.Thetypicalcenterconsumes2040
megawattsofpower(theequivalentofacitywith20,00040,000residents)andtodayhousessometensof
thousandsofserversandharddiskdrives,totalingsometensofpetabytes.96Worldwide,thereareroughly
6000datacentersofthisscale,abouthalfintheUnitedStates.97
Datacentersarethephysicallocusofbigdatainallitsforms.Largedatacollectionsareoftenreplicatedin
multipledatacenterstoimprovebothperformanceandrobustness.Thereisagrowingmarketplaceinselling
datacenterservices.
Specializedsoftwaretechnologyallowsthedatainmultipledatacenters(andspreadacrosstensofthousandsof
processorsandharddiskdrives)tocooperateinperformingthetasksofdataanalytics,therebyprovidingboth
scalingandbetterperformance.Forexample,MapReduce(originallyaproprietarytechnologyofGoogle,but
nowatermusedgenerically)isaprogrammingmodelforparalleloperationsacrossapracticallyunlimited
numberofprocessors;Hadoopisapopularopensourceprogrammingplatformandprogramlibrarybasedon
thesameideas;NoSQL(thenamederivedfromnotStructuredQueryLanguage)isasetofdatabase
technologiesthatrelaxesmanyoftherestrictionsoftraditional,relationaldatabasesandallowsforbetter
scalabilityacrossthemanyprocessorsinoneormoredatacenters.Contemporaryresearchisaimedatthenext
generationbeyondHadoop.OnepathisrepresentedbyAccumulo,initiatedbytheNationalSecurityAgency
andtransitionedtotheopensourceApachecommunity.98AnotheristheBerkeleyDataAnalyticsStack,an
opensourceplatformthatoutperformsHadoopbyafactorof100formemoryintensivedataanalyticsandis
beingusedbysuchcompaniesasFoursquare,Conviva,Klout,Quantifind,Yahoo,andAmazonWebServices.99
SometimestermedNoHadoop(toparallelthemovementfromSQLtoNoSQL),technologiesthatfitthistrend
includeGooglesDremel,MPI(typicallyusedinsupercomputing),Pregel(forgraphs),andCloudscale(forreal
timeanalytics).

96

Apetabyteis1015bytes.OnepetabytecouldstoretheindividualgenomesoftheentireU.S.population.Thehuman
brainhasbeenestimatedtohaveacapacityof2.5petabytes.
97
McLellan,Charles,The21stCenturyDataCenter:AnOverview,ZDNet,April2,2013.http://www.zdnet.com/the21st
centurydatacenteranoverview7000012996/
98
See:http://accumulo.apache.org/
99
See:https://amplab.cs.berkeley.edu/software/

30

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

3.3.2 The cloud


Thecloudisnotjusttheworldinventoryofdatacenters(althoughmuchofthepublicmaythinkofitassuch).
Rather,onewayofunderstandingthecloudisasasetofplatformsandservicesmadepossiblebythephysical
commoditizationofdatacenters.Whenonesaysthatdataareinthecloud,onerefersnotjusttothephysical
harddiskdrivesthatexist(somewhere!)withthedata,butalsotothecomplexinfrastructureofapplication
programs,middleware,networkingprotocols,and(notleast)businessmodelsthatallowthatdatatobe
ingested,accessed,andutilized,allwithcoststhatarecompetitivelyallocated.Thecommercialentitiesthat,in
aggregate,provisionthecloudexistinanecosystemthathasmanyhierarchicallevelsandmanydifferent
coexistingmodelsofvalueadded.Theremaybeseveralhandoffsofresponsibilitybetweentheenduserand
thephysicaldatacenter.
Todayscloudprovidersoffersomesecuritybenefits(andthroughthat,privacybenefits)ascomparedto
yesterdaysconventionalcorporatedatacentersorsmallbusinesscomputers.100Theseservicesmayinclude
betterphysicalprotectionandmonitoring,aswellascentralizedsupportstaffing,training,andoversight.Cloud
servicesalsoposenewchallengesforsecurity,asubjectofcurrentresearch.Bothbenefitsandriskscomefrom
thecentralizationofresources:Moredataareheldbyagivenentity(albeitdistributedacrossmultipleservers
orsites),andacloudprovidercanperformbetterthanseparatelyhelddatacentersbyapplyinghighstandards
torecruitingandmanagingpeopleandsystems.
Usageofthecloudandindividualinteractionswithit(whetherwittingornot)areexpectedtoincrease
dramaticallyincomingyears.Theriseofbothmobileapps,101reinforcingtheuseofcellphonesandtabletsas
platforms,andbroadlydistributedsensorsisassociatedwiththegrowinguseofcloudsystemsforstoring,
processing,andotherwiseactingoninformationcontributedbydisperseddevices.Althoughprogressinthe
mobileenvironmentimprovestheusabilityofmobilecloudapplications,itmaybedetrimentaltoprivacytothe
extentthatitmoreeffectivelyhidesinformationexchangefromtheuser.Asmorecoremobilefunctionalityis
transitionedtothecloud,largeramountsofinformationwillbeexchanged,andusersmaybesurprisedbythe
natureoftheinformationthatnolongerremainslocalizedtotheircellphone.Forexample,cloudbasedscreen
rendering(orvirtualizedscreens)forcellphoneswouldmeanthattheimagesshownonacellphonescreen
willactuallybecalculatedonthecloudandtransmittedtothemobiledevice.Thismeansalltheimagesonthe
screenofthemobiledevicecanbeaccessedandmanipulatedfromthecloud.
Cloudarchitecturesarealsobeingusedincreasinglytosupportbigdataanalytics,bothbylargeenterprises(e.g.,
Google,Amazon,eBay)andbysmallentitiesorindividualswhomakeadhocorroutineuseofpubliccloud
platforms(e.g.,AmazonWebServices,GoogleCloudPlatform,MicrosoftAzure)inlieuofacquiringtheirown
infrastructure.SocialmediaservicessuchasFacebookandTwitteraredeployedandanalyzedbytheirproviders
usingcloudsystems.Theseusesrepresentakindofdemocratizationofanalytics,withthepotentialtofacilitate
newbusinessesandmore.Prospectsforthefutureincludeexplorationofoptionsforfederatingor

100

CloudSecurityAlliance,BigDataWorkingGroup:CommentonBigDataandtheFutureofPrivacy,March2014.
https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Comment_on_Big_Data_Future_of_Privacy.pdf
101
Qi,H.andA.Gani,"Researchonmobilecloudcomputing:Review,trendandperspectives,"DigitalInformationand
CommunicationTechnologyandit'sApplications(DICTAP),2012SecondInternationalConferenceon,2012.

31

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


interconnectingcloudapplicationsandforreducingsomeoftheheterogeneityinapplicationprogramming
interfacesforcloudapplications.102

102

Jeffery,K.etal.,"Avisionforbettercloudapplications,"Proceedingsofthe2013InternationalWorkshoponMultiCloud
ApplicationsandFederatedClouds,Prague,CzechRepublic,MODAClouds,ACMDigitalLibrary,April2223,2013.

32

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

4. Technologies and Strategies for Privacy Protection


Datacomeintoexistence,arecollected,andarepossiblyprocessedimmediately(includingaddingmetadata),
possiblycommunicated,possiblystored(locally,remotely,orboth),possiblycopied,possiblyanalyzed,possibly
communicatedtousers,possiblyarchived,possiblydiscarded.Technologyatanyofthesestagescanaffect
privacypositivelyornegatively.
Thischapterfocusesonthepositiveandassessessomeofthekeytechnologiesthatcanbeusedinserviceofthe
protectionofprivacy.Itseekstoclarifytheimportantdistinctionsbetweenprivacyand(cyber)security,aswell
asthevital,butyetlimited,rolethatencryptiontechnologycanplay.Someoldertechniques,suchas
anonymization,whilevaluableinthepast,areseenashavingonlylimitedfuturepotential.Newertechnologies,
someenteringthemarketplaceandsomerequiringfurtherresearch,aresummarized.

4.1 The relationship between cybersecurity and privacy


Cybersecurityisadiscipline,orsetoftechnologies,thatseekstoenforcepoliciesrelatingtoseveraldifferent
aspectsofcomputeruseandelectroniccommunication.103Atypicallistofsuchaspectswouldbe

identityandauthentication:Areyouwhoyousayyouare?
authorization:Whatareyouallowedtodo?
availability:Canattackersinterferewithauthorizedfunctions?
confidentiality:Candataorcommunicationsbe(passively)copiedbysomeonenotauthorizedtodoso?
integrity:Candataorcommunicationsbe(actively)changedormanipulatedbysomeonenot
authorized?
nonrepudiation,auditability:Canactions(paymentsmayprovidethebestexample)laterbeshownto
haveoccurred?

Goodcybersecurityenforcespoliciesthatarepreciseandunambiguous.Indeed,suchclarityofpolicy,
expressibleinmathematicalterms,isanecessaryprerequisitefortheHolyGrailofcybersecurity,provably
securesystems.Atpresent,provablesecurityexistsonlyinverylimiteddomains,forexample,forcertain
functionsonsomekindsofcomputerchips.Itisagoalofcybersecurityresearchtoextendthescopeof
provablysecuresystemstolargerandlargerdomains.Meanwhile,practicalcybersecuritydrawsonthe
emergingprinciplesofsuchresearch,butitisguidedevenmorebypracticallessonslearnedfromknownfailures
ofcybersecurity.Therealisticgoalisthatthepracticeofcybersecurityshouldbecontinuouslyimprovingsoas
tobe,inmostplacesandatmostofthetime,aheadoftheevolvingthreat.
Poorcybersecurityisclearlyathreattoprivacy.Privacycanbebreachedbyfailuretoenforceconfidentialityof
data,byfailureofidentityandauthenticationprocesses,orbymorecomplexscenariossuchasthose
compromisingavailability.

103

PCASThasaddressedissuesincybersecurity,bothinreviewingtheNITRDprogramsanddirectlyina2013report,
ImmediateOpportunitiesforStrengtheningtheNationsCybersecurity.
http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_cybersecurity_nov2013.pdf

33

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Securityandprivacyshareafocusonmalice.Thesecurityofdatacanbecompromisedbyinadvertenceor
accident,butitcanalsobecompromisedbecausesomepartyactedknowinglytoachievethecompromisein
thelanguageofsecurity,committedanattack.Substitutingthewordsbreachorinvasionforcompromise
orattack,thesameconceptsapplytoprivacy.
Eveniftherewereperfectcybersecurity,however,privacywouldremainatrisk.Violationsofprivacyare
possibleevenwhenthereisnofailureincomputersecurity.Ifanauthorizedindividualchoosestomisuse(e.g.,
disclose)data,whatisviolatedisprivacypolicy,notsecuritypolicy.Or,aswehavediscussed(seeSection3.1.1),
privacymaybeviolatedbythefusionofdataevenifperformedbyauthorizedindividualsonsecurecomputer
systems.104
Privacyisdifferentfromsecurityinotherrespects.Foronething,itishardertocodifyprivacypoliciesprecisely.
Arguablythisisbecausethepresuppositionsandpreferencesofhumanbeingshavegreaterdiversitythanthe
usefulscopeofassertionsaboutcomputersecurity.Indeed,howtocodifyhumanprivacypreferencesisan
important,nascentareaofresearch.105
Whenpeopleprovideassurance(atsomelevel)thatacomputersystemissecure,theyaresayingsomething
aboutapplicationsthatarenotyetinvented:Theyareassertingthattechnologicaldesignfeaturesalreadyin
themachinetodaywillpreventsuchapplicationprogramsfromviolatingpertinentsecuritypoliciesinthat
machine,eventomorrow.106Assurancesaboutprivacyaremuchmoreprecarious.Sincenotyetinvented
applicationswillhaveaccesstonotyetimaginednewsourcesofdata,aswellastonotyetdiscoveredpowerful
algorithms,itmuchhardertoprovide,today,technologicalsafeguardsagainstanewroutetoviolationof
privacytomorrow.Securitydealswithtomorrowsthreatsagainsttodaysplatforms.Thatishardenough.But
privacydealswithtomorrowsthreatsagainsttomorrowsplatforms,sincethoseplatformscomprisenotjust
hardwareandsoftware,butalsonewkindsofdataandnewalgorithms.
Computerscientistsoftenworkfromthebasisofaformalpolicyforsecurity,justasengineersaimtodescribe
somethingexplicitlysothattheycandesignspecificwaystodealwithitbypurelytechnicalmeans.Asmore
computerscientistsbegintothinkaboutprivacy,thereisincreasingattentiontoformalarticulationofprivacy
policy.107Tocaricature,youhavetoknowwhatyouaredoingtoknowwhetherwhatyouaredoingisdoingthe
rightthing.108Researchaddressingthechallengesofaligningregulationsandpolicieswithsoftware

104

Therearealsochoicesinthedesignandimplementationofsecuritymechanismsthataffectprivacy.Inparticular,
authenticationortheattempttodemonstrateidentityatsomelevelcanbedonewithvaryingdegreesofdisclosure.See,
forexample:ComputerScienceandTelecommunicationsBoard,WhoGoesThere:AuthenticationThroughtheLensof
Privacy,NationalAcademiesPress,2003.
105
Suchresearchcaninformeffortstoautomatethecheckingofcompliancewithpoliciesand/orassociatedauditing.
106
Thisfutureproofingremainshardtoachieve;PCASTscybersecurityreportadvocatedapproachesthatwouldbemore
durablethanthekindsofcheckliststhatareeasilyrenderedobsolete.See:
http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_cybersecurity_nov2013.pdf
107

See,forexample:(1)Breaux,TravisD.,andAshwiniRao,FormalAnalysisofPrivacyRequirementsSpecificationsfor

MultiTierApplications,21stIEEERequirementsEngineeringConference(RE2013),RiodeJaneiro,Brazil,July2013.
http://www.cs.cmu.edu/~agrao/paper/Analysis_of_Privacy_Requirements_Facebook_Google_Zynga.pdf(2)Feigenbaum,
Joan,etal.,TowardsaFormalModelofAccountability,NewSecurityParadigmsWorkshop2011,MarinCounty,CA,
September1215,2011.http://www.nspw.org/papers/2011/nspw2011feigenbaum.pdf
108
Landwehr,Carl,EngineeredControlsforDealingwithBigData,Chapter10,inLane,Julia,etal.,Privacy,BigData,and
thePublicGood,CambridgeUniversityPress,2014.

34

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


specificationsincludesformallanguagestoexpresspoliciesandsystemrequirements;toolstoreasonabout
conflicts,inconsistencies,andambiguitieswithinandamongpoliciesandsoftwarespecifications;methodsto
enablerequirementsengineers,businessanalysts,andsoftwaredeveloperstoanalyzeandrefinepolicyinto
measurablesystemspecificationsthatcanbemonitoredovertime;formalizingandenforcingprivacythrough
auditingandaccountabilitysystems;privacycomplianceinbigdatasystems;andformalizingandenforcing
purposerestrictions.

4.2 Cryptography and encryption


Cryptographycomprisesasetofalgorithmsandsystemdesignprinciples,somewelldevelopedandothers
nascent,forprotectingdata.Cryptographyisafieldofknowledgewhoseproductsareencryptiontechnology.
Withwelldesignedprotocols,encryptiontechnologyisaninhibitortocompromisingprivacy,butitisnota
silverbullet.109

4.2.1 Well established encryption technology


Usingcryptography,readabledataofanykind,termedplaintext,aretransformedintowhatare,forallintents
andpurposes,incomprehensiblestringsofprovablyrandombits,socalledcryptotext.Cryptotextrequiresno
securityprotectionofanykind.Itcanbestoredinthecloudorsentanywherethatisconvenient.Itcanbesent
promiscuouslytoboththeNSAandRussianFSB.Iftheyhaveonlycryptotextandifitwasproperlygenerated
inaprecisemathematicalsenseitisuselesstothem.Theycanneitherreadthedatanorcomputewithit.
Whatisneededtodecrypt,toturncryptotextbackintotheoriginalplaintext,isakey,whichisinpracticea
stringofbitsthatissupposedtobeknownto(orcomputableby)onlyauthorizedusers.Onlywiththekeycan
encrypteddatabeused,i.e.,theirvalueread.
Inthecontextofprotectingprivacy,itisprimarilynotthecryptographythatisofconcern.110Rather,
compromisesofdatawilloccurinoneoftwomainways:

Datacanbestolen,ormistakenlyshared,beforetheyhavebeenencryptedoraftertheyhavebeen
decrypted.Manyattacksonsupposedlyencrypteddataareactuallyattacksonmachinesthatcontain
howeverbrieflyunencryptedplaintext.Forexample,inTargets2013breachofonehundredmillion
debitcardnumberandpersonalidentificationnumbers(PINs),thePINswerepresentinunencrypted
formonlyephemerally.Theywerestolennonetheless.111
Keysmustbeauthorized,generated,distributed,andused.Ateverystageofakeyslife,itispotentially
opentocompromiseormisusethatcanultimatelycompromisethedatathatthekeywasintendedto
protect.Nosystembasedonencryptionissecure,ofcourse,ifpersonswithaccesstoprivatekeyscan
becoercedintosharingthem.

109

Theuseofthistermincomputingoriginatedwithwhatisnowviewedasaclassicarticle:Brooks,FredP.,Nosilver
bulletEssenceandAccidentsofSoftwareEngineering,IEEEComputer20:4,April1987,pp.1019.
110
Attacksthatcompromisethehardwareorsoftwarethatdoestheencrypting(forexample,thepromulgationof
intentionallyweakcryptographystandards)canbeconsideredtobeavariantofattacksthatrevealplaintext.
111
KrebsonSecurity,collectedpostsonTargetdatabreach,2014.http://krebsonsecurity.com/tag/targetdatabreach/

35

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Untilthe1970s,keysweredistributedphysically,onpaperorcomputermedia,protectedbyregisteredmail,
armedguards,oranythinginbetween.Theinventionofpublickeycryptography112changedeverything.
Publickeycryptography,asthenameimplies,allowsindividualstobroadcastpubliclytheirpersonalkey.But
thispublickeyisonlyanencryptionkey,usefulforturningplaintextintocryptotextthatismeaninglessto
others.Itscorrespondingprivatekey,usedtotransformcryptotexttoplaintext,isstillkeptsecretbythe
recipient.Publickeycryptographythusturnstheproblemofkeydistributionintoaproblemofidentity
determination.Alicesmessages(encrypteddatatransmissions)toBobarecompletelyprotectedbyBobs
publickeybutonlyifAliceiscertainthatitisreallyBobspublickeythatsheisusing,andnotthepublickeyof
someonemerelymasqueradingasBob.
Luckily,publickeycryptographyalsoprovidessometechniquesforhelpingtoestablishidentity,namelythe
electronicsigningofmessagestodocumenttheirauthenticity.Electronicsignatures,inturn,enablemessages
oftheformI,apersonofauthorityknownasX,certifythatthefollowingisreallythepublickeyofsubordinate
personY.(Signed)X.Messageslikethisaretermedcertificates.Certificatescanbecascaded,withAcertifying
theidentityofB,whocertifiesC,andsoon.Certificatesessentiallytransformtheidentityproblemfromoneof
validatingtheidentityofmillionsofpossibleYstovalidatingtheidentityofmuchsmallernumberoftoplevel
certificateauthorities(CAs).Yetitisamatterofconcernthatmorethan100toplevelCAsarewidelyrecognized
(e.g.,acceptedbymostallwebbrowsers),becausetheremaybeseveralintermediatestepsinthehierarchyof
certificatesfromaCAtoauser,andateverystepaprivatekeymustbeprotectedbysomesigneronsome
computer.Thecompromiseofthisprivatekeypotentiallycompromisestheprivacyofalluserslowerdownthe
chainbecauseforgedcertificatesofidentitycannowbecreated.Suchexploitshavebeenseen.Forexample,
the2011apparenttheftofaDutchCAsprivatekeycompromisedtheprivacyofpotentiallyallgovernment
recordsintheNetherlands.113,114
Manymajorcompanieshaverecentlyintroducedorstrengthenedtheiruseofencryptiontotransmitdata.115
Somearenowusing(perfect)forwardsecrecy,avariantofpublickeycryptographythatensuresthatthe
compromiseofanindividualsprivatekeycancompromiseonlymessagesthathereceivessubsequently,while
theconfidentialityofpastconversationsismaintained,eveniftheircryptotextwaspreviouslyrecordedbythe
sameeavesdroppernowinpossessionofthepurloinedprivatekey.116

4.2.2 Encryption frontiers


Thetechnologiesthusfarmentionedenabletheprotectionofdatabothinstorageandintransit,allowingthose
datatobefullydecryptedbyuserswhoeither(i)havetherightkeyalready(asmightbethecaseforpersons

112

PublickeyencryptionoriginatedthroughthesecretworkofBritishmathematiciansattheU.K.sGovernment
CommunicationsHeadquarters(GCHQ),anorganizationroughlyanalogoustotheNSA,andreceivedbroaderattention
throughtheindependentworkbyresearchersincludingWhitfieldDiffieandMartinHellmanintheUnitedStates.
113
Fisher,Dennis,FinalReportonDigiNotarHackShowsTotalCompromiseofCAServers,ThreatPost,October31,2012.
http://threatpost.com/finalreportdiginotarhackshowstotalcompromisecaservers103112/77170.
114
Itisnotpubliclyknownwhetherornottheearlier2010compromiseofserversbelongingtoVeriSign,amuchlargerCA,
ledtocompromisesofcertificatesorsigningauthorities.Bradley,Tony,VeriSignHacked:WhatWeDon'tKnowMightHurt
Us,PCWorld,February2,2012.
http://www.pcworld.com/article/249242/verisign_hacked_what_we_dont_know_might_hurt_us.html
115
Asamplereportcard:https://www.eff.org/deeplinks/2013/11/encryptwebreportwhosdoingwhat#cryptochart
116
Diffie,Whitfield,etal.,"AuthenticationandAuthenticatedKeyExchanges"Designs,CodesandCryptography2:2,June
1992,pp.107125.

36

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


storingdatafortheirownlateruse),or(ii)areauthorizedbythedataownerandhaveidentitiescertifiedbyaCA
thatisitselftrustedbythedataowner.Afrontierofcryptographyresearch,withsomeinventionsnowstarting
tomakeitintopractice,ishowtocreatedifferentkindsofkeys,oneswhichgiveonlylimitedaccessofvarious
kinds,orwhichallowmessagestobesenttoclassesofindividualswithoutknowinginadvanceexactlywhothey
maybe.
Forexample,identitybasedencryptionandattributebasedencryptionarewaysofsendingamessage,or
protectingafileofdata,fortheexclusiveuseofapersonnamedRamonaQ.DoewhowasbornonMay23,
1980,orforanyonewiththejobtitleombudsman,ombudsperson,orconsumeradvocate.Thesetechniques
requireatrustedthirdparty(essentiallyacertificateauthority),butthemessagesthemselvesdonotneedto
passthroughthehandsofthatthirdparty.Thesetoolsareinearlystagesofadoption.
Zeroknowledgesystemsallowencrypteddatatobequeriedforcertainhigherlevelabstractionswithout
revealingthelowleveldata.Forexample,awebsiteoperatorcouldverifythatauserisoverage21without
learningtheusersactualbirthdate.Whatisremarkableisthatthiscanbedoneinawaythatproves
mathematicallythattheuserisnotlyingabouthisage:Theoperatorlearnswithmathematicalcertaintythata
certificate(signedbysomeCAofcourse!)atteststotheusersbirthdate,withouteveractuallyseeingthat
certificate.Zeroknowledgesystemsarejustbeginningtobecommercializedinsimplecases.Theyarenot
foreseeablyextendabletocomplexandunstructuredsituations,suchaswhatmightbeneededfortheresearch
miningofhealthrecorddatafromnonconsentingpatients.
Insomesimplerdomains,forexamplelocationprivacy,practicalcryptographicprotectionisclosertoreality.
Thetypicalcasemightbethatagroupoffriendswanttoknowwhentheyareclosetooneanother,butwithout
sharingtheiractuallocationswithanythirdparty.Applicationslikethisare,ofcourse,muchsimplerifthereisa
trustedthirdparty,asisdefactothecaseformostsuchcommercialapplicationstoday.
Homomorphicencryptionisaresearchareathatgoesbeyondthemerequeryingofencrypteddatabasesto
actualcomputations(e.g.,thecollectionofstatistics)usingencrypteddatawithouteverdecryptingit.These
techniquesarefarfrombeingpractical,andtheyareunlikelytoprovidepolicyoptionsonthetimescalerelevant
tothisreport.
Insecuremultipartycomputation,whichisrelatedtohomomorphicencryptionandisofparticularinterestin
thefinancialsector,computationmaybedoneondistributeddatastoresthatareencrypted.Although
individualdataarekeptprivateusingcollusionrobustencryptionalgorithms,datacanbeusedtocalculate
generalstatistics.Partiesthateachknowsomeprivatedatauseaprotocolthatgeneratesusefulresultsbased
onbothinformationtheyknowandinformationtheydonotknow,withoutrevealingtothemdatatheydonot
alreadyknow.
Differentialprivacy,acomparativelynewdevelopmentrelatedtobutdifferentfromencryption,aimsto
maximizetheaccuracyofdatabasequeriesorcomputationswhileminimizingtheidentifiabilityofindividuals
withrecordsinthedatabase,typicallyviaobfuscationofqueryresults(forexample,bytheadditionofspurious
informationornoise).117Aswithotherobfuscationapproaches,thereisatradeoffbetweendataanonymity

117

(1)Dwork,Cynthia,DifferentialPrivacy,33rdInternationalColloquiumonAutomata,LanguagesandProgramming,
2006.(2)Dwork,Cynthia,AFirmFoundationforPrivateDataAnalysis,CommunicationsoftheACM,54.1,2011.

37

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


andtheaccuracyandutilityofthequeryoutputs.Theseideasarefarfrompracticalapplication,exceptinsofar
astheymayenabletherisksofallowinganyqueriesatalltobebetterassessed.

4.3 Notice and consent


Noticeandconsentis,today,themostwidelyusedstrategyforprotectingconsumerprivacy.Whentheuser
downloadsanewapptohisorhermobiledevice,orwhenheorshecreatesanaccountforawebservice,a
noticeisdisplayed,towhichtheusermustpositivelyindicateconsentbeforeusingtheapporservice.Insome
fantasyworld,usersactuallyreadthesenotices,understandtheirlegalimplications(consultingtheirattorneysif
necessary),negotiatewithotherprovidersofsimilarservicestogetbetterprivacytreatment,andonlythenclick
toindicatetheirconsent.Realityisdifferent.118
Noticeandconsentfundamentallyplacestheburdenofprivacyprotectionontheindividualexactlythe
oppositeofwhatisusuallymeantbyaright.Worseyet,ifitishiddeninsuchanoticethattheproviderhas
therighttosharepersonaldata,theusernormallydoesnotgetanynoticefromthenextcompany,muchless
theopportunitytoconsent,eventhoughuseofthedatamaybedifferent.Furthermore,iftheproviderchanges
itsprivacynoticefortheworse,theuseristypicallynotnotifiedinausefulway.
Asausefulpolicytool,noticeandconsentisdefeatedbyexactlythepositivebenefitsthatbigdataenables:
new,nonobvious,unexpectedlypowerfulusesofdata.Itissimplytoocomplicatedfortheindividualtomake
finegrainedchoicesforeverynewsituationorapp.Nevertheless,sincenoticeandconsentissodeeplyrooted
incurrentpractice,someexplorationofhowitsusefulnessmightbeextendedseemswarranted.
Onewaytoviewtheproblemwithnoticeandconsentisthatitcreatesanonlevelplayingfieldintheimplicit
privacynegotiationbetweenprovideranduser.Theprovideroffersacomplextakeitorleaveitsetofterms,
backedbyalotoflegalfirepower,whiletheuser,inpractice,allocatesonlyafewsecondsofmentaleffortto
evaluatingtheoffer,sinceacceptanceisneededtocompletethetransactionthatwastheuserspurpose,and
sincethetermsaretypicallydifficulttocomprehendquickly.Thisisakindofmarketfailure.Inothercontexts,
marketfailureslikethiscanbemitigatedbytheinterventionofthirdpartieswhoareabletorepresent
significantnumbersofusersandnegotiateontheirbehalf.Section4.5.1belowsuggestshowsuchintervention
mightbeaccomplished.

4.4 Other strategies and techniques


4.4.1 Anonymization or deidentification
Longusedinhealthcareresearchandotherresearchareasinvolvinghumansubjects,anonymization(also
termeddeidentification)applieswhenthedata,standingaloneandwithoutanassociationtoaspecificperson,
donotviolateprivacynorms.Forexample,youmaynotmindifyourmedicalrecordisusedinresearchaslong
asyouareidentifiedonlyasPatientXandyouractualnameandpatientidentifierarestrippedfromthatrecord.
Anonymizationofadatarecordmightseemeasytoimplement.Unfortunately,itisincreasinglyeasytodefeat
anonymizationbytheverytechniquesthatarebeingdevelopedformanylegitimateapplicationsofbigdata.In

118
Gindin,SusanE.,NobodyReadsYourPrivacyPolicyorOnlineContract:LessonsLearnedandQuestionsRaisedbythe
FTC'sActionagainstSears,NorthwesternJournalofTechnologyandIntellectualProperty1:8,20092010.

38

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


general,asthesizeanddiversityofavailabledatagrows,thelikelihoodofbeingabletoreidentifyindividuals
(thatis,reassociatetheirrecordswiththeirnames)growssubstantially.119
OnecompellingexamplecomesfromSweeney,Abu,andWinn.120Theyshowedinarecentpaperthat,byfusing
public,PersonalGenomeProjectprofilescontainingzipcode,birthdate,andgenderwithpublicvoterrolls,and
miningfornameshiddeninattacheddocuments,8497percentoftheprofilesforwhichnameswereprovided
werecorrectlyidentified.
Anonymizationremainssomewhatusefulasanaddedsafeguard,butitisnotrobustagainstneartermfuturere
identificationmethods.PCASTdoesnotseeitasbeingausefulbasisforpolicy.Unfortunately,anonymizationis
alreadyrootedinthelaw,sometimesgivingafalseexpectationofprivacywheredatalackingcertainidentifiers
aredeemednottobepersonallyidentifiableinformationandthereforenotcoveredbysuchlawsastheFamily
EducationalRightsandPrivacyAct(FERPA).

4.4.2 Deletion and nonretention


Itisanevidentgoodbusinesspracticethatdataofallkindsshouldbedeletedwhentheyarenolongerofvalue.
Indeed,wellruncompaniesoftenmandatethedestructionofsomekindsofrecords(bothpaperandelectronic)
afterspecifiedperiodsoftime,oftenbecausetheyseelittlebenefitinkeepingtherecordsaswellaspotential
costinproducingthem.Forexample,employeeemails,whichmaybesubjecttolegalprocessby(e.g.)divorce
lawyers,areoftenseenashavingnegativeretentionvalue.
Countertothispracticeisthenewobservationthatbigdataisfrequentlyabletofindeconomicorsocialvaluein
massesofdatathatwereotherwiseconsideredtobeworthless.Asthephysicalcostofretentioncontinuesto
decreaseexponentiallywithtime(especiallyinthecloud),therewillbeatendencyinbothgovernmentandthe
privatesectortoholdmoredataforlongerwithobviousprivacyimplications.Archivaldatamayalsobe
importanttofuturehistorians,orforlaterlongitudinalanalysisbyacademicresearchers.
Onlypolicyinterventionswillcounterthistrend.Governmentcanmandateretentionpoliciesforitself.To
affecttheprivatesector,governmentmaymandatepolicieswhereithasregulatoryauthorities(asforconsumer
protection,forexample).Butitcanalsoencouragethedevelopmentofstricterliabilitystandardsforcompanies
whosedata,includingarchiveddata,causeharmtoindividuals.Arationalresponsebytheprivatesectorwould
thenbetoholdfewerdataortoprotecttheiruse.
Theaboveholdstrueforprivacysensitivedataaboutindividualsthatareheldovertlythatis,theholderknows
thathehasthedataandtowhomtheyrelate.AswasdiscussedinSection3.1.2,however,sourcesofdata
increasinglycontainlatentinformationaboutindividuals,informationthatbecomesknownonlyiftheholder
expendsanalyticresources(beyondwhatmaybeeconomicallyfeasible),orthatmaybecomeknowableonlyin
thefuturewiththedevelopmentofnewdataminingalgorithms.Insuchcasesitispracticallyimpossibleforthe
dataholdereventosurfaceallthedataaboutanindividual,muchlessdeletethosedataonanyspecified
schedule.

119

Deidentificationcanalsobeseenasaspectrum,ratherthanasingleapproach.See:ResponsetoRequestfor
InformationFiledbyU.S.PublicPolicyCounciloftheAssociationforComputingMachinery,March2014.
120
Sweeney,etal.,IdentifyingParticipantsinthePersonalGenomeProjectbyName,HarvardUniversityDataPrivacyLab.
WhitePaper10211,April24,2013.http://dataprivacylab.org/projects/pgp/

39

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Theconceptsofephemerality(keepingdataonlyontheflyorforabriefperiod),andtransparency(enablingthe
individualtoknowwhatdataabouthimorherareheld)arecloselyrelated,andwiththesamepractical
limitations.Whiledatathatareonlystreamed,andnotarchived,mayhavelowerriskoffutureuse,thereisno
guaranteethataviolatorwillplaybythesupposedrules,asinTargetslossof100milliondebitcardPINs,each
presentonlyephemerally(seeSection4.2.1).
Today,giventhedistributedandredundantnatureofdatastorage,itisnotevenclearthatdatacanbe
destroyedwithanyusefuldegreeofassurance.Althoughresearchondatadestructionisongoing,itisa
fundamentalfactthatatthemomentthatdataaredisplayed(inanalog)toauserseyeballsorears,theycan
alsobecopied(redigitized)withoutanytechnicalprotections.Thesameholdsifdataareevermadeavailable
inunencryptedformtoaroguecomputerprogram,onedesignedtocircumventtechnicalsafeguards.Some
misinformedpublicdiscussionnotwithstanding,thereisnosuchthingasautomaticallyselfdeletingdata,other
thaninafullycontrolledandruleabidingenvironment.
Asacurrentexample,SnapChatprovidestheserviceofdeliveringephemeralsnapshots(images),visibleforonly
afewseconds,toadesignatedrecipientsmobiledevice.SnapChatpromisestodeletepastdatesnapsfrom
theirservers,butitisonlyapromise.And,theyarecarefulnottopromisethattheintendedrecipientmaynot
contrivetomakeanuncontrolledandnonexpiringcopy.Indeed,thesuccessofSnapChatincentivizesthe
developmentofjustsuchcopyingapplications.121
Fromapolicymakingperspective,theonlyviableassumptiontoday,andfortheforeseeablefuture,isthatdata,
oncecreated,arepermanent.Whiletheirusemayberegulated,theircontinuedexistenceisbestconsidered
conservativelyasunalterablefact.

4.5 Robust technologies going forward


4.5.1 A Successor to Notice and Consent
Thepurposeofnoticeandconsentisthattheuserassentstothecollectionanduseofpersonaldataforastated
purposethatisacceptabletothatindividual.GiventhelargenumberofprogramsandInternetavailable
devices,bothvisibleandnot,thatcollectandusepersonaldata,thisframeworkisincreasinglyunworkableand
ineffective.PCASTbelievesthattheresponsibilityforusingpersonaldatainaccordancewiththeusers
preferencesshouldrestwiththeprovider,possiblyassistedbyamutuallyacceptedintermediary,ratherthan
withtheuser.
Howmightthatbeaccomplished?Individualsmightbeencouragedtoassociatethemselveswithoneofa
standardsetofprivacypreferenceprofiles(thatis,settingsorchoices)voluntarilyofferedbythirdparties.For
example,JanemightchoosetoassociatewithaprofileofferedbytheAmericanCivilLibertiesUnionthatgives
particularweighttoindividualrights,whileJohnmightassociatewithoneofferedbyConsumerReportsthat
givesweighttoeconomicvaluefortheconsumer.Largeappstores(suchasAppleAppStore,GooglePlay,
MicrosoftStore)forwhomreputationalvalueisimportant,orlargecommercialsectorssuchasfinance,might
choosetooffercompetingprivacypreferenceprofiles.

121

See,forexample:RyanWhitwam,SnapSaveforiPhoneDefeatsthePurposeofSnapchat,SavesEverythingForever,PC
Magazine,August12,2013.http://appscout.pcmag.com/appleiosiphoneipadipod/314653snapsaveforiphone
defeatsthepurposeofsnapchatsaveseverythingforever

40

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Inthefirstinstance,anorganizationofferingprofileswouldvetnewappsasacceptableornotacceptablewithin
eachoftheirprofiles.Basically,theywoulddotheclosereadingoftheprovidersnoticethattheusershould,
butdoesnot,do.Thisisnotasonerousasitmaysound:Whiletherearemillionsofapps,themostpopular
downloadsarerelativelyfewandareconcentratedinarelativelysmallnumberofportals.Thelongtailof
appswithfewcustomerseachmightinitiallybeleftasunrated.
Simplybyvettingapps,thethirdpartyorganizationswouldautomaticallycreateamarketplaceforthe
negotiationofcommunitystandardsforprivacy.Toattractmarketshare,providers(especiallysmallerones)
couldseektoqualifytheirofferingsinasmanyprivacypreferenceprofiles,offeredbyasmanydifferentthird
parties,astheydeemfeasible.TheFederalgovernment(e.g.,throughtheNationalInstituteofStandardsand
Technology)couldencouragethedevelopmentofstandard,machinereadableinterfacesforthecommunication
ofprivacyimplicationsandsettingsbetweenprovidersandassessors.
Althoughhumanprofessionalscoulddothevettingtodayusingpoliciesexpressedinnaturallanguage,itwould
bedesirableinthefuturetoautomatethatprocess.Todothat,itwouldbenecessarytohaveformalismsto
specifyprivacypoliciesandtoolstoanalyzesoftwaretodetermineconformancetothosepolicies.Butthatis
onlypartofthechallenge.Agreaterchallengeistomakesurethepolicylanguageissufficientlyexpressive,the
policiesaresufficientlyrich,andconformancetestsaresufficientlypowerful.Thoserequirementsleadtoa
considerationofcontextanduse.

4.5.2 Context and Use


Thepreviousdiscussion,particularlythatofSections3.1and3.2,illustratesPCASTsbeliefthatafocusonthe
collection,storage,andretentionofelectronicpersonaldatawillnotprovideatechnologicallyrobust
foundationonwhichtobasefuturepolicy.Amongthemanyauthorsthathavetouchedontheseissues,Kagan
andAbelsonexplainwhyaccesscontroldoesnotsufficetoprotectprivacy.122Mundiegivesacogentandmore
completeexplanationofthisissueandadvocatesthatprivacyprotectionisbetterservedbycontrollingtheuse
ofpersonaldata,broadlyconstrued,includingmetadataanddataderivedfromanalyticsthanbycontrolling
collection.123Inacomplementaryvein,Nissenbaumexplainsthatboththecontextofusageandtheprevailing
socialnormscontributetoacceptableuse.124
Toimplementinameaningfulwaytheapplicationofprivacypoliciestotheuseofpersonaldataforaparticular
purpose(i.e.,incontext),thosepoliciesneedtobeassociatedbothwithdataandwiththecodethatoperates
onthedata.Forexample,itmustbepossibletoensurethatonlyappswithparticularpropertiescanbeapplied
tocertaindata.Thepoliciesmightbeexpressedinwhatcomputerscientistscallnaturallanguage(plainEnglish
ortheequivalent)andtheassociationdonebytheuser,orthepoliciesmightbestatedformallyandtheir
associationandenforcementdoneautomatically.Ineithercase,theremustalsobepoliciesassociatedwiththe
outputsofthecomputation,sincetheyaredataaswell.Theprivacypoliciesoftheoutputdatamustbe
computedfromthepoliciesassociatedwiththeinputs,thepoliciesassociatedwiththecode,andtheintended
useoftheoutputs(i.e.,thecontext).Theseprivacypropertiesareakindofmetadata.Toachieveareasonable
levelofreliability,theirimplementationmustbetamperproofandstickywhendataarecopied.

122

Abelson,HalandLalanaKagal,AccessControlisanInadequateFrameworkforPrivacyProtection,W3CWorkshopon
PrivacyforAdvancedWebAPIs12/13,July2010,London.http://www.w3.org/2010/apiprivacyws/papers.html
123
Mundie,Craig,PrivacyPragmatism:FocusonDataUse,NotDataCollection,ForeignAffairs,March/April,2014.
124
Nissenbaum,H.,PrivacyinContext:Technology,Policy,andtheIntegrityofSocialLife,StanfordLawBooks,2009.

41

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


Therehasbeenconsiderableresearchinareasthatwouldcontributetosuchacapability,someofwhichis
beginningtobecommercialized.Thereisahistoryofusingmetadata(tagsorattributes)indatabase
systemstocontroluse.Whiletheformalizationofprivacypoliciesandtheirsynthesisisaresearchtopic,125
manualinterpretationofsuchpoliciesandthehumandeterminationofusagetagscanbefoundinrecent
products.Identitymanagementsystems(toauthenticateusersandtheirroles,i.e.,theircontext)arealso
evidentbothinresearch126andinpractice.127
CommercialprivacysystemsforimplementingusecontrolexisttodayunderthenameofTrustedDataFormat
(TDF)implementations,developedprincipallyfortheUnitedStatesintelligencecommunity.128TDFoperatesat
thefilelevel.Thesystemsareprimarilybeingimplementedonacustombasisbylargeconsultingfirms,often
assembledfromopensourcesoftwarecomponents.Customerstodayareprimarilygovernmentagencies,such
asFederalintelligenceagenciesorlocalgovernmentcriminalintelligenceunits,orlargecommercialcompanies
inverticallyintegratedindustrieslikefinancialservicesandpharmaceuticalcompanieslookingtoimprovetheir
accountabilityandauditingcapabilities.Consultingservicesthathaveexpertiseinbuildingsuchsystemsinclude,
forexample,BoozAllen,Ernst&Young,IBM,NorthropGrumman,andLockheed;productbasedcompanieslike
Palantirandnewstartupspioneeringinternalusageauditing,policyanalytics,andpolicyreasoningengineshave
suchexpertise,aswell.Withsufficientmarketdemand,morewidespreadmarketpenetrationcouldhappenin
thenextfiveyears.Marketpenetrationwouldbefurtheracceleratediftheleadingcloudplatformproviders
likeAmazon,Google,andMicrosoftimplementedusagecontrolledsystemtechnologiesintheirofferings.
Widerscaleusethroughthegovernmentwouldhelpmotivatethecreationofofftheshelfstandardsoftware.

4.5.3 Enforcement and deterrence


Privacypoliciesandthecontrolofuseincontextareonlyeffectivetotheextentthattheyarerealizedand
enforced.Technicalmeasuresthatincreasetheprobabilitythataviolatoriscaughtcanbeeffectiveonlywhen
thereareregulationsandlawswithcivilorcriminalpenaltiestodetertheviolators.Thenthereisboth
deterrenceofharmfulactionsandincentivetodeployprivacyprotectingtechnologies.
Itistodaystraightforwardtechnicallytoassociatemetadatawithdata,withvaryingdegreesofgranularity
rangingfromanindividualdatum,toarecord,toanentirecollection.Thesemetadatacanrecordawealthof
auditableinformation,forexample,provenance,detailedaccessandusepolicies,authorizations,logsofactual
accessanduse,anddestructiondates.Extendingsuchmetadatatoderivedorshareddata(secondaryuse)
togetherwithprivacyawareloggingcanfacilitateauditing.Althoughthestateoftheartisstillsomewhatad
hoc,andauditingisoftennotautomated,socalledaccountablesystemsarebeginningtobedeployed(Section

125

Seereferencesatfootnote107andalso:(1)Weitzner,D.J.,etal.,InformationAccountability,Communicationsofthe
ACM,June2008,pp.8287.(2)Tschantz,MichaelCarl,AnupamDatta,andJeannetteM.Wing,FormalizingandEnforcing
PurposeRestrictionsinPrivacyPolicies.http://www.andrew.cmu.edu/user/danupam/TschantzDattaWing12.pdf
126
Forexample,atCarnegieMellonUniversity,LorrieCranordirectstheCyLabUsablePrivacyandSecurityLaboratory
(http://cups.cs.cmu.edu/).Also,see2ndInternationalWorkshoponAccountability:Science,TechnologyandPolicy,MIT
ComputerScienceandArtificialIntelligenceLaboratory,January2930,2014.
http://dig.csail.mit.edu/2014/AccountableSystems2014/
127
OracleseXtensibleAccessControlMarkupLanguage(XACML)hasbeenusedtoimplementattributebasedaccess
controlsforidentitymanagementsystems.(Personalcommunication,MarkGorenbergandPeterGuerraofBoozAllen)
128
OfficeoftheDirectorofNationalIntelligence,ICCIOEnterpriseIntegration&Architecture:TrustedDataFormat.
http://www.dni.gov/index.php/about/organization/chiefinformationofficer/trusteddataformat

42

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


4.5.2).Theabilitytodetectviolationsofprivacypolicies,particularlyiftheauditingisautomatedand
continuous,canbeusedbothtodeterprivacyviolationsandtoensurethatviolatorsarepunished.
Inthenextfiveyears,withregulationormarketdrivenencouragement,thelargecloudbasedinfrastructure
systems(e.g.,Google,Amazon,Microsoft,Rackspace)could,asoneexample,incorporatethedataprovenance
andusagecomplianceaspectsofaccountablesystemsintotheircloudapplicationprogramminginterfaces
(APIs)andadditionallyprovideAPIsforpolicyawareness.Thesecapabilitiescouldthenreadilybeincludedin
opensourcebasedsystemslikeOpenStack(associatedwithRackspace)129andotherproviderplatforms.
Applicationsintendedtorunonsuchcloudbasedsystemscouldbebuiltwithprivacyconceptsbakedinto
them,evenwhentheyaredevelopedbysmallenterprisesorindividualdevelopers.

4.5.4 Operationalizing the Consumer Privacy Bill of Rights


InFebruary2012,theAdministrationissuedareportsettingforthaConsumerPrivacyBillofRights(CPBR).The
CPBRaddressescommercial(notpublicsector)usesofpersonaldataandisastrongstatementofAmerican
privacyvalues.
Forpurposesofthisdiscussion,theprinciplesembodiedinCPBRcanbedividedintotwocategories.First,there
areobligationsfordataholders,analyzers,orcommercialusers.Thesearepassivefromtheconsumers
standpointtheobligationsshouldbemetwhetherornottheconsumerknows,cares,oracts.Second,and
different,thereareconsumerempowerments,thingsthattheconsumershouldbeempoweredtoinitiate
actively.ItisusefulheretorearrangetheCPBRsprinciplesbycategory.
Inthecategoryofobligationsaretheseelements:
RespectforContext:Consumershavearighttoexpectthatcompanieswillcollect,use,anddisclose
personaldatainwaysthatareconsistentwiththecontextinwhichconsumersprovidethedata.
FocusedCollection:Consumershavearighttoreasonablelimitsonthepersonaldatathatcompanies
collectandretain.
Security:Consumershavearighttosecureandresponsiblehandlingofpersonaldata.
Accountability:Consumershavearighttohavepersonaldatahandledbycompanieswithappropriate
measuresinplacetoassuretheyadheretotheConsumerPrivacyBillofRights.
Inthecategoryofconsumerempowermentsaretheseelements:
IndividualControl:Consumershavearighttoexercisecontroloverwhatpersonaldatacompanies
collectfromthemandhowtheyuseit.
Transparency:Consumershavearighttoeasilyunderstandableandaccessibleinformationabout
privacyandsecuritypractices.
AccessandAccuracy:Consumershavearighttoaccessandcorrectpersonaldatainusableformats,ina
mannerthatisappropriatetothesensitivityofthedataandtheriskofadverseconsequencesto
consumersifthedataareinaccurate.

PCASTendorsesassoundtheprinciplesunderlyingCPBR.Becauseoftherapidlychangingtechnologies
associatedwithbigdata,however,effectiveoperationalizationofCPBRisatrisk.Uptonow,debateoverhow
tooperationalizeCPBRhasfocusedonthecollection,storage,andretentionofdata,withanemphasisonthe

129

See:http://www.openstack.org/

43

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


smalldatacontextsthatmotivatedCPBRdevelopment.But,asdiscussedatmultipleplacesinthisreport
(e.g.,Sections3.1.2,4.4and4.5.2),PCASTbelievesthatsuchafocuswillnotprovideatechnologicallyrobust
foundationonwhichtobasefuturepolicythatalsoappliestobigdata.Further,theincreasingcomplexityof
applicationsandusesofdataunderminesevenasimpleconceptlikenoticeandconsent.
PCASTbelievesthattheprinciplesofCPBRcanreadilybeadaptedtoamorerobustregimebasedonrecognizing
andcontrollingharmfulusesofthedata.Somespecificsuggestionsfollow.
Turnfirsttotherightsclassifiedaboveasobligationsonthedataholder.
TheprincipleofRespectforContextneedsaugmentation.Asthisreporthasrepeatedlydiscussed,thereare
instancesinwhichpersonaldataarenotprovidedbythecustomer.Suchdatamayemergeasaproductof
analysiswellafterthedatawerecollectedandaftertheymayhavepassedthroughseveralhands.Whilethe
intentoftherightisappropriate,namelythatdatabeusedforlegitimatepurposesthatdonotproducecertain
adverseconsequencesorharmstoindividuals,theCPBRsarticulationinwhichconsumersprovidethedatais
toolimited.Thisrightneedstostateinsomewaythatdataaboutanindividualhoweveracquirednotbe
usedsoastocausecertainadverseconsequencesorharmstothatindividual.(SeeSection1.4forapossiblelist
ofadverseconsequencesandharmsthatmightbesubjecttosomeregulation.)
Asinitiallyconceived,therighttoFocusedCollectionwastobeachievedbytechniqueslikedeidentificationand
datadeletion.AsdiscussedinSection4.4.1,however,deidentification(anonymization)isnotarobust
technologyforbigdatainthefaceofdatafusion;insomeinstances,theremaybecompellingreasonstoretain
dataforbeneficialpurposes.Thisrightshouldbeaboutuseratherthancollection.Itshouldemphasizeutilizing
bestpracticestopreventinappropriateuseofdataduringthedataswholelifecycle,ratherthandependingon
deidentification.Itshouldnotdependonacompanysbeingableitselftorecognizeallthedataabouta
consumerthatitholds,whichisincreasinglytechnicallyinfeasible.
TheprinciplesunderlyingCPBRsSecurityandAccountabilityremainvalidinausebasedregime.Theyneedto
beappliedthroughoutthevaluechainthatincludesdatacollection,analysis,anduse.
Turnnexttotherightshereclassifiedasconsumerempowerments.
Whereconsumerempowermentshavebecomepracticallyimpossiblefortheconsumertoexercise
meaningfully,theyneedtoberecastasobligationsofthecommercialentitythatactuallyusesthedataor
productsofdataanalysis.ThisappliestotheCPBRsprinciplesofIndividualControlandofTransparency.
Section4.3explainedhowthenonobviousnatureofbigdatasproductsofanalysismakeitallbutimpossible
foranindividualtomakefinegrainedprivacychoicesforeverynewsituationorapp.Fortheprincipleof
IndividualControltohavemeaning,PCASTbelievesthattheburdenshouldnolongerfallontheconsumerto
manageprivacyforeachcompanywithwhichtheconsumerinteractsbyaframeworklikenoticeandconsent.
Rather,eachcompanyshouldtakeresponsibilityforconformingitsusesofpersonaldatatoapersonalprivacy
profiledesignatedbytheconsumerandmadeavailabletothatcompany(includingfromathirdparty
designatedbytheconsumer).Section4.5.1proposedamechanismforthischangeinresponsibility.
Transparency(inthesenseofdisclosureofprivacypractices)suffersfrommanyofthesameproblems.Today,
theconsumerreceivesanunhelpfulblizzardofprivacypolicynotifications,manyofwhichsay,inessence,we

44

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


providerscandoanythingwewant.130AswithIndividualControl,theburdenofconformingtoaconsumers
statedpersonalprivacyprofileshouldfallonthecompany,withnotificationtotheconsumersbyacompanyif
theirprofileprecludesthatcompanysacceptingtheirbusiness.Sincecompaniesdonotliketolosebusiness,a
positivemarketdynamicforcompetingprivacypracticeswouldthusbecreated.
FortherightofAccessandAccuracytobemeaningful,personaldatamustincludethefruitsofdataanalytics,
notjustcollection.However,asthisreporthasalreadyexplained(Section4.4.2),itisnotalwayspossiblefora
companytoknowwhatitknowsaboutaconsumer,sincethatinformationmaybeunrecognizedinthedata;
oritmaybecomeidentifiableonlyinthefuture,whendatasetsarecombinedusingnewalgorithms.When,
however,thepersonalcharacterofdataisapparenttoacompanybyvirtueofitsuseofthedata,itsobligation
toprovidemeansforthecorrectionoferrorsshouldbetriggered.Consumersshouldhaveanexpectationthat
companieswillvalidateandcorrectdatastemmingfromanalysisand,sincenotallerrorswillbecorrected,will
alsotakestepstominimizetheriskofadverseconsequencestoconsumersfromtheuseofinaccuratedata.
Again,theprimaryburdenmustfallonthecommercialuserofbigdataandnotontheconsumer.

130

Lawyersmayencouragecompaniestouseoverinclusivelanguagetocovertheunpredictableevolutionofpossibilities
describedelsewhereinthisreport,evenintheabsenceofspecificplanstousespecificcapabilities.

45

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

46

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

5. PCAST Perspectives and Conclusions


Breachesofprivacycancauseharmtoindividualsandgroups.Itisaroleofgovernmenttopreventsuchharm
wherepossible,andtofacilitatemeansofredresswhentheharmoccurs.Technicalenhancementsofprivacy
canbeeffectiveonlywhenaccompaniedbyregulationsorlawsbecause,unlesssomepenaltiesareenforced,
thereisnoendtotheescalationofthemeasurescountermeasuresgamebetweenviolatorsandprotectors.
Rulesandregulationsprovidebothdeterrenceofharmfulactionsandincentivestodeployprivacyprotecting
softwaretechnologies.
Fromeverythingalreadysaid,itshouldbeobviousthatnewsourcesofbigdataareabundant;thattheywill
continuetogrow;andthattheycanbringenormouseconomicandsocialbenefits.Similarly,andofcomparable
importance,newalgorithms,software,andhardwaretechnologieswillcontinuetoincreasethepowerofdata
analyticsinunexpectedways.Giventhesenewcapabilitiesofdataaggregationandprocessing,thereis
inevitablynewpotentialforboththeunintentionalleakingofbothbulkandfinegraineddataaboutindividuals,
andfornewsystematicattacksonprivacybythosesominded.
Cameras,sensors,andotherobservationalormobiletechnologiesraisenewprivacyconcerns.Individualsoften
donotknowinglyconsenttoprovidingdata.Thesedevicesnaturallypullindataunrelatedtotheirprimary
purpose.Theirdatacollectionisofteninvisible.Analysistechnology(suchasfacial,scene,speech,andvoice
recognitiontechnology)isimprovingrapidly.Mobiledevicesprovidelocationinformationthatmightnotbe
otherwisevolunteered.Thecombinationofdatafromthosesourcescanyieldprivacythreateninginformation
unbeknownsttotheaffectedindividuals.
Itisalsotrue,however,thatprivacysensitivedatacannotalwaysbereliablyrecognizedwhentheyarefirst
collected,becausetheprivacysensitiveelementsmaybeonlylatentinthedata,madevisibleonlybyanalytics
(includingthosenotyetinvented),orbyfusionwithotherdatasources(includingthosenotyetknown).
Suppressingthecollectionofprivacysensitivedatawouldthusbeincreasinglydifficult,anditwouldalsobe
increasinglycounterproductive,frustratingthedevelopmentofbigdatassociallyimportantandeconomic
benefits.
Norwoulditbedesirabletosuppressthecombiningofmultiplesourcesandkindsofdata:Muchofthepowerof
bigdatastemsfromthiskindofdatafusion.Thatsaid,itremainsamatterofconcernthatconsiderable
amountsofpersonaldatamaybederivedfromdatafusion.Inotherwords,suchdatacanbeobtainedor
inferredwithoutintentionalpersonaldisclosure.
Itisanunavoidablefactthatparticularcollectionsofbigdataandparticularkindsofanalysiswilloftenhave
bothbeneficialandprivacyinappropriateuses.Theappropriateuseofboththedataandtheanalysesare
highlycontextual.
Anyspecificharmoradverseconsequenceistheresultofdata,ortheiranalyticalproduct,passingthroughthe
controlofthreedistinguishableclassesofactorinthevaluechain:
First,therearedatacollectors,whocontroltheinterfacestoindividualsortotheenvironment.Datacollectors
maycollectdatafromclearlyprivaterealms(e.g.,ahealthquestionnaireorwearablesensor),fromambiguous
situations(e.g.,cellphonepicturesorGoogleGlassvideostakenatapartyorcamerasandmicrophonesplaced

47

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


inaclassroomforremotebroadcast),orincreasinginbothquantityandqualitydatafromthepublic
square,whereprivacysensitivedatamaybelatentandinitiallyunrecognizable.
Second,therearedataanalyzers.Thisiswherethebiginbigdatabecomesimportant.Analyzersmay
aggregatedatafrommanysources,andtheymaysharedatawithotheranalyzers.Analyzers,asdistinctfrom
collectors,createuses(productsofanalysis)bybringingtogetheralgorithmsanddatasetsinalargescale
computationalenvironment.Importantly,analyzersarethelocuswhereindividualsmaybeprofiledbydata
fusionorstatisticalinference.
Third,thereareusersoftheanalyzeddatabusiness,government,orindividual.Userswillgenerallyhavea
commercialrelationshipwithanalyzers;theywillbepurchasersorlicensees(etc.)oftheanalyzersproductsof
analysis.Itistheuserwhocreatesdesirableeconomicandsocialoutcomes.But,itisalsotheuserwhoisthe
locusofproducingactualadverseconsequencesorharms,whensuchoccur.

5.1 Technical feasibility of policy interventions


Policy,ascreatedbynewlegislationorwithinexistingregulatoryauthorities,can,inprinciple,interveneat
variousstagesinthevaluechaindescribedabove.Notallsuchinterventionsareequallyfeasiblefroma
technicalperspective,orequallydesirableifthesocietalandeconomicbenefitsofbigdataaretoberealized.
AsindicatedinChapter4,basingpolicyonthecontrolofcollectionisunlikelytosucceed,exceptinverylimited
circumstanceswherethereisanexplicitlyprivatecontext(e.g.,measurementordisclosureofhealthdata)and
thepossibilityofmeaningfulexplicitorimplicitnoticeandconsent(e.g.,byprivacypreferenceprofiles,see
Sections4.3and4.5.1),whichdoesnotexisttoday.
Thereislittletechnicallikelihoodthat"arighttoforget"orsimilarlimitsonretentioncouldbemeaningfully
definedorenforced(seeSection4.4.2).Increasingly,itwillnotbetechnicallypossibletosurfaceallofthedata
aboutanindividual.Policybasedonprotectionbyanonymizationisfutile,becausethefeasibilityofre
identificationincreasesrapidlywiththeamountofadditionaldata(seeSection4.4.1).Thereislittle,and
decreasing,meaningfuldistinctionbetweendataandmetadata.Thecapabilitiesofdatafusion,datamining,
andreidentificationrendermetadatanotmuchlessproblematicthandata(seeSection3.1).
Evenifdirectcontrolsoncollectionareinmostcasesinfeasible,however,attentiontocollectionpracticesmay
helptoreduceriskinsomecircumstances.Suchbestpracticesastrackingprovenance,auditingaccessanduse,
andcontinuousmonitoringandcontrol(seeSections4.5.2and4.5.3)couldbedrivenbypartnershipsbetween
governmentandindustry(thecarrot)andalsobyclarifyingtortlawanddefiningwhatmightconstitute
negligence(thestick).
Turnnexttodataanalyzers.Onetheonehand,itmaybedifficulttoregulatethem,becausetheiractionsdonot
directlytouchtheindividual(itisneithercollectionnoruse)andmayhavenoexternalvisibility.Mereinference
aboutanindividual,absentitspublicationoruse,maynotbeafeasibletargetofregulation.Ontheotherhand,
anincreasingfractionofprivacyissueswillsurfaceonlywiththeapplicationofdataanalytics.Manyprivacy
challengeswillarisefromtheanalysisofdatacollectedunintentionallythatwasnot,atthetimeofcollection,
targetedatanyparticularindividualorevengroupofindividuals.Thisisbecausecombiningdatafrommany
sourceswillbecomemoreandmorepowerful.
Itmightbefeasibletointroduceregulationatthemomentofparticularizationofdataaboutanindividual,or
whenthisisdoneforsomeminimumnumberofindividualsconcurrently.Tobeeffectivesuchregulationwould

48

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


needtobeaccompaniedbyrequirementsfortrackingprovenance,auditingaccessanduse,andusingsecurity
measures(e.g.,robustencryptioninfrastructure)atallstagesoftheevolutionofdata,andforproviding
transparency,and/ornotification,atthemomentofparticularization.
Bigdatasproductsofanalysisarecreatedbycomputerprogramsthatbringtogetheralgorithmsanddataso
astoproducesomethingofvalue.Itmightbefeasibletorecognizesuchprograms,ortheirproducts,inalegal
senseandtoregulatetheircommerce.Forexample,theymightnotbeallowedtobeusedincommerce(sold,
leased,licensed,andsoon)unlesstheyareconsistentwithindividualsprivacyelectionsorotherexpressionsof
communityvalues(seeSections4.3and4.5.1).Requirementsmightbeimposedonconformitytoappropriate
standardsofprovenance,auditability,accuracy,andsoon,inthedatatheyuseandproduce;orthatthey
meaningfullyidentifywho(licensorvs.licensee)isresponsibleforcorrectingerrorsandliableforvarioustypes
ofharmoradverseconsequencecausedbytheproduct.
Itisnot,however,themeredevelopmentofaproductofanalysisthatcancauseadverseconsequences.Those
occuronlywithitsactualuse,whetherincommerce,bygovernment,bythepress,orbyindividuals.Thisseems
themosttechnicallyfeasibleplacetoapplyregulationgoingforward,focusingatthelocuswhereharmcanbe
produced,notfarupstreamfromwhereitmaybarely(ifatall)beidentifiable.
Whenproductsofanalysisproduceimperfectinformationthatmaymisclassifyindividualsinwaysthatproduce
adverseconsequences,onemightrequirethattheymeetstandardsfordataaccuracyandintegrity;thatthere
areuseableinterfacesthatallowanindividualtocorrecttherecordwithvoluntaryadditionalinformation;and
thatthereexiststreamlinedoptionsforredress,includingfinancialredress,whenadverseconsequencesreacha
certainlevel.
Someharmsmayaffectgroups(e.g.,thepoororminorities)ratherthanidentifiableindividuals.Mechanismsfor
redressinsuchcasesneedtobedeveloped.
Thereisaneedtoclarifystandardsforliabilityincaseofadverseconsequencesfromprivacyviolations.
Currentlythereisapatchworkofoutofdatestatelawsandlegalprecedents.Onecouldencouragethedrafting
oftechnologicallysavvymodellegislationoncybertortsforconsiderationbythestates.
Finally,governmentmaybeforbiddenfromcertainclassesofuses,despitetheirbeingavailableintheprivate
sector.

5.2 Recommendations
PCASTschargeforthisstudydoesnotaskittomakerecommendationsonprivacypolicies,butrathertomakea
relativeassessmentofthetechnicalfeasibilityofdifferentbroadpolicyapproaches.PCASTsoverallconclusions
aboutthatquestionareembodiedinthefirsttwoofourrecommendations:
Recommendation1.Policyattentionshouldfocusmoreontheactualusesofbigdataandlessonits
collectionandanalysis.
Byactualuses,wemeanthespecificeventswheresomethinghappensthatcancauseanadverseconsequence
orharmtoanindividualorclassofindividuals.Inthecontextofbigdata,theseevents(uses)arealmost
alwaysactionsofacomputerprogramorappinteractingeitherwiththerawdataorwiththefruitsofanalysisof
thosedata.Inthisformulation,itisnotthedatathemselvesthatcausetheharm,northeprogramitself(absent
anydata),buttheconfluenceofthetwo.Theseuseevents(incommerce,bygovernment,orbyindividuals)

49

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


embodythenecessaryspecificitytobethesubjectofregulation.Sincethepurposeofbringingprogramand
datatogetheristoaccomplishsomeidentifiabledesiredtask,useeventsalsocapturesomenotionofintent,ina
waythatdatacollectionbyitselforprogramdevelopmentbyitselfmaynot.Thepolicyquestionofwhatkinds
ofadverseconsequencesorharmsrisetothelevelofneedingregulationisoutsideofPCASTscharge,butan
illustrativesetthatseemgroundedincommonAmericanvalueswasprovidedinSection1.4.
PCASTjudgesthatalternativebigdatapoliciesthatfocusontheregulationofdatacollection,storage,retention,
apriorilimitationsonapplications,andanalysis(absentidentifiableactualusesofbigdataoritsproductsof
analysis)areunlikelytoyieldeffectivestrategiesforimprovingprivacy.Suchpoliciesareunlikelytobescalable
overtimeasitbecomesincreasinglydifficulttoascertain,aboutanyparticulardataset,whatpersonal
informationmaybelatentinitorinitspossiblefusionwitheveryotherpossibledataset,presentorfuture.
Therelatedissueisthatpolicieslimitingcollectionandretentionareincreasinglyunlikelytobeenforceableby
otherthansevereandeconomicallydamagingmeasures.Whiletherearecertaindefinableclassesofdataso
repugnanttosocietythattheirmerepossessioniscriminalized,131theinformationinbigdatathatmayraise
privacyconcernsisincreasinglyinseparablefromavastvolumeofthedataofordinarycommerce,or
governmentfunction,orcollectioninthepublicsquare.Thisdualusecharacterofinformation,too,arguesfor
theregulationofuseratherthancollection.
Recommendation2.Policiesandregulation,atalllevelsofgovernment,shouldnotembedparticular
technologicalsolutions,butrathershouldbestatedintermsofintendedoutcomes.
Toavoidfallingbehindthetechnology,itisessentialthatpolicyconcerningprivacyprotectionshouldaddress
thepurpose(thewhat)ratherthanthemechanism(thehow).Forexample,regulatingdisclosureofhealth
informationbyregulatingtheuseofanonymizationfailstocapturethepowerofdatafusion;regulatingthe
protectionofinformationaboutminorsbycontrollinginspectionofstudentrecordsheldbyschoolsfailsto
anticipatethestudentinformationcapturingbyonlinelearningtechnologies.Regulatingcontrolofthe
inappropriatedisclosureofhealthinformationorstudentperformance,nomatterhowthedataareacquiredis
morerobust.
PCASTfurtherrespondstoitschargewiththefollowingrecommendations,intendedtoadvancetheagendaof
strongprivacyvaluesandthetechnologicaltoolsneededtosupportthem:
Recommendation3.WithcoordinationandencouragementfromOSTP,theNITRDagencies132should
strengthenU.S.researchinprivacyrelatedtechnologiesandintherelevantareasofsocialsciencethatinform
thesuccessfulapplicationofthosetechnologies.
Someofthetechnologyforcontrollingusesalreadyexists.Research(andfundingforit)isneeded,however,in
thetechnologiesthathelptoprotectprivacy,inthesocialmechanismsthatinfluenceprivacypreserving

131

Childpornographyisthemostuniversallyrecognizedexample.
NITRDreferstotheNetworkingandInformationTechnologyResearchandDevelopmentprogram,whoseparticipating
Federalagenciessupportunclassifiedresearchininadvancedinformationtechnologiessuchascomputing,networking,and
softwareandincludebothresearchandmissionfocusedagenciessuchasNSF,NIH,NIST,DARPA,NOAA,DOEsOfficeof
Science,andtheD0Dmilitaryservicelaboratories(seehttp://www.nitrd.gov/SUBCOMMITTEE/nitrd_agencies/index.aspx).
ThereisresearchcoordinationbetweenNITRDandFederalagenciesconductingorsupportingcorrespondingclassified
research.
132

50

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


behavior,andinthelegaloptionsthatarerobusttochangesintechnologyandcreateappropriatebalance
amongeconomicopportunity,othernationalpriorities,andprivacyprotection.
FollowinguponrecommendationsfromPCASTforincreasedprivacyrelatedresearch,133a20132014internal
governmentreviewofprivacyfocusedresearchacrossFederalagenciessupportingresearchoninformation
technologiessuggeststhatabout$80millionsupportseitherresearchwithanexplicitfocusonenhancing
privacyorresearchthataddressesprivacyprotectionancillarytosomeothergoal(typicallycybersecurity).134
Thefundedresearchaddressessuchtopicsasanindividualscontroloverhisorherinformation,transparency,
accessandaccuracy,andaccountability.Itistypicallyofageneralnature,exceptforresearchfocusingonthe
healthdomainor(relativelynew)consumerenergyusage.Thebroadestandmostvariedsupportforprivacy
research,intheformofgrantstoindividualsandcenters,comesfromtheNationalScienceFoundation(NSF),
engagingsocialscienceaswellascomputerscienceandengineering.135,136
ResearchintoprivacyasanextensionorcomplementtosecurityissupportedbyavarietyofDepartmentof
Defenseagencies(AirForceResearchLaboratory,theArmysTelemedicineandAdvancedTechnologyResearch
Center,DefenseAdvancedResearchProjectsAgency,NationalSecurityAgency,andOfficeofNavalResearch)
andtheIntelligenceAdvancedResearchProjectsActivity(IARPA)withintheIntelligenceCommunity.IARPA,for
example,hashostedtheSecurityandPrivacyAssuranceResearch137program,whichhasexploredavarietyof
encryptiontechniques.ResearchattheNationalInstituteforStandardsandTechnology(NIST)focusesonthe
developmentofcryptographyandbiometrictechnologytoenhanceprivacyaswellassupportforfederal
standardsandprogramsforidentitymanagement.138
Lookingtothefuture,continuedinvestmentisneedednotonlyinprivacytopicsancillarytosecurity,butalsoin
automatingprivacyprotectionforthebroadestaspectsofuseofdatafromallsources.Relevanttopicsinclude
cryptography,privacypreservingdatamining(includinganalysisofstreamingaswellasstored)data,139
formalizationofprivacypolicies,toolsforautomatingconformanceofsoftwaretopersonalprivacypolicyandto
legalpolicy,methodsforauditinguseincontextandidentifyingviolationsofpolicy,andresearchonenhancing
peoplesabilitytomakesenseoftheresultsofvariousbigdataanalyses.Developmentoftechnologiesthat
supportbothqualityanalyticsandprivacypreservationondistributeddata,suchassecuremultiparty
computation,willbecomeevenmoreimportant,giventheexpectationthatpeoplewilldrawincreasinglyfrom

133

DesigningaDigitalFuture:FederallyFundedResearchandDevelopmentinNetworkingandInformationTechnology
(http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcastnitrd2013.pdf[2012]and
http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcastnitrdreport2010.pdf[2010]).
134
FederalNetworkingandInformationTechnologyResearchandDevelopmentProgram,ReportonPrivacyResearch
WithinNITRD[NetworkingandInformationTechnologyResearchandDevelopment],NationalCoordinationOfficefor
NITRD,April23,2014.http://www.nitrd.gov/Pubs/Report_on_Privacy_Research_within_NITRD.pdf
135
TheSecureandTrustworthyCyberspaceprogramisthelargestfunderofrelevantresearch.See:
http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504709
136
InDecember2013,theNSFdirectoratessupportingcomputerandsocialsciencejoinedinsolicitingproposalsforprivacy
relatedresearch.http://www.nsf.gov/pubs/2014/nsf14021/nsf14021.jsp.
137
See:http://www.iarpa.gov/index.php/researchprograms/spar
138
NISTisresponsibleforadvancingtheNationalStrategyforTrustedIdentitiesinCyberspace(NSTIC),whichisintendedto
facilitatesecuretransactionswithinandacrosspublicandprivatesectors.See:http://www.nist.gov/nstic/
139
Pike,W.A.etal.,PNNL[PacificNorthwestNationalLaboratory]ResponsetoOSTPBigDataRFI,March2014.

51

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE


datastoredinmultiplelocations.ThecreationoftoolsthatanalyzethepanoplyofNational,state,regional,and
internationalrulesandregulationsforinconsistenciesanddifferenceswillbehelpfulforthedefinitionofnew
rulesandregulations,aswellasforthosesoftwaredevelopersthatneedtocustomizetheirservicesfordifferent
markets.
Recommendation4.OSTP,togetherwiththeappropriateeducationalinstitutionsandprofessionalsocieties,
shouldencourageincreasededucationandtrainingopportunitiesconcerningprivacyprotection,including
professionalcareerpaths.
Programsthatprovideeducationleadingtoprivacyexpertise(akintowhatisbeingdoneforsecurityexpertise)
areessentialandneedencouragement.Onemightenvisioncareersfordigitalprivacyexpertsbothonthe
softwaredevelopmentsideandonthetechnicalmanagementside.Employmentopportunitiesshouldexistnot
onlyinindustry(andgovernmentatalllevels),wherejobsfocusedonprivacy(includingbutnotlimitedtoChief
PrivacyOfficers)havebeengrowing,butalsoforconsumerandcitizenadvocacyandsupport,perhapsoffering
annualprivacycheckupsforindividuals.Justaseducationandtrainingaboutcybersecurityhasadvancedover
thepast20yearswithinthetechnicalcommunity,thereisnowopportunitytoeducateandtrainstudentsabout
privacyimplicationsandprivacyenhancements,beyondthepresentsmallnicheareaoccupiedbythisfocus
withincomputerscienceprograms.140Privacyisalsoanimportantcomponentofethicseducationfor
technologyprofessionals.
Recommendation5.TheUnitedStatesshouldtaketheleadbothintheinternationalarenaandathomeby
adoptingpoliciesthatstimulatetheuseofpracticalprivacyprotectingtechnologiesthatexisttoday.This
countrycanexhibitleadershipbothbyitsconveningpower(forinstance,bypromotingthecreationand
adoptionofstandards)andalsobyitsownprocurementpractices(suchasitsownuseofprivacypreserving
cloudservices).
Section4.5.2describedasetofprivacyenhancingbestpracticesthatalreadyexisttodayinU.S.markets.PCAST
isnotawareofanymoreeffectiveinnovationorstrategiesbeingdevelopedabroad;rather,somecountries
seeminclinedtopursuewhatPCASTbelievestobeblindalleys.ThiscircumstanceoffersanopportunityforU.S.
technicalleadershipinprivacyintheinternationalarena,anopportunitythatshouldbeseized.Publicpolicycan
helptonurturethebuddingcommercialpotentialofprivacyenhancingtechnologies,boththroughU.S.
governmentprocurementandthroughthelargerpolicyframeworkthatmotivatesprivatesectortechnology
engagement.
Asitdoesforsecurity,cloudcomputingofferspositivenewopportunitiesforprivacy.Byrequiringprivacy
enhancingservicesfromcloudserviceproviderscontractingwiththeU.S.government,thegovernmentshould
encouragethoseproviderstomakeavailablesophisticatedprivacyenhancingtechnologiestosmallbusinesses
andtheircustomers,beyondwhatthesmallbusinessmightbeabletodoonitsown.141

140

AbasiscanbefoundinthenewestversionofthecurriculumguidanceoftheAssociationforComputingMachinery
(http://www.acm.org/education/CS2013finalreport.pdf).Givenallofthepressuresoncurriculum,progressaswith
cybersecuritymayhingeongrowthinprivacyrelatedresearch,businessopportunities,andoccupations.
141
AbeginningcanbefoundintheFederalGovernmentsFedRAMPprogramforcertifyingcloudservices.Initiatedto
addressFederalagencysecurityconcerns,FedRAMPalreadybuildsinattentiontoprivacyintheformofarequiredPrivacy
ThresholdAnalysisandinsomesituationsaPrivacyImpactAnalysis.TheofficeoftheU.S.ChiefInformationOfficer

52

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

5.4 Final Remarks


Privacyisanimportanthumanvalue.Theadvanceoftechnologyboththreatenspersonalprivacyandprovides
opportunitiestoenhanceitsprotection.ThechallengefortheU.S.Governmentandthelargercommunity,both
withinthiscountryandglobally,istounderstandwhatthenatureofprivacyisinthemodernworldandtofind
thosetechnological,educational,andpolicyavenuesthatwillpreserveandprotectit.

providesguidanceonFederalusesofinformationtechnologythataddressesprivacyalongwithsecurity(see
http://cloud.cio.gov/).ItprovidesspecificguidanceonthecloudandFedRAMP(http://cloud.cio.gov/fedramp),including
privacyprotection(http://cloud.cio.gov/document/privacythresholdanalysisandprivacyimpactassessment).

53

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

54

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

Appendix A. Additional Experts Providing Input

PeterGuerra
BoozAllen

MichaelJordan
UniversityofCalifornia,Berkeley

PhilipKegelmeyer
SandiaNationalLaboratory

AngelosKeromytis
ColumbiaUniversity

ThomasKalil
OSTP

JonKleinberg
CornellUniversity

JuliaLane
AmericanInstitutesforResearch

CarlLandwehr
GeorgeWashingtonUniversity

DavidMoon
Ernst&Young

KeithMarzullo
NationalScienceFoundation

MarthaMinow
HarvardLawSchool

TomMitchell
CarnegieMellonUniversity

YochaiBenkler
Harvard

EleanorBirrell
CornellUniversity

CourtneyBowman
Palantir

ChristopherClifton
PurdueUniversity

JamesCosta
SandiaNationalLaboratory

LorrieFaithCranor
CarnegieMellonUniversity

DeborahEstrin
CornellNYC

WilliamW.(Terry)Fisher
HarvardLawSchool

StephanieForrest
UniversityofNewMexico

DanGeer
InQTel

DeborahK.Gracio
PacificNorthwestNationalLaboratory

EricGrosse
Google

55

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

DeirdreMulligan
UniversityofCalifornia,Berkeley

LeonardNapolitano
SandiaNationalLaboratory

CharlesNelson
OSTP

ChrisOehmen
PacificNorthwestNationalLaboratory

AlexSandyPentland
MassachusettsInstituteofTechnology

RenePeralta
NationalInstituteofStandardsandTechnology

AnthonyPhilippakis
GenomeBridge

TimothyPolk
OSTP

FredB.Schneider
CornellUniversity

GregShipley
InQTel

LaurenSmith
OSTP

FrancisSullivan
InstituteforDefenseAnalysis

ThomasVagoun
NITRDNationalCoordinationOffice

KonradVesey
IntelligenceAdvancedResearchActivity

JamesWaldo
Harvard

PeterWeinberger
Google,Inc.

DanielJ.Weitzner
MassachusettsInstituteofTechnology

NicoleWong
OSTP

JonathanZittrain
HarvardLawSchool

56

BIG DATA AND PRIVACY: A TECHNOLOGICAL PERSPECTIVE

SpecialAcknowledgment
PCASTisespeciallygratefulfortherapidandcomprehensiveassistanceprovidedbyanadhocgroupof
staffattheNationalScienceFoundation(NSF),ComputerandInformationScienceandEngineering
Directorate.ThisteamwasledbyFenZhaoandEmilyGrumbling,whowereenlistedbySuzanne
Iacono.Drs.ZhaoandGrumblingworkedtirelesslytoreviewthetechnicalliterature,elicit
perspectivesandfeedbackfromarangeofNSFcolleagues,anditerateondescriptionsofnumerous
technologiesrelevanttobigdataandprivacyandhowthosetechnologieswereevolving.

NSFTechnologyTeamLeaders
FenZhao,AAASFellow,CISE
EmilyGrumbling,AAASFellow,Officeof
Cyberinfrastructure

AdditionalNSFContributors
RobertChadduck,ProgramDirector
AlmadenaY.Chtchelkanova,ProgramDirector
DavidCorman,ProgramDirector

57

JamesDonlon,ProgramDirector
JeremyEpstein,ProgramDirector
JosephB.Lyles,ProgramDirector
DmitryMaslov,ProgramDirector
MimiMcClure,AssociateProgramDirector
AnitaNikolich,Expert
AmyWalton,ProgramDirector
RalphWachter,ProgramDirector

Presidents Council of Advisors on Science and


Technology (PCAST)
www.whitehouse.gov/ostp/pcast

Das könnte Ihnen auch gefallen