To give you what you want, companies and governments

say, they need to know everything they can about you.
Are you ready for the era of Big Data?

The digital
By Steven Poole

information about how they are functioning.

This mammoth store of telemetry data can
be analysed to predict part failures before
they happen.
But Big Data is not just an approach that
improves uncontroversially useful systems.
Its also a hype machine. IBM, for instance,
offers to furnish companies with its own
Big Data platform, the PR material for which
is a savoury mix of space-age techspeak and
corporate mumbo-jumbo. Big data represents a new era of computing, the company
promises, an inflection point of opportunity
where data in any format may be explored
and utilised for breakthrough insights
whether that data is in-place, in-motion, or
at-rest. It sounds rather cruel to disturb data
that is at-rest, presumably power-napping,
but inflection points of opportunity wait for
no man or megabyte.
Another platform capable of changing the
game, albeit in an unhappily permanent
manner, might be the Big Data skiing goggles
marketed by the tech-shades manufacturer
Oakley. Rather as the computerised spectacles known as Google Glass promise to do for
the whole world, these $600 goggles project
into your eyes all kinds of fascinating information about your skiing, including changes

in speed and altitude, and can even display

incoming messages from your mobile.
Of course, it might happen that while reading a titillating sext from a co-worker you ski
at high speed into a tree. And so the goggles
are sold with a splendidly self-defeating
warning on the box: Do not operate product
while skiing. Clearly, all the information all
of the time is not always desirable. Still less
so when Big Datas tendrils move out from
cool gadgets or website tools into our personal lives, the workplace and government
notwithstanding the Panglossian boosters of
global datagasm.
What is data in the first place? The big
lie of much Big Data publicity is that data is
neutral, ambient information that can just be
hoovered up; that it is already given the
meaning of the classical root of data. (Because the word is the Latin plural of datum,
some people prefer to say data are. I am not
one of them.)
But data is not simply collected; it is manufactured. There are always questions about
how you choose what to measure, how you
measure it, and how you analyse it afterwards;
at each step, you make theoretical assumptions. A few years ago Chris Anderson, the
former editor of Wired magazine, wrote an

Data will save us. All we need to do is measure

the world. When we have quantified everything, problems both technical and social will
melt away. That, at least, is the promise of
Big Data the buzzphrase for the practice
of collecting mountains of data about a subject and then crunching away on it with shiny
supercomputers. The term has lately become
so ubiquitous that people make wry jokes
about small data. But Big Data is not only
something geeks do in the science lab or the
start-up company; it affects us all. So we had
better understand what its plans are.
Miraculous things can already be accomplished. By analysing web-searches tied to
geographical location, Google Flu Trends can
track the spread of an influenza epidemic in
near-real time, thus helping to direct medical
resources to the right places. Another of the
companys services, Google Translate, is so
effective not because it understands language
to any degree, but because it holds huge corpuses of written examples in various tongues
and knows statistically which phrase of the
sample text is most often translated by which
phrase in the target language. Meanwhile,
aircraft and other complex engineering projects can be made more reliable once components are able wirelessly to phone home

article claiming that big data meant the

end of theory in science.
But as Viktor Mayer-Schnberger and
Kenneth Cukier point out in their useful recent book, Big Data: Big data itself is
founded on theory. And once youve manufactured data with instruments that operate
according to certain theories, you then need
to analyse it theoretically. At the Large
Hadron Collider, subatomic smashing generates a million gigabytes of data every second. Automated systems keep just a millionth of this for analysis (discarding the rest
based on theories), but the bit-heap is still
Brobdingnagian. And it needs to be analysed
according to still other theories before scientists will understand what is going on. Until
then, the data itself is just inscrutable numbers. Raw data is not knowledge. According
to IBM, 90 per cent of the worlds extant data
has been created in the past two years. Unless
I missed something important, that is not
because the human race has very rapidly become much wiser.
Nor can data always tell us why things are
how they are. In Big Data, Mayer-Schnberger
and Cukier adopt an optimistic meta-theoretical attitude on this point, arguing that
increasing use of Big Data will wean us off
our obsession with causation finding out
what causes what.
The authors argue that with data-sets that
approach the entirety of the relevant information, rather than mere statistical samples,
correlation when two things are regularly
associated with each other will be king.
When we can easily find out that one thing
goes with another, we wont worry too much
about what is causing what.
This seems an odd claim to make when
set beside, say, the history of research into
the harmful effects of smoking. It was
known for a long time that smoking was correlated with lung cancer but then, so were
many other things. Siddhartha Mukherjee
pointed this out recently in the New York
Times: Asked about the strikingly concomitant increases in lung cancer and smoking
rates in the 1930s, Evarts Graham, a surgeon,
countered dismissively that the sale of nylon
stockings had also increased. It took another few decades, and careful experimentation, for us to become sure that cigarettes
were carcinogenic. Why mere correlations
even in large data-sets should suddenly
have become magical truth machines in the
meantime is not clear.
The quarks and other colourful subatomic
fauna at the Large Hadron Collider presumably dont care all that much that vast quantities of data about them are being recorded, but
human beings might be more nervous, and
with good reason. Big Data also holds out the
promise of, for instance, total supervision
in the workplace. Lest perfect surveillance of

employees sound alarming, this new field
is given the blandly technocratic name of
workforce science. Every phone call, email
and even mouse-click of an employee can be
stored and analysed to guide management in
making decisions.
So workforce science is a scaled-up and
automated version of the scientific management promoted by Frederick Winslow
Taylor in his highly influential 1911 book,
The Principles of Scientific Management,
which recounted how he performed timeand-motion studies on labourers in order to
get more work out of them. It has since been
alleged that Winslow fiddled the data, but

Every email and even

mouse-click can be
analysed by managers
that didnt stop him becoming an eponym:
taylorisation is the breaking-down of some
activity into discrete repetitive units, supposedly to improve efficiency. Big Data promises
taylorisation on steroids.
Increasingly, workers are also forced to submit to personality tests so that their scores
can be added to the swelling data file. Such
tests are another example of what MayerSchnberger and Cukier call datafication
quantifying the previously unquantifiable
but their reliability is highly controversial.
The Myers-Briggs personality type indicator,
for instance, is derided by psychologists but
widely employed in business.
At the same time, the UK governments
nudge unit has designed a psychometric
test for jobseekers called My Strengths. Unfortunately, as bloggers have demonstrated,
its results bear no relation to the answers
given. In the age of Big Data, it begins to seem
as though any kind of data, whether true or
false, is better than no data at all.

Err . . . OK, just tell them its another metaphor

We dont choose to be targets of surveillance in the office, but we do when we use

internet services such as Facebook. In assiduously entering our personal information,
accepting friends, checking in to bars
and restaurants and liking things, we have
been inveigled into a digital taylorising of our
social lives. Facebooks billion users constitute a global underclass of volunteer labour
in a giant programme of corporate welfare.
Facebook owns this information and sells
advertising against it. What we put in the
cloud a comfortingly fluffy name for a
collection of gargantuan physical data centres, owned and operated by an oligopoly of
corporations including Facebook, Google,
Amazon, Microsoft and Apple does not belong to us any more, and often has a worryingly long life. Apple keeps the questions you
ask its voice-operated virtual assistant, Siri,
for two years before deleting the data.

hrough Big Data analysis, the

cloud comes to know an awful
lot about us. Simply analysing a
persons Facebook likes can
identify a persons sexual orientation or history of drug use. Even just searching for things and filling out online surveys
can lead to personal information about you
being bought and sold by big marketing analytics companies. When the Big Data is data
about you, privacy becomes a faint memory.
And this is true not just on the web. The Data
Privacy Lab at Harvard University recently
managed to identify 40 per cent of individuals who had taken part (again, supposedly
anonymously) in a large-scale DNA study,
the Personal Genome Project.
Depending on your taste, you might be
more or less worried to find that Big Data
about you is held by the state rather than
a profit-seeking internet company. At least
governments are supposed to be democratically accountable. Both David Cameron and
Barack Obama announced open data initiatives on assuming office, though the types
of data released were very carefully chosen.
The British government put online information about all items of local-authority spending over 500, salaries of civil servants, and
a big database of government spending. The
Communities Secretary, Eric Pickles, announced that this move would unleash an
army of armchair auditors among the public
with fresh ideas for savings. Unfortunately,
this citizen army never showed up; from the
comfort of their armchairs, they presumably
preferred to shoot one another repeatedly in
the face on Call of Duty. Or perhaps the rhetoric of transparency in releasing this data to
public scrutiny did not prevent people from
noticing that the government was inviting
citizens to do its own work for free.



eanwhile legislators were

preparing to grant themselves
new powers to trawl through
our data. Nick Clegg promised
to block the Communications
Data Bill, aka the snoopers charter, which
wants to order services including Facebook
and Skype to record information about every
citizens communications and to grant access
to the records on demand by the police.
These days, a promise from Nick Clegg
probably ranks quite low on the list of things
that the British public will bank on, but his
statement does indicate that he knows which
way public sentiment on official surveillance
is leaning. When such legislation is proposed, critics rhetoric about a Big Brother
state is hardly overblown. Indeed, MayerSchnberger and Cukier go further, remarking tartly that the Stasi, too, were Big Data
fanatics avant la lettre.
There is no reason in principle why Big
Data cannot be used by the government
(or anyone else) to serve the public good, as
in Google Flu Trends. Although Big Data in
medicine raises privacy issues (as with the
UK governments current plans for datasharing across the NHS), it can also prevent
needless deaths. Another positive development is the Big Data operation announced in
the US in April by the Consumer Financial
Protection Bureau. Banks and other financial
services companies already collect and pool
colossal volumes of data about their customers; now the official watchdog will force
them to supply the same data to its analysts,
so that the regulator can have detailed oversight of lenders behaviour. In this instance,
amusingly, it is anti-regulation Republicans
who are crying Big Brother. They dont
mind the credit companies holding and sharing such data, but heaven forbid that a body
tasked to protect consumers should get its
hands on it, too.
A subtler problem with Big Data, though,
is that it might lead us to downgrade what
cant easily be measured. When you have a
hammer, everything starts to look like a nail.
When you have such vast quantities of information, you might care only about what is
quantifiable. Stuff that cant easily be turned
into a forest of numbers for crunching might
get sidelined.
Big-data analysis in the humanities is called
culturomics. Its findings can be very interesting. Who would have guessed that, as a
Harvard study cited by Mayer-Schnberger
and Cukier has found, less than half the
number of English words that appear in books
are included in dictionaries, the rest being
lexical dark matter ? Another recent study
by researchers at Bristol, Durham, Sheffield
and Stockholm analysed the appearance of
what it calls mood words in a large number

of 20th-century English-language novels.

(The mood words are those semantically
associated with six mood categories: anger,
disgust, fear, joy, sadness and surprise.) The
study concluded that American English has
become decidedly more emotional than
British English in the last half-century. The
scare-quotes around emotional may be
taken as an acknowledgment of the limits of
such a data-driven approach. After all, prose
can be emotional in more ways than one,
and there are many moods besides the six
named here.
No doubt counting words in books is, for
such researchers, an amusing activity that
harms no one and might offer fascinating results. But when just the same kind of analysis, only on a larger scale and tied to more
personal information, is done by a corporate
giant such as Amazon, we might feel more
squeamish. Thanks to its Kindle devices
and apps, Amazon is sitting on a prodigious
quantity of data about not only what people
read, but which passages they highlight and
during what part of a book they are most
likely to give up reading. In this titanic datachest lurks an intriguing potential business
plan for Amazon, if not good news for the
culture at large, as Evgeny Morozov notes in
his latest book, To Save Everything, Click Here.
One day, Amazon might be able to build a
system that uses this aggregated mountain
of reading data to write new books automatically books that readers are statistically
guaranteed to like. At that point, will writers
and readers the world over shrug and admit
that the data knows best?

stuff seems a modern kind of monstrous

egotism. One might look indulgently on the
Selfers as harmless eccentrics, but even they
signal the wider concerns we should have
about Big Data. If your friend is a Selfer
who records all his social interactions with
a webcam strapped to his head (yes, some
people do this) then he is going to be storing
video footage of you. And where exactly is
he putting it? And who else can get at it? The
central question of Big Data, as its invisible
bit-storm blows through every aspect of our
lives, is going to be who owns it and controls
access. Big Data is big power.
At least it is potentially. More heartening,
in the context of such dystopian worries, is a
recent story of human beings loveable fallibility in the face of their own complex tools.
Willy Brandt International Airport in Berlin,
Germany, is a dazzlingly lit ghost town. It
was supposed to open in 2011; in January this
year it was announced that it still wont be
ready by the latest target of October 2013. So
why keep the lights on? Because officials
cant figure out how to turn them off. The
data-gobbling computer that runs the airports systems is so complicated that no one
knows how to use it. Sometimes, apparently,
Big Data really is too big. l
Steven Pooles latest book is You Arent
What You Eat (Union Books, 7.99)

he great Victorian scientist Lord

Kelvin once said: When you can
measure what you are speaking
about, and express it in numbers,
you know something about it; but
when you cannot express it in numbers, your
knowledge is of a meagre and unsatisfactory
kind; it may be the beginning of knowledge,
but you have scarely, in your thoughts, advanced to the stage of science . . .
This is a fine nostrum for mathematical
physics and other sciences, and could serve
as a slogan (of much more dubious aptness)
for the Big Data hype today in culture, commerce and politics. It might also be the reasoning behind one of the weirdest modern
tech subcultures, the Quantified Self movement. Using wearable gadgets and smartphone apps, Selfers collect information about
every measurable aspect of their daily lives,
including food intake, numbers of footsteps
walked, variations in heart rate, emails written and received, and even evacuations of
waste matter into porcelain receptacles.
This is perhaps best understood as a kind of
hoarding; the piling-up of terabytes of such
24-30 MAY 2013 | NEW STATESMAN | 25

