Sie sind auf Seite 1von 10

contributed articles

doi:10.1145/ 2500499
part of data science, is increasingly
Big data promises automated actionable heterogeneous and unstructured—
text, images, video—often emanating
knowledge creation and predictive models from networks with complex relation-
for use by both humans and computers. ships between their entities. Figure
1 outlines the relative expected vol-
By Vasant Dhar umes of unstructured and structured
data from 2008 to 2015 worldwide,

Data Science
projecting a difference of almost 200
petabytes (PB) in 2015 compared to a
difference of 50PB in 2012. Analysis,
including the combination of the two

and Prediction
types of data, requires integration, in-
terpretation, and sense making that
is increasingly derived through tools
from computer science, linguistics,
econometrics, sociology, and other
disciplines. The proliferation of mark-
up languages and tags is designed to
let computers interpret data automat-
ically, making them active agents in
the process of decision making. Un-
Use of the term “data science” is increasingly like early markup languages (such as
HTML) that emphasized the display
common, as is “big data.” But what does it mean? Is of information for human consump-
there something unique about it? What skills do “data tion, most data generated by humans
and computers today is for consump-
scientists” need to be productive in a world deluged by tion by computers; that is, computers
data? What are the implications for scientific inquiry? increasingly do background work for
Here, I address these questions from the perspective each other and make decisions auto-
matically. This scalability in decision
of predictive modeling. making has become possible because
The term “science” implies knowledge gained of big data that serves as the raw mate-
rial for the creation of new knowledge;
through systematic study. In one definition, it is Watson, IBM’s “Jeopardy!” champion,
a systematic enterprise that builds and organizes is a prime illustration of an emerging
knowledge in the form of testable explanations and machine intelligence fueled by data
and state-of-the-art analytics.
predictions.11 Data science might therefore imply a
focus involving data and, by extension, statistics, or key insights
the systematic study of the organization, properties,  D ata science is the study of the
and analysis of data and its role in inference, generalizable extraction of knowledge
from data.
including our confidence in the inference. Why then  A common epistemic requirement in
do we need a new term like data science when we have assessing whether new knowledge is
actionable for decision making is its
Photo Illustration by Barry Downa rd

had statistics for centuries? The fact that we now have predictive power, not just its ability to
explain the past.
huge amounts of data should not in and of itself
justify the need for a new term.  A data scientist requires an integrated
skill set spanning mathematics,
The short answer is data science is different from machine learning, artificial intelligence,
statistics, databases, and optimization,
statistics and other existing disciplines in several along with a deep understanding of the
craft of problem formulation to engineer
important ways. To start, the raw material, the “data” effective solutions.

64 communicatio ns o f th e ac m | d ec ember 201 3 | vol. 5 6 | n o. 1 2


dec e mbe r 2 0 1 3 | vol . 56 | n o. 1 2 | c om m u n ic at ion s of t he acm 65
contributed articles

From an engineering perspective, not discovery of patterns in massive is usually something unexpected and
scale matters in that it renders the tra- swaths of data when users lack a well- actionable and “robust” is a pattern
ditional database models somewhat formulated query. Unlike database expected to occur in the future.
inadequate for knowledge discovery. querying, which asks “What data sat- What makes an insight actionable?
Traditional database methods are isfies this pattern (query)?” discovery Other than domain-specific reasons,
not suited for knowledge discovery asks “What patterns satisfy this data?” it is its predictive power; the return
because they are optimized for fast ac- Specifically, our concern is finding distribution associated with an action
cess and summarization of data, given interesting and robust patterns that can be reliably estimated from past
what the user wants to ask, or a query, satisfy the data, where “interesting” data and therefore acted upon with a
high degree of confidence.
Figure 1. Projected growth of unstructured and structured data. The emphasis on prediction is
particularly strong in the machine
Total archived capacity, by content type, worldwide, 2008–2015 (petabytes) learning and knowledge discov-
350,000 ery in databases, or KDD, commu-
nities. Unless a learned model is
300,000
predictive, it is generally regarded
250,000 with skepticism, a position mirror-
200,000
ing the view expressed by the 20th-
century Austro-British philosopher
150,000
Karl Popper as a primary criterion
100,000 for evaluating a theory and for
scientific progress in general. 24
50,000
Popper argued that theories that
0 sought only to explain a phenom-
2008 2009 2010 2011 2012 2013 2014 2015
enon were weak, whereas those that
π Unstructured 11,430 16,737 25,127 39,237 59,600 92,536 147,885 226,716 made “bold predictions” that stand
π Database 1,952 2,782 4,065 6,179 9,140 13,824 21,532 32,188 the test of time despite being read-
π Email 1,652 2,552 4,025 6,575 10,411 16,796 27,817 44,091 ily falsifiable should be taken more
seriously. In his well-known 1963
treatise on this subject, Conjec-
tures and Refutations, Popper char-
acterized Albert Einstein’s theory
Figure 2. Health-care-use database snippet. of relativity as a “good” one since it
made bold predictions that could be
Clean Period Diagnosis Outcome Period
falsified; all attempts at falsification
of the theory have indeed failed. In
contrast, Popper argued that theo-
ries of psychoanalyst pioneers Sig-
$ $ mund Freud and Alfred Adler could
be “bent” to accommodate virtu-
$ $ $ ally polar opposite scenarios and are
weak in that they are virtually unfal-
$ $ $ sifiable.a The emphasis on predictive
$ $ $ accuracy implicitly favors “simple”
theories over more complex theories
$ $ in that the accuracy of sparser mod-
els tends to be more robust on future
$ $ $ $ data.4,20 The requirement on predic-
tive accuracy on observations that
$ $

$ $
a Popper used opposite cases of a man who
$ $ $ pushes a child into water with the intention
of drowning the child and that of a man who
$ $ $ $ sacrifices his life in an attempt to save the
child. In Adler’s view, the first man suffered
from feelings of inferiority (producing per-
haps the need to prove to himself that he
Time dared to commit the crime), and so did the
second man (whose need was to prove to him-
self that he dared to rescue the child at the
expense of his own life).

66 communicatio ns o f th e ac m | d ec ember 201 3 | vol. 5 6 | n o. 1 2


contributed articles

will occur in the future is a key con- An important lesson learned in


sideration in data science. the 1990s is that machine learning
In the rest of this article, I cover “works” in the sense that these meth-
the implications of data science from ods detect subtle structure in data
a business and research standpoint,
first for skills, or what people in in- It is not uncommon relatively easily without having to
make strong assumptions about lin-
dustry need to know and why. How
should educators think about design-
for two experts in earity, monotonicity, or parameters of
distributions. The downside of these
ing programs to deliver the skills most the social sciences methods is they also pick up the noise
efficiently and enjoyably? And what
kinds of decision-making skills will be
to propose opposite in data,31 often with no way to distin-
guish between signal and noise, a
required in the era of big data and how relationships point I return to shortly.
will they differ from the past when data
was less plentiful?
among the Despite their drawbacks, a lot can
be said for methods that do not force
The second part of my answer to variables and us to make assumptions about the na-
defining big-data skills is aimed at re-
search. How can scientists exploit the offer diametrically ture of the relationship between vari-
ables before we begin our inquiry. This
abundance of data and massive com- opposite predictions is not trivial. Most of us are trained to
putational power to their advantage
in scientific inquiry? How does this based on the same believe theory must originate in the
human mind based on prior theory,
new line of thinking complement tra-
ditional methods of scientific inqui-
sets of facts. with data then gathered to demon-
strate the validity of the theory. Ma-
ry? And how can it augment the way chine learning turns this process
data scientists think about discovery around. Given a large trove of data, the
and innovation? computer taunts us by saying, “If only
you knew what question to ask me, I
Implications would give you some very interesting
A 2011 McKinsey industry report19 answers based on the data.” Such a
said the volume of data worldwide is capability is powerful since we often
growing at a rate of approximately 50% do not know what question to ask. For
per year, or a roughly 40-fold increase example, consider a health-care da-
since 2001. Hundreds of billions of tabase of individuals who have been
messages are transmitted through so- using the health-care system for many
cial media daily and millions of videos years, where among them a group has
uploaded daily across the Internet. As been diagnosed with Type 2 diabetes,
storage becomes almost free, most of and some subset of this group has
it is stored because businesses gener- developed complications. It could be
ally associate a positive option value very useful to know whether there are
with data; that is, since it may turn out any patterns to the complications and
to be useful in ways not yet foreseen, whether the probability of complica-
why not just keep it? (One indicator tions can be predicted and therefore
of how inexpensive storage is today acted upon. However, it is difficult to
is the fact that it is possible to store know what specific query, if any, might
the world’s entire stock of music on a reveal such patterns.
$500 device.) To make this scenario more con-
Using large amounts of data for crete, consider the data emanating
decision making became practical in from a health-care system that essen-
the 1980s. The field of data mining tially consists of “transactions,” or
burgeoned in the early 1990s as rela- points of contact over time between
tional database technology matured a patient and the system. Records in-
and business processes were increas- clude services rendered by health-care
ingly automated. Early books on data providers or medication dispensed
mining6,7,17 from the 1990s described on a particular date; notes and obser-
how various methods from machine vations could also be part of the re-
learning could be applied to a variety cord. Figure 2 outlines what the raw
of business problems. A correspond- data would look like for 10 individu-
ing explosion involved software tools als where the data is separated into
geared toward leveraging transactional a “clean period” (history prior to di-
and behavioral data for purposes of ex- agnosis), a red bar (“diagnosis”), and
planation and prediction. the “outcome period” (costs and other

dec e mbe r 2 0 1 3 | vol . 56 | n o. 1 2 | c om m u n ic at ion s of t he acm 67


contributed articles

outcomes, including complications). the population is 5%; that is, a ran-


Each colored bar in the clean period dom sample of the database includes,
represents a medication, showing the on average, 5% complications. In this
first individual was on seven differ- scenario, the snippet on the right side
ent medications prior to diagnosis,
the second on nine, the third on six, A new powerful of Figure 3 could be very interesting
since its complication rate is many
and so on. The sixth and tenth indi- method is times greater than the average. The
viduals were the costliest to treat and
developed complications, as did the available for theory critical question is whether this is a
pattern that is robust and hence pre-
first three, represented by the upward-
pointing green arrows.
development not dictive, likely to hold up in other cases
in the future. The issue of determin-
Extracting interesting patterns is previously practical ing robustness has been addressed
nontrivial, even from a tiny tempo-
ral database like this. Are complica-
due to the paucity extensively in the machine learning
literature and is a key consideration
tions associated with the yellow meds of data. for data scientists.23
or with the gray meds? The yellows in If Figure 3 is representative of the
the absence of the blues? Or is it more larger database, the box on the right
than three yellows or three blues? The tells us the interesting question to ask
list goes on. Even more significant, per- the database: “What is the incidence
haps if we created “useful” features or of complications in Type 2 diabetes
aggregations from the raw data, could for people over age 36 who are on six
physicians, insurers, or policy makers or more medications?” In terms of ac-
predict likely complications for indi- tionability, such a pattern might sug-
viduals or for groups of people? gest being extra vigilant about people
Feature construction is an impor- with such a profile who do not current-
tant creative step in knowledge discov- ly have a complication in light of their
ery. The raw data across individuals high susceptibility to complications.
typically needs to be aggregated into The general point is that when data
some sort of canonical form before is large and multidimensional, it is
useful patterns can be discovered; for practically impossible for us to know a
example, suppose we could count the priori that a query (such as the one here
number of prescriptions an individual concerning patterns in diabetes com-
is on without regard to the specifics plications) is a good one, or one that
of each prescription as one approxi- provides a potentially interesting and
mation of the “health status” of the actionable insight. Suitably designed
individual prior to diagnosis. Such a machine learning algorithms help
feature ignores the “severity” or other find such patterns for us. To be useful
characteristics of the individual medi- both practically and scientifically, the
cations, but such aggregation is none- patterns must be predictive. The em-
theless typical of feature engineering. phasis on predictability typically favors
Suppose, too, a “complications da- Occam’s razor, or succinctness, since
tabase” would be synthesized from simpler models are more likely to hold
the data, possibly including demo- up on future observations than more
graphic information (such as patient complex ones, all else being equal;4 for
age and medical history); it could example, consider the diabetes com-
also include health status based on a plication pattern here:
count of current medications; see Fig-
ure 3, in which a learning algorithm, Age > 36 and #Medication >
designated by the right-facing blue ar- 6 → Complication_rate=100%
row, could be applied to discover the
pattern on the right. The pattern rep- A simpler competing model might
resents an abstraction of the data, or ignore age altogether, stating simply
the type of question we should ask the that people on six or more medications
database, if only we knew what to ask. tend to develop complications. The
Other data transformations and ag- reliability of such a model would be
gregations could yield other medically more apparent when applied to future
insightful patterns. data; for example, does simplicity lead
What makes the pattern on the to greater future predictive accuracy in
right side of Figure 3 interesting? Sup- terms of fewer false positives and false
pose the overall complication rate in negatives? If it does, it is favored. The

68 communicatio ns o f th e ac m | d ec ember 201 3 | vol. 5 6 | n o. 1 2


contributed articles

practice of “out of sample” and “out of Skills and manipulated by computers. This
time” testing is used by data scientists Machine learning skills are fast be- involves a sequence of courses on data
to assess the robustness of patterns coming necessary for data scientists structures, algorithms, and systems,
from a predictive standpoint. as companies navigate the data deluge including distributed computing,
When predictive accuracy is a pri- and try to build automated decision databases, parallel computing, and
mary objective in domains involving systems that hinge on predictive accu- fault-tolerant computing. Together
massive amounts of data, the com- racy.25 A basic course in machine learn- with scripting languages (such as Py-
puter tends to play a significant role in ing is necessary in today’s marketplace. thon and Perl), systems skills are the
model building and decision making. In addition, knowledge of text process- fundamental building blocks required
The computer itself can build predic- ing and “text mining” is becoming es- for dealing with reasonable-size data-
tive models through an intelligent sential in light of the explosion of text sets. For handling very large datasets,
“generate and test” process, with the and other unstructured data in health- however, standard database systems
end result an assembled model that care systems, social networks, and oth- built on the relational data model have
is the decision maker; that is, it auto- er forums. Knowledge about markup severe limitations. The recent move
mates Popper’s criterion of predictive languages like XML and its derivatives toward cloud computing and non-
accuracy for evaluating models at a is also essential, as content becomes relational structures for dealing with
scale in ways not feasible before. tagged and hence able to be interpret- enormous datasets in a robust manner
If we consider one of these pat- ed automatically by computers. signals a new set of required skills for
terns—that people with “poor health Data scientists’ knowledge about data scientists.
status” (proxied by number of medi- machine learning must build on more The third class of skills requires
cations) have high rates of complica- basic skills that fall into three broad knowledge about correlation and cau-
tions—can we say poor health status classes: The first is statistics, especially sation and is at the heart of virtually
“causes” complications? If so, perhaps Bayesian statistics, which requires a any modeling exercise involving data.
we can intervene and influence the working knowledge of probability, dis- While observational data generally lim-
outcome by possibly controlling the tributions, hypothesis testing, and mul- its us to correlations, we can get lucky.
number of medications. The answer tivariate analysis. It can be acquired in Sometimes plentiful data might repre-
is: it depends. It could be the case that a two- or three-course sequence. Mul- sent natural randomized trials and the
the real cause is not in our observed set tivariate analysis often overlaps with possibility of calculating conditional
of variables. If we assume we have ob- econometrics, which is concerned with probabilities reliably, enabling dis-
served all relevant variables that could fitting robust statistical models to eco- covery of causal structure.22 Building
be causing complications, algorithms nomic data. Unlike machine learning causal models is desirable in domains
are available for extracting causal methods, which make no or few as- where one has reasonable confidence
structure from data,21 depending how sumptions about the functional form as to the completeness of the formulat-
the data was generated. Specifically, of relationships among variables, mul- ed model and its stability, or whether
we still need a clear understanding of tivariate analysis and econometrics by the causal model “generating” the ob-
the “story” behind the data in order to and large focus on estimating param- served data is stable. At the very least, a
know whether the possibility of causa- eters of linear models where the rela- data scientist should have a clear idea
tion can and should be entertained, tionship between the dependent and of the distinction between correlation
even in principle. In our example of independent variables is expressed as and causality and the ability to assess
patients over age 36 with Type 2 diabe- a linear equality. which models are feasible, desirable,
tes, for instance, was it the case that the The second class of skills comes and practical in different settings.
people on seven or more medications from computer science and pertains The final skill set is the least stan-
were “inherently sicker” and would to how data is internally represented dardized and somewhat elusive and to
have developed complications anyway?
If so, it might be incorrect to conclude Figure 3. Extracting interesting patterns in health outcomes from health-care system use.
that large numbers of medications
cause complications. If, on the other
Patient Age #Medications Complication
hand, the observational data followed
1 52 7 Yes
a “natural experiment” where treat-
2 57 9 Yes
ments were assigned randomly to com-
3 43 6 Yes
parable individuals and enough data
4 33 6 No Age >= 37
is available for calculating the relevant AND
5 35 8 No
conditional probabilities, it might be #Medications >= 6
6 49 8 Yes →
feasible to extract a causal model that
7 58 4 No Complication = Yes (100% confidence)
could be used for intervention. This is-
8 62 3 No
sue of extracting a causal model from
data is addressed in the following sec- 9 48 0 No

tions; for a more complete treatment 10 37 6 Yes

on causal models, see Pearl,21 Slo-


man,29 and Spirtes et al.30

dec e mbe r 2 0 1 3 | vol . 56 | n o. 1 2 | c om m u n ic at ion s of t he acm 69


contributed articles

some extent a craft but also a key differ- tational thinking” coined by Papert21 Knowledge Discovery
entiator to be an effective data scien- and elaborated by Wing32 is similar Former editor of Wired magazine
tist—the ability to formulate problems in spirit to the skills described here. Chris Anderson1 drew on the quote by
in a way that results in effective solu- There is considerable activity in uni- British-born statistician George Box
tions. Herbert Simon, the 20th-century versities to train students in problem- that “All models are wrong, but some
American economist who coined the formulation skills and provide elec- are useful,” arguing, with the huge
term “artificial intelligence” demon- tives structured around the core that amounts of data available today, we do
strated that many seemingly different are more suited to specific disciplines. not need to settle for wrong models or
problems are often “isomorphic,” or The data science revolution also any models at all. Anderson said pre-
have the identical underlying struc- poses serious organizational chal- diction is of paramount importance to
ture. He demonstrated that many re- lenges as to how organizations manage businesses, and data can be used to let
cursive problems could be expressed their data scientists. Besides recogniz- such models emerge through machine
as the standard Towers of Hanoi prob- ing and nurturing the appropriate skill learning algorithms, largely unaided
lem, or involving identical initial and sets, it requires a shift in managers’ by humans, pointing to companies like
goal states and operators. His larger mind-sets toward data-driven decision Google as symbolizing the triumph of
point was it is easy to solve seemingly making to replace or augment intuition machine learning over top-down the-
difficult problems if represented cre- and past practices. A famous quote by ory development. Google’s language
atively with isomorphism in mind.28 20th-century American statistician W. translator does not “understand” lan-
In a broader sense, formulation ex- Edwards Demming—“In God we trust, guage, nor do its algorithms know the
pertise involves the ability to see com- everyone else please bring data”—has contents of webpages. IBM’s Watson
monalities across very different prob- come to characterize the new orienta- does not “understand” the questions it
lems; for example, many problems tion, from intuition-based decision is asked or use deep causal knowledge
have “unbalanced target classes” usu- making to fact-based decision making. to generate questions to the answers
ally denoting the dependent variable From a decision-making stand- it is given. There are dozens of lesser-
is interesting only sometimes (such as point, we are moving into an era of known companies that likewise are
when people develop diabetes compli- big data where for many types of prob- able to predict the odds of someone
cations or respond to marketing offers lems computers are inherently bet- responding to a display ad without a
or promotions). These are the cases of ter decision makers than humans, solid theory but rather based on gobs
interest we would like to predict. Such where “better” could be defined in of data about the behavior of individu-
problems are a challenge for models terms of cost, accuracy, and scalabil- als and the similarities and differences
that, in Popperian terms, must go out ity. This shift has already happened in that behavior.
on a limb to make predictions that are in the world of data-intensive finance Anderson’s 2008 article launched
likely to be wrong unless the model where computers make the majority a vigorous debate in academic circles.
is extremely good at discriminating of investment decisions, often in frac- How can one have science and predic-
among the classes. Experienced data tions of a second, as new information tive models without first articulating
scientists are familiar with these prob- becomes available. The same holds in a theory?
lems and know how to formulate them areas of online advertising where mil- The observation by Dhar and Chou5
in a way that gives a system a chance to lions of auctions are conducted in mil- that “patterns emerge before reasons
make correct predictions under con- liseconds every day, air traffic control, for them become apparent” tends to
ditions where the priors are stacked routing of package delivery, and many resonate universally among profession-
heavily against it. types of planning tasks that require als, particularly in financial markets,
Problem-formulation skills repre- scale, speed, and accuracy simultane- marketing, health care, and fields that
sent core skills for data scientists over ously, a trend likely to accelerate in study human behavior. If this is true,
the next decade. The term “compu- the next few years. Box’s observation becomes relevant: If
a problem is nonstationary and a mod-
Figure 4. Sources of error in predictive models and their mitigation. el is only an approximation anyway,
why not build the best predictive model
based on data available until that time
and just update it periodically? Why
bother developing a detailed causal
1.  Misspecification of the model model if it is poor at prediction and,
Big data admits a larger space of functional forms more important, likely to get worse over
2.  Using a sample to estimate the model
time due to “concept drift”?
With big data, sample is a good estimate
Some scientists would say there is
of the population no theory without causality, that all
observational data, except total cha-
Predictive modeling attempts to minimize
3. Randomness
the combination of these two errors os, must be generated from a causal
model. In the earlier health-care ex-
ample involving medical complica-
tions in patients with Type 2 diabetes,

70 communicatio ns o f th e acm | d ec ember 201 3 | vol. 5 6 | n o. 1 2


contributed articles

this seems obvious; some underlying theory development is often on pro-


mechanism must have been respon- posing theories that embody causality
sible for the observed outcomes. But without serious consideration of their
we may not have observed or been ca- predictive power. When such a theory
pable of observing the causal picture.
Even if we observed the right variables Big data makes claims “A causes B,” data is gathered to
confirm whether the relationship is in-
we would need to know how the obser-
vational data was generated before we
it feasible for a deed causal. But its predictive accuracy
could be poor because the theory is in-
can in principle draw causal connec- machine to ask and complete. Indeed, it is not uncommon
tions. If the observations represent
a natural experiment (such as physi-
validate interesting for two experts in the social sciences to
propose opposite relationships among
cians using a new drug vs. other physi- questions humans the variables and offer diametrically
cians using an old one for comparable
individuals), the data might reveal
might not consider. opposite predictions based on the
same sets of facts; for example, econo-
causality. On the other hand, if the mists routinely disagree on both theory
new drug is prescribed primarily for and prediction, and error rates of fore-
“sicker” individuals, it would repre- casts tend to be high.
sent a specific kind of bias in the data. How could big data put these
Anderson’s point has particular rel- domains on firmer ground? In the
evance in the health, social, and earth “hard” sciences, where models can be
sciences in the era of big data since assumed, for practical purposes, to
these areas are generally character- be complete, there exists the possibil-
ized by a lack of solid theory but where ity of extracting causal models from
we now see huge amounts of data that large amounts of data. In other fields,
can serve as grist for theory build- large amounts of data can result in
ing3,12,13 or understanding large-scale accurate predictive models, even
social behavior and attitudes and how though no causal insights are imme-
they can be altered.14 Contrast physics diately apparent. As long as their pre-
and social sciences at opposite ends diction errors are small, they could
of the spectrum in terms of the predic- still point us in the right direction for
tive power of their theories. In physics, theory development. As an example of
a theory is expected to be “complete” being pointed in the right direction, a
in the sense a relationship among cer- health-care research scientist recent-
tain variables is intended to explain ly remarked on an observed pattern
the phenomenon completely, with no of coronary failure being preceded
exceptions. Such a model is expected months earlier by a serious infection.
to make perfect predictions—sub- One of his conjectures was infections
ject to measurement error but not to might have caused inflamed arter-
error due to omitted variables or un- ies and loosened plaque that sub-
intended consequences. In such do- sequently caused coronary failure.
mains, the explanatory and predictive There could be other explanations,
models are synonymous. The behav- but if the observed pattern is predic-
ior of a space shuttle is, for example, tive, it might be worthy of publication
explained completely by the causal and deeper inquiry. The questions
model describing the physical forces such a case raise for gatekeepers of
acting on it. This model can also be science is whether to more strongly
used to predict what will happen if consider the Popperian test of predic-
any input changes. It is not sufficient tive accuracy on future data and favor
to have a model 95% sure of outcomes simple accurate predictive models as
and leave the rest to chance. Engineer- potential components of future theo-
ing follows science. ry instead of requiring a causal model
In contrast, the social sciences are up front tested by the data.
generally characterized by incomplete What makes predictive models ac-
models intended to be partial approxi- curate? Conversely, where do errors
mations of reality, often based on as- come from?
sumptions of human behavior known Hastie et al.10 said errors in predic-
to be simplistic. A model correct 95% tion come from three sources: The
of the time in this world would be con- first is misspecification of a model,
sidered quite good. Ironically, how- so, for example, a linear model that
ever, the emphasis in social science attempts to fit a nonlinear phenom-

dec e mbe r 2 0 1 3 | vol . 56 | n o. 1 2 | c om m u n ic at ion s of t he acm 71


contributed articles

enon could generate an error simply other than from other age groups”).
because the linear model imposes an While specific to games, these results
inappropriate bias on the problem. suggest influence is nuanced, certain-
The second is the samples used for es- ly more so than existing theories like
timating parameters; the smaller the
samples, the greater the bias in the Predictive Malcolm Gladwell’s concept of “super
influencers”8 and myriad other popu-
model’s estimates. And the third is
randomness, even when the model is
modeling and lar theories. Big data provides a basis
for testing them.
specified perfectly. machine learning One of the most far-reaching mod-
Big data allows data scientists to
significantly reduce the first two types
are increasingly ern applications of big data is in poli-
tics, as exemplified by the Democratic
of error (see Figure 4). Large amounts central to the National Committee heavy investment
of data allow us to consider models
that make fewer assumptions about
business models in data and analytics prior to Presi-
dent Barack Obama’s winning 2012
functional form than linear or logistic of Internet-based campaign, debunking widely held be-
regressions simply because there is a
lot more data to test such models and data-driven liefs (such as voters in the “middle”
are most critical to outcomes, when
compute reliable error bounds.27 Big businesses. in fact issues that resonate with some
data also eliminates the second type segments of solidly partisan voters
of error, as sample estimates become can sway them14). In the campaign,
reasonable proxies for the population. the DNC crafted predictive models on
The theoretical limitation of ob- the basis of results from large-scale
servational data of the sort in these experiments used to manipulate atti-
examples, regardless of how big it is, tudes. The campaign predicted at the
is that the data is generally “passive,” level of individual voters how each eli-
representing what actually happened gible voter would vote, as well as how
in contrast to the multitude of things to “turn someone into the type of per-
that could have happened had circum- son it wanted you to be.”14
stances been different. In health care, Social science theory building is
it is like having observed the use of the also likely to get a good boost from big
health-care system passively and now data and machine learning. Never be-
having the chance of understand it in fore have social scientists been able
retrospect and extract predictive pat- to observe human behavior at the
terns from it. Unless we are fortunate degree of granularity and variability
enough that the data provided us the seen today with increasing amounts
right experiments naturally, it does of human interaction and economic
not tell us what could have happened activity mediated by the Internet.
if some other treatment had been ad- While the inductive method has limi-
ministered to a specific patient or to tations, the sheer volume of data be-
an identical patient; that is, it does not ing generated makes induction not
represent a clean controlled random- only feasible but productive. That is
ized experiment where the researcher not to say the traditional scientific
is able to establish controls and mea- method is “dead,” as claimed by An-
sure the differential effect of treat- derson.1 On the contrary, it contin-
ments on matched pairs. ues to serve us well. However, a new
Interestingly, however, the Inter- powerful method is available for the-
net era is fertile ground for conduct- ory development not previously prac-
ing inexpensive large-scale random- tical due to the paucity of data. That
ized experiments on social behavior; era of limited data and its associated
Kohavi et al.15 provide a number of assumptions is largely over.
examples. A 2012 controlled experi-
ment by Aral and Walker2 on the adop- Conclusion
tion of video games asked whether it Hypothesis-driven research and ap-
was “influence” or “homophily” that proaches to theory development
affected choice uncovered profiles of have served us well. But a lot of data
people who are influential and suscep- is emanating around us where these
tible. Results include patterns (such as traditional approaches to identify-
“older men are more influential than ing structure do not scale well or take
younger men” and “people of the same advantage of observations that would
age group have more influence on each not occur under controlled circum-

72 comm unicatio ns o f the acm | d ec ember 201 3 | vol. 5 6 | n o. 1 2


contributed articles

stances; for example, in health care, the first machine that could arguably ACM Conference on Electronic Commerce (2012),
623–638.
controlled experiments have helped be considered to pass the Turing test 10. Hastie, T., Tibsharani, R., and Friedman, J. The
identify many causes of disease but and create new insights in the course Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York, 2009.
may not reflect the actual complexi- of problem solving is IBM’s Watson, 11. Heilbron, J.L., Ed. The Oxford Companion to the
ties of health.3,18 Indeed, some esti- which makes extensive use of learning History of Modern Science. Oxford University Press,
New York, 2003.
mates claim clinical trials exclude and prediction in its problem-solving 12. Hey, T., Tansley, S., and Tolle, K., Eds. 2009. The
as much as 80% of the situations in process. In a game like “Jeopardy!,” Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Research, Redmond, WA, 2009.
which a drug might be prescribed, as where understanding the question it- 13. Hunt, J., Baldochi, D., and van Ingen, C. Redefining
when a patient is on multiple medica- self is often nontrivial and the domain Ecological Science Using Data. The Fourth Paradigm:
Data-Intensive Scientific Discovery. Microsoft
tions.3 In situations where we are able open-ended and nonstationary, it is Research, Redmond, WA, 2009.
14. Issenberg, S. A more perfect union: How President
to design randomized trials, big data not practical to be successful through Obama’s campaign used big data to rally individual
makes it feasible to uncover the caus- an extensive enumeration of possi- voters. MIT Technology Review (Dec. 2012).
15. Kohavi, R., Longbotham, R., Sommerfield, D., and
al models generating the data. bilities or top-down theory building. Henne, R. Controlled experiments on the Web: Survey
As shown earlier in the diabetes- The solution is to endow a computer and practical guide. Data Mining and Knowledge
Discovery 18 (2009), 140–181.
related health-care example, big data with the ability to train itself auto- 16. Lin, T., Patrick, P., Gamon, M., Kannan, A., and Fuxman,
makes it feasible for a machine to ask matically based on large numbers of A. Active objects: Actions for entity-centric search. In
Proceedings of the 21st International Conference on
and validate interesting questions examples. Watson also demonstrat- the World Wide Web (Lyon, France). ACM Press, New
humans might not consider. This ca- ed the power of machine learning is York, 2012.
17. Linoff, G. and Berry, M. Data Mining Techniques: For
pability is indeed the foundation for greatly amplified through the avail- Marketing, Sales, and Customer Support. John Wiley
building predictive modeling, which ability of high-quality human-curated & Sons, Inc., New York, 1997.
18. Maguire, J. and Dhar, V. Comparative effectiveness for
is key to actionable business decision data, as in Wikipedia. This trend— oral anti-diabetic treatments among newly diagnosed
making. For many data-starved areas combining human knowledge with Type 2 diabetics: Data-driven predictive analytics in
healthcare. Health Systems 2 (2013), 73–92.
of inquiry, especially health care and machine learning—also appears to be 19. McKinsey Global Institute. Big Data: The Next
the social, ecological, and earth sci- on the rise. Google’s recent foray in Frontier for Innovation, Competition, and Productivity.
Technical Report, June 2011.
ences, data provides an unprecedent- the Knowledge Graph16 is intended to 20. Meinshausen, N. Relaxed lasso. Computational
Statistics & Data Analysis 52, 1 (Sept. 15, 2007),
ed opportunity for knowledge discov- enable the system to understand the 374–393.
ery and theory development. Never entities corresponding to the torrent 21. Papert, S. An exploration in the space of mathematics
educations. International Journal of Computers for
before have these areas had data of the of strings it processes continuously. Mathematical Learning 1, 1 (1996), 95–123.
variety and scale available today. Google wants to understand “things,” 22. Pearl, J. Causality: Models, Reasoning, and Inference.
Cambridge University Press, Cambridge, U.K., 2000.
This emerging landscape calls for not just “strings.”26 23. Perlich, C., Provost, F., and Simonoff, J. Tree induction
the integrative skill set identified here Organizations and managers face vs. logistic regression: A learning-curve analysis.
Journal of Machine Learning Research 4, 12 (2003),
as essential for emerging data scien- significant challenges in adapting to 211–255.
tists. Academic programs in comput- the new world of data. It is suddenly 24. Popper, K. Conjectures and Refutations. Routledge,
London, 1963.
er science, engineering, and business possible to test many of their estab- 25. Provost, F. and Fawcett, T. Data Science for Business.
management teach a subset of these lished intuitions, experiment cheaply O’Reilly Media, New York, 2013.
26. Roush, W. Google gets a second brain, changing
skills but have yet to teach the inte- and accurately, and base decisions on everything about search. Xconomy (Dec. 12, 2012);
gration of skills needed to function data. This opportunity requires a fun- http://www.xconomy.com /san-francisco/2012/12/12/
google-gets-a-second-brain-changing-everything-
as a data scientist or to manage data damental shift in organizational cul- about-search/?single_page=true
scientists productively. Universities ture, one seen in organizations that 27. Shmueli, G. To explain or to predict? Statistical
Science 25, 3 (Aug. 2010), 289–310.
are scrambling to address the lacunae have embraced the emerging world of 28. Simon, H.A. and Hayes, J.R. The understanding
process: Problem isomorphs. Cognitive Psychology 8,
and provide a more integrated skill data for decision making. 2 (Apr. 1976), 165–190.
set covering basic skills in computer 29. Sloman, S. Causal Models. Oxford University Press,
Oxford, U.K. 2005.
science, statistics, causal modeling, References 30. Spirtes, P., Scheines, R., and Glymour, C. Causation,
1. Anderson, C. The end of theory: The data deluge
problem isomorphs and formulation, makes the scientific method obsolete. Wired 16, 7
Prediction and Search. Springer, New York, 1993.
31. Tukey, J.W. Exploratory Data Analysis. Addison-
and computational thinking. (June 23, 2008). Wesley, Boston, 1977.
2. Aral, S. and Walker, D. Identifying influential and
Predictive modeling and machine susceptible members of social networks. Science 337,
32. Wing, J. Computational thinking. Commun. ACM 49, 3
(Mar. 2006), 33–35.
learning are increasingly central to 6092 (June 21, 2012).
3. Buchan, I., Winn, J., and Bishop, C. A Unified Modeling
the business models of Internet-based Approach to Data-Intensive Healthcare. The Fourth
Vasant Dhar (vdhar@stern.nyu.edu) is a professor and co-
data-driven businesses. An early suc- Paradigm: Data-Intensive Scientific Discovery.
director of the Center for Business Analytics at the Stern
Microsoft Research, Redmond, WA, 2009.
cess, Paypal, was able to capture and 4. Dhar, V. Prediction in financial markets: The case for
School of Business at New York University, New York.
dominate consumer-to-consumer small disjuncts. ACM Transactions on Intelligent
Systems and Technologies 2, 3 (Apr. 2011).
payments due to its ability to predict 5. Dhar, V. and Chou, D. A comparison of nonlinear
the distribution of losses for each models for financial prediction. IEEE Transactions on
Neural Networks 12, 4 (June 2001), 907–921.
transaction and act accordingly. This 6. Dhar, V. and Stein, R. Seven Methods for Transforming
data-driven ability was in sharp con- Corporate Data Into Business Intelligence. Prentice-
Hall, Englewood Cliffs, NJ, 1997.
trast to the prevailing practice of treat- 7. Frawley, W. and Piatetsky-Shapiro, G., Eds. Knowledge
ing transactions identically from a Discovery in Databases. AAAI/MIT Press, Cambridge,
MA, 1991.
risk standpoint. Predictive modeling 8. Gladwell, M. The Tipping Point: How Little Things Can
Make a Big Difference. Little Brown, New York, 2000.
is also at the heart of Google’s search 9. Goel, S., Watts, D., and Goldstein, D. The structure of Copyright held by Owner/Author(s). Publication rights
engine and several other products. But online diffusion networks. In Proceedings of the 13th licensed to ACM. $15.00

dec e mbe r 2 0 1 3 | vol . 56 | n o. 1 2 | c om m u n ic at ion s of t he acm 73

Das könnte Ihnen auch gefallen