Beruflich Dokumente
Kultur Dokumente
doi:10.1145/ 2500499
part of data science, is increasingly
Big data promises automated actionable heterogeneous and unstructured—
text, images, video—often emanating
knowledge creation and predictive models from networks with complex relation-
for use by both humans and computers. ships between their entities. Figure
1 outlines the relative expected vol-
By Vasant Dhar umes of unstructured and structured
data from 2008 to 2015 worldwide,
Data Science
projecting a difference of almost 200
petabytes (PB) in 2015 compared to a
difference of 50PB in 2012. Analysis,
including the combination of the two
and Prediction
types of data, requires integration, in-
terpretation, and sense making that
is increasingly derived through tools
from computer science, linguistics,
econometrics, sociology, and other
disciplines. The proliferation of mark-
up languages and tags is designed to
let computers interpret data automat-
ically, making them active agents in
the process of decision making. Un-
Use of the term “data science” is increasingly like early markup languages (such as
HTML) that emphasized the display
common, as is “big data.” But what does it mean? Is of information for human consump-
there something unique about it? What skills do “data tion, most data generated by humans
and computers today is for consump-
scientists” need to be productive in a world deluged by tion by computers; that is, computers
data? What are the implications for scientific inquiry? increasingly do background work for
Here, I address these questions from the perspective each other and make decisions auto-
matically. This scalability in decision
of predictive modeling. making has become possible because
The term “science” implies knowledge gained of big data that serves as the raw mate-
rial for the creation of new knowledge;
through systematic study. In one definition, it is Watson, IBM’s “Jeopardy!” champion,
a systematic enterprise that builds and organizes is a prime illustration of an emerging
knowledge in the form of testable explanations and machine intelligence fueled by data
and state-of-the-art analytics.
predictions.11 Data science might therefore imply a
focus involving data and, by extension, statistics, or key insights
the systematic study of the organization, properties, D ata science is the study of the
and analysis of data and its role in inference, generalizable extraction of knowledge
from data.
including our confidence in the inference. Why then A common epistemic requirement in
do we need a new term like data science when we have assessing whether new knowledge is
actionable for decision making is its
Photo Illustration by Barry Downa rd
had statistics for centuries? The fact that we now have predictive power, not just its ability to
explain the past.
huge amounts of data should not in and of itself
justify the need for a new term. A data scientist requires an integrated
skill set spanning mathematics,
The short answer is data science is different from machine learning, artificial intelligence,
statistics, databases, and optimization,
statistics and other existing disciplines in several along with a deep understanding of the
craft of problem formulation to engineer
important ways. To start, the raw material, the “data” effective solutions.
From an engineering perspective, not discovery of patterns in massive is usually something unexpected and
scale matters in that it renders the tra- swaths of data when users lack a well- actionable and “robust” is a pattern
ditional database models somewhat formulated query. Unlike database expected to occur in the future.
inadequate for knowledge discovery. querying, which asks “What data sat- What makes an insight actionable?
Traditional database methods are isfies this pattern (query)?” discovery Other than domain-specific reasons,
not suited for knowledge discovery asks “What patterns satisfy this data?” it is its predictive power; the return
because they are optimized for fast ac- Specifically, our concern is finding distribution associated with an action
cess and summarization of data, given interesting and robust patterns that can be reliably estimated from past
what the user wants to ask, or a query, satisfy the data, where “interesting” data and therefore acted upon with a
high degree of confidence.
Figure 1. Projected growth of unstructured and structured data. The emphasis on prediction is
particularly strong in the machine
Total archived capacity, by content type, worldwide, 2008–2015 (petabytes) learning and knowledge discov-
350,000 ery in databases, or KDD, commu-
nities. Unless a learned model is
300,000
predictive, it is generally regarded
250,000 with skepticism, a position mirror-
200,000
ing the view expressed by the 20th-
century Austro-British philosopher
150,000
Karl Popper as a primary criterion
100,000 for evaluating a theory and for
scientific progress in general. 24
50,000
Popper argued that theories that
0 sought only to explain a phenom-
2008 2009 2010 2011 2012 2013 2014 2015
enon were weak, whereas those that
π Unstructured 11,430 16,737 25,127 39,237 59,600 92,536 147,885 226,716 made “bold predictions” that stand
π Database 1,952 2,782 4,065 6,179 9,140 13,824 21,532 32,188 the test of time despite being read-
π Email 1,652 2,552 4,025 6,575 10,411 16,796 27,817 44,091 ily falsifiable should be taken more
seriously. In his well-known 1963
treatise on this subject, Conjec-
tures and Refutations, Popper char-
acterized Albert Einstein’s theory
Figure 2. Health-care-use database snippet. of relativity as a “good” one since it
made bold predictions that could be
Clean Period Diagnosis Outcome Period
falsified; all attempts at falsification
of the theory have indeed failed. In
contrast, Popper argued that theo-
ries of psychoanalyst pioneers Sig-
$ $ mund Freud and Alfred Adler could
be “bent” to accommodate virtu-
$ $ $ ally polar opposite scenarios and are
weak in that they are virtually unfal-
$ $ $ sifiable.a The emphasis on predictive
$ $ $ accuracy implicitly favors “simple”
theories over more complex theories
$ $ in that the accuracy of sparser mod-
els tends to be more robust on future
$ $ $ $ data.4,20 The requirement on predic-
tive accuracy on observations that
$ $
$ $
a Popper used opposite cases of a man who
$ $ $ pushes a child into water with the intention
of drowning the child and that of a man who
$ $ $ $ sacrifices his life in an attempt to save the
child. In Adler’s view, the first man suffered
from feelings of inferiority (producing per-
haps the need to prove to himself that he
Time dared to commit the crime), and so did the
second man (whose need was to prove to him-
self that he dared to rescue the child at the
expense of his own life).
practice of “out of sample” and “out of Skills and manipulated by computers. This
time” testing is used by data scientists Machine learning skills are fast be- involves a sequence of courses on data
to assess the robustness of patterns coming necessary for data scientists structures, algorithms, and systems,
from a predictive standpoint. as companies navigate the data deluge including distributed computing,
When predictive accuracy is a pri- and try to build automated decision databases, parallel computing, and
mary objective in domains involving systems that hinge on predictive accu- fault-tolerant computing. Together
massive amounts of data, the com- racy.25 A basic course in machine learn- with scripting languages (such as Py-
puter tends to play a significant role in ing is necessary in today’s marketplace. thon and Perl), systems skills are the
model building and decision making. In addition, knowledge of text process- fundamental building blocks required
The computer itself can build predic- ing and “text mining” is becoming es- for dealing with reasonable-size data-
tive models through an intelligent sential in light of the explosion of text sets. For handling very large datasets,
“generate and test” process, with the and other unstructured data in health- however, standard database systems
end result an assembled model that care systems, social networks, and oth- built on the relational data model have
is the decision maker; that is, it auto- er forums. Knowledge about markup severe limitations. The recent move
mates Popper’s criterion of predictive languages like XML and its derivatives toward cloud computing and non-
accuracy for evaluating models at a is also essential, as content becomes relational structures for dealing with
scale in ways not feasible before. tagged and hence able to be interpret- enormous datasets in a robust manner
If we consider one of these pat- ed automatically by computers. signals a new set of required skills for
terns—that people with “poor health Data scientists’ knowledge about data scientists.
status” (proxied by number of medi- machine learning must build on more The third class of skills requires
cations) have high rates of complica- basic skills that fall into three broad knowledge about correlation and cau-
tions—can we say poor health status classes: The first is statistics, especially sation and is at the heart of virtually
“causes” complications? If so, perhaps Bayesian statistics, which requires a any modeling exercise involving data.
we can intervene and influence the working knowledge of probability, dis- While observational data generally lim-
outcome by possibly controlling the tributions, hypothesis testing, and mul- its us to correlations, we can get lucky.
number of medications. The answer tivariate analysis. It can be acquired in Sometimes plentiful data might repre-
is: it depends. It could be the case that a two- or three-course sequence. Mul- sent natural randomized trials and the
the real cause is not in our observed set tivariate analysis often overlaps with possibility of calculating conditional
of variables. If we assume we have ob- econometrics, which is concerned with probabilities reliably, enabling dis-
served all relevant variables that could fitting robust statistical models to eco- covery of causal structure.22 Building
be causing complications, algorithms nomic data. Unlike machine learning causal models is desirable in domains
are available for extracting causal methods, which make no or few as- where one has reasonable confidence
structure from data,21 depending how sumptions about the functional form as to the completeness of the formulat-
the data was generated. Specifically, of relationships among variables, mul- ed model and its stability, or whether
we still need a clear understanding of tivariate analysis and econometrics by the causal model “generating” the ob-
the “story” behind the data in order to and large focus on estimating param- served data is stable. At the very least, a
know whether the possibility of causa- eters of linear models where the rela- data scientist should have a clear idea
tion can and should be entertained, tionship between the dependent and of the distinction between correlation
even in principle. In our example of independent variables is expressed as and causality and the ability to assess
patients over age 36 with Type 2 diabe- a linear equality. which models are feasible, desirable,
tes, for instance, was it the case that the The second class of skills comes and practical in different settings.
people on seven or more medications from computer science and pertains The final skill set is the least stan-
were “inherently sicker” and would to how data is internally represented dardized and somewhat elusive and to
have developed complications anyway?
If so, it might be incorrect to conclude Figure 3. Extracting interesting patterns in health outcomes from health-care system use.
that large numbers of medications
cause complications. If, on the other
Patient Age #Medications Complication
hand, the observational data followed
1 52 7 Yes
a “natural experiment” where treat-
2 57 9 Yes
ments were assigned randomly to com-
3 43 6 Yes
parable individuals and enough data
4 33 6 No Age >= 37
is available for calculating the relevant AND
5 35 8 No
conditional probabilities, it might be #Medications >= 6
6 49 8 Yes →
feasible to extract a causal model that
7 58 4 No Complication = Yes (100% confidence)
could be used for intervention. This is-
8 62 3 No
sue of extracting a causal model from
data is addressed in the following sec- 9 48 0 No
some extent a craft but also a key differ- tational thinking” coined by Papert21 Knowledge Discovery
entiator to be an effective data scien- and elaborated by Wing32 is similar Former editor of Wired magazine
tist—the ability to formulate problems in spirit to the skills described here. Chris Anderson1 drew on the quote by
in a way that results in effective solu- There is considerable activity in uni- British-born statistician George Box
tions. Herbert Simon, the 20th-century versities to train students in problem- that “All models are wrong, but some
American economist who coined the formulation skills and provide elec- are useful,” arguing, with the huge
term “artificial intelligence” demon- tives structured around the core that amounts of data available today, we do
strated that many seemingly different are more suited to specific disciplines. not need to settle for wrong models or
problems are often “isomorphic,” or The data science revolution also any models at all. Anderson said pre-
have the identical underlying struc- poses serious organizational chal- diction is of paramount importance to
ture. He demonstrated that many re- lenges as to how organizations manage businesses, and data can be used to let
cursive problems could be expressed their data scientists. Besides recogniz- such models emerge through machine
as the standard Towers of Hanoi prob- ing and nurturing the appropriate skill learning algorithms, largely unaided
lem, or involving identical initial and sets, it requires a shift in managers’ by humans, pointing to companies like
goal states and operators. His larger mind-sets toward data-driven decision Google as symbolizing the triumph of
point was it is easy to solve seemingly making to replace or augment intuition machine learning over top-down the-
difficult problems if represented cre- and past practices. A famous quote by ory development. Google’s language
atively with isomorphism in mind.28 20th-century American statistician W. translator does not “understand” lan-
In a broader sense, formulation ex- Edwards Demming—“In God we trust, guage, nor do its algorithms know the
pertise involves the ability to see com- everyone else please bring data”—has contents of webpages. IBM’s Watson
monalities across very different prob- come to characterize the new orienta- does not “understand” the questions it
lems; for example, many problems tion, from intuition-based decision is asked or use deep causal knowledge
have “unbalanced target classes” usu- making to fact-based decision making. to generate questions to the answers
ally denoting the dependent variable From a decision-making stand- it is given. There are dozens of lesser-
is interesting only sometimes (such as point, we are moving into an era of known companies that likewise are
when people develop diabetes compli- big data where for many types of prob- able to predict the odds of someone
cations or respond to marketing offers lems computers are inherently bet- responding to a display ad without a
or promotions). These are the cases of ter decision makers than humans, solid theory but rather based on gobs
interest we would like to predict. Such where “better” could be defined in of data about the behavior of individu-
problems are a challenge for models terms of cost, accuracy, and scalabil- als and the similarities and differences
that, in Popperian terms, must go out ity. This shift has already happened in that behavior.
on a limb to make predictions that are in the world of data-intensive finance Anderson’s 2008 article launched
likely to be wrong unless the model where computers make the majority a vigorous debate in academic circles.
is extremely good at discriminating of investment decisions, often in frac- How can one have science and predic-
among the classes. Experienced data tions of a second, as new information tive models without first articulating
scientists are familiar with these prob- becomes available. The same holds in a theory?
lems and know how to formulate them areas of online advertising where mil- The observation by Dhar and Chou5
in a way that gives a system a chance to lions of auctions are conducted in mil- that “patterns emerge before reasons
make correct predictions under con- liseconds every day, air traffic control, for them become apparent” tends to
ditions where the priors are stacked routing of package delivery, and many resonate universally among profession-
heavily against it. types of planning tasks that require als, particularly in financial markets,
Problem-formulation skills repre- scale, speed, and accuracy simultane- marketing, health care, and fields that
sent core skills for data scientists over ously, a trend likely to accelerate in study human behavior. If this is true,
the next decade. The term “compu- the next few years. Box’s observation becomes relevant: If
a problem is nonstationary and a mod-
Figure 4. Sources of error in predictive models and their mitigation. el is only an approximation anyway,
why not build the best predictive model
based on data available until that time
and just update it periodically? Why
bother developing a detailed causal
1. Misspecification of the model model if it is poor at prediction and,
Big data admits a larger space of functional forms more important, likely to get worse over
2. Using a sample to estimate the model
time due to “concept drift”?
With big data, sample is a good estimate
Some scientists would say there is
of the population no theory without causality, that all
observational data, except total cha-
Predictive modeling attempts to minimize
3. Randomness
the combination of these two errors os, must be generated from a causal
model. In the earlier health-care ex-
ample involving medical complica-
tions in patients with Type 2 diabetes,
enon could generate an error simply other than from other age groups”).
because the linear model imposes an While specific to games, these results
inappropriate bias on the problem. suggest influence is nuanced, certain-
The second is the samples used for es- ly more so than existing theories like
timating parameters; the smaller the
samples, the greater the bias in the Predictive Malcolm Gladwell’s concept of “super
influencers”8 and myriad other popu-
model’s estimates. And the third is
randomness, even when the model is
modeling and lar theories. Big data provides a basis
for testing them.
specified perfectly. machine learning One of the most far-reaching mod-
Big data allows data scientists to
significantly reduce the first two types
are increasingly ern applications of big data is in poli-
tics, as exemplified by the Democratic
of error (see Figure 4). Large amounts central to the National Committee heavy investment
of data allow us to consider models
that make fewer assumptions about
business models in data and analytics prior to Presi-
dent Barack Obama’s winning 2012
functional form than linear or logistic of Internet-based campaign, debunking widely held be-
regressions simply because there is a
lot more data to test such models and data-driven liefs (such as voters in the “middle”
are most critical to outcomes, when
compute reliable error bounds.27 Big businesses. in fact issues that resonate with some
data also eliminates the second type segments of solidly partisan voters
of error, as sample estimates become can sway them14). In the campaign,
reasonable proxies for the population. the DNC crafted predictive models on
The theoretical limitation of ob- the basis of results from large-scale
servational data of the sort in these experiments used to manipulate atti-
examples, regardless of how big it is, tudes. The campaign predicted at the
is that the data is generally “passive,” level of individual voters how each eli-
representing what actually happened gible voter would vote, as well as how
in contrast to the multitude of things to “turn someone into the type of per-
that could have happened had circum- son it wanted you to be.”14
stances been different. In health care, Social science theory building is
it is like having observed the use of the also likely to get a good boost from big
health-care system passively and now data and machine learning. Never be-
having the chance of understand it in fore have social scientists been able
retrospect and extract predictive pat- to observe human behavior at the
terns from it. Unless we are fortunate degree of granularity and variability
enough that the data provided us the seen today with increasing amounts
right experiments naturally, it does of human interaction and economic
not tell us what could have happened activity mediated by the Internet.
if some other treatment had been ad- While the inductive method has limi-
ministered to a specific patient or to tations, the sheer volume of data be-
an identical patient; that is, it does not ing generated makes induction not
represent a clean controlled random- only feasible but productive. That is
ized experiment where the researcher not to say the traditional scientific
is able to establish controls and mea- method is “dead,” as claimed by An-
sure the differential effect of treat- derson.1 On the contrary, it contin-
ments on matched pairs. ues to serve us well. However, a new
Interestingly, however, the Inter- powerful method is available for the-
net era is fertile ground for conduct- ory development not previously prac-
ing inexpensive large-scale random- tical due to the paucity of data. That
ized experiments on social behavior; era of limited data and its associated
Kohavi et al.15 provide a number of assumptions is largely over.
examples. A 2012 controlled experi-
ment by Aral and Walker2 on the adop- Conclusion
tion of video games asked whether it Hypothesis-driven research and ap-
was “influence” or “homophily” that proaches to theory development
affected choice uncovered profiles of have served us well. But a lot of data
people who are influential and suscep- is emanating around us where these
tible. Results include patterns (such as traditional approaches to identify-
“older men are more influential than ing structure do not scale well or take
younger men” and “people of the same advantage of observations that would
age group have more influence on each not occur under controlled circum-
stances; for example, in health care, the first machine that could arguably ACM Conference on Electronic Commerce (2012),
623–638.
controlled experiments have helped be considered to pass the Turing test 10. Hastie, T., Tibsharani, R., and Friedman, J. The
identify many causes of disease but and create new insights in the course Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York, 2009.
may not reflect the actual complexi- of problem solving is IBM’s Watson, 11. Heilbron, J.L., Ed. The Oxford Companion to the
ties of health.3,18 Indeed, some esti- which makes extensive use of learning History of Modern Science. Oxford University Press,
New York, 2003.
mates claim clinical trials exclude and prediction in its problem-solving 12. Hey, T., Tansley, S., and Tolle, K., Eds. 2009. The
as much as 80% of the situations in process. In a game like “Jeopardy!,” Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Research, Redmond, WA, 2009.
which a drug might be prescribed, as where understanding the question it- 13. Hunt, J., Baldochi, D., and van Ingen, C. Redefining
when a patient is on multiple medica- self is often nontrivial and the domain Ecological Science Using Data. The Fourth Paradigm:
Data-Intensive Scientific Discovery. Microsoft
tions.3 In situations where we are able open-ended and nonstationary, it is Research, Redmond, WA, 2009.
14. Issenberg, S. A more perfect union: How President
to design randomized trials, big data not practical to be successful through Obama’s campaign used big data to rally individual
makes it feasible to uncover the caus- an extensive enumeration of possi- voters. MIT Technology Review (Dec. 2012).
15. Kohavi, R., Longbotham, R., Sommerfield, D., and
al models generating the data. bilities or top-down theory building. Henne, R. Controlled experiments on the Web: Survey
As shown earlier in the diabetes- The solution is to endow a computer and practical guide. Data Mining and Knowledge
Discovery 18 (2009), 140–181.
related health-care example, big data with the ability to train itself auto- 16. Lin, T., Patrick, P., Gamon, M., Kannan, A., and Fuxman,
makes it feasible for a machine to ask matically based on large numbers of A. Active objects: Actions for entity-centric search. In
Proceedings of the 21st International Conference on
and validate interesting questions examples. Watson also demonstrat- the World Wide Web (Lyon, France). ACM Press, New
humans might not consider. This ca- ed the power of machine learning is York, 2012.
17. Linoff, G. and Berry, M. Data Mining Techniques: For
pability is indeed the foundation for greatly amplified through the avail- Marketing, Sales, and Customer Support. John Wiley
building predictive modeling, which ability of high-quality human-curated & Sons, Inc., New York, 1997.
18. Maguire, J. and Dhar, V. Comparative effectiveness for
is key to actionable business decision data, as in Wikipedia. This trend— oral anti-diabetic treatments among newly diagnosed
making. For many data-starved areas combining human knowledge with Type 2 diabetics: Data-driven predictive analytics in
healthcare. Health Systems 2 (2013), 73–92.
of inquiry, especially health care and machine learning—also appears to be 19. McKinsey Global Institute. Big Data: The Next
the social, ecological, and earth sci- on the rise. Google’s recent foray in Frontier for Innovation, Competition, and Productivity.
Technical Report, June 2011.
ences, data provides an unprecedent- the Knowledge Graph16 is intended to 20. Meinshausen, N. Relaxed lasso. Computational
Statistics & Data Analysis 52, 1 (Sept. 15, 2007),
ed opportunity for knowledge discov- enable the system to understand the 374–393.
ery and theory development. Never entities corresponding to the torrent 21. Papert, S. An exploration in the space of mathematics
educations. International Journal of Computers for
before have these areas had data of the of strings it processes continuously. Mathematical Learning 1, 1 (1996), 95–123.
variety and scale available today. Google wants to understand “things,” 22. Pearl, J. Causality: Models, Reasoning, and Inference.
Cambridge University Press, Cambridge, U.K., 2000.
This emerging landscape calls for not just “strings.”26 23. Perlich, C., Provost, F., and Simonoff, J. Tree induction
the integrative skill set identified here Organizations and managers face vs. logistic regression: A learning-curve analysis.
Journal of Machine Learning Research 4, 12 (2003),
as essential for emerging data scien- significant challenges in adapting to 211–255.
tists. Academic programs in comput- the new world of data. It is suddenly 24. Popper, K. Conjectures and Refutations. Routledge,
London, 1963.
er science, engineering, and business possible to test many of their estab- 25. Provost, F. and Fawcett, T. Data Science for Business.
management teach a subset of these lished intuitions, experiment cheaply O’Reilly Media, New York, 2013.
26. Roush, W. Google gets a second brain, changing
skills but have yet to teach the inte- and accurately, and base decisions on everything about search. Xconomy (Dec. 12, 2012);
gration of skills needed to function data. This opportunity requires a fun- http://www.xconomy.com /san-francisco/2012/12/12/
google-gets-a-second-brain-changing-everything-
as a data scientist or to manage data damental shift in organizational cul- about-search/?single_page=true
scientists productively. Universities ture, one seen in organizations that 27. Shmueli, G. To explain or to predict? Statistical
Science 25, 3 (Aug. 2010), 289–310.
are scrambling to address the lacunae have embraced the emerging world of 28. Simon, H.A. and Hayes, J.R. The understanding
process: Problem isomorphs. Cognitive Psychology 8,
and provide a more integrated skill data for decision making. 2 (Apr. 1976), 165–190.
set covering basic skills in computer 29. Sloman, S. Causal Models. Oxford University Press,
Oxford, U.K. 2005.
science, statistics, causal modeling, References 30. Spirtes, P., Scheines, R., and Glymour, C. Causation,
1. Anderson, C. The end of theory: The data deluge
problem isomorphs and formulation, makes the scientific method obsolete. Wired 16, 7
Prediction and Search. Springer, New York, 1993.
31. Tukey, J.W. Exploratory Data Analysis. Addison-
and computational thinking. (June 23, 2008). Wesley, Boston, 1977.
2. Aral, S. and Walker, D. Identifying influential and
Predictive modeling and machine susceptible members of social networks. Science 337,
32. Wing, J. Computational thinking. Commun. ACM 49, 3
(Mar. 2006), 33–35.
learning are increasingly central to 6092 (June 21, 2012).
3. Buchan, I., Winn, J., and Bishop, C. A Unified Modeling
the business models of Internet-based Approach to Data-Intensive Healthcare. The Fourth
Vasant Dhar (vdhar@stern.nyu.edu) is a professor and co-
data-driven businesses. An early suc- Paradigm: Data-Intensive Scientific Discovery.
director of the Center for Business Analytics at the Stern
Microsoft Research, Redmond, WA, 2009.
cess, Paypal, was able to capture and 4. Dhar, V. Prediction in financial markets: The case for
School of Business at New York University, New York.
dominate consumer-to-consumer small disjuncts. ACM Transactions on Intelligent
Systems and Technologies 2, 3 (Apr. 2011).
payments due to its ability to predict 5. Dhar, V. and Chou, D. A comparison of nonlinear
the distribution of losses for each models for financial prediction. IEEE Transactions on
Neural Networks 12, 4 (June 2001), 907–921.
transaction and act accordingly. This 6. Dhar, V. and Stein, R. Seven Methods for Transforming
data-driven ability was in sharp con- Corporate Data Into Business Intelligence. Prentice-
Hall, Englewood Cliffs, NJ, 1997.
trast to the prevailing practice of treat- 7. Frawley, W. and Piatetsky-Shapiro, G., Eds. Knowledge
ing transactions identically from a Discovery in Databases. AAAI/MIT Press, Cambridge,
MA, 1991.
risk standpoint. Predictive modeling 8. Gladwell, M. The Tipping Point: How Little Things Can
Make a Big Difference. Little Brown, New York, 2000.
is also at the heart of Google’s search 9. Goel, S., Watts, D., and Goldstein, D. The structure of Copyright held by Owner/Author(s). Publication rights
engine and several other products. But online diffusion networks. In Proceedings of the 13th licensed to ACM. $15.00