You are on page 1of 12

What is

Data Science?
The future belongs to the companies
and people that turn data into products

An OReilly Radar Report

By Mike Loukides
Where data comes from.......................................... 3

Working with data at scale..................................... 5

Making data tell its story........................................ 7

Data scientists............................................................. 8

2010 OReilly Media, Inc. OReilly logo is a registered trademark of OReilly Media, Inc.
ii : An OReilly Radar Report: What is Data Science? All other trademarks are the property of their respective owners. 10478.1
What is
Data Science?

he Web is full of data-driven apps. Almost any Google is a master at creating data products. Here are a
e-commerce application is a data-driven applica- few examples:
tion. Theres a database behind a web front end,
and middleware that talks to a number of other n Googles breakthrough was realizing that a search
databases and data services (credit card processing engine could use input other than the text on the page.
companies, banks, and so on). But merely using data isnt Googles PageRank algorithm was among the first to
really what we mean by data science. A data application use data outside of the page itself, in particular, the
acquires its value from the data itself, and creates more number of links pointing to a page. Tracking links made
data as a result. Its not just an application with data; Google searches much more useful, and PageRank has
its a data product. Data science enables the creation of been a key ingredient to the companys success.
data products.
One of the earlier data products on the Web was the n Spell checking isnt a terribly difficult problem, but by
CDDB database. The developers of CDDB realized that any suggesting corrections to misspelled searches, and
CD had a unique signature, based on the exact length (in observing what the user clicks in response, Google
samples) of each track on the CD. Gracenote built a data- made it much more accurate. Theyve built a dictionary
base of track lengths, and coupled it to a database of of common misspellings, their corrections, and the
album metadata (track titles, artists, album titles). If youve contexts in which they occur.
ever used iTunes to rip a CD, youve taken advantage of
this database. Before it does anything else, iTunes reads n Speech recognition has always been a hard problem,
the length of every track, sends it to CDDB, and gets back and it remains difficult. But Google has made huge
the track titles. If you have a CD thats not in the database strides by using the voice data theyve collected, and
(including a CD youve made yourself), you can create an has been able to integrate voice search into their core
entry for an unknown album. While this sounds simple search engine.
enough, its revolutionary: CDDB views music as data, not
as audio, and creates new value in doing so. Their business n During the Swine Flu epidemic of 2009, Google was
is fundamentally different from selling music, sharing able to track the progress of the epidemic by follow-
music, or analyzing musical tastes (though these can also ing searches for flu-related topics.
be data products). CDDB arises entirely from viewing a
musical problem as a data problem.

An OReilly Radar Report: What is Data Science? :1

Flu trends

Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing
searches that people were making in different regions of the country.

Google isnt the only company that knows how to use In the last few years, there has been an explosion in
data. Facebook and LinkedIn use patterns of friendship the amount of data thats available. Whether were talking
relationships to suggest other people you may know, or about web server logs, tweet streams, online transaction
should know, with sometimes frightening accuracy. records, citizen science, data from sensors, government
Amazon saves your searches, correlates what you search data, or some other source, the problem isnt finding data,
for with what other users search for, and uses it to create its figuring out what to do with it. And its not just com-
surprisingly appropriate recommendations. These recom- panies using their own data, or the data contributed by
mendations are data products that help to drive Amazons their users. Its increasingly common to mashup data
more traditional retail business. They come about because from a number of sources. Data Mashups in R analyzes
Amazon understands that a book isnt just a book, a camera mortgage foreclosures in Philadelphia County by taking
isnt just a camera, and a customer isnt just a customer; a public report from the county sheriffs office, extracting
customers generate a trail of data exhaust that can be addresses and using Yahoo! to convert the addresses to
mined and put to use, and a camera is a cloud of data that latitude and longitude, then using the geographical data
can be correlated with the customers behavior, the data to place the foreclosures on a map (another data source),
they leave every time they visit the site. and group them by neighborhood, valuation, neighbor-
The thread that ties most of these applications together hood per-capita income, and other socio-economic factors.
is that data collected from users provides added value. The question facing every company today, every
Whether that data is search terms, voice samples, or startup, every non-profit, every project site that wants to
product reviews, the users are in a feedback loop in attract a community, is how to use data effectivelynot
which they contribute to the products they use. Thats just their own data, but all the data thats available and
the beginning of data science. relevant. Using data effectively requires something differ-
ent from traditional statistics, where actuaries in business

2 : An OReilly Radar Report: What is Data Science?

suits perform arcane but fairly well-defined kinds of has moved from $1,000/MB to roughly $25/GBa price
analysis. What differentiates data science from statistics is reduction of about 40000, to say nothing of the reduction
that data science is a holistic approach. Were increasingly in size and increase in speed. Hitachi made the first giga-
finding data in the wild, and data scientists are involved byte disk drives in 1982, weighing in at roughly 250 pounds;
with gathering data, massaging it into a tractable form, now terabyte drives are consumer equipment, and a 32 GB
making it tell its story, and presenting that story to others. microSD card weighs about half a gram. Whether you look
To get a sense for what skills are required, lets look at at bits per gram, bits per dollar, or raw capacity, storage has
the data life cycle: where it comes from, how you use it, more than kept pace with the increase of CPU speed.
and where it goes.

Where data comes from

Data is everywhere: your government, your web server,
your business partners, even your body. While we arent
drowning in a sea of data, were finding that almost every-
thing can (or has) been instrumented. At OReilly, we
frequently combine publishing industry data from Nielsen
BookScan with our own sales data, publicly available
Amazon data, and even job data to see whats happening
in the publishing industry. Sites like Infochimps and
Factual provide access to many large datasets, including
climate data, MySpace activity streams, and game logs
from sporting events. Factual enlists users to update and
improve its datasets, which cover topics as diverse as
endocrinologists to hiking trails.
Much of the data we currently work with is the direct
consequence of Web 2.0, and of Moores Law applied to
data. The Web has people spending more time online,
and leaving a trail of data wherever they go. Mobile
applications leave an even richer data trail, since many of
them are annotated with geolocation, or involve video or
audio, all of which can be mined. Point-of-sale devices
One of the first commercial disk drives from IBM. It has a 5 MB
and frequent-shoppers cards make it possible to capture capacity and its stored in a cabinet roughly the size of a luxury
all of your retail transactions, not just the ones you make refrigerator. In contrast, a 32 GB microSD card measures around
5/8 x 3/8 inch and weighs about 0.5 gram.
online. All of this data would be useless if we couldnt
store it, and thats where Moores Law comes in. Since the (Photo: Mike Loukides. Disk drive on display at IBM Almaden Research)
early 80s, processor speed has increased from 10 MHz to
3.6 GHzan increase of 360 (not counting increases in
word length and number of cores). But weve seen much
bigger increases in storage capacity, on every level. RAM

An OReilly Radar Report: What is Data Science? :3

The importance of Moores law as applied to data isnt the missing points? That isnt always possible. If data is
just geek pyrotechnics. Data expands to fill the space you incongruous, do you decide that something is wrong
have to store it. The more storage is available, the more with badly behaved data (after all, equipment fails), or that
data you will find to put into it. The data exhaust you leave the incongruous data is telling its own story, which may be
behind whenever you surf the Web, friend someone on more interesting? Its reported that the discovery of ozone
Facebook, or make a purchase in your local supermarket, layer depletion was delayed because automated data
is all carefully collected and analyzed. Increased storage collection tools discarded readings that were too low1. In
capacity demands increased sophistication in the analysis data science, what you have is frequently all youre going
and use of that data. Thats the foundation of data science. to get. Its usually impossible to get better data, and you
So, how do we make that data useful? The first step of have no alternative but to work with the data at hand.
any data analysis project is data conditioning, or getting If the problem involves human language, understand-
data into a state where its usable. We are seeing more data ing the data adds another dimension to the problem.
in formats that are easier to consume: Atom data feeds, Roger Magoulas, who runs the data analysis group at
web services, microformats, and other newer technologies OReilly, was recently searching a database for Apple job
provide data in formats thats directly machine-consumable. listings requiring geolocation skills. While that sounds like
But old-style screen scraping hasnt died, and isnt going to a simple task, the trick was disambiguating Apple from
die. Many sources of wild data are extremely messy. They many job postings in the growing Apple industry. To do it
arent well-behaved XML files with all the metadata nicely well you need to understand the grammatical structure
in place. The foreclosure data used in Data Mashups in R of a job posting; you need to be able to parse the English.
was posted on a public website by the Philadelphia county And that problem is showing up more and more frequently.
sheriffs office. This data was presented as an HTML file that Try using Google Trends to figure out whats happening
was probably generated automatically from a spreadsheet. with the Cassandra database or the Python language, and
If youve ever seen the HTML thats generated by Excel, you youll get a sense of the problem. Google has indexed
know thats going to be fun to process. many, many websites about large snakes. Disambiguation
Data conditioning can involve cleaning up messy HTML is never an easy task, but tools like the Natural Language
with tools like Beautiful Soup, natural language processing Toolkit library can make it simpler.
to parse plain text in English and other languages, or even When natural language processing fails, you can replace
getting humans to do the dirty work. Youre likely to be artificial intelligence with human intelligence. Thats where
dealing with an array of data sources, all in different forms. services like Amazons Mechanical Turk come in. If you can
It would be nice if there was a standard set of tools to do split your task up into a large number of subtasks that are
the job, but there isnt. To do data conditioning, you have easily described, you can use Mechanical Turks market-
to be ready for whatever comes, and be willing to use place for cheap labor. For example, if youre looking at job
anything from ancient Unix utilities such as awk to XML listings, and want to know which originated with Apple,
parsers and machine learning libraries. Scripting languages, you can have real people do the classification for roughly
such as Perl and Python, are essential. $0.01 each. If you have already reduced the set to 10,000
Once youve parsed the data, you can start thinking postings with the word Apple, paying humans $0.01 to
about the quality of your data. Data is frequently missing classify them only costs $100.
or incongruous. If data is missing, do you simply ignore

4 : An OReilly Radar Report: What is Data Science?

Working with data at scale To store huge datasets effectively, weve seen a new
Weve all heard a lot about big data, but big is really a red breed of databases appear. These are frequently called
herring. Oil companies, telecommunications companies, NoSQL databases, or Non-Relational databases, though
and other data-centric industries have had huge datasets for neither term is very useful. They group together funda-
a long time. And as storage capacity continues to expand, mentally dissimilar products by telling you what they
todays big is certainly tomorrows medium and next arent. Many of these databases are the logical descendants
weeks small. The most meaningful definition Ive heard: of Googles BigTable and Amazons Dynamo, and are
big data is when the size of the data itself becomes part designed to be distributed across many nodes, to provide
of the problem. Were discussing data problems ranging eventual consistency but not absolute consistency, and
from gigabytes to petabytes of data. At some point, tradi- to have very flexible schema. While there are two dozen or
tional techniques for working with data run out of steam. so products available (almost all of them open source), a
What are we trying to do with data thats different? few leaders have established themselves:
According to Jeff Hammerbacher2 (@hackingdata), were
n Cassandra: Developed at Facebook, in production
trying to build information platforms or dataspaces.
use at Twitter, Rackspace, Reddit, and other large
Information platforms are similar to traditional data ware- sites. Cassandra is designed for high performance,
houses, but different. They expose rich APIs, and are reliability, and automatic replication. It has a very
designed for exploring and understanding the data rather flexible data model. A new startup, Riptano, provides
than for traditional analysis and reporting. They accept all commercial support.
data formats, including the most messy, and their schemas n HBase: Part of the Apache Hadoop project, and
evolve as the understanding of the data changes. modelled on Googles BigTable. Suitable for extremely
Most of the organizations that have built data platforms large databases (billions of rows, millions of columns),
have found it necessary to go beyond the relational data- distributed across thousands of nodes. Along with
base model. Traditional relational database systems stop Hadoop, commercial support is provided by Cloudera.
being effective at this scale. Managing sharding and repli-
cation across a horde of database servers is difficult and Storing data is only part of building a data platform,
slow. The need to define a schema in advance conflicts though. Data is only useful if you can do something with it,
with reality of multiple, unstructured data sources, in and enormous datasets present computational problems.
which you may not know whats important until after Google popularized the MapReduce approach, which is
youve analyzed the data. Relational databases are designed basically a divide-and-conquer strategy for distributing an
for consistency, to support complex transactions that can extremely large problem across an extremely large com-
easily be rolled back if any one of a complex set of opera- puting cluster. In the map stage, a programming task is
tions fails. While rock-solid consistency is crucial to many divided into a number of identical subtasks, which are then
applications, its not really necessary for the kind of analysis distributed across many processors; the intermediate
were discussing here. Do you really care if you have 1,010 results are then combined by a single reduce task. In
or 1,012 Twitter followers? Precision has an allure, but in hindsight, MapReduce seems like an obvious solution to
most data-driven applications outside of finance, that Googles biggest problem, creating large searches. Its easy
allure is deceptive. Most data analysis is comparative: to distribute a search across thousands of processors, and
if youre asking whether sales to Northern Europe are then combine the results into a single set of answers.
increasing faster than sales to Southern Europe, you arent Whats less obvious is that MapReduce has proven to be
concerned about the difference between 5.92 percent widely applicable to many large data problems, ranging
annual growth and 5.93 percent. from search to machine learning.

An OReilly Radar Report: What is Data Science? :5

The most popular open source implementation of require soft real-time; reports on trending topics dont
MapReduce is the Hadoop project. Yahoo!s claim that they require millisecond accuracy. As with the number of
had built the worlds largest production Hadoop application, followers on Twitter, a trending topics report only needs
with 10,000 cores running Linux, brought it onto center to be current to within five minutesor even an hour.
stage. Many of the key Hadoop developers have found a According to Hilary Mason (@hmason), data scientist at
home at Cloudera, which provides commercial support., its possible to precompute much of the calculation,
Amazons Elastic MapReduce makes it much easier to put then use one of the experiments in real-time MapReduce
Hadoop to work without investing in racks of Linux to get presentable results.
machines, by providing preconfigured Hadoop images for Machine learning is another essential tool for the data
its EC2 clusters. You can allocate and de-allocate processors scientist. We now expect web and mobile applications to
as needed, paying only for the time you use them. incorporate recommendation engines, and building a
Hadoop goes far beyond a simple MapReduce imple- recommendation engine is a quintessential artificial
mentation (of which there are several); its the key compo- intelligence problem. You dont have to look at many
nent of a data platform. It incorporates HDFS, a distributed modern web applications to see classification, error
filesystem designed for the performance and reliability detection, image matching (behind Google Goggles and
requirements of huge datasets; the HBase database; SnapTell) and even face detectionan ill-advised mobile
Hive, which lets developers explore Hadoop datasets application lets you take someones picture with a cell
using SQL-like queries; a high-level dataflow language phone, and look up that persons identity using photos
called Pig; and other components. If anything can be available online. Andrew Ngs Machine Learning course at
called a one-stop information platform, Hadoop is it. is one
Hadoop has been instrumental in enabling agile of the most popular courses in computer science at
data analysis. In software development, agile practices Stanford, with hundreds of students.
are associated with faster product cycles, closer interaction There are many libraries available for machine learning:
between developers and consumers, and testing. PyBrain in Python, Elefant, Weka in Java, and Mahout
Traditional data analysis has been hampered by extremely (coupled to Hadoop). Google has just announced their
long turn-around times. If you start a calculation, it might Prediction API, which exposes their machine learning
not finish for hours, or even days. But Hadoop (and particu- algorithms for public use via a RESTful interface. For com-
larly Elastic MapReduce) make it easy to build clusters that puter vision, the OpenCV library is a de-facto standard.
can perform computations on long datasets quickly. Faster Mechanical Turk is also an important part of the toolbox.
computations make it easier to test different assumptions, Machine learning almost always requires a training set, or
different datasets, and different algorithms. Its easer to a significant body of known data with which to develop
consult with clients to figure out whether youre asking the and tune the application. The Turk is an excellent way to
right questions, and its possible to pursue intriguing pos- develop training sets. Once youve collected your training
sibilities that youd otherwise have to drop for lack of time. data (perhaps a large collection of public photos from
Hadoop is essentially a batch system, but Hadoop Twitter), you can have humans classify them inexpen-
Online Prototype (HOP) is an experimental project that sivelypossibly sorting them into categories, possibly
enables stream processing. Hadoop processes data as it drawing circles around faces, cars, or whatever interests
arrives, and delivers intermediate results in (near) real- you. Its an excellent way to classify a few thousand data
time. Near real-time data analysis enables features like points at a cost of a few cents each. Even a relatively
trending topics on sites like Twitter. These features only large job only costs a few hundred dollars.

6 : An OReilly Radar Report: What is Data Science?

While I havent stressed traditional statistics, building But thats not really what concerns us here. Visualization
statistical models plays an important role in any data is crucial to each stage of the data scientist. According to
analysis. According to Mike Driscoll (@dataspora), statistics Martin Wattenberg (@wattenberg, founder of Flowing
is the grammar of data science. It is crucial to making data Media), visualization is key to data conditioning: if you
speak coherently. Weve all heard the joke that eating want to find out just how bad your data is, try plotting it.
pickles causes death, because everyone who dies has Visualization is also frequently the first step in analysis.
eaten pickles. That joke doesnt work if you understand Hilary Mason says that when she gets a new data set, she
what correlation means. More to the point, its easy to starts by making a dozen or more scatter plots, trying to
notice that one advertisement for R in a Nutshell generated get a sense of what might be interesting. Once youve
2 percent more conversions than another. But it takes gotten some hints at what the data might be saying, you
statistics to know whether this difference is significant, or can follow it up with more detailed analysis.
just a random fluctuation. Data science isnt just about the There are many packages for plotting and presenting
existence of data, or making guesses about what that data data. GnuPlot is very effective; R incorporates a fairly
might mean; its about testing hypotheses and making comprehensive graphics package; Casey Reas and Ben
sure that the conclusions youre drawing from the data are Frys Processing is the state of the art, particularly if you
valid. Statistics plays a role in everything from traditional need to create animations that show how things change
business intelligence (BI) to understanding how Googles over time. At IBMs Many Eyes, many of the visualizations
ad auctions work. Statistics has become a basic skill. It isnt are full-fledged interactive applications.
superseded by newer techniques from machine learning Nathan Yaus FlowingData blog is a great place to
and other disciplines; it complements them. look for creative visualizations. One of my favorites is
While there are many commercial statistical packages, the animation of the growth of Walmart over time
the open source R languageand its comprehensive
package library, CRANis an essential tool. Although R is growth-of-walmart-now-with-100-more-sams-club/.
an odd and quirky language, particularly to someone with And this is one place where art comes in: not just the
a background in computer science, it comes close to aesthetics of the visualization itself, but how you under-
providing one-stop shopping for most statistical work. It stand it. Does it look like the spread of cancer throughout
has excellent graphics facilities; CRAN includes parsers for a body? Or the spread of a flu virus through a population?
many kinds of data; and newer extensions extend R into Making data tell its story isnt just a matter of presenting
distributed computing. If theres a single tool that provides results; it involves making connections, then going back
an end-to-end solution for statistics work, R is it. to other data sources to verify them. Does a successful
retail chain spread like an epidemic, and if so, does that
Making data tell its story give us new insights into how economies work? Thats not
A picture may or may not be worth a thousand words, a question we could even have asked a few years ago.
but a picture is certainly worth a thousand numbers. The There was insufficient computing power, the data was all
problem with most data analysis algorithms is that they locked up in proprietary sources, and the tools for working
generate a set of numbers. To understand what the with the data were insufficient. Its the kind of question we
numbers mean, the stories they are really telling, you now ask routinely.
need to generate a graph. Edward Tuftes Visual Display of
Quantitative Information is the classic for data visualization,
and a foundational text for anyone practicing data science.

An OReilly Radar Report: What is Data Science? :7

Data scientists profiles, LinkedIns data scientists started looking at events
Data science requires skills ranging from traditional that members attended. Then at books members had in
computer science to mathematics to art. Describing the their libraries. The result was a valuable data product that
data science group he put together at Facebook (possibly analyzed a huge databasebut it was never conceived
the first data science group at a consumer-oriented web as such. It started small, and added value iteratively. It
property), Jeff Hammerbacher said: was an agile, flexible process that built toward its goal
on any given day, a team member could incrementally, rather than tackling a huge mountain of
author a multistage processing pipeline in data all at once.
Python, design a hypothesis test, perform a This is the heart of what Patil calls data jiujitsu
regression analysis over data samples with R, using smaller auxiliary problems to solve a large, difficult
design and implement an algorithm for some problem that appears intractable. CDDB is a great example
data-intensive product or service in Hadoop, or of data jiujitsu: identifying music by analyzing an audio
communicate the results of our analyses to stream directly is a very difficult problem (though not
other members of the organization3 unsolvablesee midomi, for example). But the CDDB
staff used data creatively to solve a much more tractable
Where do you find the people this versatile? According problem that gave them the same result. Computing a
to DJ Patil, chief scientist at LinkedIn (@dpatil), the best signature based on track lengths, and then looking up
data scientists tend to be hard scientists, particularly that signature in a database, is trivially simple.
physicists, rather than computer science majors. Physicists Entrepreneurship is another piece of the puzzle. Patils
have a strong mathematical background, computing skills, first flippant answer to what kind of person are you look-
and come from a discipline in which survival depends on ing for when you hire a data scientist? was someone you
getting the most from the data. They have to think about would start a company with. Thats an important insight:
the big picture, the big problem. When youve just spent a were entering the era of products that are built on data.
lot of grant money generating data, you cant just throw We dont yet know what those products are, but we do
the data out if it isnt as clean as youd like. You have to know that the winners will be the people, and the compa-
make it tell its story. You need some creativity for when the nies, that find those products. Hilary Mason came to the
story the data is telling isnt what you think its telling. same conclusion. Her job as scientist at is really to
Scientists also know how to break large problems up investigate the data that is generating, and find out
into smaller problems. Patil described the process of how to build interesting products from it. No one in the
creating the group recommendation feature at LinkedIn. nascent data industry is trying to build the 2012 Nissan
It would have been easy to turn this into a high-ceremony Stanza or Office 2015; theyre all trying to find new prod-
development project that would take thousands of hours ucts. In addition to being physicists, mathematicians,
of developer time, plus thousands of hours of computing programmers, and artists, theyre entrepreneurs.
time to do massive correlations across LinkedIns member- Data scientists combine entrepreneurship with
ship. But the process worked quite differently: it started patience, the willingness to build data products incremen-
out with a relatively small, simple program that looked at tally, the ability to explore, and the ability to iterate over a
members profiles and made recommendations accord- solution. They are inherently interdisciplinary. They can
ingly. Asking things like, did you go to Cornell? Then you tackle all aspects of a problem, from initial data collection
might like to join the Cornell Alumni group. It then and data conditioning to drawing conclusions. They can
branched out incrementally. In addition to looking at

8 : An OReilly Radar Report: What is Data Science?

Hiring trends for data science

Its not easy to get a handle on jobs in data science. However, data from OReilly Research shows a steady year-over-year increase in
Hadoop and Cassandra job listings, which are good proxies for the data science market as a whole. This graph shows the increase in
Cassandra jobs, and the companies listing Cassandra positions, over time.

think outside the box to come up with new ways to view ence of millions of travellers, or studying the URLs that
the problem, or to work with very broadly defined prob- people pass to others, the next generation of successful
lems: heres a lot of data, what can you make from it? businesses will be built around data. The part of Hal
The future belongs to the companies who figure out Varians quote that nobody remembers says it all:
how to collect and use data successfully. Google,
The ability to take datato be able to understand it,
Amazon, Facebook, and LinkedIn have all tapped into
to process it, to extract value from it, to visualize it,
their datastreams and made that the core of their suc-
to communicate itthats going to be a hugely
cess. They were the vanguard, but newer companies like
important skill in the next decades. are following their path. Whether its mining your
personal biology, building maps from the shared experi- Data is indeed the new Intel Inside.

The NASA article denies this, but also says that in 1984, they decided that the low values (which went back to the 70s)
were real. Whether humans or software decided to ignore anomalous data, it appears that data was ignored.
Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)
Information Platforms as Dataspaces, by Jeff Hammerbacher (in Beautiful Data)

An OReilly Radar Report: What is Data Science? :9

OReilly Media Data Science Resources

Smart companies are betting on data-driven insight to understand customer behavior, create better products, and gain
true competitive advantage in the marketplace. Strata is a new conference focusing on the people, tools, and technolo-
gies putting data to work. Happening February 1-3, 2011 in Santa Clara, CA, Strata will bring together decision makers,
managers, and data practitioners for three days of training, sessions, discussions, events, and exhibits showcasing the
new data ecosystem.
For details, visit Use discount code str11clpdf when you register and save 20%.

Data Analysis with Open Source Tools Beautiful Data

This book shows you how to think about Learn from the best data practitioners in
data and the results you want to achieve the field about how wide-ranging
with it. and beautifulworking with data can be.

Programming Collective Intelligence Beautiful Visualization

Learn how to build web applications that This book demonstrates why visualizations
mine the data created by people on the are beautiful not only for their aesthetic
Internet. design, but also for elegant layers of detail.

R in a Nutshell Head First Statistics

A quick and practical reference to learn This book teaches statistics through
what is becoming the standard for puzzles, stories, visual aids, and real-world
developing statistical software. examples.

Statistics in a Nutshell Head First Data Analysis

An introduction and reference for anyone Learn how to collect your data, sort the
with no previous background in statistics. distractions from the truth, and find
meaningful patterns.