Sie sind auf Seite 1von 19

Data Science with R

Foundations of Data Science


1. Outline the chapter as a. definitions o. an observation has a
unit i. individuals are one type of unit ii. groups are a highlevel unit i. a natural law applies to individual (low-level)
observations ii. a statistical law applies to (high level)
groups of observations iii. a unit of analysis is a property of
the question you are trying to answer b. how data reveals
patterns i. patterns can be there even if you do not see them
ii. you can use patterns to test hypotheses and make
predictions c. Simple, right? But why do things go wrong so
often? c. but data science is not magic, it relies on similarity
i. your observations must be similar to the points you want
to predict ii. and your group of points must be similar to the
population at large iii. because patterns are like a statistics,
they describe group behaviour d. data scientists are not the
first to try to learn about the world. i. Epistemiologists have
tread this ground before and they proved that you cannot
know with certainty ii. the problem of induction iii. but data
science is the most pragmatic thing that you can do
OReilly publishes nine books on data science and one of them is
named What is Data Science? When you open any of these books
you should ask yourself what you are getting into. As a term, data
science has come to mean several things.
At one level, data science is a body of knowledge, a collection of
useful information related to a specific task. For example, library
science and managerial science are bodies of knowledge. Library
science collects the best ways to run a library, and managerial
science collects the best ways to run a business. Data science
collects the best ways to store, retrieve, and manage data. As a
result, a data scientist might know how to set up a hadoop cluster

or run the latest type of non-relational database. This is probably


what most people think of when they think of data science, but
this is not the type of data science that I will teach you.
At another level, data science is a way of doing science. Data
scientists use data, models, and visualizations to make scientific
discoveries, just as other scientists use experiments. In fact, you
can think of data science as a method of science that complements
experimental science. Experimental scientists use the
experimental method to solve scientific problems, and data
scientists use the data science method. Many scientists use both.
This book will teach you the method of data science. You will learn
how to use data to make discoveries, and to justify those
discoveries once they are made. Along the way, you will learn how
to visualize data, build models, and make predictions.
In this chapter, you will learn the strategy behind data science:
data scientists search for evidence of natural laws in the structure
of data. They then judge the strength of the evidence that they
find, and are able to develop insights based on the laws they
discover. This strategy guides the techniques that you will learn in
later chapterstechniques like data wrangling, exploratory data
analysis (EDA), bootstrap sampling, and cross-validation.

The scientic worldview


As a method of science, data science is based on two simple ideas.
First, that the best way to learn about the word is to observe it.
And second, that the universe operates according to natural laws.
These ideas summarize the worldview shared by many scientists,
and they provide a the vocabulary that will help us talk about data
science.
A natural law is a rule that describes a part of the natural world,
like E = M c2 or F = M A. Natural laws help scientists
understand, control, and make predictions about natural
processes.
You can write down a natural law as a relationship between
E = M

variables. For example, E = M c2 is a natural law that states that


the energy content of a system (E) is always equal to the mass of
the system (M ) multiplied by the speed of light squared (c2 ).
F = M A is a natural law that explains that a force (F ) exerted
upon an object will cause the object to accelerate (A) at a rate
proportional to the mass of the object (M ), an insight that has
many applications in the field of physics.
Natural laws deal with variables, values, and observations. We use
these terms in everyday speech, but they have a technical meaning
when associated with science.
A variable is a quantity, quality, or property that can be
measured.
A value is the apparent state of a variable when you measure
it. The value of a variable may change from measurement to
measurement.
An observation is a set of values that are measured on
multiple variables under similar (ideally identical)
conditions.
You can think of an observation as a snapshot of the world.
An observation shows what a group of variables looked like
together for a brief moment before they changed.
Natural laws deal with variables, but they operate on values that
appear in the same observation. For example, the law F = M A
states that when you measure the force, mass, and acceleration
associated with the same particle at the same time, you will observe
a trio of values such that
f 1 = m1

f 2 = m2

f 3 = m3

or
2

and so on.
In the notation above, the lowercase letters denote specific values
of the variables F , M , and A. Throughout the book, I will refer to
variable names with a capital letter and individual values with a
lower case letter.
The subscripts denote which observation each of the values
belongs to. If a set of values belongs to the same observation, it
implies that the values were measured under similar conditions.
To see how this works, consider what the three observations above
may represent. The observations may have been taken at three
different times. For example, f 1 , m1 , and a1 may have been
measured at time one, f 2 , m2 , and a2 measured at time two, and
so on. Alternatively, the observations may describe three different
particles. For example, f 1 , m1 , and a1 may have been measured on
one particle, f 2 , m2 , and a2 may may have been measured on a
second particle, and f 3 , m3 , and a3 may have been measured on a
third particle.
Observations play a very important role in science. A natural law
implies that a relationship will exist between values of variables
that appear in the same observation. However, a natural law does
not imply that a relationship will exist between values in different
observations. You wouldnt think that the force you exert on one
object would equal the mass times the acceleration that you
measure on a different object. Or, in the notation above, you
wouldnt think that f 1 should equal m2 times a2 . You would
expect f 1 to equal m1 times a1 .
Natural laws provide a goal for science. Scientists attempt to
discover natural laws and thereby explain natural phenomena.
You can think of science as a collection of methods that use
observations to discover natural laws.
Data science is one of those methods. It uses a specific tool to
reveal natural laws, and that tool is data.
Data, or more precisely a data set, is a collection of values that

have been organized in a specific way: each value in a data set is


associated with a variable and an observation.
For example, you can use the values f 1 , f 2 , f 3 , m1 , m2 , m3 , a1 , a2 ,
and a3 to compose a data set, like the one below.
obs

f1

m1

a1

f2

m2

a2

f3

m3

a3

Now that you know the vocabulary of data science, lets look at the
method.

Patterns
Recall that a data set organizes values so that each value is
associated with a variable as well as an observation. This
organization makes data sets particularly useful for discovering
natural laws. If a natural law exists between the variables in a data
set, the law will appear as a pattern that reoccurs in each
observation. Or to put it more simply, a natural law will appear as
a pattern in data.
In our example data set, the relationship described by the law
F = M A will be present in each observation. As a result, the data
set will reveal what the natural law implies:
obs

f1

= m1

a1

f2

= m2

a2

f3

= m3

a3

This is easy to verify if you measure the real forces, masses, and
accelerations associated with several dozen particles, like in the
data set below. Each row of values displays the relationship

F = MA

obs

3.01

0.98

3.08

2.35

0.91

2.58

5.57

1.01

5.52

0.62

1.09

0.56

4.15

0.89

4.69

5.07

1.05

4.84

7.56

0.93

8.12

4.04

1.09

3.70

...

...

...

A data set transforms a law into a pattern, which makes data a


very useful tool for science. The tool isnt perfect, patterns are
very difficult to spot in raw data, but you can optimize how you
search for patterns.
First, you can transform your variables or compute summary
statistics in a way that would make patterns easier to spot. Data
scientists often transform their data, a process known as data
wrangling, to prepare for the steps that follow.
Second, you can visualize raw data to make patterns easier to spot.
Notice how the pattern between F , M , and A becomes obvious
when you visualize the data with a three dimensional, or even a
two dimensional, graph. The relationship F = M A appears as a
three dimensional plane that each of the data points falls on. This
plane resembles a line when it is projected into a two-dimensional
graph.

Third, you can use a computer algorithm to search for patterns


within data, which is exactly what data scientists do when they use
statistical modeling or machine learning techniques.
Moreover, you can count on laws to appear as patterns in data
under a wide variety of conditions. Consider what would happen if
your data failed in some way, for example, if your measurements
were inaccurate, or if your data did not contain all of the variables
in a law.
If your values are contaminated by measurement errors, the errors
will add noise to your data. As long as the errors are relatively
small, laws will still emerge in your data as discernable, but noisy
patterns.
You can see measurement errors at work in the graphs below. The
graph on the left displays two variables that are related by the law
Y = X

. However, the measurements were made in a sloppy

fashion that resulted in inaccurate values. The graph on the left


displays the same data after the measurement errors have been
corrected. Notice that you can still perceive the underlying pattern
even when it has been contaminated by measurement errors.

You will face a similar situation if your data contains some, but
not all, of the variables in a law. When this happens, a pattern will
still appear between the variables in your data set that are
connected by a law. The influence of the missing variables will
appear as noise in the pattern.
You can see this happen in the two dimensional graph between F
and A. The graph ignores the M variable, as if it were not part of
the data set. As a result, the M variable adds noise to the linear
pattern between F and A, but the pattern is still discernable.

Noise in your data is not a cause for defeat. As long as you capture
the most influential variables in a law, and do not let
measurement errors get so big that they swamp your data, you are
likely to find a pattern that will point to the law.

Correlations
So far we have considered how data will display F

= MA

,a

natural law that describes a causal relationship between physical

variables. But now we can see that data will also display patterns
that do not involve causal relationships (or physical variables).
Consider two everyday variables that are strongly correlated. For
example, consider the price of Chevron stock and the price of BP
stock. These two companies compete against each other, but the
prices of their stocks tend to rise and fall at the same times. This is
because the companies both sell oil. As the price of oil rises, so
does the price of each stock.
This correlation forms a relationship between the prices, but the
relationship is not causal. The price of BP stock does not cause the
price of Chevron stock.
Will data display a non-causal relationship between variables? Yes
it will, and it is easy to see why. The price of each stock is caused
by the price of oil plus some company specific variables that
determine how profitable each company is, i.e.
priceChevron = priceOil + Chevron specific variables

and
priceBP = priceOil + BP specific variables

Simple algebraic substitution shows that this arrangement implies


a relationship between the price of Chevron and BP stock.
priceChevron = priceBP + BP specific variables + Chevron specific variables

This relationship will appear as a pattern whenever you collect


data on Chevron and BP stock. Since we do not collect data on the
BP specific and Chevron specific variables, these variables will
show up as noise in the pattern. In short, our graph may look
something like the noisy graph between X and Y above.
To summarize, data will display any sort of relationship between
variables as a pattern, whether or not that relationship involves a
causal association.
To account for this, I will use the term natural law loosely in this

book. A natural law is a causal relationship or reliable correlation


between variables. You might argue that correlations are not
natural laws but are more like natural shorthand rules. You are
right, but correlations do provide valuable information. As you will
see later in the chapter, data scientists tend to make as much use
of correlations as other scientists do of causal laws (and a
correlation can suggest the operation of some unidentified causal
law).

Sample effects
You have seen that data provides a simple way to spot natural
laws, and that this method works in a variety of situations. Why
then, does data science have such a fearsome reputation?
Unfortunately, natural laws are not the only thing that can cause
patterns to appear in data. Sometimes data sets display patterns
that do not exist in real life. These patterns are illusions and lead
to false results. How can you tell whether the patterns that you do
find are real and not an illusion? Before we answer that question,
lets examine why a data set might contain patterns that do not
exist in real life.
Most data sets are much smaller than they could be. For example,
if you wanted to research a question like, How is an adults height
related to their age?, you could collect a very big data set: the
measurements of every single adult on the planet. But that
wouldnt be necessary. A pattern between height and age would
become clear well before you finish measuring every adult on the
planet (and if it doesnt, a pattern between your data collection
efforts and your quality of life certainly would).
Data scientists refer to the universe of possible observations that
you could collect as a population, and the set of observations that
you actually collect as a sample. The process of collecting a sample
of data is known as sampling, and it has important consequences
for data science. Sampling opens the door for illusions to creep
into a data set.

Consider the two data sets visualized below.

The graph on the left shows the relationship between the age and
height

of 1000 adults. In adults, these two variables are not closely

related. As a result, the points appear as an unstructured cloud,


with no patterns.
The graph on the right displays the relationship between height
and weight for the same adults. An adults height is related to
their weight, and the data points display a pattern as a result. The
pattern is noisy because other variables (such as diet and exercise)
also play a role in a persons weight, and their effect appears here
as noise in the pattern.
Lets do a simple thought experiment. Imagine that these 1000
adults are the only adults on the planet. In other words, imagine
that these data sets display the entire population of adults. Now
suppose that you only observed 50 of these adults. What would
your data look like?
We can randomly select 50 of the data points above to see. More
than likely, the 50 points would display a less dense, but still
unstructured cloud on the left and a less dense, but still noticeable
pattern on the right. For example, here are 50 points randomly
selected from the original data sets.

However, by coincidence you might collect 50 observations that


display an illusion. For example, any of the collections below
would suggest that a natural law exists between height and age .

And any of the collections below would suggest that a natural law
does not exist between height and weight or worse, the last
pattern suggests that an inverse relationship exists between
height

and weight .

These patterns are illusions. They are not caused by natural laws,
they are caused by omission and coincidence. We did not collect
all of the possible observations (which wouldve revealed the true
law). The observations that we did collect happened to make an
unusual set.

Notice how diabolical this situation can be. The individual


measurements in each of these samples are correct, and yet the
patterns displayed by the measurements do not exist in real life.
Due to sampling effects, data sets often display patterns that do
not exist in real life, which creates a challenge for data scientists.
As a data scientist, your main source of evidence for natural laws
will be patterns (or descriptions of patterns) that you find in data.
Will you be able to tell when your patterns are caused by natural
laws and when they are caused by sampling effects?
In theory, there is no way to use a data set to determine whether
the patterns contained in the data exist in real life. Or, more
precisely, there is no way to determine with absolute certainty
whether the patterns exist in real life.
In practice, there is a way forward. You can calculate the
probability that a pattern is the result of random chance.

Probability
Probability is the branch of mathematics that describes random
behavior. We will take a look at probability later in the book, but
for now lets consider how you can use probability to spot real
patterns.
Recall that sampling is the source of illusions when illusions
appear in your data. In other words, which observations you
decide to collect will determine which patterns you see (if any).
If you use a random method to select observations, then random
chance will be the only mechanism that could cause sampling
effects to appear in the data. You could then calculate the
probability that a pattern in the data is a result of random chance,
and not a natural law.
This system reduces patterns in data from proofs of natural laws to
evidence of natural laws. Each pattern that you find is evidence of a
natural law. If the pattern is likely to be caused by random chance,
then the evidence is weak. If the pattern is not likely to be caused

by random chance, then the evidence is strong.


A probability calculation will tell you exactly how weak or how
strong your evidence is. As a data scientist, you will need to decide
for yourself how strong the evidence must be before you are
willing to believe it.
It is important to realize that probability does not eliminate the
uncertainty associated with patterns. There will always be a small
possibility that even the most striking patterns are caused by
random chance. Probability calculations do not eliminate this
possibility; they quantify it, which makes it easier for you to
reason about your evidence.
Data scientists use probability calculations to augment the simple
system of discovery presented by data. This arrangement creates
the method of data science, which can be described with a basic
outline.

The Method of Data Science


Data scientists search for evidence of natural laws in the structure
of data. They then judge the strength of the evidence that they
find. To do this, they:
1. Collect data in a way that minimizes the chance that
patterns will appear by coincidence. Often this involves
some type of random selection.
2. Search for patterns that provide evidence of natural laws.
During this search a data scientist will often:
Wrangle data - make patterns more apparent by
reshaping, subsetting, or transforming the data.
Visualize data - display data in a graph, which
exposes patterns to the human visual system.
Model data - search for patterns with computer
algorithms that can be automated, calibrated, and
optimized.
3. Judge patterns - calculate the probability that a pattern is

due to random chance, and not a natural law. You can view
this step as measuring the strength of the evidence provided
by an analysis.
This method involves a level of uncertainty. In many ways, as a
data scientist, you will be a specialist in uncertainty. You will not
work with proofs, like a mathematician, but with evidence that
comes with a certain probability that it might be wrong.
Given this ambiguity, you may wonder why anyone would practice
data science. There are some very good reasons.

Why do Data Science?


Data science complements other methods of scientific inquiry. To
see the strengths of data science, lets compare it to experimental
science, a well known way to do science. To summarize loosely,
experimental scientists use the following method to learn about
natural laws:
1. Formulate a hypothesis about a natural law.
2. Make a testable prediction deduced from the hypothesis.
3. Conduct an experiment to test the prediction.
4. Reject the hypothesis if the prediction is incorrect.

Discovery and conrmation


You may notice that the experimental method begins with a
hypothesis and then uses observations to test the hypothesis. This
approach makes the experimental method very good for confirming
hypotheses. Experimental scientists can quickly winnow false
hypotheses from very plausible hypotheses.
However, the experimental method does not answer a very
important question: how should scientists think up useful
hypotheses to test? Data science provides the answer. A scientist
can begin with observations and then search them for patterns

that suggest hypotheses about natural laws. In short, data science


provides a system of discovery for scientists to use.

Causation and prediction


Experiments are designed to show causation, a specific type of
relationship between variables. An experimenter manipulates an
explanatory variable to observe the effect that the manipulation
causes in a response variable. This design makes experiments less
effective at discovering non-causal relationships.
Why would you want to discover a non-causal relationship?
Whenever a relationship exists between variables, you can use the
relationship to make better predictions. You can use the value of
one variable to predict the value of another variable that it is
related to. This works even if the relationship is a non-causal
correlation.
Consider, for example, how Netflix knows which movies you will
like. By studying data, Netflix has learned that people who like The
Matrix also tend to like The Terminator and vice versa. This
relationship is very useful, but it is not causal: your opinion of The
Matrix does not cause your opinion of the The Terminator.
In contrast to experimental science, data science makes it easy to
spot any type of relationship between variables. Data science will
expose both non-causal and causal relationships as patterns in the
data. Data science will not tell you which relationships are causal
and which are not, but if you are only interested in making
accurate predictions, you may not mind.

Flexibility and control


Consider for a moment why experiments prove causation. An
experimenter does more than manipulate an explanatory variable
to see the effect on a response variable. An experimenter also
holds constant any other variable that could influence the
response during the experiment. For example, an experimenter
will ensure that the temperature, humidity, local magnetic fields,

etc. do not fluctuate during an experiment.


As a result, the experimenter can rule out the posibility that
something other than the explanatory variable caused the effect in
the response variable. This method is almost foolproof, but it
requires a tremendous amount of control over the process being
studied.
In many research settings, this amount of control is impossible or
unethical. For example, you could not control each of the variables
that influences something like the stock market, or a nations
economy. Nor should you control variables like how much alcohol
a pregnant woman ingests or how much pollution an asthmatic
person inhales if doing so would cause unnecessary harm.
Data science requires much less control than experimental
science, which makes data science adaptable to a broader range of
research questions. As a data scientist, you do not need to
intervene in a process to study it. You can collect data passively by
observing nature as it is, which can free you from the ethical and
logistical burdens that an experimental scientists would face.

Take Caution
We are starting to learn that most published data science findings
may be wrong. In 2012, Amgen determined that only 6 of 53
landmark medical studies had results that could be reproduced.
From a scientific point of view, this means that these studies
should be considered unreliable, if not wrong. In 2011, the Bayer
company found it could only reproduce 25% of published findings
in cardiovascular disease, cancer, and womens health studies.
Bayer shelved development of two thirds of its new drug projects
as a result.
Data science goes wrong in other fields too. The 2008 Financial
Crisis was enabled by a misapplication of the Gaussian copula, a
data analysis technique. In another case, NASA analyzed global
ozone data for seven years without noticing the hole in the ozone

layer. The most famous data analysis failure probably happened in


1983. Engineers at Morton Thiokol, the builder of space shuttle
booster rockets, predicted that the Challenger would explode on
launch. They had a chance to prevent the launch, but changed
their minds after misreading data that proved them right.
Even famous statisticians can get data wrong. Sir Ronald Fisher
invented much of modern statistics, but he spent the end of his
career using data to show that cigarettes do not cause cancer.
This doesnt mean you should avoid data. Looking at data will
always create better understanding than ignoring it, but remember
that data is not a cure-all. Good science also requires good
reasoning and skepticism.

Summary and Parting Advice


The method of data science is very simple and very effective. Data
scientists search for evidence of natural laws in the structure of
data. If a natural law exists between the variables in a data set, it
will appear as a pattern in the data.
This method is very useful for discovering laws and for collecting
information that can lead to better predictions. Moreover, you can
apply data science to any situation in which you can collect data.
But data can be very deceptive. Patterns can be hidden in noise
and may not appear at first glance. Moreover, coincidencesor
biasesthat occur when you collect your data can introduce
patterns into your data that do not occur in real life. These things
cause formidable challenges; let the sidebar serve as a cautionary
tale.
How can you do better than the people mentioned in the sidebar?
You already have one advantage. Many people who practice (and
fail) at data science, do not study data science and might not
appreciate how deceptive data can be. You can further protect
yourself by adopting two traits that will safeguard your work. You

can be curious enough to explore a data set thoroughly, exposing


any patterns that are there. Then you can be skeptical enough to
question every pattern that you find and to search for alternative
explanations.
John Tukey, one of the first data scientists, often compared data
science to detective work. I like this metaphor because detectives
are both curious and skeptical. Also, detective work is risky
business, and so is data science. But I would extend the metaphor
further. If you think of yourself as a detective, you should think of
data as the mysterious blonde who walks into your office: sexy on
the surface, murky and treacherous beneath.

Garrett Grolemund. Pre-order Data Science with R at shop.oreilly.com.

Das könnte Ihnen auch gefallen