Beruflich Dokumente
Kultur Dokumente
f 2 = m2
f 3 = m3
or
2
and so on.
In the notation above, the lowercase letters denote specific values
of the variables F , M , and A. Throughout the book, I will refer to
variable names with a capital letter and individual values with a
lower case letter.
The subscripts denote which observation each of the values
belongs to. If a set of values belongs to the same observation, it
implies that the values were measured under similar conditions.
To see how this works, consider what the three observations above
may represent. The observations may have been taken at three
different times. For example, f 1 , m1 , and a1 may have been
measured at time one, f 2 , m2 , and a2 measured at time two, and
so on. Alternatively, the observations may describe three different
particles. For example, f 1 , m1 , and a1 may have been measured on
one particle, f 2 , m2 , and a2 may may have been measured on a
second particle, and f 3 , m3 , and a3 may have been measured on a
third particle.
Observations play a very important role in science. A natural law
implies that a relationship will exist between values of variables
that appear in the same observation. However, a natural law does
not imply that a relationship will exist between values in different
observations. You wouldnt think that the force you exert on one
object would equal the mass times the acceleration that you
measure on a different object. Or, in the notation above, you
wouldnt think that f 1 should equal m2 times a2 . You would
expect f 1 to equal m1 times a1 .
Natural laws provide a goal for science. Scientists attempt to
discover natural laws and thereby explain natural phenomena.
You can think of science as a collection of methods that use
observations to discover natural laws.
Data science is one of those methods. It uses a specific tool to
reveal natural laws, and that tool is data.
Data, or more precisely a data set, is a collection of values that
f1
m1
a1
f2
m2
a2
f3
m3
a3
Now that you know the vocabulary of data science, lets look at the
method.
Patterns
Recall that a data set organizes values so that each value is
associated with a variable as well as an observation. This
organization makes data sets particularly useful for discovering
natural laws. If a natural law exists between the variables in a data
set, the law will appear as a pattern that reoccurs in each
observation. Or to put it more simply, a natural law will appear as
a pattern in data.
In our example data set, the relationship described by the law
F = M A will be present in each observation. As a result, the data
set will reveal what the natural law implies:
obs
f1
= m1
a1
f2
= m2
a2
f3
= m3
a3
This is easy to verify if you measure the real forces, masses, and
accelerations associated with several dozen particles, like in the
data set below. Each row of values displays the relationship
F = MA
obs
3.01
0.98
3.08
2.35
0.91
2.58
5.57
1.01
5.52
0.62
1.09
0.56
4.15
0.89
4.69
5.07
1.05
4.84
7.56
0.93
8.12
4.04
1.09
3.70
...
...
...
You will face a similar situation if your data contains some, but
not all, of the variables in a law. When this happens, a pattern will
still appear between the variables in your data set that are
connected by a law. The influence of the missing variables will
appear as noise in the pattern.
You can see this happen in the two dimensional graph between F
and A. The graph ignores the M variable, as if it were not part of
the data set. As a result, the M variable adds noise to the linear
pattern between F and A, but the pattern is still discernable.
Noise in your data is not a cause for defeat. As long as you capture
the most influential variables in a law, and do not let
measurement errors get so big that they swamp your data, you are
likely to find a pattern that will point to the law.
Correlations
So far we have considered how data will display F
= MA
,a
variables. But now we can see that data will also display patterns
that do not involve causal relationships (or physical variables).
Consider two everyday variables that are strongly correlated. For
example, consider the price of Chevron stock and the price of BP
stock. These two companies compete against each other, but the
prices of their stocks tend to rise and fall at the same times. This is
because the companies both sell oil. As the price of oil rises, so
does the price of each stock.
This correlation forms a relationship between the prices, but the
relationship is not causal. The price of BP stock does not cause the
price of Chevron stock.
Will data display a non-causal relationship between variables? Yes
it will, and it is easy to see why. The price of each stock is caused
by the price of oil plus some company specific variables that
determine how profitable each company is, i.e.
priceChevron = priceOil + Chevron specific variables
and
priceBP = priceOil + BP specific variables
Sample effects
You have seen that data provides a simple way to spot natural
laws, and that this method works in a variety of situations. Why
then, does data science have such a fearsome reputation?
Unfortunately, natural laws are not the only thing that can cause
patterns to appear in data. Sometimes data sets display patterns
that do not exist in real life. These patterns are illusions and lead
to false results. How can you tell whether the patterns that you do
find are real and not an illusion? Before we answer that question,
lets examine why a data set might contain patterns that do not
exist in real life.
Most data sets are much smaller than they could be. For example,
if you wanted to research a question like, How is an adults height
related to their age?, you could collect a very big data set: the
measurements of every single adult on the planet. But that
wouldnt be necessary. A pattern between height and age would
become clear well before you finish measuring every adult on the
planet (and if it doesnt, a pattern between your data collection
efforts and your quality of life certainly would).
Data scientists refer to the universe of possible observations that
you could collect as a population, and the set of observations that
you actually collect as a sample. The process of collecting a sample
of data is known as sampling, and it has important consequences
for data science. Sampling opens the door for illusions to creep
into a data set.
The graph on the left shows the relationship between the age and
height
And any of the collections below would suggest that a natural law
does not exist between height and weight or worse, the last
pattern suggests that an inverse relationship exists between
height
and weight .
These patterns are illusions. They are not caused by natural laws,
they are caused by omission and coincidence. We did not collect
all of the possible observations (which wouldve revealed the true
law). The observations that we did collect happened to make an
unusual set.
Probability
Probability is the branch of mathematics that describes random
behavior. We will take a look at probability later in the book, but
for now lets consider how you can use probability to spot real
patterns.
Recall that sampling is the source of illusions when illusions
appear in your data. In other words, which observations you
decide to collect will determine which patterns you see (if any).
If you use a random method to select observations, then random
chance will be the only mechanism that could cause sampling
effects to appear in the data. You could then calculate the
probability that a pattern in the data is a result of random chance,
and not a natural law.
This system reduces patterns in data from proofs of natural laws to
evidence of natural laws. Each pattern that you find is evidence of a
natural law. If the pattern is likely to be caused by random chance,
then the evidence is weak. If the pattern is not likely to be caused
due to random chance, and not a natural law. You can view
this step as measuring the strength of the evidence provided
by an analysis.
This method involves a level of uncertainty. In many ways, as a
data scientist, you will be a specialist in uncertainty. You will not
work with proofs, like a mathematician, but with evidence that
comes with a certain probability that it might be wrong.
Given this ambiguity, you may wonder why anyone would practice
data science. There are some very good reasons.
Take Caution
We are starting to learn that most published data science findings
may be wrong. In 2012, Amgen determined that only 6 of 53
landmark medical studies had results that could be reproduced.
From a scientific point of view, this means that these studies
should be considered unreliable, if not wrong. In 2011, the Bayer
company found it could only reproduce 25% of published findings
in cardiovascular disease, cancer, and womens health studies.
Bayer shelved development of two thirds of its new drug projects
as a result.
Data science goes wrong in other fields too. The 2008 Financial
Crisis was enabled by a misapplication of the Gaussian copula, a
data analysis technique. In another case, NASA analyzed global
ozone data for seven years without noticing the hole in the ozone