Sie sind auf Seite 1von 5

"Significance testing 101: The Z test

Part One"
What is Significance testing?

My former statistics professor used to say that: "our world is noisy."


In data science- and science in general- there is a need to distinguish changes and effects
caused by chance and random reasons, to those caused by deliberate actions.
For example, let's imagine we want to know if changing the background color of a website
from red to yellow makes visitors to stay longer. One of the most likely ways to find this out
is to change it and check whether the average visiting time of visitors has dropped.
The fundamental problem with this is that our world is noisy. Even if we do see a drop in
visiting time- how can we know it is caused by the change of background color and not just
by chance? This is where significance tests are beginning to play a role. This allows us to
know what the odds are that the changes we see are just a random fluctuation of the data.
And if it is very unlikely, we can say with a high degree of certainty that is was caused by our
actions.
Today we will talk about the Z test. While it is not very useful by itself (the reasons for this
will be explained later on), it lays down the ideas used for almost all other Significance tests.

The mean

Before we dive into the test itself, there are a few key concepts we need to get familiar with.
The first and most commonly-known of them is the mean. The mean is a statistical tool that
helps us understand the distribution of data. Simply put, the mean is just the mathematical
term for average, and is the sum of all the data points in the distribution divided by the
number of numbers being averaged:

Where X1, X2Xn represent all the numbers being averaged and n is the amount of them.

The average gives us a very useful and intuitive way to understand the distribution without
having to go through all the observations in it. For example, if you know that the average
test score of a class is 98.5 out of 100, then you do not have to read all the test scores of all
the students to know that most of them did really well.
The standard deviation

Imagine that I am offering you one million dollars for answering this one question: "Is 300
above the average is a lot?". Honestly, my money is safe - because this is a trick question. In
order to answer it, more information is required. The standard deviation tells us how
stretched out the observations are from the average. If the standard deviation is small (for
example: if the biggest deviation from the average is 3) then 300 is a lot. The formula for
standard deviation is:

Where X is a single observation, is the mean, n is the amount of observations and means
"sum up all of the following". The basic idea of this equation is to sum up all of the distances
of observations from the mean (they are squared in order to account for the negative
distances canceling the positive ones), and divide it by n to make sure the standard deviation
is not affected by the amount of observations only by their distance from the average. The
standard deviation is the squared root of that number.

The normal distribution

One of the most important distributions in statistics is the normal distribution. Here are
some examples of it:

As you can see, it is symmetrical and has only one peak. The center of the distribution is its
mean. Although there are other distributions with these properties, the normal distribution
has a very important feature that we can use: given a number of standard deviations from
its center its mean we can know the area under the curve until that point. The following
graph and example will make this idea clearer:
Three Two One
The One Two Three
standard standard standard
mean standard standard standard
deviation deviation deviation
deviation deviations deviations

For example, if the mean of a normal distribution is 30, and its standard deviation is 3.5,
then 34.1% of all observations will be between 30 and 33.5.

Even though the form of the normal distribution varies according to the mean and standard
deviation, it will always be bell-shaped and those percentages will always stay the same.

It is important to understand that the area under the curve represents the probability of
selecting a particular value by accident. Values in the middle of normal distribution are more
likely to be selected - this is why more of the distribution area is concentrated in the middle.
On the other hand, values that are more than 2 standard deviations from the average are at
the edges and will rarely be selected.

The sample distribution

The sample distribution is a theoretical distribution in which, instead of basing the


distribution on all the existing observations, the observations are divided into groups - or
samples - of the same size. Then, the means of all of the samples are used to create the
sample distribution. This is a good place to mention one of the main rules of statistics: "The
Central Limit Theorem". Without getting into its mathematical proof, it simply states that if
the sample size is big enough, the sample distribution will be approximately normal
regardless of the actual form of the distribution of single observations. The sample
distribution will also be normal if the distribution of the single observations was normal
regardless of sample size. It is important to notice that while the mean of a regular
distribution and the sample distribution built from it is the same, the standard deviation of
the sample distribution is lower and is called "the standard error". In order to understand
why, ask yourself what is more likely: To meet one person on the street whose height is 2
meters, or 15 people whose average height is 2 meters. Since samples with extreme values
are much less reasonable, the values in the sample distribution are usually less stretched out
from the average. The formula for the standard error is:

Where SE is the standard error, is the standard deviation and n is the number of
observation in 1 sample.

The Z score

Before we can start learning to perform the Z test, there is one last (and simple) key idea we
need to understand: Z score. The Z score which is given to one observation placed inside the
distribution of single observations express the distance of the observation from the mean in
units of standard deviation. More commonly, it is used for expressing the distance between
the mean of one sample of observations placed inside the sample distribution. The formula
for generating Z score is:

Where is the Z score, is the observation's value, is the mean of all observation and
is the Standard Error.

Let's get familiar with a specific kind of normal distribution: The Standard normal
distribution. In order to construct it, we can take each of the observation and give it a Z
score. For the Z test. It needs to be done with the means of the samples in the sample
distribution. With all of those Z scores, we can create the Standard normal distribution. This
distribution has a few interesting features: its mean is always 0, and its new standard
deviation is always 1. Also, the Standard normal distribution created from any sample
distribution is completely identical -"Regardless to its original mean or SD. The Standard
normal distribution is defined as a normal distribution with average = 0 and standard
deviation = 1.
The Standard normal distribution helps us understand the uniqueness of a certain value.
Consider this question: Is it more unlikely to randomly walk into a person (let's call him
David) on the street whose height is 2 meters, or into a person (let's call him Adam) whose
weight is 200 pounds? This is something that is hard to appreciate without some kind of
standardized scoring system. But if I were to tell you that David's height's Z score is 2 and
Adam's weight's Z score is 1, it would mean that there is much less people taller than David
then people weighting more than Adam.
-1 0 1 2 3

Adam David

As you can see, 15.8% of the population weight more than Adam, while only 2.2% is taller
than David.

The general idea behind the Z test


With the tools we have acquired throughout this post, we can now understand the core idea
behind the Z test and Significance testing in general. It is actually quite simple.
By creating a Z score to our sample's mean, we can know how likely it is to get such a value.
In our example from the beginning of the post, we were interested in finding out whether
changing the background color of a website from red to yellow makes visitors stay longer. By
taking the mean of the visiting time of the users at the first month in which the background
was yellow and giving this mean a Z score, we can know how likely it is to get this value by
chance. If it's very unlikely, we can argue with reasonable certainty that this increase during
the month was caused by the change we have made.

What's next
In this post we have laid the necessary foundations to understand the Z test and many other
kinds of significance tests. In part 2, we will learn how to perform the Z test. We will also
discuss when it is right to use it and what its limitations are. And finally, we will learn to use
it with Python.

Das könnte Ihnen auch gefallen