Sie sind auf Seite 1von 15

# There are a few divisions of topics in statistics.

## One division that quickly comes to

mind is the differentiation between descriptive and inferential statistics. There are
other ways that we can separate out the discipline of statistics. One of these ways is
to classify statistical methods as either parametric or nonparametric.
We will find out what the difference is between parametric methods and
nonparametric methods.
The way that we will do this is to compare different instances of these types of
methods.

Parametric Methods
Methods are classified on the basis of what we know about the population we are
studying. Parametric methods are typically the first methods studied in an
introductory statistics course. The basic idea is that there is a set of fixed parameters
that determine a probability model.
Parametric methods are often those for which we know that the population is
approximately normal, or we can approximate using a normal distribution after we
invoke the central limit theorem. There are two parameters for a normal
distribution: the mean and the standard deviation.
Ultimately the classification of a method as parametric depends upon the
assumptions that are made about a population. A few parametric methods include:
 Confidence interval for a population mean, with known standard deviation.
 Confidence interval for a population mean, with unknown standard deviation.
 Confidence interval for a population variance.
 Confidence interval for the difference of two means, with unknown standard
deviation.

Nonparametric Methods
To contrast with parametric methods, we will define nonparametric methods. These
are statistical techniques for which we do not have to make any assumption of
parameters for the population we are studying.
Indeed, the methods do not have any dependence on the population of interest. The
set of parameters is no longer fixed, and neither is the distribution that we use. It is
for this reason that nonparametric methods are also referred to as distribution-free
methods.
Nonparametric methods are growing in popularity and influence for a number of
reasons. The main reason is that we are not constrained as much as when we use a
parametric method. We do not need to make as many assumptions about the
population that we are working with as what we have to make with a parametric
method. Many of these nonparametric methods are easy to apply and to understand.
A few nonparametric methods include:
 Sign test for population mean
 Bootstrapping techniques
 U test for two independent means
 Spearman correlation test

Comparison
There are multiple ways to use statistics to find a confidence interval about a
mean. A parametric method would involve the calculation of a margin of error with
a formula, and the estimation of the population mean with a sample mean. A
nonparametric method to calculate a confidence mean would involve the use of
bootstrapping.
Why do we need both parametric and nonparametric methods for this type of
problem?
Many times parametric methods are more efficient than the corresponding
nonparametric methods. Although this difference in efficiency is typically not that
much of an issue, there are instances where we do need to consider which method is
more efficient.

A normal distribution is more commonly known as a bell curve. This type of curve
shows up throughout statistics and the real world.
For example, after I give a test in any of my classes, one thing that I like to do is to
make a graph of all the scores. I typically write down 10 point ranges such as 60-69,
70-79, and 80-89, then put a tally mark for each test score in that range. Almost
every time I do this, a familiar shape emerges.
A few students do very well and a few do very poorly. A bunch of scores end up
clumped around the mean score. Different tests may result in different means and
standard deviations, but the shape of the graph is nearly always the same. This shape
is commonly called the bell curve.
Why call it a bell curve? The bell curve gets its name quite simply because its shape
resembles that of a bell. These curves appear throughout the study of statistics, and
their importance cannot be overemphasized.
What Is a Bell Curve?
To be technical, the kinds of bell curves that we care about the most in statistics are
actually called normal probability distributions. For what follows we’ll just assume
the bell curves we’re talking about are normal probability distributions. Despite the
name “bell curve,” these curves are not defined by their shape. Instead, an
intimidating looking formula is used as the formal definition for bell curves.
But we really don’t need to worry too much about the formula. The only two
numbers that we care about in it are the mean and standard deviation. The bell curve
for a given set of data has the center located at the mean. This is where the highest
point of the curve or “top of the bell“ is located. A data set‘s standard deviation
determines how spread out our bell curve is.
The larger the standard deviation, the more spread out the curve.

## Important Features of a Bell Curve

There are several features of bell curves that are important and distinguishes them
from other curves in statistics:
 A bell curve has one mode, which coincides with the mean and median. This is
the center of the curve where it is at its highest.
 A bell curve is symmetric. If it were folded along a vertical line at the mean,
both halves would match perfectly because they are mirror images of each
other.
 A bell curve follows the 68-95-99.7 rule, which provides a convenient way to
carry out estimated calculations:
 Approximately 68% of all of the data lies within one standard
deviation of the mean.
 Approximately 95% of all the data is within two standard deviations of
the mean.
 Approximately 99.7% of the data is within three standard deviations
of the mean.

An Example
If we know that a bell curve models our data, we can use the above features of the
bell curve to say quite a bit. Going back to the test example, suppose we have 100
students who took a statistics test with a mean score of 70 and standard deviation of
10.
The standard deviation is 10. Subtract and add 10 to the mean. This gives us 60 and
80.
By the 68-95-99.7 rule we would expect about 68% of 100, or 68 students to score
between 60 and 80 on the test.
Two times the standard deviation is 20. If we subtract and add 20 to the mean we
have 50 and 90. We would expect about 95% of 100, or 95 students to score between
50 and 90 on the test.
The central limit theorem is a result from probability theory. This theorem shows up
in a number of places in the field of statistics. Although the central limit theorem can
seem abstract and devoid of any application, this theorem is actually quite important
to the practice of statistics.
So what exactly is the importance of the central limit theorem? It all has to do with
the distribution of our population.
As we will see, this theorem allows us to simplify problems in statistics by allowing
us to work with a distribution that is approximately normal.

## Statement of the Theorem

The statement of the central limit theorem can seem quite technical but can be
understood if we think through the following steps. We begin with a simple random
sample with n individuals from a population of interest. From this sample, we can
easily form a sample mean that corresponds to the mean of what measurement we
are curious about in our population.
A sampling distribution for the sample mean is produced by repeatedly selecting
simple random samples from the same population and of the same size, and then
computing the sample mean for each of these samples. These samples are to be
thought of as being independent of one another.
The central limit theorem concerns the sampling distribution of the sample means.
We may ask about the overall shape of the sampling distribution.
The central limit theorem says that this sampling distribution is approximately
normal - commonly known as a bell curve. This approximation improves as we
increase the size of the simple random samples that are used to produce the
sampling distribution.
There is a very surprising feature concerning the central limit theorem.
The astonishing fact is that this theorem says that a normal distribution arises
regardless of the initial distribution. Even if our population has a skewed
distribution, which occurs when we examine things such as incomes or people’s
weights, a sampling distribution for a sample with a sufficiently large sample size
will be normal.

## Central Limit Theorem in Practice

The unexpected appearance of a normal distribution from a population distribution
that is skewed (even quite heavily skewed) has some very important applications in
statistical practice. Many practices in statistics, such as those involving hypothesis
testing or confidence intervals, make some assumptions concerning the population
that the data was obtained from. One assumption that is initially made in a statistics
course is that the populations that we work with are normally distributed.
The assumption that data is from a normal distribution simplifies matters but seems
a little unrealistic. Just a little work with some real-world data shows that outliers,
skewness, multiple peaks and asymmetry show up quite routinely. We can get
around the problem of data from a population that is not normal. The use of an
appropriate sample size and the central limit theorem help us to get around the
problem of data from populations that are not normal.
Thus, even though we might not know the shape of the distribution where our data
comes from, the central limit theorem says that we can treat the sampling
distribution as if it were normal. Of course, in order for the conclusions of the
theorem to hold, we do need a sample size that is large enough. Exploratory data
analysis can help us to determine how large of a sample is necessary for a given
situation.

Inferential statistics gets its name from what happens in this branch of statistics.
Rather than simply describe a set of data, inferential statistics seeks to infer
something about a population on the basis of a statistical sample. One specific goal
in inferential statistics involves the determination of the value of an unknown
population parameter. The range of values that we use to estimate this parameter is
called a confidence interval.

## The Form of a Confidence Interval

A confidence interval consists of two parts. The first part is the estimate of the
population parameter. We obtain this estimate by using a simple random sample.
From this sample, we calculate the statistic that corresponds to the parameter that
we wish to estimate. For example, if we were interested in the mean height of all
first-grade students in the United States, we would use a simple random sample of
U.S. first graders, measure all of them and then compute the mean height of our
sample.

The second part of a confidence interval is the margin of error. This is necessary
because our estimate alone may be different from the true value of the population
parameter. In order to allow for other potential values of the parameter, we need to
produce a range of numbers. The margin of error does this.

## Estimate ± Margin of Error

The estimate is in the center of the interval, and then we subtract and add the
margin of error from this estimate to obtain a range of values for the parameter.

Confidence Level
Attached to every confidence interval is a level of confidence. This is a probability or
percent that indicates how much certainty we should be attributed to our confidence
interval.

If all other aspects of a situation are identical, the higher the confidence level the
wider the confidence interval.

This level of confidence can lead to some confusion. It is not a statement about the
sampling procedure or population. Instead it is giving an indication of the success of
the process of construction of a confidence interval. For example, confidence
intervals with confidence of 80% will, in the long run, miss the true population
parameter one out of every five times.

Any number from zero to one could, in theory, be used for a confidence level. In
practice 90%, 95% and 99% are all common confidence levels.

Margin of Error
The margin of error of a confidence level is determined by a couple of factors. We
can see this by examining the formula for margin of error. A margin of error is of the
form:

## Margin of Error = (Statistic for Confidence Level)(Standard Deviation/Error)

The statistic for the confidence level depends upon what probability distribution is
being used and what level of confidence we have chosen. For example, if Cis our
confidence level and we are working with a normal distribution, then C is the area
under the curve between -z* to z*. This number z* is the number in our margin of
error formula.

## Standard Deviation or Standard Error

The other term necessary in our margin of error is the standard deviation or
standard error. The standard deviation of the distribution that we are working with
is preferred here. However, typically parameters from the population are unknown.
This number is not usually available when forming confidence intervals in practice.

To deal with this uncertainty in knowing the standard deviation we instead use the
standard error. The standard error that corresponds to a standard deviation is an
estimate of this standard deviation. What makes the standard error so powerful is
that it is calculated from the simple random sample that is used to calculate our
estimate. No extra information is necessary as the sample does all of the estimation
for us.

## Different Confidence Intervals

There are a variety of different situations that call for confidence intervals.

## These confidence intervals are used to estimate a number of different parameters.

Although these aspects are different, all of these confidence intervals are united by
the same overall format. Some common confidence intervals are those for a
population mean, population variance, population proportion, the difference of two
population means and the difference of two population proportions.

## In inferential statistics, one of the major goals is to estimate

an unknown population parameter. You start with a statistical sample, and from
this, you can determine a range of values for the parameter. This range of values is
called a confidence interval.

Confidence Intervals
Confidence intervals are all similar to one another in a few ways. First, many two-
sided confidence intervals have the same form:

## Estimate ± Margin of Error

Second, the steps for calculating confidence intervals are very similar, regardless of
the type of confidence interval you are trying to find. The specific type of confidence
interval that will be examined below is a two-sided confidence interval for a
population mean when you know the population standard deviation. Also, assume
that you are working with a population that is normally distributed.

## Confidence Interval for a Mean With a Known Sigma

Below is a process to find the desired confidence interval. Although all of the steps
are important, the first one is particularly so:

## 1. Check conditions: Begin by ensuring that the conditions

for your confidence interval have been met. Assume that you
know the value of the population standard deviation,
denoted by the Greek letter sigma σ. Also, assume a normal
distribution.

## 2. Calculate estimate: Estimate the population parameter—

in this case, the population mean—by use of a statistic,
which in this problem is the sample mean. This involves
forming a simple random sample from the population.
Sometimes, you can suppose that your sample is a simple
random sample, even if it does not meet the strict definition.

## 1. Critical value: Obtain the critical value z* that corresponds

with your confidence level. These values are found by
consulting a table of z-scores or by using the software. You
can use a z-score table because you know the value of the
population standard deviation, and you assume that the
population is normally distributed. Common critical values
are 1.645 for a 90-percent confidence level, 1.960 for a 95-
percent confidence level, and 2.576 for a 99-percent
confidence level.

## 1. Margin of error: Calculate the margin of error z* σ /√n,

where n is the size of the simple random sample that you
formed.

## 2. Conclude: Finish by putting together the estimate and

margin of error. This can be expressed as either Estimate ±
Margin of Error or as Estimate - Margin of Error to
Estimate + Margin of Error. Be sure to clearly state the
level of confidence that is attached to your confidence
interval.

Example
To see how you can construct a confidence interval, work through an example.
Suppose you know that the IQ scores of all incoming college freshman are normally
distributed with standard deviation of 15. You have a simple random sample of 100
freshmen, and the mean IQ score for this sample is 120. Find a 90-percent
confidence interval for the mean IQ score for the entire population of incoming
college freshmen.

## 1. Check conditions: The conditions have been met since

you have been told that the population standard deviation is
15 and that you are dealing with a normal distribution.

## 2. Calculate estimate: You have been told that you have a

simple random sample of size 100. The mean IQ for this
sample is 120, so this is your estimate.

## 3. Critical value: The critical value for confidence level of 90

percent is given by z* = 1.645.

## 1. Margin of error: Use the margin of error formula and

obtain an error of z* σ /√n = (1.645)(15) /√(100) = 2.467.
2. Conclude: Conclude by putting everything together. A 90-
percent confidence interval for the population’s mean IQ
score is 120 ± 2.467. Alternatively, you could state this
confidence interval as 117.5325 to 122.4675.

Practical Considerations
Confidence intervals of the above type are not very realistic. It is very rare to know
the population standard deviation but not know the population mean. There are
ways that this unrealistic assumption can be removed.

While you have assumed a normal distribution, this assumption does not need to
hold. Nice samples, which exhibit no strong skewness or have any outliers, along
with a large enough sample size, allow you to invoke the central limit theorem.

As a result, you are justified in using a table of z-scores, even for populations that are
not normally distributed.

Inferential statistics concerns the process of beginning with a statistical sample and
then arriving at the value of a population parameter that is unknown. The unknown
value is not determined directly. Rather we end up with an estimate that falls into a
range of values. This range is known in mathematical terms an interval of real
numbers, and is specifically referred to as a confidence interval.

Confidence intervals are all similar to one another in a few ways. Two-sided
confidence intervals all have the same form:

## Similarities in confidence intervals also extend to the steps used to calculate

confidence intervals. We will examine how to determine a two sided confidence
interval for a population mean when the population standard deviation is unknown.
An underlying assumption is that we are sampling from a normally distributed
population.

## Process for Confidence Interval for Mean – Unknown Sigma

We will work through a list of steps required to find our desired confidence interval.
Although all of the steps are important, the first one is particularly so:

## 1. Check Conditions: Begin by making sure that the

conditions for our confidence interval have been met. We
assume that the value of the population standard deviation,
denoted by the Greek letter sigma σ, is unknown and that we
are working with a normal distribution. We can relax the
assumption that we have a normal distribution as long as
our sample is large enough and has no outliers or extreme
skewness.

## 1. Calculate Estimate: We estimate our population

parameter, in this case the population mean, by use of a
statistic, in this case the sample mean. This involves forming
a simple random sample from our population. Sometimes
we can suppose that our sample is a simple random sample,
even if it does not meet the strict definition.

## 1. Critical Value: We obtain the critical value t* that

corresponds with our confidence level. These values are
found by consulting a table of t-scores or by using software.
If we use a table, we will need to know the number of
degrees of freedom. The number of degrees of freedom is
one less than the number of individuals in our sample.

## 2. Margin of Error: Calculate the margin of error t*s /√n,

where n is the size of the simple random sample that we
formed and s is the sample standard deviation, which we
obtain from our statistical sample.

## 3. Conclude: Finish by putting together the estimate and

margin of error. This can be expressed as either Estimate ±
Margin of Error or as Estimate - Margin of Error to
Estimate + Margin of Error. In the statement of our
confidence interval it is important to indicate the level of
confidence. This is just as much a part of our confidence
interval as numbers for the estimate and margin of error.

Example
To see how we can construct a confidence interval, we will work through an example.
Suppose we know that the heights of a specific species of pea plants are normally
distributed. A simple random sample of 30 pea plants has a mean height of 12 inches
with a sample standard deviation of 2 inches.

What is a 90% confidence interval for the mean height for the entire population of
pea plants?

## 1. Check Conditions: The conditions have been met as the

population standard deviation is unknown and we are
dealing with a normal distribution.

## 2. Calculate Estimate: We have been told that we have a

simple random sample of 30 pea plants. The mean height
for this sample is 12 inches, so this is our estimate.

3. Critical Value: Our sample has size of 30, and so there are
29 degrees of freedom. The critical value for confidence level
of 90% is given by t* = 1.699.

## 4. Margin of Error: Now we use the margin of error formula

and obtain a margin of error of t*s /√n = (1.699)(2) /√(30) =
0.620.

## 5. Conclude: We conclude by putting everything together. A

90% confidence interval for the population’s mean height
score is 12 ± 0.62 inches. Alternatively we could state this
confidence interval as 11.38 inches to 12.62 inches.
Practical Considerations
Confidence intervals of the above type are more realistic than other types that can be
encountered in a statistics course. It is very rare to know the population standard
deviation but not know the population mean. Here we assume that we do not know
either of these population parameters.

## Bootstrapping is a statistical technique that falls under the broader heading of

resampling. This technique involves a relatively simple procedure but repeated so
many times that it is heavily dependent upon computer calculations. Bootstrapping
provides a method other than confidence intervals to estimate a population
parameter. Bootstrapping very much seems to work like magic. Read on to see how it
obtains its interesting name.

An Explanation of Bootstrapping
One goal of inferential statistics is to determine the value of a parameter of a
population. It is typically too expensive or even impossible to measure this directly.
So we use statistical sampling. We sample a population, measure a statistic of this
sample, and then use this statistic to say something about the corresponding
parameter of the population.

For example, in a chocolate factory, we might want to guarantee that candy bars
have a particular mean weight. It’s not feasible to weigh every candy bar that is
produced, so we use sampling techniques to randomly choose 100 candy bars. We
calculate the mean of these 100 candy bars and say that the population mean falls
within a margin of error from what the mean of our sample is.

Suppose that a few months later we want to know with greater accuracy -- or less of a
margin of error -- what the mean candy bar weight was on the day that we sampled
the production line.

We cannot use today’s candy bars, as too many variables have entered the picture
(different batches of milk, sugar and cocoa beans, different atmospheric conditions,
different employees on the line, etc.). All that we have from the day that we are
curious about are the 100 weights. Without a time machine back to that day, it would
seem that the initial margin of error is the best that we can hope for.

## Fortunately, we can use the technique of bootstrapping. In this situation, we

randomly sample with replacement from the 100 known weights. We then call this a
bootstrap sample. Since we allow for replacement, this bootstrap sample most likely
not identical to our initial sample. Some data points may be duplicated, and others
data points from the initial 100 may be omitted in a bootstrap sample. With the help
of a computer, thousands of bootstrap samples can be constructed in a relatively
short time.

An Example
As mentioned, to truly use bootstrap techniques we need to use a computer. The
following numerical example will help to demonstrate how the process works. If we
begin with the sample 2, 4, 5, 6, 6, then all of the following are possible bootstrap
samples:

 2 ,5, 5, 6, 6

 4, 5, 6, 6, 6

 2, 2, 4, 5, 5

 2, 2, 2, 4, 6

 2, 2, 2, 2, 2

 4,6, 6, 6, 6

## History of the Technique

Bootstrap techniques are relatively new to the field of statistics. The first use was
published in a 1979 paper by Bradley Efron. As computing power has increased and
becomes less expensive, bootstrap techniques have become more widespread.

## Why the Name Bootstrapping?

The name “bootstrapping” comes from the phrase, “To lift himself up by his
bootstraps.” This refers to something that is preposterous and impossible.

Try as hard as you can, you cannot lift yourself into the air by tugging at pieces of
leather on your boots.

## There is some mathematical theory that justifies bootstrapping techniques.

However, the use of bootstrapping does feel like you are doing the impossible.
Although it does not seem like you would be able to improve upon the estimate of a
population statistic by reusing the same sample over and over again, bootstrapping
can, in fact, do this.