Sie sind auf Seite 1von 24

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis of
virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics.
With descriptive statistics you are simply describing what is or what the
data shows. With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer from the sample data what the
population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups
is a dependable one or one that might have happened by chance in this
study. Thus, we use inferential statistics to make inferences from our
data to more general conditions; we use descriptive statistics simply to
describe what's going on in our data.
Descriptive Statistics are used to present quantitative descriptions in a
manageable form. In a research study we may have lots of measures.
Or we may measure a large number of people on any measure.
Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a
simpler summary. For instance, consider a simple number used to
summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by the
number of times at bat (reported to three significant digits). A batter who
is hitting .333 is getting a hit one time in every three at bats. One
batting .250 is hitting one time in four. The single number describes a
large number of discrete events. Or, consider the scourge of many
students, the Grade Point Average (GPA). This single number describes
the general performance of a student across a potentially wide range of
course experiences.

Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing important
detail. The batting average doesn't tell you whether the batter is hitting
home runs or singles. It doesn't tell whether she's been in a slump or on
a streak. The GPA doesn't tell you whether the student was in difficult
courses or easy ones, or whether they were courses in their major field
or in other disciplines. Even given these limitations, descriptive statistics
provide a powerful summary that may enable comparisons across
people or other units.

Univariate Analysis
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
variable that we tend to look at:

the distribution

the central tendency

the dispersion

In most situations, we would describe all three of these characteristics


for each of the variables in our study.
The Distribution. The distribution is a summary of the frequency of
individual values or ranges of values for a variable. The simplest
distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe the
distribution of college students is by year in college, listing the number or
percent of students at each of the four years. Or, we describe gender by
listing the number or percent of males and females. In these cases, the
variable has few enough values that we can list each one and
summarize how many sample cases had the value. But what do we do
for a variable like income or GPA? With these variables there can be a

large number of possible values, with relatively few people having each
one. In this case, we group the raw scores into categories according to
ranges of values. For instance, we might look at GPA according to the
letter grade ranges. Or, we might group income into four or five ranges of
income values.

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with


a frequency distribution. Depending on the particular variable, all of
the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would
usually not be sensible to determine the frequencies for each value.
Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as a
table or as a graph. Table 1 shows an age frequency distribution with five
categories of age ranges defined. The same frequency distribution can
be depicted in a graph as shown in Figure 2. This type of graph is often
referred to as a histogram or bar chart.

Table 2. Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example,


you could use percentages to describe the:

percentage of people in different income levels

percentage of people in different age ranges

percentage of people in different ranges of standardized test scores

Central Tendency. The central tendency of a distribution is an estimate


of the "center" of a distribution of values. There are three major types of
estimates of central tendency:

Mean

Median

Mode

The Mean or average is probably the most commonly used method of


describing central tendency. To compute the mean all you do is add up
all the values and divide by the number of values. For example, the

mean or average quiz score is determined by summing all the scores


and dividing by the number of students taking the exam. For example,
consider the test score values:
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so the mean is 167/8 = 20.875.
The Median is the score found at the exact middle of the set of values.
One way to compute the median is to list all scores in numerical order,
and then locate the score in the center of the sample. For example, if
there are 500 scores in the list, score #250 would be the median. If we
order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent the halfway point.
Since both of these scores are 20, the median is 20. If the two middle
scores had different values, you would have to interpolate to determine
the median.
The mode is the most frequently occurring value in the set of scores. To
determine the mode, you might again order the scores as shown above,
and then count each one. The most frequently occurring value is the
mode. In our example, the value 15 occurs three times and is the model.
In some distributions there is more than one modal value. For instance,
in a bimodal distribution there are two values that occur most frequently.
Notice that for the same set of 8 scores we got three different values -20.875, 20, and 15 -- for the mean, median and mode respectively. If the
distribution is truly normal (i.e., bell-shaped), the mean, median and
mode are all equal to each other.
Dispersion. Dispersion refers to the spread of the values around the
central tendency. There are two common measures of dispersion, the

range and the standard deviation. The range is simply the highest value
minus the lowest value. In our example distribution, the high value is 36
and the low is 15, so the range is 36 - 15 = 21.
The Standard Deviation is a more accurate and detailed estimate of
dispersion because an outlier can greatly exaggerate the range (as was
true in this example where the single outlier value of 36 stands apart
from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample. Again lets take the set
of scores:
15,20,21,20,36,15,25,15
to compute the standard deviation, we first find the distance between
each value and the mean. We know from above that the mean is 20.875.
So, the differences from the mean are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875

Notice that values that are below the mean have negative discrepancies
and values above it have positive ones. Next, we square each
discrepancy:
-5.875 * -5.875 = 34.515625
-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625

-5.875 * -5.875 = 34.515625


+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625

Now, we take these "squares" and sum them to get the Sum of Squares
(SS) value. Here, the sum is 350.875. Next, we divide this sum by the
number of scores minus 1. Here, the result is 350.875 / 7 = 50.125. This
value is known as the variance. To get the standard deviation, we take
the square root of the variance (remember that we squared the
deviations earlier). This would be SQRT(50.125) = 7.079901129253.
Although this computation may seem convoluted, it's actually quite
simple. To see this, consider the formula for the standard deviation:

In the top part of the ratio, the numerator, we see that each score has
the the mean subtracted from it, the difference is squared, and the
squares are summed. In the bottom part, we take the number of scores
minus 1. The ratio is the variance and the square root is the standard
deviation. In English, we can describe the standard deviation as:
the square root of the sum of the squared deviations from the mean
divided by the number of scores minus one

Although we can calculate these univariate statistics by hand, it gets


quite tedious when you have more than a few values and variables.
Every statistics program is capable of calculating them easily for you.
For instance, I put the eight scores into SPSS and got the following table
as a result:
N

Mean

20.8750

Median

20.0000

Mode

15.00

Std. Deviation
Variance
Range

7.0799
50.1250
21.00

which confirms the calculations I did by hand above.


The standard deviation allows us to reach some conclusions about
specific scores in our distribution. Assuming that the distribution of
scores is normal or bell-shaped (or close to it!), the following conclusions
can be reached:

approximately 68% of the scores in the sample fall within one standard deviation of the mean

approximately 95% of the scores in the sample fall within two standard deviations of the mean

approximately 99% of the scores in the sample fall within three standard deviations of the mean

For instance, since the mean in our example is 20.875 and the standard
deviation is 7.0799, we can from the above statement estimate that
approximately 95% of the scores will fall in the range of 20.875(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This
kind of information is a critical stepping stone to enabling us to compare
the performance of an individual on one variable with their performance
on another, even when the variables are measured on entirely different
scales.

Descriptive statistics
From Wikipedia, the free encyclopedia

Descriptive statistics is the discipline of quantitatively describing the main features of a collection
of information, or the quantitative description itself. Descriptive statistics are distinguished
from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize
a sample, rather than use the data to learn about the population that the sample of data is thought to
represent. This generally means that descriptive statistics, unlike inferential statistics, are not
developed on the basis of probability theory. Even when a data analysis draws its main conclusions
using inferential statistics, descriptive statistics are generally also presented. For example in a paper
reporting on a study involving human subjects, there typically appears a table giving the
overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure
group), anddemographic or clinical characteristics such as the average age, the proportion of
subjects of each sex, and the proportion of subjects with related comorbidities.
[1]

[2]

Some measures that are commonly used to describe a data set are measures of central
tendency and measures of variability or dispersion. Measures of central tendency include
the mean, median and mode, while measures of variability include the standard
deviation (or variance), the minimum and maximum values of the variables, kurtosis andskewness.

[3]

Contents
[hide]

1Use

in statistical analysis

1.1Univariate

1.2Bivariate

2References

3External

analysis

analysis

links

Use in statistical analysis[edit]


Descriptive status provides simple summaries about the sample and about the observations that
have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs. These summaries may either form the basis of the initial description of
the data as part of a more extensive statistical analysis, or they may be sufficient in and of
themselves for a particular investigation.

For example, the shooting percentage in basketball is a descriptive statistic that summarizes the
performance of a player or a team. This number is the number of shots made divided by the number
of shots taken. For example, a player who shoots 33% is making approximately one shot in every
three. The percentage summarizes or describes multiple discrete events. Consider also the grade
point average. This single number describes the general performance of a student across the range
of their course experiences.
[4]

The use of descriptive and summary statistics has an extensive history and, indeed, the simple
tabulation of populations and of economic data was the first way the topic ofstatistics appeared.
More recently, a collection of summarisation techniques has been formulated under the heading
of exploratory data analysis: an example of such a technique is the box plot.
In the business world, descriptive statistics provides a useful summary of many types of data. For
example, investors and brokers may use a historical account of return behavior by performing
empirical and analytical analyses on their investments in order to make better investing decisions in
the future.

Univariate analysis[edit]
Univariate analysis involves describing the distribution of a single variable, including its central
tendency (including the mean, median, and mode) and dispersion (including
therange and quantiles of the data-set, and measures of spread such as the variance and standard
deviation). The shape of the distribution may also be described via indices such
as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.

Bivariate analysis[edit]
When a sample consists of more than one variable, descriptive statistics may be used to describe
the relationship between pairs of variables. In this case, descriptive statistics include:

Cross-tabulations and contingency tables

Graphical representation via scatterplots

Quantitative measures of dependence

Descriptions of conditional distributions

The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not
only simple descriptive analysis, but also it describes the relationship between two different
variables. Quantitative measures of dependence include correlation (such as Pearson's r when both
variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects
the scale variables are measured on). The slope, in regression analysis, also reflects the relationship
between variables. The unstandardised slope indicates the unit change in the criterion variable for a
one unit change in the predictor. The standardised slope indicates this change in standardised (z[5]

score) units. Highly skewed data are often transformed by taking logarithms. Use of logarithms
makes graphs more symmetrical and look more similar to the normal distribution, making them
easier to interpret intuitively.
[6]:47

Descriptive statistics
Descriptive statistics provide a concise summary of data. You can summarize data numerically or graphically. For
example, the manager of a fast food restaurant tracks the wait times for customers during the lunch hour for a
week and summarizes the data.
The manager calculates the following numeric descriptive statistics:
Statistic

Mean

Standard deviation

Range

N (sample size)
The manager examines the following graphs to visualize the wait times:

Histogram of wait times

Boxplot of wait times


Return to top

Inferential statistics
Inferential statistics use a random sample of data taken from a population to describe and make inferences about
the population. Inferential statistics are valuable when it is not convenient or possible to examine each member of
an entire population. For example, it is impractical to measure the diameter of each nail that is manufactured in a
mill, but you can measure the diameters of a representative random sample of nails and use that information to
make generalizations about the diameters of all the nails produced.

Descriptive and Inferential Statistics


When analysing data, such as the marks achieved by 100 students for a piece of
coursework, it is possible to use both descriptive and inferential statistics in your
analysis of their marks. Typically, in most research conducted on groups of people,
you will use both descriptive and inferential statistics to analyse your results and draw
conclusions. So what are descriptive and inferential statistics? And what are their
differences?

Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show
or summarize data in a meaningful way such that, for example, patterns might emerge
from the data. Descriptive statistics do not, however, allow us to make conclusions
beyond the data we have analysed or reach conclusions regarding any hypotheses we
might have made. They are simply a way to describe our data.

Descriptive statistics are very important because if we simply presented our raw data
it would be hard to visualize what the data was showing, especially if there was a lot
of it. Descriptive statistics therefore enables us to present the data in a more
meaningful way, which allows simpler interpretation of the data. For example, if we
had the results of 100 pieces of students' coursework, we may be interested in the
overall performance of those students. We would also be interested in the distribution
or spread of the marks. Descriptive statistics allow us to do this. How to properly
describe data through statistics and graphs is an important topic and discussed in other
Laerd Statistics guides. Typically, there are two general types of statistic that are used
to describe data:
Measures of central tendency: these are ways of describing the central
position of a frequency distribution for a group of data. In this case, the
frequency distribution is simply the distribution and pattern of marks scored by
the 100 students from the lowest to the highest. We can describe this central
position using a number of statistics, including the mode, median, and mean.
You can read about measures of central tendency here.

Measures of spread: these are ways of summarizing a group of data by


describing how spread out the scores are. For example, the mean score of our
100 students may be 65 out of 100. However, not all students will have scored
65 marks. Rather, their scores will be spread out. Some will be lower and others
higher. Measures of spread help us to summarize how spread out these scores
are. To describe this spread, a number of statistics are available to us, including
the range, quartiles, absolute deviation, variance and standard deviation.
When we use descriptive statistics it is useful to summarize our group of data using a
combination of tabulated description (i.e., tables), graphical description (i.e., graphs
and charts) and statistical commentary (i.e., a discussion of the results).
Join the 1,000s of students, academics and professionals who rely on Laerd
Statistics.TAKE THE TOUR PLANS & PRICING

Inferential Statistics
We have seen that descriptive statistics provide information about our immediate
group of data. For example, we could calculate the mean and standard deviation of the
exam marks for the 100 students and this could provide valuable information about
this group of 100 students. Any group of data like this, which includes all the data you
are interested in, is called a population. A population can be small or large, as long as

it includes all the data you are interested in. For example, if you were only interested
in the exam marks of 100 students, the 100 students would represent your population.
Descriptive statistics are applied to populations, and the properties of populations, like
the mean or standard deviation, are called parameters as they represent the whole
population (i.e., everybody you are interested in).
Often, however, you do not have access to the whole population you are interested in
investigating, but only a limited number of data instead. For example, you might be
interested in the exam marks of all students in the UK. It is not feasible to measure all
exam marks of all students in the whole of the UK so you have to measure a
smaller sample of students (e.g., 100 students), which are used to represent the larger
population of all UK students. Properties of samples, such as the mean or standard
deviation, are not called parameters, but statistics. Inferential statistics are techniques
that allow us to use these samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important that the sample accurately
represents the population. The process of achieving this is called sampling (sampling
strategies are discussed in detail here on our sister site). Inferential statistics arise out
of the fact that sampling naturally incurs sampling error and thus a sample is not
expected to perfectly represent the population. The methods of inferential statistics are
(1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
We have provided some answers to common FAQs on the next page. Alternatively,
why not now read our guide on Types of Variable?
12
next

FAQs - Descriptive and Inferential Statistics


What are the similarities between descriptive and inferential statistics?
Both descriptive and inferential statistics rely on the same set of data. Descriptive statistics rely solely on this
set of data, whilst inferential statistics also rely on this data in order to make generalisations about a larger
population.

What are the strengths of using descriptive statistics to examine a distribution of


scores?
Other than the clarity with which descriptive statistics can clarify large volumes of data, there are no
uncertainties about the values you get (other than only measurement error, etc.).

What are the limitations of descriptive statistics?


Descriptive statistics are limited in so much that they only allow you to make summations about the people or
objects that you have actually measured. You cannot use the data you have collected to generalize to other
people or objects (i.e., using data from a sample to infer the properties/parameters of a population). For
example, if you tested a drug to beat cancer and it worked in your patients, you cannot claim that it would work
in other cancer patients only relying on descriptive statistics (but inferential statistics would give you this
opportunity).

Join the 1,000s of students, academics and professionals who rely on Laerd Statistics. TAKE
THE TOUR PLANS & PRICING

What are the limitations of inferential statistics?


There are two main limitations to the use of inferential statistics. The first, and most important limitation, which
is present in all inferential statistics, is that you are providing data about a population that you have not fully
measured, and therefore, cannot ever be completely sure that the values/statistics you calculate are correct.
Remember, inferential statistics are based on the concept of using the values measured in a sample to
estimate/infer the values that would be measured in a population; there will always be a degree of uncertainty
in doing this. The second limitation is connected with the first limitation. Some, but not all, inferential tests
require the user (i.e., you) to make educated guesses (based on theory) to run the inferential tests. Again, there
will be some uncertainty in this process, which will have repercussions on the certainty of the results of some
inferential statistics.
Why not now read our guide on Types of Variable?

Examples of Descriptive Statistics


MAY 6, 2014 BY APRIL KLAZEMA

In statistics, data is everything. When you collect


your data, you can make a conclusion based on how you use it. Calculating things, such as the range,
median, and mode of your set of data is all a part of descriptive statistics. Descriptive statistics can be
difficult to deal with when youre dealing with a large set of data, but the amount of work done for
each equation is actually pretty simple.
Learning statistics can be a great asset for you in the work world. In fact, people who master statistics
can get high level jobs, such as an actuary. If you want to start learning more about statistics and
what it can be applied for, check out the Udemy course Introductory Statistics Part 1:
Descriptive Statistics. The following examples will help you understand what descriptive statistics
is and how to utilize it to draw conclusions.

What is Descriptive Statistics?


When put in its simplest terms, descriptive statistics is pretty easy to understand. Descriptive
statistics helps you describe and summarize the data that you have set out before you. You can, make
conclusions with that data. When you make these conclusions, they are called parameters. This is a
lot different than conclusions made with inferential statistics, which are called statistics.
Descriptive statistics involves all of the data from a given set, which is also known as a population.
With this form of statistics, you dont make any conclusions beyond what youre given in the set of
data. For example, if you have a data set that involves 20 students in class, you can find the average
of that data set for those 20 students, but you cant find what the possible average is for all the
students in the school using just that data.
Within descriptive statistics there are two key types, and in those types you will find the different
forms of measurements that you will perform with the data that you have.
Descriptive statistics has a lot of variations, and its all used to help make sense of raw data. Without
descriptive statistics the data that we have would be hard to summarize, especially when it is on the
large side. Imagine finding the mean or the average of hundreds of thousands of numbers for
statistical analysis.

There are simpler ways to do descriptive statistics, such as with computer software. The Udemy
course Descriptive Statistics in SPSS is a great tool to help you with descriptive statistics for
incredibly large amounts.

Exploring the Two Types of Descriptive Statistics


The first type of descriptive statistics that we will discuss is the measure of central tendency. These
are the different ways in which we describe a group based on its central frequency. There are several
ways in which we describe this central position, such as with the median, mean and mode.
When performing statistics, you will find yourself discovering the median, mean, and mode for
various sets of data. Youre probably already familiar with discovering the mean of a number, which
is also commonly known as the average, but the median and mode are important as well.

Examples of Finding the Median, Mean, and Mode


Its easy to perform the arithmetic for the mean, median, and mode. In fact, for many of these forms
of descriptive statistics, you dont have to do any arithmetic at all.
For example, finding the median is simply discovering what number falls in the middle of a set. So
lets look at a set of data for 5 numbers. The following numbers would be 27, 54, 13, 81, and 6.
Now you would think that the median would be 13, since it sits in the middle of the data set, but this
isnt the case. An important thing to remember about the median is that it can only be found once
youve rearranged the data in the order from largest to smallest.
When you rearrange this data set, the order of the numbers becomes 6, 13, 27, 54, and 81. Now the
median number is 27 and not 13.
Another important thing to remember about the median is when you have an even number in your
data set. When the set is even, you take the two numbers that sit in the middle, add them together
and then divide them by two. Your result is the answer.
Lets add onto the data set from above to find the mode. 6, 6, 13, 27, 53, 53, 53, 81, and 93 will be the
numbers for this data set. Notice that some of the numbers repeat. When youre finding the mode for
a set of numbers, the mode is the number in the data set that appears the most times. In this
instance, 53 is the mode since it appears 3 times in the data set, which is more than any of the other
numbers.
A key factor to remember about data sets is that they should always be placed in order. Finding the
mode was pretty simple in this instance, but if the numbers were scrambled like before things would
be a lot more difficult. Ordering the numbers is the first thing you should do when youre doing any
sort of descriptive statistics.
The final part of descriptive statistics that you will learn about is finding the mean or the average.
The average is the addition of all the numbers in the data set and then having those numbers divided
by the number of numbers within that set. Lets look at the following data set. 6, 7, 13, 15, 18, 21, 21,
and 25 will be the data set that we use to find the mean. Now in this data set there are 8 numbers.
The first thing we will do is add together all of the numbers within the set.

6 + 7 + 13 + 15 + 18 + 21 + 21 + 25 = 126
Now we divide 126 by the number of numbers in the set 8, and we get the result. You should have
gotten 15.75 as the mean for this set of data.
In terms of measures of central tendency, this is all there is to descriptive statistics. To make it
easier, you can try to learn about the different statistics formulas for mean, median, and
mode.

The Second Type of Descriptive Statistics


The other type of descriptive statistics is known as the measures of spread. This type of statistics is
used to analyze the way the data spread out, such as noticing that most of the students in a class got
scores in the 80 percentile than in any other area.
One of the most common types of measure of spread is known as the range. The range is incredibly
simple to calculate, and it requires just the basic knowledge of math. To calculate the range, simply
take the largest number in the data set and subtract the smallest from it. For example, in the set we
used to find the average, we will find the range.
25 6 = 19
Thats the range for the entire set of data. Its as easy as that. There are other forms of measures of
spread, such as absolute and standard deviation. If you want to learn more about these types of
statistics, then check out the Workshop in Probability and Statistics.

Understanding All Forms of Statistics


Descriptive statistics is only one type. There are several forms of statistical analysis you can perform,
such as inferential statistics, which is used to predict what the data may be in the future. Take your
first step in inferential statistics by checking out the Udemy course Inferential Statistics in SPSS.

Statistics is the study of the collection, analysis, interpretation, presentation, and


organization of data.[1] In applying statistics to, e.g., a scientific, industrial, or societal
problem, it is conventional to begin with a statistical population or a statistical
model process to be studied. Populations can be diverse topics such as "all persons
living in a country" or "every atom composing a crystal". Statistics deals with all
aspects of data including the planning of data collection in terms of the design
of surveys and experiments.[1]
When census data cannot be collected, statisticians collect data by developing specific
experiment designs and survey samples. Representative sampling assures that
inferences and conclusions can safely extend from the sample to the population as a
whole. An experimental study involves taking measurements of the system under

study, manipulating the system, and then taking additional measurements using the
same procedure to determine if the manipulation has modified the values of the
measurements. In contrast, an observational study does not involve experimental
manipulation.
Two main statistical methodologies are used in data analysis: descriptive statistics,
which summarizes data from a sample using indexes such as the mean or standard
deviation, and inferential statistics, which draws conclusions from data that are
subject to random variation (e.g., observational errors, sampling variation).
[2]
Descriptive statistics are most often concerned with two sets of properties of
a distribution(sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value, while dispersion (or variability)
characterizes the extent to which members of the distribution depart from its center
and each other. Inferences on mathematical statistics are made under the framework
of probability theory, which deals with the analysis of random phenomena.
A standard statistical procedure involves the test of the relationship between two
statistical data sets, or a data set and a synthetic data drawn from idealized model. An
hypothesis is proposed for the statistical relationship between the two data sets, and
this is compared as an alternative to an idealized null hypothesis of no relationship
between two data sets. Rejecting or disproving the null hypothesis is done using
statistical tests that quantify the sense in which the null can be proven false, given the
data that are used in the test. Working from a null hypothesis, two basic forms of error
are recognized: Type I errors (null hypothesis is falsely rejected giving a "false
positive") and Type II errors (null hypothesis fails to be rejected and an actual
difference between populations is missed giving a "false negative"). [3] Multiple
problems have come to be associated with this framework: ranging from obtaining a
sufficient sample size to specifying an adequate null hypothesis. [citation needed]
Measurement processes that generate statistical data are also subject to error. Many of
these errors are classified as random (noise) or systematic (bias), but other important
types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also
be important. The presence of missing data and/or censoring may result in biased
estimates and specific techniques have been developed to address these problems.
Statistics can be said to have begun in ancient civilization, going back at least to the
5th century BC, but it was not until the 18th century that it started to draw more

heavily from calculus and probability theory. Statistics continues to be an area of


active research, for example on the problem of how to analyze Big data.
Important Features of
Health Statistics and Data Sets
The Importance of Health Data
Health statistics and data are important because they measure a wide range of health indicators for a community. A community can
be the entire United States, a region, state, county, or city. Health data provide comparisons for clinical studies, can be used to
assess costs of health care, can help identify needed prevention targets for such programs as Healthy People 2010, and are
important for program planning and evaluation by finding a baseline against which to measure in the evaluation phase.

The Context of Health Statistics


Health statistics are influenced by an organizations perspective and bias. These biases can affect the collection device and eventual
outcomes that are reported. They also can determine what data are collected and how the data are collected. Whenever possible,
read the notes describing the reasons for and methods of data collection. Remember that statistics are collected to meet the needs
of the collector.
The populations covered by different data collections systems may not be the same. Data on vital statistics and national
expenditures cover the entire population. Most data on morbidity and utilization of health resources cover only the civilian noninstitutionalized population.
Some information is collected in more than one survey and estimates of the same statistic may vary among surveys. For example,
the National Health Interview Survey, the National Survey on Drug Use and Health, the Monitoring the Future Survey, and the Youth
Risk Behavior Survey all measure cigarette use. But estimates of cigarette use may differ among these surveys because of different
survey methodologies, sampling frames, questionnaires, definitions, and tabulation categories.

Key Features of Health Statistics


Health statistics are population based, and many are collected and analyzed over time. Statistics often use geographic regions such
as zip codes for determining health care coverage and comparisons of specific disease occurrences. Most studies focus on variation
over time, space, and social group.

Health Statistics Come from Diverse Sources


Many studies use administrative data. Administrative data, according to the Centers for Medicare and Medicaid Services (CMS),
include enrollment or eligibility information, claims information, and managed care encounters. The claims and encounters may be
for hospital and other facility services, professional services, prescription drug services, laboratory services, or other services.
Surveys are designed to collect specific data and are often conducted by trained personnel who administer them by telephone or inperson.
CDC states that public health surveillance is the systematic collection, analysis, interpretation, and dissemination of health data on
an ongoing basis, to gain knowledge of the pattern of disease occurrence and potential in a community in order to control and
prevent disease in the community.

In this International Year of Statistics, I'd like to describe the major role of statistics in
public health advances. In our modern society, it is sometimes difficult to recall the
huge advances in health and medicine in the 20th century. To name a few: penicillin
was discovered in 1928, risk factors for heart attacks and stroke were established in
the 1950s, and vaccines were created throughout the latter half of the century to
prevent diseases that once killed thousands of children annually.

A few weeks ago, SAS was fortunate to receive a visit from Christy Chuang-Stein,
Vice President and Head of Statistical Research and Consulting Center at Pfizer, and a
candidate for president of the American Statistical Association. One of Christy's slides
mentioned that the Centers for Disease Control and Prevention (CDC) published a list
of the "Top 10 Great Public Health Achievements in the 20th Century." The CDC
articles are fascinating but lengthy, so let me give you the executive summary and
simultaneously emphasize the role of statistics in a few of these achievements:
1. Routine immunization of children: During the 20th century, researchers
developed vaccines that prevent smallpox, measles, polio, and other diseases.
The safety and efficacy of these life-saving vaccines were tested by using
statistically designed clinical trials and statistical quality control during
manufacturing.
Many of these studies were conducted out of the public's eye, but in 1954 there
was a massive public trial to test Jonas Salks polio vaccine. More than 1.8
million children participated in a randomized, double-blind trial. This is a
statistical design in which subjects are randomly assigned to either the control
group or the vaccine group. Neither the doctors nor the parents knew which
child received the vaccine instead of a placebo. This famous experiment was a
success. Today, pharmaceutical companies run similar statistical studies as they
develop drugs for the treatment and management of a wide range of maladies.

2. Motor-vehicle safety: In 1925, about 18 Americans died for every million miles
traveled. By the 1990s, that average mortality rate had dropped to 1.7 deaths
per million miles. Engineering (both in vehicles and on the roads) had a large
part to do with that decrease, but statistics played a role in identifying key risk
factors that contributed to vehicular deaths, including statistics about the use of
seat belts, infant restraint systems, booster seats, and statistics about the value
of graduated licensing for teenage drivers.

3. Declines in deaths from heart disease and stroke: Although heart disease and
stroke are the first and third leading causes of death in the US, respectively,
death rates due to heart disease have decreased 56 percent since the 1950s, and
death rates from stroke have decreased by 70 percent. As I described in
a previous article about Jerome Cornfield, carefully designed statistical studies
such as theFramingham Heart Study established the major risk factors: high

cholesterol, high blood pressure, smoking, and dietary factors such as


cholesterol, fat, and salt.

4. Safer foods: Several times a year, the news tells us about recalls of food that are
linked to a foodborne disease such as Salmonella or E. coli. Tomatoes, spinach,
lettuce, ground beefthese and other food products have recently been in the
news as sources of outbreaks that are geographically diverse, yet are eventually
traced to a common cause, such as poor sanitation at a single processing
facility. Kaiser Fung's book, Numbers Rule Your World, presents a fascinating
look at how statistical methods (coordinated through the CDC) are used to
detect, investigate, trace, and control outbreaks of foodborne diseases.

5. Tobacco as a health hazard: The CDC article notes that smoking is known to be
the "leading preventable cause of death and disability" in the US. But the health
risk of smoking was unknown before epidemiologists and statisticians began
analyzing data in the 1940s and 1950s. By 1964 the evidence from thousands of
studies convinced the US Surgeon General to conclude that "cigarette smoking
is causally related to lung cancer" and to other diseases. The statistics
associated with establishing causality led to a lengthy debate between
statistician Jerome Cornfield (and colleagues) at the National Institutes of
Health and Sir Ronald Fisher, a heavy smoker who was a brilliant statistician
but a bully toward those who disagreed with him. In the end, science prevailed
over intimidation, and the weight of statistical evidence has led to laws that
have decreased the prevalence of smoking in the US, as shown in the following
graph:

The other achievements in the top 10 list were workplace safety, control of infectious
diseases, healthier mothers and babies, family planning, and fluoridation of drinking
water.
After her talk, I asked Chuang-Stein whether the 21st century has produced any
advances that compare with those on the CDC list for the 20th century. She replied
that the following two achievements are likely to make the list for the 21st century:
Personalized medicine: In personalized medicine, an individual's genetic profile
and his or her unique biochemistry are used to customize treatment. For
example, which medicines are likely to provide the best results with the fewest
side effects?

Disease modification: If a person is diagnosed early enough, it might be


possible to inhibit the disease so that it never debilitates the person. For
example, current research on Alzheimer's disease aims to inhibit the
progression of the disease. Disease-modifying treatments are available for
several diseases that develop slowly, including multiple
sclerosis and rheumatoid arthritis
Certainly medicine, chemistry, engineering, and other fields also contributed to these
public health successes. Statistics is a collaborative science, and it does not make its
contributions in isolation. Cornfield called it "the bedfellow of the sciences."

Nevertheless, statistics made these public health advances possible. And that is truly
something to celebrate!
tag

Das könnte Ihnen auch gefallen