Beruflich Dokumente
Kultur Dokumente
Descriptive statistics are used to describe the basic features of the data
in a study. They provide simple summaries about the sample and the
measures. Together with simple graphics analysis, they form the basis of
virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics.
With descriptive statistics you are simply describing what is or what the
data shows. With inferential statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance,
we use inferential statistics to try to infer from the sample data what the
population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups
is a dependable one or one that might have happened by chance in this
study. Thus, we use inferential statistics to make inferences from our
data to more general conditions; we use descriptive statistics simply to
describe what's going on in our data.
Descriptive Statistics are used to present quantitative descriptions in a
manageable form. In a research study we may have lots of measures.
Or we may measure a large number of people on any measure.
Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a
simpler summary. For instance, consider a simple number used to
summarize how well a batter is performing in baseball, the batting
average. This single number is simply the number of hits divided by the
number of times at bat (reported to three significant digits). A batter who
is hitting .333 is getting a hit one time in every three at bats. One
batting .250 is hitting one time in four. The single number describes a
large number of discrete events. Or, consider the scourge of many
students, the Grade Point Average (GPA). This single number describes
the general performance of a student across a potentially wide range of
course experiences.
Every time you try to describe a large set of observations with a single
indicator you run the risk of distorting the original data or losing important
detail. The batting average doesn't tell you whether the batter is hitting
home runs or singles. It doesn't tell whether she's been in a slump or on
a streak. The GPA doesn't tell you whether the student was in difficult
courses or easy ones, or whether they were courses in their major field
or in other disciplines. Even given these limitations, descriptive statistics
provide a powerful summary that may enable comparisons across
people or other units.
Univariate Analysis
Univariate analysis involves the examination across cases of one
variable at a time. There are three major characteristics of a single
variable that we tend to look at:
the distribution
the dispersion
large number of possible values, with relatively few people having each
one. In this case, we group the raw scores into categories according to
ranges of values. For instance, we might look at GPA according to the
letter grade ranges. Or, we might group income into four or five ranges of
income values.
Mean
Median
Mode
range and the standard deviation. The range is simply the highest value
minus the lowest value. In our example distribution, the high value is 36
and the low is 15, so the range is 36 - 15 = 21.
The Standard Deviation is a more accurate and detailed estimate of
dispersion because an outlier can greatly exaggerate the range (as was
true in this example where the single outlier value of 36 stands apart
from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample. Again lets take the set
of scores:
15,20,21,20,36,15,25,15
to compute the standard deviation, we first find the distance between
each value and the mean. We know from above that the mean is 20.875.
So, the differences from the mean are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
Notice that values that are below the mean have negative discrepancies
and values above it have positive ones. Next, we square each
discrepancy:
-5.875 * -5.875 = 34.515625
-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
Now, we take these "squares" and sum them to get the Sum of Squares
(SS) value. Here, the sum is 350.875. Next, we divide this sum by the
number of scores minus 1. Here, the result is 350.875 / 7 = 50.125. This
value is known as the variance. To get the standard deviation, we take
the square root of the variance (remember that we squared the
deviations earlier). This would be SQRT(50.125) = 7.079901129253.
Although this computation may seem convoluted, it's actually quite
simple. To see this, consider the formula for the standard deviation:
In the top part of the ratio, the numerator, we see that each score has
the the mean subtracted from it, the difference is squared, and the
squares are summed. In the bottom part, we take the number of scores
minus 1. The ratio is the variance and the square root is the standard
deviation. In English, we can describe the standard deviation as:
the square root of the sum of the squared deviations from the mean
divided by the number of scores minus one
Mean
20.8750
Median
20.0000
Mode
15.00
Std. Deviation
Variance
Range
7.0799
50.1250
21.00
approximately 68% of the scores in the sample fall within one standard deviation of the mean
approximately 95% of the scores in the sample fall within two standard deviations of the mean
approximately 99% of the scores in the sample fall within three standard deviations of the mean
For instance, since the mean in our example is 20.875 and the standard
deviation is 7.0799, we can from the above statement estimate that
approximately 95% of the scores will fall in the range of 20.875(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This
kind of information is a critical stepping stone to enabling us to compare
the performance of an individual on one variable with their performance
on another, even when the variables are measured on entirely different
scales.
Descriptive statistics
From Wikipedia, the free encyclopedia
Descriptive statistics is the discipline of quantitatively describing the main features of a collection
of information, or the quantitative description itself. Descriptive statistics are distinguished
from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize
a sample, rather than use the data to learn about the population that the sample of data is thought to
represent. This generally means that descriptive statistics, unlike inferential statistics, are not
developed on the basis of probability theory. Even when a data analysis draws its main conclusions
using inferential statistics, descriptive statistics are generally also presented. For example in a paper
reporting on a study involving human subjects, there typically appears a table giving the
overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure
group), anddemographic or clinical characteristics such as the average age, the proportion of
subjects of each sex, and the proportion of subjects with related comorbidities.
[1]
[2]
Some measures that are commonly used to describe a data set are measures of central
tendency and measures of variability or dispersion. Measures of central tendency include
the mean, median and mode, while measures of variability include the standard
deviation (or variance), the minimum and maximum values of the variables, kurtosis andskewness.
[3]
Contents
[hide]
1Use
in statistical analysis
1.1Univariate
1.2Bivariate
2References
3External
analysis
analysis
links
For example, the shooting percentage in basketball is a descriptive statistic that summarizes the
performance of a player or a team. This number is the number of shots made divided by the number
of shots taken. For example, a player who shoots 33% is making approximately one shot in every
three. The percentage summarizes or describes multiple discrete events. Consider also the grade
point average. This single number describes the general performance of a student across the range
of their course experiences.
[4]
The use of descriptive and summary statistics has an extensive history and, indeed, the simple
tabulation of populations and of economic data was the first way the topic ofstatistics appeared.
More recently, a collection of summarisation techniques has been formulated under the heading
of exploratory data analysis: an example of such a technique is the box plot.
In the business world, descriptive statistics provides a useful summary of many types of data. For
example, investors and brokers may use a historical account of return behavior by performing
empirical and analytical analyses on their investments in order to make better investing decisions in
the future.
Univariate analysis[edit]
Univariate analysis involves describing the distribution of a single variable, including its central
tendency (including the mean, median, and mode) and dispersion (including
therange and quantiles of the data-set, and measures of spread such as the variance and standard
deviation). The shape of the distribution may also be described via indices such
as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.
Bivariate analysis[edit]
When a sample consists of more than one variable, descriptive statistics may be used to describe
the relationship between pairs of variables. In this case, descriptive statistics include:
The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not
only simple descriptive analysis, but also it describes the relationship between two different
variables. Quantitative measures of dependence include correlation (such as Pearson's r when both
variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects
the scale variables are measured on). The slope, in regression analysis, also reflects the relationship
between variables. The unstandardised slope indicates the unit change in the criterion variable for a
one unit change in the predictor. The standardised slope indicates this change in standardised (z[5]
score) units. Highly skewed data are often transformed by taking logarithms. Use of logarithms
makes graphs more symmetrical and look more similar to the normal distribution, making them
easier to interpret intuitively.
[6]:47
Descriptive statistics
Descriptive statistics provide a concise summary of data. You can summarize data numerically or graphically. For
example, the manager of a fast food restaurant tracks the wait times for customers during the lunch hour for a
week and summarizes the data.
The manager calculates the following numeric descriptive statistics:
Statistic
Mean
Standard deviation
Range
N (sample size)
The manager examines the following graphs to visualize the wait times:
Inferential statistics
Inferential statistics use a random sample of data taken from a population to describe and make inferences about
the population. Inferential statistics are valuable when it is not convenient or possible to examine each member of
an entire population. For example, it is impractical to measure the diameter of each nail that is manufactured in a
mill, but you can measure the diameters of a representative random sample of nails and use that information to
make generalizations about the diameters of all the nails produced.
Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show
or summarize data in a meaningful way such that, for example, patterns might emerge
from the data. Descriptive statistics do not, however, allow us to make conclusions
beyond the data we have analysed or reach conclusions regarding any hypotheses we
might have made. They are simply a way to describe our data.
Descriptive statistics are very important because if we simply presented our raw data
it would be hard to visualize what the data was showing, especially if there was a lot
of it. Descriptive statistics therefore enables us to present the data in a more
meaningful way, which allows simpler interpretation of the data. For example, if we
had the results of 100 pieces of students' coursework, we may be interested in the
overall performance of those students. We would also be interested in the distribution
or spread of the marks. Descriptive statistics allow us to do this. How to properly
describe data through statistics and graphs is an important topic and discussed in other
Laerd Statistics guides. Typically, there are two general types of statistic that are used
to describe data:
Measures of central tendency: these are ways of describing the central
position of a frequency distribution for a group of data. In this case, the
frequency distribution is simply the distribution and pattern of marks scored by
the 100 students from the lowest to the highest. We can describe this central
position using a number of statistics, including the mode, median, and mean.
You can read about measures of central tendency here.
Inferential Statistics
We have seen that descriptive statistics provide information about our immediate
group of data. For example, we could calculate the mean and standard deviation of the
exam marks for the 100 students and this could provide valuable information about
this group of 100 students. Any group of data like this, which includes all the data you
are interested in, is called a population. A population can be small or large, as long as
it includes all the data you are interested in. For example, if you were only interested
in the exam marks of 100 students, the 100 students would represent your population.
Descriptive statistics are applied to populations, and the properties of populations, like
the mean or standard deviation, are called parameters as they represent the whole
population (i.e., everybody you are interested in).
Often, however, you do not have access to the whole population you are interested in
investigating, but only a limited number of data instead. For example, you might be
interested in the exam marks of all students in the UK. It is not feasible to measure all
exam marks of all students in the whole of the UK so you have to measure a
smaller sample of students (e.g., 100 students), which are used to represent the larger
population of all UK students. Properties of samples, such as the mean or standard
deviation, are not called parameters, but statistics. Inferential statistics are techniques
that allow us to use these samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important that the sample accurately
represents the population. The process of achieving this is called sampling (sampling
strategies are discussed in detail here on our sister site). Inferential statistics arise out
of the fact that sampling naturally incurs sampling error and thus a sample is not
expected to perfectly represent the population. The methods of inferential statistics are
(1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
We have provided some answers to common FAQs on the next page. Alternatively,
why not now read our guide on Types of Variable?
12
next
Join the 1,000s of students, academics and professionals who rely on Laerd Statistics. TAKE
THE TOUR PLANS & PRICING
There are simpler ways to do descriptive statistics, such as with computer software. The Udemy
course Descriptive Statistics in SPSS is a great tool to help you with descriptive statistics for
incredibly large amounts.
6 + 7 + 13 + 15 + 18 + 21 + 21 + 25 = 126
Now we divide 126 by the number of numbers in the set 8, and we get the result. You should have
gotten 15.75 as the mean for this set of data.
In terms of measures of central tendency, this is all there is to descriptive statistics. To make it
easier, you can try to learn about the different statistics formulas for mean, median, and
mode.
study, manipulating the system, and then taking additional measurements using the
same procedure to determine if the manipulation has modified the values of the
measurements. In contrast, an observational study does not involve experimental
manipulation.
Two main statistical methodologies are used in data analysis: descriptive statistics,
which summarizes data from a sample using indexes such as the mean or standard
deviation, and inferential statistics, which draws conclusions from data that are
subject to random variation (e.g., observational errors, sampling variation).
[2]
Descriptive statistics are most often concerned with two sets of properties of
a distribution(sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value, while dispersion (or variability)
characterizes the extent to which members of the distribution depart from its center
and each other. Inferences on mathematical statistics are made under the framework
of probability theory, which deals with the analysis of random phenomena.
A standard statistical procedure involves the test of the relationship between two
statistical data sets, or a data set and a synthetic data drawn from idealized model. An
hypothesis is proposed for the statistical relationship between the two data sets, and
this is compared as an alternative to an idealized null hypothesis of no relationship
between two data sets. Rejecting or disproving the null hypothesis is done using
statistical tests that quantify the sense in which the null can be proven false, given the
data that are used in the test. Working from a null hypothesis, two basic forms of error
are recognized: Type I errors (null hypothesis is falsely rejected giving a "false
positive") and Type II errors (null hypothesis fails to be rejected and an actual
difference between populations is missed giving a "false negative"). [3] Multiple
problems have come to be associated with this framework: ranging from obtaining a
sufficient sample size to specifying an adequate null hypothesis. [citation needed]
Measurement processes that generate statistical data are also subject to error. Many of
these errors are classified as random (noise) or systematic (bias), but other important
types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also
be important. The presence of missing data and/or censoring may result in biased
estimates and specific techniques have been developed to address these problems.
Statistics can be said to have begun in ancient civilization, going back at least to the
5th century BC, but it was not until the 18th century that it started to draw more
In this International Year of Statistics, I'd like to describe the major role of statistics in
public health advances. In our modern society, it is sometimes difficult to recall the
huge advances in health and medicine in the 20th century. To name a few: penicillin
was discovered in 1928, risk factors for heart attacks and stroke were established in
the 1950s, and vaccines were created throughout the latter half of the century to
prevent diseases that once killed thousands of children annually.
A few weeks ago, SAS was fortunate to receive a visit from Christy Chuang-Stein,
Vice President and Head of Statistical Research and Consulting Center at Pfizer, and a
candidate for president of the American Statistical Association. One of Christy's slides
mentioned that the Centers for Disease Control and Prevention (CDC) published a list
of the "Top 10 Great Public Health Achievements in the 20th Century." The CDC
articles are fascinating but lengthy, so let me give you the executive summary and
simultaneously emphasize the role of statistics in a few of these achievements:
1. Routine immunization of children: During the 20th century, researchers
developed vaccines that prevent smallpox, measles, polio, and other diseases.
The safety and efficacy of these life-saving vaccines were tested by using
statistically designed clinical trials and statistical quality control during
manufacturing.
Many of these studies were conducted out of the public's eye, but in 1954 there
was a massive public trial to test Jonas Salks polio vaccine. More than 1.8
million children participated in a randomized, double-blind trial. This is a
statistical design in which subjects are randomly assigned to either the control
group or the vaccine group. Neither the doctors nor the parents knew which
child received the vaccine instead of a placebo. This famous experiment was a
success. Today, pharmaceutical companies run similar statistical studies as they
develop drugs for the treatment and management of a wide range of maladies.
2. Motor-vehicle safety: In 1925, about 18 Americans died for every million miles
traveled. By the 1990s, that average mortality rate had dropped to 1.7 deaths
per million miles. Engineering (both in vehicles and on the roads) had a large
part to do with that decrease, but statistics played a role in identifying key risk
factors that contributed to vehicular deaths, including statistics about the use of
seat belts, infant restraint systems, booster seats, and statistics about the value
of graduated licensing for teenage drivers.
3. Declines in deaths from heart disease and stroke: Although heart disease and
stroke are the first and third leading causes of death in the US, respectively,
death rates due to heart disease have decreased 56 percent since the 1950s, and
death rates from stroke have decreased by 70 percent. As I described in
a previous article about Jerome Cornfield, carefully designed statistical studies
such as theFramingham Heart Study established the major risk factors: high
4. Safer foods: Several times a year, the news tells us about recalls of food that are
linked to a foodborne disease such as Salmonella or E. coli. Tomatoes, spinach,
lettuce, ground beefthese and other food products have recently been in the
news as sources of outbreaks that are geographically diverse, yet are eventually
traced to a common cause, such as poor sanitation at a single processing
facility. Kaiser Fung's book, Numbers Rule Your World, presents a fascinating
look at how statistical methods (coordinated through the CDC) are used to
detect, investigate, trace, and control outbreaks of foodborne diseases.
5. Tobacco as a health hazard: The CDC article notes that smoking is known to be
the "leading preventable cause of death and disability" in the US. But the health
risk of smoking was unknown before epidemiologists and statisticians began
analyzing data in the 1940s and 1950s. By 1964 the evidence from thousands of
studies convinced the US Surgeon General to conclude that "cigarette smoking
is causally related to lung cancer" and to other diseases. The statistics
associated with establishing causality led to a lengthy debate between
statistician Jerome Cornfield (and colleagues) at the National Institutes of
Health and Sir Ronald Fisher, a heavy smoker who was a brilliant statistician
but a bully toward those who disagreed with him. In the end, science prevailed
over intimidation, and the weight of statistical evidence has led to laws that
have decreased the prevalence of smoking in the US, as shown in the following
graph:
The other achievements in the top 10 list were workplace safety, control of infectious
diseases, healthier mothers and babies, family planning, and fluoridation of drinking
water.
After her talk, I asked Chuang-Stein whether the 21st century has produced any
advances that compare with those on the CDC list for the 20th century. She replied
that the following two achievements are likely to make the list for the 21st century:
Personalized medicine: In personalized medicine, an individual's genetic profile
and his or her unique biochemistry are used to customize treatment. For
example, which medicines are likely to provide the best results with the fewest
side effects?
Nevertheless, statistics made these public health advances possible. And that is truly
something to celebrate!
tag