Sie sind auf Seite 1von 17

Chapter 3

Summary Statistics
To describe characteristics of the data
set, we can use single numbers called
summary statistics.
More exact in nature & provides more
meaningful information.
Summary Statistics constitute:

Measure of Central Tendency


Measure of Dispersion
Skewness
Kurtosis

Summary statistics

Central Tendency
Central tendency is the middle point of a
distribution
Measures of central tendency are also
called Measures of Location
To describe the bunching up of the data

Central Tendency

Dispersion
Dispersion is spread of the data in a
distribution, that is, the extent to which the
observations are scattered.
It shows the variability present in the data
set.
Dispersion is contrasted with location or
central tendency, and together they are the
most used properties of distributions.

Dispersion

Skewness
Curves representing data points in the data set
may be either symmetrical or skewed.
They are skewed because values in their
frequency distributions are concentrated at either
the low end or high end of scale.
If skewness is positive, the data are positively
skewed or skewed right, meaning that the right
tail of the distribution is longer than the left.
If skewness is negative, the data are negatively
skewed or skewed left, meaning that the left tail is
longer.
If
skewness=0,
the
data
are
perfectly
symmetrical.
Skewness

Kurtosis
The height and sharpness of the peak
relative to the rest of the data are
measured by a number called kurtosis.
Higher values indicate a higher, sharper
peak; lower values indicate a lower, less
distinct peak.
This occurs because higher kurtosis means
more of the variability is due to a few
extreme differences from the mean, rather
than a lot of modest differences from the
mean.
Kurtosis

Kurtosis
Distributions with zero excess kurtosis are
called mesokurtic. e.g normal distribution
A distribution withpositiveexcess kurtosis
is
calledleptokurtic.
e.g.
Cauchy
distribution, Student's t-distribution, Poisson
distribution and the logistic distribution.
A distribution with negative excess kurtosis
is called platykurtic, e.g. continuous or
discrete uniform distributions

Kurtosis

Measure of Central Tendency:


Mean
For a data set, the mean is the sum of the
values divided by the number of values.
Mean is a unique measure of central
tendency, because every data set has one
and only one mean. The most useful
measure, generally refer to as average.
It is highly influenced by extreme values.
Mean can not be calculated for open class
interval.
Can be calculated only for quantitative
measurement.
Mean

Measure of Central Tendency:


Weighted Mean
The weighted mean enables us to calculate
an average that takes into account the
importance of each value to the overall
total.
Weighted mean is calculated when there
are several observation have same value
but different frequencies.

Weighted Mean

Measure of Central Tendency:


Geometric Mean
When we are dealing with quantities that
change over a period time, we are
interested to know an average rate of
change.
In such cases geometric mean is preferred
over arithmetic mean.
Geometric mean is used to show
multiplicative
effects
over
time
in
compound
interest
and
inflation
calculations.

Geometric Mean

Measure of Central Tendency:


Median
The median is a single value from the data set
that measures the central item in the data. It
is the middlemost (most central) value.
Median is not influenced by extreme values.
Easy to understand and can be calculated for
any kind of data even for grouped data with
open ended classes.
Useful for the situation, when data are
qualitative descriptions.

We must array the data before we can


calculate median.

Median

Measure of Central Tendency:


Mode
Mode is the value that is repeated most often in
the data set.
Mode is different from mean but somewhat
similar to median because it is not actually
calculated by process of arithmetic.
Mode can be used as a measure of location for
quantitative as well as qualitative data.
Mode is not affected by extreme values and can
be used for open ended data also.
If data set have two or more modes it is difficult
to intercept
For continuous data sometimes there is no mode
Mode

Measure of Dispersion: Range


The range is the difference between the highest
and lowest observed values.
The range is easy to understand and to find, but its
usefulness as a measure of dispersion is limited.
In open ended distributions range can not be
calculated.
Range depends on only two observation of the
dataset and fails to take account of all other
observations in the data set.
It is heavily influenced by extreme values but
ignores the nature of variation among all the other
observations.

Range

Measure of Dispersion: Variance


Variance and Standard Deviation are
average deviation measures.
They
are
based
on
average
deviation/distance from the mean of
distribution.
The variance is the mean (average) of the
squared deviation between the mean and
each item in the population.
It take into account all possible values and
provides more weight to the large
deviations.
Variance

Measure of Dispersion: Standard


Deviation
Standard deviation is square root of the
variance.
It is the most useful and popular measure of
dispersion.
It is used to compare distributions and to
compute standard scores.
It can not be computed from open ended
classes.
Extreme values in distribution affect the
value of standard deviation.

Standard Deviation

Measure of Relative Dispersion:


Coefficient of Variation
Standard deviation is an absolute measure
of dispersion that express variation in the
same units as in the original data.
The standard deviation can not be sole
basis for comparing two distributions.
The coefficient of variation relates standard
deviation and the mean by expressing
standard deviation as percentage of mean.
It
is
useful
in
comparing
the
variability/consistency present in two or
more distribution/data set.
Coefficient of Variation

Exploratory data analysis


Exploratory Data Analysis (EDA) uses some
simple techniques and diagrams to
summarize and describe the data.
Stem and Leaf is one of the most useful
techniques of EDA.
Stem and Leaf displays gives the rank order
of the items in the data set and the shape
of the distribution.
Stem and Leaf is a histogram like display
but also display all the original values along
with the frequencies.
Exploratory data