Beruflich Dokumente
Kultur Dokumente
Business Statistics
HUMAN PERITUS
ww.humanperitus.com
Table of Contents
The following are the measures of average or central tendency that are in common use:
(i) Arithmetic Average or Arithmetic Mean or Simple Mean
(ii) Median
(iii) Mode
(iv) Geometric mean and Harmonic mean
(v) Quartiles, Deciles and Percentile
Arithmetic mean, Geometric mean and Harmonic means are called Mathematical averages while Mode, Median,
Quartiles, Deciles and Percentile are called Positional averages.
Mean
Arithmetic Mean (AM) is the most commonly used measure of central tendency. It is defined as the sum of the values
of all observations divided by the number of observations and is usually denoted by X . In general, if there are N
observations as X1, X2, X3,..., XN
X1 + X2 + X3 + …… XN
̅) is given by, X
Then the Arithmetic Mean (𝐗 ̅=
N
ΣX
̅=
Which can also be written as X
N
There are three methods to find the arithmetic mean, which we will understand by using Income data of 10 families:
(i) Direct Method: In direct method, the AM is calculated by direct formula, given below:
X1 + X2 + X3 + …… XN
̅=
X
N
In the above table, the Column I, shows calculation of AM by direct Formula
11160
̅=
X = 1116
10
(ii) Assumed Mean Method: If the number of observations in the data is more, it is difficult to compute arithmetic
mean by direct method. The computation can be made easier by using assumed mean method.
Here you assume a particular figure in the data as the arithmetic mean, on the basis of logic/experience. Then you may
take deviations of the said assumed mean from each of the observation. You can, then, take the summation of these
deviations and divide it by the number of observations in the data.
The actual arithmetic mean is estimated by taking the sum of the assumed mean and the ratio of sum of deviations to
number of observations.
Let, A = assumed mean
X = individual observations
N = total numbers of observations
d = deviation of assumed mean from individual observation, i.e. d = X – A
(iii) Step Deviation Method: The calculations can be further simplified by dividing all the deviations taken from
assumed mean by the common factor ‘c’. The objective is to avoid large numerical figures, i.e., if
d = X – A is very large, then find d’
d X−A
d’= =
c c
The Column III of table (previous section) shows calculation by the Step Deviation Method.
Here we assumed mean to be c = 10 and A= 850
We get ∑ d` = 266
266
̅
X = 850 + × 10
10
= 1116
The mean of a sample of data organized in a frequency distribution is computed by the following formula:
f1 .x1 +f2 .x2 ……….fN .xN
̅
X=
f1 +f2 ……….fN .
where fi is the frequency of Class i and xi is the class mid-point of Class i.
(iii) If each value of a variable X is increased or decreased or multiplied by a constant k, the arithmetic mean also
increases or decreases or multiplies by the same constant k.
(iv) The sum of squares of deviation of set of values about its mean is minimum.
Please note that, in a skewed distribution, the calculated mean can miss the centrality. This problem occurs because
outliers have a substantial impact on the mean. Extreme values in an extended tail pull the mean away from the center.
As the distribution becomes more skewed, the mean is drawn further away from the center. Consequently, it is best to
use the mean as a measure of the central tendency when you have a "symmetric distribution".
Weighted Mean
The Weighted Average Mean is measure of Central Tendency of a set of quantitative observations, when not all the
observations have same importance. In other words, it is important to assign weights to various data values, according
to their importance. In that case, the mean is called the Weighted Mean. If Mohan sold 5 blue balls for price of Rs
0.50 each, 15 red balls for price of Rs 0.75 each, 15 grey balls for price of Rs 0.90 each and 15 black balls for price of
Rs 1.10 each, the average selling price of a ball is given by weighted mean, as calculated below.
The Arithmetic Mean and Number of observations for a Population is represented by µ and N, respectively. The
Arithmetic Mean and Number of observations for a Sample is represented by x̅ and n, respectively.
Median
The Median is that positional value of the variable which divides the distribution into two equal parts, one part
comprises all values greater than or equal to the median value and the other comprises all values less than or equal to
it. The Median is the “middle” element when the data set is arranged in order of the magnitude.
Since the median is determined by the position of different values, it remains unaffected if, say, the size of the largest
value increases (unlike mean).
The median can be easily computed by sorting the data from smallest to largest and finding out the middle value.
Suppose we have the following observation in a data set: 5, 7, 6, 1, 8, 10, 12, 4, and 3.
The “middle score” is 6, so the median is 6. Half of the scores are larger than 6 and half of the scores are smaller.
However, if there are even numbers in the data, there will be two observations which fall in the middle. The median in
this case is computed as the arithmetic mean of the two middle values.
Where:
L is the lower limit of the median class
h is class interval size of the model class
C is the cumulative frequency before the median class
f is the frequency of the median class
n is number of values or total frequency.
Mode
Mode is the most frequently observed data value. It is denoted by Mo. It has been derived from the French word “la
Mode” which signifies the most fashionable values of a distribution, because it is repeated the highest number of times
in the series.
Consider the data set 1, 2, 3, 4, 4, 5. The mode for this data is 4 because 4 occurs most frequently (twice) in the data.
In this example, as there is a unique value of mode, the data is unimodal. But, the mode is not necessarily unique,
unlike arithmetic mean and median. You can have data with two modes (bi-modal) or more than two modes (multi-
modal). Example of bi-modal data is 1, 2, 2, 3, 4, 4, 5 (because both 2 and 4 are appearing twice each).
It may be possible that there may be no mode if no value appears more frequent than any other value in the distribution.
For example, in a series 1, 1, 2, 2, 3, 3, 4, 4, there is no mode.
In the continuous data below, no values repeat, which means there is no mode. With continuous data, it is unlikely that
two or more values will be exactly equal because there are an infinite number of values between any two values.
However, you can find the mode for continuous data by locating the maximum value on a probability distribution plot.
If you can identify a probability distribution that fits your data, find the peak value and use it as the mode.
The median is always between the arithmetic mean and the mode.
For moderately asymmetrical distribution (or for asymmetrical curve), the relation
Mean – Mode = 3 (Mean - Median), approximately holds. In such a case, first evaluate mean and median and then
mode is determined by
Mode = 3 × Median – 2 × Mean
When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this case,
analysts tend to use the mean because it includes all of the data in the calculations. However, if you have a skewed
distribution, the median is often the best measure of central tendency.
When you have ordinal data, the median or mode is usually the best choice. For categorical data, you have to use the
mode.
The Geometric Mean, G of a series of values X1, X2, X3,…, Xn (none of them being zero) is defined as the nth root
product of n numbers.
G = 𝑛√X1 . X 2 . X 3 … … . X n
For two observations, the product of Arithmetic Mean (AM) and Harmonic Mean (HM) is equal to square of Geometric
Mean (GM).
𝐆𝐌 𝟐 = 𝐀𝐌 × 𝐇𝐌
Also, AM ≥ GM ≥ HM
Partition Values
If the values of the variate are arranged in ascending or descending order of magnitudes then we have seen above that
median is that value of the variate which divides the total frequencies in two equal parts.
Similarly the given series can be divided into 4, 10 and 100 equal pars. The values of the variate dividing into 4 equal
parts are called Quartile, into 10 equal parts are called Decile and into 100 equal parts are called Percentile.
Quartiles
The values of the variate which divide the total frequency into 4 equal parts, are called quartiles. There are three
quartiles.
The first Quartile (denoted by Q1 ) or lower quartile has 25% of the items of the distribution below it and 75% of the
items are greater than it.
The second Quartile (denoted by Q2 ) or median has 50% of items below it and 50% of the observations above it.
The third Quartile (denoted by Q3 ) or upper Quartile has 75% of the items of the distribution below it and 25% of
the items above it. Thus, Q1 and Q3 denote the two limits within which central 50% of the data lies.
Illustration: Calculate the value of lower quartile (Q1) and upper quartile (Q3) from the data of the marks obtained by
ten students in an examination.
22, 26, 14, 30, 18, 11, 35, 41, 12, 32.
Solution: Arranging the data in an ascending order,
11, 12, 14, 18, 22, 26, 30, 32, 35, 41.
n+1
Q1 = value of th term
4
10+1
= value of th term
4
= value of 2.75th item
= 2nd term+ .75 (3rd term – 2nd term)
= 12 + .75(14 –12)
= 13.5 marks.
3.(n+1)
Similarly upper quartile (Q3) can be calculated by finding value of th term.
4
Percentiles
Percentiles divide the distribution into 100 equal parts, so you can get 99 dividing positions denoted by P1, P2, P3 , ...,
P99. The P50 is the median value.
To understand this, if you have scored 82 percentile in CAT exam, it means that your position is below 18 per cent of
total candidates appeared in the exam. It also means that 18% candidates score marks, greater than your score.
When the population variability is small, all of the scores are clustered close together and any individual score or
sample will necessarily provide a good representation of the entire set. On the other hand, when variability is large and
scores are widely spread, it is easy for one or two extreme scores to give a distorted picture of the general population.
Various measures of dispersion or variation are available like Range, Mean Deviation, Variance or Standard Deviation.
Range
Range (R) is the difference between the largest (L) and the smallest value (S) in a distribution. Thus,
R=L–S
Higher value of Range implies higher dispersion and vice-versa.
Quartile Deviation
The presence of even one extremely high or low value in a distribution can reduce the utility of range as a measure of
dispersion. Thus, you may need a measure which is not unduly affected by the outliers.
In such a situation, if the entire data is divided into 4 equal parts, each containing 25% of the values, we get the values
of Quartiles and Median. The upper and lower quartiles (Q3 and Q1 , respectively) are used to calculate Inter Quartile
Range which is Q3 – Q1. Inter-Quartile Range is based upon middle 50% of the values in a distribution and is, therefore,
not affected by extreme values.
Q3 −Q1
Quartile Deviation (QD) =
2
The Quartile Deviation (QD) is also called Semi inter Quartile Range. Quartile Deviation can generally be
calculated for open-ended distributions and is not unduly affected by extreme values.
Calculate Range and Q.D. of the following observations: 20, 25, 29, 30, 35, 39, 41, 48, 51, 60 and 70.
Since the average is a central value, some deviations are positive and some are negative. If these are added as they are,
the sum will not reveal anything. In fact, the sum of deviations from Arithmetic Mean is always zero.
The Mean Deviation tries to overcome this problem by ignoring the signs of deviations, i.e., it considers all deviations
positive. In case of Standard Deviation, the deviations are first squared and averaged and then square root of the
average is found.
Mean Deviation
The Mean Deviation is the Arithmetic Mean of absolute deviations of the observations from a measure of central
tendency. The steps to calculate mean deviation are given below:
(i) The A.M. of the values is calculated
(ii) Difference between each value and the A.M. is calculated. All differences are considered positive. These are
denoted as |d|
(iii)The A.M. of these differences (called deviations) is the Mean Deviation.
∑ |d|
Mean Deviation, MD =
n
∑ |d| 12
Mean Deviation, MD = =
n 5
=2.4
So if there are 5 values X1 , X2 , X3, X4 and X5 , first their mean is calculated. Then deviations of the values from mean
are calculated. These deviations are then squared. The mean of these squared deviations is the Variance. Positive
square root of the variance is the Standard deviation.
6. Hypothesis Testing
In last section, we estimated an unknown population parameter from the corresponding statistic obtained from the
analysis of a sample from the population. Now we shall similarly analyse a sample and then use its statistic to see
whether some claim made about the population is reasonable or not.
In hypothesis testing, we begin by making a tentative assumption about a population parameter. This tentative
assumption is called the null hypothesis and is denoted by H0 . We then define another hypothesis, called the
alternative hypothesis, which is the opposite of what is stated in the null hypothesis. The alternative hypothesis is
denoted by Ha or H1 .
Thus, the Hypothesis testing uses sample data to determine whether a statement about the value of a population
parameter (a) should be rejected or (b) should not be rejected.
In some situations, it is easier to identify the alternative hypothesis first and then develop the null hypothesis. In other
situations, it is easier to identify the null hypothesis first and then develop the alternative hypothesis.
The hypothesis tests generally involve two population parameters: (a) population mean and (b) population proportion.
Depending on the situation, hypothesis tests about a population parameter may take one of following 3 forms; out of
which 2 use inequalities in the null hypothesis; the third uses an equality in the null hypothesis. Here µ0 denoted
hypothesized value.
Type 1 and Type 2 are type of One-tailed tests and Type 3 is a type of Two-tailed test.
In selecting the proper form of H0 and Ha , keep in mind that the alternative hypothesis is often what the test is
attempting to establish. Hence, asking whether the user is looking for evidence to support µ < µ0 , µ > µ0 , or µ ≠ µ0 will
help determine Ha .
You can use either of (a) p value approach or (b) critical value approach for Steps 4 and 5.
These steps have been explained in detail, with examples, in next sections.
The first row of the figure shows what can happen if the conclusion is to accept H0 . If H0 is true, this conclusion is
correct. However, if Ha is true, we make a Type II error; that is, we accept H0 when it is false.
The second row of the figure shows what can happen if the conclusion is to reject H0 . If H0 is true, we make a Type I
error; that is, we reject H0 when it is true. However, if Ha is true, rejecting H0 is correct.
The probability of making a Type I error, when the null hypothesis is true as an equality is called the level of
significance. The Greek symbol α (alpha) is used to denote the level of significance, and common choices for α are
0.05 and 0.01. In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting
α, that person is controlling the probability of making a Type I error. If the cost of making a Type I error is high, small
values of α are preferred. If the cost of making a Type I error is not too high, larger values of α are typically used.
Applications of hypothesis testing that only control for the Type I error are called significance tests.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not
always control for the probability of making a Type II error. Hence, if we decide to accept 𝐻0, we cannot determine
how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when
conducting significance tests, statisticians usually recommend that we use the statement “do not reject 𝐻0” instead of
“accept 𝐻0”.
If you wish to reduce the level of Type I error, then reduce the significance level to a very low level, perhaps to α =
0.01, or even to α = 0.001. Remember though that, this implies a higher level of Type II error. Since the negative
consequences of Type I error are not so negative, then it is preferable to provide a better balance of Types I and II error
by adopting a significance level such as 0.05 or 0.10.
Type I errors are also called: 1) Producer’s risk 2) False alarm 3) False negative 4) α error
Type II errors are also called: 1) Consumer’s risk 2) Misdetection 3) False positive 4) β error
Let us understand Hypothesis Testing with few examples. We have grouped the tests in two categories:
(i) Calculation of Mean of Population, when Standard Deviation of Population (𝜎) is Known – using z test
(ii) Calculation of Mean of Population, when Standard Deviation of Population (𝜎) is NOT Known- using t test
Later, we have explained another example to calculate Population Portion, using Hypothesis Testing.
One tailed test (lower tailed) One tailed test (upper tailed) Two tailed test
H0 : µ ≥ µ0 H0 : µ ≤ µ0 H0 : µ = µ0
Ha : µ < µ0 Ha : µ > µ0 Ha : µ ≠ µ0
The students may note down steps of Hypothesis Testing, while we are doing this example. Almost all questions of
Hypothesis testing follow similar steps.
Null Hypothesis H0 : µ ≥ 3
Alternate Hypothesis Ha : µ < 3
If the sample data indicate that H0 cannot be rejected, then the Court should be take any action against the company.
However, if the sample data indicate H0 can be rejected, we will conclude that the alternative hypothesis, Ha : µ < 3,
is true. In this case, punitive action against the company would be justified.
Step 3. Collect the sample data and compute the value of the test statistic
Suppose Population Standard Deviation (from historic data) is, 𝜎 = 0.18 and we take sample size, n= 36.
Then the standard error of mean is given by
σ 0.18
σx̅ = = = 0.03
√n √36
For hypothesis tests about a population mean in the σ known case, we use the standard normal random variable z as a
test statistic to determine whether deviates from the hypothesized value of μ enough to justify rejecting the null
hypothesis. The test statistic (z) is as follows:
x̅ − µ0
z= σ
√n
A value of z = -1 means that the value of is one standard error below the hypothesized value of the mean, a value of z
= -2 means that the value of is two standard errors below the hypothesized value of the mean, and so on. We can use
the standard normal probability table to find the lower tail probability corresponding to any z value.
Let us now compute the value of z in our example, where 𝜎 = 0.18 and sample size, n= 36. Suppose mean of sample
comes out to be, x
̅ = 2.92 litres.
We calculate Z Statistic is below:
x̅ − µ0 2.92 − 3
z= σ = 0.18 = -2.67
√n √36
The key question for a lower tail test is, How small must the test statistic z be before we choose to reject the null
hypothesis? Two approaches can be used to answer this question: the p-value approach and the critical value approach.
p-value approach
The p-value is used to determine whether the null hypothesis should be rejected. A p-value is a probability that provides
a measure of the evidence against the null hypothesis provided by the sample. Smaller p-values indicate more evidence
against H0 .
Step 4. Use the value of the test statistic to compute the p-value
Using the standard normal probability table, we find that the lower tail area (p-value) at z = -2.67 is 0.0038. (Look at
the figure to appreciate relation in Z and p)
The region of rejection of the null hypothesis is called the critical region for the hypothesis test. The critical region is
sometimes referred to as the region of rejection of H0 , and the two terms are synonymous.
The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection
decision.
We can use the same general approach to conduct an upper tail test. The test statistic z is still computed using same
equation.
But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger
than that provided by the sample. Thus, to compute the p-value for the upper tail test, we must find the area under the
standard normal curve to the right of the test statistic.
For lower tail tests, the null hypothesis is rejected if the value of the test statistic is less than or equal to the critical
value. For upper tail tests, the null hypothesis is rejected if the value of the test statistic is greater than or equal to the
critical value.
In other words, for Upper Tailed Tests.
Reject 𝐇𝟎 if Z > 𝐙𝛂
Suppose a supplier supplies a metallic panel to Maruti, with mean length of 295 cms. Since the panel will fit in a car,
Maruti will NOT prefer length to be either greater or lesser than 295 cms.
Step 3. Collect the sample data and compute the value of the test statistic
Suppose Population Standard Deviation (from historic data) is, 𝜎 = 12 and we take sample size, n= 50.
Then the standard error of mean is given by
σ 12
σx̅ = = = 1.7
√n √50
Suppose mean of sample comes out to be, x
̅ = 297.6 cms.
We calculate Z Statistic is below:
x̅ − µ0 297.6 − 295
Z= σ = 12 = 1.53
√n √50
Step 4. Use the value of the test statistic to compute the p-value
Recall that the p-value is a probability used to determine whether the null hypothesis should be rejected. For a two-
tailed test, values of the test statistic in either tail provide evidence against the null hypothesis.
Now to compute the p-value we must find the probability of obtaining a value for the test statistic at least as unlikely
as z = 1.53. Clearly values of z ≥ 1.53 are at least as unlikely. But, because this is a two-tailed test, values of z ≤
−1.53 are also at least as unlikely as the value of the test statistic provided by the sample. As shown in the Figure, we
note that the two-tailed p-value in this case is given by P(z ≥ 1.53) + P(z ≤ - 1.53).
The table for the standard normal distribution shows that the area to the left of z = 1.53 is 0.9370. Thus, the area
under the standard normal curve to the right of the test statistic z = 1.53 is 1.0000 - 0.9370 =0.0630. Doubling this,
we find the p-value for our example is 2 × (.0630) = 0.1260.
Please note that, if the value of the test statistic is in the upper tail (z > 0), we find the area under the standard normal
curve to the right of z. If the value of the test statistic is in the lower tail (z < 0), find the area under the standard
normal curve to the left of z.
With a level of significance of α = 0.05, the area in each tail beyond the critical values is α/2 = -.05/2 = 0.025. Using
the standard normal probability table, we find the critical values for the test statistic are −𝑍0.025 = -1.96 and
𝑍0.025 = 1.96. Thus, using the critical value approach, the two-tailed rejection rule is
Summary of all three types of Z tests (discussed above) is tabled below (using z statistic).
To conduct a hypothesis test about a population mean for the σ unknown case, the sample mean 𝐱̅ is used as an estimate
of μ and the sample standard deviation s is used as an estimate of population standard deviation σ.
The steps of the hypothesis testing procedure for the “σ unknown case” are the same as those for the “σ known case”.
Recall that for the σ known case, the sampling distribution of the test statistic has a standard normal distribution. For
the σ unknown case, however, the sampling distribution of the test statistic follows the t distribution (with n-1 degrees
of freedom); it has slightly more variability because the sample is used to develop estimates of both μ and σ.
Step 3. Collect the sample data and compute the value of the test statistic
We calculate t Statistic is below:
x̅ − µ0 7.25−7
t= s = 1.052 = 1.84
√n √60
The students may note the difference in formula of z statistic (previous examples) and t statistic.
Step 4. Use the value of the test statistic to compute the p-value
The sampling distribution of t has 60-1=59 degrees of freedom. Because the test is an upper tail test, the p-value is the
area under the curve of the t distribution to the right of t = 1.84.
Then we use a t table to compute p-value. The p value comes out to be 0.0354 (from t table).
The summary of both One tailed tests and Two tailed test presented in the table below (using t statistic):
Suppose 20% of participants are women at Gold Gym. In order to increase participation of women, the Gold Gym
started a marketing campaign. After 3 months of campaign, the Gold Gym wants to find out, if proportion of women
has increased? A sample of 400 participants was taken, out of which 100 were women.
Step 3. Collect the sample data and compute the value of the test statistic
We calculate z Statistic is below: